Retry actions in Ruby

If you ever worked with Ruby or Rails codebases, you've almost certainly had to deal with actions that require some sort of retry handling. For example, making a HTTP call to a third-party web service might fail due to various reasons: network failure, server crash, bad dns lookup and so on. In such cases your best bet is to be prepared and handle the workflow gracefully. In this post i'm going to cover a few retry strategies available in Ruby.

Basics

For starters, lets create an example worker code that makes HTTP calls. I'm going to use Faraday HTTP client as a reference, it's my to go gem for making HTTP calls.

class Worker
  def perform
    response = Faraday.get("http://foobarservice.com/items")
    if response.success?
      puts "Ok, we've got items!"
    else
      puts "Uh, server responded with #{response.status}"
    end
  end
end

Our worker fetches items from the service and deals with the response data based on the status code, nothing complicated. Now, what happens when the service fails? Boom!

/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:906:in `rescue in block in connect': Failed to open TCP connection to foobarservice.com:80 (getaddrinfo: nodename nor servname provided, or not known) (Faraday::ConnectionFailed)
  from /Users/sosedoff/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:903:in `block in connect'
  from /Users/sosedoff/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/timeout.rb:93:in `block in timeout'

We get Faraday::ConnectionFailed exception and the whole thing crashes. Let's add a guard around that next:

def perform
  response = Faraday.get("http://foobarservice.com/items")
  if response.success?
    puts "Ok, we've go items"
  else
    puts "Uh, server responded with #{response.status}"
  end
rescue Faraday::Error => err # handle all Faraday exceptions
  puts "Oh no, we failed. Error: #{err}"
end

After running the code again, we get:

Oh no, we failed. Error: Failed to open TCP connection to foobarservice.com:80 (getaddrinfo: nodename nor servname provided, or not known)

Good, now our code does not crash when the third-party service misbehaves. But we still want to be able to retry the request and get the data. Next, add a basic retry flow:

def perform
  retries = 3 # or any numer
  delay = 1 # number of seconds to wait between attemps

  begin
    Faraday.get("http://foobarservice.com/items")
  rescue Faraday::Error => err
    fail "All retries are exhausted" if retries == 0

    puts "Oh no, we failed. Retries left: #{retries -= 1}"
    sleep delay

    retry
  end
end

When the worker tries to make a call and fails it will retry 3 times with 1 second between the attempts. Retry control flow in this case is provided by a retry keyword. After running the code we see this:

Oh no, we failed. Retries left: 2
Oh no, we failed. Retries left: 1
Oh no, we failed. Retries left: 0
retry.rb:12:in `rescue in perform': All retries are exhausted (RuntimeError)

To make things a bit better we could also add a varying delay between attempts to ensure the system does not try to make requests and fail too quickly.

def perform
  max_retries = 3
  retry_count = 0
  delay = 1

  begin
    Faraday.get("http://foobarservice.com/items")
  rescue Faraday::Error => err
    fail "All retries are exhausted" if retry_count >= max_retries
    retry_count += 1

    puts "[#{Time.now}] Oh no, we failed. Retries left: #{max_retries - retry_count}"
    sleep delay += retry_count

    retry
  end
end

On every failed attempt we increase the delay between attempts by a number of performed attempts.

[2017-10-10 21:19:35 -0500] Oh no, we failed. Retries left: 2
[2017-10-10 21:19:37 -0500] Oh no, we failed. Retries left: 1
[2017-10-10 21:19:41 -0500] Oh no, we failed. Retries left: 0
retry.rb:12:in `rescue in perform': All retries are exhausted (RuntimeError)
  from retry.rb:9:in `perform'
  from retry.rb:23:in `<main>'

You can see the sleep time increases with more attempts. As you've noticed, this stratery is pretty simple. There are a bunch of backoff algorithms that could be applied here, like exponential backoff, but that all depends on the workload.

There's still a problem though. Once all the retries are exhausted, a new exception will be thrown (RuntimeError) and that will ultimately crash the code unless accounted for. In some cases such behavior is useful and tells the developer that program is definitely not working and should be fixed. However, if the task is not critical you can always add the final handling block:

def perform
  max_retries = 3
  retry_count = 0
  delay = 1

  begin
    Faraday.get("http://foobarservice.com/")
  rescue Faraday::Error => err
    puts "[#{Time.now}] Oh no, we failed. Retries left: #{max_retries - retry_count}"
    sleep delay += retry_count
    retry_count += 1

    retry if retry_count < max_retries
  ensure
    # code in the ensure block will always run
    if retry_count == max_retries
      # notity logging/monitoring system here
      return
    end
  end
end

Another caveat with the code above - it will only trigger the retry if the error occured on the Faraday's level. In case if there's a computational or other problem, like unexpected data in the response, you'd need to add whatever the exception you're expecting to the rescue block.

begin
  # ... codes
rescue Faraday::Error, JSON::ParseError => err
  # .. handle error
end

Shared Module

Okay, so we've covered the basics of retrying (well, technically exception handling), but wrapping your code in such a way will be harder to read and introduces duplication if used in many places. Next, let's make a module that will provide retry functionality and could be included into any class that requires it.

module Retriable
  # Creates module functions for the named methods
  module_function

  class RetryError < StandardError; end

  # These errors will be handled automatically
  DEFAULT_ERROR_CLASSES = [
    Retriable::RetryError,
    StandardError
  ]

  def retry_with(opts = {}, &blk)
    fail("Block is required") if blk.nil?

    classes     = [opts[:errors] || DEFAULT_ERROR_CLASSES].flatten
    attempts    = opts[:attempts] || 3
    delay       = opts[:delay] || 1
    delay_inc   = opts[:increment_delay] == true
    delay_sleep = opts[:sleep] == true
    debug       = opts[:debug] == true

    # We need to handle our own retry method
    classes << Retriable::RetryError

    1.upto(attempts) do |i|
      begin
        puts "Trying #{i} attempt..." if debug
        blk.call
        puts "Success" if debug
        return
      rescue Exception => err
        puts "Got an error on #{i} attemp: #{err}" if debug

        if (classes & err.class.ancestors).any?
          delay *= i if delay_inc
          sleep(delay) if delay_sleep
        else
          puts "Unhanded retriable error: #{err}" if debug
          fail(err)
        end
      end
    end

    fail "Retry attempts are exhausted (#{attempts} total)"
  end

  def try_again
    fail(RetryError)
  end
end

There's a lot going on in the code. But it all boils down to a few use cases:

# Include in class
class Foo
  include Retriable
end

# Bare bones, will trigger retry on all errors (that inherit from StandardError).
# Total of 3 attempts, with an incmenetal delay.
retry_with { do_stuff }

# Print all retry-related debugging information
retry_with(debug: true) { do_stuff }

# Max 10 attempts, with no delay in between
retry_with(sleep: false, attempts: 10) { do_stuff }

# Retry only when a specific error is thrown. 
# Fixed 10 seconds between retries with no delay increment between retries.
retry_with(
  errors: [MyError, Faraday::Error],
  delay: 10,
  increment_delay: false
) { do_stuff }

# Standalone usage
Retriable.retry_with(...) { do_stuff }

Example worker class using the module:

class Worker
  include Retriable

  def perform
    retry_with(errors: Faraday::Error, delay: 2, debug: true) do
      fetch_items
    end
  end

  private

  def fetch_items
    response = Faraday.get("http://foobarservice.com/items")

    # Uh, we got bad response. Lets try again..
    try_again if !response.success?

    # Good data!
    puts response.body
  end
end

In addition to automatic retries handled by the module, there's a way to manually trigger the retry inside of the user-defined block with try_again method. Such behavior allows to redo the work upon an undesired outcome, when no exceptions were thrown.

Summary

While the module we've created has a bunch of extra functionality that you might not need, having a pretty simple and straighforward retry logic available for your worker classes (or whatever that is) is always nice.

In general, retry functionality should not be abused because hiding the failures is a bad thing, especially if you're using a catch-all error class like Exception.

This tutorial is mostly for educational purposes, so if you don't really like to roll your own module, there's plenty of options on the market, like retriable and others.