Retry actions in Ruby
If you ever worked with Ruby or Rails codebases, you've almost certainly had to deal with actions that require some sort of retry handling. For example, making a HTTP call to a third-party web service might fail due to various reasons: network failure, server crash, bad dns lookup and so on. In such cases your best bet is to be prepared and handle the workflow gracefully. In this post i'm going to cover a few retry strategies available in Ruby.
Basics
For starters, lets create an example worker code that makes HTTP calls. I'm going to use Faraday HTTP client as a reference, it's my to go gem for making HTTP calls.
class Worker
def perform
response = Faraday.get("http://foobarservice.com/items")
if response.success?
puts "Ok, we've got items!"
else
puts "Uh, server responded with #{response.status}"
end
end
end
Our worker fetches items from the service and deals with the response data based on the status code, nothing complicated. Now, what happens when the service fails? Boom!
/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:906:in `rescue in block in connect': Failed to open TCP connection to foobarservice.com:80 (getaddrinfo: nodename nor servname provided, or not known) (Faraday::ConnectionFailed)
from /Users/sosedoff/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/net/http.rb:903:in `block in connect'
from /Users/sosedoff/.rvm/rubies/ruby-2.4.1/lib/ruby/2.4.0/timeout.rb:93:in `block in timeout'
We get Faraday::ConnectionFailed
exception and the whole thing crashes. Let's add
a guard around that next:
def perform
response = Faraday.get("http://foobarservice.com/items")
if response.success?
puts "Ok, we've go items"
else
puts "Uh, server responded with #{response.status}"
end
rescue Faraday::Error => err # handle all Faraday exceptions
puts "Oh no, we failed. Error: #{err}"
end
After running the code again, we get:
Oh no, we failed. Error: Failed to open TCP connection to foobarservice.com:80 (getaddrinfo: nodename nor servname provided, or not known)
Good, now our code does not crash when the third-party service misbehaves. But we still want to be able to retry the request and get the data. Next, add a basic retry flow:
def perform
retries = 3 # or any numer
delay = 1 # number of seconds to wait between attemps
begin
Faraday.get("http://foobarservice.com/items")
rescue Faraday::Error => err
fail "All retries are exhausted" if retries == 0
puts "Oh no, we failed. Retries left: #{retries -= 1}"
sleep delay
retry
end
end
When the worker tries to make a call and fails it will retry 3 times with 1 second between
the attempts. Retry control flow in this case is provided by a retry
keyword. After running
the code we see this:
Oh no, we failed. Retries left: 2
Oh no, we failed. Retries left: 1
Oh no, we failed. Retries left: 0
retry.rb:12:in `rescue in perform': All retries are exhausted (RuntimeError)
To make things a bit better we could also add a varying delay between attempts to ensure the system does not try to make requests and fail too quickly.
def perform
max_retries = 3
retry_count = 0
delay = 1
begin
Faraday.get("http://foobarservice.com/items")
rescue Faraday::Error => err
fail "All retries are exhausted" if retry_count >= max_retries
retry_count += 1
puts "[#{Time.now}] Oh no, we failed. Retries left: #{max_retries - retry_count}"
sleep delay += retry_count
retry
end
end
On every failed attempt we increase the delay between attempts by a number of performed attempts.
[2017-10-10 21:19:35 -0500] Oh no, we failed. Retries left: 2
[2017-10-10 21:19:37 -0500] Oh no, we failed. Retries left: 1
[2017-10-10 21:19:41 -0500] Oh no, we failed. Retries left: 0
retry.rb:12:in `rescue in perform': All retries are exhausted (RuntimeError)
from retry.rb:9:in `perform'
from retry.rb:23:in `<main>'
You can see the sleep time increases with more attempts. As you've noticed, this stratery is pretty simple. There are a bunch of backoff algorithms that could be applied here, like exponential backoff, but that all depends on the workload.
There's still a problem though. Once all the retries are exhausted, a new exception will be thrown (RuntimeError) and that will ultimately crash the code unless accounted for. In some cases such behavior is useful and tells the developer that program is definitely not working and should be fixed. However, if the task is not critical you can always add the final handling block:
def perform
max_retries = 3
retry_count = 0
delay = 1
begin
Faraday.get("http://foobarservice.com/")
rescue Faraday::Error => err
puts "[#{Time.now}] Oh no, we failed. Retries left: #{max_retries - retry_count}"
sleep delay += retry_count
retry_count += 1
retry if retry_count < max_retries
ensure
# code in the ensure block will always run
if retry_count == max_retries
# notity logging/monitoring system here
return
end
end
end
Another caveat with the code above - it will only trigger the retry if the error occured
on the Faraday's level. In case if there's a computational or other problem, like unexpected
data in the response, you'd need to add whatever the exception you're expecting to the rescue
block.
begin
# ... codes
rescue Faraday::Error, JSON::ParseError => err
# .. handle error
end
Shared Module
Okay, so we've covered the basics of retrying (well, technically exception handling), but wrapping your code in such a way will be harder to read and introduces duplication if used in many places. Next, let's make a module that will provide retry functionality and could be included into any class that requires it.
module Retriable
# Creates module functions for the named methods
module_function
class RetryError < StandardError; end
# These errors will be handled automatically
DEFAULT_ERROR_CLASSES = [
Retriable::RetryError,
StandardError
]
def retry_with(opts = {}, &blk)
fail("Block is required") if blk.nil?
classes = [opts[:errors] || DEFAULT_ERROR_CLASSES].flatten
attempts = opts[:attempts] || 3
delay = opts[:delay] || 1
delay_inc = opts[:increment_delay] == true
delay_sleep = opts[:sleep] == true
debug = opts[:debug] == true
# We need to handle our own retry method
classes << Retriable::RetryError
1.upto(attempts) do |i|
begin
puts "Trying #{i} attempt..." if debug
blk.call
puts "Success" if debug
return
rescue Exception => err
puts "Got an error on #{i} attemp: #{err}" if debug
if (classes & err.class.ancestors).any?
delay *= i if delay_inc
sleep(delay) if delay_sleep
else
puts "Unhanded retriable error: #{err}" if debug
fail(err)
end
end
end
fail "Retry attempts are exhausted (#{attempts} total)"
end
def try_again
fail(RetryError)
end
end
There's a lot going on in the code. But it all boils down to a few use cases:
# Include in class
class Foo
include Retriable
end
# Bare bones, will trigger retry on all errors (that inherit from StandardError).
# Total of 3 attempts, with an incmenetal delay.
retry_with { do_stuff }
# Print all retry-related debugging information
retry_with(debug: true) { do_stuff }
# Max 10 attempts, with no delay in between
retry_with(sleep: false, attempts: 10) { do_stuff }
# Retry only when a specific error is thrown.
# Fixed 10 seconds between retries with no delay increment between retries.
retry_with(
errors: [MyError, Faraday::Error],
delay: 10,
increment_delay: false
) { do_stuff }
# Standalone usage
Retriable.retry_with(...) { do_stuff }
Example worker class using the module:
class Worker
include Retriable
def perform
retry_with(errors: Faraday::Error, delay: 2, debug: true) do
fetch_items
end
end
private
def fetch_items
response = Faraday.get("http://foobarservice.com/items")
# Uh, we got bad response. Lets try again..
try_again if !response.success?
# Good data!
puts response.body
end
end
In addition to automatic retries handled by the module, there's a way to manually
trigger the retry inside of the user-defined block with try_again
method. Such
behavior allows to redo the work upon an undesired outcome, when no exceptions
were thrown.
Summary
While the module we've created has a bunch of extra functionality that you might not need, having a pretty simple and straighforward retry logic available for your worker classes (or whatever that is) is always nice.
In general, retry functionality should not be abused because hiding the failures
is a bad thing,
especially if you're using a catch-all error class like Exception
.
This tutorial is mostly for educational purposes, so if you don't really like to roll your own module, there's plenty of options on the market, like retriable and others.