Background Processing - The Rails 4 Way (2014)

The Rails 4 Way (2014)

Chapter 18. Background Processing

People count up the faults of those who keep them waiting.

—French Proverb

Users of modern websites have lofty expectations when it comes to application responsiveness - most likely they will expect behavior and speed similar to that of desktop applications. Proper user experience guidelines would dictate that no HTTP request/response cycle should take more than a second to execute, however there will be actions that arise that simply cannot achieve this time constraint.

Tasks of this nature can range from simple, long running tasks due to network latency to more complex tasks that require heavy processing on the server. Examples of these actions could be sending an email or processing video, respectively. In these situations it is best to have the actions execute asynchronously, so that the responsiveness of the application remains swift while the procedures run.

In this chapter these types of tasks are referred to as background jobs. They include any execution that is handled in a separate process from the Rails application. Rails and Ruby have several libraries and techniques for performing this work, most notably:

· Delayed Job

· Sidekiq

· Resque

· Rails Runner

This chapter will cover each of these tools, discussing the strengths and weaknesses of each one so that you may determine what is appropriate for your application.

18.1 Delayed Job

Delayed Job101 is a robust background processing library that is essentially a highly configurable priority queue. It provides various approaches to handling asynchronous actions, including:

· Custom background jobs

· Permanently marked background methods

· Background execution of methods at runtime

Delayed Job requires a persistence store to save all queue related operations. Along with the delayed_job gem, a backend gem is required to get up and running. Supported options are:

· Active Record with the delayed_job_active_record gem.

· Mongoid (for use with MongoDB) with the delayed_job_mongoid gem.

18.1.1 Getting started

Add the delayed_job and delayed_job_active_record gems to your application’s Gemfile, then run the generator to create your execution and migration scripts.

$ rails generate delayed_job:active_record

This will create the database migration that will need to be run to set up the delayed_jobs table in the database, as well as a command to run Delayed Job.

To change the default settings for Delayed Job, first add a delayed_job.rb in your config/initializers directory. Options then can be configured by calling various methods on Delayed::Worker, which include settings for changing the behavior of the queue with respect to tries, timeouts, maximum run times, sleep delays and other options.

1 Delayed::Worker.destroy_failed_jobs = false

2 Delayed::Worker.sleep_delay = 30

3 Delayed::Worker.max_attempts = 5

4 Delayed::Worker.max_run_time = 1.hour

5 Delayed::Worker.max_priority = 10

18.1.2 Creating Jobs

Delayed Job can create background jobs using 3 different techniques, and which one you use depends on your own personal style.

The first option is to chain any method that you wish to execute asynchronously after a call to Object#delay. This is good for cases where some common functionality needs to execute in the background in certain situations, but is acceptable to run synchronously in others.

1 # Execute normally

2 mailer.send_email(user)

3

4 # Execute asynchronously

5 mailer.delay.send_email(user)

The second technique is to tell Delayed Job to execute every call to a method in the background via the Object.handle_asynchronously macro.

1 classMailer

2 def send_email(user)

3 UserMailer.activation(user).deliver

4 end

5

6 handle_asynchronously :send_email

7 end

tip

Durran says…

When using handle_asynchronously, make sure the declaration is after the method definition, since Delayed Job uses alias_method_chain internally to set up the behavior.

Lastly, you may create a custom job by creating a separate Ruby object that only needs to respond to perform. That job can then be run at any point by telling Delayed Job to enqueue the action.

1 classEmailJob < Struct.new(:user_id)

2 def perform

3 user = User.find(user_id)

4 UserMailer.activation(user).deliver

5 end

6 end

7

8 # Enqueue a job with default settings

9 Delayed::Job.enqueue EmailJob.new(user.id)

10

11 # Enqueue a job with priority of 1

12 Delayed::Job.enqueue EmailJob.new(user.id), 1

13

14 # Enqueue a job with priority of 0, starting tomorrow

15 Delayed::Job.enqueue EmailJob.new(user.id), 0, 1.day.from_now

18.1.3 Running

To start up Delayed Job workers, use the delayed_job command created by the generator. This allows for starting a single worker or multiple workers on their own processes, and also provides the ability to stop all workers.

1 # Start a single worker

2 RAILS_ENV=staging bin/delayed_job start

3

4 # Start multiple workers, each in a separate process

5 RAILS_ENV=production bin/delayed_job -n 4 start

6

7 # Stop all workers

8 RAILS_ENV=staging bin/delayed_job stop

tip

Durran says…

Delayed Job workers generally have a lifecycle that is equivalent to an application deployment. Because of this, their memory consumption grows over time and may eventually have high swap usage, causing workers to become unresponsive. A good practice is to have a monitoring tool like God or monit watching jobs, and restarting them when their memory usage hits a certain point.

18.1.4 Summary

Delayed Job is an excellent choice where you want ease of setup, need to schedule jobs for later dates, or want to add priorities to jobs in your queue. It works well in situations where the total number of jobs is low and the tasks they execute are not long running or consume large amounts of memory.

Do note that if you are using Delayed Job with a relational database backend and have a large number of jobs, performance issues may arise due to the table locking the framework employs. Since jobs may have a long lifecycle, be wary of resource consumption due to workers not releasing memory once jobs are finished executing. Also where job execution can take a long period of time, higher priority jobs will still wait for the other jobs to complete before being processed. In these cases, using a non-relational backend such as MongoDB or potentially another library such as Sidekiq may be advisable.

18.2 Sidekiq

Sidekiq102 is a full-featured background processing library with support for multiple weighted queues, scheduled jobs, and sending asynchronous Action Mailer emails. Like Resque (covered later in this chapter), Sidekiq uses Redis for its storage engine, minimizing the overhead of job processing.

Sidekiq is currently the best performing and memory efficient background processing library in the Ruby ecosystem. It is multithreaded, which allows Sidekiq to process jobs in parallel without the overhead of having to run multiple processes. This also means Sidekiq can process jobs with a much smaller memory footprint compared to other background processing libraries, such as Delayed Job or Resque. According to the official documentation103, one Sidekiq process can process a magnitude more than its competitors:

You’ll find that you might need 50 200MB resque processes to peg your CPU whereas one 300MB Sidekiq process will peg the same CPU and perform the same amount of work.

Since it’s multithreaded, all code executed by Sidekiq should be threadsafe.

18.2.1 Getting Started

To integrate Sidekiq into your Rails application, add the sidekiq gem in your Gemfile and run bundle install.

# Gemfile

gem 'sidekiq'

By default, Sidekiq will assume that Redis can be found at localhost:6379. To override the location of the Redis server used by Sidekiq (for production deployments you will probably need to point Sidekiq to an external Redis server), create a Rails initializer that configures redis in bothSidekiq.configure_server and Sidekiq.configure_client code blocks.

1 # config/initializers/sidekiq.rb

2

3 Sidekiq.configure_server do |config|

4 config.redis = {

5 url: 'redis://redis.example.com:6379/10',

6 namespace: 'tr4w'

7 }

8 end

9

10 Sidekiq.configure_client do |config|

11 config.redis = {

12 url: 'redis://redis.example.com:6379/10',

13 namespace: 'tr4w'

14 }

15 end

Note that setting the :namespace option is completely optional, but highly recommended if Sidekiq is sharing access to a Redis database.

tip

Juanito says…

Sidekiq requires Redis 2.4 or greater.

18.2.2 Workers

To create a worker in Sidekiq, one must create a class in the app/workers folder that includes the module Sidekiq::Worker and responds to perform.

1 classEmailWorker

2 include Sidekiq::Worker

3

4 def perform(user_id)

5 user = User.find(@user_id)

6 UserMailer.activation(user).deliver

7 end

8 end

To enqueue a job on the worker, simply call the perform_async class method passing any arguments required by the perform method of the worker.

1 EmailWorker.perform_async(1)

Be aware that all worker jobs are stored in the Redis database as JSON objects, meaning you must ensure the arguments provided to your worker can be serialized to JSON. For the sake of clarity, in the above example, instead of passing an instance of User, we provided the worker with an identifier for the record. The worker would then be responsible for querying the User record from the database.

Sidekiq workers can be configured via the sidekiq_options macro style method. Available options are:

:backtrace

Specifies whether or not to save error backtraces to the retry payload, defaulting to false. The error backtrace is used for display purposes in the Sidekiq web UI. Alternatively, you can specify the number of lines to save (i.e., backtrace: 15).

:queue

The name of queue for the worker, defaulting to “default”.

:retry

By default, a worker is able to retry jobs until it’s successfully completed. Setting the :retry option to false will instruct Sidekiq to run a job only once. Alternatively, you can specify the maximum number of times a job is retried (i.e., retry: 5).

1 classSomeWorker

2 include Sidekiq::Worker

3 sidekiq_options queue: :high_priority, retry: 5, backtrace: true

4

5 def perform

6 ...

7 end

8 end

18.2.3 Scheduled Jobs

Out of the box, Sidekiq has the ability to schedule when jobs will be executed. To delay the execution of a job for a specific interval, enqueue the job by calling perform_in.

EmailWorker.perform_in(1.hour, 1)

A job can also be scheduled for a specific time using the enqueue method perform_at.

EmailWorker.perform_at(2.days.from_now, 1)

18.2.4 Delayed Action Mailer

When Sidekiq is included in a Rails application, it adds three methods to Action Mailer that allow for email deliveries to be executed asynchronous.

Note

The following methods are also available on Active Record classes to execute class methods asynchronously. It’s strongly not recommended to call these methods on Active Record instances.

1 User.delay(1.hour).some_background_operation

18.2.4.1 delay

Calling delay from a mailer will result in the email being added to the DelayedMailer worker for processing.

UserMailer.delay.activation(user.id)

18.2.4.2 delay_for(interval)

Using delay_for, an email can be scheduled for delivery at a specific time interval.

UserMailer.delay_for(10.minutes).status_report(user.id)

18.2.4.3 delay_until(timestamp)

The last Action Mailer method added by Sidekiq is delay_until. Sidekiq will wait until the specified time to attempt delivery of the email.

1 UserMailer.delay_for(1.day).status_report(user.id)

2 UserMailer.delay_until(1.day.from_now).status_report(user.id)

18.2.5 Running

To start up Sidekiq workers, run the sidekiq command from the root of your Rails application.

$ bundle exec sidekiq

This allows for starting a Sidekiq process that begins processing against the “default” queue. To use multiple queues, you can pass the name of a queue and and optional weight to the sidekiq command.

$ bundle exec sidekiq -q default -q critical,2

Queues have a weight of 1 by default. If a queue has a higher weight, it will checked that many more times than a queue with a weight of 1. For instance, in the example above, the critical queue is checked twice as often as default.

Stopping jobs involves sending signals to the sidekiq process, which then takes the appropriate action on all processors:

TERM

Signals that Sidekiq should shut down within the -t timeout option. Any jobs that are not completed within the timeout period are pushed back into Redis. These jobs are executed again once Sidekiq restarts. By default, the timeout period is 8 seconds.

USR1

Continues working on current jobs, but stops accepting any new ones.

18.2.5.1 Concurrency

By default, Sidekiq starts up 25 concurrent processors. To explicitly set the amount of processors for Sidekiq to use, pass the -c option to the sidekiq command.

1 $ bundle exec sidekiq -c 100

information

Active Record Database Connections

When using Sidekiq alongside Active Record, ensure that the Active Record connection pool setting pool is close or equal to the number of Sidekiq processors.

1 production:

2 adapter: postgresql

3 database: example_production

4 pool: 25

18.2.5.2 sidekiq.yml

If you find yourself having to specify different options to the sidekiq command for multiple environments, you configure Sidekiq using a YAML file.

1 # config/sidekiq.yml

2 ---

3 :concurrency: 10

4 :queues:

5 - [default, 1]

6 - [critical, 5]

7 staging:

8 :concurrency: 25

9 production:

10 :concurrency: 100

Now, when starting the sidekiq command, pass the path of sidekiq.yml to the -C option.

1 $ bundle exec sidekiq -e $RAILS_ENV -C config/sidekiq.yml

18.2.6 Error Handling

Sidekiq ships with support to notify the following exception notification services if an error occurs within a worker during processing:

· Airbrake

· Exceptional

· ExceptionNotifier

· Honeybadger

Other services, such as Sentry and New Relic, implement their own Sidekiq middleware that handles the reporting of errors. Installation usually involves adding a single require statement to a Rails initializer.

# config/initializers/sentry.rb

require 'raven/sidekiq'

18.2.7 Monitoring

When Resque was released, it set a precedent for Ruby background processing libraries by shipping with a web interface to monitor your queues and jobs. Sidekiq follows suit and also comes with a Sinatra application that can be run standalone or be mounted with your Rails application.

To run the web interface standalone, create a config.ru file and boot it with any Rack server:

1 require 'sidekiq'

2

3 Sidekiq.configure_client do |config|

4 config.redis = { size: 1 }

5 end

6

7 require 'sidekiq/web'

8 run Sidekiq::Web

If you prefer to access the web interface within your Rails application, explicitly mount Sidekiq::Web to a path in your config/routes.rb file.

1 require 'sidekiq/web'

2

3 Rails.application.routes.draw do

4 mount Sidekiq::Web => '/sidekiq'

5 ...

6 end

Since the web interface is a Sinatra application, you will need to add the sinatra gem to your Gemfile.

# Gemfile

gem 'sinatra', '>= 1.3.0', require: nil

18.2.8 Summary

Sidekiq is highly recommended for any Rails application that has a large number of jobs. It’s the fastest and most efficient background processing library available due to it being multithreaded.

With a Redis backend, Sidekiq does not suffer from the potential database locking issues that can arise when using Delayed Job and has significantly better performance with respect to queue management over both Delayed Job and Resque.

Note that Redis stores all of its data in memory, so if you are expecting a large amount of jobs but do not have a significant amount of RAM to spare, you may need to look at a different framework.

18.3 Resque

Resque104 is a background processing framework that supports multiple queues and like Sidekiq, uses Redis for its persistent storage. Resque also comes with a Sinatra web application to monitor the queues and jobs.

Resque workers are Ruby objects or modules that respond to a class method. Jobs are stored in the database as JSON objects, and because of this only primitives can be passed as arguments to the actions. Resque also provides hooks into the worker and job lifecycles, as well as the ability to configure custom failure mechanisms.

Due to Resque’s use of Redis as its storage engine, the overhead of job processing is unnoticeable. Resque uses a parent/child forking architecture, which makes its resource consumption predictable and easily managed.

18.3.1 Getting Started

First in your Gemfile add the resque gem, then configure Resque by creating a Rails initializer and a resque.yml to store the configuration options. The YAML should be key/value pairs of environment name with the Redis host and port, and the initializer should load the YAML and set up the Redis options.

Configuring failure backends can also be done in the same manner - Resque supports persistence to Redis or Airbrake notifications out of the box, but custom backends can be easily created by inheriting from Resque::Failure::Base.

In config/resque.yml:

1 development: localhost:6379

2 staging: localhost:6379

3 production: localhost:6379

The config/initializers/resque.rb:

1 require 'resque/failure/multiple'

2 require 'resque/failure/airbrake'

3 require 'resque/failure/redis'

4

5 rails_env = ENV['RAILS_ENV'] || 'development'

6 config = YAML.load_file(Rails.root.join 'config','resque.yml')

7 Resque.redis = config[rails_env]

8

9 Resque::Failure::Airbrake.configure do |config|

10 config.api_key = 'abcdefg'

11 config.secure = true

12 end

13 Resque::Failure::Multiple.classes = [Resque::Failure::Redis,

14 Resque::Failure::Airbrake]

15 Resque::Failure.backend = Resque::Failure::Multiple

18.3.2 Creating Jobs

Jobs in Resque are plain old Ruby objects that respond to a perform class method and define which queue they should be processed in. The simplest manner to define the queue is to set an instance variable on the job itself.

1 classEmailJob

2 @queue = :communications

3

4 defself.perform(user_id)

5 user = User.find(user_id)

6 UserMailer.activation(user).deliver

7 end

8 end

9

10 # Enqueue the job

11 Resque.enqueue(EmailJob, user.id)

18.3.3 Hooks

Resque provides lifecycle hooks that can used to add additional behavior, for example adding an automatic retry for a failed job. There are two categories of hooks: worker hooks and job hooks.

The available worker hooks are before_first_fork, before_fork, and after_fork. Before hooks are executed in the parent process where the after hook executes in the child process. This is important to note since changes in the parent process will be permanent for the life of the worker, whereas changes in the child process will be lost when the job completes.

1 # Before the worker's first fork

2 Resque.before_first_fork do

3 puts "Creating worker"

4 end

5

6 # Before every worker fork

7 Resque.before_fork do |job|

8 puts "Forking worker"

9 end

10

11 # After every worker fork

12 Resque.after_fork do |job|

13 puts "Child forked"

14 end

Job hooks differ slightly from worker hooks in that they are defined on the action classes themselves, and are defined as class methods with the hook name as the prefix. The available hooks for jobs are: before_perform, after_perform, around_perform, and on_failure.

An example job that needs to retry itself automatically on failure, and logged some information before it started processing would look like so:

1 classEmailJob

2 class << self

3 def perform(user_id)

4 user = User.find(user_id)

5 UserMailer.activation(user).deliver

6 end

7

8 def before_perform_log(*args)

9 Logger.info "Starting Email Job"

10 end

11

12 def on_failure_retry(error, *args)

13 Resque.enqueue self, *args

14 end

15 end

16 end

18.3.4 Plugins

Resque has a very good plugin ecosystem to provide it with additional useful features. Most plugins are modules that are included in your job classes, only to be used on specific jobs that need the extra functionality. Plugins of note are listed below and a complete list can be found athttps://github.com/resque/resque/wiki/plugins.

resque-scheduler

A job scheduler built on top of Resque.

resque-throttle

Restricts the frequency that jobs are run.

resque-retry

Adds configurable retry and exponential backoff behavior for failed jobs.

resque_mailer

Adds ability to send Action Mailer emails asynchronously.

18.3.5 Running

Resque comes with two rake tasks that can be used to run workers, one to run a single worker for one or more queues, the second to run multiple workers. Configuration options are supplied as environment variables when running the tasks, and allow for defining the queue for the workers to monitor, logging verbosity, and the number or workers to start.

# Start 1 worker for the communications queue

$ QUEUE=communications rake environment resque:work

# Start 6 workers for the communications queue

$ QUEUE=communications COUNT=6 rake resque:workers

# Start 2 workers for all queues

$ QUEUE=* COUNT=2 rake resque:workers

Stopping jobs involves sending signals to the parent Resque workers, which then take the appropriate action on the child and themselves:

QUIT

Waits for the forked child to finish processing, then exists

TERM/INT

Immediately kills the child process and exits

USR1

Immediately kills the child process, but leaves the parent worker running

USR2

Finishes processing the child action, then waits for CONT before spawning another

CONT

Continues to start jobs again if it was halted by a USR2

18.3.6 Monitoring

One of the really nice features of Resque is the web interface that it ships with for monitoring your queues and jobs. It can run standalone or be mounted with your Rails application.

To run standalone, simply run resque-web from the command line. If you prefer to access the web interface within your Rails application, explicitly mount an instance of Resque::Server.new to a path in your config/routes.rb file.

1 require "resque/server"

2

3 Rails.application.routes.draw do

4 mount Resque::Server.new => '/resque'

5 ...

6 end

18.3.7 Summary

Resque is recommended where a large number of jobs are in play and your code is not threadsafe. It does not support priority queueing but does support multiple queues, which is advantageous when jobs can be categorized together and given pools of workers to run them.

Since it uses a Redis backend, Resque does not suffer from the potential database locking issues that can arise when using Delayed Job. However, being single-threaded means that Resque requires a process for every worker you want to run in parallel.

18.4 Rails Runner

Rails comes with a built-in tool for running tasks independent of the web cycle. The rails runner command simply loads the default Rails environment and then executes some specified Ruby code. Popular uses include:

· Importing “batch” external data

· Executing any (class) method in your models

· Running intensive calculations, delivering e-mails in batches, or executing scheduled tasks

Usages involving rails runner that you should avoid at all costs are:

· Processing incoming e-mail

· Tasks that take longer to run as your database grows

18.4.1 Getting Started

For example, let us suppose that you have a model called “Report.” The Report model has a class method called generate_rankings, which you can call from the command line using

$ rails runner 'Report.generate_rankings'

Since we have access to all of Rails, we can even use the Active Record finder methods to extract data from our application. 105

$ rails runner 'User.pluck(:email).each { |e| puts e }'

charles.quinn@highgroove.com

me@seebq.com

bill.gates@microsoft.com

obie@obiefernandez.com

This example demonstrates that we have access to the User model and are able to execute arbitrary Rails code. In this case, we’ve collected some e-mail addresses that we can now spam to our heart’s content. (Just kidding!)

18.4.2 Usage Notes

There are some things to remember when using rails runner. You must specify the production environment using the -e option; otherwise, it defaults to development. The rails runner help option tells us:

$ rails runner -h

Usage: rails runner [options] ('Some.ruby(code)' or a filename)

-e, --environment=name Specifies the environment for the runner

to operate under (test/development/production).

Default: development

Using rails runner, we can easily script any batch operations that need to run using cron or another system scheduler. For example, you might calculate the most popular or highest-ranking product in your e-commerce application every few minutes or nightly, rather than make an expensive query on every request:

$ rails runner –e production 'Product.calculate_top_ranking'

A sample crontab to run that script might look like

0 */5 * * * root /usr/local/bin/ruby \

/apps/exampledotcom/current/script/rails runner -e production \

'Product.calculate_top_ranking'

The script will run every five hours to update the Product model’s top rankings.

18.4.3 Considerations

On the positive side: It doesn’t get any easier and there are no additional libraries to install. That’s about it.

As for negatives: The rails runner process loads the entire Rails environment. For some tasks, particularly short-lived ones, that can be quite wasteful of resources. Also, nothing prevents multiple copies of the same script from running simultaneously, which can be catastrophically bad, depending on the contents of the script.

tip

Wilson says…

Do not process incoming e-mail with rails runner. It’s a Denial of Service attack waiting to happen.

18.4.4 Summary

The Rails Runner is useful for short tasks that need to run infrequently, but jobs that require more heavy lifting, reporting, and robust failover mechanisms are best handled by other libraries.

18.5 Conclusion

Most web applications today will need to incorporate some form of asynchronous behavior, and we’ve covered some of the important libraries available when needing to implement background processing. There are many other frameworks and techniques available for handling this, so choose the solution that is right for your needs - just remember to never make your users wait.