Monit - Reliably Deploying Rails Applications: Hassle free provisioning, reliable deployment (2014)

Reliably Deploying Rails Applications: Hassle free provisioning, reliable deployment (2014)

9.0 - Monit

Overview

It is inevitable that unexpected behaviors will occur which will cause some processes to fail. This could be anything from our database process to nginx to the application server or background job workers.

Where possible we want the system to take care of itself, to detect that a process has either failed or is performing incorrectly and restart it accordingly. Where this is not possible, we want to be alerted immediately so we can take action to rectify it.

Monit allows us to do just this, it can monitor system parameters such as processes and network interfaces. If a process fails or moves outside of a range of defined parameters, Monit can restart it, if a restart fails or there are too many restarts in a given period, Monit can alert us by email. If needed, using email to SMS services, we can have these alerts delivered by text message.

In the example template, Monit is managed by two cookbooks; monit-tlq and monit_configs-tlq.

Monit-tlq

This cookbook takes care of installing and configuring Monit and contains only one recipe default.

It begins by installing the Monit package and updates Monits main configuration file /etc/monit/monitrc from the template monit-rc.rb.

As usual the attributes are documented in attributes/default.rb but it’s worth looking at the configuration file to see just how simple Monits setup is:

1 set daemon <%= node[:monit][:poll_period] || 30 %>

This sets the interval in seconds which Monit will performs its checks at. In Monit terminology this is known as a cycle, this is important because when we’re defining configurations for individual processes we want to monitor, we’ll specify many parameters in terms of number of cycles. A good starting point for most systems is between 30 and 120 seconds.

1 set logfile syslog facility log_daemon

This line tells Monit to log error and status messages to /var/log/syslog.

1 <% if node[:monit][:enable_emails] %>

2 set mailserver <%= node[:monit][:mailserver][:host] %> port <%= node[:monit]\

3 [:mailserver][:port] %>

4 username <%= node[:monit][:mailserver][:username] %>

5 password <%= node[:monit][:mailserver][:password] %>

6 using tlsv1

7 with timeout 30 seconds

8 using hostname <%= node[:monit][:mailserver][:hostname] %>

9 <% node[:monit][:notify_emails].each do |email| %>

10 <% if node[:monit][:minimise_alerts] %>

11 set alert <%= email %> but not on {instance, pid}

12 <% else %>

13 set alert <%= email %>

14 <% end %>

15 <% end %>

16 <% end %>

This section sets up a mailserver and the users which should be alerted when appropriate conditions are met. We’ll examine the two variations on the set alert lines in more detail later but in short, the addition of but not in {instance, pid} prevents emails from being sent due to events which usually don’t require any manual intervention.

1 set httpd port 2812 and

2 use address localhost

3 allow localhost

4 allow <%= "#{node[:monit][:web_interface][:allow][0]}:#{node[:monit][:web_\

5 interface][:allow][1]}"%>

Here we set up Monits web interface which allows us to view system status and manually start and stop processes where appropriate.

The username and password (basic auth) are set in the final allow line. By default in the configuration above, the web interface is bound to localhost (e.g. not externally accessible). While it is possible to add additional allow lines to have the web interface accept connections from additional IP’s or ranges and then allow these connections through the firewall, it is not recommended.

The recommended approach to making the web interface externally accessible is to proxy it through Nginx, the process for this is covered later in this chapter.

1 include /etc/monit/conf.d/*.conf

Finally we tell Monit to include any other configuration files in /etc/monit/confi.d/*.conf - note the .d convention meaning that it’s a directory.

Rather than having a single monolithic configuration file for everything we want to monitor, we’ll have an individual config file for each of our services we wish to monitor. This modular approach has the benefit of making it easy to debug the monitoring of a particular service as well as making re-use of config files across servers with different combinations of services much easier.

Which configuration goes where

The processes and services we’re going to monitor fall into two categories; system components and app components.

System components are those which are installed when provisioning, for example Nginx, our database, Redis, memcached and ssh.

App components are generally processes specific to the Rails app(s) we are running on the server. Examples of these include our app server (unicorn) and background workers such as Sidekiq.

The rule followed for the sample configuration is that a components monitoring should be managed at the time it is added to the server. Specifically if chef is used to install something (generally system components) then a suitable Monit configuration should be added by chef. If on the other hand a component is added via Capistrano (Rails apps and background workers) then the Monit configuration should be defined within the app and managed from Capistrano with the rest of the apps resources.

Monit configurations for system components can be found in the various recipes contained in the cookbook monit_configs-tlq. Monit configs for app components are covered in the second section of this book.

The importance of a custom monitoring configuration

Whilst my Monit configurations are a good starting point, your monitoring requirements are likely to vary based on your uptime requirements. I suggest you fork my recipes and customise them to suit your own requirements.

System level monitoring

Out of the box, Monit can keep track of load averages, memory usage and cpu usage and alert you when they move outside of certain parameters.

Take as an example the below system definition from monit_configs-tlq.

1 check system localhost

2 if loadavg (1min) > 4 then alert

3 if loadavg (5min) > 3 then alert

4 if memory usage > 75% then alert

5 if cpu usage (user) > 70% for 5 cycles then alert

6 if cpu usage (system) > 30% for 5 cycles then alert

7 if cpu usage (wait) > 20% for 5 cycles then alert

The first line check system localhost tells Monit to, on each cycle (which we defined earlier as a period of time), perform each of the checks listed below it.

The first three are quite simple, they perform a check, if the criteria are met then “alert” is called which means an email will be sent as per your alert configuration in the main Monit config.

The second three demonstrate the addition of another criteria for 5 cycles. This means that Monit will only perform the specified action (in this case alert) if the conditions are met for 5 checks in a row. This allows us to avoid being alerted to brief operation as normal spikes, instead receiving alerts only if a spike continues for an extended amount of time.

Load Average

The load average refers to the systems load, taken over three time periods, 1, 5 and 15 minutes. You may well recognised these as they’re displayed on many server control panels as well as on utilities such as top.

Intuitively, it might seem that a load average of 1.0 is perfect, that the system is loaded to exactly its maximum capacity. In practice this is a dangerous position to be in as there is no headroom, if there is any additional load, you’ll start to see slow downs. A 15 minute load average of 0.7 on a single core server is a good rule of a thumb for the maximum before you should start to look at reducing the servers load or uprating it. A 1.0 load averaging on a single core server needs investigating urgently and anything above 1.0 means you have a problem.

If you have more than one core, then you can roughly multiply your maximum acceptable load by the number of cores. So on a 4 core system a load average of 4.0 would be 100% load, 3.0 would be 75% load etc.

So on a 4 core system, we might choose to have the following line:

1 if loadavg (15min) > 2.8 then alert

Which would tell Monit to alert us if 15 minute load average was above 2.8 for a single cycle.

Memory Usage

This one is fairly simple, Monit can alert us when the servers memory usage exceeds a certain threshold. Nothing will kill a servers performance faster than swapping so there should always be some available RAM.

A rule of thumb is that more than 70% - 80% memory usage on an ongoing basis is an indicator the server either needs uprating or some processes moving off it.

A Monit line to reflect this rule would look like this:

1 if memory usage > 75% then alert

Or, to reduce the number of alerts caused by brief spikes, the following could be used

1 if memory usage > 75% for 5 cycles then alert

Which would only alert us if it was above 75% for 20 cycles. If our interval is configured as 30 seconds, this would alert us if memory usage was above 75% continually, for 10 minutes.

Monitoring Pids

Monitoring based on pid forms the bulk of the monitoring on most production servers. Every process on a Unix system is assigned a unique pid, using that pid, information such as the processes memory and cpu usage can be determined.

A typical Monit definition for monitoring a pidfile might look like this:

1 check process nginx with pidfile /var/run/nginx.pid

2 start program = "/etc/init.d/nginx start"

3 stop program = "/etc/init.d/nginx stop"

4 if 15 restarts within 15 cycles then timeout

This simple definition tells Monit to check a process called Nginx (this is just a human readable name of the service and can be anything) based on value in the pidfile at /var/run/nginx.pid.

The pidfile model is extremely simple, when a process starts, it will create a file - its pidfile - in a known location containing the pid it was allocated. Other applications which need to interact with the process can simply query that file, to find out the current pid allocated to it.

Additionally if there is no pidfile, it’s an indicator that the process may not be running. This is not however conclusive, since a pidfile is just a standard file, it’s entirely possible for the file not to be written due to permission issues or for the file to have been subsequently deleted.

This is a common cause of hard to debug errors; where a process is running without a pidfile. In this scenario our monitoring may well try and start the process in question, assuming it has failed. If for example this is a database server which binds to port 3306, we may then find a large number of failed attempts to start the server with an error that the port is already in use.

Returning to our simple Monit definition above, in addition to the name and pidfile location, we also define a start command and a stop command. On each check, if process is not found to be running, Monit will execute the start command.

This is extremely powerful. As long as we can find a pidfile for an application and define a command which can be used to start it, Monit can be used to check that it is running and if not attempt to start it.

As we’ll see below, we can also have Monit alert us to changes in such processes so we know when manual intervention is required or likely to be required.

The final line in our simple definition above is this:

1 if 15 restarts within 15 cycles then timeout

This line means that if there have been 15 attempts to restart a process in the last 15 cycles, then stop trying to restart it. This is to deal with scenarios where it is clear that the process is not going to start without manual intervention.

This is particularly important if the start process which is failing involves brief periods of intense CPU usage. Were this qualifier not there, our startup script would be called on every cycle indefinitely leading to extremely high CPU usage and potentially causing the rest of the system to fail as well or at least slow down dramatically.

Finding Pidfiles

Finding the pidfiles for applications can be something of an art form.

The first place to look is the configuration files for the application in question. Often the configuration allows the pids location to be specified and includes a default value. Furthermore not all applications will generate a pidfile by default, for some this will be an option which will have to be enabled.

If there is no mention of the pidfile in the application, the next place to look is /run which is the “standard” location for pidfiles in Ubuntu (previously /var/run which is, as of 11.10 now just a symlink to /run)

When specifying pidfile locations, /run is a good bet however see the section below on pidfile permissions.

Pidfile Permissions

When specifying a pidfile location in a config file, be mindful of permissions. A pidfile is a file like any other, therefore the user who creates it, must have write permissions to the relevant part of the filesystem.

If the application in question runs as a specific user, be sure to double check that the user has write access to the path you’re specifying. If this is not the case, the application may fail to start or simply log the error and continue as if nothing has happened.

A common source of this error is a work flow like the following:

· a process is to be run as a none root user and so a sub directory in /var/run is created and the relevant user given write access.

· In initial tests this works fine and the configuration is flagged as working

· The server is restarted and suddenly the pidfile can’t be written and Monit can’t find the service and sometimes the service itself will not start.

The problem here is that whilst /run is simply a part of the filesystem, it’s actually a mounted tmpfs. What this means is that the contents of /run are never persisted to disk - they are stored in RAM - on reboot, the contents is lost.

Therefore after a restart, the folder which was created with appropriate permissions in the initial setup, will no longer exist.

There are various approaches to solving this problem, the simplest is to ensure that your applications startup scripts (such as those in /etc/init.d/) include logic to check for the existence and permissions of pidfile target locations and if these are not presents, creates them.

An example of such logic is the following:

1 # make sure the pid destination is writable

2 mkdir -p /var/run/an_application/

3 chown application_user:application_user /run/an_application

For an example of this, see the redis-tlq recipe, specifically the init.d script template in templates/redis-server.erb.

Monitoring Ports

In addition to checking the status of processes, Monit can check whether connections are being accepted on particular ports. So we could expand out Nginx definition above to the following:

1 check process nginx with pidfile /var/run/nginx.pid

2 start program = "/etc/init.d/nginx start"

3 stop program = "/etc/init.d/nginx stop"

4 if 15 restarts within 15 cycles then timeout

5 if failed host 127.0.0.1 port 80 then restart

The additional line if failed host means that on each cycle, Monit will attempt to establish a connection 12.0.0.1:80. If a connection cannot be established, then it will attempt to restart the process.

Here we cam see that a restart command is available even though we’ve only defined start and stop commands. Restart simply calls the stop and start commands sequentially. At time of writing there was no option to specify a separate restart command but its been slated as an upcoming feature for a while so this may change.

Free Space Monitoring

An often overlooked factor in server health is the amount of available free space. Particularly now that disk space is so cheap, it’s easy to forget that it’s still entirely possible to run out. Once your production database server has run out of space once, it’s unlikely you’ll decide against including checks for this again.

A simple Monit check for available disk space might look like this:

1 check filesystem rootfs with path /

2 if space usage > 80% then alert

This is fairly self explanatory, the filesystem is checked and if the space in use is over 80%, an alert is sent.

Alerts and avoiding overload

In our original Monit configuration at the start of the chapter we had the following line:

1 set alert <%= email %>

which translated to:

1 set alert user@example.com

This is what’s known as a global alert. By default Monit will send alerts to this address whenever anything that is being monitored (a service) changes, including:

· A service which should exist, does not exist

· A service which didn’t exist, starts existing

· The pid of a service changes between cycles

· A port connection fails

A default catch all alert statement like this can generate a lot of email traffic. If, for example you’re monitoring 5 unicorn workers, every time you deploy, you’ll receive at least 5 notifications from Monit to tell you that the the pids of all 5 unicorn workers have changed.

The danger is that receiving Monit alerts will become so commonplace, that they get a similar treatment to spam, when one is received, the subject is glanced at and then the email archived. This makes it very easy to miss an important alert.

It is therefore worth spending some time tuning your alerts, starting off with a bias towards alerting you too much and then regularly reviewing over the first few weeks of operation to tune out alerts for events which you do not need to know about.

Our sample Monit configuration from the beginning of this chapter included the following alert definition:

1 set alert <%= email %> but not on {instance, pid}

Used when the minimise alerts flag is set on the node definition.

This means that globally alerts will be sent for all events except for instance and pid changes. Whilst I strongly recommend you tune your own configuration, in my experience the above is often sufficient to minimise the amount of alert traffic while ensuring critical events are still sent.

The Monit documentation on managing alerts is excellent and well worth reading:

http://mmonit.com/monit/documentation/monit.html#alert_messages

Serving the web interface with Nginx

Using the example template

Monit provides a web interface which displays the current status of all monitored processes and allows administrator restart of each as well as allowing for manual starting of processes for which automatic restart has failed.

In our example configuration at the start of this chapter it is configured to run on port 2812 and be accessible only to localhost.

It’s possible to have Monit serve the web interface to other IP’s directly however I prefer to have all web traffic served from Nginx.

To do this you would add an Nginx virtual host similar to the below:

1 server {

2 listen 80;

3 server_name monit.example.com;

4 location / {

5 proxy_pass http://127.0.0.1:2812;

6 proxy_set_header Host $host;

7 }

8 }

This simply means take all requests for monit.example.com and send them to 127.0.0.1on port 2812.

The example template provides a handy shortcut for this in the nginx-tlq recipe. In the default recipe you’ll see the following:

1 # Monit pass through

2 if @node[:monit_address]

3 template "/etc/nginx/sites-enabled/monit" do

4 owner "deploy"

5 group "deploy"

6 mode "0644"

7 source "monit_interface.erb"

8 end

9 end

Where monit_interface.erb simply contains:

1 server {

2 listen 80;

3 server_name <%= @node[:monit_address] %>;

4 location / {

5 proxy_pass http://127.0.0.1:2812;

6 proxy_set_header Host $host;

7 }

8 }

This means that if you’re using the monit-tlq recipe and include the “monit_address” attribute in your node definition, for example:

1 "monit_address" : "monit.example.com"

Then a virtualhost entry to forward traffic requests for monit.example.com to the Monit interface on 127.0.0.1 will be included.

Serving multiple Monit interfaces from one nginx interface

An additional benefit of this is that you can use a single Nginx instance to serve the Monit interface from multiple machines. For example you could establish a convention that machine-name.monit.example.com always points to the Monit interface for machine-name. You could then specify in your Monit config files that the admin interface on 2812 is only accessible to the private (internal) IP address of the Nginx instance you’re using to serve Monit interfaces.

If the private IP address of the additional server you’re monitoring was 168.1.1.5 then your Nginx virtualhost would look like this:

1 server {

2 listen 80;

3 server_name machine-name.monit.example.com;

4 location / {

5 proxy_pass http://127.0.0.1:2812;

6 proxy_set_header Host $host;

7 }

8 }

9 server {

10 listen 80;

11 server_name machine2-name.monit.example.com;

12 location / {

13 proxy_pass http://168.1.1.5:2812;

14 proxy_set_header Host $host;

15 }

16 }

The above definition show that Nginx is serving both the Monit interface for itself (on 127.0.0.1) and for the second server on 168.1.1.5.

In the next chapter we’ll look at how to ensure Monit keeps running.

Upstart

Overview

Upstart is the primary utility in Ubuntu for managing startup processes and ensuring key system processes remain started. In this chapter we’ll look at both using Upstart to ensure Monit is always running as well as how we manage the starting and stopping of services in general on an Ubuntu system.

What monitors Monit?

A common query at this stage is what monitors Monit? If we’re using Monit to make sure everything starts up correctly when the system starts and reloads on failure, how do we make sure Monit starts and reloads?

The current tool for doing this on Ubuntu is called Upstart. Upstart takes care of processes which we need to run on boot and allows for them to be re-spawned in the event that they fail.

From this simple description it could be forgiven to think that Upstart could be used instead of Monit completely. This is not however the case. Upstart provides simple pid monitoring, e.g. if the pid for a process it monitors no longer exists, it will try and re-spawn it.

Therefore if the process still exists but is stuck, unresponsive or consuming unreasonable resource requirements, Upstart will not intervene. Additionally it doesn’t provide the alerting functionality so integral to a good monitoring configuration.

Upstart Services

Processes which can be managed by scripts located in /etc/init are referred to as services, if we look at the monit-tlq recipe, we can see it adds the contents of the monit-upstart.conf.erb to /etc/init/monit.conf:

1 # after adding this file run

2 # initctl reload-configuration

3 #

4 # You can manually start and stop Monit like this:

5 #

6 # start monit

7 # stop monit

8 #

9

10 description "Monit service manager"

11

12 limit core unlimited unlimited

13

14 start on runlevel [2345]

15 stop on runlevel [!2345]

16

17 expect daemon

18 respawn

19

20 exec /usr/bin/monit -c /etc/monit/monitrc

21

22 pre-stop exec /usr/bin/monit -c /etc/monit/monitrc quit

This creates an entry for Monit in /etc/init which defines:

· when the process should be automatically started (run levels 2, 3, 4 or 5)

· when it should be automatically stopped (run level no longer 2, 3, 4 or 5)

· That the type of process is expected to be a daemon (background) process

· That the process should be respawened (e.g. started again) if it’s found to no longer exist.

It then goes on to define the command for starting Monit:

1 exec /usr/bin/monit -c /etc/monit/monitrc

Which is the part to the Monit executable along with the -c flag which allows us to specify the config file it should use.

It then defines commands which should be run on stop, e.g. before killing the process, in this case:

1 exec /usr/bin/monit -c /etc/monit/monitrc quit

Which is the command to gracefully shutdown Monit.

In the next chapter we’ll look at the steps required to fork my basic Monit configurations and use them in your template configuration.

Forking My Monit Configurations

Overview

In the previous chapter we saw how powerful Monit can be for monitoring services and looked at some basic monitoring configurations. In this chapter we’ll look further at why it’s important to create custom Monit configurations and examine the step by step process for forking my basic Monit configurations on Github and applying the forked version to an existing VPS.

Even if you don’t plan on forking the Monit configurations, this chapter will serve as a simple reference for forking any cookbook or updating a third party cookbook to a newer version.

Why is creating your own Monit configurations so important?

We’ve already touched on the dangers of Monit alerts being too frequent and consequently treated as spam. The key take away should be that there is no one size fits all monitoring configuration. If the system in question provides business critical services, something as small as a pid change might be important, if on the other hand you’re providing a free, advertising based service to a small user base, some level of downtime may be acceptable.

Your monitoring configurations must be defined based on the balance of downtime to personal inconvenience which is acceptable for your service.

I have systems which have now been running for several years with minimal manual intervention. This was achieved by carefully tuning the Monit configurations over the first months to ensure that sources of predictable failure were eliminated or carefully handled by Monit and that suitable alerts would be properly delivered for anything Monit would not be able to deal with.

For these reasons the sample configurations provided are best used as a starting point rather than directly.

Step by Step

For the purposes of this guide I’ll assume that you’ve already run berks install once to pull my simple Monit configs and potentially already applied this recipe to your VPS. Therefore we’ll cover replacing my Monit configurations with your new ones as well as how to make changes to your configurations and then re-apply them.

To begin with, fork and clone the repository https://github.com/TalkingQuickly/monit_configs-tlq. Or, if you prefer starting from scratch, you can create a new repository with the standard chef cookbook structure (see 5.2) and commit this ti a fresh git repository.

If you’ve forked my repository, your metadata.rb will look something like this:

1 name "monit_configs-tlq"

2 maintainer "Ben Dixon"

3 maintainer_email "ben@hillsbede.co.uk"

4 description "Monit configs for server components"

5 version "0.0.1"

6

7 recipe "monit_configs-tlq::memcached", "Monit config for memcached"

8 recipe "monit_configs-tlq::mongo", "Monit config for mongodb"

9 recipe "monit_configs-tlq::mysql-server", "Monit config for mysql server"

10 recipe "monit_configs-tlq::nginx", "Monit config for nginx"

11 recipe "monit_configs-tlq::redis-server", "Monit config for redis server"

12

13 supports "ubuntu"

Begin by updating the basic metadata about the recipe, in particular the name. Remember that when we add this new cookbook to our Berksfile, the name we specify here will be the name which we refer to it by.

The second section of the file contains a list of all the recipes we will define in this cookbook.

For example:

1 recipe "monit_configs-tlq::memcached", "Monit config for memcached"

Defines a recipe called monit_configs-tlq::memcached which will therefore expect a file to exist within cookbook_root/recipes called memcached.rb which will define the recipe.

Any templates associated with the recipe will be expected to be in templates/default - remember the subdirectories within templates refer to distributions, not specific recipes.

You can now modify the existing cookbooks or add your own, this might be as simple as tweaking the scenarios in which alerts are or are not sent for a particular service or adding specific monitoring for file system changes.

Finally you’ll want to bump the version number in metadata.rb so that Berkshelf recognises that a new version is available.

Once you’ve completed your new cookbook, you can modify your Berksfile to include your new cookbook. So if, for example, I’d forked my own repository to a new cookbook called monit_configs_strict-tlq with a Github repository called monit_configs_strict-tlq available at:

1 git@github.com:TalkingQuickly/monit_configs_strict-tlq.git

I would update my Berksfile to include:

1 cookbook 'monit_configs_strict-tlq', git: 'git@github.com:TalkingQuickly/monit\

2 _configs_strict-tlq.git'

We can then run:

1 berks install

If it’s a newly added cookbook or:

1 berks update monit_configs_strict-tlq

If I’d already installed the cookbook but had since made changes to it and pushed them to the git repository.

Both will update Berksfile.lock and download our new cookbook to our chef repository.

We can then add the new recipes to a node or roles run_list and when knife solo cook is next run the updated cookbooks will be uploaded and applied to the server.

In the next chapter we’ll look at installing Nginx as our web server.