Programming Nagios - Learning Nagios 4 (2014)

Learning Nagios 4 (2014)

Chapter 11. Programming Nagios

The previous chapter provided information about monitoring Microsoft Windows machines and several approaches for more advanced monitoring using Nagios.

This chapter focuses on extending Nagios using various programming languages. One of the key features of Nagios is its extensibility. There are multiple ways in which Nagios can be tailored to suit your needs. It is also possible to integrate Nagios tightly with your applications and benefit from a powerful mechanism to schedule and perform checks.

In this chapter, we will cover the following topics:

· Understanding what aspects of Nagios can be customized

· Writing plugins that perform active checks

· Monitoring cloud environments (VMware and Amazon Web Services machines)

· Creating commands to send custom notifications

· Managing Nagios and reading its status information

· Using passive checks for long-running tests

Introducing Nagios customizations

The most exciting aspect of using Nagios is the ability to combine your programming skills with the powerful engine offered by the Nagios daemon. Your own pieces of code can be plugged into the Nagios daemon, and they can communicate with it in various ways.

One of the best things about Nagios is that, in most cases, it does not force you to use a specific language. Whether the language of your choice is PHP, Perl, Tcl, Python, Ruby, or Java, you can easily use any one with Nagios. This is a fundamental difference between Nagios and the majority of monitoring applications. Usually, an application can only be extended in the language in which it is written.

Our code can cooperate with Nagios in various ways, for example, by implementing commands, by sending information to the Nagios daemon, and so on. The first case means that we can create a script or executable that will be run by Nagios, and its output and exit code will be processed by Nagios. Running external commands is used to perform active checks, send notifications, and trigger event handlers. By using the macro substitutions and variables available in the current context (seehttp://nagios.sourceforge.net/docs/nagioscore/4/en/macrolist.html), we will be able to pass down all of the information that's needed for the command to do its job.

The alternative method of extending Nagios is to send information to it from other applications. The first option is that external applications (such as web or typical user interface) allow the configuration and management of the Nagios system. This is done by sending control commands to Nagios over the UNIX sockets. Because this involves opening and writing to a Unix socket, which works just like a file, it can be done in any programming language that handles I/O.

Yet another option is that the other applications reporting to your application or a system scheduling mechanism, such as cron, are responsible for running the checks. A test needs to be carried out on its own and the application itself is responsible for sending the results back to Nagios. Results can be sent directly via a Unix socket or via a Nagios Status Check Acceptor (NSCA) protocol. Luckily, even sending over a network with NSCA is simple, as results can be sent directly to the standard input of the send_nscacommand.

Your software can also get information related to Nagios easily. All that's needed is to monitor Nagios' status.dat file for changes, and read it as if it contains all object definitions along with the current soft and hard states. The format of the file is quite simple, and the task of writing a parser for it is quite easy. The file format and how to parse its content is described later in this chapter.

There are ready-to-use Nagios status file parsers for multiple languages—for example, Pynag for Python (available at http://pynag.org/), nagios_analyzed for Ruby (available at https://github.com/jbbarth/nagios_analyzer), and so on. Also, there are multiple ready-to-use PHP solutions to parse statuses—for example, Naupy (available at http://sourceforge.net/projects/naupy/).

Over the course of this chapter, we will use various programming languages, such as PHP (http://www.php.net/), Ruby (http://www.ruby-lang.org/), Python (http://www.python.org/), Perl (http://www.perl.org/), Tcl (http://www.tcl.tk/), Java (http://www.oracle.com/technetwork/java/), and C/C++. Even though many people do not know all of these languages, the code will only use the basic functionality of the languages so that it is understandable to nontechnical users.

Assuming that you need to write a piece of code on your own, the first thing you should start with is choosing the programming language. If you already know a language that would fit this task, stick to it. Otherwise, there are a few candidates to consider. The languages I will recommend are Ruby, Python, or Tcl.

Ruby is a very popular dynamic language that has a large variety of uses. It has a very natural syntax that makes the code easy to read. Python is another popular dynamic language, and its syntax makes it easy to write check commands. Both languages have a wide range of libraries that can be used to interact with other software.

Tcl, on the other hand, is less popular, but a very powerful language in its own way. This is usually my first choice for a programming language. It features a very simple, but very powerful syntax. Tcl is tightly integrated with an event loop that is handy when programming event-driven applications. This is perfectly suitable for communicating with the Nagios server. It also comes with a huge set of protocols and libraries to use, especially the ActiveTcl distribution from ActiveState (http://www.activestate.com/). Throughout this book, Tcl examples will be using the packages available with ActiveTcl distributions. If your Tcl interpreter does not have one or more of these packages, it is recommended that you install the ActiveTcl distribution.

People who are only familiar with PHP can also feel safe about it. It's possible to create various commands and passive check scripts in this language. It is also possible to integrate Nagios with error reporting for your web applications.

Nagios is known to integrate very well with Perl. This chapter teaches us how both Perl and other languages can be easily integrated with Nagios so that the readers familiar with other languages will also benefit from it, and will learn Perl just for the purpose of extending Nagios.

Even though we'll focus only on few languages, almost any technology can be used. Nagios mainly uses basic functionality for interaction—exit codes, reading a program's output, and passing commands via a pipe. Also, all of its interaction is in text mode, and both active check output and command pipe use very basic formats.

Programming in C with libnagios

Nagios 4 comes with libnagios. It is a C library that provides various functionality that are used in Nagios, and which could also be reused in other programs. This section will talk about the library, how to install it, and how to use it in your programs. If you are not interested in the development of Nagios-related applications in C, you may skip this section.

The functions in the library also make it easier to create software that interact with Nagios, such as plugins, event handlers, or programs that send passive check results. The library is built as a part of the Nagios compilation and is created as a statically linked library only (please visit http://en.wikipedia.org/wiki/Static_library for more details). This means, in order to use the library, it will be included in the application, so we do not have look for it in the shared libraries directory such as /usr/lib.

In order to install the application, go to the source directory of Nagios and run the following commands:

# make install-lib

# make install-headers

For an installation performed according to the steps given in Chapter 2, Installing Nagios 4, the library will be copied to /opt/nagios/lib and the header files will be placed in /opt/nagios/include.

The libnagios library provides multiple platform-independent functions and algorithms that are used throughout Nagios and can be reused.

As an example of using libnagios, let's write a simple program that communicates with a query handler that was introduced in Nagios 4. It will query for information about core scheduling queue:

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <libnagios.h>

main()

{

// socket descriptor

int sd;

// buffer for reading output

char buf[16384];

// buffer size, last value and iterator

int bufsize, last = 0, i;

// open socket to Nagios query handler

sd = nsock_unix("/var/nagios/rw/nagios.qh", NSOCK_TCP | NSOCK_CONNECT);

if (sd < 0)

{

printf("Unable to connect to Nagios socket\n");

exit(3);

}

// send "squeuestats" query to core handler

nsock_printf_nul(sd, "@core squeuestats");

// read result until \0 is received

while (bufsize < sizeof(buf))

{

if (read(sd, buf + bufsize, 1) == 1)

{

// check if this is end of response

if (buf[bufsize] == '\0')

break;

bufsize++;

}

}

buf[bufsize] = 0;

// read all values separated by semi-colon

for (i = 0 ; i < bufsize ; i++)

{

if (buf[i] == ';')

{

buf[i] = '\0';

printf("%s\n", buf + last);

last = i + 1;

}

}

// print last value

printf("%s\n", buf + last);

// close socket

close(sd);

}

The program sends a command to the query handler, reads the response, and then prints each individual result that is returned.

To compile the program, simply run the following command:

# gcc -o query_squeue query_squeue.c \

-I/opt/nagios/include/nagios/lib \

-L/opt/nagios/lib -lnagios

Here, query_squeue is the output binary name and query_squeue.c is the name of the source code file. The paths for -I and -L options are valid for installations performed according to the steps given in Chapter 2, Installing Nagios 4, the library will be copied to/opt/nagios/lib, and the header files will be placed in /opt/nagios/include. If you have installed Nagios to your Linux distribution or to another location, the paths may be different.

Once it is built successfully, we can now run it using the following command:

# ./query_squeue

If the command fails to create the socket, please make sure that the command is run by the user that has write access to /var/nagios/rw/nagios.qh file; for example, a nagios user, a member of nagioscmd group, or as root.

After running, the code will print a result similar to the following output:

SERVICE_CHECK=22

COMMAND_CHECK=0

LOG_ROTATION=1

PROGRAM_SHUTDOWN=0

PROGRAM_RESTART=0

CHECK_REAPER=1

ORPHAN_CHECK=1

RETENTION_SAVE=1

STATUS_SAVE=1

SCHEDULED_DOWNTIME=0

SFRESHNESS_CHECK=1

EXPIRE_DOWNTIME=0

HOST_CHECK=4

HFRESHNESS_CHECK=0

RESCHEDULE_CHECKS=0

EXPIRE_COMMENT=0

CHECK_PROGRAM_UPDATE=1

SLEEP=0

USER_FUNCTION=0

SQUEUE_ENTRIES=33

The code communicates with the Nagios query handler that can be used for many interesting things such as receiving information about host and/or service check results. The query handler and its possible uses are described in more detail in Chapter 12, Using the Query Handler.

Creating custom active checks

One of the most common areas where Nagios can be suited to fit your needs is that of active checks . These are the checks that are scheduled and run by the Nagios daemon. This functionality is described in more detail in Chapter 2, Installing Nagios 4.

Nagios has a project that ships the commonly-used plugins and comes with a large variety of checks that can be performed. Before thinking about writing anything on your own, it is best to check for the standard plugins (described in detail in Chapter 4, Using the Nagios Plugins).

Tip

The Nagios Exchange (http://exchange.nagios.org) website contains multiple ready-to-use plugins for performing active checks. It is recommended that you check whether somebody has already written a similar plugin for your needs.

The reason for this is that even though active checks are quite easy to implement, sometimes a complete implementation that handles errors and command line options parsing is not very easy to create. Typically, proper error handling can take a lot of time to implement. Another thing is that plugins that have already existed for some time have often been thoroughly tested by others. Typical errors would have been already identified and fixed; and sometimes the plugins have been tested in a larger environment, under a wider variety of conditions. Writing check plugins on your own should be preceded by an investigation to find out whether anybody has encountered and solved a similar problem.

Active check commands are very simple to implement. They simply require a plugin to return one or more lines of check output to the standard output stream, and return one of the predefined exit codes—OK (code 0), WARNING (code 1), CRITICAL (code 2), or UNKNOWN(code 3). How active check plugins work is described in more detail at the beginning of Chapter 4, Using the of Nagios Plugins.

Testing the correctness of the MySQL database

Let's start with a simple plugin that performs active checks. We'll implement a simple check that connects to a MySQL database and verifies whether the specified tables are structurally correct. It will also accept connection information from command line as a series of arguments. We'll write the script in Python.

From a technical point of view, the check is quite simple—all that's needed is to connect to a server, choose the database, and run the CHECK TABLE (https://dev.mysql.com/doc/refman/5.7/en/check-table.html) command in SQL.

The plugin requires installation of the MySQLdb package for Python (http://sourceforge.net/projects/mysql-python/). We will also need a working MySQL database that we can connect to for testing purposes. It is a good idea to install the MySQL server on your local machine and set up a dummy database with tables to test.

In order to set up a MySQL database server on Ubuntu Linux, install the mysql-server package using the following command:

# apt-get install mysql-server

In Red Hat and Fedora Linux, the package is called mysql-server and the command to install it is as follows:

# yum install mysql-server

After that, you will be able to connect to the database locally as root, either without a password or with the password supplied during database installation.

If you do not have any other database to run the script against, you can use mysql as the database name, as this is a database that all instances of MySQL have.

The following is a sample script that performs the test. It needs to be run with the hostname, username, password, database name, and the list of tables to be checked as arguments. The table names should be separated by commas.

#!/usr/bin/env python

import MySQLdb

import sys, string

# only perform check if we're loaded as main script

if __name__ == '__main__':

dbhost = sys.argv[1]

dbuser = sys.argv[2]

dbpass = sys.argv[3]

dbname = sys.argv[4]

tables = sys.argv[5]

errors = []

count = 0

# connect to the database

conn = MySQLdb.connect(dbhost, dbuser, dbpass, dbname);

cursor = conn.cursor()

# perform check for all tables in the table list

# (splits the table names by ",")

for table in string.split(tables, ","):

cursor.execute("CHECK TABLE %s" % (table))

row = cursor.fetchone()

count = count + 1

if row[3] != "OK":

errors.append(table)

# handle output – if any errors occurred, report 2, otherwise 0

if len(errors) == 0:

print "check_mysql_table: OK %d table(s) checked" % count

sys.exit(0);

else:

print "check_mysql_table: CRITICAL: erorrs in %s" % \

(string.join(errors, ", "))

sys.exit(2);

The code consists of four parts: initialization, argument parsing, connection, and checking each table. The first part consists of the import statements that load various required modules and make sure that the code is run from the command line. In the second part, the arguments passed by the user are mapped to the various variables. After that, a connection to the database is made. If the connection succeeds, for each table specified when running the command, a CHECK TABLE command (http://dev.mysql.com/doc/refman/5.0/en/check-table.html) will be run. This makes MySQL verify that the table structure is correct.

To use it, let's run it by specifying the connection information and tables tbl1, tbl2, and tbl3:

root@ubuntu:~# /opt/nagios/plugins/check_mysql_table.py \

127.0.0.1 mysqluser secret1 databasename tbl1,tbl2,tbl3

check_mysql_table: OK 3 table(s) checked

As you can see, the script seems quite easy and usable.

Monitoring local time with a time server

The next task is to create a check plugin that compares the local time with the time on a remote machine and issues a warning or critical state if the difference exceeds a specified number. We will use Tcl for this job.

We'll use Tcl's time package (http://tcllib.sourceforge.net/doc/ntp_time.html) to communicate with remote machines. This package comes bundled with ActiveTcl and is a part of the tcllib package available in many Linux distributions.

If you do not have the tcllib and/or time packages, you will need to install them. On Ubuntu Linux, the package is called tcllib and the following command installs it:

apt-get install tcllib

The script will accept the hostname and the warning and critical thresholds in number of seconds. The script will use these to decide on the exit status. It will also output the difference in number of seconds, for informational purposes.

The following is a script to perform a check of the time on a remote machine:

#!/usr/bin/env tclsh

package require time

# retrieve arguments for the script

set host [lindex $argv 0]

set warndiff [lindex $argv 1]

set critdiff [lindex $argv 2]

# retrieve times

set handle [time::gettime $host]

set remotetime [time::unixtime $handle]

time::cleanup $handle

set localtime [clock seconds]

# calculate difference

set diff [expr {abs($remotetime - $localtime)}]

# decide which exit code should be used

if {$diff > $critdiff} {

puts "check_time CRITICAL: $diff seconds difference"

exit 2

} elseif {$diff > $warndiff} {

puts "check_time WARNING: $diff seconds difference"

exit 1

} else {

puts "check_time OK: $diff seconds difference"

exit 0

}

This command is split into three parts: initializing, parsing arguments, and checking status. The first part loads the time package and the second maps the arguments to variables. After that, a connection to the remote host is made, the time on the remote machine is received, and this remote time is compared with the local time. Based on what the difference is, the command returns either a CRITICAL, WARNING, or OK status.

And now, let's run it against a sample machine using the following command:

root@ubuntu:~#

/opt/nagios/plugins/check_time.tcl \

ntp2a.mcc.ac.uk 60 120

check_time WARNING: 76 seconds difference

As shown in the preceding output, the script works properly and returns a WARNING state as the difference is higher than 60, but lower than 120.

Another example may be using libnagios to monitor the Nagios 4 query handler and use the @echo handler for this purpose. This is a query handler that returns whatever is sent to it and is meant mainly for testing the query handler.

The following C code can be used to monitor whether the Nagios query handler is working properly:

#include <string.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <libnagios.h>

main(int argc, char *argv[])

{

// socket descriptor

int sd;

// buffer for reading output

char buf[16384];

// buffer size, last value and message size

int bufsize, last = 0, test_message_size;

char *test_message;

char *qh;

// get arguments from command line

if (argc != 3)

{

printf("Usage: %s path/to/nagios.qh mesasge\n", argv[0]);

exit(1);

}

qh = argv[1];

test_message = argv[2];

test_message_size = strlen(test_message);

// open socket to Nagios query handler

sd = nsock_unix(qh, NSOCK_TCP | NSOCK_CONNECT);

if (sd < 0)

{

printf("check_qh: Unable to connect to Nagios socket %s\n",qh);

exit(3);

}

// send "squeuestats" query to core handler

nsock_printf_nul(sd, "@echo %s", test_message);

if (read(sd, buf, test_message_size) != test_message_size)

{

printf("check_qh: Invalid returned message size\n");

exit(2);

}

if (memcmp(buf, test_message, test_message_size) != 0)

{

printf("check_qh: Invalid message returned\n");

exit(2);

}

else

{

printf("check_qh: Correct message received\n");

exit(0);

}

}

The code connects to the Nagios query handler, sends the specified message to the @echo query handler, and reads the same amount of bytes returned. The query handler functionality and how to use it is described in more detail in Chapter 12, Using the Query Handler.

If the message is not the same, or an invalid number of bytes is read, the program returns an error. If Nagios does not return sufficient number of bytes, either the closed socket will cause read() to return a smaller number of bytes or, if Nagios will not close the closed socket, Nagios will detect an active check's timeout that has elapsed and consider the test as invalid.

Writing plugins correctly

We have already created a few sample scripts, and they're working. So, it is possible to use them from Nagios. However, these checks are very far from being complete. They lack error control, parsing, and argument verification.

It is recommended that you write all the commands in a more user-friendly way. The reason is that in most cases, after some time, someone else will take over using and/or maintaining your custom check commands. You might also come back to your own code after a year of working on completely different things. In such cases, having a check command that is user-friendly, has proper comments in the code, and allows debugging, will save a lot of time. The standard Nagios plugins guidelines (available at https://nagios-plugins.org/doc/guidelines.html) documents good practices for standard Nagios plugins package developers. While some parts may be specific to C language, it is worth reading them when developing in other languages as well.

The first thing that should be done is to provide proper handling of arguments—this means using functionality such as the getopt package for Python (http://www.python.org/doc/2.5/lib/module-getopt.html) or the cmdline package for Tcl (http://tcllib.sourceforge.net/doc/cmdline.html) to parse the arguments. This way, functionalities like the --help parameter will work properly and in a more user-friendly way. The majority of programming languages provide such libraries, and it is always recommended to use them.

Another thing worth considering is proper error handling. If connectivity to a remote machine is not possible, the check command should exit with a critical or unknown status. In addition, all other pieces of the code should be wrapped to catch errors depending on whether an error suggests a failure in the service being checked, or is due to a problem outside of a checked service.

Using the example of the first check plugin, we can redesign the beginning of the script to parse the arguments correctly. The reworked plugin sets the values of all of the parameters to their default value and then parses the options and corresponding values based on what the argument is. The script also allows specification of the --verbose flag to tell the plugin that it should report more information on what it is currently doing.

Finally, the connection is wrapped in the try ... except Python statements to catch exceptions when connecting to the MySQL server. This statement is used to detect errors when running the commands between try and except. In this case, if a connection to the database could not be established, the script will handle this and report an error, instead of returning a Python error report.

It's also a good practice to wrap the entire script in a try ... except statement so that all potential errors or unhandled situations are sent to Nagios as a general error. In addition, if the --verbose flag is specified, more information should be displayed. This should ease the debugging of any potential error.

The following code extract shows the rewritten beginning of a Python script that uses getopt to parse arguments and has used try ... except to handle errors in connectivity:

# only perform check if we're loaded as main script

if __name__ == '__main__':

dbhost='localhost'

dbuser=''

dbpass=''

dbname=''

tables=''

verbose = False

try:

options, args = getopt.getopt(sys.argv[1:],

"hvH:u:p:d:t:", ["help", "verbose", "hostname=",

"username=", "password=", "dbname=", "tables="]

)

except getopt.GetoptError:

usage()

sys.exit(3)

for name, value in options:

if name in ("-h", "--help"):

usage()

sys.exit(0)

if name in ("-H", "--hostname"):

dbhost = value

if name in ("-u", "--username"):

dbuser = value

if name in ("-p", "--password"):

dbpass = value

if name in ("-d", "--dbname"):

dbname = value

if name in ("-v", "--verbose"):

verbose = True

if name in ("-t", "--tables"):

tables = value

if verbose:

print " Connecting to %s@%s (database %s)" % \

(dbuser, dbhost, dbname)

try:

conn = MySQLdb.connect(dbhost, dbuser, dbpass, dbname);

except Exception:

print "Unable to connect to database"

sys.exit(3)

This code also requires the defining of a usage function that prints the usage syntax. This has been left out of our example and is left as an exercise for the reader.

Another change would be to add the reporting of what is currently being done if the --verbose flag is passed. This helps to determine whether the script is idle or is currently trying to check specific table contents.

Similarly, for Tcl, we should use the cmdline package to parse arguments. It's also a good idea to check if all arguments have been specified correctly:

package require cmdline

array set opt [cmdline::getoptions argv {

{host.arg "127.0.0.1" "Host to connect to"}

{warntime.arg "300" "Warning threshold (seconds)"}

{crittime.arg "600" "Critical threshold (seconds)"}

}]

set host $opt(host)

set warntime $opt(warntime)

set crittime $opt(crittime)

if {![string is integer –strict $warntime] || $warntime <= 0} {

puts stderr "Invalid warning time specified"

exit 3

}

if {![string is integer –strict $crittime] || $crittime <= 0} {

puts stderr "Invalid critical time specified"

exit 3

}

The preceding code should replace the three lines that read the argv variable in the original script earlier. The remaining part of the check script should stay the same.

Of course, the changes mentioned here are just small examples of how plugins should be written. It's not possible to cover all possible aspects of what plugins should take into account. It's your responsibility as the command's author to make sure that all scenarios are covered in your plugin.

Typically, this means correct error handling—usually related to catching all of the exceptions that the underlying functions might throw. There are also additional things to take into account. For example, if you are writing a networked plugin, the remote server can return error messages that also need to be handled properly.

An important thing worth considering is the proper handling of timeouts.

Usually, a plugin tries to connect in the background. If it fails within a specified period of time, the plugin will exit the check and report an error status. This is usually done through the use of child threads or child processes. In event-driven languages, this can be done by scheduling an event that exits with a timeout message after a specified time interval.

Checking websites

Nagios ships with a very powerful check_http plugin that allows you to monitor websites in a simple way. This plugin should be enough for a large variety of tasks. However, there are often situations where using only this plugin is not enough.

If you are running a website that is critical to your business, checking only whether the main page is showing up correctly may not be enough. In many cases, you might actually want to be sure that the users are able to log in, orders can be sent out, and reports can be generated correctly.

In such cases, it is not sufficient just to check if a couple of pages work correctly. It might be necessary to write a more complex check that will log you into the website, fill out an order form, send it, and verify whether it shows up in the order history. You may also want to check that a specified text is present on specific pages.

This task is very common when performing automated tests during the development of a site. Not many people perform such tests regularly when the site is in production. A downside of this is that if version control of your website is not very strict, then small bug fixes can break things in a different part of the website and those may go unnoticed for a long time.

One may question whether this is a task for system monitoring or for the testing phase of the development and maintenance cycles. For a number of reasons, this task should be common to both development and maintenance, but it should also be a part of system monitoring. The first reason is that such tests make sure that the overall functionality of the site is working as expected. It can also be used to detect defacing or other unauthorized modification of the page. It can also be used to monitor the response time. Monitoring the web page's functionality should normally be performed rarely, but checks for the web server and the main page should be done more often.

There are a couple of approaches to this problem, depending on what you actually want to monitor. The first one is using the http or https protocol directly using various libraries—requests for Python (https://github.com/kennethreitz/requests), http for Tcl/Tk (http://www.tcl.tk/man/tcl8.4/TclCmd/http.htm), and LWP for Perl (http://search.cpan.org/~gaas/libwww-perl/lib/LWP.pm). By deciding on the appropriate approach, you will need to hardcode your URLs along with the queries to send and, in some cases, also implement cookie handling on your own.

Another approach is to use automated test frameworks. This includes mechanize for Python (http://wwwsearch.sourceforge.net/mechanize/), webautotest for Tcl (http://sourceforge.net/projects/dqsoftware/) for Tcl, and WWW::Mechanize for Perl (http://search.cpan.org/dist/WWW-Mechanize/). There are also multiple Java frameworks for this, such as HttpUnit (http://httpunit.sourceforge.net/) and HtmlUnit (http://htmlunit.sourceforge.net/). These packages offer the automated parsing of HTML, reading of the DOM tree, and operating similar to how a browser would work. This allows scripts to be written at a higher level without having to care about low-level things such as reading and passing values from all fields. A typical script would consist of going to a URL, locating forms, setting values, and sending these values.

The last approach is to use packages that take advantage of Internet Explorer over Component Object Model (COM), which is available at http://www.microsoft.com/com/. This approach uses an entire browser and, therefore, is the most accurate method of testing a website's correctness. It also requires a much larger setup to accomplish the same task—tests need to be performed on a Microsoft Windows system and require a separate account for proper cookie management. For example, in the cases where tests need to start after all of the cookies have been removed, Perl offers the ability to automate Internet Explorer using the PAMIE package (http://pamie.sourceforge.net/), while for Python it is SAMIE (http://samie.sourceforge.net/). Tcl offers Internet Explorer automation in theautoie package (http://sourceforge.net/projects/dqsoftware/). For Ruby, the most popular utility is called Watir (http://wtr.rubyforge.org/). In order to use IE- and COM-based automation, you should set up all the checks on a Microsoft Windows-based machine and set it up so that the results are sent back via NSCA.

Usually, the best choice is to use automated web testing frameworks. These require fewer overheads when developing the code to perform checks, and tend to react nicely to small changes in the way your website works.

As an example, we will write a simple script in Tcl that communicates with a website using the webautotest package. The plugin logs into the backend of a Joomla! content management system (http://www.joomla.org/) and makes sure that it works correctly. This test also checks that all Joomla! mechanisms are working correctly.

The following is the source code of the plugin:

package require http

# initialize Webautotest object

package require webautotest::httpclient

set o [webautotest::httpclient ::#auto]

if {$argc != 3} {

puts "Usage: check_joomla_backend URL username password"

exit 3

}

set url [lindex $argv 0]

set username [lindex $argv 1]

set password [lindex $argv 2]

if {[catch {

# go to your company's Joomla backend

$o navigate $url

# log in and submit form

$o setForm -name login

$o setFormValue username $username

$o setFormValue password $password

$o setFormValue lang en-GB

$o submitForm

# check if "Logged in Users" text can be found on the page

set result [$o regexpDataI "Logged in Users"]

} error]} {

puts "JOOMLA UNKNOWN: error occurred during check."

exit 3

}

if {[llength $result] > 0} {

puts "JOOMLA OK: Administrative panel loaded correctly."

exit 0

} else {

puts "JOOMLA CRITICAL: Administrative panel does not work."

exit 2

}

To check the plugin, simply run the following command:

root@ubuntu:~# /opt/nagios/plugins/check_joomla_backend \

http://joomla.yourcompany.com/administrator/ admin adminpassword

JOOMLA OK: Administrative panel loaded correctly.

Virtualization and clouds

Nowadays, more and more IT systems are moving into private or public cloud solutions. Clouds allow the more efficient use of resources and movement from smaller to bigger CPU power, memory, or storage capacity instantly.

Clouds can be divided into two forms:

· Public clouds: These are clouds hosted by external companies and allow the use of their machines as a service. There are a few public cloud providers that are most popular and commonly used—AWS from Amazon and Azure from Microsoft.

· Private clouds: These are setups where the company that wants to use the system also hosts it.

Both have their advantages and disadvantages, and sometimes both types of clouds are used. There are multiple free and commercial technologies to set up private clouds—VMware being a very popular one in enterprise IT infrastructure.

Nagios provides many ready-to-use plugins for various types of clouds. If possible, it is always a good idea to use the already-existing plugins. However, often we will need to either retrieve specific information or monitor specific data, in which case, we will need to create our own plugins.

Monitoring VMware

For Intel-based platforms, VMware virtualization (http://www.vmware.com/) is one of the most popular technologies. This spans from desktop solutions to server products. VMware also offers a free virtualization platform called VMware Server (http://www.vmware.com/products/server/).

Although Nagios does not offer a large variety of plugins to monitor VMware ESX and ESXi systems, VMware offers a Perl API that can easily be used to query virtual machines, along with a few of their parameters. On Windows operating systems, there is also theVmCOM API that allows interaction with VMware products.

These functions allow the querying of the virtual machine's status and guest parameters, as well as checking whether the virtual machine is working correctly.

The following code contains a script written in Perl that allows the querying of a particular virtual machine's state as well as make sure that it is working correctly. The script can easily be expanded to monitor CPU usage on a particular machine by querying thecpu.cpusecs parameter by using the get_resource() function from a virtual machine object.

Even though the script is configured to connect to a local machine, it is possible to specify different connection parameters so that it will query remote machines. In such a case, it is also necessary to specify the username and password of a user who can log in to the VMware system.

For the script to work, it is necessary that the VmPerl API is configured in your Perl interpreter. In order to check this, please run the following command:

root@ubuntu:~# perl -e 'use VMware::VmPerl;'

If VmPerl libraries are correctly installed, then this command should pass without any warnings or errors being generated. Otherwise, a configuration of VMware might be needed—VmPerl needs to be recompiled on each minor and major upgrade of Perl.

#!/usr/bin/perl

require VMware::VmPerl::VM;

require VMware::VmPerl::ConnectParams;

if (@ARGV != 2){

printf "Usage: check_vmstatus <machine> <command>\n";

exit(1);

}

($vmpath, $cmd) = @ARGV;

my $params = VMware::VmPerl::ConnectParams::new();

my $vm = VMware::VmPerl::VM::new();

$vm->connect($params, $vmpath);

my $title = $vm->get_config("displayName");

if ($cmd eq "state"){

if ($vm->get_execution_state() != 1)

{

printf "CRITICAL: %s is not running\n", $title;

exit(2);

}

else

{

printf "OK: %s is running\n", $title;

exit(0);

}

}

if ($cmd eq "heartbeat"){

my $hb0 = $vm->get_heartbeat();

sleep(5);

my $hb1 = $vm->get_heartbeat();

if ($hb0 == $hb1)

{

printf "CRITICAL: %s does not respond to events\n", $title;

exit(2);

}

else

{

printf "OK: %s is alive\n", $title;

exit(0);

}

}

printf "UNKNOWN: invalid command\n", $cmd;

exit(3);

In order to test the script, simply run the following command:

# /opt/nagios/plugins/check_vm "/path/to/Solaris.vmx" state

OK: Solaris 10 test machine is running

You will need to specify the full path to the .vmx file, and the virtual machine needs to be added to the VMware.

Monitoring Amazon Web Services

Amazon Web Services (AWS) is a public cloud. It provides large variety of services such as storage, CDN, computing, and running of servers. It also provides a monitoring service called CloudWatch , which can easily be integrated with Nagios using the plugins available on Nagios Exchange (visit http://exchange.nagios.org/ for more details). AWS provides a very easy-to-use API and client libraries exist for all popular programming languages. For some languages such as Java, Ruby, or Python, Amazon provides the client library themselves. For many other languages, there are unofficial libraries available. For example, there are complete libraries available for Perl and Tcl.

Elastic Compute Cloud (EC2) is an AWS service that allows the running of Linux- or Windows-based virtual machines in the cloud (visit http://aws.amazon.com/ec2/ for more details). A very basic thing that we can do is write code to test whether specific EC2 instance is currently running or not. For this example, we'll use Ruby. First, we need to install official APIs in Ruby, we can use the API. To install it, simply run the following command:

# gem install aws-sdk

The gem command is a command from RubyGems package (visit http://rubygems.org for more details), and it is a standard way to install additional modules for Ruby.

It will install the SDK for AWS. For testing purposes, it is enough to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, for example:

# export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXXXXXXX

# export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

This is where the values can be retrieved from the AWS console of the IAM section at https://console.aws.amazon.com/iam/home. For production, the values should be stored in a file that only a nagios user can read and should be loaded from the script.

We'll now create a simple script to check instance statuses that will take the values from the command-line options:

#!/usr/bin/env ruby

# load Amazon Web Services API

require 'aws-sdk'

# parse options

require 'optparse'

# default values for options

options = {instance: nil, status: :running}

# parse command line options

OptionParser.new do |opts|

opts.banner = \

"Usage: check_ec2_instance options"

# instance has to be in form of i-XXXXXXXX

opts.on("-i", "--instance ID",

/^i-[0-9A-Fa-f]{8}$/,

"Instance to test") do |v|

options[:instance] = v

end

# status can be running or stopped (there are

# other statuses as well, but we check those only)

opts.on("-s", "--status running|stopped",

[:running, :stopped],

"Expected status") do |v|

options[:status] = v.to_sym

end

end.parse!

# verify instance ID is specified, exit otherwise

if options[:instance].nil?

puts "UNKNOWN Instance must be specified"

exit 3

end

# create an EC2 instance and get instance status

ec2 = AWS::EC2.new

i = ec2.instances[options[:instance]]

# check that instance exists and its status matches expected

if !i.exists?

puts "CRITICAL Instance #{i.id} does not exist"

exit 2

elif i.status == options[:status]

puts "OK Instance #{i.id} is #{i.status}"

exit 0

else

puts "CRITICAL Instance #{i.id} is #{i.status}"

exit 2

end

The preceding example will return an appropriate message and status, for example:

# ./check_ec2_instance.rb -i i-12345678 -s running

CRITICAL Instance i-12345678 is running

$ ./check_ec2_instance.rb -i i-12341234 -s stopped

CRITICAL Instance i-12341234 is running

Amazon Web Services (AWS) also provides EC2 as spot instances. These are the same as normal EC2 instances, but the pricing model works by bidding, which is similar to how stock markets operate. When requesting a spot instance, you can specify the maximum price you can pay. If the current price is lower, your machine gets started or continues running and you get charged for the current price. When the current price is higher than the specified one, the machine gets stopped. This allows calculations, tests, or other activities that can be done at any time, at a lower price than regular instances, for example, analyzing historical data. Visit http://aws.amazon.com/ec2/spot-instances/ to find out more about spot instances.

If you are using spot instances, it may be a good idea to monitor spot pricing of EC2 instances and show a warning if it exceeds a certain value for some time.

The following code is a complete example that fetches the history for specified availability zone (the price is different for each instance type in each availability zone) and compares it with the maximum values for warning and critical results:

#!/usr/bin/env ruby

# load API

require 'aws-sdk'

# parse options

require 'optparse'

options = {

zone: "us-east-1a",

type: "m1.small",

hours: 4

}

OptionParser.new do |opts|

opts.banner = \

"Usage: check_ec2_instance options"

opts.on("-z", "--availability_zone zone",

"Availability zone price to check") do |v|

options[:zone] = v

end

opts.on("-t", "--type type",

"Availability zone price to check") do |v|

options[:type] = v

end

opts.on("-h", "--hours",

"Number of hours to request history for") do |v|

options[:hours] = v.to_i

# make sure at least 2 hours are used

options[:hours] = 2 if options[:hours] < 2

end

opts.on("-w", "--warning price", Float,

"Warning threshold price") do |v|

options[:warning] = v.to_f

end

opts.on("-c", "--critical price", Float,

"Critical threshold price") do |v|

options[:critical] = v.to_f

end

end.parse!

# get pricing history

ec2 = AWS::EC2.new

history = ec2.client.describe_spot_price_history(

instance_types: [options[:type]],

start_time: (Time.now - 3600*options[:hours]).iso8601,

availability_zone: options[:zone]

)

# get list of prices and calculate average value

prices = history[:spot_price_history_set]

.map{|i| i[:spot_price].to_f}

avg = prices.inject(0.0) {|s,i| s+i} / prices.size

# format message and print it along with proper status

msg="Average %s price in %s is %.3f" % \

[options[:type], options[:zone], avg]

if options[:critical] && (avg > options[:critical])

puts "CRITICAL #{msg}"

exit 2

elsif options[:warning] && (avg > options[:warning])

puts "WARNING #{msg}"

exit 1

else

puts "OK #{msg}"

exit 0

end

This will retrieve the pricing history for a specified amount of time, calculate average value using the inject method (this is described in more details http://apidock.com/ruby/Enumerable/inject), and check if the values are above specified thresholds, for example:

# ./check_spot_pricing.rb -w 0.015 -c 0.02 -t m1.small

OK Average m1.small price in us-east-1a is 0.013

# ./check_spot_pricing.rb -w 0.015 -c 0.02 -t m1.large

CRITICAL Average m1.large price in us-east-1a is 0.327

AWS provides a large variety of services and options how those services can be used. If you are using AWS for anything more than just basic functionality, it is a good idea to create custom Nagios plugins to monitor metrics specific to your operations.

Writing commands to send notifications

Another part of Nagios that can be extended to fit your needs are notifications. These are messages that Nagios sends out whenever a problem occurs or is resolved.

One way in which Nagios' notification system can be expanded is to create template-based e-mails. These will send notifications as both plain text and HTML messages. The template of the e-mail will be kept in separate files.

We will use Tcl for this purpose as it contains libraries for MIME (http://tcllib.sourceforge.net/doc/mime.html) and SMTP (http://tcllib.sourceforge.net/doc/smtp.html) functionality. The first one allows the creation of structured e-mails, whereas the latter one is used to send these using an SMTP server.

E-mails that contain content in multiple formats need to be wrapped in the multipart/alternative MIME type. This type will contain two subparts: first the plain text version and then the HTML version. This order makes e-mail clients choose HTML over plain text if both the types are supported.

This part can then be wrapped in a multipart/related MIME type. This allows the embedding of additional files such as images, which can then be used from within an HTML message. This is not used in the example shown on the next page, but can easily be added, in the same manner as text and HTML parts are embedded inside the multipart/alternative MIME type.

In the same way that macro substitution works in Nagios commands, templates will replace certain strings such as $HOSTSTATE$ within the template. For example, the following script can be used in a HTML template:

<tr><td>Notification type</td>

<td><b>$TYPE$</b></td></tr>

Similar macros can be used in plain text templates and will be substituted as well.

The following script will allow users to be notified in HTML format through the use of templates:

#!/usr/bin/env tclsh

package require mime

package require smtp

package require fileutil

# map arguments

set mappings {TEMPLATE EMAIL TYPE

HOSTNAME HOSTSTATE HOSTOUTPUT}

if {[llength $argv] != [llength $mappings]} {

puts stderr "Usage: [info script] [join $mappings]"

exit 1

}

# handle arguments

set template [lindex $argv 0]

set to [lindex $argv 1]

foreach name $mappings value $argv {

lappend map "\$$name\$" $value

}

# read template files and map variables accordingly

set textbody [string map $map \

[fileutil::cat $template/body.txt]]

set htmlbody [string map $map \

[fileutil::cat $template/body.html]]

set mailsubject [string map $map \

[fileutil::cat $template/subject.txt]]

# create a list of alternate formats (plain text and html)

set parts [list]

lappend parts [mime::initialize -canonical text/plain \

-encoding 8bit -string $textbody]

lappend parts [mime::initialize -canonical text/html \

-encoding 8bit -string $htmlbody]

# wrap all parts inside multipart/alternative

set parts [mime::initialize -canonical multipart/alternative \

-header [list Subject $mailsubject] \

-header [list To "\"$to\" <$to>"] \

-header [list From "\"Nagios\" <nagios@yourcompany.com>"] \

-parts $parts]

smtp::sendmessage $parts \

-recipients $to \

-originator "nagios@yourcompany.com" \

-servers {localhost}

exit 0

To test it, simply run the following command:

root@ubuntu:# /opt/nagios/plugins/notify-email-html template1 \

jdoe@yourcompany.com RECOVERY myhost1 OK "OK: host is alive"

This should cause an e-mail to be sent to jdoe@yourcompany.com.

We can now define a command that will send a notification for the host, for example:

define command{

command_name notify-host-by-email-html

command_line $USER5$/notify-email-html

template1 '$CONTACTEMAIL$'

'$NOTIFICATIONTYPE$' '$HOSTNAME$'

'$HOSTSTATE$' '$HOSTOUTPUT$'

}

It will pass the appropriate arguments for the user's e-mail address, notification type, hostname, state, and output from the host check. The command can then be used for one or more contacts by setting the host_notification_commands option, for example:

define contact{

name jdoe

host_notification_period 24x7

host_notification_options d,u,r,f,s

host_notification_commands notify-host-by-email-html

(...)

}

Managing Nagios

Your application might also want to have some control over Nagios. You might want to expose an interface for users to take control of your monitoring system, for example, a web interface or a client-server system. You might also want to handle custom authorization and the access control list, but this is something that is beyond the functionality offered by the web interface that Nagios comes with.

In such cases, it is best to create your own system to read the current status, as well as to send commands directly over the external command pipe. In both cases, this is very easy to do from any programming language.

The first thing we can do is to show Nagios' current status. This requires the reading of the status.dat file, parsing it to any data format, and then manipulating it. The format of the file is relatively simple—each object is enclosed in a section and each section contains one or more name=value directives. For example, the following is a definition of information about the status.dat file:

info

{

created=1388002190

version=4.0.1

}

All hosts, services, and other objects are defined in the same way as the preceding definition. There can be multiple instances of a specified object type, for example, each hoststatus object definition specifies a single host along with its current status.

Sending commands to Nagios also seems easy. The details of the most commonly used commands were given in Chapter 6, Notifications and Events. Sending commands simply involves opening a pipe to write and send commands, and close the pipe again.

Controlling Nagios from an external application is commonly done in PHP to create web applications. Implementing the reading of the current status as well as sending commands to Nagios is relatively easy to do in PHP, as the language offers convenient functions for string manipulation and regular expressions. Your web application also needs to limit commands that a user is able to send to Nagios, as it might be a security risk if your application offers functionalities such as disabling and enabling checks for hosts and/or services.

The following function reads the Nagios status file and returns it as an array of types of objects:

function readStatus($filename)

{

$rc = array();

$fh = fopen($filename, "r");

$objname = "";

while (!feof($fh))

{

$line = fgets($fh);

$line = substr($line, 0, strlen($line)-1);

// match beginning of an object

if (ereg("^(.*) +\{$", $line, $ereg_output))

{

// if object data was previously read, store it

if ($objname != "")

$rc[$objname][] = $object_info;

$objname = $ereg_output[1];

$arguments = array();

}

else if (ereg("^(.*)=(.*)$", trim($line),

$ereg_output))

{

$object_info[trim($ereg_output[1])] =

$ereg_output[2];

}

}

// if object data was previously read, store it

if ($objname != "")

$rc[$objname][] = $object_info;

return $rc;

}

The function reads the file and looks for a line that starts with a text and is followed by one or more spaces and ends with a curly bracket open character ({). This will match the beginning of an object definition and store the object name. For lines matching thename=value pattern, the name and value are stored if a beginning of an object was previously read.

Whenever a new object is read or when an end of file is reached, information about the previously read object is stored. In this way, the returned value is an array that contains a list of all object types, such as the info definition mentioned above.

It's also relatively easy to write a function that allows you to search for objects by their type so that they match the specified criteria, for example, all of the services associated with a host. A sample code to do this is as follows:

function findObject($status, $object_type, $matching_fields)

{

$rc = array();

// iterate over all objects of said type

foreach ($status[$object_type] as $object)

{

$ok = true;

// iterate over all matching fields query and

// check if they are all set and match value

foreach ($matching_fields as $name => $value)

{

if ($object[$name] != $value)

$ok = false;

}

// if all fields matched criteria, add to output list

if ($ok)

$rc[] = $object;

}

return $rc;

}

The function takes all objects of the specified type and checks whether all of the fields and expected values passed as $matching_fields. The current $object is added to the output list only if it has all of the required fields and their values matched expected values

Next, we can test this by reading the status and finding all of the services on the localhost machine that have critical statuses. This is done by invoking the following sample code:

$s = readStatus("/var/nagios/status.dat");

print_r(findObject($s, "servicestatus",

array("host_name" => "localhost", "last_hard_state" => "2")));

This code will print out an array of all services matching the predefined criteria. This can be used to perform complex searches and show the status depending on many configuration options.

Sending commands to Nagios from PHP is also a very simple thing to do. The following is a class that offers internal functions for sending commands, as well as two sample commands that cause Nagios to schedule the next host or service check on the specified date. If the date is omitted, then the check is run immediately. Please check the following code:

class Nagios

{

var $pipefilename = "/var/nagios/rw/nagios.cmd";

function writeCommand($str)

{

$f = fopen($this->pipefilename, "w");

fwrite($f, "[" . time() . "] " . $str . "\n");

fclose($f);

}

function scheduleHostCheck($host, $when = "")

{

if ($when == "")

$when = time();

$this->writeCommand("SCHEDULE_FORCED_HOST_CHECK;" .

$host . ";" . $when);

}

function scheduleServiceCheck($host, $svc, $when = "")

{

if ($when == "")

$when = time();

$this->writeCommand("SCHEDULE_FORCED_SVC_CHECK;" .

$host . ";" . $svc . ";" . $when);

}

}

A small section of code to test the functionality is as follows:

$n = new Nagios();

$n->scheduleHostCheck("linux1");

$n->scheduleServiceCheck("localhost", "APT", strtotime("+1 day"));

The preceding code initializes an instance of the Nagios class, and then schedules a host check for the linux1 machine immediately. Next, it schedules the APT service check on the localhost machine to occur one day from now.

Implementing additional commands should be as simple as specifying new functions that send commands (http://www.nagios.org/developerinfo/externalcommands/) to Nagios over the external command pipe. Usually, the functionality base grows as the project grows. Hence, we should not define unused functions on a just-in-case basis.

Using passive checks

Nagios offers a very powerful mechanism to schedule tests. However, there are many situations where you might want to perform tests on your own and just tell Nagios what the result is. One of the typical scenarios to use passive tests can be when performing the actual test takes very little time, but the startup overhead is large. This is normal for languages such as Java, whose runtime initialization requires a lot of resources.

Another reason might be that checks are done on different machines where the Nagios instance is running. In many cases, due to security issues, it is not possible to schedule checks directly from Nagios. This is because communications not initiated by those machines are blocked. In this case, it's often best to schedule checks on your own and simply submit the results back to Nagios. In cases where such tests are going to be written by you, it's wise to integrate them with a mechanism to send the results over to NSCA directly.

Passive checks are responsible for scheduling and performing tests on their own. They can also be started by Nagios event handlers and be run as part of other applications. After a passive check is done, the result needs to be sent to the Nagios server. There are a couple of ways to do this. The easiest way is to send results over the external commands pipe, which is similar to managing Nagios. In this case, the application needs to send proper commands to submit either service or host check results. Nagios will then take care of incorporating the results into its database.

Another approach is to use NSCA. This is a protocol for sending results over the network. NSCA provides a command to send the results over the network and requires the passing of the configuration file that specifies the protocol, password, and other information. It is described in more detail in Chapter 7, Passive Checks and NSCA.

The next page contains an example of an application that periodically performs tests and sends its results to Nagios over the external command pipe. This code consists of a method to supply information to Nagios and a main loop that performs tests every 5 minutes. It does not contain the actual test that should be performed as this might vary depending on your needs. The following is a sample Java code to perform the test and report its results using Nagios external commands pipe:

/* write check status to Nagios pipe */

private static void writeStatus(String host, String svc,

int code, String output) throws Exception

{

long time = System.currentTimeMillis() / 1000;

FileWriter fw = new FileWriter("/var/nagios/rw/nagios.cmd");

fw.write("[" + time +"] PROCESS_SERVICE_CHECK_RESULT;" +

host + ";" + svc + ";" + code + ";" + output + "\n");

fw.close();

}

public static void main(String[] args)

{

while (true)

{

int code;

StringBuffer output = new StringBuffer();

/* perform actual test and report error if it failed */

try

{

code = performTest(output);

}

catch (Exception e)

{

code = 3;

output = new StringBuffer("Error: "+e.getMessage());

}

try

{

writeStatus("hostname","serviceDescription",

code, output.toString());

}

catch (Exception e)

{

System.out.println("Problem sending command to Nagios:" +

e.getMessage());

}

/* wait for 5 minutes between performing tests */

Thread.sleep(300*1000);

}

}

private static int performTest(StringBuffer buf){

return 0;

}

Please note that the actual implementation of the performTest method will perform real tests. The following is a sample test function to connect over JDBC:

int performTest(StringBuffer output)

{

String url = "jdbc:mysql://localhost:3306/mysql";

String username = "root";

String password = "yourpassword";

Connection conn;

try {

conn = java.sql.DriverManager.

getConnection(url, username, password);

conn.close();

}

catch (Exception exception) {

output.append("JDBC CRITICAL: Unable to connect");

return(2);

}

output.append("JDBC OK: Connection established");

return(0);

}

To run the tests, you will first need to compile the class. Assuming the source code is called PerformTests.java, run the following command:

javac PerformTests.java

Now, you can run the actual test using the following command:

java -cp . PerformTests

This will send reports to Nagios, so you can check the Nagios log file to see whether it has received information from your test checker.

Very often, you will need to create or extend applications to perform checks on remote machines. In this case, NSCA is used to send the check results to the Nagios server.

The following code is a Python class for sending service and host results over NSCA. It uses the Subprocess API (http://docs.python.org/2/library/subprocess.html) and allows configuration of the path to the command, and the configuration, host, and port:

import subprocess

class nscawriter:

def __init__(self):

self.nscacommand = "/opt/nagios/bin/send_nsca"

self.nscaconfig = "/etc/nagios/send_nsca.cfg"

self.nscahost = "10.0.0.1"

self.nscaport = 5667

def open(self):

process = subprocess.Popen(

"\"" + self.nscacommand + "\"" +

" -H \"" + self.nscahost + "\"" +

" -p \"" + str(self.nscaport) + "\"" +

" -c \"" + self.nscaconfig + "\"")

self.nscain = p.stdin

self.nscaout = p.stdout

def serviceResult(self, host, svc, code, output):

self.nscaout.write(host + "\t" + svc +

"\t" + str(code) + "\t" + output + "\n")

self.nscaout.flush()

def hostResult(self, host, code, output):

self.nscaout.write(host +

"\t" + str(code) + "\t" + output + "\n")

self.nscaout.flush()

def close(self):

self.nscaout.close()

In order to test it, we can run the following code. This will send out a host notification about the linux1 machine and submit a result for the APT service on that host.

if __name__ == "__main__":

nsca = nscawriter()

nsca.open()

nsca.hostResult("linux1", 0, "Host is reachable")

nsca.serviceResult("linux1", "APT", 0, "No upgrades available")

nsca.close()

You have to open and close the handle on your own. This is because the send_nsca command has an internal timeout handling to read results from the standard input. For the same reason, it is not possible to use the same NSCA instance to submit results over long periods of time.

Summary

Nagios has many places where it can be extended with external scripts or applications. We have also learned that Nagios is not bound to any specific language and that its real power comes from the fact that you can choose the language you'll use to program your code.

In this chapter, we learned how to create our own plugins to perform active checks. Adding our own commands makes it possible to perform checks using techniques that might not be available using the default Nagios plugin commands. We have also learned how it can be used to create various types of plugins—checking database consistency, monitoring system time differences, websites, and cloud environments.

This chapter also covered how to use passive checks and supply the check results to Nagios. In such a case, we are responsible for performing the test and sending results to Nagios. Nagios will then handle all of the results of the new status for a host or service, such as triggering event handlers and sending notifications.

We also covered how to send results to Nagios in two different ways. For tests that are running on the same machine where the Nagios process is running, results can be sent using the external commands pipe. If the test is running on another machine, this can be done using NSCA protocol.

We have created a custom notification command that sends e-mails using a predefined template. This can be used to send HTML and plain text notifications using Nagios; these are more readable and nicer than plain, text-only e-mails.

This chapter also discusses how Nagios stores its status information and how it can be read, to present it to the user or perform processing of the data.

Of course, this chapter does not cover all of the aspects in which Nagios can be customized. Nagios offers an event handling mechanism that you can use for tasks such as automatic recovery or the deployment of backup configuration.

The next chapter talks about using the query handler and Nagios Event Radio Dispatcher (NERD) to communicate with the Nagios process and receive real-time updates about host and service statuses.