Apache Flume: Distributed Log Collection for Hadoop, Second Edition (2015)

Chapter 8. Monitoring Flume

The user guide for Flume states:

Monitoring in Flume is still a work in progress. Changes can happen very often. Several Flume components report metrics to the JMX platform MBean server. These metrics can be queried using Jconsole.

While JMX is fine for casual browsing of metric values, the number of eyeballs looking at Jconsole doesn't scale when you have hundreds or even thousands of servers sending data all over the place. What you need is a way to watch everything at once. However, what are the important things to look for? That is a very difficult question, but I'll try and cover several of the items that are important, as we cover monitoring options in this chapter.

Monitoring the agent process

The most obvious type of monitoring you'll want to perform is Flume agent process monitoring, that is, making sure the agent is still running. There are many products that do this kind of process monitoring, so there is no way we can cover them all. If you work at a company of any reasonable size, chances are there is already a system in place for this. If this is the case, do not go off and build your own. The last thing operations wants is yet another screen to watch 24/7.

Monit

If you do not already have something in place, one freemium option is Monit (http://mmonit.com/monit/). The developers of Monit have a paid version that provides more bells and whistles you may want to consider. Even in the free form, it can provide you with a way to check whether the Flume agent is running, restart it if it isn't, and send you an e-mail when this happens so that you can look into why it died.

Monit does much more, but this functionality is what we will cover here. If you are smart, and I know you are, you will add checks to the disk, CPU, and memory usage as a minimum, in addition to what we cover in this chapter.

Nagios

Another option for Flume agent process monitoring is Nagios (http://www.nagios.org/). Like Monit, you can configure it to watch your Flume agents and alert you via a web UI, e-mail, or an SNMP trap. That said, it doesn't have restart capabilities. The community is quite strong, and there are many plugins for other available applications.

My company uses this to check the availability of Hadoop web UIs. While not a complete picture of health, it does provide more information to the overall monitoring of our Hadoop ecosystem.

Again, if you already have tools in place at your company, see whether you can reuse them before bringing in another tool.

Monitoring performance metrics

Now that we have covered some options for process monitoring, how do you know whether your application is actually doing the work you think it is? On many occasions, I've seen a stuck syslog-ng process that appears to be running, but it just wasn't sending any data. I'm not picking on syslog-ng specifically; all software does this when conditions that are not designed for occur.

When talking about Flume data flows, you need to monitor the following:

· Data entering sources is within expected rates

· Data isn't overflowing your channels

· Data is exiting sinks at expected rates

Flume has a pluggable monitoring framework, but as mentioned at the beginning of the chapter, it is still very much a work in progress. This does not mean you shouldn't use it, as that would be foolish. It means you'll want to prepare extra testing and integration time anytime you upgrade.

Note

While not covered in the Flume documentation, it is common to enable JMX in your Flume JVM (http://bit.ly/javajmx) and use the Nagios JMX plugin (http://bit.ly/nagiosjmx) to alert you about performance abnormalities in your Flume agents.

Ganglia

One of the available monitoring options to watch Flume internal metrics is Ganglia integration. Ganglia (http://ganglia.sourceforge.net/) is an open source monitoring tool that is used to collect metrics and display graphs, and it can be tiered to handle very largeinstallations. To send your Flume metrics to your Ganglia cluster, you need to pass some properties to your agent at startup time:

Java property	Value	Description
flume.monitoring.type	ganglia	Set to ganglia
flume.monitoring.hosts	host1:port1, host2:port2	A comma-separated list of host:port pairs for your gmond process(es)
flume.monitoring.pollFrequency	60	The number of seconds between sending of data (default 60 seconds)
flume.monitoring.isGanglia3	false	Set to true if using older Ganglia 3 protocol. The default process is to send data using v3.1 protocol.

Look at each instance of gmond within the same network broadcast domain (as reachability is based on multicast packets), and find the udp_recv_channel block in gmond.conf. Let's say I had two nearby servers with these two corresponding configuration blocks:

udp_recv_channel {

mcast_join = 239.2.14.22

port = 8649

bind = 239.2.14.22

retry_bind = true

}

udp_recv_channel {

mcast_join = 239.2.11.71

port = 8649

bind = 239.2.11.71

retry_bind = true

}

In this case the IP and port are 239.2.14.22/8649 for the first server and 239.2.11.71/8649 for the second, leading to these startup properties:

-Dflume.monitoring.type=ganglia

-Dflume.monitoring.hosts=239.2.14.22:8649,239.2.11.71:8649

Here, I am using defaults for the poll interval, and I'm also using the newer Ganglia wire protocol.

Note

While receiving data via TCP is supported in Ganglia, the current Flume/Ganglia integration only supports sending data using multicast UDP. If you have a large/complicated network setup, you'll want to get educated by your network engineers if things don't work as you expect.

Internal HTTP server

You can configure the Flume agent to start an HTTP server that will output JSON that can use queries by outside mechanisms. Unlike the Ganglia integration, an external entity has to call the Flume agent to poll the data. In theory, you can use Nagios to poll this JSON data and alert on certain conditions, but I have personally never tried it. Of course, this setup is very useful in development and testing, especially if you are writing custom Flume components to be sure they are generating useful metrics. Here is a summary of the Java properties you'll need to set at the start up of the Flume agent:

Java property	Value	Description
flume.monitoring.type	http	The value is set to http
flume.monitoring.port	PORT	This is the port number to bind the HTTP server

The URL for metrics will be http://SERVER_OR_IP_OF_AGENT:PORT/metrics.

Let's look at the following Flume configuration:

agent.sources = s1

agent.channels = c1

agent.sinks = k1

agent.sources.s1.type=avro

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.sources.s1.channels=c1

agent.channels.c1.type=memory

agent.sinks.k1.type=avro

agent.sinks.k1.hostname=192.168.33.33

agent.sinks.k1.port=9999

agent.sinks.k1.channel=c1

Start the Flume agent with these properties:

-Dflume.monitoring.type=http

-Dflume.monitoring.port=44444

Now, when you go to http://SERVER_OR_IP:44444/metrics, you might see something like this:

{

"SOURCE.s1":{

"OpenConnectionCount":"0",

"AppendBatchAcceptedCount":"0",

"AppendBatchReceivedCount":"0",

"Type":"SOURCE",

"EventAcceptedCount":"0",

"AppendReceivedCount":"0",

"StopTime":"0",

"EventReceivedCount":"0",

"StartTime":"1365128622891",

"AppendAcceptedCount":"0"},

"CHANNEL.c1":{

"EventPutSuccessCount":"0",

"ChannelFillPercentage":"0.0",

"Type":"CHANNEL",

"StopTime":"0",

"EventPutAttemptCount":"0",

"ChannelSize":"0",

"StartTime":"1365128621890",

"EventTakeSuccessCount":"0",

"ChannelCapacity":"100",

"EventTakeAttemptCount":"0"},

"SINK.k1":{

"BatchCompleteCount":"0",

"ConnectionFailedCount":"4",

"EventDrainAttemptCount":"0",

"ConnectionCreatedCount":"0",

"BatchEmptyCount":"0",

"Type":"SINK",

"ConnectionClosedCount":"0",

"EventDrainSuccessCount":"0",

"StopTime":"0",

"StartTime":"1365128622325",

"BatchUnderflowCount":"0"}

}

As you can see, each source, sink, and channel are broken out separately with their corresponding metrics. Each type of source, channel, and sink provide their own set of metric keys, although there is some commonality, so be sure to check what looks interesting. For instance, this Avro source has OpenConnectionCount, that is, the number of connected clients (who are most likely sending data in). This may help you decide whether you have the expected number of clients relying on that data or, perhaps, too many clients, and you need to start tiering your agents.

Generally speaking, the channel's ChannelSize or ChannelFillPercentage metrics will give you a good idea whether the data is coming in faster than it is going out. It will also tell you whether you have it set large enough for maintenance/outages of your data volume.

Looking at the sink, EventDrainSuccessCount versus EventDrainAttemptCount will tell you how often output is successful when compared to the times tried. In this example, I am configuring an Avro sink to a nonexistent target. As you can see, the ConnectionFailedCountmetric is growing, which is a good indicator of persistent connection problems. Even a growing ConnectionCreatedCount metric can indicate that connections are dropping and reopening too often.

Really, there are no hard and fast rules besides watching ChannelSize/ChannelFillPercentage. Each use case will have its own performance profile, so start small, set up your monitoring, and learn as you go.

Custom monitoring hooks

If you already have a monitoring system, you may want to take the extra effort to develop a custom monitoring reporting mechanism. You may think this is as simple as implementing the org.apache.flume.instrumentation.MonitorService interface. You do need to do this, but looking at the interface, you will only see a start() and stop() method. Unlike the more obvious interceptor paradigm, the agent expects that your MonitorService implementation will start/stop a thread to send data on the expected or configured interval if it is the type to send data to a receiving service. If you are going to operate a service, such as the HTTP service, then start/stop would be used to start and stop your listening service. The metrics themselves are published internally to JMX by the various sources, sinks, channels, and interceptors using object names that start with org.apache.flume. Your implementation will need to read these from MBeanServer.

Note

The best advice I can give you, should you decide to implement your own, is to look at the source of two existing implementations (included in the source download referenced in Chapter 2, A Quick Start Guide to Flume) and do what they do. To use your monitoring hook, set the flume.monitoring.type property to the fully qualified class name of your implementation class. Expect to have to rework any custom hooks with new Flume versions until the framework matures and stabilizes.

Summary

In this chapter, we covered monitoring Flume agents both from the process level and the monitoring of internal metrics (whether it is working).

Monit and Nagios were introduced as open source options for process watching.

Next, we covered the Flume agent internal monitoring metrics with Ganglia and JSON over HTTP implementations that ship with Apache Flume.

Finally, we covered how to integrate a custom monitoring implementation if you need to directly integrate to some other tool that's not supported by Flume by default.

In our final chapter, we will discuss some general considerations for your Flume deployment.