Apache Flume: Distributed Log Collection for Hadoop, Second Edition (2015)
Hadoop is a great open source tool for shifting tons of unstructured data into something manageable so that your business can gain better insight into your customers' needs. It's cheap (mostly free), scales horizontally as long as you have space and power in your datacenter, and can handle problems that would crush your traditional data warehouse. That said, a little-known secret is that your Hadoop cluster requires you to feed it data. Otherwise, you just have a very expensive heat generator! You will quickly realize (once you get past the "playing around" phase with Hadoop) that you will need a tool to automatically feed data into your cluster. In the past, you had to come up with a solution for this problem, but no more! Flume was started as a project out of Cloudera, when its integration engineers had to keep writing tools over and over again for their customers to automatically import data. Today, the project lives with the Apache Foundation, is under active development, and boasts of users who have been using it in their production environments for years.
In this book, I hope to get you up and running quickly with an architectural overview of Flume and a quick-start guide. After that, we'll dive deep into the details of many of the more useful Flume components, including the very important file channel for the persistence of in-flight data records and the HDFS Sink for buffering and writing data into HDFS (the Hadoop File System). Since Flume comes with a wide variety of modules, chances are that the only tool you'll need to get started is a text editor for the configuration file.
By the time you reach the end of this book, you should know enough to build a highly available, fault-tolerant, streaming data pipeline that feeds your Hadoop cluster.
What this book covers
Chapter 1, Overview and Architecture, introduces Flume and the problem space that it's trying to address (specifically with regards to Hadoop). An architectural overview of the various components to be covered in later chapters is given.
Chapter 2, A Quick Start Guide to Flume, serves to get you up and running quickly. It includes downloading Flume, creating a "Hello, World!" configuration, and running it.
Chapter 3, Channels, covers the two major channels most people will use and the configuration options available for each of them.
Chapter 4, Sinks and Sink Processors, goes into great detail on using the HDFS Flume output, including compression options and options for formatting the data. Failover options are also covered so that you can create a more robust data pipeline.
Chapter 5, Sources and Channel Selectors, introduces several of the Flume input mechanisms and their configuration options. Also covered is switching between different channels based on data content, which allows the creation of complex data flows.
Chapter 6, Interceptors, ETL, and Routing, explains how to transform data in-flight as well as extract information from the payload to use with Channel Selectors to make routing decisions. Then this chapter covers tiering Flume agents using Avro serialization, as well as using the Flume command line as a standalone Avro client for testing and importing data manually.
Chapter 7, Putting It All Together, walks you through the details of an end-to-end use case from the web server logs to a searchable UI, backed by Elasticsearch as well as archival storage in HDFS.
Chapter 8, Monitoring Flume, discusses various options available for monitoring Flume both internally and externally, including Monit, Nagios, Ganglia, and custom hooks.
Chapter 9, There Is No Spoon – the Realities of Real-time Distributed Data Collection, is a collection of miscellaneous things to consider that are outside the scope of just configuring and using Flume.
What you need for this book
You'll need a computer with a Java Virtual Machine installed, since Flume is written in Java. If you don't have Java on your computer, you can download it from http://java.com/.
You will also need an Internet connection so that you can download Flume to run the Quick Start example.
This book covers Apache Flume 1.5.2.
Who this book is for
This book is for people responsible for implementing the automatic movement of data from various systems to a Hadoop cluster. If it is your job to load data into Hadoop on a regular basis, this book should help you to code yourself out of manual monkey work or from writing a custom tool you'll be supporting for as long as you work at your company.
Only basic knowledge of Hadoop and HDFS is required. Some custom implementations are covered, should your needs necessitate them. For this level of implementation, you will need to know how to program in Java.
Finally, you'll need your favorite text editor, since most of this book covers how to configure various Flume components via an agent's text configuration file.
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and explanations of their meanings.
Code words in text are shown as follows: "If you want to use this feature, you set the useDualCheckpoints property to true and specify a location for that second checkpoint directory with the backupCheckpointDir property."
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
$ tar -zxf apache-flume-1.5.2.tar.gz
$ cd apache-flume-1.5.2
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Flume was first introduced in Cloudera's CDH3 distribution in 2011."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.