Field Guide to Hadoop (2015)
Chapter 5. Analytic Helpers
Now that you’ve ingested data into your Hadoop cluster, what’s next? Usually you’ll want to start by simply cleansing or transforming your data. This could be as simple or reformatting fields and removing corrupt records or it could involve all manner of complex aggregation, enrichment, and summarization. Once you’ve cleaned up your data, you may be satisfied to simply push it into a more traditional data store, such as a relational database, and consider your big data work to be done. On the other hand, you may want to continue to work with your data, running specialized machine-learning algorithms to categorize your data or perhaps performing some sort of geospatial analysis.
In this chapter, we’re going to talk about two types of tools:
MapReduce interfaces
General-purpose tools that make it easier to process your data
Analytic libraries
Focused-purpose libraries that include functionality to make it easier to analyze your data
MapReduce Interfaces
In the early days of Hadoop, the only way to process the data in your system was to work with MapReduce in Java, but this approach presented a couple of major problems:
§ Your analytic writers need to not only understand your business and your data, but they also need to understand Java code
§ Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query
For example, the process of developing and testing a simple analytic written directly in MapReduce might look something like the following for a developer:
1. Write about a hundred lines of Java MapReduce.
2. Compile the code into a JAR file (Java archive).
3. Copy the JAR file to cluster.
4. Run the analytic.
5. Find a bug, go back and write some more code.
As you can imagine, this process can be time-consuming, and tinkering with the code can disrupt thinking about the business problem. Fortunately, a robust ecosystem of tools to work with Hadoop and MapReduce have emerged to simplify this process and allow your analysts to spend more time thinking about the business problem at hand.
As you’ll see, these tools generally do a few things:
§ Provide a simpler, more familiar interface to MapReduce
§ Generate immediate feedback by allowing users to build queries interactively
§ Simplify complex operations
Analytic Libraries
While there is much analysis that can be done in MapReduce or Pig, there are some machine-learning algorithms that are distributed as part of Apache Mahout project. Some examples of the kinds of problems suited for Mahout are classification, recommendation, and clustering.
You point machine-learning algorithms at a dataset, and they “learn” something from the data. They fall into two classes: supervised and unsupervised. In supervised learning, the data typically has a set of observations and an outcome value. For example, clinical data about patients would be the observations, and an outcome value might be the presence of a disease. A supervised-learning algorithm, given a new patient’s clinical data, would try to predict the presence of a disease. Unsupervised algorithms do not use a given outcome, and instead attempt to find some hidden pattern in the data. For example, we could take a set of observations of clinical data from patients and try to see if they tend to cluster, so that points inside a cluster would be “close” to one another and the cluster centers would be far from one another. The interpretation of the cluster is not given by the algorithm and is left for the data analyst to discover. You can find the list of supported algorithms on the Mahout home page.
Recommendation algorithms determine the following: based on other people’s ratings, and the similarity of them to you, what would you be likely to rate highly?
Classification algorithms, given a set of observations on an individual, predict some unknown outcome. If the outcome is a binary variable, logistic regression can be used to predict the probability of that outcome. For example, given the set of lab results of a patient, predict the probability that the patient has a given disease. If the outcome is a numeric variable, linear regression can be used to predict the value of that outcome. For example, given this month’s economic conditions, predict the unemployment rate for next month.
Clustering algorithms don’t really answer a question. You frequently use them in the first stage of your analysis to get a feel for the data.
Data analytics is a deep topic—too deep to discuss in any detail here. O’Reilly has an excellent series of books on the topic of data analytics.
Most of the analytics just discussed deal with numerical or categorical data. Increasingly important in the Hadoop world are text analytics and geospatial analytics.
Pig
License |
Apache License, Version 2.0 |
Activity |
High |
Purpose |
High-level data flow language for processing data |
Official Page |
http://pig.apache.org |
Hadoop Integration |
Fully Integrated |
If MapReduce code in Java is the “assembly language” of Hadoop, then Pig is analogous to Python or another high-level language. Why would you want to use Pig rather than MapReduce? Writing in Pig may not be as performant as writing mappers and reducers in Java, but it speeds up your coding and makes it much more maintainable. Pig calls itself a data flow language in which datasets are read in and transformed to other datasets using a combination of procedural thinking as well as some SQL-like constructs.
Pig is so called because “pigs eat everything,” meaning that Pig can accommodate many different forms of input, but is frequently used for transforming text datasets. In many ways, Pig is an admirable extract, transform, and load (ETL) tool. Pig is translated or compiled into MapReduce code and it is reasonably well optimized so that a series of Pig statements do not generate mappers and reducers for each statement and then run them sequentially.
There is a library of shared Pig routines available in the Piggy Bank.
Tutorial Links
There’s a fairly complete guide to get you through the process of installing Pig and writing your first couple scripts. “Working with Pig” is a great overview of the Pig technology.
Example Code
The movie review problem can be expressed quickly in Pig with only five lines of code:
-- Read in all the movie review and find the average rating
for the film Dune
-- the file reviews.csv has lines of form:
name, film_title, rating
reviews = load ‘reviews.csv’ using PigStorage(',')
as (reviewer:chararray, title:chararray,rating:int);
-- Only consider reviews of Dune
duneonly = filter reviews by title == 'Dune';
-- we want to use the Pig builtin AVG function but
-- AVG works on bags, not lists, this creates bags
dunebag = group duneonly by title;
-- now generate the average and then dump it
dunescore = foreach dunebag generate AVG(dune.rating);
dump dunescore;
Hadoop Streaming
License |
Apache License, Version 2.0 |
Activity |
Medium |
Purpose |
Write MapReduce code without Java |
Official Page |
http://hadoop.apache.org/docs/r1.2.1/streaming.html |
Hadoop Integration |
Fully Integrated |
You have some data, you have an idea of what you want to do with it, you understand the concepts of MapReduce, but you don’t have solid Java or MapReduce expertise, and the problem does not really fit into any of the other major tools that Hadoop has to offer. Your solution may be Hadoop Streaming, which allows you to write code in any Linux program that reads from stdin and writes to stdout.
You still need to write mappers and reducers, but in the language of your choice. Your mapper will likely read lines from a text file and produce a key-value pair separated by a tab character. The shuffle phase of the process will be handled by the MapReduce infrastructure, and your reducer will read from standard input (stdin), do its processing, and write its output to standard output (stdout).
The reference in the following “Tutorial Links” section shows a WordCount application in Hadoop Streaming using Python.
Is Streaming going to be as performant as native Java code? Almost certainly not, but if your organization has Ruby or Python or similar skills, you will definitely yield better results than sending your developers off to learn Java before doing any MapReduce projects.
Tutorial Links
There’s an excellent overview of the technology as well as a tutorial available on this web page.
Example Code
We’ll use streaming to compute the average ranking for Dune. Let’s start with our small dataset:
Kevin,Dune,10
Marshall,Dune,1
Kevin,Casablanca,5
Bob,Blazing Saddles,9
The mapper function could be:
#! /usr/bin/python
import sys
for line in sys.stdin:
line = line.strip()
keys = line.split(',')
print( "%s\t%s" % (keys[1], keys[2]) )
The reducer function could be:
#!/usr/bin/python
import sys
count = 0
rating_sum = 0
for input_line in sys.stdin:
input_line = input_line.strip()
title, rating = input_line.split("\t", 1)
rating = float(rating)
if title == 'Dune':
count += 1
rating_sum += rating
dune_avg = rating_sum/count
print("%s\t%f" % ('Dune',dune_avg))
And the job would be run as:
hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-file /home/hduser/mapper.py \
-mapper /home/hduser/movie-mapper.py \
-file /home/hduser/reducer.py \
-reducer /home/hduser/movie-reducer.py \
-input /user/hduser/movie-reviews-in/* \
-output /user/hduser/movie-reviews-out
producing the result:
Dune 5.500000
Mahout
License |
Apache License, Version 2.0 |
Activity |
High |
Purpose |
Machine learning and data analytics |
Official Page |
http://mahout.apache.org |
Hadoop Integration |
API Compatible |
You have a bunch of data in your Hadoop cluster. What are you going to do with it? You might want to do some analytics, or data science, or machine learning. Much of this can be done in some of the tools that come with the standard Apache distribution, such as Pig, MapReduce, or Hive. But more sophisticated uses will involve algorithms that you will not want to code yourself. So you turn to Mahout. What is Mahout? Mahout is a collection of scalable machine-learning algorithms that run on Hadoop. Why is it called Mahout? Mahout is the Hindi word for an elephant handler, as you can see from the logo. The list of algorithms is constantly growing, but as of March 2014, it includes the ones listed in Table 5-1.
Mahout algorithm |
Brief description |
k-means/fuzzy k-means clustering |
Clustering is dividing a set of observation into groups where elements in the group are similar and the groups are distinct |
Latent Dirichlet allocation |
LDA is a modelling technique often used for classifying documents predicated on the use of specific topic terms in the document |
Singular value decomposition |
SVD is difficult to explain succinctly without a lot of linear algebra and eigenvalue background |
Logistic-regression-based classifier |
Logistic regression is used to predict variables that have a zero-one value, such as presence or absense of a disease, or membership in a group |
Complementary naive Bayes classifier |
Another classification scheme making use of Bayes’ theorem (which you may remember from Statistics 101) |
Random forest decision tree-based classifier |
Yet another classifier based on decision trees |
Collaborative filtering |
Used in recommendation systems (if you like X, may we suggest Y) |
Table 5-1. Mahout MapReduce algorithms |
A fuller discussion of all these is well beyond the scope of this book. There are many good introductions to machine learning available. Google is your friend here.
In April 2014, the Mahout community announced that it was moving away from MapReduce to a domain-specific language (DSL) based on Scala to a Spark implementation (described here). Current MapReduce algorithms would continue to be supported, but additions to the code base could not be MapReduce based. In fact, in the latest release, the Mahout community had dropped support for some infrequently used routines.
Tutorial Links
The Mahout folks have an entire page of curated links to books, tutorials, and talks.
Example Code
The process of using Mahout to produce a recommendation system is too complex to present here. Mahout includes an example of a movie ratings recommendation system. The data is available via GroupLens Research.
MLLib
License |
Apache License, Version 2.0 |
Activity |
High |
Purpose |
Machine-learning tools for Spark |
Official Page |
https://spark.apache.org/mllib |
Hadoop Integration |
Fully Integrated |
If you’ve decided to invest in Spark but need some machine-learning tools, then MLLib provides you with a basic set. Similar in functionality to Mahout (described here), MLLib has an ever-growing list of modules that perform many tasks useful to data scientists and big data analytics teams. As of Version 1.2, the list includes (but is not limited to) those in Table 5-2. New algorithms are frequently added.
MLLib algorithm |
Brief description |
Linear SVM and logistic regression |
Prediction using continuous and binary variables |
Classification and regression tree |
Methods to classify data based on binary decisions |
k-means clustering |
Clustering is dividing a set of observation into groups where elements in the group are similar and the groups are distinct |
Recommendation via alternating least squares |
Used in recommendation systems (if you like X, you might like Y) |
Multinomial naive Bayes |
Classification based upon Bayes’ Theorem |
Basic statistics |
Summary statistics, random data generation, correlations |
Feature extraction and transformation |
A number of routines often used in text analytics |
Dimensionality reduction |
Reducing the number of variables in an analytic problem, often used when they are highly correlated |
Table 5-2. MLLib algorithms |
Again, as MLLib lives on Spark, you would be wise to know Scala, Python, or Java to do anything sophisticated with it.
You may wonder whether to choose MLLib or Mahout. In the short run, Mahout is more mature and has a larger set of routines, but the current version of Mahout uses MapReduce and is slower in general (though likely more stable). If the algorithms you need only exist today on Mahout, that solves your problem. Mahout currently has a much larger user community, so if you’re looking for online help with problems, you’re more likely to find it for Mahout. On the other hand, Mahout v2 will move to Spark and Scala, so in the long run, MLLib may well replace Mahout or they may merge efforts.
Tutorial Links
“MLLib: Scalable Machine Learning on Spark” is a thorough but rather technical tutorial that you may find useful.
Example Code
The AMPlab at Berkeley has some example code to do personalized movie recommendations based on collaborative filtering.
Hadoop Image Processing Interface (HIPI)
License |
BSD Simplified |
Activity |
Moderate |
Purpose |
Image Processing |
Official Page |
http://hipi.cs.virginia.edu/index.html |
Hadoop Integration |
API Compatible |
Image processing is a vastly overloaded term. It can mean something as simple as “cleaning up” your image by putting it into focus and making the boundaries more distinct. It can also mean determining what is in your image, or scene analysis. For example, does the image of a lung X-ray show a tumor? Does the image of cells collected in a Pap smear indicate potential cervical cancer? Or it can mean deciding whether a fingerprint image matches a particular image or is similar to one in a set of images.
HIPI is an image-processing package under development at the University of Virginia. While the documentation is sketchy, the main use is the examination of a collection of images and determining their similarity or dissimilarity. This package seems to assume a knowledge of image processing and specific technologies such as the exchangeable image file format (EXIF). As this is a university project, its future is unknown, but there seems to be a resurgence of activity expected in 2015, including Spark integration.
Tutorial Links
HIPI is still a fairly new technology; the best source of information at the moment is this thesis paper.
Example Code
A number of examples of HIPI usage can be found on the project’s official examples page.
SpatialHadoop
License |
Unknown |
Activity |
High |
Purpose |
Spatial Analytics |
Official Page |
http://spatialhadoop.cs.umn.edu |
Hadoop Integration |
API Compatible |
If you’ve been doing much work with spatial data, it’s likely you’re familiar with PostGIS, the open source spatial extension to the open source PostgreSQL database. But what if you want to work in a massively Hadoop environment rather than PostgreSQL? The University of Minnesota’s computer science department has developed SpatialHadoop, which is an open source extension to MapReduce designed to process huge datasets of spatial data in Hadoop. To use SpatialHadoop, you first load data into HDFS and then build a spatial index. Once you index the data, you can execute any of the spatial operations provided in SpatialHadoop such as range query, k-nearest neighbor, and spatial join.
There are high-level calls in SHadoop that generate MapReduce jobs, so it’s possible to use SHadoop without writing MapReduce code. There are clear usage examples at the website.
In addition to the MapReduce implementation, there is an extension to Pig, called Pigeon, that allows spatial queries in Pig Latin. This is available at the project page in GitHub. Pigeon has the stated goal of supporting as many of the PostGIS functions as possible. This is an ambitious but extremely useful goal because PostGIS has a wide following and the ST functions it supports make it fairly simple to do spatial analytics in a high-level language like Pig/Pigeon.
The code is all open source and available on GitHub.
Tutorial Links
The official project page has a handful of links to great tutorials.