MICROSOFT BIG DATA SOLUTIONS (2014)

Part V. Big Data and SQL Server Together

Chapter 12. Big Data Analytics

What You Will Learn in This Chapter

· Exploring Data Mining and Predictive Analytics

· Using the Mahout Machine Learning Library

· Building a Recommendation Engine on Hadoop

Up to this point, the focus has been on building a foundation that enables you to capture and store large volumes of disparate data. When this data is collected, as you have seen in previous chapters, it can be easily summarized and aggregated using tools built in to the Hadoop ecosystem.

Although this is noteworthy, it alone hardly justifies the time or investment required to implement a big data solution in your organization. The real value for businesses in bringing this data together is that it can be mined for hidden patterns, correlations, and other interesting information that can facilitate better business decision making.

This chapter covers how you can use HDInsight and Hadoop as a big data analytics platform by taking advantage of the Mahout machine learning library to deliver predictive analytics, such as implementing a recommendation engine, and to perform more common data mining in the form of clustering and classification.

Data Science, Data Mining, and Predictive Analytics

Included in just about every big data discussion, the art of data science is one that is built on multiple disciplines. Various skills involving mathematics, statistics, and computer science are combined to allow practitioners in this field, known as data scientists, to broadly explore data and patterns.

Two of the most common techniques used by data scientists to unveil these patterns are data mining and predictive analytics.

Data Mining

Also referred to as machine learning, data mining comes from the computer science field and is a broad term that represents techniques used to discover new information from existing data sets. For many experienced data warehouse and business intelligence professionals, this may be a subject that you are already familiar with.

For those who are not, the distinction is simple. Unlike more traditional business intelligence projects, which merely summarize or aggregate data in various slices, data mining uses specific algorithms to uncover data patterns or trends that would otherwise not be apparent.

To better understand how this applies to big data, review in Table 12.1 some of the more common types of data mining algorithms.

Table 12.1 Common Data Mining Algorithms

Algorithm	Description
Clustering	Rather than a specific algorithm, clustering is a category of unsupervised algorithms that groups similar objects into groups or clusters. Typically, a notion of statistical distance is used to calculate distance between objects. Examples of these distance calculations include k-means and Euclidean distance.
Classification	Like clustering, classification is not a single algorithm but a group of algorithms that also group together objects. These algorithms are usually supervised and require both a classifier and a training set of data that helps identify what the groups look like in terms of machine learning. Taken together, classification algorithms can then classify each data point or object into the most appropriate group or bucket.
Regression	An outgrowth of statistics, the regression techniques are used to estimate the relationship between variables. Regression algorithms such as linear regression are commonly used for forecasting important business indicators such as sales volumes.
Association rules	These algorithms are used to explore or discover relationships between objects that exist in large data sets. Common examples of these algorithms include market basket analysis, clickstream analysis, and fraud detection.

Predictive Analytics

A specialization or subset of the broader data mining field, predictive analytics (like other forms of data mining) is interested in identifying patterns. The exception or difference however, is that the patterns identified are used to make a prediction about some behavior or event using the historical data fed to the model.

Examples of predictive analytics are everywhere around you. They are used to determine whether your loan application should be approved, they are the basis of your credit score, they help recommend what movie you should watch or what book you buy, and they often determine whether you are included in a marketing campaign by your favorite retailer.

Taken together, this field is one the primary driving factors for interest in big data and offers huge opportunities for businesses that capitalize on it (by allowing those businesses to transform their data into a competitive differentiator). To paint a clearer picture, let's look at three types of models used in predictive analytics: predictive models, descriptive models, and decision models.

Predictive Models

Predictive models use historic data to make predictions on how likely a defined behavior or action is to occur. These models are often used to improve marketing outcomes by predicting buying behaviors and are also often used to assess credit risks and in fraud-detection systems.

Descriptive Models

Often, a business will want to identify how groups of customers behave. Behavior in this regards is often used to identify customers who are loyal, profitable, or at risk for cancelling your product or service. This is a common scenario used to predict and proactively handle customer churn and is one example where descriptive models excel. Descriptive models, instead of predicting the actions of a single customer, can identify multiple relationships between products and customers and are used to categorize like customers together.

Decision Models

The basis of a decision model is to improve the outcome of some well-defined business decision, taking into account multiple variables. This model uses well-known data (often included historical decision-making data), the business decision and the probability of some result based on the given variables or factors. The results of these models are often used to write business rules found in the line-of-business applications.

With a high-level understanding of the core concepts behind data mining and predictive analytics, we can turn our attention to looking at how Mahout allows you to easily integrate some of these techniques into your big data architecture.

Introduction to Mahout

The preceding section introduced you at a high level to both data mining and predictive analytics and how they apply to big data. If at this point you are worried that you don't possess the skills or background to successfully build and deliver this type of intelligence within your HDInsight platform, fear not!

The remainder of this chapter introduces you to the Mahout machine learning library and explains how you can use it to deliver meaningful big data analytical solutions without a PhD in statistics or mathematics. So, what is this Mahout thing?

Mahout is an open source, top-level Apache project that encapsulates multiple machine learning algorithms into a single library. Like its Hadoop counterpart, the Mahout community is a vibrant and active community that has continually expanded and improved on Mahout.

For a historical perspective, the Mahout project grew out of two separate projects: the Apache Lucene (an open source text indexing project) and Taste (an open source Java library of machine learning algorithms).

Mahout supports two basic implementations. First, is a non-distributed or real-time implementation that involves native non-Hadoop Java calls directly to the Mahout library. The second scenario is the one we are focused on and is accomplished in a distributed or batch processing manner using Hadoop. Both of these scenarios abstracts away the complexity of machine learning algorithms.

The basis of Mahout within the context of big data are four primary use cases:

· Collaborative filtering (recommendation mining based on user behavior)

· Clustering (grouping similar documents)

· Classification (assigning uncategorized documents to predefined categories)

· Frequent item set mining (market basket analysis)

To get started with the Apache Mahout library, you first need to download the project distribution; it is not included by default with HDInsight on Azure platform. To download the required files, visit the Apache project site at http://mahout.apache.org/. As of this writing, the currently supported version is 0.7. After you download the mahout-distribution-0.7.zip file, extract the contents of using your preferred compression utility.

NOTE

If you are running the Hortonworks Data Platform on premise instead of using HDInsight of Windows Azure, you will find that Mahout is included and is ready to use. No further action is required.

In the decompressed folder, you'll find the mahout-core-0.7-job.jar and the mahout-examples-0.7-job.jar (which contains a number of prebuilt samples) files. The mahout-core-0.7-job.jar contains prebuilt Hadoop jobs for each of the use cases previously mentioned. These jobs do not require any coding and will generate and run the required MapReduce jobs to implement machine learning algorithms in a distributed environment. To use this jar file, you must upload it either directly to your cluster using the Remote Desktop connection or to the Azure Blob Storage account connected to Azure HDInsight cluster.

NOTE

This section, and the remainder of this chapter, assumes that you have an HDInsight on Windows Azure cluster set up. If you do not have a cluster set up and configured, see Chapter 3, “Configuring Your First Big Data Environment,” for further information.

Building a Recommendation Engine

What caused you to pick up this book? What about the last movie you saw or maybe even the last item of clothing you purchased? Every decision that anyone makes is inevitably based on some preconceived (and often unconscious) opinion. Every day, our opinions develop and become part of the ever-larger library of unconscious factors from which we “borrow” as we face decisions.

Deciding what you like, what you don't, and in some cases what you are indifferent to drives nearly all the purchasing decisions you'll make through your lifetime. Successfully predicting these outcomes is the basis of a recommendation engine.

To set a foundation, let's define what and how a recommendation engine works. First we must start with an assumption that people with similar interests share common preferences. This is a straightforward idea demonstrable by simply looking around at your network of friends and family. Note that this doesn't imply that the assumption always hold true, but it holds well enough to produce meaningful and useful recommendations.

This assumption is the basis of simple recommendation engines and will allow us to generate a recommendation, but what is a recommendation? For most, the first thought is some product, good, or service.

However, the recommendation can also be people if it the recommendation engine is implemented on a social media or an Internet dating website, for instance. The truth is that the recommendation can be anything because the recommendation engine doesn't understand the concept of physical things, which is why you will commonly see it referred to as simply an item.

Recommendation engines, for our purpose, use one of two paradigms: collaborative filtering or clustering. Collaborative filtering is highly dependent on both the assumption previously discussed and on historical data.

This historical data records interaction or behavior with the items you are attempting to generate recommendations for. Data that can be used is often split into two distinct groups: explicit and implicit. Explicit data is well-defined data such as purchase or click history. Implicit data, in contrast, is more subjective and includes preferential or product ratings. Table 12.2 provides a more complete list of examples for each category.

NOTE

One of the most difficult parts of building a recommendation engine is determining how to quantify and weight non-numeric explicit data. Determining the right balance and scale is as much art as science and requires some experimentation.

In the case of implicit data, when used singularly (that is, only purchase history or only click history), it represents a Boolean data type. It is not necessary to represent the negative cases in these models because missing data translates to false. When multiple implicit data points are combined, it is necessary to scale them appropriately.

Table 12.2 Examples of Explicit and Implicit Data

Explicit	Implicit
Ratings Feedback Demographics Psychographics (personality/lifestyle/attitude) Ephemeral need (need for a moment)	Purchase history Clicks Browse history

Clustering, unlike collaborative filtering, focuses instead on an item's taxonomies, attributes, description, or properties. It does not need behavioral or interaction data, and is often a good choice when the data required for collaborative filtering is not available.

To generate recommendations, Mahout supports collaborative filtering and clustering. To understand what each of these are, we can look at the three common recommendation engine implementations:

1. User-to-user collaborative filtering: In a user-to-user recommendation implementation, clusters or neighborhoods of similar users are formed based on some user behavior (for instance, purchasing an item or attending a movie). Because similar users are clustered together, these clusters are then used to generate recommendations.

2. Item-to-item collaborative filtering: The item-to-item recommendation implementation works in a similar manner to that of the user-to-user implementation, except that it works from the context of an item. Instead of looking at similar users, it uses behavior or interaction with an item to determine which item to recommend. Let's illustrate this with a more concrete example. If we were evaluating purchase history and everyone who bought the Star Wars trilogy also purchased the Lord of the Rings boxed set, we would generate a recommendation for Lord of the Rings whenever a customer purchases Star Wars.

3. Content-based clustering: Content-based recommendation engines function in a different manner than the two previously discussed. Instead of looking at behavior or interactions, content-based systems use attributes associated with an item. Item attributes can be anything from physical attributes that describe an item to features or behaviors when describing technology or even people. Examples of this methodology are widely available and can be found on your favorite news aggregator website when similar articles are recommended.

Now that you have a basic understanding, let's look at the specifics for building a recommendation engine using Mahout and HDInsight.

Getting Started

For this demonstration, we are using the GroupLens data set (http://grouplens.org/datasets/movielens/). This data set is publicly available and at its largest contains more than 10 million user movie reviews (10,000 movies and 72,000 users). To work with this data set, you must process the data set into a format that Mahout understands.

For the built-in recommenders, the expected data file format is a comma-separated list containing the following data points:

<Unique User ID>, <Unique Item ID>, <Numeric Rating>The format of the GroupLens data set is tab-delimited and contains an additional Timestamp column. If you choose, you could process this data set using a C# utility application or MapReduce job to get it into the correct format.

NOTE

A preprocessed file called MovieRatings.csv is supplied and can be downloaded with the Chapter 12 materials from http://www.wiley.com/go/microsoftbigdatasolutions.

After you have your source data file, you must upload it to your HDInsight cluster. To upload the data file to your cluster and add it to HDFS, follow these steps:

1. Connect to the head node of your HDInsight cluster using Remote Desktop. If Remote Desktop has not been enabled, enable it through the Configuration page and set up an account for Remote Access.

2. Copy and paste the MovieRatings.csv file from your local machine to the c:\temp folder on the head node of your HDInsight cluster. If the c:\temp folder does not exist, you will need to create it first.

3. On the desktop, double-click the Hadoop command prompt to open the Hadoop command line.

4. We will use the Hadoop File System shell to copy the data file from the local file system to the Hadoop Distributed File System (HDFS). Enter the following command at the command prompt:

5. hadoop fs -copyFromLocal c:\temp\MovieRatings.csv

6. /user/<YOUR USERNAME>/chapter15/input/MovieRatings.csv

7. After the command completes, use the following command to verify that the MovieRatings.csv file now exists within HDFS:

hadoop fs -ls /user/<YOUR USERNAME>/chapter15/input/

The data input required to run recommendation jobs using Mahout is now available and ready for processing on HDInsight.

Running a User-to-user Recommendation Job

The first job that we will run will use collaborative filtering to generate user-to-user recommendations for the MovieLens data set. Be sure before continuing that you have deployed the Mahout jar files to your HDInsight cluster before continuing.

Before starting the job, let's delve into what's going to happen once we start the recommendation job. First, we will iterate through each user, finding movies that the user has not previously rated. Next, for each movie that the user has not yet reviewed, we will find other users who have reviewed the movie. We will use a statistical measure to calculate the similarity between the two users and then use the similarity to estimate the preference in the form of a weighted average.

At a high level, we can distill this logic of the recommendation job to the following pseudo-code:

for each item i that u has no preference

for each user v that has a preference for i

compute similarity s between u and v

calculate running average of v's preference for i,

weighted by s

return top ranked (weighted average) i

Note that this is a drastic simplification of what is really occurring behind the scenes. In fact, we've omitted a key step. If the preceding logic were implemented as is, it would not perform efficiently and would suffer at scale. To remedy this, we could introduce the concept of neighborhoods or clusters of similar users to limit the number of similarity comparisons that need to be made.

A detailed explanation of how this works is beyond the scope of this book, but you can find ample material about how Mahout handles neighborhood formation and similarity calculations on the Mahout website.

With a high-level understanding of how user-to-user recommendations are generated, use the hadoop jar command at the Hadoop command line to start the RecommenderJob:

hadoop jar c:\mahout\mahout-core-0.7-job.jar

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

-s SIMILARITY_PEARSON_CORRELATION

--input=/user/<YOUR USERNAME>/chapter15/input/MovieRatings.csv

--output=/user/<YOUR USERNAME>/chapter15/output/

userrecommendations

NOTE

The preceding command assumes that you have placed the Mahout files in a local directory called c:\Mahout. If your files are located in another directory, adjust the command so that the path to the Mahout *.jar files is correct before using the command.

The RecommenderJob is a prebuilt jar that accepts multiple arguments. At a minimum, you must specify the similarity metric that will be used to determine similarity between users, your input data file, and the path for your recommendation job output.

A number of other job parameters control how the RecommenderJob behaves and will allow you to customize it without investing time or effort in coding. Table 12.3 describes these additional job parameters.

Table 12.3 RecommenderJob Parameters

Parameter	Description
Input	Directory containing preference data
Output	Output for recommender output
similarityClassname	Name of vector similarity class
usersFile	User IDs to compute recommendations
itemsFile	Item IDs to include in recommendations
filterFile	Excludes the item/user recommendations
numRecommendations	Number of recommendations per user
booleanData	Treat input data as having no preferences values
maxPrefsPerUser	Maximum preferences considered per user
maxSimilaritiesPerItem	Maximum number of similarities per item
minPrefsPerUser	Ignore users with less preferences than this in the similarity computation
maxPrefsPerUserInItemSimilarity	Maximum number of preferences to consider per user in the item similarity
Threshold	Discard item pairs with a similarity value

Here is a list of the available similarity metrics:

· Euclidean distance

· Spearman correlation

· Cosine

· Tanimoto coefficient

· Log-likelihood

· Pearson correlation

After, the job is started, it will take between 15 and 20 minutes to run if you are using a four-node cluster. During this time, a series of MapReduce jobs are being run to process the data and generate the movie recommendations. When the job has completed, you can view the various outputted files using the following command:

hadoop fs -ls /user/<YOUR USERNAME>/chapter15/output/userrecommendations

You can find the generated recommendations in the part-r-00000 file. To export the file from HDFS to your local file system, use the following command:

hadoop fs -copyToLocal

/user/<YOUR USERNAME>/chapter15/output/userrecommendations/part-r-00000

c:\<LOCAL OUPUT DIRECTORY>\recommendations.csv

You can review the file to find the recommendation generated for each user. The output from the recommendation job takes the following format:

UserID [ItemID:Estimate Rating, ………]

An example of the output is shown here:

1 [1566:5.0,1036:5.0,1033:5.0,1032:5.0,1031:5.0,3107:5.0]

In this example, for the user identified by the ID of 1, we would recommend the movies identified by the IDs 1566 (The Man from Down Under), 1036 (Drop Dead Fred), 1033 (Homeward Bound II: Lost in San Francisco), and so on. The estimated ratings for each of these movies for this specific user is 5.0. You can cross-reference these IDs with the data file provided as part of the GroupLens data set download.

Running an Item-to-item Recommendation Job

In the previous example, we used the GroupLens data set to generate recommendations by calculating similarity between users. In this demonstration, we instead use the notion of item similarity to determine our item recommendations.

For this exercise, you will reuse the GroupLens data set as the format and data requirements for the item-to-item RecommendationJob are the same. In fact, a significant amount of overlap exists between the two jobs, including the job parameters.

In the user-to-user example, the Mahout library uses a similarity metric to form neighborhoods or clusters and then makes recommendations based on reviews by statistically similar users. The item-to-item recommender takes a different approach, instead focusing on items (or in our case, movies).

Much like the former example, the item-to-item recommender must calculate the similarity between movies. To accomplish this, the recommender uses both user reviews and the co-occurrence of movie reviews by users to determine this similarity score. Using this notion of similarity, the job can then generate recommendations based on the provided input.

To generate item-based recommendations, follow these steps:

1. Open the Hadoop command-line console.

2. Mahout uses temporary storage for intermediate files that are output out of intermediate MapReduce jobs. Before you can run a new Mahout job, you need to purge the temporary directory. Use the following command to delete the files in the temporary directory:

3. hadoop fs -rmr -skipTrash /user/<USER>/temp

4. Enter the Mahout item recommender job to kick off the item-based RecommenderJob:

5. hadoop jar c:\mahout\mahout-core-0.7-job.jar

6. org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

7. -s SIMILARITY_PEARSON_CORRELATION

8. --input=/user/<USER NAME>/chapter15/input/MovieRatings.csv

9. --output=/user/<USER NAME>/chapter15/output/

itemrecommendations

The item-based RecommenderJob, like the user-based recommender job will take some time to run to completion. After the job successfully completes, you can browse the results in the /chapter12/output/itemrecommendations folder in HDFS.

Summary

Big data analytics, and more specifically data mining and predictive analytics, represent the biggest and more potent parts of your big data platform. Taking advantage of this data to gain new insights, identify patterns, and bring out new and interesting information will provide a competitive advantage for those businesses that take the leap.

HDInsight and the Mahout machine learning library make this area approachable by abstracting away the complexity (mathematics and statistics!) generally associated with data mining and predictive analytics.

Mahout provides implementations for clustering, classifying, and (as demonstrated) generating recommendations using the concept of collaborative filtering.