Preface - Pig Design Patterns (2014)

Pig Design Patterns (2014)

Preface

This book is a practical guide to realizing the power of analytics in Big Data. It walks the Big Data technologist in you through the process of getting the data ready, applying analytics, and creating a value out of the data. All of this is done using appropriate design patterns in Pig. We have chosen Pig to demonstrate how useful it is, which is evident from the following:

· The inherent amenability of Pig through its simple language constructs, which can be learned very easily, and its extensibility and applicability to structured and unstructured Big Data makes it the preferred choice over others.

· The ease and speed with which patterns can be implemented by Pig to derive meaning out of the apparent randomness in any Big Data is commendable.

· This book guides system architects and developers so they become more proficient at creating complex analytics solutions using Pig. It does so by exposing them to a variety of Pig design patterns, UDFs, tools, and best practices.

By reading this book, you will achieve the following goals:

· Simplify the process of creating complex data pipelines by performing data movement across platforms, data ingestion, profiling, validation, transformations, data reduction, and egress; you'll also be able to use Pig in these design patterns

· Create solutions that use patterns for exploratory analysis of multistructured unmodeled data to derive structure from it and move the data to downstream systems for further analysis

· Decipher how Pig can coexist with other tools in the Hadoop ecosystem to create Big Data solutions using design patterns

What this book covers

Chapter 1, Setting the Context for Design Patterns in Pig, lays a basic foundation for design patterns, Hadoop, MapReduce and its ecosystem components gradually exposing Pig, its dataflow paradigm, and the language constructs and concepts with a few basic examples that are required to make Pig work. It sets the context to understand the various workloads Pig is most suitable for and how Pig scores better. This chapter is more of a quick practical reference and points to additional references if you are motivated enough to know more about Pig.

Chapter 2, Data Ingest and Egress Patterns, explains the data ingest and egress design patterns that deal with a variety of data sources. The chapter includes specific examples that illustrate the techniques to integrate with external systems that emit multistructured and structured data and use Hadoop as a sink to ingest. This chapter also explores patterns that output the data from Hadoop to external systems. To explain these ingest and egress patterns, we have considered multiple filesystems, which include, but are not limited to, logfiles, JSON, XML, MongoDB, Cassandra, HBase, and other common structured data sources. After reading this chapter, you will be better equipped to program patterns related to ingest and egress in your enterprise context, and will be capable of applying this knowledge to use the right Pig programming constructs or write your own UDFs to accomplish these patterns.

Chapter 3, Data Profiling Patterns, focuses on the data profiling patterns applied to a multitude of data formats and realizing these patterns in Pig. These patterns include different approaches to using Pig and applying basic and innovative statistical techniques to profile data and find data quality issues. You will learn about ways to program similar patterns in your enterprise context using Pig and write your own UDFs to extend these patterns.

Chapter 4, Data Validation and Cleansing Patterns, is about the data validation and cleansing patterns that are applied to various data formats. The data validation patterns deal with constraints, regex, and other statistical techniques. The data cleansing patterns deal with simple filters, bloom filters, and other statistical techniques to make the data ready for transformations to be applied.

Chapter 5, Data Transformation Patterns, deals with data transformation patterns applied to a wide variety of data types ingested into Hadoop. After reading this chapter, you will be able to choose the right pattern for basic transformations and also learn about widely used concepts such as creating joins, summarization, aggregates, cubes, rolling up data, generalization, and attribute construction using Pig's programming constructs and also UDFs where necessary.

Chapter 6, Understanding Data Reduction Patterns, explains the data reduction patterns applied to the already ingested, scrubbed, and transformed data. After reading this chapter, you will be able to understand and use patterns for dimensionality reduction, sampling techniques, binning, clustering, and irrelevant attribute reduction, thus making the data ready for analytics. This chapter explores various techniques using the Pig language and extends Pig's capability to provide sophisticated usages of data reduction.

Chapter 7, Advanced Patterns and Future Work, deals with the advanced data analytics patterns. These patterns cover the extensibility of the Pig language and explain with use cases the methods of integrating with executable code, map reduce code written in Java, UDFs from PiggyBank, and other sources. Advanced analytics cover the patterns related to natural language processing, clustering, classification, and text indexing.

Motivation for this book

The inspiration for writing this book has its roots in the job I do for a living, that is, heading the enterprise practice for Big Data where I am involved in the innovation and delivery of solutions built on the Big Data technology stack.

As part of this role, I am involved in the piloting of many use cases, solution architecture, and development of multiple Big Data solutions. In my experience, Pig has been a revelation of sorts, and it has a tremendous appeal for users who want to quickly pilot a use case and demonstrate value to business. I have used Pig to prove rapid gains and solve problems that required a not-so-steep learning curve. At the same time, I have found out that the documented knowledge of using Pig in enterprises was nonexistent in some cases and spread out wide in cases where it was available. I personally felt the need to have a use case pattern based reference book of knowledge. Through this book, I wanted to share my experiences and lessons, and communicate to you the usability and advantages of Pig for solving your common problems from a pattern's viewpoint.

One of the other reasons I chose to write about Pig's design patterns is that I am fascinated with the Pig language, its simplicity, versatility, and its extensibility. My constant search for repeatable patterns for implementing Pig recipes in an enterprise context has inspired me to document it for wider usage. I wanted to spread the best practices that I learned while using Pig through contributing to a pattern repository of Pig. I'm intrigued by the unseen possibilities of using Pig in various use cases, and through this book, I plan to stretch the limit of its applicability even further and make Pig more pleasurable to work with.

This book portrays a practical and implementational side of learning Pig. It provides specific reusable solutions to commonly occurring challenges in Big Data enterprises. Its goal is to guide you to quickly map the usage of Pig to your problem context and to design end-to-end Big Data systems from a design pattern outlook.

In this book, a design pattern is a group of enterprise use cases logically tied together so that they can be broken down into discrete solutions that are easy to follow and addressable through Pig. These design patterns address common enterprise problems involved in the creation of complex data pipelines, ingress, egress, transformation, iterative processing, merging, and analysis of large quantities of data.

This book enhances your capability to make better decisions on the applicability of a particular design pattern and use Pig to implement the solution.

Pig Latin has been the language of choice for implementing complex data pipelines, iterative processing of data, and conducting research. All of these use cases involve sequential steps in which data is ingested, cleansed, transformed, and made available to upstream systems. The successful creation of an intricate pipeline, which integrates skewed data from multiple data platforms with varying structure, forms the cornerstone of any enterprise, which leverages Big Data and creates value out of it through analytics.

This book enables you to use these design patterns to simplify the creation of complex data pipelines using Pig, ingesting data from multiple data sources, cleansing, profiling, validating, transformation and final presentation of large volumes of data.

This book provides in-depth explanations and code examples using Pig and the integration of UDFs written in Java. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise's use cases. The chapters are relatively independent of each other and can be completed in any order since they address design patterns specific to a set of common steps in the enterprise. As an illustration, a reader who is looking forward to solving a data transformation problem, can directly access Chapter 5, Data Transformation Patterns, and quickly start using the code and explanations mentioned in this chapter. The book recommends that you use these patterns for solving the same or similar problems you encounter and create your own patterns if the design pattern is not suitable in a particular case.

This book's intent is not to be a complete guide to Pig programming but to be more of a reference book that brings in the design patterns' perspective of applying Pig. It also intends to empower you to make creative use of the design patterns and build interesting mashups with them.

What you need for this book

You will need access to a single machine (VM) or multinode Hadoop cluster to execute the Pig scripts given in this book. It is expected that the tools needed to run Pig are configured. We have used Pig 0.11.0 to test the examples of this book, and it is highly recommended that you have this version installed.

The code for the UDFs in this book is written in different languages such as Java; therefore, it is advisable for you to have access to a machine with development tools (such as Eclipse) that you are comfortable with.

It is recommended to use Pig Pen (Eclipse plugin) on the developer's machine for developing and debugging Pig scripts.

Pig Pen can be downloaded from https://issues.apache.org/jira/secure/attachment/12456988/org.apache.pig.pigpen_0.7.5.jar.

Who this book is for

This book is for experienced developers who are already familiar with Pig and are looking forward to referring to a use case standpoint that they can relate to the problems of ingestion, profiling, cleansing, transformation, and egress of data encountered in the enterprises. These power users of Pig will use the book as a reference for understanding the significance of Pig design patterns to solve their problems.

Knowledge of Hadoop and Pig is mandatory for you to grasp the intricacies of Pig design patterns better. To address this, Chapter 1, Setting the Context for Design Patterns in Pig, contains introductory concepts with simple examples. It is recommended that readers be familiar with Java and Python in order to better comprehend the UDFs that are used as examples in many chapters.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are a few examples of these styles and an explanation of their meaning.

Code words in text are shown as follows: "From this point onward, we shall call the unpacked Hadoop directory HADOOP_HOME."

A block of code for UDFs written in Java is set as follows:

package com.pigdesignpatterns.myudfs;

public class DeIdentifyUDF extends EvalFunc<String> {

@Override

public String exec(Tuple input){

try {

String plainText = (String)input.get(0);

String encryptKey = (String)input.get(1);

String str="";

str = encrypt(plainText,encryptKey.getBytes());

return str;

}

catch (NullPointerException npe) {

warn(npe.toString(), PigWarning.UDF_WARNING_2);

return null;

} catch (StringIndexOutOfBoundsException npe) {

warn(npe.toString(), PigWarning.UDF_WARNING_3);

return null;

} catch (ClassCastException e) {

warn(e.toString(), PigWarning.UDF_WARNING_4);

return null;

}

Pig Script is displayed as follows:

Users = load 'users' as (name, age);

Fltrd = filter Users by

age >= 18 and age <= 25;

Pages = load 'pages' as (user, url);

Jnd = join Fltrd by name, Pages by user;

Grpd = group Jnd by url;

Smmd = foreach Grpd generate group,

COUNT(Jnd) as clicks;

Srtd = order Smmd by clicks desc;

Top5 = limit Srtd 5;

store Top5 into 'top5sites'

Any command-line input or output is written as follows:

>tar -zxvf hadoop-1.x.x.tar.gz

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Third-party libraries

A number of third-party libraries are used for the sake of convenience. They are included in the Maven dependencies so there is no extra work required to work with these libraries. The following table contains a list of the libraries that are in prevalent use throughout the code examples:

Library name

Description

Link

dataFu

DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig especially for data mining and statistics

http://search.maven.org/remotecontent?filepath=com/linkedin/datafu/datafu/0.0.10/datafu-0.0.10.jar

mongo-hadoop-core

This is the plugin for Hadoop that provides the ability to use MongoDB as an input source and/or an output source

http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop-core_1.0.0/1.0.0-rc0/mongo-hadoop-core_1.0.0-1.0.0-rc0.jar

mongo-hadoop-pig

This is to load records from the MongoDB database to use them in a Pig script and to write to a MongoDB instance

http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop-pig/1.0.0/mongo-hadoop-pig-1.0.0.jar

mongo-java-driver

This is a Java driver for MongoDB

http://repo1.maven.org/maven2/org/mongodb/mongo-java-driver/2.9.0/mongo-java-driver-2.9.0.jar

elephant-bird-pig

This is Twitter's open source library of Pig LoadFuncs

http://repo1.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/3.0.5/elephant-bird-pig-3.0.5.jar

elephant-bird-core

This is Twitter's collection of core utilities

http://repo1.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/3.0.5/elephant-bird-pig-3.0.5.jar

hcatalog-pig-adapter

This contains utilities to access data from Hcatalog-managed tables

http://search.maven.org/remotecontent?filepath=org/apache/hcatalog/hcatalog-pig-adapter/0.11.0/hcatalog-pig-adapter-0.11.0.jar

cb2java

This JAR has libraries to dynamically parse the COBOL copybook

http://sourceforge.net/projects/cb2java/files/latest/download

Avro

This is the Avro core components' library

http://repo1.maven.org/maven2/org/apache/avro/avro/1.7.4/avro-1.7.4.jar

json-simple

This library is a Java toolkit for JSON to encode or decode JSON text

http://www.java2s.com/Code/JarDownload/json-simple/json-simple-1.1.1.jar.zip

commons-math

This library contains few mathematical and statistical components

http://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.2/commons-math3-3.2.jar

Datasets

Throughout this book, you'll work with these datasets to provide some variety for the examples. Copies of the exact data used are available in the GitHub repository in the directory https://github.com/pradeep-pasupuleti/pig-design-patterns. Wherever relevant, data that is specific to a chapter exists within chapter-specific subdirectories under the same GitHub location.

The following are the major classifications of datasets, which are used in this book as relevant to the use case discussed:

· The logs dataset contains a month's worth of HTTP requests to the NASA Kennedy Space Center WWW server in Florida. These logs are in the format of Apache access logs.

Note

The dataset is downloaded from the links ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz.

Acknowledgement: The logs were collected by Jim Dumoulin of the Kennedy Space Center, and contributed by Martin Arlitt (<mfa126@cs.usask.ca>) and Carey Williamson (<carey@cs.usask.ca>) of the University of Saskatchewan.

· The custom logs dataset contains logs generated by a web application in the custom log format. Web service request and response information is embedded along with the event logs. This is a synthetic dataset created specifically to illustrate the examples in this book.

· The historical NASDAQ stock data from 1970 to 2010, including daily open, close, low, high, and trading volume figures. Data is organized alphabetically by ticker symbol.

Note

This dataset is downloaded from the link http://www.infochimps.com/datasets/nasdaq-exchange-daily-1970-2010-open-close-high-low-and-volume/downloads/166853.

· The customer retail transactions dataset has details on category of the product being purchased and customer demographic information. This is a synthetic dataset created specifically to illustrate the examples in this book.

· The automobile insurance claims dataset consists of two files. The automobile_policy_master.csv file contains the vehicle price and the premium paid for it. The file automobile_insurance_claims.csv contains automobile insurance claims data, specifically vehicle repair charges claims. This is a synthetic dataset created specifically to illustrate the examples in this book.

· The MedlinePlus health topic XML files contain records of health topics. Each health topic record includes data elements associated with that topic.

Note

This dataset is downloaded from the link http://www.healthdata.gov/data/dataset/medlineplus-health-topic-xml-files-0.

· This dataset contains a large set of e-mail messages from the Enron corpus which has about 150 users with an average of 757 messages per user; the dataset is in AVRO format and we have converted it to JSON format for the purpose of this book.

Note

This dataset is downloaded from the link https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro.

· Manufacturing dataset for electrical appliances is a synthetic dataset created for the purpose of this book. This dataset contains the following files:

· manufacturing_units.csv: This contains information about each manufacturing unit

· products.csv: This contains details of the products that are manufactured

· manufacturing_units_products.csv: This holds detailed information of products that are manufactured in different manufacturing units

· production.csv: This holds the production details

· The unstructured text dataset contains parts of articles from Wikipedia on Computer science and Information Technology, Big Data, Medicine, invention of telephone, stop words list, and dictionary words list.

· The Outlook contacts dataset is a synthetic dataset created by exporting the Outlook contacts for the purpose of this book; it is a CSV file with attributes contact names and job titles.

· The German credit dataset in CSV format classifies people as good or bad credit risks based on a set of attributes. There are 20 attributes (7 numerical and 13 categorical) with 1,000 instances.

Note

This dataset is downloaded from the link http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data.

Acknowledgement: Data collected from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)), source: Professor Dr. Hans Hofmann, Institut fuer Statistik und Oekonometrie, Universitaet Hamburg.