Introduction to Amazon Elastic MapReduce - Programming Elastic MapReduce (2014)

Programming Elastic MapReduce (2014)

Chapter 1. Introduction to Amazon Elastic MapReduce

In programming, as in many fields, the hard part isn’t solving problems, but deciding what problems to solve.

— Paul Graham Great Hackers

On August 6, 2012, the Mars rover Curiosity landed on the red planet millions of miles from Earth. A great deal of engineering and technical expertise went into this mission. Just as exciting was the information technology behind this mission and the use of AWS services by the NASA’s Jet Propulsion Laboratory (JPL). Shortly before the landing, NASA was able to provision stacks of AWS infrastructure to support 25 Gbps of throughput to provide NASA’s many fans and scientists up-to-the-minute information about the rover and its landing.Today, NASA continues to use AWS to analyze data and give scientists quick access to scientific data from the mission.

Why is this an important event in a book about Amazon Elastic MapReduce? Access to these types of resources used to be available only to governments or very large multi-national corporations. Now this power to analyze volumes of data and support high volumes of traffic in an instant is available to anyone with a laptop and a credit card. What used to take months—with the buildout of large data centers, computing hardware, and networking—can now be done in an instant and for short-term projects in AWS.

Today, businesses need to understand their customers and identify trends to stay ahead of their competition. In finance and corporate security, businesses are being inundated with terabytes and petabytes of information. IT departments with tight budgets are being asked to make sense of the ever-growing amount of data and help businesses stay ahead of the game. Hadoop and the MapReduce framework have been powerful tools to help in this fight. However, this has not eliminated the cost and time needed to build out and maintain vast IT infrastructure to do this work in the traditional data center.

EMR is an in-the-cloud solution hosted in Amazon’s data center that supplies both the computing horsepower and the on-demand infrastructure needed to solve these complex issues of finding trends and understanding vast volumes of data.

Throughout this book, we will explore Amazon EMR and how you can use it to solve data analysis problems in your organization. In many of the examples, we will focus on a common problem many organizations face: analyzing computer log information across multiple disparate systems. Many businesses are required by compliance regulations that exist, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry Data Security Standard (PCI DSS), to analyze and review log information on a regular, if not daily, basis. Log information from a large enterprise can easily grow into terabytes or petabytes of data. We will build a number of building blocks of an application that takes in computer log information and analyzes it for trends utilizing EMR. We will show you how to utilize Amazon EMR services to perform this analysis and discuss the economics and costs of doing so.

Amazon Web Services Used in This Book

AWS has grown greatly over the years from its origins as a provider of remotely hosted infrastructure with virtualized computer instances called Amazon Elastic Compute Cloud (EC2). Today, AWS provides many, if not all, of the building blocks used in many applications today. Throughout this book, we will focus on a number of the key services Amazon provides.

Amazon Elastic MapReduce (EMR)

A book focused on EMR would not be complete without using this key AWS service from Amazon. We will go into much greater detail throughout this book, but in short, Amazon EMR is the in-the-cloud workhorse of the Hadoop framework that allows us to analyze vast amounts of data with a configurable and scalable amount of computing power. Amazon EMR makes heavy use of the Amazon Simple Storage Service (S3) to store analysis results and host data sets for processing, and leverages Amazon EC2’s scalable compute resources to run the Job Flows we develop to perform analysis. There is an additional charge of about 30 percent for the EMR EC2 instances. To read Amazon’s overview of EMR, visit the Amazon EMR web page. As the primary focus of this book, Amazon EMR is used heavily in many of the examples.

Amazon Simple Storage Service (S3)

Amazon S3 is the persistent storage for AWS. It provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the Web. There are some restrictions, though; data in S3 must be stored in named buckets, and any single object can be no more than 5 terabytes in size. The data stored in S3 is highly durable and is stored in multiple facilities and multiple devices within a facility. Throughout this book, we will use S3 storage to store many of the Amazon EMR scripts, source data, and the results of our analysis.

As with almost all AWS services, there are standard REST- and SOAP-based web service APIs to interact with files stored on S3. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of websites. The service aims to maximize benefits of scale and to pass those benefits on to developers. To read Amazon’s overview of S3, visit the Amazon S3 web page. Amazon S3’s permanent storage will be used to store data sets and computed result sets generated by Amazon EMR Job Flows. Applications built with Amazon EMR will need to use some S3 services for data storage.

Amazon Elastic Compute Cloud (EC2)

Amazon EC2 makes it possible to run multiple instances of virtual machines on demand inside any one of the AWS regions. The beauty of this service is that you can start as many or as few instances as you need without having to buy or rent physical hardware like in traditional hosting services. In the case of Amazon EMR, this means we can scale the size of our Hadoop cluster to any size we need without thinking about new hardware purchases and capacity planning. Individual EC2 instances come in a variety of sizes and specifications to meet the needs of different types of applications. There are instances tailored for high CPU load, high memory, high I/O, and more. Throughout this book, we will use native EC2 instances for a lot of the scheduling of Amazon EMR Job Flows and to run many of the mundane administrative and data manipulation tasks associated with our application building blocks. We will, of course, be using the Amazon EMR EC2 instances to do the heavy data crunching and analysis.

To read Amazon’s overview of EC2, visit the Amazon EC2 web page. Amazon EC2 instances are used as part of an Amazon EMR cluster throughout the book. We also utilize EC2 instances for administrative functions and to simulate live traffic and data sets. In building your own application, you can run the administrative and live data on your own internal hosts, and these separate EC2 instances are not a required service in building an application with Amazon EMR.

Amazon Glacier

Amazon Glacier is a new offering available in AWS. Glacier is similar to S3 in that it stores almost any amount of data in a secure and durable manner. Glacier is intended for long-term storage of data due to the high latency involved in the storage and retrieval of data. A request to retrieve data from Glacier may take several hours for Amazon to fulfill. For this reason, we will store data that we do not intend to use very often in Amazon Glacier. The benefit of Amazon Glacier is its large cost savings. At the time of this writing, the storage cost in the US East region was $0.01 per gigabyte per month. Comparing this to a cost of $0.076 to $0.095 per gigabyte per month for S3 storage, you can see how the cost savings will add up for large amounts of data. To read Amazon’s overview of Glacier, visit the Amazon Glacier web page. Glacier can be used to reduce data storage costs over S3, but is not a required service in building an Amazon EMR application.

Amazon Data Pipeline

Amazon Data Pipeline is another new offering available in AWS. Data Pipeline is a web service that allows us to build graphical workflows to reliably process and move data between different AWS services. Data Pipeline allows us to create complex user-defined logic to control AWS resource usage and execution of tasks. It allows the user to define schedules, prerequisite conditions, and dependencies to build an operational workflow for AWS. To read Amazon’s overview of Data Pipeline, visit the Amazon Data Pipeline web page. Data Pipeline can reduce the overall administrative costs of an application using Amazon EMR, but is not a required AWS service for building an application.

Amazon Elastic MapReduce

Amazon EMR is an AWS service that allows users to launch and use resizable Hadoop clusters inside of Amazon’s infrastructure. Amazon EMR, like Hadoop, can be used to analyze large data sets. It greatly simplifies the setup and management of the cluster of Hadoop and MapReduce components. EMR instances use Amazon’s prebuilt and customized EC2 instances, which can take full advantage of Amazon’s infrastructure and other AWS services. These EC2 instances are invoked when we start a new Job Flow to form an EMR cluster. A Job Flow is Amazon’s term for the complete data processing that occurs through a number of compute steps in Amazon EMR. A Job Flow is specified by the MapReduce application and its input and output parameters.

Figure 1-1 shows an architectural view of the EMR cluster.

Typical Amazon EMR cluster

Figure 1-1. Typical Amazon EMR cluster

Amazon EMR performs the computational analysis using the MapReduce framework. The MapReduce framework splits the input data into smaller fragments, or shards, that are distributed to the nodes that compose the cluster. From Figure 1-1, we note that a Job Flow is executed on a series of EC2 instances running the Hadoop components that are broken up into master, core, and task clusters. These individual data fragments are then processed by the MapReduce application running on each of the core and task nodes in the cluster. Based on Amazon EMR terminology, we commonly call the MapReduce application a Job Flow throughout this book.

The master, core, and task cluster groups perform the following key functions in the Amazon EMR cluster:

Master group instance

The master group instance manages the Job Flow and allocates all the needed executables, JARs, scripts, and data shards to the core and task instances. The master node monitors the health and status of the core and task instances and also collects the data from these instances and writes it back to Amazon S3. The master group instances serve a critical function in our Amazon EMR cluster. If a master node is lost, you lose the work in progress by the master and the core and task nodes to which it had delegated work.

Core group instance

Core group instance members run the map and reduce portions of our Job Flow, and store intermediate data to the Hadoop Distributed File System (HDFS) storage in our Amazon EMR cluster. The master node manages the tasks and data delegated to the core and task nodes. Due to the HDFS storage aspects of core nodes, a loss of a core node will result in data loss and possible failure of the complete Job Flow.

Task group instance

The task group is optional. It can do some of the dirty computational work of the map and reduce jobs, but does not have HDFS storage of the data and intermediate results. The lack of HDFS storage on these instances means the data needs to be transferred to these nodes by the master for the task group to do the work in the Job Flow.

The master and core group instances are critical components in the Amazon EMR cluster. A loss of a node in the master or core group instance can cause an application to fail and need to be restarted. Task groups are optional because they do not control a critical function of the Amazon EMR cluster. In terms of jobs and responsibilities, the master group must maintain the status of tasks. A loss of a node in the master group may make it so the status of a running task cannot be determined or retrieved and lead to Job Flow failure.

The core group runs tasks and maintains the data retained in the Amazon EMR cluster. A loss of a core group node may cause data loss and Job Flow failure.

A task node is only responsible for running tasks delegated to it from the master group and utilizes data maintained by the core group. A failure of a task node will lose any interim calculations. The master node will retry the task node when it detects failure in the running job. Because task group nodes do not control the state of jobs or maintain data in the Amazon EMR cluster, task nodes are optional, but they are one of the key areas where capacity of the Amazon EMR cluster can be expanded or shrunk without affecting the stability of the cluster.

Amazon EMR and the Hadoop Ecosystem

As we’ve already seen, Amazon EMR uses Hadoop and its MapReduce framework at its core. Accordingly, many of the other core Apache Software Foundation projects that work with Hadoop also work with Amazon EMR. There are also many other AWS services that may be useful when you’re running and monitoring Amazon EMR applications. Some of these will be covered briefly in this book:

Hive

Hive is a distributed data warehouse that allows you to create a Job Flow using a SQL-like language. Hive can be run from a script loaded in S3 or interactively inside of a running EMR instance. We will explore Hive in Chapter 4.

Pig

Pig is a data flow language. (The language is, not surprisingly, called Pig Latin.) Pig scripts can be loaded into S3 and used to perform the data analysis in a Job Flow. Pig, like Hive, is one of the Job Flow types that can be run interactively inside of a running EMR instance. We cover the details on Pig and Pig Latin in Chapter 3.

Amazon Cloudwatch

Cloudwatch allows you to monitor the health and progress of Job Flows. It also allows you to set alarms when metrics are outside of normal execution parameters. We will look at Amazon Cloudwatch briefly in Chapter 6.

Amazon Elastic MapReduce Versus Traditional Hadoop Installs

So how does using Amazon EMR compare to building out Hadoop in the traditional data center? Many of the AWS cloud considerations we discuss in Appendix B are also relevant to Amazon EMR. Compared to allocating resources and buying hardware in a traditional data center, Amazon EMR can be a great place to start a project because the infrastructure is already available at Amazon. Let’s look at a number of key areas that you should consider before embarking on a new Amazon EMR project.

Data Locality

Amazon EMR uses S3 storage for the input and output of data sets to be processed and analyzed. In order to process data, you need to transport it from the many sources where it currently lives up to Amazon’s cloud into S3 buckets. This is not a major issue for projects transitioning from other AWS services, but may be a barrier to projects that need to transport terabytes or petabytes of data from another cloud provider or hosted in a private data center to Amazon’s S3 storage.

In the traditional Hadoop install, data transport between the current source locations and the Hadoop cluster may be colocated in the same data center on high-speed internal networks. This lowers the data transport barriers and the amount of time to get data into Hadoop for analysis. Figure 1-2 shows the data locations and network topology differences between an Amazon EMR and traditional Hadoop installation.

Comparing data locality between Hadoop and Amazon EMR environments

Figure 1-2. Comparing data locality between Hadoop and Amazon EMR environments

If this will be a large factor in your project, you should review Amazon’s S3 Import and Export service option. The Import and Export service for S3 allows you to prepare portable storage devices that you can ship to Amazon to import your data into S3. This can greatly decrease the time and costs associated with getting large data sets into S3 for analysis. This approach can also be used in transitioning a project to AWS and EMR to seed the existing data into S3 and add data updates as they occur.

Hardware

Many people point to Hadoop’s use of low-cost hardware to achieve enormous compute capacity as one of the great benefits of using Hadoop compared to purchasing large, specialized hardware configurations. We couldn’t agree more when comparing what Hadoop achieves in terms of cost and compute capacity in this model. However, there are still large upfront costs in building out a modest Hadoop cluster. There are also the ongoing operational costs of electricity, cooling, IT personnel, hardware retirement, capacity planning and buildout, and vendor maintenance contracts on the operating system and hardware.

With Amazon EMR, you only pay for the services you use. You can quickly scale capacity up and down, and if you need more memory or CPU for your application, this is a simple change in your EC2 instance types when you’re creating a new Job Flow. We’ll explore the costs of Amazon EMR in Chapter 6 and help you understand how to estimate costs to determine the best solution for your organization.

Complexity

With the low-cost hardware of Hadoop clusters, many organizations start proof-of-concept data analysis projects with a small Hadoop cluster. The success of these projects leads many organizations to start building out their clusters and meet production-level data needs. These projects eventually reach a tipping point of complexity where much of the cost savings gained from the low-cost hardware is lost to the administrative, labor, and data center cost burdens. The time and labor commitments of keeping thousands of Hadoop nodes updated with OS security patches and replacing failing systems can require a great deal of time and IT resources. Estimating them and being able to compare these costs to EMR will be covered in detail in Chapter 6.

With Amazon EMR, the EMR cluster nodes exist and are maintained by Amazon. Amazon regularly updates its EC2 Amazon Machine Images (AMI) with newer releases of Hadoop, security patches, and more. By default, a Job Flow will start an EMR cluster with the latest and greatest EC2 AMIs. This removes much of the administrative burden in running and maintaining large Hadoop clusters for data analysis.

Application Building Blocks

In order to show the power of using AWS for building applications, we will build a number of building blocks for a MapReduce log analysis application. In many of our examples throughout this book, we will use these building blocks to perform analysis of common computer logfiles and demonstrate how these same building blocks can be used to attack other common data analysis problems. We will discuss how AWS and Amazon EMR can be utilized to solve different aspects of these analysis problems. Figure 1-3 shows the high-level functional diagram of the AWS components we will use in the upcoming chapters. Figure 1-3 also highlights the workflow and inter-relationships between these components and how they share data and communicate in the AWS infrastructure.

Functional architecture of our data analysis solution

Figure 1-3. Functional architecture of our data analysis solution

Using our building blocks, we will explore how these can be used to ingest large volumes of log data, perform real-time and batch analysis, and ultimately produce results that can be shared with end users. We will derive meaning and understanding from data and produce actionable results. There are three component areas for the application: collection stage, analysis stage, and the nuts and bolts of how we coordinate and schedule work through the many services we use. It might seem like a complex set of systems, interconnections, storage, and so on, but it’s really quite simple, and Amazon EMR and AWS provide us a number of great tools, services, and utilities to solve complex data analysis problems.

In the next set of chapters, we will dive into each component area of the application and highlight key portions of solving data analysis problems:

Collection

In Chapter 2, we will work on data collection attributes that are the key building blocks for any data analysis project. We will present a number of small AWS options for generating test data to work with and learn about working with this data in Amazon EMR and S3. Chapter 2 will explore real-world areas to collect data throughout your enterprise and the tools available to get this data into Amazon S3.

Analysis

In Chapters 2 and 3, we will begin analyzing the data we have collected using Java code to write map and reduce methods that will be run as Job Flows in Amazon EMR. In Chapter 4, we will show you that you don’t have to be a NASA rocket scientist or a Java programmer to use Amazon EMR. We will revisit the same analysis issues covered in earlier chapters, and using more high-level scripting tools like Pig and Hive, solve the same problems. Hadoop and Amazon EMR allow us to bring to bear a significant number of tools to mine critical information out of our data.

Machine learning

In Chapter 5, we will explore how machine learning can be used in EMR to derive interesting results on data sets. Python is used for the examples in this chapter.

Storage

Storage and the costs of storing data are always an ever-growing problem for organizations. After you have done your data analysis, you may need to retain the original data and analysis for many years. Depending on the compliance needs of an organization, the retention time can be very long. In Chapter 6, we will look at cost-effective ways to store data for long periods using Amazon Glacier.

By now, you hopefully have an understanding of how AWS and Amazon EMR could provide value to your organization. In the next chapter, you will start getting your hands dirty. You’ll generate some simple log data to analyze and create your first Amazon EMR Job Flow, and then do some simple data frequency analysis on those sample log messages.