Preface - Programming Elastic MapReduce (2014)

Programming Elastic MapReduce (2014)

Preface

Many organizations have a treasure trove of data stored away in the many silos of information within them. To unlock this information and use it to compete in the marketplace, organizations have begun looking to Hadoop and “Big Data” as the key to gaining an advantage over their competition. Many organizations, however, lack the knowledgeable resources and data center space to launch large-scale Hadoop solutions for their data analysis projects.

Amazon Elastic MapReduce (EMR) is Amazon’s Hadoop solution, running in Amazon’s data center. Amazon’s solution is allowing organizations to focus on the data analysis problems they want to solve without the need to plan data center buildouts and maintain large clusters of machines. Amazon’s pay-as-you-go model is just another benefit that allows organizations to start these projects with no upfront costs and scale instantly as the project grows. We hope this book inspires you to explore Amazon Web Services (AWS) and Amazon EMR, and to use this book to help you launch your next great project with the power of Amazon’s cloud to solve your biggest data analysis problems.

This book focuses on the core Amazon technologies needed to build an application using AWS and EMR. We chose an application to analyze log data as our case study throughout this book to demonstrate the power of EMR. Log analysis is a good case study for many data analysis problems that organizations faced. Computer logfiles contain large amounts of diverse data from different sources and can be mined to gain valuable intelligence. More importantly, logfiles are ubiquitous across computer systems and provide a ready and available data set with which you can start solving data analysis problems.

Here is an outline of what this book provides:

§ Sample configurations for third-party software

§ Step-by-step configurations for AWS

§ Sample code

§ Best practices

§ Gotchas

The intent is not to provide a book that has all the code, configuration, and so on, to be able to plop this application on AWS and start going. Instead, we will provide guidance to help you see how to put together a system or application in a cloud environment and describe core issues you may face in working within AWS in building your own project.

You will get the most out of this book if you have a some experience developing or managing applications developed for the traditional data center, but now want to learn how you can move your applications and data into a cloud environment. You should be comfortable using development toolsets and reviewing code samples, architecture diagrams, and configuration examples to understand basic concepts covered in this book. We will use the command line and command-line tools in Unix on a number of the examples we present, so it would not hurt to be familiar with navigating the command line and using basic Unix command-line utilities. The examples in this book can be used on Windows systems too, but you may need to load third-party utilities like Cygwin to follow along.

This book will challenge you with new ways of looking at your applications outside of your traditional data center walls, but hopefully it will open your eyes to the possibilities of what you can accomplish when you focus on the problems you are trying to solve rather than the many administrative issues of building out new servers in a private data center.

What Is AWS?

Amazon Web Services is the name of the computing platform started by Amazon in 2006. AWS offers a suite of services to companies and third-party developers to build solutions using the computing and software resources hosted in Amazon’s data centers around the globe. Amazon Elastic MapReduce is one of many available AWS services. Developers and companies only pay for the resources they use with a pay-as-you-go model in AWS. This model is changing the approach many businesses take at looking at new projects and initiatives. New initiatives can get started and scale within AWS as they build a customer base and grow without much of the usual upfront costs of buying new servers and infrastructure. Using AWS, companies can now focus on innovation and on building great solutions. They are able to focus less on building and maintaining data centers and the physical infrastructure and can focus on developing solutions.

CLOUD SERVICES AND THEIR IMPACTS

Throughout this book, we discuss the many benefits of AWS and cloud services. Although these services do provide tremendous value to organizations in many ways, they are not always the best option for every project. Running your application comes with many of the same impacts and effects as using VMware or other virtualization technology stacks. These impacts can affect application performance and security, and your application in the cloud may be running with multiple other customers on the same machine. For most applications, the benefits of cloud computing greatly outweigh these impacts. In Appendix B, we cover a number of the factors that impact cloud-based applications. We suggest reviewing the items in Appendix B before starting your own application to make sure it will be a good fit for AWS and cloud computing.

What’s in This Book?

This book is organized as follows. Chapter 1 introduces cloud computing and helps you understand Amazon Web Service and Amazon Elastic MapReduce. Chapter 2 gets us started exploring the Amazon tools we will be using to examine log data and execute our first Job Flow inside of Amazon EMR. In Chapter 3, we get down to the business of exploring the types of analyses that can be done with Amazon EMR using a number of MapReduce design patterns, and review the results we can get out of log data. In Chapter 5, we delve into machine learning techniques and how these can be implemented and utilized in our application to build intelligent systems that can take action or recommend a solution to a problem. Finally, in Chapter 6, we review project cost estimation for AWS and EMR applications and how to perform cost analysis of a project.

Sign Up for AWS

To get started, you need to sign up for AWS. If you are already an AWS user, you can skip this section because you already have access to each of the AWS services used throughout this book. If you are a new user, we will get you started in this section.

To sign up for AWS, go to the AWS website, as shown in Figure 1.

Amazon Web Services home page

Figure 1. Amazon Web Services home page

You will need to provide a phone number to verify that you are setting up a valid account and you will also need to provide a credit card number to allow Amazon to bill you for the usage of AWS services. We will cover how to estimate, review, and set up billing alerts within AWS inChapter 6.

After signing up for an AWS account, go to your My Account page to review the services to which you now have access. Figure 2 shows the available services under our account, but your results will likely look somewhat different.

TIP

Remember, there are charges associated with the use of AWS, and a number of the examples and exercises in this book will incur charges to your account. With a new AWS account, there is a free tier. To minimize the costs while learning about Amazon Elastic MapReduce, review the free-tier limitations, turn off instances after running through your exercises, and learn how to estimate costs in Chapter 6.

AWS services available after signup

Figure 2. AWS services available after signup

Code Samples in This Book

There are numerous code samples and examples throughout this book. Many of the examples are built using the Java programming language or Hadoop Java libraries. To get the most out of this book and follow along, you need to have a system set up to do Java development and Hadoop Java JAR files to build an application that Amazon EMR can consume and execute. To get ready to develop and build your next application, review Appendix C to set up your development environment. This is not a requirement, but it will help you get the most value out of the material presented in the chapters.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

TIP

This icon signifies a tip, suggestion, or general note.

WARNING

This icon indicates a warning or caution.