Amazon Web Services Resources and Tools - Programming Elastic MapReduce (2014)

Programming Elastic MapReduce (2014)

Appendix A. Amazon Web Services Resources and Tools

Throughout the book, we provided a number of the AWS links and demonstrated the tools. This appendix serves as a snapshot of resources that are useful for planning and building applications utilizing Amazon EMR and various other supporting services and information.

Amazon AWS Online Resources

The examples and information represented costs and services available at the time of writing this book. Amazon regularly adds services, new service options, and competitive pricing. We strongly recommend reviewing the latest information on AWS before starting your project.

The following links and information on Amazon’s AWS site should be helpful in using and understanding the services in this book.

Amazon Web Services (AWS) home page

This is a starting point for learning about Amazon Web Services and signing up for service.

Amazon Elastic MapReduce (EMR)

This is the service home page for Amazon Elastic MapReduce. The site provides a detailed description of Amazon Elastic MapReduce, third-party software installation options, and detailed pricing and configuration information.

Amazon Elastic Compute Cloud (EC2)

This is the service home page for Amazon Elastic Compute Cloud. The site provides a detailed description of Amazon EC2 and detailed pricing information. Amazon EC2 is used for a number of the source machines and to run tasks separate from Amazon EMR throughout the book.

Amazon Simple Storage Service

This is the service home page for Amazon Simple Storage Service (S3). The site provides a detailed description of Amazon S3 and pricing information. Amazon S3 is used to store input and output data for Amazon EMR data analysis. Many of the scripts and applications used for data analysis are stored in S3, and their S3 location is specified in configuring Amazon EMR Job Flows.

Amazon Glacier

This is the service home page for Amazon Glacier. Amazon Glacier is a low-cost, long-term storage solution for data in the book that may be needed in the future, but is not currently being processed by EMR or reviewed by system users. Amazon Glacier can be used for cost savings compared to online S3 storage.

AWS Data Pipeline

This is the service home page for AWS Data Pipeline. Data Pipeline is used to automate EMR processing and reduce the administrative burden of maintaining an EMR application in AWS.

Amazon AWS Cost Estimation Tools

When one transitions from internal systems to cloud-based solutions like AWS, the discussion almost always comes down to considerations around cost. In Chapter 6, we covered numerous real-world scenarios and estimation techniques to review project costs. In running through the scenarios, we used the following online cost estimation tools to review and compare costs in the scenarios.

Amazon Web Services Simple Monthly Calculator

This online calculator allows you to input the resources you expect to use in AWS and determine the monthly cost of those services. The tool also allows you to “Save and Share” your calculations, and produces a URL that can be given to others on the project team or stakeholders for review.

Amazon Web Services Economics Center

The Economics Center helps you compare the costs of running an application in a traditional data center and running the same application in AWS. This tool can be useful in determining cost savings and comparing available resources inside an organization.

AWS Best Practices and Architecture

Amazon provides a number of critical documents that help organizations start building their applications using best practices. Also, for organizations that use third-party components like Microsoft Windows, Oracle, Red Hat Linux, and others, Amazon provides a number of already configured EC2 instances and information on how to build your own Amazon Machine Images (AMI). The following links at AWS are useful for projects that need this information:

Amazon Architecture Center

This AWS site helps developers review software reference architectures that were designed to make best use of AWS services. The site can be useful in building a new application or transitioning an existing application over to AWS. The information will help the development team build applications in AWS that minimize downtime and optimize scalability and performance.

Amazon Security Center

Security is one of the top reasons many organizations have been hesitant to move their critical systems to cloud service providers like AWS. Amazon provides a great deal of information on the security of AWS and its AWS data centers on this site. Information on how AWS meets the compliance regulations for a number of industry compliance regimes like PCI, HIPAA, and others is also published on this site.

Amazon EC2 Instances

This site demystifies the Amazon EC2 instance sizes of small, medium, large, extra large, and so on, and maps these sizes to their physical equivalents of CPU, memory, and disk space allocations.

Create Your Own AMI

Amazon AWS has many of the common software configurations that many organizations use for applications. However, you may want to build an Amazon Machine Image of special or in-house software so you can instantly start a preconfigured image with your software. This guide provides details on how to build a custom image to run inside EC2 or EMR.

Amazon EMR Distributions

As a developer in Amazon EMR, you must understand what features and APIs are available. Fortunately, Amazon has extensive documentation of all of its AWS services including developer documentation of EMR.

Amazon regularly updates the version of Hadoop and applies patches to integrate Hadoop with AWS infrastructure and services. Table A-1 lists the versions of Hadoop that are supported in Amazon EMR as of the writing of this book.

Table A-1. Amazon-supported Hadoop versions

Hadoop version

Configuration parameters

1.0.3

--hadoop-version 1.0.3 --ami-version 2.3

0.20.205

--hadoop-version 0.20.205 --ami-version 2.0

0.20

--hadoop-version 0.20 --ami-version 1.0

0.18

--hadoop-version 0.18 --ami-version 1.0

To find out the latest supported versions of Hadoop for EMR, visit the Supported Hadoop Versions section of the EMR Developer Guide.