Best Practices and Recommendations - Optimizing Hadoop for MapReduce (2014)

Optimizing Hadoop for MapReduce (2014)

Chapter 7. Best Practices and Recommendations

Well, this is the grand finale! So far, we learned how to optimize MapReduce job performance and dedicated a major part of this book to laying down some important fundamentals. Remember that setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.

In this chapter, we will describe the hardware and application configuration checklists that you can use to optimize your Hadoop MapReduce jobs.

The following topics will be covered in this chapter:

· The common Hadoop cluster checklist

· The BIOS checklist and OS recommendations

· Hadoop best practices and recommendations

· A MapReduce template class to use in your application

Hardware tuning and OS recommendations

Recommendations for system tuning depend on the intrinsic capabilities of the system. The following sections suggest different recommendation techniques and tips that you can use as reminder baselines when engaging in your MapReduce optimization process.

The Hadoop cluster checklist

The following checklist describes only the minimal set of steps required to get your Hadoop cluster working optimally:

· Check and ensure that all cluster nodes can communicate with each other and you have physical and/or remote management access to each cluster node

· Check whether your cluster is well dimensioned and is able to compensate a failure of (at least) one node per service

· Check the limitations of your cluster environment (hardware availability resources/rack space, hosting parameters, and so on)

· Define your cluster strategies for a failover to ensure high availability of your services

· Define what you need to back up, and what needs to be saved and where, in order to maximize your Hadoop storage capacity

The Bios tuning checklist

This topic lists what you should check while installing Hadoop cluster nodes to be used in an optimal environment. The following checks are to be carried out:

· Check whether all CPU cores on the hardware are fully utilized; otherwise, you can downgrade the CPU frequency.

· Enable Native Command Queuing mode (NCQ), which helps improve I/O performance of modern hard drives by optimizing the drive head's movement. Usually, the NCQ mode can be enabled through the Advanced Host Controller Interface (AHCI) option in BIOS.

Check whether there are any default BIOS settings that may negatively affect your Hadoop MapReduce jobs.

OS configuration recommendations

In this minimal checklist, we present some recommendations for system tuning, which are a combination of CPU, I/O, and memory techniques. The following are the recommendations:

· Choose a Linux distribution that supports the EXT4 filesystem.

· By default, every file's read operation triggers a disk write operation in order to maintain the time the file was last accessed. This extra disk activity associated with updating the access time is not desired. You can disable this logging of access time for both files and directories using noatime on the filesystem.

Note

nodiratime: This disables updating of the access time when opening directories so that the access time is not modified when enumerating directories.

· Avoid using Logical Volume Management (LVM), which is used to manage disk drives and similar mass-storage devices as this will affect the disk's I/O performance.

· Set the Linux kernel's swap memory to a low value. This informs the Linux kernel that it should try to avoid swapping as much as possible.

· Linux kernels' I/O scheduling controls how input/output operations will be submitted to the storage. Experiment with the Completely Fair Queuing (CFQ) of the I/O scheduler, which is similar to the round-robin algorithm in the way that the I/O operations are implemented as a circular queue, and a fixed execution time is allowed for each I/O operation.

· Increase the Linux OS max open file descriptors, which may enhance the MapReduce job performance.

Hadoop best practices and recommendations

In order to improve Hadoop performance, these are some configuration tips and recommendations that represent compendium of best practices for applications running on the Hadoop framework.

Deploying Hadoop

Hadoop can be installed manually by downloading its archived files from the official website and copying it to the cluster. This will work, but it is not recommended if you want to install Hadoop on more than four node clusters. Installing Hadoop manually on a large cluster can lead to issues with maintenance and troubleshooting. Any configuration changes need to be applied manually to all nodes using Secure Copy Protocol (SCP) or Secure Shell (SSH).

To deploy Hadoop on a large cluster, it is recommended (and a good practice) to use a configuration management system and/or automated deployment tools such as Cloudera (http://www.cloudera.com), Hortonworks (http://hortonworks.com), and the MapR (http://www.mapr.com) management system. For additional work, such as application deployment, it is good to use Yum and Puppet.

You can use these tools to build and maintain Hadoop clusters for the following:

· Setup

· Configuration

· Scalability

· Monitoring

· Maintenance

· Troubleshooting

Note

Puppet is a powerful open source tool that helps you to perform administrative tasks such as adding users, installing packages, and updating server configurations based on a centralized specification. You can learn more about Puppet by browsing the following link: http://puppetlabs.com/puppet/what-is-puppet.

Hadoop tuning recommendations

The checklists and recommendations given in this section will be useful to prepare and follow MapReduce performance recommendations.

The following is the checklist for Memory recommendations:

· Adjust memory settings to avoid a job hanging due to insufficient memory

· Set or define a JVM reuse policy

· Verify the JVM code cache and increase it if necessary

· Analyze garbage collector (GC) cycles (using detailed logs), observe whether it has an intensive cycle (which means there is a large number of object instances created in memory) and check the Hadoop framework heap usage

The following are the massive I/O tuning recommendations to ensure that there are no setbacks due to I/O operations:

· In the context of large input data, compress source data to avoid/reduce massive I/O tuning

· Reduce spilled records from map tasks when you experiment with large spilled records

Reduce spilled records by tuning: io.sort.mb, io.sort.record.percent, io.sort.spill.percent

· Compress the map output to minimize I/O disk operations

· Implement a Combiner to minimize massive I/O and network traffic

Add a Combiner with the following line of code:

job.setCombinerClass(Reduce.class);

· Compress the MapReduce job output to minimize large output data effects

The compression parameters are mapred.compress.map.output and mapred.output.compression.type

· Change the replication parameter value to minimize network traffic and massive I/O disk operations

The Hadoop minimal configuration checklist to validate hardware resources is as follows:

· Define the Hadoop ecosystem components that are required to be installed (and maintained)

· Define how you are going to install Hadoop, manually or using an automated deployment tool (such as Puppet/Yum)

· Choose the underlying core storage such as HDFS, HBase, and so on

· Check whether additional components are required for orchestration, job scheduling, and so on

· Check on third-party software dependencies such as JVM version

· Check the key parameter configuration of Hadoop, such as HDFS block size, replication factor, and compression

· Define the monitoring policy; what should be monitored and with which tool (for example, Ganglia)

· Install a monitoring tool, such as Nagios or Ganglia, to monitor your Hadoop cluster resources

· Identify (calculate) the amount of required disk space to store the job data

· Identify (calculate) the number of required nodes to perform the job

· Check whether NameNodes and DataNodes have the required minimal hardware resources, such as amount of RAM, number of CPUs, and network bandwidth

· Calculate the number of mapper and reducer tasks required to maximize CPU usage

· Check the number of MapReduce tasks to ensure that sufficient tasks are running

· Avoid using the Virtual server for the production environment and use it only for your MapReduce application development

· Eliminate map-side spills and reduce-side disk I/O

Using a MapReduce template class code

Most MapReduce applications are similar to each other. Often, you can create one basic application template, customize the map() and reduce() functions, and reuse it.

The code snippet in the following screenshot shows you a MapReduce template class that you can enhance and customize to meet your needs:

Using a MapReduce template class code

In the preceding screenshot, lines 1-16 are declarations for all Java classes that will be used in the application.

In the preceding screenshot, in line 18 the MapReduceTemplate class is a declaration. This class extends the Configured class and implements the Tool interface. The following screenshot represents the mapper function:

Using a MapReduce template class code

In the preceding screenshot, from lines 20-36, the static class Map definition represents your mapper function, as shown in the previous screenshot. This class extends the Mapper class and the map() function should be overridden by your code. It is recommended that you catch any exception to prevent the failure of a map task without terminating its process. The code snippets in the following screenshot represent the static class Reduce definition:

Using a MapReduce template class code

In the preceding screenshot, from lines 38-54, the static class Reduce definition represents your Reducer function shown in the previous code screenshot. This class extends the Reducer class and the reduce () function should be overridden by your code. It is recommended that you catch any exception to prevent the reduce task from failing before terminating its process:

Using a MapReduce template class code

In the preceding screenshot, from lines 56-85, the different class implementations are optional. These classes will help you to define a custom key practitioner, group, and sort class comparators, as mentioned in the previous screenshot. The run() methodimplementation is shown in the following screenshot:

Using a MapReduce template class code

In the preceding screenshot, from lines 87-127 comprise the run() method implementation. The MapReduce job is configured and its parameters are set within this method, such as the Map and Reduce class, as shown in the previous screenshot. You can now enable compression, set the OutputKey class, the FileOutputFormat class, and so on. The code in the following screenshot represents the main() method:

Using a MapReduce template class code

In the preceding screenshot, from lines 133-138, the main() method implementation is specified. Within this method, a new configuration instance is created and the run() method is called to launch the MapReduce job.

Some other factors to consider when writing your application are as follows:

· When you have large output file (multiple GBs), consider using a larger output block size by setting dfs.block.size.

· To improve HDFS write performance, choose an appropriate compressor codec (compression speed versus efficiency) to compress the application's output.

· Avoid writing out more than one output file per reduce.

· Use an appropriate file format for the output of the reducers.

· Use only splittable compression codec to write out a large amount of compressed textual data. Codec such as zlib/gzip/lzo is counter-productive because it cannot be split and processed, which forces the MapReduce framework to process the entire file in a single map. Consider using SequenceFile file formats, since they are compressed and splittable.

Summary

In this chapter, we learned about the best practices of Hadoop MapReduce optimization. In order to improve the job performance of Hadoop, several recommendations and tips were presented in the form of checklists, which will help you to ensure that your Hadoop cluster is well configured and dimensioned.

This concludes our journey of learning to optimize and enhance our MapReduce job performance. I hope you have enjoyed it as much as I did. I encourage you to keep learning more about the topics covered in this book using the MapReduce published papers on Google, Hadoop's official website, and other sources on the Internet. Optimizing a MapReduce job is an iterative and repeatable process that you need to do before finding out its optimum performance. Also, I recommend that you try different techniques to figure out which tuning solutions are most efficient in the context of your job and try combining different optimization techniques!