Optimizing Hadoop for MapReduce (2014)

Chapter 7. Best Practices and Recommendations

Well, this is the grand finale! So far, we learned how to optimize MapReduce job performance and dedicated a major part of this book to laying down some important fundamentals. Remember that setting up a Hadoop cluster is basically the challenge of combining the requirements of high availability, load balancing, and the individual requirements of the services you aim to get from your cluster servers.

In this chapter, we will describe the hardware and application configuration checklists that you can use to optimize your Hadoop MapReduce jobs.

The following topics will be covered in this chapter:

· The common Hadoop cluster checklist

· The BIOS checklist and OS recommendations

· Hadoop best practices and recommendations

· A MapReduce template class to use in your application

Hardware tuning and OS recommendations

Recommendations for system tuning depend on the intrinsic capabilities of the system. The following sections suggest different recommendation techniques and tips that you can use as reminder baselines when engaging in your MapReduce optimization process.

The Hadoop cluster checklist

The following checklist describes only the minimal set of steps required to get your Hadoop cluster working optimally:

· Check and ensure that all cluster nodes can communicate with each other and you have physical and/or remote management access to each cluster node

· Check whether your cluster is well dimensioned and is able to compensate a failure of (at least) one node per service

· Check the limitations of your cluster environment (hardware availability resources/rack space, hosting parameters, and so on)

· Define your cluster strategies for a failover to ensure high availability of your services

· Define what you need to back up, and what needs to be saved and where, in order to maximize your Hadoop storage capacity

The Bios tuning checklist

This topic lists what you should check while installing Hadoop cluster nodes to be used in an optimal environment. The following checks are to be carried out:

· Check whether all CPU cores on the hardware are fully utilized; otherwise, you can downgrade the CPU frequency.

· Enable Native Command Queuing mode (NCQ), which helps improve I/O performance of modern hard drives by optimizing the drive head's movement. Usually, the NCQ mode can be enabled through the Advanced Host Controller Interface (AHCI) option in BIOS.

Check whether there are any default BIOS settings that may negatively affect your Hadoop MapReduce jobs.

OS configuration recommendations

In this minimal checklist, we present some recommendations for system tuning, which are a combination of CPU, I/O, and memory techniques. The following are the recommendations:

· Choose a Linux distribution that supports the EXT4 filesystem.

· By default, every file's read operation triggers a disk write operation in order to maintain the time the file was last accessed. This extra disk activity associated with updating the access time is not desired. You can disable this logging of access time for both files and directories using noatime on the filesystem.

Note

nodiratime: This disables updating of the access time when opening directories so that the access time is not modified when enumerating directories.

· Avoid using Logical Volume Management (LVM), which is used to manage disk drives and similar mass-storage devices as this will affect the disk's I/O performance.

· Set the Linux kernel's swap memory to a low value. This informs the Linux kernel that it should try to avoid swapping as much as possible.

· Linux kernels' I/O scheduling controls how input/output operations will be submitted to the storage. Experiment with the Completely Fair Queuing (CFQ) of the I/O scheduler, which is similar to the round-robin algorithm in the way that the I/O operations are implemented as a circular queue, and a fixed execution time is allowed for each I/O operation.

· Increase the Linux OS max open file descriptors, which may enhance the MapReduce job performance.

Hadoop best practices and recommendations

In order to improve Hadoop performance, these are some configuration tips and recommendations that represent compendium of best practices for applications running on the Hadoop framework.

Deploying Hadoop

Hadoop can be installed manually by downloading its archived files from the official website and copying it to the cluster. This will work, but it is not recommended if you want to install Hadoop on more than four node clusters. Installing Hadoop manually on a large cluster can lead to issues with maintenance and troubleshooting. Any configuration changes need to be applied manually to all nodes using Secure Copy Protocol (SCP) or Secure Shell (SSH).

To deploy Hadoop on a large cluster, it is recommended (and a good practice) to use a configuration management system and/or automated deployment tools such as Cloudera (http://www.cloudera.com), Hortonworks (http://hortonworks.com), and the MapR (http://www.mapr.com) management system. For additional work, such as application deployment, it is good to use Yum and Puppet.

You can use these tools to build and maintain Hadoop clusters for the following:

· Setup

· Configuration

· Scalability

· Monitoring

· Maintenance

· Troubleshooting

Note

Puppet is a powerful open source tool that helps you to perform administrative tasks such as adding users, installing packages, and updating server configurations based on a centralized specification. You can learn more about Puppet by browsing the following link: http://puppetlabs.com/puppet/what-is-puppet.

Hadoop tuning recommendations

The checklists and recommendations given in this section will be useful to prepare and follow MapReduce performance recommendations.

The following is the checklist for Memory recommendations:

· Adjust memory settings to avoid a job hanging due to insufficient memory

· Set or define a JVM reuse policy

· Verify the JVM code cache and increase it if necessary

· Analyze garbage collector (GC) cycles (using detailed logs), observe whether it has an intensive cycle (which means there is a large number of object instances created in memory) and check the Hadoop framework heap usage

The following are the massive I/O tuning recommendations to ensure that there are no setbacks due to I/O operations:

· In the context of large input data, compress source data to avoid/reduce massive I/O tuning

· Reduce spilled records from map tasks when you experiment with large spilled records

Reduce spilled records by tuning: io.sort.mb, io.sort.record.percent, io.sort.spill.percent

· Compress the map output to minimize I/O disk operations

· Implement a Combiner to minimize massive I/O and network traffic

Add a Combiner with the following line of code:

job.setCombinerClass(Reduce.class);

· Compress the MapReduce job output to minimize large output data effects

The compression parameters are mapred.compress.map.output and mapred.output.compression.type

· Change the replication parameter value to minimize network traffic and massive I/O disk operations

The Hadoop minimal configuration checklist to validate hardware resources is as follows:

· Define the Hadoop ecosystem components that are required to be installed (and maintained)

· Define how you are going to install Hadoop, manually or using an automated deployment tool (such as Puppet/Yum)

· Choose the underlying core storage such as HDFS, HBase, and so on

· Check whether additional components are required for orchestration, job scheduling, and so on

· Check on third-party software dependencies such as JVM version

· Check the key parameter configuration of Hadoop, such as HDFS block size, replication factor, and compression

· Define the monitoring policy; what should be monitored and with which tool (for example, Ganglia)

· Install a monitoring tool, such as Nagios or Ganglia, to monitor your Hadoop cluster resources

· Identify (calculate) the amount of required disk space to store the job data

· Identify (calculate) the number of required nodes to perform the job

· Check whether NameNodes and DataNodes have the required minimal hardware resources, such as amount of RAM, number of CPUs, and network bandwidth

· Calculate the number of mapper and reducer tasks required to maximize CPU usage

· Check the number of MapReduce tasks to ensure that sufficient tasks are running

· Avoid using the Virtual server for the production environment and use it only for your MapReduce application development

· Eliminate map-side spills and reduce-side disk I/O

Using a MapReduce template class code

Most MapReduce applications are similar to each other. Often, you can create one basic application template, customize the map() and reduce() functions, and reuse it.

The code snippet in the following screenshot shows you a MapReduce template class that you can enhance and customize to meet your needs: