Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 (2014)

12. Apache Hadoop YARN Frameworks

One of the most exciting aspects of YARN is its ability to support multiple programming models and application frameworks. In Hadoop version 1, the only processing model available to users is MapReduce. In Hadoop version 2, MapReduce is separated from the resource management layer of Hadoop and placed into its own application framework. YARN forms a resource management platform, which provides services such as scheduling, fault monitoring, data locality, and more to Map-Reduce and other frameworks.

The following is a brief survey of emerging open-source frameworks that are being developed to run under YARN. As of this writing, there are many YARN frameworks under active development and the framework landscape is expected to change rapidly. Commercial vendors are also taking advantage of the YARN platform. Each of the following frameworks is under various stages of development and deployment; please consult the Framework webpage for full details.

Distributed-Shell

Distributed-Shell is an example application that demonstrates how to write applications on top of YARN. It is covered in detail in Chapter 11, “Using Apache Hadoop YARN Distributed-Shell,” and represents a simple method for running shell commands and scripts in containers in parallel on a Hadoop YARN cluster.

Hadoop MapReduce

As mentioned earlier, MapReduce was the first YARN framework and drove many of YARN’s requirements. As described in previous chapters, MapReduce works well, is of production quality, has almost the same feature set as before, and provides full compatibility, with a few minor exceptions, with Hadoop version 1. In addition, it has been thoroughly tested on a large scale, and is integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig, Apache Hive, and Apache Oozie.

One important aspect of the YARN design is the increased “user agility” in choosing different versions of MapReduce to use on a cluster. Indeed, with YARN it is possible to have production jobs using a stable MapReduce algorithm, even as test versions of MapReduce are running concurrently. These test versions allow developers to fix issues, develop new features, and fully test new versions of MapReduce on the same cluster.

Apache Tez

One great example of a new YARN framework that exploits its power is Apache Tez. Many Hadoop jobs consist of executing a complex directed acyclic graph (DAG) of tasks using separate MapReduce stages. Apache Tez generalizes this process and allows these tasks spread across stages to be run as a single, all-encompassing job. For example, a reduce task of a traditional MapReduce job can feed directly into another reduce task without an intermediate (pass-through) map task. The end result is faster processing of jobs and the promotion of what was previously a batch-oriented job to an interactive query.

Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache Pig. It provides them with a more natural model for their execution plans, together with faster response times and extreme throughput at a petabyte scale. The Apache Tez project is part of the Stinger Initiative, a broad, community-based effort to drive the future of Apache Hive, delivering 100-times performance improvements at a petabyte scale with familiar SQL semantics.

Recently, Tez released a technical preview as part of the Stinger Initiative Phase 3 release. Hortonworks reported that besides Apache Hive, Apache Pig and Cascading are moving toward using Tez. Users are able to run jobs with and without Tez to get an understanding of how much performance gain is possible. The Tez preview can be run in the Hortonworks Sandbox VM or with a full Hortonworks Data Platform-based Apache Hadoop cluster.

For more information, see http://tez.incubator.apache.org/, http://hortonworks.com/hadoop/tez/, and http://hortonworks.com/labs/stinger/.

Apache Giraph

Apache Giraph is an iterative graph processing system built for high scalability. This open-source implementation is based on Google’s Pregel, which is used to calculate page rank (pages are vertices connected by edges that represent hyperlinks). It is used by Facebook, Twitter, and LinkedIn to create social graphs of users. Both Giraph and Pregel are based on the Bulk Synchronous Parallel (BSP) model of distributed computation, which was introduced by Leslie Valiant. Giraph adds several features beyond the basic Pregel model, including master computation, shared aggregators, edge-oriented input, out-of-core computation, and more.

Giraph was originally written to run on standard Hadoop version 1, using the MapReduce framework, but is inefficient and totally unnatural for various reasons. It runs as a map-only job where each map is special (breaking typical MapReduce assumptions) and interacts with other maps (vertices). The native Giraph implementation under YARN provides the user with an iterative processing model not directly available with MapReduce.

Support for YARN has been present in Giraph since its own version 1.0 release. Giraph’s YARN-related abstraction is easy to extend or use as a template for new projects. Giraph takes advantage of the ApplicationMaster to perform a more natural job control, which includes the ability to spawn and retire tasks as part of each BSP step. In addition, using the flexibility of YARN, Giraph plans on implementing its own web interface to monitor job progress.

For more information, see http://giraph.apache.org/.

Hoya: HBase on YARN

The Hoya project creates dynamic and elastic Apache HBase clusters on top of YARN. It does so with a client application that creates the persistent configuration files, sets up the HBase cluster XML files, and then asks YARN to create an ApplicationMaster. YARN copies all files listed in the client’s application-launch request from HDFS into the local file system of the chosen server, and then executes the command to start the ApplicationMaster.

When the Hoya ApplicationMaster starts, it starts an HBase Master on the local machine, which is the sole HBase Master that Hoya currently manages. In parallel with the Master start-up, Hoya asks YARN for a number of containers matching the number of HBase region servers it needs. For each of these containers, Hoya provides the commands to start the region server and does not run any Hoya-specific code on the worker nodes. The Hoya ApplicationMaster points YARN at those files that need to be on the worker nodes and the necessary commands. YARN then does the rest of the work. Because HBase clusters use Apache ZooKeeper to find each other, as do HBase clients, the HBase services locate each other automatically with neither Hoya nor YARN getting involved.

For more information, see http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/.

Dryad on YARN

Similar to Apache Tez, Microsoft’s Dryad provides a DAG as the abstraction of execution flow. It is ported to run natively on YARN. Dryad on YARN is fully compatible with its non-YARN version.

The ported code is written completely in native C++ and C# for worker nodes. The ApplicationMaster leverages a thin layer of Java interfacing with the ResourceManager for the native Dryad graph manager to schedule work. Eventually, the Java layer will be substituted by direct interaction with protocol-buffer interfaces. Overall, this project demonstrates, as-an-aside, YARN’s enablement of writing applications in programming languages of choice.

For more information, see http://research.microsoft.com/en-us/projects/dryad/.

Apache Spark

Spark was initially developed for applications where keeping data in memory helps performance, such as iterative algorithms, which are common in machine learning, and interactive data mining.

Spark is often compared to MapReduce because it provides parallel processing over HDFS and other Hadoop input sources. Spark differs from MapReduce in two important ways, however. First, Spark holds intermediate results in memory, rather than writing them to disk—an approach that drastically decreases query response times. Second, Spark supports more than just MapReduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data stores. Spark offers a general execution model that can optimize arbitrary operator graphs, and it supports in-memory computing, which lets it query data faster than disk-based engines like MapReduce. It also provides clean, concise APIs in Scala, Java, and Python. Users can also use Spark interactively from the Scala and Python shells to rapidly query big data sets.

Since 2013, Spark has been running on production YARN clusters at Yahoo!. The advantage of porting and running Spark on top of YARN is the common resource management and a single underlying data fabric. Spark users can continue to use the same data for building models and share the same physical resources with other Hadoop frameworks.

For more information, see http://spark.incubator.apache.org.

Apache Storm

Apache Storm allows processing of unbounded streams of data in real time. It is designed to be used in any programming language. The basic Storm use cases are real-time analytics, online machine learning, continuous computation, distributed RPC (remote procedure call), ETL (extract, transform, load), and more. Storm provides fast performance, is scalable, is fault tolerant, and gives processing guarantees.

Traditional MapReduce jobs are expected to eventually finish, but Storm continuously processes messages until it is stopped. This behavior makes it ideal for a YARN cluster. There are two kinds of nodes on a Storm cluster: the master node and the worker nodes, which can be fully implemented with an ApplicationMaster. The master node runs a daemon called “Nimbus” that is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Each worker node runs a daemon called the “Supervisor,” which listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

Efforts are under way to run Storm directly under YARN and take advantage of the common resource management substrate.

For more information, see http://storm-project.net/documentation.html.

REEF: Retainable Evaluator Execution Framework

YARN’s flexibility sometimes requires significant effort on the part of application implementers. Writing a custom application on YARN includes building one’s own ApplicationMaster, performing client and container management, and handling aspects of fault tolerance, execution flow, coordination, and other concerns. The REEF project by Microsoft recognizes this challenge and factors out several components that are common to many applications, such as storage management, data caching, fault detection, and checkpoints. Framework designers can build on top of REEF more easily than they can build directly on YARN, and can reuse these common services/libraries. REEF’s design makes it suitable for both MapReduce and DAG-like executions as well as iterative and interactive computations.

Hamster: Hadoop and MPI on the Same Cluster

The Message Passing Interface (MPI) is widely used in high-performance computing (HPC). MPI is primarily a set of optimized message-passing library calls for C, C++, and Fortran that operate over popular server interconnects such as Ethernet and InfiniBand. Because users have full control of their YARN containers, there is no reason why MPI applications cannot run within a Hadoop cluster. The Hamster effort is a work-in-progress that provides a good discussion of the issues involved in mapping MPI to a YARN cluster (seehttps://issues.apache.org/jira/browse/MAPREDUCE-2911). Currently, an alpha version of MPICH2 is available for YARN that can be used to run MPI applications.

For more information, see https://github.com/clarkyzl/mpich2-yarn.

Wrap-up

Application frameworks for Apache Hadoop YARN are emerging and evolving at a rapid pace. This area is expected to see large amounts of growth as developers create more applications that move beyond MapReduce and take full advantage of the data services and capabilities offered by a shared Hadoop YARN cluster. Indeed, like many successful data processing platforms in use today, Apache Hadoop YARN will eventually migrate to be a behind-the-scenes layer and allow users to move away from execution management and closer to their big data applications and the subsequent discoveries they provide.