Writing a YARN application - Beyond MapReduce - Hadoop in Practice, Second Edition (2015)

Hadoop in Practice, Second Edition (2015)

Part 4. Beyond MapReduce

Chapter 10. Writing a YARN application

This chapter covers

· Understanding key capabilities of a YARN application

· How to write a basic YARN application

· An examination of YARN frameworks and applications

Looking at the source code for any reasonably sized YARN application typically results in words like “complex” and “low-level” being thrown around. At its core, writing a YARN application isn’t that complex, as you’ll discover in this chapter. The complexity with YARN is typically introduced once you need to build more advanced features into your application, such as supporting secure Hadoop clusters or handling failure scenarios, which are complicated in distributed systems regardless of the framework. That being said, there are emerging frameworks that abstract away the YARN APIs and provide common features that you’ll require.

In this chapter, you’ll write a simple YARN application that will run a Linux command on a node in the cluster. Once you’ve run your application, you’ll be introduced to some of the more advanced features that you may need in your YARN application. Finally, this chapter looks at some of the open source YARN abstractions and examine their features.

Before we get started, let’s ease into YARN programming by looking at the building blocks of a YARN application.

10.1. Fundamentals of building a YARN application

This section provides a brief high-level overview of the YARN actors and the basic communication flows that you’ll need to support in your YARN application.

10.1.1. Actors

There are five separate pieces of a YARN application that are either part of the YARN framework or components that you must create yourself (which I call the user space), all of which are shown in figure 10.1.

Figure 10.1. The main actors and communication paths in a YARN application

The actors in a YARN application and the YARN framework include

· YARN client —The YARN client, in the user space, is responsible for launching the YARN application. It sends createApplication and submitApplication requests to the ResourceManager and can also kill the application.

· ResourceManager —In the framework, a single cluster-wide ResourceManager is responsible for receiving container allocation requests and asynchronously notifying clients when resources become available for their containers.

· ApplicationMaster —The ApplicationMaster in the user space is the main coordinator for an application, and it works with the ResourceManager and NodeManagers to request and launch containers.

· NodeManager —In the framework, each node runs a NodeManager that’s responsible for servicing client requests to launch and kill containers.

· Container —The container in the user space is an application-specific process that performs work on behalf of the application. A container could be a simple fork of an existing Linux process (such as the find command to find files), or an application-developed service such as a map or reduce task for MapReduce YARN applications.

The following sections discuss these actors and their role in your Yarn application.

10.1.2. The mechanics of a YARN application

When implementing a YARN application, there are a number of interactions that you need to support. Let’s examine each interaction and what information is relayed between the components.

Resource Allocation

When the YARN client or the ApplicationMaster asks the ResourceManager for a new container, they indicate the resources that the container needs in a Resource object. In addition, the ApplicationMaster sends some more attributes in a ResourceRequest, as shown in figure 10.2.

Figure 10.2. Resource properties that can be requested for a container

The resourceName specifies the host and rack where the container should be executed, and it can be wildcarded with an asterisk to inform the ResourceManager that the container can be launched on any node in the cluster.

The ResourceManager responds to a resource request with a Container object that represents a single unit of execution (a process). The container includes an ID, a resourceName, and other attributes. Once the YARN client or ApplicationMaster receives this message from the ResourceManager, it can communicate with the NodeManager to launch the container.

Launching a container

Once a client receives the Container from the ResourceManager, it’s ready to talk to the NodeManager associated with the container to launch the container. Figure 10.3 shows the information that the client sends to the NodeManager as part of the request.

Figure 10.3. Container request metadata

The NodeManager is responsible for downloading any local resources identified in the request (including items such as any libraries required by the application or files in the distributed cache) from HDFS. Once these files are downloaded, the NodeManager launches the container process.

With these YARN preliminaries out of the way, let’s go ahead and start writing a YARN application.

10.2. Building a YARN application to collect cluster statistics

In this section you’ll build a simple YARN application that will launch a single container to execute the vmstat Linux command. As you build this simple example, we’ll focus on the plumbing needed to get a YARN application up and running. The next section covers the more advanced capabilities that you’ll likely require in a full-blown YARN application.

Figure 10.4 shows the various components that you’ll build in this section and their interactions with the YARN framework.

Figure 10.4. An overview of the YARN application that you’ll build

Let’s get started by building the YARN client.

Technique 101 A bare-bones YARN client

The role of the YARN client is to negotiate with the ResourceManager for a YARN application instance to be created and launched. As part of this work, you’ll need to inform the ResourceManager about the system resource requirements of your ApplicationMaster. Once the ApplicationMaster is up and running, the client can choose to monitor the status of the application.

This technique will show you how to write a client that performs the three steps illustrated in figure 10.5.

Figure 10.5. The three activities that your YARN client will perform

Problem

You’re building a YARN application, so you need to write a client to launch your application.

Solution

Use the YarnClient class to create and submit a YARN application.

Discussion

Let’s walk through the code for each of the steps highlighted in figure 10.5, starting with creating a new YARN application.

Creating a YARN application

The first thing your YARN client needs to do is communicate with the ResourceManager about its intent to start a new YARN application. The response from the ResourceManager is a unique application ID that’s used to create the application and that’s also supported by the YARN command line for queries such as retrieving logs.

The following code shows how you can get a handle to a YarnClient instance and use that to create the application:[1]

1 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/Client.java.

The createApplication method will call the ResourceManager, which will return a new application ID. In addition, the YarnClientApplication object contains information about the cluster, such as the resource capabilities that can be used to predetermine container resource properties.

The YarnClient class used in the preceding code contains a number of APIs that result in an RPC call to the ResourceManager. Some of these methods are shown in the following extract from the code:[2]

2 Some queue and security APIs were omitted from the YarnClient class—refer to the YarnClient Javadocs for the complete API: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/yarn/client/api/YarnClient.html.

Creating an application in YARN doesn’t actually do anything other than inform the ResourceManager of your intent to actually launch the application. The next step shows what you need to do to have the ResourceManager launch your ApplicationMaster.

Submitting a YARN application

Submitting the YARN application launches your ApplicationMaster in a new container in your YARN cluster. But there are several items you need to configure before you can submit the application, including the following:

· An application name

· The command to launch the ApplicationMaster, along with the classpath and environment settings

· Any JARs, configuration files, and other files that your application needs to perform its work

· The resource requirements for the ApplicationMaster (memory and CPU)

· Which scheduler queue to submit the application to and the application priority within the queue

· Security tokens

Let’s look at the code required to get a basic Java-based ApplicationMaster up and running. We’ll break this code up into two subsections: preparing the Container-LaunchContext object, and then specifying the resource requirements and submitting the application.

First up is the ContainerLaunchContext, which is where you specify the command to launch your ApplicationMaster, along with any other environmental details required for your application to execute:[3]

3 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/Client.java.

The final steps are specifying the memory and CPU resources needed by the ApplicationMaster, followed by the application submission:[4]

4 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/Client.java.

All container requests sent to the ResourceManager are processed asynchronously, so just because submitApplication returns doesn’t mean your ApplicationMaster is up and running. To figure out the state of your application, you’ll need to poll the ResourceManager for the application status, which will be covered next.

Waiting for the YARN application to complete

After submitting an application, you can poll the ResourceManager for information on the state of your ApplicationMaster. The result will contain details such as

· The state of your application

· The host the ApplicationMaster is running on, and an RPC port (if any) where it’s listening for client requests (not applicable in our example)

· A tracking URL, if supported by the ApplicationMaster, which provides details on the progress of the application (again not supported in our example)

· General information such as the queue name and container start time

Your ApplicationMaster can be in any one of the states shown in figure 10.6 (the states are contained in the enum YarnApplicationState).

Figure 10.6. ApplicationMaster states

The following code performs the final step of your client, which is to regularly poll the ResourceManager until the ApplicationMaster has completed:[5]

5 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/Client.java.

Summary

There are a number of more advanced client capabilities that weren’t explored in this section, such as security. Section 10.3 discusses this and other features that you’ll probably want to build into your client.

With your YARN client in place, it’s time to turn to the second half of your YARN application—the ApplicationMaster.

Technique 102 A bare-bones ApplicationMaster

The ApplicationMaster is the coordinator of the YARN application. It’s responsible for asking the ResourceManager for containers and then launching the containers via the NodeManager. Figure 10.7 shows these interactions, which you’ll explore in this technique.

Figure 10.7. The basic functions that your ApplicationMaster will perform

Problem

You’re building a YARN application and need to implement an ApplicationMaster.

Solution

Use the YARN ApplicationMaster APIs to coordinate your work via the ResourceManager and NodeManager.

Discussion

As in the previous technique, we’ll break down the actions that the ApplicationMaster needs to perform.

Register with the ResourceManager

The first step is to register the ApplicationMaster with the ResourceManager. To do so, you need to get a handle to an AMRMClient instance, which you’ll use for all your communication with the ResourceManager:[6]

6 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/ApplicationMaster.java.

Submit a container request and launch it when one is available

Next you’ll need to specify all the containers that you want to request. In this simple example, you’ll request a single container, and you won’t specify a specific host or rack on which it’ll run:[7]

7 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/ApplicationMaster.java.

The AMRMClient’s allocate method performs a number of important functions:

· It acts as a heartbeat message to the ResourceManager. If the ResourceManager doesn’t receive a heartbeat message after 10 minutes, it will consider the ApplicationMaster to be in a bad state and will kill the process. The default expiry value can be changed by settingyarn.am.liveness-monitor.expiry-interval-ms.

· It sends any container allocation requests that were added to the client.

· It receives zero or more allocated containers that resulted from container allocation requests.

The first time that allocate is called in this code, the container request will be sent to the ResourceManager. Because the ResourceManager handles container requests asynchronously, the response won’t contain the allocated container. Instead, a subsequent invocation of allocate will return the allocated container.

Wait for the container to complete

At this point you’ve asked the ResourceManager for a container, received a container allocation from the ResourceManager, and communicated with a NodeManager to launch the container. Now you have to continue to call the allocate method and extract from the response any containers that completed:[8]

8 GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch10/dstat/ApplicationMaster.java.

Summary

In this technique you used the AMRMClient and NMClient classes to communicate with the ResourceManager and NodeManagers. These clients provide synchronous APIs to the YARN services. They have asynchronous counterparts (AMRMClientAsync and NMClientAsync) that encapsulate the heartbeat functionality and will call back into your code when messages are received from the ResourceManager. The async APIs may make it easier to reason about the interactions with the ResourceManager because the ResourceManager processes everything asynchronously.

There are a few more features that the ResourceManager and NodeManager expose to ApplicationMasters:[9]

9 The complete Javadocs for AMRMClient can be viewed at http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/yarn/client/api/AMRMClient.html.

Similarly, the NMClient API exposes a handful of mechanisms that you can use to control and get metadata about your containers:[10]

10 The complete Javadocs for NMClient are available at http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/yarn/client/api/NMClient.html.

At this point you’ve written the code for a complete YARN application! Next you’ll execute your application on a cluster.

Technique 103 Running the application and accessing logs

At this point you have a functional YARN application. In this section, you’ll look at how to run the application and access its output.

Problem

You want to run your YARN application.

Solution

Use the regular Hadoop command line to launch it and view the container outputs.

Discussion

The hip script that you’ve been using to launch all the examples in this book also works for running the YARN application. Behind the scenes, hip calls the hadoop script to run the examples.

The following example shows the output of running the YARN application that was written in the last two techniques. It runs a vmstat Linux command in a single container:

$ hip --nolib hip.ch10.dstat.basic.Client

client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

Submitting application application_1398974791337_0055

impl.YarnClientImpl: Submitted application

application_1398974791337_0055 to ResourceManager at /0.0.0.0:8032

Application application_1398974791337_0055 finished with state FINISHED

If you have log aggregation enabled (see technique 3 for more details), you can issue the following command to view the log output of both the ApplicationMaster and the vmstat container:

The ApplicationMaster directed the container standard output to the stdout file, and you can see the output of the vmstat command in that file.

Accessing logs when containers fail to start

It’s likely that during the development of your YARN application, either the ApplicationMaster or one of your containers will fail to launch due to missing resources or errors in the startup command. Depending on where the failure occurs, your container logs will have the error related to startup or you’ll need to examine the NodeManager logs if the process failed to start outright.

Retaining localized and log directories

The yarn.nodemanager.delete.debug-delay-sec configuration property controls how long the localized and log directories for the application are kept around. The localized directory contains the command executed by the NodeManager to launch containers (both the ApplicationMaster and the application containers), as well as any JARs and other localized resources that were specified by the application for the container.

It’s recommended that you set this property to a value that gives you enough time to diagnose failures. But don’t set this value too high (say, in the order of days) as this could create pressure on your storage.

An alternative to hunting down ApplicationMaster startup problems is to run an unmanaged ApplicationMaster, which is covered in the next technique.

Technique 104 Debugging using an unmanaged application master

Debugging a YARN ApplicationMaster is a challenge, as it’s launched on a remote node and requires you to pull logs from that node to troubleshoot your code. ApplicationMasters that are launched by the ResourceManager in this way are called managed ApplicationMasters, as shown infigure 10.8.

Figure 10.8. A managed ApplicationMaster

YARN also supports the notion of an unmanaged ApplicationMaster, where the ApplicationMaster is launched on a local node, as seen in figure 10.9. Issues with an ApplicationMaster are easier to diagnose when it’s running on the local host.

Figure 10.9. An unmanaged ApplicationMaster

In this section you’ll discover how to run an unmanaged ApplicationMaster and learn how they can be used by projects.

Problem

You want to run a local instance of an ApplicationMaster.

Solution

Run an unmanaged ApplicationMaster.

Discussion

YARN comes bundled with an application called the UnmanagedAMLauncher, which launches an unmanaged ApplicationMaster. An unmanaged ApplicationMaster is one that is not launched by the ResourceManager. Instead, the UnmanagedAMLauncher liaises with the ResourceManager to create a new application, but instead of issuing a submit-Application call to the ResourceManager (as is the case with managed ApplicationMasters), the UnmanagedAMLauncher starts the process.

When using the UnmanagedAMLauncher, you don’t have to define a YARN client, so all you need to provide are the details required to launch your ApplicationMaster. The following example shows how you can execute the ApplicationMaster that you wrote in the previous techniques:

The UnmanagedAMLauncher captures the ApplicationMaster’s standard output and standard error and outputs them to its own standard output. This is useful in situations where your ApplicationMaster is failing to start, in which case the error will be seen in the output of the preceding command, as opposed to being tucked away in the logs of the NodeManager.

Figure 10.10 shows the interactions between the UnmanagedAMLauncher and the ResourceManager.

Figure 10.10. The unmanaged launcher working with the ResourceManager to launch an unmanaged ApplicationMaster

There’s nothing stopping you from writing your own unmanaged ApplicationMaster launcher if the capabilities in UnmanagedAMLauncher are too limited. The following code shows the key step that the UnmanagedAMLauncher takes to tell the ResourceManager that the ApplicationMaster is unmanaged:

ApplicationSubmissionContext appContext = ...;

appContext.setUnmanagedAM(true);

Unmanaged ApplicationMasters are useful as they provide local access to an ApplicationMaster, which can ease your debugging and profiling efforts.

Next, let’s look at some more advanced capabilities that you may want to support in your YARN applications.

10.3. Additional YARN application capabilities

So far in this chapter, we’ve looked at a bare-bones YARN application that launches a Linux command in a container. However, if you’re developing a YARN application, it’s likely that you’ll need to support more sophisticated capabilities. This section highlights some features that you may need to support in your application.

10.3.1. RPC between components

If you have a long-running application, you may want to allow clients to communicate with the ApplicationMaster. Your ApplicationMaster may also need to be able to communicate with containers, and vice versa. An example could be a SQL-on-Hadoop application that allows clients to send queries to the ApplicationMaster, and whose ApplicationMaster then coordinates containers to perform the work.

YARN doesn’t provide you with any plumbing here, so you need to pick an RPC protocol and supporting library. You have a few options:

· Thrift or Avro —Both of these provide an interface definition language (IDL) where you can define endpoints and messages, which are compiled into concrete client and service code that can be easily incorporated into your code. The advantages of these libraries are code generation and schema evolution, allowing your services to evolve over time.

· Protocol Buffers —Google didn’t open source the RPC layer, so you’ll need to roll your own. You can use REST over HTTP for your transport and easily implement it all using Jersey’s annotations.

· Hadoop’s RPC —Behind the scenes, this uses Protocol Buffers.

Because YARN doesn’t support communication between your components, how can you know which hosts or ports your services are listening on?

10.3.2. Service discovery

YARN can schedule multiple containers on the same node, so hard-wiring the listening port for any service in your container or ApplicationMaster isn’t ideal. Instead, you can pick one of the following strategies:

· If your ApplicationMaster has a built-in service, pass the launched containers the ApplicationMaster host and port details, and have containers call back to the ApplicationMaster with their port number.

· Use ZooKeeper as a service registry by having containers publish their host and port details to ZooKeeper, and have clients look up services in ZooKeeper. This is the strategy that Apache Twill, covered later in this chapter, employs.

Next up is a look at maintaining state in your application so that you can resume from a well-known state in the event of an application restart.

10.3.3. Checkpointing application progress

If your application is long-running and maintains and builds state during its execution, you may need to periodically persist that state so that in the event of a container restart, a container or ApplicationMaster can pick up where it left off. Containers can be killed for a variety of reasons, including making resources available for other users and applications. ApplicationMasters going down are typically the result of an error in your application logic, the node going down, or a cluster restart.

Two services you can use for checkpointing are HDFS and ZooKeeper. Apache Twill, an abstracted framework for writing YARN applications, uses ZooKeeper to checkpoint container and ApplicationMaster state.

One area to be aware of with checkpointing is handling split-brain situations.

10.3.4. Avoiding split-brain

It’s possible that a networking problem will result in the ResourceManager believing that an ApplicationMaster is down and launching a new ApplicationMaster. This can lead to an undesired outcome if your application produces outputs or intermediary data in a way that’s not idempotent.

This was a problem in the early MapReduce YARN application, where task- and job-level commits could be executed more than once, which was not ideal for commit actions that couldn’t be repeatedly executed.[11] The solution was to introduce a delay in committing, and to use the ResourceManager heartbeat to verify that the ApplicationMaster was still valid. Refer to the JIRA ticket for more details.

11 See the JIRA ticket titled “MR AM can get in a split brain situation” at https://issues.apache.org/jira/browse/MAPREDUCE-4832.

10.3.5. Long-running applications

Some YARN applications, such as Impala, are long-running, and as a result have requirements that differ from applications that are more transient in nature. If your application is also long-lived, you should be aware of the following points, some of which are currently being worked on in the community:

· Gang scheduling, which allows a large number of containers to be scheduled in a short period of time (YARN-624).

· Long-lived container support, allowing containers to indicate the fact that they’re long-lived so that the scheduler can make better allocation and management decisions (YARN-1039).

· Anti-affinity settings, so that applications can specify that multiple containers aren’t allocated on the same node (YARN-397).

· Renewal of delegation tokens when running on a secure Hadoop cluster. Kerberos tokens expire, and if they’re not renewed, you won’t be able to access services such as HDFS (YARN-941).

There’s an umbrella JIRA ticket that contains more details: https://issues.apache.org/jira/browse/YARN-896.

Even though Impala is a YARN application, it uses unmanaged containers and its own gang-scheduling mechanism to work around some of the issues with long-running applications. As a result, Cloudera created a project called Llama (http://cloudera.github.io/llama/), which mediates resource management between Impala and YARN to provide these features. Llama may be worth evaluating for your needs.

10.3.6. Security

YARN applications running on secure Hadoop clusters need to pass tokens to the ResourceManager that will be passed on to your application. These tokens are required to access services such as HDFS. Twill, detailed in the next section, provides support for secure Hadoop clusters.

This concludes our overview of additional capabilities that you may need in your YARN applications. Next up is a look at YARN programming abstractions, some of which implement the capabilities discussed in this section.

10.4. YARN programming abstractions

YARN exposes a low-level API and has a steep learning curve, especially if you need to support many of the features that were outlined in the previous section. There are a number of abstractions on top of YARN that simplify the development of YARN applications and help you focus on implementing your application logic without worrying about the mechanics of YARN. Some of these frameworks, such as Twill, also support more advanced capabilities, such as shipping logs to the YARN client and service discovery via ZooKeeper.

In this section I’ll provide a brief summary of three such abstractions: Apache Twill, Spring, and REEF.

10.4.1. Twill

Apache Twill (http://twill.incubator.apache.org/), formerly known as Weave, not only provides a rich and high-level programming abstraction, but also supports many features that you’ll likely require in your YARN application, such as service discovery, log shipping, and resiliency to failure.

The following code shows an example YARN client written in Twill. You’ll note that construction of the YarnTwillRunnerService requires a ZooKeeper connection URL, which is used to register the YARN application. Twill also supports shipping logs to the client (via Kafka), and here you’re adding a log handler to write the container and ApplicationMaster logs to standard output:

Twill’s programming model uses well-known Java types such as Runnable to model container execution. The following code shows a container that launches the vmstat utility:

Figure 10.11 shows how Twill uses ZooKeeper and Kafka to support features such as log shipping and service discovery.

Figure 10.11. Twill features

You can get a detailed overview of Twill from Terence Yim’s “Harnessing the Power of YARN with Apache Twill” (http://www.slideshare.net/TerenceYim1/twill-apachecon-2014?ref=). Yim also has a couple of blog entries on programming with Twill (formerly Weave).[12]

12 Terence Yim, “Programming with Weave, Part I,” http://blog.continuuity.com/post/66694376303/programming-with-weave-part-i; “Programming with Apache Twill, Part II,” http://blog.continuuity.com/post/73969347586/programming-with-apache-twill-part-ii.

10.4.2. Spring

The 2.x release of Spring for Hadoop (http://projects.spring.io/spring-hadoop/) brings support for simplifying YARN development. It differs from Twill in that it’s focused on abstracting the YARN API and not on providing application features; Twill, in contrast, offers log shipping and service discovers. But it’s very possible that you may not want the added complexity that these features bring to Twill and instead want more control over your YARN application. If so, this may make Spring for Hadoop a better candidate.

Spring for Hadoop provides default implementations of a YARN client, ApplicationMaster, and container that can be overridden to provide application-specific functionality. You can actually write a YARN application without writing any code! The following example is from the Spring Hadoop samples, showing how you can configure a YARN application to run a remote command.[13] This first snippet shows the application context, and configures the HDFS, YARN, and application JARs:

13 “Spring Yarn Simple Command Example,” https://github.com/spring-projects/spring-hadoop-samples/tree/master/yarn/yarn/simple-command.

<beans ...>

<context:property-placeholder location="hadoop.properties"

system-properties-mode="OVERRIDE"/>

<yarn:configuration>

fs.defaultFS=${hd.fs}

yarn.resourcemanager.address=${hd.rm}

fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem

</yarn:configuration>

<yarn:localresources>

<yarn:hdfs path="/app/simple-command/*.jar"/>

<yarn:hdfs path="/lib/*.jar"/>

</yarn:localresources>

<yarn:environment>

<yarn:classpath use-yarn-app-classpath="true"/>

</yarn:environment>

<util:properties id="arguments">

<prop key="container-count">4</prop>

</util:properties>

<yarn:client app-name="simple-command">

<yarn:master-runner arguments="arguments"/>

</yarn:client>

</beans>

The following code defines the ApplicationMaster properties and tells it to run the vmstat command:

<beans ...>

<context:property-placeholder location="hadoop.properties"/>

<bean id="taskScheduler" class="

org.springframework.scheduling.concurrent.ConcurrentTaskScheduler"/>

<bean id="taskExecutor" class="

org.springframework.core.task.SyncTaskExecutor"/>

<yarn:configuration>

fs.defaultFS=${SHDP_HD_FS}

yarn.resourcemanager.address=${SHDP_HD_RM}

yarn.resourcemanager.scheduler.address=${SHDP_HD_SCHEDULER}

</yarn:configuration>

<yarn:localresources>

<yarn:hdfs path="/app/simple-command/*.jar"/>

<yarn:hdfs path="/lib/*.jar"/>

</yarn:localresources>

<yarn:environment>

<yarn:classpath use-yarn-app-classpath="true" delimiter=":">

./*

</yarn:classpath>

</yarn:environment>

<yarn:master>

<yarn:container-allocator/>

<yarn:container-command>

<![CDATA[

vmstat

1><LOG_DIR>/Container.stdout

2><LOG_DIR>/Container.stderr

]]>

</yarn:container-command>

</yarn:master>

</beans>

The samples also include a look at how you can extend the client, ApplicationMaster, and container.[14]

14 Example of extending the Spring YARN classes: “Spring Yarn Custom Application Master Service Example,” https://github.com/spring-projects/spring-hadoop-samples/tree/master/yarn/yarn/custom-amservice.

You can find some sample Spring for Hadoop applications on GitHub (https://github.com/spring-projects/spring-hadoop-samples). There’s also a wiki for the project: https://github.com/spring-projects/spring-hadoop/wiki.

10.4.3. REEF

REEF is a framework from Microsoft that simplifies scalable, fault-tolerant runtime environments for a range of computational models, including YARN and Mesos (www.reef-project.org/; https://github.com/Microsoft-CISL/REEF). REEF has some interesting capabilities, such as container reuse and data caching.

You can find a REEF tutorial on GitHub: https://github.com/Microsoft-CISL/REEF/wiki/How-to-download-and-compile-REEF.

10.4.4. Picking a YARN API abstraction

YARN abstractions are still in their early stages because YARN is a young technology. This section provided a brief overview of three abstractions that you could use to hide away some of the complexities of the YARN API. But which one should you pick for your application?

· Apache Twill looks the most promising, as it already encapsulates many of the features that you’ll need in your application. It has picked best-of-breed technologies such as Kafka and ZooKeeper to support these features.

· Spring for Hadoop may be a better fit if you’re developing a lightweight application and you don’t want a dependency on Kafka or ZooKeeper.

· REEF may be useful if you have some complex application requirements, such as the need to run on multiple execution frameworks, or if you need to support more complex container choreographies and state sharing across containers.

10.5. Chapter summary

This chapter showed you how to write a simple YARN application and then introduced you to some of the more advanced capabilities that you may need in your YARN applications. It also looked at some YARN abstractions that make it easier to write your applications. You’re now all set to go out and start writing the next big YARN application.

This concludes not only this chapter but the book as a whole! I hope you’ve enjoyed the journey and along the way have picked up some tips and tricks that you can employ in your Hadoop applications and environments. If you have any questions about items covered in this book, please head on over to Manning’s forum dedicated to this book and post a question.[15]

15 Manning forum for Hadoop in Practice: http://www.manning-sandbox.com/forum.jspa?forumID=901.