Hadoop in Practice, Second Edition (2015)
Appendix. Installing Hadoop and friends
This appendix contains instructions on how to install Hadoop and other tools that are used in the book.
Getting started quickly with Hadoop
The quickest way to get up and running with Hadoop is to download a preinstalled virtual machine from one of the Hadoop vendors. Following is a list of the popular VMs:
· Cloudera Quickstart VM—http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
· Hortonworks Sandbox—http://hortonworks.com/products/hortonworkssandbox/
· MapR Sandbox for Hadoop—http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop
A.1. Code for the book
Before we get to the instructions for installing Hadoop, let’s get you set up with the code that accompanies this book. The code is hosted on GitHub at https://github.com/alexholmes/hiped2. To get you up and running quickly, there are prepackaged tarballs that don’t require you to build the code—just install and go.
Downloading
First you’ll need to download the most recent release of the code from https://github.com/alexholmes/hiped2/releases.
Installing
The second step is to unpackage the tarball into a directory of your choosing. For example, the following untars the code into /usr/local, the same directory where you’ll install Hadoop:
$ cd /usr/local
$ sudo tar -xzvf <download directory>/hip-<version>-package.tar.gz
Adding the home directory to your path
All the examples in the book assume that the home directory for the code is in your path. The methods for doing this differ by operating system and shell. If you’re on Linux using Bash, then the following should work (use of the single quotes for the second command is required to avoid variable substitution):
$ echo "export HIP_HOME=/usr/local/hip-<version>" >> ~/.bash_profile
$ echo 'export PATH=${PATH}:${HIP_HOME}/bin' >> ~/.bash_profile
Running an example job
You can run the following commands to test your installation. This assumes that you have a running Hadoop setup (if you don’t, please jump to section A.3):
# create two input files in HDFS
$ hadoop fs -mkdir -p hip/input
$ echo "cat sat mat" | hadoop fs -put - hip/input/1.txt
$ echo "dog lay mat" | hadoop fs -put - hip/input/2.txt
# run the inverted index example
$ hip hip.ch1.InvertedIndexJob --input hip/input --output hip/output
# examine the results in HDFS
$ hadoop fs -cat hip/output/part*
Downloading the sources and building
There are some techniques (such as Avro code generation) that require access to the full sources. First, check out the sources using git:
$ git clone git@github.com:alexholmes/hiped2.git
Set up your environment so that some techniques know where the source is installed:
$ echo "export HIP_SRC=<installation dir>/hiped2" >> ~/.bash_profile
You can build the project using Maven:
$ cd hiped2
$ mvn clean validate package
This generates a target/hip-<version>-package.tar.gz file, which is the same file that’s uploaded to GitHub when releases are made.
A.2. Recommended Java versions
The Hadoop project keeps a list of recommended Java versions that have been proven to work well with Hadoop in production. For details, take a look at “Hadoop Java Versions” on the Hadoop Wiki at http://wiki.apache.org/hadoop/HadoopJavaVersions.
A.3. Hadoop
This section covers installing, configuring, and running the Apache distribution of Hadoop. Please refer to distribution-specific instructions if you’re working with a different distribution of Hadoop.
Apache tarball installation
The following instructions are for users who want to install the tarball version of the vanilla Apache Hadoop distribution. This is a a pseudo-distributed setup and not for a multi-node cluster.[1]
1 Pseudo-distributed mode is when you have all the Hadoop components running on a single host.
First you’ll need to download the tarball from the Apache downloads page at http://hadoop.apache.org/common/releases.html#Download and extract the tarball under /usr/local:
$ cd /usr/local
$ sudo tar -xzf <path-to-apache-tarball>
$ sudo ln -s hadoop-<version> hadoop
$ sudo chown -R <user>:<group> /usr/local/hadoop*
$ mkdir /usr/local/hadoop/tmp
Installation directory for users that don’t have root privileges
If you don’t have root permissions on your host, you can install Hadoop under a different directory and substitute instances of /usr/local in the following instructions with your directory name.
Configuration for pseudo-distributed mode for Hadoop 1 and earlier
The following instructions work for Hadoop version 1 and earlier. Skip to the next section if you’re working with Hadoop 2.
Edit the file /usr/local/hadoop/conf/core-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
Then edit the file /usr/local/hadoop/conf/hdfs-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<!-- specify this so that running 'hadoop namenode -format'
formats the right dir -->
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/cache/hadoop/dfs/name</value>
</property>
</configuration>
Finally, edit the file /usr/local/hadoop/conf/mapred-site.xml and make sure it looks like the following (you may first need to copy mapred-site.xml.template to mapred-site.xml):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Configuration for pseudo-distributed mode for Hadoop 2
The following instructions work for Hadoop 2. See the previous section if you’re working with Hadoop version 1 and earlier.
Edit the file /usr/local/hadoop/etc/hadoop/core-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
Then edit the file /usr/local/hadoop/etc/hadoop/hdfs-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Next, edit the file /usr/local/hadoop/etc/hadoop/mapred-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Finally, edit the file /usr/local/hadoop/etc/hadoop/yarn-site.xml and make sure it looks like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Shuffle service that needs to be set for
Map Reduce to run.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>2592000</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://0.0.0.0:19888/jobhistory/logs/</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>-1</value>
<description>Amount of time in seconds to wait before
deleting container resources.</description>
</property>
</configuration>
Set up SSH
Hadoop uses Secure Shell (SSH) to remotely launch processes such as the Data-Node and TaskTracker, even when everything is running on a single node in pseudo-distributed mode. If you don’t already have an SSH key pair, create one with the following command:
$ ssh-keygen -b 2048 -t rsa
You’ll need to copy the .ssh/id_rsa file to the authorized_keys file:
$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
You’ll also need an SSH agent running so that you aren’t prompted to enter your password a bazillion times when starting and stopping Hadoop. Different operating systems have different ways of running an SSH agent, and there are details online for CentOS and other Red Hat derivatives[2]and for OS X.[3] Google is your friend if you’re running on a different system.
2 See the Red Hat Deployment Guide section on “Configuring ssh-agent” at www.centos.org/docs/5/html/5.2/Deployment_Guide/s3-openssh-config-ssh-agent.html.
3 See “Using SSH Agent With Mac OS X Leopard” at www-uxsup.csx.cam.ac.uk/~aia21/osx/leopard-ssh.html.
To verify that the agent is running and has your keys loaded, try opening an SSH connection to the local system:
$ ssh 127.0.0.1
If you’re prompted for a password, the agent’s not running or doesn’t have your keys loaded.
Java
You need a current version of Java (1.6 or newer) installed on your system. You’ll need to ensure that the system path includes the binary directory of your Java installation. Alternatively, you can edit /usr/local/hadoop/conf/hadoop-env.sh, uncomment the JAVA_HOME line, and update the value with the location of your Java installation.
Environment settings
For convenience, it’s recommended that you add the Hadoop binary directory to your path. The following code shows what you can add to the bottom of your Bash shell profile file in ~/.bash_profile (assuming you’re running Bash):
HADOOP_HOME=/usr/local/hadoop
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH
Format HDFS
Next you need to format HDFS. The rest of the commands in this section assume that the Hadoop binary directory exists in your path, as per the preceding instructions. On Hadoop 1 and earlier, type
$ hadoop namenode -format
On Hadoop versions 2 and newer, type
$ hdfs namenode -format
After HDFS has been formatted, you’re ready to start Hadoop.
Starting Hadoop 1 and earlier
A single command can be used to start Hadoop on versions 1 and earlier:
$ start-all.sh
After running the start script, use the jps Java utility to check that all the processes are running. You should see the following output (with the exception of the process IDs, which will be different):
$ jps
23836 JobTracker
23475 NameNode
23982 TaskTracker
23619 DataNode
24024 Jps
23756 SecondaryNameNode
If any of these processes aren’t running, check the logs directory (/usr/local/hadoop/logs) to see why the processes didn’t start correctly. Each of the preceding processes has two output files that can be identified by name and should be checked for errors.
The most common error is that the HDFS formatting step, which I showed earlier, was skipped.
Starting Hadoop 2
The following commands are required to start Hadoop version 2:
$ yarn-daemon.sh start resourcemanager
$ yarn-daemon.sh start nodemanager
$ hadoop-daemon.sh start namenode
$ hadoop-daemon.sh start datanode
$ mr-jobhistory-daemon.sh start historyserver
After running the start script, use the jps Java utility to check that all the processes are running. You should see the output that follows, although the ordering and process IDs will differ:
$ jps
32542 NameNode
1085 Jps
32131 ResourceManager
32613 DataNode
32358 NodeManager
1030 JobHistoryServer
If any of these processes aren’t running, check the logs directory (/usr/local/hadoop/logs) to see why the processes didn’t start correctly. Each of the preceding processes has two output files that can be identified by name and should be checked for errors. The most common error is that the HDFS formatting step, which I showed earlier, was skipped.
Creating a home directory for your user on HDFS
Once Hadoop is up and running, the first thing you’ll want to do is create a home directory for your user. If you’re running on Hadoop 1, the command is
$ hadoop fs -mkdir /user/<your-linux-username>
On Hadoop 2, you’ll run
$ hdfs dfs -mkdir -p /user/<your-linux-username>
Verifying the installation
The following commands can be used to test your Hadoop installation. The first two commands create a directory in HDFS and create a file in HDFS:
$ hadoop fs -mkdir /tmp
$ echo "the cat sat on the mat" | hadoop fs -put - /tmp/input.txt
Next you want to run a word-count MapReduce job. On Hadoop 1 and earlier, run the following:
$ hadoop jar /usr/local/hadoop/*-examples*.jar wordcount \
/tmp/input.txt /tmp/output
On Hadoop 2, run the following:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/*-examples*.jar \
wordcount /tmp/input.txt /tmp/output
Examine and verify the MapReduce job outputs on HDFS (the outputs will differ based on the contents of the config files that you used for the job inputs):
$ hadoop fs -cat /tmp/output/part*
at 1
mat 1
on 1
sat 1
the 2
Stopping Hadoop 1
To stop Hadoop 1, use the following command:
$ stop-all.sh
Stopping Hadoop 2
To stop Hadoop 2, use the following commands:
$ mr-jobhistory-daemon.sh stop historyserver
$ hadoop-daemon.sh stop datanode
$ hadoop-daemon.sh stop namenode
$ yarn-daemon.sh stop nodemanager
$ yarn-daemon.sh stop resourcemanager
Just as with starting, the jps command can be used to verify that all the Hadoop processes have stopped.
Hadoop 1.x UI ports
There are a number of web applications in Hadoop. Table A.1 lists them, along with the ports they run on and their URLs (assuming they’re running on the local host, as is the case if you have a pseudo-distributed installation running).
Table A.1. Hadoop 1.x web applications and ports
Component |
Default port |
Config parameter |
Local URL |
MapReduce JobTracker |
50030 |
mapred.job.tracker.http.address |
http://127.0.0.1:50030/ |
MapReduce TaskTracker |
50060 |
mapred.task.tracker.http.address |
http://127.0.0.1:50060/ |
HDFS NameNode |
50070 |
dfs.http.address |
http://127.0.0.1:50070/ |
HDFS DataNode |
50075 |
dfs.datanode.http.address |
http://127.0.0.1:50075/ |
HDFS Secondary-NameNode |
50090 |
dfs.secondary.http.address |
http://127.0.0.1:50090/ |
HDFS Backup and Checkpoint Node |
50105 |
dfs.backup.http.address |
http://127.0.0.1:50105/ |
Each of these URLs supports the following common paths:
· /logs —This shows a listing of all the files under hadoop.log.dir. By default, this is under $HADOOP_HOME/logs on each Hadoop node.
· /logLevel —This can be used to view and set the logging levels for Java packages.
· /metrics —This shows JVM and component-level statistics. It’s available in Hadoop 0.21 and newer (not in 1.0, 0.20.x, or earlier).
· /stacks —This shows a stack dump of all the current Java threads in the daemon.
Hadoop 2.x UI ports
There are a number of web applications in Hadoop. Table A.2 lists them, including the ports that they run on and their URLs (assuming they’re running on the local host, as is the case if you have a pseudo-distributed installation running).
Table A.2. Hadoop 2.x web applications and ports
Component |
Default port |
Config parameter |
Local URL |
YARN ResourceManager |
8088 |
yarn.resourcemanager.webapp.address |
http://localhost:8088/cluster |
YARN NodeManager |
8042 |
yarn.nodemanager.webapp.address |
http://localhost:8042/node |
MapReduce Job History |
19888 |
mapreduce.jobhistory.webapp.address |
http://localhost:19888/jobhistory |
HDFS Name-Node |
50070 |
dfs.http.address |
http://127.0.0.1:50070/ |
HDFS DataNode |
50075 |
dfs.datanode.http.address |
http://127.0.0.1:50075/ |
A.4. Flume
Flume is a log collection and distribution system that can transport data across a large number of hosts into HDFS. It’s an Apache project originally developed by Cloudera.
Chapter 5 contains a section on Flume and how it can be used.
Getting more information
Table A.3 lists some useful resources to help you become more familiar with Flume.
Table A.3. Useful resources
Resource |
URL |
Flume main page |
http://flume.apache.org/ |
Flume user guide |
http://flume.apache.org/FlumeUserGuide.html |
Flume Getting Started guide |
https://cwiki.apache.org/confluence/display/FLUME/Getting+Started |
Installation on Apache Hadoop 1.x systems
Follow the Getting Started guide referenced in the resources.
Installation on Apache Hadoop 2.x systems
If you’re trying to get Flume 1.4 to work with Hadoop 2, follow the Getting Started guide to install Flume. Next, you’ll need to remove the protobuf and guava JARs from Flume’s lib directory because they conflict with the versions bundled with Hadoop 2:
$ mv ${flume_bin}/lib/{protobuf-java-2.4.1.jar,guava-10.0.1.jar} ~/
A.5. Oozie
Oozie is an Apache project that started life inside Yahoo. It’s a Hadoop workflow engine that manages data processing activities.
Getting more information
Table A.4 lists some useful resources to help you become more familiar with Oozie.
Table A.4. Useful resources
Resource |
URL |
Oozie project page |
https://oozie.apache.org/ |
Oozie Quick Start |
https://oozie.apache.org/docs/4.0.0/DG_QuickStart.html |
Additional Oozie resources |
https://oozie.apache.org/docs/4.0.0/index.html |
Installation on Hadoop 1.x systems
Follow the Quick Start guide to install Oozie. The Oozie documentation has installation instructions.
If you’re using Oozie 4.4.0 and targeting Hadoop 2.2.0, you’ll need to run the following commands to patch your Maven files and perform the build:
cd oozie-4.0.0/
find . -name pom.xml | xargs sed -ri 's/(2.2.0\-SNAPSHOT)/2.2.0/'
mvn -DskipTests=true -P hadoop-2 clean package assembly:single
Installation on Hadoop 2.x systems
Unfortunately Oozie 4.0.0 doesn’t play nicely with Hadoop 2. To get Oozie working with Hadoop, you’ll first need to download the 4.0.0 tarball from the project page and then unpackage it. Next, run the following command to change the Hadoop version being targeted:
$ cd oozie-4.0.0/
$ find . -name pom.xml | xargs sed -ri 's/(2.2.0\-SNAPSHOT)/2.2.0/'
Now all you need to do is target the hadoop-2 profile in Maven:
$ mvn -DskipTests=true -P hadoop-2 clean package assembly:single
A.6. Sqoop
Sqoop is a tool for importing data from relational databases into Hadoop and vice versa. It can support any JDBC-compliant database, and it also has native connectors for efficient data transport to and from MySQL and PostgreSQL.
Chapter 5 contains details on how imports and exports can be performed with Sqoop.
Getting more information
Table A.5 lists some useful resources to help you become more familiar with Sqoop.
Table A.5. Useful resources
Resource |
URL |
Sqoop project page |
http://sqoop.apache.org/ |
Sqoop User Guide |
http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html |
Installation
Download the Sqoop tarball from the project page. Pick the version that matches with your Hadoop installation and explode the tarball. The following instructions assume that you’re installing under /usr/local:
$ sudo tar -xzf \
sqoop-<version>.bin.hadoop-<hadoop-version>.tar.gz \
-C /usr/local/
$ ln -s /usr/local/sqoop-<version> /usr/local/sqoop
Sqoop 2
This book currently covers Sqoop version 1. When selecting which tarball to download, please note that version 1.99.x and newer are the Sqoop 2 versions, so be sure to pick an older version.
If you’re planning on using Sqoop with MySQL, you’ll need to download the MySQL JDBC driver tarball from http://dev.mysql.com/downloads/connector/j/, explode it into a directory, and then copy the JAR file into the Sqoop lib directory:
$ tar -xzf mysql-connector-java-<version>.tar.gz
$ cd mysql-connector-java-<version>
$ sudo cp mysql-connector-java-<version>-bin.jar \
/usr/local/sqoop/lib
To run Sqoop, there are a few environment variables that you may need to set. They’re listed in table A.6.
Table A.6. Sqoop environment variables
Environment variable |
Description |
JAVA_HOME |
The directory where Java is installed. If you have the Sun JDK installed on Red Hat, this would be /usr/java/latest. |
HADOOP_HOME |
The directory of your Hadoop installation. |
HIVE_HOME |
Only required if you’re planning on using Hive with Sqoop. Refers to the directory where Hive was installed. |
HBASE_HOME |
Only required if you’re planning on using HBase with Sqoop. Refers to the directory where HBase was installed. |
The /usr/local/sqoop/bin directory contains the binaries for Sqoop. Chapter 5 contains a number of techniques that show how the binaries are used for imports and exports.
A.7. HBase
HBase is a real-time, key/value, distributed, column-based database modeled after Google’s BigTable.
Getting more information
Table A.7 lists some useful resources to help you become more familiar with HBase.
Table A.7. Useful resources
Resource |
URL |
Apache HBase project page |
http://hbase.apache.org/ |
Apache HBase Quick Start |
http://hbase.apache.org/book/quickstart.html |
Apache HBase Reference Guide |
http://hbase.apache.org/book/book.html |
Cloudera blog post on HBase Dos and Don’ts |
http://blog.cloudera.com/blog/2011/04/hbase-dos-and-donts/ |
Installation
Follow the installation instructions in the Quick Start guide at https://hbase.apache.org/book/quickstart.html.
A.8. Kafka
Kafka is a publish/subscribe messaging system built by LinkedIn.
Getting more information
Table A.8 lists some useful resources to help you become more familiar with Kafka.
Table A.8. Useful resources
Resource |
URL |
Kafka project page |
http://kafka.apache.org/ |
Kafka documentation |
http://kafka.apache.org/documentation.html |
Kafka Quick Start |
http://kafka.apache.org/08/quickstart.html |
Installation
Follow the installation instructions in the Quick Start guide.
A.9. Camus
Camus is a tool for importing data in Kafka into Hadoop.
Getting more information
Table A.9 lists some useful resources to help you become more familiar with Camus.
Table A.9. Useful resources
Resource |
URL |
Camus project page |
https://github.com/linkedin/camus |
Camus Overview |
https://github.com/linkedin/camus/wiki/Camus-Overview |
Installation on Hadoop 1
Download the code from the 0.8 branch in GitHub, and run the following command to build it:
$ mvn clean package
Installation on Hadoop 2
At the time of writing, the 0.8 version of Camus doesn’t support Hadoop 2. You have a couple of options to get it working—if you’re just experimenting with Camus, you can download a patched version of the code from my GitHub project. Alternatively, you can patch the Maven build files.
Using my patched GitHub project
Download my cloned and patched version of Camus from GitHub and build it just as you would the Hadoop 1 version:
$ wget https://github.com/alexholmes/camus/archive/camus-kafka-0.8.zip
$ unzip camus-kafka-0.8.zip
$ cd camus-camus-kafka-0.8
$ mvn clean package
Patching the Maven build files
If you want to patch the original Camus files, you can do that by taking a look at the patch I applied to my own clone: https://mng.bz/Q8GV.
A.10. Avro
Avro is a data serialization system that provides features such as compression, schema evolution, and code generation. It can be viewed as a more sophisticated version of a SequenceFile, with additional features such as schema evolution.
Chapter 3 contains details on how Avro can be used in MapReduce as well as with basic input/output streams.
Getting more information
Table A.10 lists some useful resources to help you become more familiar with Avro.
Table A.10. Useful resources
Resource |
URL |
Avro project page |
http://avro.apache.org/ |
Avro issue tracking page |
https://issues.apache.org/jira/browse/AVRO |
Cloudera blog about Avro use |
http://blog.cloudera.com/blog/2011/12/apache-avro-at-richrelevance/ |
CDH usage page for Avro |
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/5.0/CDH5-Installation-Guide/cdh5ig_avro_usage.html |
Installation
Avro is a full-fledged Apache project, so you can download the binaries from the downloads link on the Apache project page.
A.11. Apache Thrift
Apache Thrift is essentially Facebook’s version of Protocol Buffers. It offers very similar data-serialization and RPC capabilities. In this book, I use it with Elephant Bird to support Thrift in MapReduce. Elephant Bird currently works with Thrift version 0.7.
Getting more information
Thrift documentation is lacking, something which the project page attests to. Table A.11 lists some useful resources to help you become more familiar with Thrift.
Table A.11. Useful resources
Resource |
URL |
Thrift project page |
http://thrift.apache.org/ |
Blog post with a Thrift tutorial |
http://bit.ly/vXpZ0z |
Building Thrift 0.7
To build Thrift, download the 0.7 tarball and extract the contents. You may need to install some Thrift dependencies:
$ sudo yum install automake libtool flex bison pkgconfig gcc-c++ \
boost-devel libevent-devel zlib-devel python-devel \
ruby-devel php53.x86_64 php53-devel.x86_64 openssl-devel
Build and install the native and Java/Python libraries and binaries:
$ ./configure
$ make
$ make check
$ sudo make install
Build the Java library. This step requires Ant to be installed, instructions for which are available in the Apache Ant Manual at http://ant.apache.org/manual/index.html:
$ cd lib/java
$ ant
Copy the Java JAR into Hadoop’s lib directory. The following instructions are for CDH:
# replace the following path with your actual
# Hadoop installation directory
#
# the following is the CDH Hadoop home dir
#
export HADOOP_HOME=/usr/lib/hadoop
$ cp lib/java/libthrift.jar $HADOOP_HOME/lib/
A.12. Protocol Buffers
Protocol Buffers is Google’s data serialization and Remote Procedure Call (RPC) library, which is used extensively at Google. In this book, we’ll use it in conjunction with Elephant Bird and Rhipe. Elephant Bird requires version 2.3.0 of Protocol Buffers (and won’t work with any other version), and Rhipe only works with Protocol Buffers version 2.4.0 and newer.
Getting more information
Table A.12 lists some useful resources to help you become more familiar with Protocol Buffers.
Table A.12. Useful resources
Resource |
URL |
Protocol Buffers project page |
http://code.google.com/p/protobuf/ |
Protocol Buffers Developer Guide |
https://developers.google.com/protocol-buffers/docs/overview?csw=1 |
Protocol Buffers downloads page, containing a link for version 2.3.0 (required for use with Elephant Bird) |
http://code.google.com/p/protobuf/downloads/list |
Building Protocol Buffers
To build Protocol Buffers, download the 2.3 or 2.4 (2.3 for Elephant Bird and 2.4 for Rhipe) source tarball from http://code.google.com/p/protobuf/downloads and extract the contents.
You’ll need a C++ compiler, which can be installed on 64-bit RHEL systems with the following command:
sudo yum install gcc-c++.x86_64
Build and install the native libraries and binaries:
$ cd protobuf-<version>/
$ ./configure
$ make
$ make check
$ sudo make install
Build the Java library:
$ cd java
$ mvn package install
Copy the Java JAR into Hadoop’s lib directory. The following instructions are for CDH:
# replace the following path with your actual
# Hadoop installation directory
#
# the following is the CDH Hadoop home dir
#
export HADOOP_HOME=/usr/lib/hadoop
$ cp target/protobuf-java-2.3.0.jar $HADOOP_HOME/lib/
A.13. Snappy
Snappy is a native compression codec developed by Google that offers fast compression and decompression times. It can’t be split (as opposed to LZOP compression). In the book’s code examples, which don’t require splittable compression, we’ll use Snappy because of its time efficiency.
Snappy is integrated into the Apache distribution of Hadoop since versions 1.0.2 and 2.
Getting more information
Table A.13 lists some useful resources to help you become more familiar with Snappy.
Table A.13. Useful resources
Resource |
URL |
Google’s Snappy project page |
http://code.google.com/p/snappy/ |
Snappy integration with Hadoop |
http://code.google.com/p/hadoop-snappy/ |
A.14. LZOP
LZOP is a compression codec that can be used to support splittable compression in MapReduce. Chapter 4 has a section dedicated to working with LZOP. In this section we’ll cover how to build and set up your cluster to work with LZOP.
Getting more information
Table A.14 shows a useful resource to help you become more familiar with LZOP.
Table A.14. Useful resource
Resource |
URL |
Hadoop LZO project maintained by Twitter |
https://github.com/twitter/hadoop-lzo |
Building LZOP
The following steps walk you through the process of configuring LZOP compression. Before you do this, there are a few things to consider:
· It’s highly recommended that you build the libraries on the same hardware that you have deployed in production.
· All of the installation and configuration steps will need to be performed on any client hosts that will be using LZOP, as well as all the DataNodes in your cluster.
· These steps are for Apache Hadoop distributions. Please refer to distribution-specific instructions if you’re using a different distribution.
Twitter’s LZO project page has instructions on how to download dependencies and build the project. Follow the Building and Configuring section on the project home page.
Configuring Hadoop
You need to configure Hadoop core to be aware of your new compression codecs. Add the following lines to your core-site.xml. Make sure you remove the newlines and spaces so that there are no whitespace characters between the commas:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
The value for io.compression.codecs assumes that you have the Snappy compression codec already installed. If you don’t, remove org.apache.hadoop.io.compress.SnappyCodec from the value.
A.15. Elephant Bird
Elephant Bird is a project that provides utilities for working with LZOP-compressed data. It also provides a container format that supports working with Protocol Buffers and Thrift in MapReduce.
Getting more information
Table A.15 shows a useful resource to help you become more familiar with Elephant Bird.
Table A.15. Useful resource
Resource |
URL |
Elephant Bird project page |
https://github.com/kevinweil/elephant-bird |
At the time of writing, the current version of Elephant Bird (4.4) doesn’t work with Hadoop 2 due the use of an incompatible version of Protocol Buffers. To get Elephant Bird to work in this book, I had to build a version of the project from the trunk that works with Hadoop 2 (as will 4.5 when it is released).
A.16. Hive
Hive is a SQL interface on top of Hadoop.
Getting more information
Table A.16 lists some useful resources to help you become more familiar with Hive.
Table A.16. Useful resources
Resource |
URL |
Hive project page |
http://hive.apache.org/ |
Getting Started |
https://cwiki.apache.org/confluence/display/Hive/GettingStarted |
Installation
Follow the installation instructions in Hive’s Getting Started guide.
A.17. R
R is an open source tool for statistical programming and graphics.
Getting more information
Table A.17 lists some useful resources to help you become more familiar with R.
Table A.17. Useful resources
Resource |
URL |
R project page |
http://www.r-project.org/ |
R function search engine |
http://rseek.org/ |
Installation on Red Hat–based systems
Installing R from Yum makes things easy: it will figure out RPM dependencies and install them for you.
Go to http://www.r-project.org/, click on CRAN, select a download region that’s close to you, select Red Hat, and pick the version and architecture appropriate for your system. Replace the URL in baseurl in the following code and execute the command to add the R mirror repo to your Yum configuration:
$ sudo -s
$ cat << EOF > /etc/yum.repos.d/r.repo
# R-Statistical Computing
[R]
name=R-Statistics
baseurl=http://cran.mirrors.hoobly.com/bin/linux/redhat/el5/x86_64/
enabled=1
gpgcheck=0
EOF
A simple Yum command can be used to install R on 64-bit systems:
$ sudo yum install R.x86_64
Perl-File-Copy-Recursive RPM
On CentOS, the Yum install may fail, complaining about a missing dependency. In this case, you may need to manually install the perl-File-Copy-Recursive RPM (for CentOS you can get it from http://mng.bz/n4C2).
Installation on non–Red Hat systems
Go to http://www.r-project.org/, click on CRAN, select a download region that’s close to you, and select the appropriate binaries for your system.
A.18. RHadoop
RHadoop is an open source tool developed by Revolution Analytics for integrating R with MapReduce.
Getting more information
Table A.18 lists some useful resources to help you become more familiar with RHadoop.
Table A.18. Useful resources
Resource |
URL |
RHadoop project page |
https://github.com/RevolutionAnalytics/RHadoop/wiki |
RHadoop downloads and prerequisites |
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads |
rmr/rhdfs installation
Each node in your Hadoop cluster will require the following components:
· R (installation instructions are in section A.17).
· A number of RHadoop and dependency packages
RHadoop requires that you set environment variables to point to the Hadoop binary and the streaming JAR. It’s best to stash this in your .bash_profile (or equivalent).
$ export HADOOP_CMD=/usr/local/hadoop/bin/hadoop
$ export HADOOP_STREAMING=${HADOOP_HOME}/share/hadoop/tools/lib/
hadoop-streaming-<version>.jar
We’ll focus on the rmr and rhdfs RHadoop packages, which provide MapReduce and HDFS integration with R. Click on the rmr and rhdfs download links on https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads. Then execute the following commands:
$ sudo -s
yum install -y libcurl-devel java-1.7.0-openjdk-devel
$ export HADOOP_CMD=/usr/bin/hadoop
$ R CMD javareconf
$ R
> install.packages( c('rJava'),
repos='http://cran.revolutionanalytics.com')
> install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp','httr',
'functional','devtools', 'reshape2', 'plyr', 'caTools'),
repos='http://cran.revolutionanalytics.com')
$ R CMD INSTALL /media/psf/Home/Downloads/rhdfs_1.0.8.tar.gz
$ R CMD INSTALL /media/psf/Home/Downloads/rmr2_3.1.1.tar.gz
$ R CMD INSTALL rmr_<version>.tar.gz
$ R CMD INSTALL rhdfs_<version>.tar.gz
If you get an error installing rJava, you may need to set JAVA_HOME and reconfigure R prior to running the rJava installation:
$ sudo -s
$ export JAVA_HOME=/usr/java/latest
$ R CMD javareconf
$ R
> install.packages("rJava")
Test that the rmr package was installed correctly by running the following command—if no error messages are generated, this means you have successfully installed the RHadoop packages.
$ R
> library(rmr2)
A.19. Mahout
Mahout is a predictive analytics project that offers both in-JVM and MapReduce implementations for some of its algorithms.
Getting more information
Table A.19 lists some useful resources to help you become more familiar with Mahout.
Table A.19. Useful resources
Resource |
URL |
Mahout project page |
http://mahout.apache.org/ |
Mahout downloads |
https://cwiki.apache.org/confluence/display/MAHOUT/Downloads |
Installation
Mahout should be installed on a node that has access to your Hadoop cluster. Mahout is a client-side library and doesn’t need to be installed on your Hadoop cluster.
Building a Mahout distribution
To get Mahout working with Hadoop 2, I had to check out the code, modify the build file, and then build a distribution. The first step is to check out the code:
$ git clone https://github.com/apache/mahout.git
$ cd mahout
Next you need to modify pom.xml and remove the following section from the file:
<plugin>
<inherited>true</inherited>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.4</version>
<executions>
<execution>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
</plugin>
Finally, build a distribution:
$ mvn -Dhadoop2.version=2.2.0 -DskipTests -Prelease
This will generate a tarball located at distribution/target/mahout-distribution-1.0-SNAPSHOT.tar.gz, which you can install using the instructions in the next section.
Installing Mahout
Mahout is packaged as a tarball. The following instructions will work on most Linux operating systems.
If you’re installing an official Mahout release, click on the “official release” links on the Mahout download page and select the current release. If Mahout 1 hasn’t yet been released and you want to use Mahout with Hadoop 2, follow the instructions in the previous section to generate the tarball.
Install Mahout using the following instructions:
$ cd /usr/local
$ sudo tar -xzf <path-to-mahout-tarball>
$ sudo ln -s mahout-distribution-<version> mahout
$ sudo chown -R <user>:<group> /usr/local/mahout*
For convenience, it’s worthwhile updating your ~/.bash_profile to export a MAHOUT_HOME environment variable to your installation directory. The following command shows how this can be performed on the command line (the same command can be copied into your bash profile file):
$ export MAHOUT_HOME=/usr/local/mahout