HDFS Quick Reference - Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 (2014)

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 (2014)

F. HDFS Quick Reference

This appendix is intended for those readers who have little or no experience with the Hadoop Distributed File System (HDFS). The following discussion is intended to provide minimal background on a few commands that will help get you started with Apache Hadoop HDFS. It is not a full description of HDFS and may be missing many of the important commands and features. In addition to this Quick Start, you are strongly advised to consult these two resources:

Image http://hadoop.apache.org/docs/stable1/hdfs_design.html

Image http://developer.yahoo.com/hadoop/tutorial/module2.html

The following is a quick command reference that may help you get started with HDFS. Be aware that there are alternative options for each command and that the examples given here are simple use-cases.

Quick Command Reference

To interact with HDFS, you must use the hdfs command. The following options are available. Only a few of these will be demonstrated here.

Usage: hdfs [--config confdir] COMMAND
where COMMAND is one of:
dfs run a file system command on the file systems supported in Hadoop.
namenode -format format the DFS file system
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
journalnode run the DFS journalnode
zkfc run the ZK Failover Controller daemon
datanode run a DFS datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS filesystem checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or DataNode
oiv apply the offline fsimage viewer to an fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
snapshotDiff diff two snapshots of a directory or diff the
current directory contents with a snapshot
lsSnapshottableDir list all snapshottable dirs owned by the current user
Use -help to see options
portmap run a portmap service
nfs3 run an NFS version 3 gateway

Most commands print help when invoked w/o parameters.

Starting HDFS and the HDFS Web GUI

HDFS must be started and running on the cluster before it can used. See Chapter 5, “Installing Apache Hadoop YARN,” for information on how to start and verify HDFS on your cluster.

Get an HDFS Status Report

A status report, similar to what is summarized on the web GUI, can be obtained by entering the following command (the output is truncated here).

$ hdfs dfsadmin –report

Configured Capacity: 747576360960 (696.23 GB)
Present Capacity: 675846991872 (629.43 GB)
DFS Remaining: 302179352576 (281.43 GB)
DFS Used: 373667639296 (348.01 GB)
DFS Used%: 55.29%
Under replicated blocks: 13
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 4 (4 total, 0 dead)

Live datanodes:
.
.
.

Perform an FSCK on HDFS

The health of HDFS can be checked by using the fsck (file system check) option.

$ hdfs fsck /

Connecting to namenode via http://headnode:50070
FSCK started by hdfs (auth:SIMPLE) from /10.0.0.1 for path / at Fri Jan 03 16:32:16 EST 2014
Status: HEALTHY
Total size: 110594648065 B
Total dirs: 311
Total files: 528
Total symlinks: 0
Total blocks (validated): 1341 (avg. block size 82471773 B)
Minimally replicated blocks: 1341 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 13 (0.9694258 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.9888144
Corrupt blocks: 0
Missing replicas: 78 (1.9089574 %)
Number of data-nodes: 4
Number of racks: 1
FSCK ended at Fri Jan 03 16:32:16 EST 2014 in 74 milliseconds

General HDFS Commands

HDFS provides a series of commands similar to those found in a standard POSIX file system. A list of those commands can be obtained by issuing the following command. A few of these commands will be highlighted here.

$ hdfs dfs

Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] <path> ...]
[-cp [-f] [-p] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma-separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the class path
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

The general command-line syntax is
bin/hadoop command [genericOptions] [commandOptions]

List Files in HDFS

To list the files in the root HDFS directory, enter the following command:

$ hdfs dfs -ls /

Found 8 items
drwxr-xr-x - hdfs hdfs 0 2013-02-06 21:17 /apps
drwxr-xr-x - hdfs hadoop 0 2014-01-01 14:17 /benchmarks
drwx------ - mapred hdfs 0 2013-04-25 16:20 /mapred
drwxr-xr-x - hdfs hdfs 0 2013-12-17 12:57 /system
drwxrwxr-- - hdfs hadoop 0 2013-11-21 14:07 /tmp
drwxrwxr-x - hdfs hadoop 0 2013-10-31 11:13 /user
drwxr-xr-x - doug hdfs 0 2013-10-11 16:24 /usr
drwxr-xr-x - hdfs hdfs 0 2013-10-31 21:25 /yarn

To list files in your home directory, enter the following command:

$ hdfs dfs –ls

Found 16 items
drwx------ - doug hadoop 0 2013-04-26 02:00 .Trash
drwxr-xr-x - doug hadoop 0 2013-10-16 20:25 DistributedShell
-rw------- 3 doug hadoop 488 2013-04-24 16:01 NOTES.txt
drwxr-xr-x - doug hadoop 0 2013-11-21 14:34 QuasiMonteCarlo_1385061734722_747204430
drwxr-xr-x - doug hadoop 0 2014-01-02 12:48 TeraGen
drwxr-xr-x - doug hadoop 0 2014-01-01 16:31 TeraGen-output
-rw------- 3 doug hadoop 1083049567 2013-02-07 01:10 acces_log
drwx------ - doug hadoop 0 2013-04-25 15:01 bin
-rw-r--r-- 3 doug hadoop 31 2013-10-16 17:09 ds-test.sh
drwxr-xr-x - doug hadoop 0 2013-04-25 15:44 id.out
-rw------- 3 doug hadoop 2246 2013-04-25 15:43 passwd
drwxr-xr-x - doug hadoop 0 2013-05-14 17:07 test
drwxr-xr-x - doug hadoop 0 2013-05-14 17:23 test-output
drwx------ - doug hadoop 0 2013-05-15 11:21 war-and-peace
drwxr-xr-x - doug hadoop 0 2013-02-06 15:14 wikipedia
drwxr-xr-x - doug hadoop 0 2013-08-27 15:54 wikipedia-output

The same result can be obtained by issuing the following command:

$ hdfs dfs -ls /user/doug

Make a Directory in HDFS

To make a directory in HDFS, use the following command. As with the –ls command, when no path is supplied, the user’s home directory is used (e.g., /users/doug).

$ hdfs dfs -mkdir stuff

Copy Files to HDFS

To copy a file from your current local directory into HDFS, use the following command. Note that if a full path is not supplied, your home directory on HDFS is assumed. In this case, the file test is placed in the directory stuff that was created previously.

$ hdfs dfs -put test stuff

The file transfer can be confirmed by using the –ls command:

$ hdfs dfs -ls stuff

Found 1 items
-rw-r--r-- 3 doug hadoop 0 2014-01-03 17:03 stuff/test

Copy Files from HDFS

Files can be copied back to your local file system using the following command. In this case, the file we copied into HDFS, test, will be copied back to the current local directory with the name test-local.

$ hdfs dfs -get stuff/test test-local

Copy Files within HDFS

The following command will copy a file in HDFS.

$ hdfs dfs -cp stuff/test test.hdfs

Delete a File within HDFS

The following command will delete the HDFS file test.dhfs that was created previously.

$ hdfs dfs -rm test.hdfs

Deleted test.hdfs

Delete a Directory in HDFS

The following command will delete the HDFS directory stuff and all its contents.

$ hdfs dfs -rm -r stuff

Deleted stuff

Decommissioning HDFS Nodes

This task is done by the HDFS administrator. To remove an active HDFS node, perform the following steps. The procedure for removing a YARN node running the NodeManager daemons is given in Chapter 6, “Apache Hadoop YARN Administration.” Depending on your installation, these may be the same or different nodes on which HDFS is running.

1. Add the following file path property to the hdfs-site.xml file. In this example, the file name hdfs.excludes is used.

<property>
<name> dfs.hosts.exclude</name>
<value>/opt/yarn/hadoop-2.2.0/etc/hadoop/hdfs.excludes</value>
</property>

2. Stop and restart the NameNode daemon.

3. To decommission a node, add the node name (or IP address) to the hdfs.excludes file.

4. Run the following to decommission the node:

hdfs dfsadmin –refreshNodes

5. HDFS will then begin decommissioning the node. Do not shut down or remove the node until this process is complete. The decommission status can be found by running hdfs dfsadmin –report. The report for the decommissioned nodes should have the following line:

Decommission Status : Decommission in progress

Once the task is complete, issuing the command hdfs dfsadmin –report will produce the following output:

Decommission Status : Decommissioned

It is now safe to remove the node. To add the node back, simply remove the node from the hdfs.excludes file and rerun hdfs dfsadmin –refreshNodes.

Consult the HDFS documentation for additional information.