Storing and Managing Data in HDFS - Storing and Managing Big Data - MICROSOFT BIG DATA SOLUTIONS (2014)

MICROSOFT BIG DATA SOLUTIONS (2014)

Part III. Storing and Managing Big Data

Chapter 5. Storing and Managing Data in HDFS

What You Will Learn in This Chapter

· Getting to Know the History and Fundamentals of HDFS

· Interacting with HDFS to Manage Files

· Administering HDFS Environments

· Managing Your HDFS Data

This chapter discusses the basics of storing and managing data for use in your big data system. The options for storing data can vary, depending on whether you are using the HDInsight Service or the Hortonworks distribution. Both options offer the Hadoop Distributed File System (HDFS), which is the standard file storage mechanism. HDInsight also offers the option of using Azure Storage Vault (ASV), which presents a full HDFS file system that uses Azure Blob storage “under the hood.”

Quite a bit of complexity underlies the full HDFS implementation, and a complete description of it would take a book of its own. This chapter instead focuses on the core knowledge you need to leverage HDFS. It also provides some details of what happens under the hood, where appropriate, to help you understand how to best use the system. Fortunately, HDFS is a stable, mature product and used by a large number of companies on an ongoing basis. In the same way that you can use SQL Server to accomplish a great deal of work without understanding its internal workings, you can use HDFS without worrying about the low-level details.

Understanding the Fundamentals of HDFS

HDFS's origin can be traced back to the Google File System (GFS) (http://research.google.com/archive/gfs.html), which was designed to handle Google's needs for storing and processing large amounts of data. Google released a paper on its file system in 2003, and when Doug Cutting began working on Hadoop, he adopted the GFS approach for HDFS.

The design goals for HDFS were threefold:

· To enable large-scale data sets to be stored and processed

· To provide reliable data storage and be fault tolerant

· To provide support for moving the computation to the data

HDFS addresses these goals well and offers a number of architectural features that enable these goals to be met. These goals also logically lead to some constraints, which HDFS had to address. (The next section breaks down these features and how they address the goals and constraints of a distributed file system designed for large data sets.)

HDFS is implemented using the Java language. This makes it highly portable because modern operating systems offer Java support. Communication in HDFS is handled using Transmission Control Protocol/Internet Protocol (TCP/IP), which also enables it to be very portable.

In HDFS terminology, an installation is usually referred to as a cluster. A cluster is made up of individual nodes. A node is a single computer that participates in the cluster. So when someone refers to a cluster, it encompasses all the individual computers that participate in the cluster.

HDFS runs beside the local file system. Each computer still has a standard file system available on it. For Linux, that may be ext4, and for a Windows server, it's usually going to be New Technology File System (NTFS). HDFS stores its files as files in the local file system, but not in a way that enables you to directly interact with them. This is similar to the way SQL Server or other relational database management systems (RDBMSs) use physical files on disk to store their data: While the files store the data, you don't manipulate the files directly. Instead, you go through the interface that the database provides. HDFS uses the native file system in the same way; it stores data there, but not in a directly useable form.

HDFS uses a write-once, read-many access model. This means that data, once written to a file, cannot be updated. Files can be deleted and then rewritten, but not updated. Although this might seem like a major limitation, in practice, with large data sets, it is often much faster to delete and replace than to perform in-place updates of data.

Now that you have a better understanding of HDFS, in the next sections you look at the architecture behind HDFS, learn about NameNodes and DataNodes, and find out about HDFS support for replication.

HDFS Architecture

Traditionally, data has been centralized rather than spread out. That worked well over the past few decades, as the capability to store ever-increasing amounts of data on a single disk continued to grow. For example, in 1981, you could purchase hard drives that stored around 20MB at a cost of approximately $ 180 per MB. By 2007, you could get a drive that stored 1TB at cost of about $ 0.0004 per MB.

Today, storage needs in big-data scenarios continue to outpace the capacity of even the largest drives (4TB). One early solution to this problem was simply to add more hard drives. If you wanted to store 1 petabyte (1,024TB) of information, you would need 256 4TB hard drives. However, if all the hard drives were placed in the same server, it introduced a single point of failure. Any problem that affected the server could mean the drives weren't accessible, and so the data on the drives could be neither read nor written. The single computer could also introduce a performance bottleneck for access to the hard drives.

HDFS was designed to solve this problem by supporting distribution of the data storage across many nodes. Because the data is spread across multiple nodes, no single computer becomes a bottleneck. By storing redundant copies of the information (discussed in more detail in the section “Data Replication”), a single point of failure is also removed.

This redundancy also enables the use of commodity hardware. (Commodity means nonspecialized, off-the-shelf components.) Special hardware or a unique configuration is not needed for a computer to participate in an HDFS cluster. Commodity hardware tends to be less expensive than more specialized components and can be acquired at a wider variety of vendors.

Many of today's server-class computers include a number of features designed to minimize downtime. This includes things like redundant power supplies, multiple network interfaces, and hard drive controllers capable of managing pools of hard drives in redundant arrays of independent/inexpensive disk (RAID) setups. Thanks to the data redundancy inherent in HDFS, it can minimize the need for this level of hardware, allowing the use of less-expensive computers.

NOTE

Just because HDFS can be run on commodity hardware and manages redundancy doesn't mean that you should ignore the reliability and performance of the computers used in an HDFS cluster. Using more reliable hardware means less time spent replacing broken components. And, just like any other application, HDFS will benefit from more computing resources to work with. In particular, NameNodes (discussed next in the “NameNodes and DataNodes” section) benefit from reliable hardware and high-performing components.

As already stated, the data being stored in HDFS is spread out and replicated across multiple machines. This makes the system resilient to the failure of any individual machine. Depending on the level of redundancy configured, the system may be able to withstand the loss of multiple machines.

Another area where HDFS enables support for large data sets is in computation. Although HDFS does not perform computation directly, it does support moving the computations closer to the data. In many computer systems, the data is moved from a server to another computer, which performs any needed computations. Then the data may be moved back to the original server or moved to yet another server.

This is a common pattern in applications that leverage a relational database. Data is retrieved from a database server to a client computer, where the application logic to update or process the data is applied. Finally, the data is saved to the database server. This pattern makes sense when you consider that the data is being stored on a single computer. If all the application logic were performed on the database server, a single computationally intensive process could block any other user from performing his or her work. By offloading application logic to client computers, it increases the database server's capability to serve data and spreads the computation work across more machines.

This approach works well for smaller data sets, but it rapidly breaks down when you begin dealing with data sets in the 1TB and up range. Moving that much data across the network can introduce a tremendous amount of latency. In HDFS, though, the data is spread out over many computers. By moving the computations closer to the data, HDFS avoids the overhead of moving the data around, while still getting the benefit of spreading computation over a larger number of computers.

Although HDFS is not directly responsible for performing the computation work, it does provide an interface that allows applications to place their computing tasks closer to the data. MapReduce is a prime example of this. It works hand in hand with HDFS to distribute computational tasks to DataNodes so that the work can be performed as close to the data as possible. Only the results of the data actually need to be transmitted across the network.

NOTE

You can think of moving the computation closer to the data as being similar to the way that many relational database engines optimize queries by moving filter criteria as close to the original data as possible. For example, if you are joining two tables, and then filtering results by a column from one table, the engine will often move that filter to the initial retrieval of data from the table, thus reducing the amount of data it has to join with the second table.

NameNodes and DataNodes

HDFS uses two primary types of nodes. A NameNode acts as a master node, managing the file system and access to files from clients. DataNodes manage the physical storage of data.

Each HDFS cluster has a single NameNode. The NameNode acts as the coordinator for the cluster and is responsible for managing and communicating to all the DataNodes in a cluster. It manages all the metadata associated with the cluster. One of its primary responsibilities is to manage the file system namespace, which is the layer that presents the distributed data stored in HDFS as though it is in a traditional file system organized as folders and files. The NameNode manages any file system namespace operations, such as creating or deleting files and folders.

The file system namespace presents the appearance that the data is stored in a folder/file structure, but the data is actually split into multiple blocks that are stored on DataNodes. The NameNode controls which blocks of data map to which DataNodes. These blocks of data are usually 64MB, but the setting is configurable.

The DataNode is responsible for the creation of blocks of data in its physical storage and for the deletion of those blocks. It is also responsible for creation of replica blocks from other nodes. The NameNode coordinates this activity, telling the DataNode what blocks to create, delete, or replicate. DataNodes communicate with the NameNode by sending a regular “heartbeat” communication over the network. This heartbeat indicates that the DataNode is operating correctly. A block report is also delivered with the heartbeat and provides a list of all the blocks stored on the DataNode.

The NameNode maintains a transaction history of all changes to the file system, known as the EditLog. It also maintains a file, referred to as the FsImage, that contains the file system metadata. The FsImage and EditLog files are read by the NameNode when it starts up, and the EditLog's transaction history is applied to the FsImage. This brings the FsImage up-to-date with the latest changes recorded by the NameNode. Once the FsImage is updated, it is written back to the file system, and the EditLog is cleared. At this point, the NameNode can begin accepting requests. This process (shown in Figure 5.1) is referred to as checkpointing, and it is run only on startup. It can have some performance impact if the NameNode has accumulated a large EditLog.

image

Figure 5.1 The checkpointing process

The NameNode is a crucial component of any HDFS cluster. Without a functioning NameNode, the data cannot be accessed. That means that the NameNode is a single point of failure for the cluster. Because of that, the NameNode is one place that using a more fault-tolerant hardware setup is advisable. In addition, setting up a Backup node may help you recover more quickly in the event of a NameNode failure. The Backup node maintains its own copy of the FsImage and EditLog. It receives all the file system transactions from the NameNode and uses that to keep its copy of the FsImage up to date. If the NameNode fails catastrophically, you can use the Backup node's copy of the FsImage to start up a new NameNode more quickly.

NOTE

Despite their name, Backup nodes aren't a direct backup to a NameNode. Rather, they manage the checkpointing process and retain a backup copy of the FsImage and EditLog. A NameNode cannot fail over to a Backup node automatically.

NOTE

Hadoop 2.0 includes several improvements for improving the availability of NameNodes, with support for Active and Standby NameNodes. These new options will make it much easier to have a highly available HDFS cluster.

Data Replication

One of the critical features of HDFS is its support for data replication. This is critical for creating redundancy in the data, which allows HDFS to be resilient to the failure of one or more nodes. Without this capability, HDFS would not be reliable to run on commodity hardware, and as a result, would require significantly more investment in highly available servers.

Data replication also enables better performance for large data sets. By spreading copies of the data across multiple nodes, the data can be read in parallel. This enables faster access and processing of large files.

By default, HDFS replicates every file three times. However, the replication level can also be specified per file. This can prove useful if you are storing transient data or data that can be re-created easily. This type of data might not be replicated at all, or only replicated once. The file is replicated at the block level. Therefore, a single file may be (and likely is) made up of blocks stored on multiple nodes. The replicas of these blocks may be stored on still more nodes.

Replicas are created as the client writes the data to HDFS. The first DataNode to receive the data gets it in small chunks. Each chunk is written to the DataNode's local storage and then transferred to the next DataNode. The receiving DataNode carries out the same process, forwarding the processed chunks to the next DataNode. That process is repeated for each chunk, for each DataNode, until the required number of replicas has been created. Because a node can be receiving a chunk to process at the same time that it is sending another chunk to the next node, the process is said to be pipelined.

A key aspect of the data replication capabilities in HDFS is that the replica placement is optimized and is continuing to be improved. The replication process is rack aware; that is, it understands how the computers are physically organized. For data centers with large numbers of computers, it is common to use network racks to hold the computers. Often, each rack has its own network switch to handle network communication between computers in the rack. This switch would then be connected to another switch, which is connected to other network racks. This means that communications between computers in the same rack is generally going to be faster than communications between computers in different racks.

HDFS uses its rack awareness to optimize the placement of replicas within a cluster. In doing so, it balances the need for performance with the need for availability in the case of a hardware failure. In the common scenario, with three replicas, one replica is stored on a node in the local rack. The other two replicas will be stored in a remote rack, on two different nodes in the rack. This approach still delivers good read performance, because a client reading the file can access two unique racks (with their own network connection) for the contents of the file. It also delivers good write performance, because writing a replica to a node in the same rack is significantly faster than writing it to a node in a different rack. This also balances availability; the replicas are located in two separate racks and three nodes. Rack failures are less common than node failures, so replicating across fewer racks doesn't have an appreciable impact on availability.

NOTE

The replica placement approach is subject to change, as the HDFS developers consider it a work in progress. As they learn more about usage patterns, they plan to update the policies to deliver the optimal balance of performance and availability.

HDFS monitors the replication levels of files to ensure the replication factor is being met. If a computer hosting a DataNode were to crash, or a network rack were taken offline, the NameNode would flag the absence of heartbeat messages. If the nodes are offline for too long, the NameNode stores forwarding requests to them, and it also checks the replication factors of any data blocks associated with those nodes. In the event that the replication factor has fallen below the threshold set when the file was created, the NameNode begins replication of those blocks again.

Using Common Commands to Interact with HDFS

This section discusses interacting with HDFS. Even though HDFS is a distributed file system, you can interact with it in a similar way as you do with a traditional file system. However, this section covers some key differences. The command examples in the following sections work with the Hortonworks Data Platform environment setup in Chapter 3, “Installing HDInsight.”

Interfaces for Working with HDFS

By default, HDFS includes two mechanisms for working with it. The primary way to interact with it is by the use of a command-line interface. For status checks, reporting, and browsing the file system, there is also a web-based interface.

Hadoop is a Java script that can run several modules of the Hadoop system. The two modules that are used for HDFS are dfs (also known as FsShell) and dfsadmin. The dfs module is used for most common file operations, such as adding or moving files. The dfsadminmodule is used for administrative functions.

You can open the command prompt on your Windows Hortonworks Data Platform (HDP) server by double-clicking the Hadoop Command Line shortcut on the desktop or by running cmd.exe from the Start menu. If you start from the Hadoop Command Line shortcut, it will set your current directory to the Hadoop location automatically, as shown in Figure 5.2.

image

Figure 5.2 The HDFS command prompt

When running commands from the command-line interface, you must use the following format:

hadoop MODULE -command arguments

The command line starts with hadoop, the executable that will interpret the remaining items on the command line. MODULE designates which Hadoop module should be run. Recall that when interacting with the HDFS file system this will be dfs or dfsadmin. -command indicates the specific command in the module that should be run, and the arguments are any specific values necessary to execute the command successfully.

NOTE

You can usually find the full list of commands supported by any module by running the module with no command, as follows:

hadoop dfs

The same holds true for commands. Running a command without arguments lists the help for that command:

hadoop dfs -put

The web interface is useful for viewing the status of the HDFS cluster. However, it does not support making modifications to either the cluster or to the files contained in it. As shown in Figure 5.3, the information provided includes the current space used by the cluster and how much space is still available. It also includes information on any unresponsive nodes. It enables you to browse the file system, as well, by clicking the Browse the file system link.

image

Figure 5.3 Web interface for HDFS

File Manipulation Commands

Most direct interaction with HDFS involves file manipulation—creating, deleting, or moving files. Remember that HDFS files are write-once, read-many, so there are no commands for updating files. However, you can manipulate the metadata associated with a file, such as the owner or group that has access to it.

NOTE

Unlike some file systems, HDFS does not have the concept of a working directory or a cd command. Most commands require that you provide a complete path to the files or directory you want to work with.

NOTE

The following commands will work on an HDInsight cluster using the standard HDFS implementation as well as ASV (discussed in Chapter 13, “Big Data and the Cloud”). However, you need to adjust the paths for each case. To reference a path in the locally attached distributed file system, use hdfs://<namenodehost>/<path> as the path. To reference a path in ASV, use asv://[<container>@]<accountname>.blob.core.windows.net/<path> as the path. You can change the asv prefix to asvs to use an encrypted connection.

By default, HDInsight creates the directories listed in Table 5.1 during the initial setup.

Table 5.1 Initial HDFS Root Directories

Directory Name

Purpose

/hive

Directory used by Hive for data storage (see Chapter 6, “Adding Structure with Hive”)

/mapred

Directory used for MapReduce

/user

Directory for user data

You can list the root directories by using the ls or lsr command:

hadoop dfs -ls /

hadoop dfs -lsr /

ls lists the directory contents of the specified folder. In the example, / indicates the root folder. lsr lists directory contents, as well, but it does it recursively for each subfolder it encounters.

Normally, user files are created in a subfolder of the /user folder, with the username being used for the title of the folder. However, this is not a requirement, and you can tailor the folder structure to fit specific scenarios. The following examples use a fictional user named MSBigDataSolutions.

Before adding data to HDFS, there must be a directory to hold it. To create a user directory for the MSBigDataSolutions user, you run the mkdir command:

hadoop dfs -mkdir /user/MSBigDataSolutions

If a directory is created by accident, or it is no longer needed, you can remove it by using the rmr command. rmr is short for remove recursive, and it removes the directory specified and any subdirectories:

hadoop dfs -rmr /user/DirectoryToRemove

After a directory has been selected or created, files can be copied to it. The most common scenario for this is to copy files from the local file system into HDFS using the put command. This example uses the sample data files created in this chapter:

hadoop dfs -put C:\MSBigDataSolutions\SampleData1.txt

/user/MSBigDataSolutions

This command loads a single file from the local file system (C:\MSBigDataSolutions\SampleData1.txt) to a directory in HDFS (/user/MSBigDataSolutions). You can use the following command to verify the file was loaded correctly:

hadoop dfs -ls /user/MSBigDataSolutions

put can load multiple files to HDFS simultaneously. You do that by using a folder as the source path, in which case all files in the folder are uploaded. You can also do so by using wildcards in the source system path:

hadoop dfs -put C:\MSBigDataSolutions\SampleData_*

/user/MSBigDataSolutions

Two other commands are related to put. copyFromLocal works exactly like the put command, and is simply an alias for it. moveFromLocal also functions like put, with the difference that the local file is deleted after the specified file(s) are loaded into HDFS.

Once the files are in HDFS, you have a couple of ways to retrieve them. One option is the cat command. cat displays the contents of the file to the screen, or it can be redirected to another output device:

hadoop dfs -cat /user/MSBigDataSolutions/SampleData1.txt

You can also use the text command to display information. The only difference is that text attempts to convert the file to a text format before displaying it. However, because most data in HDFS is text already, cat will usually work.

To get the contents of a file back to the local file system from HDFS, use the get command:

hadoop dfs -get /user/MSBigDataSolutions/SampleData1.txt

C:\MSBigDataSolutions\Output

Just like the put command, get can work with multiple files simultaneously, either by specifying a folder or a wildcard:

hadoop dfs -get /user/MSBigDataSolutions/SampleData_*

C:\MSBigDataSolutions\Output

get also has two related commands. copyToLocal works exactly like the get command and is simply an alias for it. moveToLocal also functions like get, with the difference that the HDFS file will be deleted after the specified file(s) are copied to the local file system.

Copying and moving files and directories within HDFS can be done with the cp and mv commands, respectively:

hadoop dfs -cp /user/MSBigDataSolutions /user/Backup

hadoop dfs -mv /user/MSBigDataSolutions /user/Backup2

You can delete a file in HDFS with the rm command. rm does not remove directories, though. For that, you must use the rmr command:

hadoop dfs -rm /user/MSBigDataSolutions/SampleData1.txt

hadoop dfs -rmr /user/MSBigDataSolutions

NOTE

HDFS is case sensitive. mydirectory and MYDIRECTORY are treated by HDFS as two separate directories. Because HDFS automatically creates directories for you when using some commands (the put, cp, and mv commands, for example), paying attention to case is important, as it can be easy to accidentally create directories. rmr is useful to clean up these directories.

Administrative Functions in HDFS

Security in HDFS follows a straightforward model, where files have an owner and a group. A given file or directory maintains permissions for three scenarios, which it checks in the following order:

1. The owner identity. This is a single-user account, and it tracks the assigned owner of the file or directory. If the user account accessing the file matches the owner identity, the user gets the owner's assigned permissions.

2. The group identity. This is a group account, and any user account with membership in the group gets the group permissions.

3. All other users. If the user account does not match the owner identity, and the user account is not a member of the group, the user will use these permissions.

NOTE

Permissions in Azure Data Storage work a bit differently. Permissions are managed in the storage account. The HDInsight cluster has full permissions to all containers in the storage account that is set as its default storage. It can also access containers in other storage accounts. If the target container is the public container, or it has the public-access level, the HDInsight cluster will have read-access without additional configuration. If the target container uses the private-access level, however, you have to update the core-site.xml within the HDInsight cluster to provide the key to access the container.

In the current release of HDFS, the host operating system manages user identity. HDFS uses whatever identity the host reports to determine the user's identity. On a Windows server, the user identity reported to HDFS will be the equivalent of the whoami command from a command prompt. The group membership will be the same groups as reported by running net user [username] command.

NOTE

HDFS also has a super-user account. The super user has access to view and modify everything, because permission checks are effectively bypassed for the super user. The account used to start the NameNode process is always set as the super user. Note that if the NameNode process is started under a different user identity, that account will then be the super user. This can be convenient for development purposes, because a developer starting a local NameNode to work against will automatically be the super user. However, for a production environment, a consistent, secured account should be used to start the NameNode.

Files and folders support a simple set of permissions:

· Read (r) permission: An account has permission to read the contents of the file or directory.

· Write (w) permission: An account has permission to change or modify the file or folder.

· Execute (x) permission: An account can enumerate the contents of a directory. This permission applies to directories only.

To modify the permissions applied to a file, you can use the chmod command. To add permissions, use the plus (+) sign followed by the appropriate permission letters. For example, to give all users read/write permissions to the SampleData_4.txt file, you use the following command:

hadoop dfs -chmod +rw /user/MSBigDataSolutions/SampleData_4.txt

To remove the permissions, use the minus ( -) sign:

hadoop dfs -chmod -rw /user/MSBigDataSolutions/SampleData_4.txt

To control which user the permissions apply to, you can prefix the permission with u, g, or o, which respectively stand for the user who owns the file, the group assigned to the file, or all other users. The following command adds read/write permissions back, but only to the owner of the file:

hadoop dfs -chmod u+rw /user/MSBigDataSolutions/SampleData_4.txt

You can modify the owner and group associated with a file or directory by using chown and chgrp, respectively. To change the owner, you must be running the command as the super-user account:

hadoop dfs -chown NewOwner /user/MSBigDataSolutions/SampleData_4.txt

To change the group associated with a file or a directory, you must be either the current owner or the super user:

hadoop dfs -chgrp NewGroup /user/MSBigDataSolutions/SampleData_4.txt

You can apply all the preceding commands recursively to a directory structure by using -R as an argument to the command. This applies permissions to be changed easily for a large group of files. The following command applies the read/write permission to all files in the MSBigDataSolutions folder:

hadoop dfs -chmod -R +rw /user/MSBigDataSolutions

NOTE

The chmod, chown, and chgrp commands are common commands in UNIX-based systems, but are not found on the Windows platform. HDFS implements versions of these commands internally, and for the most part, they function like their UNIX counterparts. chmod, in particular, supports a number of options, such as specifying the permission set in octal notation, that aren't immediately obvious. You can find more documentation on advanced uses of chmod at http://en.wikipedia.org/wiki/Chmod.

Managing deleted files in HDFS is normally transparent to the user. However, in some cases, it can require intervention by an administrator. By default, deleting files from HDFS does not result in immediate removal of the file. Instead, the file is moved to the /trashfolder. (If a file is deleted accidentally, it can be recovered by simply moving it back out of the /trash folder using the -mv command.) By default, the /trash folder is emptied every six hours. To explicitly manage the process, you can run the expunge command to force the trash to be emptied:

hadoop dfs -expunge

You can access other administrative functions in HDFS through the dfsadmin module. dfsadmin can be used for a number of different activities, most of which are fairly uncommon. One useful command it offers is report. This command returns a summary of the space available to the HDFS cluster, how much is actually used, and some basic indicators of replication status and potential corruption. Here is an example:

hadoop dfsadmin -report

When you need to manipulate the available nodes for maintenance, a useful command is refreshNodes, which forces the name node to reread the list of available DataNodes and any exclusions:

hadoop dfsadmin -refreshNodes

Generally, HDFS will correct any file system problems that it encounters, assuming that the problem is correctable. In some cases, though, you might want to explicitly check for errors. In that case, you can run the fsck command. This checks the HDFS file system and reports any errors back to the user. You can also use the fsck command to move corrupted files to a specific folder or to delete them. This command runs the file system check on the /user directory:

hadoop fsck /user

Overall, HDFS is designed to minimize the amount of administrative overhead involved. This section has focused on the core pieces of administrative information to provide you with enough information to get up and running without overwhelming you. For more details on administering it, you may want to review the document at http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.

Moving and Organizing Data in HDFS

HDFS manages the data stored in the Hadoop cluster without any necessary user intervention. In fact, a good portion of the design strategies used for HDFS were adopted to support that goal: a system that minimizes the amount of administration that you need to be concerned with. If you will be working with small clusters, or data on the smaller end of big data, you can safely skip this section. However, there are still scenarios in Hadoop where you can get better performance and scalability by taking a more direct approach, as this section covers.

Moving Data in HDFS

As your big data needs grow, it is not uncommon to create additional Hadoop clusters. Additional clusters are also used to keep workloads separate and to manage single-point-of-failure concerns that arise from having a single NameNode. But what happens if you need access to the same data from multiple clusters? You can export the data, using the dfs -get command to move it back to a local file system and the dfs -put command to put into the new cluster. However, this is likely to be slow and take a large amount of additional disk space during the copying process.

Fortunately, a tool in HDFS makes this easier: distcp (Distributed Copy). distcp enables a distributed approach to copying large amounts of data. It does this by leveraging MapReduce to distribute the copy process to multiple DataNodes in the cluster. The list of files to be copied is placed in a list, along with any related directories. Then the list is partitioned by the available nodes, and each node becomes responsible for copying its assigned files.

distcp can be executed by running the distcp module with two arguments: the source directory and the target directory. To reference a different cluster, you use a fully qualified name for the NameNode:

hadoop distcp hdfs://mynamenode:50010/user/MSBigDataSolutions \

hdfs://mybackupcluster:50010/user/MSBigDataSolutions

distcp can also be used for copying data inside the same cluster. This is useful if you need to copy a large amount of data for backup purposes.

NOTE

If you are using HDInsight with ASV, you will not have as much need to move data between clusters. That is because containers in Azure Storage can be shared between clusters; there's no need to copy it. However, you may still need to copy data from one container to another. You can do this from the Azure Storage Explorer (http://azurestorageexplorer.codeplex.com) if you would like a graphical user interface (GUI). You can also use the same HDFS commands (including distcp) to work with ASV; just use the appropriate qualifier and reference to the container (for example, asv:///MyAsvContainer/MyData/Test.txt).

Implementing Data Structures for Easier Management

HDFS, being a file system, is organized into directories. Many commands work with directories as well as with files, and a number of them also support the -R parameter for applying the command recursively across all child directories. Security can also be managed more easily for folders than for individual files.

Given this, it is very effective to map your data files into a folder structure that reflects the use and segmentation of the data. Using a hierarchical folder structure that reflects the source, usage, and application for the data supports this.

Consider, for example, a company that manages several websites. The website traffic logs are being captured into HDFS, along with user activity logs. Each activity log has its own distinct format. When creating a folder structure for storing this information, you would consider whether it is more important to segment the data by the site that originated it or by the type of data. Which aspect is more important likely depends on the business needs. For this example, suppose that the originating site is the most critical element, because this company keeps their website information heavily separated and secured. You might use a folder structure like this one:

/user/CompanyWebsiteA/sitelogs

/user/CompanyWebsiteA/useractivity

/user/CompanyWebsiteB/sitelogs

/user/CompanyWebsiteB/useractivity

By structuring the folders in this manner, you can easily implement security for each folder at the website level, to prevent unauthorized access. Conversely, if security were not a critical element, you might choose to reverse the order and store the format of the data first. This would make it easier to know what each folder contains and to do processing that spans all websites.

NOTE

ASV doesn't support a directory hierarchy. Instead, you have a container, and it stores key/value pairs for the data. However, ASV does allow the forward slash (/) to be used inside a key name (for example, CompanyWebsiteA/sitelogs/sitelog1.txt). By using the forward slash, the key keeps the appearance of a folder-based structure.

You can easily modify the folder structures by using the dfs -cp and -mv commands. This means that if a particular folder structure isn't working out, you can try new ones.

Rebalancing Data

Generally, HDFS manages the placement of data across nodes very well. As discussed previously, it attempts to balance the placement of data to ensure a combination of reliability and performance. However, as more data is added to a Hadoop cluster, it is normal to add more nodes to it. This can lead to the cluster being out of balance; that is, some nodes have more data and, therefore, more activity than other nodes. It can also lead to certain nodes having most of the more recently added data, which can create some issues, because newly added data is often more heavily accessed.

You can rebalance the cluster by using the balancer tool, which is simple to use. It takes one optional parameter, which defines a threshold of disk usage variance to use for the rebalancing process. The default threshold level is 10%, as shown here:

hadoop balancer -threshold .1

The balancer determines an average space utilization across the cluster. Nodes are considered over- or underutilized if their space usage varies by more than the threshold from the average space utilization. The balance runs until one of the following occurs:

· All the nodes in the cluster have been balanced.

· It has exceeded three iterations without making progress on balancing.

· The user who started the balancer aborts it by pressing Ctrl+C.

Balancing the data across nodes is an important step to maintaining the performance of the cluster, and it should be carried out whenever there are significant changes to the nodes in a cluster.

Summary

In this chapter, the background of the HDFS file system has been covered, along with some of the underlying details, including how NameNodes and DataNodes interact to store information in HDFS. The basic commands for working with and administering an HDFS file system—such as ls for listing files, get and put for moving files in and out of HDFS, and rm for removing unnecessary files—have been covered. In addition, some advanced administrative topics, like balancing and data movement, which are important for maintaining your HDFS cluster, have been covered. In the next chapter, these topics will be built on with a discussion of how the Hive application runs on top of the HDFS file system while presenting the appearance of a traditional RDBMS to applications.