Analytical Big Data - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 11. Analytical Big Data

Hadoop: Analytics at Scale

Welcome to this inexpensive way to store and process the data it’s designed for—a high volume of unstructured data in a highly distributed manner. Learn how Hadoop works and other aspects of Hadoop that are true for the entire NoSQL movement.

Keywords

data warehouse; Hadoop; open source; NoSQL; scale-out; Hortonworks; Hive; Pig; Cloudera; MapReduce; HDFS; DBMS; analytics

A new technology has emerged in recent years that was formed initially in 2006 for the needs of the Silicon Valley data elite. The companies had data needs that far surpassed any budget for DBMS out there. The scale was another order of magnitude away from the target for the DBMS. The timing of the scale was not certain, given the variability of the data. It’s not like Google wanted to be bound by calculations that enterprises go through like “by next year at this time, we’ll have 3 petabytes so let’s build to that.” They didn’t know.

Hadoop was originally developed by Doug Cutting who named it after his son’s toy elephant. Today, Yahoo is the biggest contributor to the Hadoop open source project. Eventually, the code for Hadoop (written in Java) was placed into open source, where it remains today. However, today as enterprises far and wide have found use for Hadoop, there are several value-added companies like HortonWorks and Cloudera that support a particular version of Hadoop. Some have developed some supporting code that is under closed source. Most enterprises find working with one of these vendors to be more functional than pure open source (Apache) Hadoop, but you could go to http://hadoop.apache.org/ and download Hadoop yourself.

Hadoop is quickly making its way down from the largest web-data companies through the Fortune 1000 and will see adoption in certain upper midmarket companies in the next 5 to 10 years.

The description here of Hadoop applies to all the variations of Hadoop. I have found nothing in the variations that would sway a decision about whether a workload belongs in Hadoop or not. Choosing the particular “distribution” of Hadoop is analogous to choosing the DBMS or business intelligence tool. Getting enterprises appropriately into the platforms is the focus of this book. As this applies to Hadoop, there is some information you need to know.

Big Data for Hadoop

Although I get calls frequently during budget season (usually 3rd quarter) about replacing some of the large DBMS budgets with Hadoop, seldom does that make sense.1 Hadoop will replace little of what we now have (and plan to or could have in the future) in our data warehouses. Sometimes the rationale for thinking about Hadoop is a belief that the multiple terabytes in the data warehouse constitutes “big data” and Hadoop is for big data analytics.

I explained in the previous chapter what big data is. You should now know that there is ample data out there that is not in the data warehouse, nor should be, that constitutes the big data of Hadoop. It will ultimately be more data by volume than what is in the data warehouse—maybe petabytes. It’s the data we’ve willfully chosen to ignore to date. Hadoop specializes in unstructured and semi-structured data such as web logs.

Data updates are not possible yet in Hadoop, but appends are.

Most shops have struggled with the management of this data and have chosen to ignore unstructured and semi-structured data because it’s less valuable (per capita) than the alphanumeric data of the standard data warehouse. Given the cost to store this data in a data warehouse, and the lower ROI per byte, it’s mostly only summaries or critical pieces that get managed. Those who once stored this data in their data warehouse have either become disenchanted with the cost or have abandoned the idea for other reasons. But there is value in detailed big data.

Hadoop Defined

Hadoop is an important part of the NoSQL movement that usually refers to a couple of open source products—Hadoop Distributed File System (HDFS), a derivative of the Google File System, and MapReduce—although the Hadoop family of products extends into a product set that keeps growing. HDFS and MapReduce were codesigned, developed, and deployed to work together.

Hadoop adoption—a bit of a hurdle to clear—is worth it when the unstructured data to be managed (considering history, too) reaches dozens of terabytes. Hadoop scales very well, and relatively cheaply, so you do not have to accurately predict the data size at the outset. Summaries of the analytics are likely valuable to the data warehouse, so interaction will occur.

The user consumption profile is not necessarily a high number of user queries with a modern business intelligence tool (although many access capabilities are being built for those tools to Hadoop) and the ideal resting state of that model is not dimensional. These are data-intensive workloads, and the schemas are more of an afterthought. Fields can vary from record to record. From one record to another, it is not necessary to use even one common field, although Hadoop is best for a small number of large files that tend to have some repeatability from record to record.

Record sets that have at least a few similar fields tend to be called “semi-structured,” as opposed to unstructured. Web logs are a good example of semi-structured. Either way, Hadoop is the store for these “nonstructured” sets of big data. Let’s dissect Hadoop by first looking at its file system.

Hadoop Distributed File System

HDFS is based on a paper Google published about their Google File System. It runs on a large cluster of commodity-class nodes (computers). Whenever a node is placed in the IP range as specified by a “NameNode,” one of the necessary Java virtual machines, it becomes game for data storage in the file system and will report a heartbeat henceforth to the NameNode.

Upon adding the node, HDFS may rebalance the nodes by redistributing data to that node.

Sharding can be utilized to spread the data set to nodes across data centers, potentially all across the world, if required.

A rack is a collection of nodes, usually dozens, that are physically stored close together and are connected to a network switch. A Hadoop cluster is a collection of racks. This could include up to thousands of machines.

There is one NameNode in the cluster (and a backup NameNode). This NameNode should receive priority in terms of hardware specification, especially memory. Its metadata—about the file system—should be kept totally in memory. NameNode failures, while not catastrophic (since there is the backup NameNode), do require manual intervention today. DataNodes are slaves to the NameNode.

Hadoop data is not considered sequenced and is in 64 MB (usually), 128 MB, or 256 MB block sizes (although records can span blocks) and is replicated a number of times (3 is default) to ensure redundancy (instead of RAID or mirroring). Each block is stored as a separate file in the local file system (e.g. NTFS). Hadoop programmers have no control over how HDFS works and where it chooses to place the files. The nodes that contain data, which is well over 99% of them, are called DataNodes.

It must be noted that NoSQL systems other than Hadoop use HDFS or something very similar. Sometimes it’s called HDFS even if it’s different. Some call their offerings a “version of Hadoop” even if it’s not HDFS. On behalf of the industry, I apologize.

Where the three replicas are placed is entirely up to the NameNode. The objectives are load balancing, fast access, and fault tolerance. Assuming three is the number of replicas, the first copy is written to the node creating the file. The second is written to a separate node within the same rack. This minimizes cross-network traffic. The third copy is written to a node in a different rack to support the possibility of switch failure. Nodes are fully functional computers so they handle these writes to their local disk.

Cloud computing (Chapter 13) is a great fit for Hadoop. In the cloud, a Hadoop cluster can be set up quickly. The elasticity on demand feature of cloud computing is essential when you get into Hadoop-scale data.

MapReduce for Hadoop

MapReduce was discussed in the previous chapter and is the means of data access in Hadoop. I’ll say a few things here about its applicability to Hadoop for bringing the processing to the data, instead of the other way around.

All data is processed in the local node in the map operation. Map and reduce are successive steps in processing a query, with the reducer doing the processing on the output of all of the mappers after that output is transferred across the network. Multiple reduce tasks for a query is not typical.

For example, a query may want a sum by a group, such as sales by region. Hadoop MapReduce can execute this query on each node in the map step. The map step is always the same code executed on each node. The degree of parallelism is also fungible. For example, if you say 10 levels of parallelism and there are 200 nodes, the map will spawn itself 10 times to be executed across 20 nodes each. MapReduce jobs are full scan and considered batch in nature. It will eventually complete the parallel scan and have 10 sums for each group. The sum across these collective groups then needs to be done. This subsequent step to the map is a reduce step. Nodes are selected for the reduce by the HDFS and the sums are distributed to those nodes for the overall summations. Again, the degree of parallelism is selected. This time, there will only be as many nodes as the degree of parallelism in the map step.

Both the map and reduce are written in a programming language, usually Java. The jobs need only be written once, identified as a map or reduce, and will spawn according to the parallelism.

This is the embodiment of taking a large problem, breaking it into smaller problems, performing the same function on all of the smaller problems, and, finally, combining the output for a result set.

Due to the redundant blocks, MapReduce needs a way to determine which blocks to read. If it read each block in the cluster, it would read three times as many as necessary! The JobTracker node manages this activity. It also waves the jobs through as they come to the cluster. There is not much in the way of native workload management, but this is improving through the supporting tools listed below. When the job is ready, off it goes into the cluster. The JobTracker is smart about directing traffic in terms of the blocks it selects for reading for the queries.

The JobTracker schedules the Map tasks and Reduce tasks on “TaskTracker” nodes and monitors for any failing tasks that need to be rescheduled on a different TaskTracker. To achieve the parallelism for your map and reduce tasks, there are many TaskTrackers in a Hadoop cluster. TaskTrackers utilize Java Virtual Machines to run the map and reduce tasks.

Queries will scan the entire file, such as ones that come up with the “you might likes” in social media that recur to many levels deep in web logs to return the best fit. Like a data warehouse, Hadoop is read-only, but is even more restrictive in this regard.

Failover

Unlike DBMS scale-up architectures, Hadoop clusters, with commodity-class nodes, will experience more failure. The replication makes up for this. Hadoop replicates the data across different computers, so that if one goes down, the data are processed on one of the replicated computers. Since there is (usually) 3x replication, there are more than enough copies in the cluster to accommodate failover. HDFS will actually not repair nodes right away. It does it in a “batch” fashion.

Like a lot of things in Hadoop, the fault tolerance is simple, efficient, and effective.

Failure types:

Disk errors and failures

DataNode failures

Switch/Rack failures

NameNode (and backup NameNode) failures

Datacenter failures

Hadoop Distributions

There are several variations of the Hadoop footprint, with more undoubtedly to come. Prominently, there is open source Apache/Yahoo, for which support is also available through Hortonworks and Cloudera, currently independent companies. EMC, IBM with BigInsights, and the Oracle Big Data Appliance also come with training and support. So while Hadoop footprints are supported, one must choose a distribution for something they most likely have no live experience with.

Distributions are guaranteed to have components that work together between HDFS, MapReduce, and all of the supporting tools. They are tested and packaged. With the many components, this is not trivial added value.

Supporting Tools

Hive and Pig are part of distributions and are good for accessing HDFS data with some abstraction from MapReduce. Both tools generate MapReduce jobs, since that is the way to access Hadoop data.

Facebook and Yahoo reached a different conclusion about the value of declarative languages like SQL from Google. Facebook produced an SQL-like language called Hive and Yahoo! produced a slightly more procedural language (with steps) called Pig. Hive and Pig queries get compiled into a sequence of MapReduce jobs.

There is also column store HBase that works with Hadoop. It is used in the Bing search engine. Its primary use is to hold web page crawler results. It mainly keeps track of URLs (keys), critical web page content (columns), and time stamps for that page (what changed version control). It provides a simple interface to the distributed data that allows incremental processing. HBase can be accessed by Hive, Pig, and MapReduce.

There is also Sqoop, a package for MapReduce jobs to move data between HDFS and DBMS.

Here are some other components:

Mahout – machine learning

Ambari – reports what is happening in the cluster

Cascading – high level tool that translates to MapReduce (like Pig and Hive)

Oozie – workflow coordination management; you define when you want MapReduce jobs to run, trigger when data is available, and launch MapReduce jobs

Flume – streaming data into Hadoop for MapReduce, Pig, Hive

Protobuf, Avrò, Thrift – for serialization

HCatalog – Hadoop metadata; increasingly, business intelligence and ETL tools will be able to access HDFS through HCatalog

image

FIGURE 11.1 Hortonworks data platform components.

Hadoop Challenges

Hadoop implementation is complex. It’s a programming orientation as opposed to a tool orientation. Programmers rule this world and have often isolated themselves from the internal “system integrators” who deploy tools. These groups often do more than build. They analyze, which is also different from information management in the database-only years.

The server requirements are immense, mostly because we are dealing with large amounts of data that require many servers. However, for many organizations, these commodity servers are harder to implement than enterprise-class servers. Many organizations are used to vertical scaling and not to supporting “server farms.”

The divide between the Hadoop implementation team and management is more pronounced as well. It is interesting that at the same time that hands-off software-as-a-service is reaching new heights in organizations, this labor-intensive, brute-force approach antithesis to software-as-a-service is making sense in those same organizations.

Large vendors are responding to the Hadoop challenge in two ways—and usually both ways at once. One way is the “join them” approach, where the vendor announces Hadoop distributions in addition to continuing full support of their current wares for “big data,” many of which are now extending their capabilities to an unprecedented scale. Others incorporate Hadoop-like capabilities as a hedge against Hadoop and a reinforcement of their road map, much of which began prior to Hadoop availability.

Hadoop is Not

Hadoop is not good for processing transactions or for other forms of random access. It is best for large scans. It is also not good for processing lots of small files or intensive calculations with little data.

We have become accustomed to real-time interactivity with data, but use cases for Hadoop must fall into batch processing. Hadoop also does not support indexing or an SQL interface—not yet, anyway. And it’s not strictly ACID-compliant, so you would not manage transactions there.

Types of Data Fit for Hadoop

• Sensor Data – Pulsing sensor readers provide this granular activity data

• Clickstream Data – Detailed website visitor clicks and movements

• Social Data – Internet-accumulated data entered

• Server Logs – Logs interesting for diagnosing processes in detail

• Smart Grid Data – Data to optimize production and prevent grid failure

• Electronic Medical Records – Data for supporting clinical trials and optimize

• Video and Pictures – High-storage and dense objects which generates patterns

• Unstructured Text – Accumulated text which generates patterns

• Geolocation Data – Location data

• Any high volume data that outscales DBMS value

• Very “cold” enterprise data with no immediate or known utility

This is very interesting as I know of no one who has tried to do something in Hadoop and failed because it did not have ACID compliance. This is mainly due to good workload selection, but the fact that transactions could fail—even if you’re not doing them in Hadoop—gives a lot of technology leaders pause about using Hadoop. Due to its eventual consistency, it could have less-than-full transactions committed—a big no-no for transactions. Again, this is theoretically possible given the eventual consistency. I have not been able to create a failure or know of anyone who has. Furthermore, nobody has been able to articulate the risk—in mathematical terms—that you are undertaking with transactions in Hadoop. I am NOT advocating doing transactions in Hadoop, just pointing this out.

Are these knockout factors for Hadoop? I don’t believe so. In my work, large unstructured batch data seems to have a cost-effective and functional home only in Hadoop. Any Hadoop will at most coexist within the enterprise with its less expensive per capita server farms processing large amounts of unstructured data, passing some of it to relational systems with broader capabilities, while those relational systems continue to do the bulk of an enterprise’s processing, especially of structured data.

The advancement of the Hadoop tool set will obviate some of the Hadoop limitations. Hadoop should begin to be matched up against the real challenges of the enterprise now.

Summary

In summary, Hadoop changes how we view the data warehouse. Shops with unstructured data in the dozens of terabytes and more—even to petabytes—may adopt Hadoop. The data warehouse no longer needs to be the biggest data store, although it will still process the most queries and handle most of the alphanumeric workload. The data warehouse will continue to process ad hoc, interactive queries. However, analytic queries that require unstructured and semi-structured data will move to Hadoop.

Action Plan

1. Develop an understanding of the full costs of Hadoop.

2. Ease into Hadoop. Download open source Hadoop.

3. Pick a problem to solve. Don’t start with egregious problems and petabytes of data. As you are learning the technology and its bounds, as well as the organization’s interests, deliver time-blocked projects with business impact.

a. Move any unstructured analytical data you may be storing in a DBMS to Hadoop

b. Explore the organization’s need for unstructured analytical data it may not be storing, for potentially storing in Hadoop

4. Extend governance to Hadoop data and implementations.

5. Consider technical integration points with the “legacy” environment.

www.mcknightcg.com/bookch11


1Except for the occasional use of Hadoop as cold storage for older, seldom-accessed data in DBMS