Exploring Data Engineering Pipelines and Infrastructure - Getting Started with Data Science - Data Science For Dummies (2016)

Data Science For Dummies (2016)

Part 1

Getting Started with Data Science

Chapter 2

Exploring Data Engineering Pipelines and Infrastructure

IN THIS CHAPTER

check Defining big data

check Looking at some sources of big data

check Distinguishing between data science and data engineering

check Hammering down on Hadoop

check Exploring solutions for big data problems

check Checking out a real-world data engineering project

There’s a lot of hype around big data these days, but most people don’t really know or understand what it is or how they can use it to improve their lives and livelihoods. This chapter defines the term big data, explains where big data comes from and how it’s used, and outlines the roles that data engineers and data scientists play in the big data ecosystem. In this chapter, I introduce the fundamental big data concepts that you need in order to start generating your own ideas and plans on how to leverage big data and data science to improve your lifestyle and business workflow (Hint: You’d be able to improve your lifestyle by mastering some of the technologies discussed in this chapter — which would certainly lead to more opportunities for landing a well-paid position that also offers excellent lifestyle benefits.)

Defining Big Data by the Three Vs

Big data is data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it doesn’t fit the structural requirements of traditional database architectures. Whether data volumes rank in the terabyte or petabyte scales, data-engineered solutions must be designed to meet requirements for the data’s intended destination and use.

technicalstuff When you’re talking about regular data, you’re likely to hear the words kilobyte and gigabyte used as measurements — 103 and 109 bytes, respectively. In contrast, when you’re talking about big data, words like terabyte and petabyte are thrown around instead — 1012 and 1015 bytes, respectively. A byte is an 8-bit unit of data.

Three characteristics (known as “the three Vs”) define big data: volume, velocity, and variety. Because the three Vs of big data are continually expanding, newer, more innovative data technologies must continuously be developed to manage big data problems.

remember In a situation where you’re required to adopt a big data solution to overcome a problem that’s caused by your data’s velocity, volume, or variety, you have moved past the realm of regular data — you have a big data problem on your hands.

Grappling with data volume

The lower limit of big data volume starts as low as 1 terabyte, and it has no upper limit. If your organization owns at least 1 terabyte of data, it’s probably a good candidate for a big data deployment.

warning In its raw form, most big data is low value — in other words, the value-to-data-quantity ratio is low in raw big data. Big data is composed of huge numbers of very small transactions that come in a variety of formats. These incremental components of big data produce true value only after they’re aggregated and analyzed. Data engineers have the job of rolling it up, and data scientists have the job of analyzing it.

Handling data velocity

A lot of big data is created through automated processes and instrumentation nowadays, and because data storage costs are relatively inexpensive, system velocity is, many times, the limiting factor. Big data is low-value. Consequently, you need systems that are able to ingest a lot of it, on short order, to generate timely and valuable insights.

In engineering terms, data velocity is data volume per unit time. Big data enters an average system at velocities ranging between 30 kilobytes (K) per second to as much as 30 gigabytes (GB) per second. Many data-engineered systems are required to have latency less than 100 milliseconds, measured from the time the data is created to the time the system responds. Throughput requirements can easily be as high as 1,000 messages per second in big data systems! High-velocity, real-time moving data presents an obstacle to timely decision making. The capabilities of data-handling and data-processing technologies often limit data velocities.

remember Data ingestion tools come in a variety of flavors. Some of the more popular ones are described in this list:

· Apache Sqoop: You can use this data transference tool to quickly transfer data back and forth between a relational data system and the Hadoop distributed file system (HDFS) — it uses clusters of commodity servers to store big data. HDFS makes big data handling and storage financially feasible by distributing storage tasks across clusters of inexpensive commodity servers. It is the main storage system that’s used in big data implementations.

· Apache Kafka: This distributed messaging system acts as a message broker whereby messages can quickly be pushed onto, and pulled from, HDFS. You can use Kafka to consolidate and facilitate the data calls and pushes that consumers make to and from the HDFS.

· Apache Flume: This distributed system primarily handles log and event data. You can use it to transfer massive quantities of unstructured data to and from the HDFS.

Dealing with data variety

Big data gets even more complicated when you add unstructured and semistructured data to structured data sources. This high-variety data comes from a multitude of sources. The most salient point about it is that it’s composed of a combination of datasets with differing underlying structures (either structured, unstructured, or semistructured). Heterogeneous, high-variety data is often composed of any combination of graph data, JSON files, XML files, social media data, structured tabular data, weblog data, and data that’s generated from click-streams.

Structured data can be stored, processed, and manipulated in a traditional relational database management system (RDBMS). This data can be generated by humans or machines, and is derived from all sorts of sources, from click-streams and web-based forms to point-of-sale transactions and sensors. Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data could be derived from blog posts, emails, and Word documents. Semistructured data doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data. Semistructured data is commonly found in databases and file systems. It can be stored as log files, XML files, or JSON data files.

tip Become familiar with the term data lake — this term is used by practitioners in the big data industry to refer to a nonhierarchical data storage system that’s used to hold huge volumes of multi-structured data within a flat storage architecture. HDFS can be used as a data lake storage repository, but you can also use the Amazon Web Services S3 platform to meet the same requirements on the cloud (the Amazon Web Services S3 platform is a cloud architecture that’s available for storing big data).

Identifying Big Data Sources

Big data is being continually generated by humans, machines, and sensors everywhere. Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the Internet of things — a web of digital connections that joins together the ever-expanding array of electronic devices we use in our everyday lives. Figure 2-1 shows a variety of popular big data sources.

image

FIGURE 2-1: Popular sources of big data.

Grasping the Difference between Data Science and Data Engineering

Data science and data engineering are two different branches within the big data paradigm — an approach wherein huge velocities, varieties, and volumes of structured, unstructured, and semistructured data are being captured, processed, stored, and analyzed using a set of techniques and technologies that is completely novel compared to those that were used in decades past.

Both are useful for deriving knowledge and actionable insights from raw data. Both are essential elements for any comprehensive decision-support system, and both are extremely helpful when formulating robust strategies for future business management and growth. Although the terms data science and data engineering are often used interchangeably, they’re distinct domains of expertise. In the following sections, I introduce concepts that are fundamental to data science and data engineering, and then I show you the differences in how these two roles function in an organization’s data processing system.

Defining data science

If science is a systematic method by which people study and explain domain-specific phenomenon that occur in the natural world, you can think of data science as the scientific domain that’s dedicated to knowledge discovery via data analysis.

technicalstuff With respect to data science, the term domain-specific refers to the industry sector or subject matter domain that data science methods are being used to explore.

Data scientists use mathematical techniques and algorithmic approaches to derive solutions to complex business and scientific problems. Data science practitioners use its predictive methods to derive insights that are otherwise unattainable. In business and in science, data science methods can provide more robust decision making capabilities:

· In business, the purpose of data science is to empower businesses and organizations with the data information that they need in order to optimize organizational processes for maximum efficiency and revenue generation.

· In science, data science methods are used to derive results and develop protocols for achieving the specific scientific goal at hand.

Data science is a vast and multidisciplinary field. To call yourself a true data scientist, you need to have expertise in math and statistics, computer programming, and your own domain-specific subject matter.

Using data science skills, you can do things like this:

· Use machine learning to optimize energy usages and lower corporate carbon footprints.

· Optimize tactical strategies to achieve goals in business and science.

· Predict for unknown contaminant levels from sparse environmental datasets.

· Design automated theft- and fraud-prevention systems to detect anomalies and trigger alarms based on algorithmic results.

· Craft site-recommendation engines for use in land acquisitions and real estate development.

· Implement and interpret predictive analytics and forecasting techniques for net increases in business value.

Data scientists must have extensive and diverse quantitative expertise to be able to solve these types of problems.

technicalstuff Machine learning is the practice of applying algorithms to learn from, and make automated predictions about, data.

Defining data engineering

If engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data-processing bottlenecks and data-handling problems that arise due to the high volume, velocity, and variety of big data.

Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with and designing real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as RDBMSs. They generally code in Java, C++, Scala, and Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into more manageably sized datasets. Simply put, with respect to data science, the purpose of data engineering is to engineer big data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.

remember Most engineered systems are built systems — they are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.

Using data engineering skills, you can, for example:

· Build large-scale Software-as-a-Service (SaaS) applications.

· Build and customize Hadoop and MapReduce applications.

· Design and build relational databases and highly scaled distributed architectures for processing big data.

· Build an integrated platform that simultaneously solves problems in data ingestion, data storage, machine learning, and system management — all from one interface.

Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.

technicalstuff Software-as-a-Service (SaaS) is a term that describes cloud-hosted software services that are made available to users via the Internet.

Comparing data scientists and data engineers

The roles of data scientist and data engineer are frequently completely confused and intertwined by hiring managers. If you look around at most position descriptions for companies that are hiring, they often mismatch the titles and roles or simply expect applicants to do both data science and data engineering.

tip If you’re hiring someone to help make sense of your data, be sure to define the requirements clearly before writing the position description. Because data scientists must also have subject-matter expertise in the particular areas in which they work, this requirement generally precludes data scientists from also having expertise in data engineering (although some data scientists do have experience using engineering data platforms). And, if you hire a data engineer who has data science skills, that person generally won’t have much subject-matter expertise outside of the data domain. Be prepared to call in a subject-matter expert to help out.

Because many organizations combine and confuse roles in their data projects, data scientists are sometime stuck spending a lot of time learning to do the job of a data engineer, and vice versa. To get the highest-quality work product in the least amount of time, hire a data engineer to process your data and a data scientist to make sense of it for you.

Lastly, keep in mind that data engineer and data scientist are just two small roles within a larger organizational structure. Managers, middle-level employees, and organizational leaders also play a huge part in the success of any data-driven initiative. The primary benefit of incorporating data science and data engineering into your projects is to leverage your external and internal data to strengthen your organization’s decision-support capabilities.

Making Sense of Data in Hadoop

Because big data’s three Vs (volume, velocity, and variety) don’t allow for the handling of big data using traditional relational database management systems, data engineers had to become innovative. To get around the limitations of relational systems, data engineers turn to the Hadoop data processing platform to boil down big data into smaller datasets that are more manageable for data scientists to analyze.

remember When you hear people use the term Hadoop nowadays, they’re generally referring to a Hadoop ecosystem that includes the HDFS (for data storage), MapReduce (for bulk data processing), Spark (for real-time data processing), and YARN (for resource management).

In the following sections, I introduce you to MapReduce, Spark, and the Hadoop distributed file system. I also introduce the programming languages you can use to develop applications in these frameworks.

Digging into MapReduce

MapReduce is a parallel distributed processing framework that can be used to process tremendous volumes of data in-batch — where data is collected and then processed as one unit with processing completion times on the order of hours or days. MapReduce works by converting raw data down to sets of tuples and then combining and reducing those tuples into smaller sets of tuples (with respect to MapReduce, tuples refer to key-value pairs by which data is grouped, sorted, and processed). In layman’s terms, MapReduce uses parallel distributed computing to transform big data into manageable-size data.

technicalstuff Parallel distributed processing refers to a powerful framework where data is processed very quickly via the distribution and parallel processing of tasks across clusters of commodity servers.

MapReduce jobs implement a sequence of map- and reduce-tasks across a distributed set of servers. In the map task, you delegate data to key-value pairs, transform it, and filter it. Then you assign the data to nodes for processing. In the reduce task, you aggregate that data down to smaller-size datasets. Data from the reduce step is transformed into a standard key-value format — where the key acts as the record identifier and the value is the value being identified by the key. The clusters’ computing nodes process the map tasks and reduce tasks that are defined by the user.

This work is done in two steps:

1. Map the data.

The incoming data must first be delegated into key-value pairs and divided into fragments, which are then assigned to map tasks. Each computing cluster (a group of nodes that are connected to each other and perform a shared computing task) is assigned a number of map tasks, which are subsequently distributed among its nodes. Upon processing of the key-value pairs, intermediate key-value pairs are generated. The intermediate key-value pairs are sorted by their key values, and this list is divided into a new set of fragments. Whatever count you have for these new fragments, it will be the same as the count of the reduce tasks.

2. Reduce the data.

Every reduce task has a fragment assigned to it. The reduce task simply processes the fragment and produces an output, which is also a key-value pair. Reduce tasks are also distributed among the different nodes of the cluster. After the task is completed, the final output is written onto a file system.

In short, you can use MapReduce as a batch-processing tool, to boil down and begin to make sense of a huge volume, velocity, and variety of data by using map and reduce tasks to tag the data by (key, value) pairs, and then reduce those pairs into smaller sets of data through aggregation operations — operations that combine multiple values from a dataset into a single value. A diagram of the MapReduce architecture is shown in Figure 2-2.

image

FIGURE 2-2: The MapReduce architecture.

tip If your data doesn’t lend itself to being tagged and processed via keys, values, and aggregation, map-and-reduce generally isn’t a good fit for your needs.

Stepping into real-time processing

Do you recall that MapReduce is a batch processor and can’t process real-time, streaming data? Well, sometimes you might need to query big data streams in real-time — and you just can’t do this sort of thing using MapReduce. In these cases, use a real-time processing framework instead.

A real-time processing framework is — as its name implies — a framework that processes data in real-time (or near - real-time) as that data streams and flows into the system. Real-time frameworks process data in microbatches — they return results in a matter of seconds rather than hours or days, like MapReduce. Real-time processing frameworks either

· Lower the overhead of MapReduce tasks to increase the overall time efficiency of the system: Solutions in this category include Apache Storm and Apache Spark for near-real-time stream processing.

· Deploy innovative querying methods to facilitate the real-time querying of big data: Some solutions in this category are Google’s Dremel, Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.

Although MapReduce was historically the main processing framework in a Hadoop system, Spark has recently made some major advances in assuming MapReduce’s position. Spark is an in-memory computing application that you can use to query, explore, analyze, and even run machine learning algorithms on incoming, streaming data in near-real-time. Its power lies in its processing speed — the ability to process and make predictions from streaming big data sources in three seconds flat is no laughing matter. Major vendors such as Cloudera have been pushing for the advancement of Spark so that it can be used as a complete MapReduce replacement, but it isn’t there yet.

Real-time, stream-processing frameworks are quite useful in a multitude of industries — from stock and financial market analyses to e-commerce optimizations, and from real-time fraud detection to optimized order logistics. Regardless of the industry in which you work, if your business is impacted by real-time data streams that are generated by humans, machines, or sensors, a real-time processing framework would be helpful to you in optimizing and generating value for your organization.

Storing data on the Hadoop distributed file system (HDFS)

The Hadoop distributed file system (HDFS) uses clusters of commodity hardware for storing data. Hardware in each cluster is connected, and this hardware is composed of commodity servers — low-cost, low-performing generic servers that offer powerful computing capabilities when run in parallel across a shared cluster. These commodity servers are also called nodes. Commoditized computing dramatically decreases the costs involved in storing big data.

The HDFS is characterized by these three key features:

· HDFS blocks: In data storage, a block is a storage unit that contains some maximum number of records. HDFS blocks are able to store 64MB of data, by default.

· Redundancy: Datasets that are stored in HDFS are broken up and stored on blocks. These blocks are then replicated (three times, by default) and stored on several different servers in the cluster, as backup, or redundancy.

· Fault-tolerance: A system is described as fault tolerant if it is built to continue successful operations despite the failure of one or more of its subcomponents. Because the HDFS has built-in redundancy across multiple servers in a cluster, if one server fails, the system simply retrieves the data from another server.

warning Don’t pay storage costs on data you don’t need. Storing big data is relatively inexpensive, but it is definitely not free. In fact, storage costs range up to $20,000 per commodity server in a Hadoop cluster. For this reason, only relevant data should be ingested and stored.

Putting it all together on the Hadoop platform

The Hadoop platform is the premier platform for large-scale data processing, storage, and management. This open-source platform is generally composed of the HDFS, MapReduce, Spark, and YARN, all working together.

Within a Hadoop platform, the workloads of applications that run on the HDFS (like MapReduce and Spark) are divided among the nodes of the cluster, and the output is stored on the HDFS. A Hadoop cluster can be composed of thousands of nodes. To keep the costs of input/output (I/O) processes low, MapReduce jobs are performed as close to the data as possible — the reduce tasks processors are positioned as closely as possible to the outgoing map task data that needs to be processed. This design facilitates the sharing of computational requirements in big data processing.

Hadoop also supports hierarchical organization. Some of its nodes are classified as master nodes, and others are categorized as slaves. The master service, known as JobTracker, is designed to control several slave services. A single slave service (also called TaskTracker) is distributed to each node. The JobTracker controls the TaskTrackers and assigns Hadoop MapReduce tasks to them. YARN, the resource manager, acts as an integrated system that performs resource management and scheduling functions.

HOW JAVA, SCALA, PYTHON, AND SQL FIT INTO YOUR BIG DATA PLANS

MapReduce is implemented in Java, and Spark’s native language is Scala. Great strides have been made, however, to open these technologies to a wider array of users. You can now use Python to program Spark jobs (a library called PySpark), and you can use SQL (discussed in Chapter 16) to query data from the HDFS (using tools like Hive and Spark SQL).

Identifying Alternative Big Data Solutions

Looking past Hadoop, alternative big data solutions are on the horizon. These solutions make it possible to work with big data in real-time or to use alternative database technologies to handle and process it. In the following sections, I introduce you to massively parallel processing (MPP) platforms and the NoSQL databases that allow you to work with big data outside of the Hadoop environment.

remember ACID compliance stands for atomicity, consistency, isolation, and durability compliance, a standard by which accurate and reliable database transactions are guaranteed. In big data solutions, most database systems are not ACID compliant, but this does not necessarily pose a major problem, because most big data systems use a decision support system (DSS) that batch-processes data before that data is read out. A DSS is an information system that is used for organizational decision support. A nontransactional DSS demonstrates no real ACID compliance requirements.

Introducing massively parallel processing (MPP) platforms

Massively parallel processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional data warehouse, an MPP may be the perfect solution.

To understand how MPP compares to a standard MapReduce parallel-processing framework, consider that MPP runs parallel computing tasks on costly, custom hardware, whereas MapReduce runs them on inexpensive commodity servers. Consequently, MPP processing capabilities are cost restrictive. MPP is quicker and easier to use, however, than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.

Introducing NoSQL databases

A traditional RDBMS isn’t equipped to handle big data demands. That’s because it is designed to handle only relational datasets constructed of data that’s stored in clean rows and columns and thus is capable of being queried via SQL. RDBMSs are not capable of handling unstructured and semistructured data. Moreover, RDBMSs simply don’t have the processing and handling capabilities that are needed for meeting big data volume and velocity requirements.

This is where NoSQL comes in. NoSQL databases are non-relational, distributed database systems that were designed to rise to the big data challenge. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution. NoSQL systems facilitate non-SQL data querying of non-relational or schema-free, semistructured and unstructured data. In this way, NoSQL databases are able to handle the structured, semistructured, and unstructured data sources that are common in big data systems.

NoSQL offers four categories of non-relational databases: graph databases, document databases, key-values stores, and column family stores. Because NoSQL offers native functionality for each of these separate types of data structures, it offers very efficient storage and retrieval functionality for most types of non-relational data. This adaptability and efficiency makes NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.

The NoSQL applications Apache Cassandra and MongoDB are used for data storage and real-time processing. Apache Cassandra is a popular type of key-value store NoSQL database, and MongoDB is a document-oriented type of NoSQL database. It uses dynamic schemas and stores JSON-esque documents. MongoDB is the most popular type of document store on the NoSQL market.

technicalstuff Some people argue that the term NoSQL stands for Not Only SQL, and others argue that it represents Non-SQL databases. The argument is rather complex, and there is no cut-and-dried answer. To keep things simple, just think of NoSQL as a class of non-relational systems that do not fall within the spectrum of RDBMSs that are queried using SQL.

Data Engineering in Action: A Case Study

A Fortune 100 telecommunications company had large datasets that resided in separate data silos — data repositories that are disconnected and isolated from other data storage systems used across the organization. With the goal of deriving data insights that lead to revenue increases, the company decided to connect all of its data silos and then integrate that shared source with other contextual, external, non-enterprise data sources as well.

Identifying the business challenge

The Fortune 100 company was stocked to the gills with all the traditional enterprise systems: ERP, ECM, CRM — you name it. Slowly, over many years, these systems grew and segregated into separate information silos. (Check out Figure 2-3 to see what I mean.) Because of the isolated structure of the data systems, otherwise useful data was lost and buried deep within a mess of separate, siloed storage systems. Even if the company knew what data it had, it would be like pulling teeth to access, integrate, and utilize it. The company rightfully believed that this restriction was limiting its business growth.

image

FIGURE 2-3: Data silos, joined by a common join point.

To optimize its sales and marketing return on investments, the company wanted to integrate external, open datasets and relevant social data sources that would provide deeper insights into its current and potential customers. But to build this 360-degree view of the target market and customer base, the company needed to develop a sophisticated platform across which the data could be integrated, mined, and analyzed.

The company had the following three goals in mind for the project:

· Manage and extract value from disparate, isolated datasets.

· Take advantage of information from external, non-enterprise, or social data sources to provide new, exciting, and useful services that create value.

· Identify specific trends and issues in competitor activity, product offerings, industrial customer segments, and sales team member profiles.

Solving business problems with data engineering

To meet the company’s goals, data engineers moved the company’s datasets to Hadoop clusters. One cluster hosted the sales data, another hosted the human resources data, and yet another hosted the talent management data. Data engineers then modeled the data using the linked data format — a format that facilitates a joining of the different datasets in the Hadoop clusters.

After this big data platform architecture was put into place, queries that would have traditionally taken several hours to perform could be performed in a matter of minutes. New queries were generated after the platform was built, and these queries also returned efficient results within a few minutes’ time.

Boasting about benefits

The following list describes some of the benefits that the telecommunications company now enjoys as a result of its new big data platform:

· Ease of scaling: Scaling is much easier and cheaper using Hadoop than it was with the old system. Instead of increasing capital and operating expenditures by buying more of the latest generation of expensive computers, servers, and memory capacity, the company opted to grow wider instead. It was able to purchase more hardware and add new commodity servers in a matter of hours rather than days.

· Performance: With their distributed processing and storage capabilities, the Hadoop clusters deliver insights faster and produce more data insight for less cost.

· High availability and reliability: The company has found that the Hadoop platform is providing data protection and high availability while the clusters grow in size. Additionally, the Hadoop clusters have increased system reliability because of their automatic failover configuration — a configuration that facilitates an automatic switch to redundant, backup data-handling systems in instances where the primary system might fail.