MICROSOFT BIG DATA SOLUTIONS (2014)

Part VI. Moving Your Big Data Forward

In This Part

· Chapter 15: Building and Executing Your Big Data Plan

· Chapter 16: Operational Big Data Management

Chapter 15. Building and Executing Your Big Data Plan

What You Will Learn in This Chapter

· Getting Sponsor and Stakeholder Buy-in

· Identifying Technical Challenges

· Identifying Operational Challenges

· Monitoring Your Solution Post-deployment

Understanding that you need to look at your data infrastructure and reevaluate how you are handling certain sets of data today and in the future is one thing; actually identifying a project and then getting executive sponsor buy-in is another. In most organizations, many opportunities exist for improvement; it's often difficult to find the one opportunity that demonstrates what a new technology or approach can do for an organization while at the same time providing a reasonable return on investment (ROI). In this chapter, we weave throughout the challenges of building and executing your big data plan. Specifically, the chapter begins by looking at how to gain buy-in from sponsors and stakeholders. Then, you learn about technical and operational challenges and how to overcome them. The chapter concludes with a discussion about what you need to do after you've deployed your solution, including how to plan for its growth.

Gaining Sponsor and Stakeholder Buy-in

One of the most difficult things to do when evaluating a new technology is choosing the project. Those of us who come from a technical background tend to look at most projects from that technical angle. We ask ourselves “which one of these applications is a good fit for technology x?” The problem with this approach is that it doesn't align itself with business needs. That particular business group might not have the budget for any new projects or hardware. The business group may be perfectly happy with their current application and not want any significant changes. The business group may simply be resistant to change. We have to remember as technologists we love to jump in and learn them. What you may see as a fascinating new product to learn, the business users often see as another new tool they have to learn to get their job done. In this section we will discuss how to match your technologies, like Hadoop, to business needs within your organization.

Problem Definition

In order to match big data solutions to a project, take a step back and look at the different business units in your organization. What are their current challenges and opportunities? Create a quick chart that maps this out. Talk to these groups about their data challenges. You will often hear many of the same statements coming from the groups who may benefit from big data technologies, such as the following:

· The business can't get access to all the relevant data; we need external data.

· We are missing the ETL window. The data we needed didn't arrive on time.

· We can't predict with confidence if we can't explore data and develop our own models.

· We need to parallelize data operations, but it's too costly and complex.

· We can't keep enough history in our enterprise data warehouse.

These are exactly the business units you want to speak with about leveraging big data solutions. The next step is to describe the solution to them without getting too technical. This step is all about the Return on Investment (ROI) to the business unit.

Many people hear about big data and assume that the first project that they take on must be an entirely new project solving a new problem with new data and new infrastructure. Although you can do this, it greatly increases your chance of failure. The reason for this is that there are too many unknowns. You are dealing with new technology (Hadoop, Hive, Pig, Oozie, etc.), new hardware architecture, new solution architecture, and new data. Every one of those unknowns will slow you down and create opportunities for mistakes. Instead, look around at existing solutions and ask yourself and the business group, “What new data elements would make this data and these reports more statistically accurate and relevant?” Whether you have current solutions dealing with fraud detection, churn analysis, equipment monitoring, or pricing analysis, additional data elements can likely be included from either inside or outside your organization that will improve your analytical capabilities.

Finally, it is important to understand that this endeavor is likely to be very visible within your organization. Most every C-level executive has been reading about big data for the last several years and how it will solve many of their data issues. Thus, they will have a great interest in the solution. Also, the initiative isn't trivial, and the budget of the solution alone will garner interest from executives in the company. It is vital that your solution have the sponsorship of one of these C-level executives so that they can represent the solution to the rest of your company. They are your champion. They should be included in regular monthly updates on the status of the projects. Include them on your successes, but also inform them of the roadblocks that are slowing the project down. Your executive sponsor can help rid your organization of those roadblocks to ensure timely completion of the solution.

As part of this process of defining the problem, it is vital to identify your business user population. Your sponsor can usually help direct you to them, but it is up to the solution architect to reach out to this population and interview them. The purpose of the interviews is to identify the key expected outputs of the solution. Of course, simply asking them what they want the solution to do is inappropriate. After all, they are probably not technologists and don't understand the art of the possible with big data. Instead, ask them probing questions, such as the following:

· What does their job entail?

· How do they measure success?

· What are their current challenges with analytics in their group?

The purpose of these queries is to isolate the questions they will want to ask of the data, and for you to determine whether the data you are collecting actually supports the business drivers for the project. As the architect, do not make the mistake of relying on hearsay; instead, go directly to the source to ensure that information has not been lost in translation.

Scope Management

Scope management is vital to the success of any project. From the beginning, you want to define the business needs of the project, data integration with other sources, data latency requirements, and delivery systems. One key aspect of scope management that often goes overlooked is the use of data profiling very early in the process to flush out a couple key issues:

· Can the data support the project expectations?

· How clean is the data, and what steps can be taken to improve it at the source systems?

Additionally, you will need to identify the tools in the presentation layer to ensure you have the appropriate data models to support the tool. Are you going to query the data directly in Hadoop? If so, you need to ensure that all your tools have a solution to allow it to connect to Hadoop. Even if you establish connectivity to Hadoop, you may find that the end user experience isn't as fast as you want it to be. Many architectures move some of the data off of Hadoop and onto a secondary system such as SQL Server or SQL Server Parallel Data Warehouse.

Finally, the presentation layer can be determined. If you are using Power View, do you have the SharePoint infrastructure to host it? If not, are you going to build it? Determine this end-to-end architecture up front as it's vital to establishing the scope of the entire project to know what you are getting into and what kind of buy-in from executives you'll be asking for.

Basing the Project Scope on Requirements

Developing the appropriate business requirements documentation is the first step in assigning the appropriate scope to any project. All features of the project should be mapped back to business requirements. But isn't this big data where we just collect the data in Hadoop and figure out what we want the data for later? No. Every project should have an initial goal; it's also more likely to get funded if the goal is already established. Base the project scope on these goals and objectives.

One key advantage of the big data paradigm is that through the Schema on Read paradigm we may be collecting additional data from sources and storing it for longer periods of time so that we can find additional uses for it. This will usually come out of additional use cases and future projects. For the purposes of the project that is being funded, though, stick with the basics and follow the plan.

Managing Change in the Project Scope

Change is inevitable, but that doesn't mean you should accept all changes. You need a standardized process by which to evaluate changes as to whether they are needed to meet the business requirements. Risks of these changes should be evaluated and weighed along with the benefits before the change is accepted. A simple Excel spreadsheet listing the risk, a description of the risk, and what action is being taken on the risk is needed. An example would be the use of a new technology such as Pig. The risk may be that you plan on using Pig in your solution. The description can include information such as you don't have anyone within the organization who knows PigLatin and thus there is a risk to implementing it incorrectly. The action being taken may be to send someone to training on Pig. Sometimes the risks will outweigh the benefits and prevent changes in scope from occurring. That said, don't let risk be an excuse to prevent change in scope if it is necessary.

Stakeholder Expectations

Most likely any executives that you are working with to implement your new project have read countless stories during the past couple of years about how big data solves everything. This is what you are up against, and setting expectations from the outset is vitally important. Managing FEAR (false expectations about reality) will be a full-time job for someone on your team. As with all projects, this is done through thorough scoping of the project and communication of goals, timeliness of meeting milestones, and addressing and clearing any roadblocks that you encounter along the way.

To manage false expectations, you first have to identify them. Communication with the business sponsor about the goals and desired outputs of the project is key here. For example:

· Does the sponsor expect the project to increase sales 25%?

· If so, what are the underlying assumptions about the data that make him believe this number is possible?

· What are the acceptable variances above and below that number?

· Do all of these assumptions have to come true to make that number?

· If any one of these assumptions is incorrect, what happens to that number?

This is your opportunity to listen to what the sponsor really believes he is paying for. After you have finished listening to the sponsor, restate everything he said as you understand it. Does he agree with your understanding? If so, now you have a starting point for a discussion about reality.

A more formal way to gather these requirements is through a three-step process of elicit, confirm, and verify:

· Elicit: What are we going to build?

1. Outputs of the elicit phase are objects such as the business requirements document.

· Confirm: Is this what you asked for?

1. A customer-signed business requirements document.

· Verify: Did you get what you wanted?

1. A customer-signed business requirements document (if a change occurred).

Defining the Criteria for Success

Defining success criteria for any big data project will vary wildly by projects. Many of the criteria are goal oriented, but some are more budget oriented. Getting the details of these criteria into your business requirements document is imperative for setting yourself up for victory. Table 19.1 shows an example of stating your project success criteria. A simple list such as this allows you to concentrate on the success criteria without worrying about any additional metadata around simply listing them out. Table 19.2 shows an example of listing out the business objectives for the project along with some additional metadata around stakeholders, success, and how to measure. Both of these documents will provide valuable references as your project progresses.

Table 19.1 Examples of Project Success Criteria

Total project cost does not exceed initial budget by more than Z%.

Delivery schedule does not exceed project timeline by more than Z%

All high priority items identified in the BRD are delivered in the first release.

Daily loads are complete by 6 a.m. daily.

Table 19.2 Examples of Business Objectives and Metadata

Business Objective	Stakeholder	Success Criteria	Measurement
Identify the Internet display ads that are most successful in driving customers to buy our products online.	Marketing

Identifying Technical Challenges

No doubt you will encounter technical challenges while building and implementing your big data solution. This section will help you identify some of those challenges before you get started. We'll discuss environmental challenges that you will face and proposed solutions for them. We'll follow that up with additional resources, helping to ensure that your team's skillset is ready for implementing and supporting a big data environment.

Environmental Challenges

There are a few challenges that are specific to your environment that you will have to take into account. Two of the challenges surround your data. Planning for your data volume and the growth through incremental data changes will be vital to your success. Within that data, understanding the privacy laws that surround your data will be critical to both success of the solution and avoiding missteps with governing authorities.

Data Volume and Growth

One of the technical challenges you need to tackle up-front is the initial data volume and growth. You want to consider two important factors here:

· How much total data will comprise the initial project?

· What is the shape of the data files that are going to encompass the initial project?

The initial data volume of the project will give you an idea of the scope of your infrastructure. Second, considering the shape of the files is an important aspect of the initial scoping. Will you be working with a large number of small files or relatively few large files? Hadoop is designed to work with large chunks of data at a time and when the size of files is smaller than 64MB; then it will increase the number of map-reduce jobs necessary to complete any submission. This will slow down each job, potentially significantly. If you are literally talking about thousands or millions of files, each very small (in the low-kilobyte range), you'll probably find it best to aggregate these files before loading them into Hadoop.

Incremental Data

Loading data for any big data or data warehouse solution usually comes in two forms. First you need to load the initial bulk load of history, and then you need to determine an approach to ingest incremental data. Once again, you need to consider the size of the files that you will be loading into the Hadoop Distributed File System (HDFS).

If the files are many but relatively small, consider using Flume to queue them up and write them as a larger data set. Otherwise, write your data to Hadoop from their source using HDFS's put command, much like you would into a staging environment for a data warehouse. Write the full data set to Hadoop and then rely on your transform processes in Hadoop to determine duplicates, unwanted data removal, and additional necessary transforms.

Privacy Laws

The issue of privacy and big data is a large and diverse subject that could be a book by itself. These privacy issues cross country and cultural barriers. As you are collecting and storing more data about your customers and augmenting it with additional ambient data from either third parties or from public sources, you must be aware of the privacy laws that affect the customers whose data you are collecting data.

Some data-retention laws require that you keep data for so long once you start collecting it. In the United States, laws such as Health Insurance Portability and Accountability Act (HIPAA) affect the private health information that healthcare providers and insurance companies compile. HIPAA requires that the holder of this health information ensure the confidentiality and availability of the information. Additional regulations such as the Gramm-Leach-Bliley Act are designed such that financial services companies notify their customers, through a privacy notice, how they collect and share their data. The European Union has its own set of much more strict and comprehensive regulations designed to protect consumer information from unauthorized disclosure or collection of personal data.

As a collector of data, it is your responsibility to identify the privacy laws that apply to that data set and the consumers involved. Spend the time to identify the laws that affect the countries of your consumers, your specific industry, and where your data will be located. Understand what data should be de-identified through anonymization, pseudonymization, encryption, or key-coding. Taking the time to understand the laws and comply with them may be the difference between a successful big data project and one that not only gets your organization in hot water but could also be publicly damaging.

Challenges in Skillset

Data analysts usually have a much easier time adapting to the big data environment than database administrators (DBAs). DBAs have a long history of confining data to a particular environment, including managing table space, foreign key relationships, applying indexes, and relying on lots of SQL tuning to solve performance issues. All mature relational database systems have extensive graphical user interfaces (GUIs) that make this job easier. Generally, this has made DBAs relatively lazy over the years. Why spend the time to script out an add column statement when you can do it in 30 seconds in the GUI?

The framework that the Hadoop environment provides is very distributed, with many components that need to be learned so that a team can get them working together for the final solution. A single solution may leverage HDInsight, Hive, Sqoop, Pig, and Oozie. The solution will likely leverage several of these technologies that use a variety of different programming paradigms. (Hive-SQL or Pig Latin anyone?) Don't expect to find any one single person to handle the entire ecosystem himself. This solution will likely require a team of individuals, and they will likely need some help getting ramped up on some or all of the technologies. If you are using existing staff with little to no experience with the Hadoop ecosystem, be prepared to get them some training and to provide ample time for ramp up. You can expect several months of ramp-up time for your staff to get comfortable enough with big data technologies in order for them to create an enterprise ready solution.

Training opportunities abound. Hortonworks provides training for their Hortonworks Data Platform (HDP) on Windows (http://hortonworks.com/hadoop-training/hadoop-on-windows-for-developers/). This is a good place to start because they walk through the basics of Hadoop and Hortonworks Data Platform. In addition, they will walk through the ecosystem of C#, Pig, Hive, HCatalog, Sqoop, Oozie, and Microsoft Excel.

Coursera is another great place to provide very applicable training to your employees on big data concepts and technology. Courses on linear algebra provide a good refresher or ramp-up for the concepts and methods of linear algebra and how to use those concepts to think about computational problems that arise in computer science. Several statistics classes provide the principles of the collection, display, and analysis of data to make valid and appropriate conclusions about said data. Other courses that are applicable are Machine Learning, data mining, and statistical pattern recognition classes.

Finally, many universities offer graduate-level courses in big data and business analytics. Universities see the future need for workers able to traverse and understand large sets of data and so are providing the necessary classes to provide that workforce. Carnegie Mellon offers a Masters of Information Technology Strategy with a concentration in big data and analytics (http://www.cmu.edu/mits/curriculum/concentration/bigdata.html). The MITS degree provides a multidisciplinary education that allows students to understand and conceptualize the development and management of big data information technology solutions.

Stanford University offers a Graduate Certificate on Mining Massive Data Sets (http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=10555807). The four-course certificate teaches “powerful techniques and algorithms for extracting information from large data sets such as the Web, social-network graphs, and large document repositories.”

Identifying Operational Challenges

In this section, we'll cover what you need to do to plan for setup and configuration. In addition, we'll discuss what you need to do to plan for ongoing maintenance.

Planning for Setup/Configuration

An early decision that will need to be made is the quantity and quality of hardware on which to run your Hadoop cluster. Generally, Hadoop is designed to be built on commodity server hardware and JBODs (just a bunch of disks). That doesn't mean that you can run down to your local electronics store and buy a few cheap $800-servers and be good to go. Commodity hardware is still server-class hardware, but the point is that you won't need to go out and spend tens of thousands of dollars per server. You generally purchase two classes of servers for your Hadoop cluster: one for master nodes, and a second one for all the worker nodes.

The master server should have more redundancy built in to it: multiple power supplies, multiple Ethernet ports, RAID 1 for the operating system LUN, and so forth. The master server requires more memory than the worker nodes. Generally, you can start with 32GB of memory for a master server of a small cluster and grow that to as much as 128GB or more for a large cluster that has more than 250 worker nodes.

The worker servers don't need the redundancy of the master server, but need to be built with balance in mind. They need to be able to store the data you have planned for your Hadoop cluster, but they also need to be able to process it appropriately when its time to query the data. You first need to consider how many and what size disks you need. Of course, this depends on the hardware vendor and the configuration of the server that you are purchasing. But after that, you will need to take an educated guestimate of your needs.

The first thing to remember is that you will be replicating your data three times. Assuming that you are using the default replication factor of 3, if you have a need for 100TB of space you will need enough servers so that you can store 300TB of space. But you aren't done yet. You need temporary workspace for queries which can be up to 30% of the drive capacity. Finally, try to always maintain 10% free disk space. History has taught us that when disks have less than 10% free space, performance suffers. Add those requirements up and you are at 420TB of space required. If each server you purchase can store 30TB of data (10 drives × 3TB), you need a minimum of 14 worker nodes for your Hadoop appliance.

The Compression Factor

You may have noticed that I did not consider compression factor in the previous calculation. You will likely take advantage of compressing data in Hadoop. The Hadoop technologies like Hive and Pig handle various forms of compression very well. So, you may use Zip, RAR, or BZip compression technologies, and getting a compression factor of 5 to 7 times is not unusual. This will reduce the number of worker nodes required to support your solution.

Required disk space = (Replication factor)(Total data TB)(1.4) / Compression factor

Next you need to consider CPU and memory. CPU for each worker node should at a minimum have two quad-core CPUs running at least 2.5GHz. Hex and Octo core solutions should be considered for heavy computing solutions. The newest chips are not necessary because the mid-level chips will generally give you the processing power you need without generating the heat and consuming the electricity of the most powerful chips. Generally speaking, each task that runs inside a worker node will require anywhere from between 2GB and 4GB of memory. A machine with 96GB of memory will be able to run 24 and 48 tasks at any given time. Therefore, a system with 14 worker nodes would be able to run 336 to 672 tasks at any given time. Understanding the potential parallel computing requirements of your solution will help you determine if this is sufficient.

Start the planning and building of your cluster assuming that you will begin with a balanced cluster configuration. If you build your cluster from the beginning with different server classes, different processors and memory, or different storage capacities, you will be spending an inordinate amount of time on resource utilization and balancing. A balanced configuration will allow you to spend less time administering your cluster and more time running awesome solutions that provide value to the line-of-business sponsoring the solution.

Planning for Ongoing Maintenance

There a few tasks that you should be acquainted with in order to perform ongoing maintenance of a Hadoop cluster. In this section we'll cover what you need to know in order to stop jobs, add nodes, and finally rebalance nodes if the data becomes skewed.

Stopping a Map-reduce job

One requirement for a Hadoop administrator is to start and stop map-reduce jobs. You may be asked to kill a job submitted by a user because it's running longer than the user expected. This might result from there being more data than they expected, or perhaps they are simply running an incorrect algorithm or process. When a job is killed, all processes associated with the job are stopped, the memory associated with the job is discarded, and temporary data written to disk is deleted; the originator of the job receives notification that the job has failed.

To stop a map-reduce job, complete the following steps:

1. From a Hadoop command prompt, run hadoop job –list.

2. Run hadoop job –kill jobid to kill the job.

Adding and Removing Cluster Nodes

Usually data nodes are added because there is a need for additional data capacity. However, it is entirely possible that it may be in response to additional I/O bandwidth needs because of additional computing requirements. Adding the data node is a quick online process of adding the node to the configuration file:

hadoop dfsadmin –refreshNodes

Rebalancing Cluster Nodes

Hadoop nodes become unbalanced most often when new data nodes are added to a cluster. Those new data nodes receive as much new data as any of the other nodes, but will never catch up to those nodes in the total amount of data stored on it unless you balance the stored data. To balance the stored data, you can run the following command from a Hadoop command prompt,

hadoop balancer –threshold N

where N is the percentage you want the nodes to be within each other. For example, if you want all the nodes to be within 5% of each other for an actual potential of each node being as much as 10% apart in total data stored, run the following:

hadoop balancer –threshold 5

Once you've created a solution that will benefit your business it's time to hand that solution off to your operations team so that it can operate the daily jobs, respond to user requests, and plan for future growth. In the next section we'll discuss some of the work necessary to make that transition to the operations team effective.

Going Forward

After deploying your solution, you need to continue to monitor its performance and plan for its growth. You learn much more about what to monitor for in Chapter 16, “Operational Big Data Management.” For now, we'll limit our coverage to handing the solution off to operations and what needs to happen post deployment.

The Handoff to Operations

The hand off to operations should be thoroughly thought about, discussed, and planned for a long time before the day comes to actually do it. In fact, the handoff to operations should be planned for from day 1 of the project. Questions you should be thinking about include the following:

· Does anyone in operations understand the big data paradigm?

· Does anyone in operations have experience with Hadoop?

· Who or what group is going to operate the solution?

· If the cluster fails at night, how do we notify and respond?

· If a job fails at night, how do we respond?

Other questions should also be discussed from day 1. Why so early? If your organization is spinning up its first big data/Hadoop team for development and there is no one in operations who has any experience in either, it will take them many months to come up to speed on big data and Hadoop (so that they can effectively support it). Why months? Because most likely they have current duties that they have to be responsible for, and other than a few weeks of specialized training, everything else they pick up will be from working part time with the project team. It is vital to the success of the project, though, that you have a skilled and confident operations team ready to support the solution when it is ready to be deployed.

Before deployment of the solution, the development team should work with the operations team to develop a run book that documents the architecture of the solution and provides operations with the responses to specific and expected failures. Typical plans should account for the following:

· If a particular job fails, how to respond. What should one look for to see the state of the job?

· If a node fails, how to respond.

· If connectivity to the source system is down, whom to notify.

· What are the common error messages that your system will surface? What are the steps to resolve these error messages?

· How should the team proactively monitor the environment? For example, performance and space usage trending?

Now that you have developed the run book, you can plan for the handoff to operations and place the solution in production.

After Deployment

During the first week after handing off to operations, the development team and operations need to work closely together to ensure proper knowledge transfer of the solution. Much of this should have been completed through the documentation process of the solution and creation of the run book, but we all have experience where documentation doesn't get read.

Many tasks need to be done to keep the Hadoop cluster healthy and ready to continue to accept more data and to process that data in an acceptable timeframe as defined by your service level agreements (SLAs). These tasks include the previously mentioned job monitoring, managing the various Hadoop-related logs, dealing with hardware failures, expanding the cluster, and upgrading the system software.

Summary

This chapter covered the basics of what you need to know in order to build and plan for you first big data solution. The core ideas delivered are that you need to have sponsorship from executive levels from the beginning and that you need to plan and communicate throughout the process of building your solution. Executive, C-level sponsorship provides you with the influence you need in order to get multiple teams working together to build a new big data solution. Planning for technical and operational challenges requires thinking through what the final architecture will look like and documenting the challenges you will have getting there. Finally, we looked at what it will take to hand off the solution to operations, including developing a run book and monitoring the system for growth.