Operational Big Data Management - Moving Your Big Data Forward - MICROSOFT BIG DATA SOLUTIONS (2014)

MICROSOFT BIG DATA SOLUTIONS (2014)

Part VI. Moving Your Big Data Forward

Chapter 16. Operational Big Data Management

What You Will Learn in This Chapter

· Building Hybrid Big Data Environments

· Integrating Cloud and On-premise Solutions

· Preparing for Disaster Recovery and High Availability in Your Big Data Environment

· Complying with Privacy Laws

· Creating Operational Analytics

Operationalizing your big data solution includes integrating multiple sources, providing a platform for analysis and reporting, and providing analytics for the solution so that you can monitor and improve the environment as needed. All of this requires planning and attention to detail.

This chapter focuses on many of those things that you need to do to make your big data implementation a success. The chapter first focuses on hybrid big data environments, where efficient data movement is key. You will then learn about the possibilities for backups and disaster recovery so that your solution can withstand catastrophes. You'll also take a look at privacy issues to ensure that you are always vigilant with personally identifiable information (PII). The chapter then delves into the analytics you will need to collect to ensure that your solution can grow with the additional demands put on it by growing data sets and user bases.

Hybrid Big Data Environments: Cloud and On-premise Solutions Working Together

The promise of working with an elastic-scale cloud solution such as HDInsight may be intriguing to many organizations. HDInsight in the cloud provides your organization with significant deployment agility. It also provides the infrastructure, high availability, and recoverability that would otherwise take months for most organizations to build themselves.

However, your organization may have hundreds or thousands of users who need access to that data for analysis. The result set of the processing you do on HDInsight may be better lived in your enterprise data warehouse that is sitting in your data center.

Another common scenario in a hybrid environment is when data is born on-premise and you want to take that transactional data and store it in the cloud for additional processing. To do that, you must move a significant amount of data from the data center to the cloud.

Getting these hybrid solutions to work well together is an integral step in the success of your hybrid environment. Throughout this chapter, we examine details of these two scenarios.

When customers ask about hybrid solutions, it is often recommended that customers start with the assumption that data born in the cloud stays in the cloud and data born on-premise stays on-premise. This advice attempts to stem the data movement discussions and complexity of a hybrid environment. It also just makes sense.

With the advent of phones and tablets and their applications, as well as other cloud-based applications, a significant amount of data is being born in the cloud today. The amount of data that requires analytics is exploding. This data can be very large and quickly changing. On-premise traditional solutions and organizations cannot easily handle this complexity and scale.

If possible, your organization should take advantage of cloud-based technologies for the additional required analysis. If you are using HDInsight and have a requirement to drive additional business intelligence (BI) reporting from the resulting data set, look into a Windows Azure SQL Database and Power BI Solution. With these, you can leverage your existing SQL Server development and administration skillset, but with the power, scalability, and cost-effectiveness of the cloud.

However, there may be a requirement that the resulting data set from HDInsight be hosted on-premise. This can be for a variety of reasons. You might have an existing enterprise data warehouse with which you want to integrate. Regulatory reasons might force you to host that data on-premise. Or you may not be able to overcome the internal politics of an organization not quite ready to host all data outside your organization. Whatever the reason, a hybrid solution is in your cards, and you will need to dive into the details of the architecture.

A great deal of data is still being born on-premise within organizations. These organizations may have identified an opportunity to use Hadoop for their data but don't have the capability to host their own Hadoop solution. This scenario may happen for a couple of reasons. First, they might not have the required capital to set up a decent-sized Hadoop cluster. Second, they might not have the staff to do it. This may be especially true if executive management hasn't yet fully come on board with the project.

The game has changed, and you are working through a real paradigm shift. Your architectures will be much more complex in the short term as you struggle to handle an ever-increasing amount of data and analytics you need to apply to that data.

Ongoing Data Integration with Cloud and On-premise Solutions

When developing a hybrid solution, data movement becomes an important consideration. You will be tasked with moving data to and from the cloud. You will need to develop a solution such that you can move that data efficiently to either populate a solution in the cloud with data or to bring data from the cloud to your on-premise environment. You will have three main considerations:

1. What tool to use to move the data to and from the cloud

2. What compression codec to use

3. Whether a hybrid approach is actually the right approach (that is, whether you should look at going all cloud or all on-premise)

You can use a number of tools to move data between your on-premise solution and the cloud. The recommended approach is to use the data movement tools that Microsoft has built right in to the Azure HDInsight PowerShell Scripts. Using PowerShell may be the most straightforward way of moving data from your on-premise environment to Windows Azure blob storage. The first thing you need to do is install Windows Azure PowerShell and then Windows Azure HDInsight PowerShell.

NOTE

Follow the instructions at the following website to get Windows Azure HDInsight PowerShell installed and configured for your environment: http://www.windowsazure.com/en-us/manage/services/hdinsight/install-and-configure-powershell-for-hdinsight/.

For example, the following script will upload the 1987.csv file from my local drive to the airlinedata container with a blob name of 1987. The first set of parameters provides the details for where to put the data in Windows Azure, the second set of parameters tells us what data to put there. We then open up the connection and, using the Set-AzureStorageBlobContent command, we copy the data to Windows Azure:

#Configure Azure Blob Storage Parameters

$subscriptionName = "msdn"

$storageAccountName = "sqlpdw"

$containerName = "airlinedata"

$blobName = "1987"

#Configure Local Parameters

$fileName ="C:\Users\brimit\SkyDrive\AirStats\1987.csv.bz2"

# Get the storage account key

Select-AzureSubscription $subscriptionName

$storageaccountkey = get-azurestoragekey $storageAccountName

| %{$_.Primary}

# Create the storage context object

$destContext = New-AzureStorageContext -StorageAccountName

$storageAccountName -StorageAccountKey $storageaccountkey

# Copy the file from local workstation to the Blob container

Set-AzureStorageBlobContent -File $fileName -Container $containerName

-Blob $blobName -context $destContext

Integration Thoughts for Big Data

Your big data solution should not exist in isolation. To get the most benefit out of the solution, it should not be built for just a couple of data scientists who will use it to do analysis in which they publish sporadic reports of their findings. If it does exist in isolation, you are not extracting the value out of it that it contains.

Your big data solution needs to be integrated into the rest of your analytics and BI infrastructure. When you integrate your solution with your existing data warehouse and BI infrastructures, more corners of your organization can learn to gain insights from the additional data brought into your organization.

NOTE

Don't worry, you've done data integration before. Your big data solution is now just another data source, albeit with possibly more data, different data types, and new structures that you've never dealt with before. But those are challenges, not barriers.

You likely already have a data warehouse environment that is a subject-oriented reporting environment tightly constrained around dimensional data that drives reports your organization relies on. The data that lives in this data warehouse is likely driven from your transactional environments.

Your new big data environment now houses additional data that likely wasn't cost-effective to store in your existing data warehouse (perhaps because the data was either too big or too complex for efficient storage). Most solutions will include having a subset of subject matter experts who will be hitting this data directly using the skills learned in this book to extract new insights. But inside of this data is a subset of data that could augment what is already stored in your existing data warehouse and that could potentially make it much more interesting and valuable to your traditional BI analyst. Our goal is to identify this data and provide an integrated solution for your organization.

These hybrid solutions, shown in Figure 16.1, including Hadoop, a data warehouse, and a BI environment are the near future state of a typical analytics environment. Although it makes sense to store log files, social network feeds, user location data, telemetry data, and other big data sources in Hadoop, this isn't the only thing that can benefit from this data. Your traditional data warehouse can take in some of this data to enhance its value.

image

Figure 16.1 The modern data warehouse

You might be thinking, “Okay, this sounds good, but how do I know which data should be integrated into my current data warehouse?” Well, this is where you trust your data scientists who have access to the data in Hadoop and the subject matter experts who explore your current data warehouse solution daily.

The data scientists will explore the data and look for relationships among the new data being collected. They will also most likely be extracting some of your subject-oriented warehouse data into Hadoop to help augment their data sets. As they explore the data and find relationships in the data, they will identify what is useful to the organization. It might sometimes be of use one time and thus the analysis is done, the report is written, and lessons will be learned. Many other times, though, they will find insights in the data that can be reused and watched for on a regular basis. It is this data that they have identified you want to operationalize into your solution.

After the data has been identified, you want to operationalize the solution. This means building the extract, transform, and load (ETL) layer to move data from Hadoop to your data warehouse. Here is where we will likely deviate from your traditional processes of using a tool like SQL Server Integration Services (SSIS) to do the ETL. The reason for this is that we have already loaded the data onto an immensely powerful data processing engine named Hadoop. Therefore, a recommended approach is to do the transform directly on Hadoop and stage the data in Hadoop in the form needed to move to your data warehouse solution.

This is where Pig proves to be incredibly powerful and the right tool for the job. A Pig script that takes your semistructured data and moves the data through a data flow and creates a structured staging table is the efficient way to take advantage of your 8-, 32-, or 128-node Hadoop cluster. This is certainly more efficient than pulling all of that data off of Hadoop into a single SSIS server to transform the data for inserting into the data warehouse. Once the data has been staged in Hadoop, you can use Sqoop to move the data directly from Hadoop into the data warehouse tables where the data is needed.

Backups and High Availability in Your Big Data Environment

How well can your organization withstand the loss of your big data environment? Do you need 100% uptime? What happens if a natural disaster affects your data center, making it no longer viable? These are questions every IT shop should be asking itself when building their big data environment. The answers to these questions will influence your high-availability and disaster recovery solutions.

High Availability

High availability is the approach taken to ensure service level availability will be met during a certain period of time. If users cannot access the system, it is unavailable. Thus, any high-availability approach that you design for your big data solution should be to ensure that your users can access the system up to and beyond the service levels you have determined are appropriate for that system.

The service level availability that you determine will have a significant impact on your high-availability design. Some organizations will determine that they need 99.9% uptime, whereas others will be primarily using their big data solution for reporting and analysis and will come to a number closer to 95% being appropriate for them. Although 4.9% doesn't sound like a big difference, that is a total of almost 18 days per year in downtime.

There are two different types of downtime that you must be concerned with: scheduled and unscheduled downtime. Scheduled downtime is the type of downtime that occurs when you schedule maintenance on your system. For example, you could be scheduling downtime for applying a patch of your software that requires a reboot of the system. In addition, you may schedule downtime for an upgrade of your system software.

In contrast, unscheduled downtime arises from unplanned events such as hardware or software failures. This unscheduled downtime could occur for many reasons from as mundane as a power outage to as significant as security breach such as a virus or other malware. In addition, events that cause downtime include network outages, hardware failures, and software failures.

For scheduled maintenance, you will need to shut down the NameNode service. To ensure the controlled shutdown of the service, follow these steps:

1. Shut down the NameNode monitoring agent:

service stop hmonitor-namenode-monitor

2. Shut down the NameNode process:

service stop hadoop-namenode

When maintenance is complete, follow these steps:

1. Start the NameNode process:

service start hadoop-namenode

2. Start the NameNode monitoring agent:

service start hmonitor-namenode-monitor

3. Verify that the monitoring process has started:

service status hmonitor-namenode-monitor

An important concept to understand is that Hadoop has significant high-availability and robustness features built in to it to withstand unscheduled downtime. Hadoop is built with the understanding that hardware does fail. One of the architectural goals of the Hadoop Distributed File System (HDFS) is the automatic recovery from any failures. With hundreds of servers in your big data solution, there will always be some nonfunctional hardware, and the architecture of HDFS is intended to handle these failures gracefully while repairs are made. The three places of concern are DataNode failures, NameNode failures, and network partitions.

A DataNode may become unresponsive for any number of reasons. It might simply have a hardware failure such as a motherboard failure, it could have a replica become corrupted, and hard drives will fail, along with many other reasons equipment fails. Each DataNode sends a heartbeat to the NameNode on a regular basis. If the NameNode does not receive a heartbeat message, it marks the DataNode as dead and does not send any new data requests to the DataNode. Any data that was on that DataNode is not available to HDFS anymore, and now that data's replication factor will likely be below that specified, which will kick off re-replication of that data to another DataNode.

Another reason Hadoop clusters become nonresponsive is due to NameNode failures. NameNodes are most likely to fail because of misconfiguration and network issues. This is similar to our collective experience with Windows cluster failovers, where it is usually not hardware issues that cause failovers but external issues such as Active Directory problems, networking, or other configuration issues. Pay heed to these external influences as you diagnose failures within your environment.

In this section we examined some basics into planning for high availability. In the next section, we will dive a bit into what happens if something does go wrong—preparing for and dealing with disaster recovery.

Disaster Recovery

By now, you should be aware that Hadoop has many high-availability features built in to it to prevent failures. As you probably know by this point in your career, despite all the features included in products and all the planning we can do, disasters happen. When they do, your service level agreements with your user community will help determine what your recovery strategy will be.

For example, if you incur a complete data center loss because of a natural disaster, how important will it be to recover your Hadoop infrastructure? It may be very low on your priority list, behind many other applications in your organization. Applications such as the transactional systems that run your business and create revenue will likely take priority. In addition, you might have many other customer-facing applications that need to be up before you need to have your big data solution up. You will likely also have applications that your internal customers need to have available to service customers; call centers and financial reporting come to mind. Your organization may have other applications such as your e-mail system that needs to be up and running so that communication may happen. Next, you might have a data warehouse and BI environment that needs to be available to drive internal reporting of month-end processes. Finally, your big data solution may need to be re-created within a few weeks to drive your analytical decisions.

If you need to be backed up within a day, you need to have a different disaster recovery plan. You can back up your Hortonworks Data Platform (HDP) clusters to an external backup Hadoop cluster using DistCp. DistCp stands for distributed copy and is a bulk data movement tool. By invoking DistCp, you can periodically transfer HDFS data sets from the active Hadoop cluster to the backup Hadoop cluster in another location.

To invoke DistCp, run the following command:

hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo

The namespace under /foo/bar on nn1 will expand into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.

You must use absolute paths with DistCp. Each TaskTracker must be able to communicate with both the source and destination systems. It is recommended that you run the same version protocols on both systems. The system should be acquiesced at the source before invoking DistCp. If there are clients writing to the source system while DistCp is being run, the copy will fail.

Table 16.1 describes some of the import options for DistCp.

Table 16.1 Import Options for DistCp

Flag

Description

-p[rbugp]

Preservesr: Replication numberb: Block sizeu: Userg: Groupp: Permission

-i

Ignores failures

-log <logdir>

Outputs logs to <logdir>

-m <num_maps>

Maximum number of maps for copying of data

-update

If source size different from destination size, it will get overwritten. The default is not to overwrite as long as the file exists.

Big Data Solution Governance

As you your big data environment develops, you must mature your ability to govern the solution. Preferably, you will have a team dedicated to governing the environment. What does governance mean in this instance? These team members are the gatekeepers of Hadoop. Specifically, this team will be designed to manage the rules of engagement for your Hadoop clusters. The kind of questions that they will answer include the following:

· What are the use cases that go onto the cluster?

· How do users apply for access to the cluster?

· Who owns each set of data?

· What are the user quotas?

· What are our regulatory requirements around information life cycle management?

· What are our security and privacy guidelines?

· How do external systems interact with the system?

· What should be audited and logged?

If there are new questions about the rules of engagement for the cluster, the big data governance team will answer those questions and set up the rules. Essentially, when it comes to big data, this governance team owns the roadmap for your IT organization to get it from its current state to some preferable future state. The need for the governance team is that there is a lot of change occurring around data in general and big data in particular. Having a team focused on where your organization needs to go allows the operations team to focus on the current state and implementing the rules established by the governance team.

The role of the operations team is to implement the above. The operations team is responsible for auditing and logging the information determined by the governance team as being important. The operations team is responsible for security and privacy of the data as outlined by the governance team. The operations team is responsible for implementing the quotas determined for users and providing access to approved users as determined by the governance team. Ideally, you want different people on the operations team and on the governance team.

Creating Operational Analytics

Hadoop is a complex distributed system, and monitoring it is not a trivial task. A variety of information sources within Hadoop provide for monitoring and debugging the various services. To create an operational analytics solution, you must collect these monitors and store them for correlation and trending analysis.

A solution included with HDP is the HDP monitoring dashboard. This solution uses two monitoring systems, called Ganglia and Nagios, to combine certain metrics provided by Hadoop into graphs and alerts that are more easily understood by administrators and managers alike. Using the HDP monitoring dashboard, you can communicate the state of various cluster services and also diagnose common problems.

When it comes to monitoring Hadoop, in many respects it differs little from monitoring a database management system. I like to refer to several system resources as the canaries in the coal mine. These resources—CPU, disk, network, and memory—are vital to the performance of any system retrieving, analyzing, and moving data. The utilization of these resources both as a point-in-time resource, and viewing them as a trend over time provides valuable insight into the state and health of a system.

For Hadoop, we want to monitor these values for both the NameNode and DataNodes. For the NameNodes, we want store and analyze the individual values. For DataNodes, it is usually good enough to report on the aggregated values of all DataNodes so that we can understand overall cluster utilization and look for one of these particular resources causing a performance bottleneck because of overutilization. In particular, you should monitor the following:

· % CPU utilization and periodic (1-, 5-, 15-minute) CPU load averages

· Average disk data transfer rates (I/O bandwidth) and number of disk I/O operations per second

· Average memory and swap space utilization

· Average data transfer rate and network latency

System Center Operations Manager for HDP

At the time of this writing, HDP for Windows does not include interfaces for open source monitoring services such as Ganglia or Nagios. These services are designed to consolidate information provided by Hadoop into a more centralized and meaningful summary of services-related statistics in graphs and alerts. On the other hand, Microsoft and Hortonworks have collaborated to provide an Ambari System Center Operations Manager (SCOM) solution. With this solution, SCOM can monitor availability, capacity, and the health of the cluster and provide you valuable metrics, graphs, and key performance indicators. In this section we'll look at the overall capabilities of the Ambari MP for SCOM, walk through the installation of the product, and finish by examining specific monitoring scenarios.

The Ambari project is aimed at making it easier to manage a Hadoop cluster. Amabari provides an interface and Rest APIs for provisioning, managing, and monitoring a Hadoop cluster. Its interface is typically a Hadoop management web UI that is powered by those Rest APIs. The Management Pack for SCOM leverages those Rest APIs in order to provide the same monitoring capabilities as the Ambari web UI, but in a familiar enterprise solution like SCOM. Ambari SCOM will first automatically discover all nodes within a Hadoop cluster, then proactively monitor availability, capacity, and health, and finally provide visualizations within for dashboards and trend analysis.

The Ambari SCOM architecture is made up of several discrete installations across your environment. Included in the architecture are the Ambari SCOM Management Pack that extends SCOM to monitor Hadoop clusters. The Ambari SCOM server component gets installed inside the Hadoop cluster to monitor Hadoop and provide the REST API interface for the SCOM management pack. A SQL Server 2012 instance database will be created for storing the Hadoop metrics that you collect for SCOM. Additional components include a ClusterLayout ResourceProvider so that Ambari can read your clusterproperties.txt file to automatically understand your cluster layout for configuration, a Property ResourceProvider that interfaces between the Ambari SCOM Server and SQL Server, and a SQLServerSink that consumes metrics from Hadoop and stores them in SQL Server. At the end of the installation, you will have a solution, such as the one shown in Figure 16.2, that monitors Hadoop will alert you on certain faults and will take performance metrics, aggregate them, and store them in SQL Server and SCOM.

image

Figure 16.2 Ambari SCOM High Level Architecture

Installing the Ambari SCOM Management Pack

Installing the Ambari SCOM Management Pack is much like installing HDP 1.3 in that there are a number of steps involved in a number of locations and you must follow directions very closely. Take your time and be cautious with this installation. You'll be:

· Configuring SQL Server

· Installing the Hadoop Metrics Sink

· Installing and configuring the Ambari SCOM Server in your cluster

· Installing and configuring the Ambari SCOM Management Pack in SCOM

We are making an assumption that you already have a SCOM Server set up within your environment to take advantage of the Management Pack. If you don't, you can download a preview virtual machine from Microsoft: http://www.microsoft.com/en-us/download/details.aspx?id=34780.

To get started, download the Ambari SCOM Management Pack here: http://hortonworks.com/products/hdp-windows/#add_ons.

Configuring SQL Server

First, you will need to configure a SQL Server for the Ambari SCOM database. If you don't have a SQL Server, you will first need to do the installation and come back to this series of steps. The steps involved for configuring SQL Server are:

1. Configure SQL Server to use “mixed mode” authentication. Open SQL Server Management Studio and connect to SQL Server. Right-click on the server and choose Properties. Change Server Authentication to SQL Server and Windows Authentication mode. Click OK.

2. Confirm SQL Server is enabled to use TCP/IP and is active. Open SQL Server Configuration Manager and choose SQL Server Network Configuration. Drill down to Protocols for MSSQLServer and find TCP/IP over to the right. If it is disabled, double-click on TCP/IP and enable it. Then follow the additional instructions about restarting your SQL Service.

3. Create a username and password for the Ambari SCOM MP to capture metrics and store them on SQL Server.

4. Extract the contents of the metrics-sink.zip package and open the Hadoop-Metrics-SQLServer-CREATE.ddl script in SQL Server Management Studio. Alternatively you can open it in Notepad and copy and paste the script into a New Query window in SQL Server Management Studio.

5. Execute the DDL script from Step 4 to create the Ambari SCOM database called HadoopMetrics.

Installing the Hadoop Metrics Sink

In these next steps, we will prepare the Hadoop Metrics Sink files to be used later in the process.

1. Create a c:\Ambari folder on each host within the cluster.

2. Retrieve the Microsoft JDBC Driver 4.0 for SQL Server sqljdbc4.jar file from here (http://www.microsoft.com/en-us/download/details.aspx?id=11774). Download the Linux version of the driver, extract it within the Downloads folder, and then find thesqljdbc4.jar file in the extraction. Copy and place this sqljdbc4.jar file on each node in the folder created in step 1 (c:\Ambari).

3. Retrieve the metrics-sink-version.jar file from the metrics-sink.zip package extracted during the SQL Server configuration and place it on each node in the folder created in step 1. Your c:\Ambari folder should now look like Figure 16.3, with only these two files.image

Figure 16.3 Ambari folder on each node in the cluster

Now you must set up the Hadoop Metrics2 Interface on each node in the cluster. This will allow it to use the SQLServerSink and send the metric information from your Hadoop cluster to SQL Server.

1. Edit the Hadoop-metrics2.properties file on each node in the cluster located by default at {C:\hadoop\install\dir}\bin folder. On your single node cluster from Chapter 3, “Setting Up for Big Data with Microsoft,” this location is c:\hdp\hadoop\hadoop-1.2.0.1.3.0.0-380\bin\hadoop-metrics2.properties. Replace Server, port, username, and password with the SQL Server name and port that you configured earlier, along with the username and password that you created for access to the HadoopMetrics database. Your Hadoop-metrics2.properties file should look similar to Figure 16.4 when you are done:

2. *.sink.sql.class=org.apache.hadoop.metrics2.sink.SqlServerSink

3. namenode.sink.sql.databaseUrl=jdbc:sqlserver:

4. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

5. [password]

6. datanode.sink.sql.databaseUrl=jdbc:sqlserver:

7. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

8. [password]

9. jobtracker.sink.sql.databaseUrl=jdbc:sqlserver:

10. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

11. [password]

12. tasktracker.sink.sql.databaseUrl=jdbc:sqlserver:

13. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

14. [password]

15. maptask.sink.sql.databaseUrl=jdbc:sqlserver:

16. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

17. [password]

18. reducetask.sink.sql.databaseUrl=jdbc:sqlserver:

19. //[server]:[port];databaseName=HadoopMetrics;user=[user];password=

[password]

image

Figure 16.4 Completed hadoop-metrics2.properties file

20.Next, you need to update the Java classpath for each Hadoop service to include the metrics-sink.jar and sqljdbc4.jar files. You can find these files in the {c:\hadoop\install\dir}\bin folder. On your single node cluster, you can find these files atc:\hdp\hadoop\hadoop-1.2.0.1.3.0.0-380\bin\. You will need to update the historyserver, tasktracker, jobtracker, datanode, namenode, and secondarynamenode files with the following information:

21. <arguments>… -classpath …;C:\Ambari\metrics-sink-1.2.5.0.9.0.0-

60.jar;C:\Ambari\sqljdbc4.jar </arguments>

In other words, add both the metrics-sink and sqljdbc4 paths to the classpath so that they are locatable by the service. You may have more success putting them at the front of the class path rather than appending them. One way to find the correct files for the addition of these paths is to search for *.xml files within the bin folder. Each one of these files needs to be updated. Also, pay attention to the metrics-sink version number and change it as appropriate. You may have a newer metrics-sink file version than in the Ambari documentation or listed here in this chapter. Finally, if you have a multiple node cluster, don't forget to do this on each node in the cluster.

22.Restart Hadoop. From a command prompt, run stop_local_hdp_services.cmd and then run start_local_hdp_services.cmd. If your cluster is more than one node, be sure to also run stop_remote_hdp_services.cmd and start_remote_hdp_services.cmd.

If everything was successful, all your services will start back up as expected.

23.To verify metrics collection, connect to the SQL Server where you installed HadoopMetrics and run the following query:

24. SELECT * FROM HadoopMetrics.dbo.MetricRecord

If you are successfully collecting metrics, you will have a result set from this query.

Installing and Configuring the Ambari SCOM Server

Now it is time to install and configure the Ambari SCOM Server within your Hadoop cluster:

1. Determine which node in the cluster you will be running the Ambari SCOM Server. In our single node cluster from Chapter 3, we'll install it along with the other services. In an enterprise cluster, you will want to install this on another node within the cluster.

2. Create a new folder: c:\ambari-scom.

3. Extract the server.zip file from the original downloaded zip file into the c:\ambari-scom\ folder. There are three packages in this file that we will install.

4. Extract the ambari-scom-server-version-lib.zip contents locally.

5. Extract the ambari-scom-server-version-conf.zip contents locally.

6. Open the Ambari.properties file within the configuration folder you just extracted in step 4. Update the server, databasename, user, and password to the information for your SQL Server hosting the HadoopMetrics database.

7. Open a command prompt and run the org.apache.ambari.scom.AmbariServer class to start the Ambari SCOM Server.

You may need to make a few changes to the following code based on where you have installed components. You may want to type it in Notepad first, before cutting and pasting it into the cmd prompt:

java -server -XX:NewRatio=3 -XX:+UseConcMarkSweepGC

-XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=60

-Xms512m -Xmx2048m -cp

"C:\Users\Administrator\Downloads\hdp-1.3.0.0-GA\hdp-1.3.0.0-GA;

c:\ambari-scom\server\conf;c:\ambari\sqljdbc4.jar;

c:\ambari-scom\server\ambari-scom-server-1.2.5.0.9.0.0-60.jar;

c:\ambari-scom\server\lib\*" org.apache.ambari.scom.AmbariServer

8. Verify that the Server API is working by entering this site from the Ambari SCOM server: http://localhost:8080/api/v1/clusters. You should receive back a page like Figure 16.5. If you are receiving any error messages, stop here and pay close attention to your paths in step 7.image

Figure 16.5 Verifying the Server API

9. Within the previously extracted ambari-scom folder, find the ambari-scom.msi. Run the ambari-scom.msi installer. The Ambari SCOM setup appears. Fill it in as appropriate and click Install. An example is shown in Figure 16.6 with the Start Services optionally checked.image

Figure 16.6 Ambari SCOM setup configuration

Once the install is complete, there are new links on the desktop for Start Ambari SCOM Server, Browse Ambari PAI, and Browse Ambari API Metrics. Go ahead and Start Ambari SCOM Server by clicking on the desktop shortcut.

Installing and Configuring the Ambari SCOM Management Pack

You are now ready to install the SCOM Management pack on the SCOM server. In the next steps, you will import the Management Pack, create the Run As Account, and finally configure the Management Pack.

1. On the server running Windows Server 2012, SCOM, and SQL Server, extract the mp.zip folder from the Ambari-scom-version folder from earlier steps.

2. Open System Center Operations Manager, click on Administration, and from the Tasks portal, under Actions, choose Import Management Packs.

3. Within the Import Management Packs dialog box, choose Add → Add from Disk. If prompted to search the online catalog, choose No. Browse for the management packs where you extracted the mp folder. Choose all three files by clicking the first and then Shift-clicking the last. After all are chosen, Click Open. This brings you back to the Import Management Packs page with your three MPs added, as shown in Figure 16.7. Choose Install.image

Figure 16.7 Selecting the management packs to import

Now that the management packs are imported, the next step is to create a Run as Account to speak with the Ambari SCOM Server installed within your cluster.

1. Go to Administration…Run as Configuration…Accounts. In the Tasks panel, select “Create Run as Account…”.

2. You will now walk through the Create Run As Account Wizard. You start on the introduction page. There is nothing to do here, so click Next.

3. On the General Properties page of the Create Run As Account Wizard, choose a Run As account type of Basic Authentication. Give it a display name such as AmbariSCOM.

4. On the Credentials Properties page, choose an Account Name and Password for connecting to the Amabari SCOM server. The defaults are “admin” and “admin”. Because you didn't configure these earlier, enter the defaults.

5. On the Distribution Security page, choose the Less Secure option. Click Create.

6. On the Completion page of the Run As Account Wizard, choose Close.

Now that you have configured the Run as Account, the last task that you need to accomplish is configuring the management pack. This will allow SCOM to speak with the Ambari SCOM Server in your Hadoop cluster:

1. Click on Authoring…Management Pack Templates…Ambari SCOM. On the right side is a task panel with the option to Add Monitoring Wizard. Click Add Monitoring Wizard.

2. On the Select Monitoring Type page of the Add Monitoring Wizard, choose Ambari SCOM. Click Next.

3. On the General Properties page, provide a name such as AmbariSCOM. Below, on the Select destination management pack, choose New.

4. On the General Properties page of the Create Management Pack wizard, choose a name such as Ambari SCOM Customizations and click Next.

5. On the Knowledge page of the Create Management Pack wizard, choose Create.

6. Back on the General Properties page of the Add Monitoring Wizard, ensure the destination management pack is the Ambari SCOM Customizations MP, and click Next.

7. On the Hadoop Details page, provide the Ambari URI. Provide either the Hostname or the IP of your Ambari-SCOM server in your Hadoop cluster. Also, choose the AmbariSCOM credentials for the Credentials Run as Account. The Hadoop Details page should look similar to Figure 16.8. Click Next.image

Figure 16.8 Hadoop details for Ambari SCOM MP

8. On the Watchers Node page of the Add Monitoring Wizard, click Add. This will take you to the Select Proxy Agent dialog. On this page you will choose the Watcher Node that will monitor Hadoop. Click Search and choose an available watcher node. Click OK. This will take you back to the Watcher Nodes page on the Add Monitoring Wizard. Click Next.

9. Complete the Wizard by clicking Create on the Summary page.

Congratulations, you have completed the install of the Ambari SCOM Management Pack. You can now explore your Hadoop cluster statistics by going back to the Monitoring Page in SCOM and choosing Ambari SCOM. You can explore your choices of Ambari SCOM folders in Figure 16.9.

image

Figure 16.9 Ambari SCOM folders

It will take a few minutes for SCOM to start collecting statistics from your cluster, so the first good place to look is the Clusters Diagram page. Figure 16.10 shows our cluster build from Chapter 3. While it's a single node cluster, you can see all the services running and their status based off of the green check marks.

image

Figure 16.10 Cluster diagram

The whole purpose of going through the pain of this section is so that you can monitor your Hadoop clusters from a central location. Essential to any good monitoring solution is the ability to alert you when something goes wrong and to monitor history of performance so that you can look back in time. In the next section, we'll take a closer look at the capabilities of the Ambari SCOM Management Pack.

Monitoring with the Ambari SCOM Management Pack

The alerts shown in Table 16.2 are configured by Ambari SCOM.

Table 16.2 Alerts Configured by Ambari SCOM

Name

Alert Message

Threshold

Capacity Remaining

Capacity remaining in HDFS is low if percentage of available space is less than the threshold.

30-Warning 10-Critical

Under-Replicated Blocks

Warns if the number of under-replicated blocks in the HDFS is above the threshold.

1-Warning 5-Critical

Corrupted Blocks

Alerts if there are any corrupted blocks.

1

DataNodes Down

Warns if the number of DataNodes that are down is greater than the threshold in the percentage.

10-Warning 20-Critical

Failed Jobs

Alerts if percentage of MapReduce jobs are greater than the threshold.

10-Warning 40-Critical

Invalid TaskTrackers

Alerts if there are any invalid TaskTrackers.

1

Memory Heap Usage

Warns if JobTracker memory heap grows to more than the threshold.

80-Warning 90-Critical

Memory Heap Usage

Warns if NameNode memory heap grows to more than the threshold.

80-Warning 90-Critical

TaskTrackers Down

Warns if the number of TaskTrackers that are down is greater than the threshold in percentage.

10-Warning 20-Critical

TaskTracker Service State

Warns if the TaskTracker service is not available.

NameNode Service State

NameNode service is not available.

Secondary NameNode Service State

Secondary NameNode service is not available.

JobTracker Service State

JobTracker service is not available.

Oozie Server Service State

Oozie Server service is not available.

Hive Metastore State

Hive Metastore server is not available.

HiveServer State

HiveServer service is not available.

WebHCat Server Service State

WebHCat Server service is not available.

On the left side of the SCOM interface, you can browse the Hadoop cluster by traversing the Ambari SCOM tree. Figure 16.9 showed the various diagrams and reports you can navigate to. Reports of great interest are the Cluster Summary, HDFS Service Summary, HDFS NameNode, MapReduce Summary, and the Jobtracker Summary. We'll examine each report for the benefits it provides.

The HDFS Service Summary provides summary insight into files, blocks, disk I/O, and remaining capacity. For the File Summary, it provides insight into how many files have been appended, deleted, and created. In addition, it lets you know how many total files are stored in HDFS. Blocks Summary provides data on under-replicated blocks, corrupt blocks, missing blocks, and total blocks in the system. Ideally, everything should be zero except Total Blocks. If there are corrupt or missing blocks, it's possible some DataNodes are down or your replication factor for those blocks was 1 and there are no files from which to create new blocks. The disk I/O Summary page provides detail on the total bytes read and written in the system. Finally, the Capacity remaining provides the total amount of space left in the system in a nice trending graph. Figure 16.11 provides a nice view of this set of SCOM graphs.

image

Figure 16.11 HDFS Service summary

The HDFS NameNode Summary graph provides the particulars for performance information for the NameNode. Specifically the summary set of graphs details memory heap utilization, threads status, garbage collection time in milliseconds, and average RPC wait time. The memory heap utilization is important to monitor because as your cluster grows in the number of files it needs to keep track of, then the amount of memory needed to maintain file location will also continue to grow. The graph will show you the difference between memory committed and memory used. The Threads Status report details how many Threads are runnable, blocked, and waiting. There are also the Garbage Collection Time and Average RPC Wait Time reports. If the RPC wait time is high, a job or an application is performing too many NameNode operations. Figure 16.12 shows the NameNode summary graphs.

image

Figure 16.12 NameNode summary

The MapReduce Service Summary shows summary trends of jobs, TaskTrackers, slots utilization, and total number of maps compared to total number of reducers. The Jobs Summary report details jobs submitted, jobs failed, jobs completed, and jobs killed. The TaskTrackers Summary report provides information on TaskTrackers decommissioned, TaskTrackers blacklisted, TaskTrackers graylisted, and number of TaskTrackers. The Slots Utilization graph shows reserved and occupied reduce slots and reserved and occupied map slots. Finally, the Maps vs. Reducers reports details the number of running map and reduce tasks over time. You will likely visit this summary page often when users are complaining about slow performance, as it will give you quick insight to the aggregate activity on the cluster. Figure 16.13 shows the MapReduce Summary report.

image

Figure 16.13 MapReduce summary report

Finally, the JobTracker Summary provides details into the memory heap utilization, threads status, garbage collection time, and average RPC wait time (see Figure 16.14). The Memory Heap Utilization graph shows memory heap committed and used for the Jobtracker Service. The Threads Status report shows the threads runnable, waiting, and blocked. The Garbage Collection Time graph shows the time in milliseconds that it takes to run garbage collection. The Average RPC Wait Time Graph shows wait time. If the average RPC wait time grows significantly, look for jobs running a very large number of short running tasks.

image

Figure 16.14 Jobtracker summary

Finally, you can create your own reports by using the Cluster Services Performance Report interface. Here you can choose which MapReduce and HDFS counters you want to add to a specific report. You can also right-click within the graph to change the timeframe of your choosing. This graph can also show alerts and maintenance windows as defined by you. Figure 16.15 shows the Cluster Services Performance report.

image

Figure 16.15 Cluster Services Performance report

Now that you have set up and understand the default reports and alerts that come with the Ambari SCOM Management Pack, you will want to continue to explore and customize the solution. The first step you should take is adding the necessary counters to each cluster node to capture the performance information referred to as the canaries in the coal mine (disk, CPU, and memory). After that, you will find there are additional report views you can create. Also, you may want to modify the default thresholds for the alerts that will be thrown. Finally, if you haven't set up a destination for someone to receive the alerts, you should do so. For test or development environments, that may be the developers. For an operational environment those alerts should go to the team responsible for the administration of the cluster.

Summary

In this chapter we've covered a great deal of material that you will need to think about as you get closer to operationalizing your solution. We started by discussing the important attributes of working in a hybrid big data environment. Specifically, we discussed data integration between the cloud and on-premises solutions and the challenges that you will face there. We also discussed some of your options in preparing for backups and high availability of your cluster. We spoke a bit about privacy laws. The important thing to remember there is that you have a responsibility to understand what the laws are around the data that you are keeping in your big data environment. Finally, we looked at the Ambari SCOM Management Pack for HDP on Windows. This management pack allows us to monitor our Hadoop cluster in SCOM, get alerted on many important thresholds, and finally report on trending of our cluster for performance information. The Ambari SCOM Management Pack is an important tool for you to manage your Hadoop clusters in a proactive and professional manner.