Big Data in the Real World - Big Data and SQL Server Together - MICROSOFT BIG DATA SOLUTIONS (2014)

MICROSOFT BIG DATA SOLUTIONS (2014)

Part V. Big Data and SQL Server Together

Chapter 14. Big Data in the Real World

What You Will Learn in This Chapter

· Finding Out How Key Industries Use Big Data Analytics

· Understanding Types of Common Analysis

· Making Analytics Operational in Your Organization

This book has covered numerous solutions so far. This chapter focuses on how the industry leverages these solutions, for example, how the telecommunication industry is planning new development based on customer usage data they are crunching in new ways. This chapter will inspire you and provide ideas (from real-world implementations) for implementing the big data techniques you have learned. Remember, as well, that your clients and employers expect you to “know” these new technologies, so coming to the project table with implementation ideas, in addition to the basic skills, will make you a more valuable team member.

Another concept this chapter covers is how to fail fast. After all, a significant benefit of these new technologies is that they provide for quick prototyping and testing cycles, which in turn allow you, faster than ever before, to redirect efforts away from unproductive avenues to more productive solutions. Companies are trying continuously to enhance their analytic capabilities, and through failing and redirecting more rapidly you can help them get to the right solution faster and with less overhead in the process. To start this discussion, let's review some common industry sectors that apply these techniques.

Common Industry Analytics

Industries that use these types of big data analytics include telecommunications, energy, oil and gas, retail, companies that sell data as a service, IT organizations, and large hosting companies such as Rackspace and GoDaddy. In addition, just about every industry's marketing department is interested in things like social sentiment and measuring their brand impact and presence/influence, as discussed later in this chapter.

Telco

In the telecommunications industry, as with any other major commodity, organizations seek consumption details that will enable them to predict bursts of activity and analyze patterns and to then use this information to create valuable offerings for customers. Those offers focus on maximizing new infrastructure development and minimizing wasted effort from construction alliances or new ventures. These analyses attempt to define usage patterns by looking at the types of services that people use. Those services could include Internet, Voice over IP (VoIP), landline telephone, and even mobile.

In consumption analysis, algorithms help us understand where and when, from a consumption perspective, people are spending the most. Is a service/product costing them more to be with this particular provider just because of the way that customers use it? Are they using a service/product more during prime-time hours (which would then influence the pricing model)? These types of models and modeling problems are common, and the solutions explored are usually found in the big data ecosystem.

The Hadoop ecosystem also facilitates the placement of new infrastructure in record time and with increased accuracy (another major analytic area, although not new, in the telecommunications industry). Movement and usage patterns (especially people using their mobile devices for data/voice) significantly influence the placement of this new infrastructure. Tracking users has always been a challenge because of the volume of data such tracking produces. Now, though, all these users can be tracked based on where they move throughout the world and where they're using services. The infrastructure can be improved or enhanced based on these particular usage patterns. That alone can provide a much better experience for the customer, leading to improved customer loyalty and, as customers stay with the same provider, a significant return on investment (ROI).

Energy

In the energy industry, common types of analysis today include survey data information about where people consume different types of energy and analysis of the carbon impact. These analyses seek to predict customer usage and to optimize the availability of multiple fuel types. Although the focus of such analysis is not new (or the result of new and better tools), industry players can complete their analysis with improved levels of accuracy based on significantly larger volumes of data. Those larger volumes of data allow a much more detailed level of analysis for everything from geographic and spatial analysis of three-dimensional survey data to predicting where the most cost-efficient and least environmentally impactful deposits and withdrawal activity could happen.

The energy industry's need to predict customer usage and to optimize the availability of multiple fuel types will continue to grow as the various fuel types become more and more popular (for instance, as geothermal and hydroelectric power begin to move more households away from traditional types of electricity derived from coal or oil). These types of analytics will show whether specific areas of the country could really benefit from the various types of fuel or alternative fuel sources.

Retail

Many of you are already familiar with the types of analysis that retail organizations do. As you shop at Amazon.com, eBay, or any other major retail sites, “you may also like this” sections pop up (known as market basket analysis). In large Hadoop implementations, this type of analysis occurs on the fly and behind the scenes based on your shopping cart and your shopping preferences. The companies are running an analytics model designed to tell you about other products/services that you might be interested in. Mahout is the tool in the Hadoop ecosystem that many organizations use to build and leverage these models for testing solution combinations. This market basket analysis value-added service provides recommended tracking for customers, and through tracking customer patterns, it may enable the retailer to sell more and make more money per individual transaction (a key metric in the retail space).

A new type of analytics in the retail space tracks customers as they move through a store. Some organizations do this with a little chip in a shopping cart. Others do it by getting you to install their smartphone application. However these stores physically track customers, the most important piece of information is where customers are spending most of their time. For example, are they spending most of their time shopping for groceries and spending some time in sporting goods? Do they go to automotive every time they come in to buy groceries? These analytics give retailers an incredible amount of insight into how they can stock their stores and how they can organize the floor plans to direct traffic to the areas that will maximize the amount of money they get per visitor.

The last major type of analysis centers on unproductive inventory and stock management. A number of years ago, a Discovery Channel special on Walmart showcased how Walmart was ahead of its time in analytics (in that they could add weather pattern data into their inventory and stock analysis). For example, when a particular type of weather system was coming through the Midwest, they knew that people would buy more of a certain type of Pop-Tarts, and so they could automatically preorder that. Working that into their inventory management system without anybody having to make any decisions is a good example of this type of analysis.

For this type of event analysis, many companies factor in everything from weather data, to traffic data, to the schedule of a big football game or a big political event, to any other event that would drive a lot of tourism. Any retail organization would find it incredibly valuable to integrate any of those types of cyclical population data points. Major retail organizations are already working hard on doing this, and some, like Walmart, have been doing it for a number of years already.

Data Services

In today's economy, an increasing number of organizations provide, as their sole focus, data as a service to other organizations. These organizations take hard-to-get data and amass that data with customer data to provide additional value—everything from integrating weather/population data as a service (as discussed earlier with regard to the retail industry) all the way to mapping data that companies can integrate into their websites for things such as your closest pharmacy or grocery store. They do this by calling a tracking application programming interface (API), and the API takes care of it for them. These data service organizations focus intently on the acquisition and quality of data and being able to scale the response time of the analysis of the particular questions posed. These data companies are very interested in (and in some cases already pioneering) the use of these big data technologies.

IT/Hosting Optimization

Internal IT can also leverage the opportunities presented by Hadoop ecosystem technologies: predicting machine failure, reviewing log data to provide more real-time workload optimization, removing hotspots in a cloud or grid environment, and even reviewing tickets and knowledge bases for common issues using natural language processing. Although this book does not go into detail about natural language processing, many other books do, and many technologies that really excel in natural language processing run in this ecosystem.

Natural language processing comprises a set of tools and software components that developers can leverage to analyze voice recordings or text documents to determine what the user is asking or trying to do. Developers can tell whether the user is happy or sad, positive or negative, forceful or not forceful. We can measure many things through natural language processing, and we can apply simple natural language processing to large call center ticketing systems that large organizations have. Developers can then deal with a significant number of common trouble tickets by either fixing the underlying issue or by providing additional user training. Any of those options may provide significant relief.

In conjunction with real-time machine workload information coming from the logs of large cloud or grid systems, developers can use native ecosystem tools to begin to analyze that data and initiate processes to move the workload to different parts of the cloud. The resultant elasticity and ability to scale more efficiently may thus allow them to be more responsive to their customers.

Marketing Social Sentiment

The ubiquity of social media provides an unprecedented opportunity for marketers. Brands can now understand their influence and presence measurement across a wide variety of populations and geographies. This information has never before been as readily accessible as today, and it is becoming increasingly accessible as new networks and new technologies come online.

Measuring an organization's brand/presence influence with respect to marketing impact is generally referred to as social sentiment analysis. Social sentiment analysis, something that most companies say that they want, usually falls within the purview of the marketing department. Real-world social sentiment analysis considers, among other things, the following:

· Positive and negative statements made about your brand

· Responses (negative and positive) and response time to advertisements played (within specific time periods and via specific modes, such as pop-up ads, referral links, and so on)

· Multiple social network user-population impacts/influences of your brand

Operational Analytics

The concept of operational analytics derives from taking your normal day-to-day operational activities in information technology and combining them as seamlessly as possible with the technologies that are providing your analytics solutions for your customers. A good example of this would be a company that is automatically placing their ads on their web presence based on click-through analysis in real time. Let's explore this more in the following sections.

Failing Fast

Earlier in this chapter, you came across the term failing fast. This concept allows you, along with the development team, to redirect your efforts based on any constructive feedback you receive from those who are using your solution. In existing legacy business intelligence (BI) solutions, it can sometimes prove difficult to retrofit a solution based on regular and ongoing feedback. Within the Hadoop ecosystem, the tools provide a more compartmentalized view of the architecture, allowing for more individual retrofitting and adjustment. You will find this point more apparent in the next few sections as you learn how these tools will impact your existing BI and reporting implementations.

A New Ecosystem of Technologies

By this point in the book, you can certainly understand that you can leverage many new technologies and opportunities to do some impressive analytics for your organization. As with any new technology, integrating these technologies can prove challenging (but also beneficial at the same time). The following subsections cover a few of the most common areas where these technologies intersect.

Changing the Face of ETL

Extract, transform, and load (ETL) processes are generally considered the backbone of any BI implementation because they are responsible for moving, cleansing, loading, and reviewing the quality of the data that reports and analytics are based on in today's current systems. The challenge with today's ETL is that it can be difficult to adapt or adjust once a large-scale implementation is in place.

As discussed throughout this book, tools such as Pig and Sqoop provide a less-robust, more-scalable, and more-flexible approach to moving data in and around your environment. In most cases, you will have both your traditional ETL platform, like SQL Server Information Services (SSIS), and an ETL process that works within your project implementation. These work together to move the data back and forth between those environments.

Pig, in particular, provides a robust command-based solution. With this solution, you can do text-based data cleansing and organization aggregation, and you can use the MapReduce framework under the covers to scale the network across many nodes and thus process and move large volumes of data that a normal ETL server may struggle to handle in the same amount of time. Traditional ETL servers use memory-intensive processes to load data into memory, access it very quickly, and perform the appropriate operation (such as cleansing, removing, sorting, aggregating, and so on). Arguably, by combining the capabilities of Pig with additional tools, such as natural language processing and other types of advanced analysis, developers can do things in conjunction with the ETL process that would previously have required a lot more code and a lot more effort to accomplish.

What Does This Do to Business Intelligence?

With all these changes, you are probably asking yourself what this does to BI. The important thing to remember is that nothing changes overnight. Yes, developers do have many new tools and new ways to do some impressive analytics and incredible dedicated visualizations, but these new tools will continue to interact with existing platforms into which you have already invested significant intellectual capital and organizational intellectual property.

BI was originally designed as the analytics platform of the future. Now, though, as the future looms (or as we pass through it), we require additional scalability. BI has successfully demonstrated the value of these types of solutions to so many organizations that they are now struggling to keep up with the volume of data and the types of questions that the users and their customer community want to ask. That represents an incredible value proposition: Users wanting more of your solution (instead of you having to convince users of the utility of BI and solicit customers for it).

The data solutions are giving people and organizations the flexibility to drive types of conversations as never before. That then allows developers to take larger volumes of data that do not fit into traditional BI systems, like a relational database-oriented or online analytical processing (OLAP) cube or a report, and provide a platform where users can analyze all of that data and come up with a subset that they know has real value and real meaning for the organization. After they have identified that subset of information, that subset can be moved into the traditional BI (Business Intelligence) implementation, where the enterprise can begin to consume it and work with it as they would any other data from their sales or customer relationship management (CRM) or reporting system.

This flexibility and agility is something not seen before, but it is similar in some ways to the concept of self-service BI. Many of self-service tools (such as Power Query and the Power BI suite from Microsoft) work well with Hadoop and the ecosystem that it runs on. With Hive, developers have the opportunity to quickly publish data to their end users in a relational abstraction format; data that users can query from desktop tools designed to help them be more exploratory in their analysis. You will learn what exploratory data analysis means a little bit later in this chapter. For now, just understand that instead of sitting down in the morning and running their normal Monday morning report, we want users to be able to go further. We want users to be able not just to respond to numbers they see on a page, but to be able to make that response a deeper query and a deeper investigation than they can get today—quickly, at their fingertips, without having to request changes or add additional infrastructure from the IT organization.

New Visualization Impact

Resultant from these new types of analysis, methods, and practices for visualizing data are a number of recently published books on best practices for data visualization. Among the valuable material in these books, one good point refers to nonrelational data (such as spatial data or data coming from unspecific file types and government sources): that you can accurately portray the meaning of the data through the way that you visualize it.

Microsoft has advanced visualization tools (for example, Power View and Power Map). Other organizations, such as Tableau, have additional advanced visualizations, and many organizations are looking at a multivendor approach for their reporting and visualization needs as they explore the beginning (and evolving) ecosystem. After all, no one vendor has really cornered the market on visualization. Even open source tools give us a lot of visualization ability right from the command line. As we begin to turn our data into analysis, those graphics can be pulled out directly and put into presentations or shared with our user base.

As all of us are aware, the analysis is only as valuable as the level with which we can communicate it to our end users, executives, and stakeholders. If we cannot visualize and communicate the analysis, its impact, and our recommendations, we will not be successful no matter what technology we are using.

User Audiences

Several user audiences will exist for these types of solutions. So, you want to understand what things each of these user audiences will be most concerned about as you begin to discuss how these new solutions will affect their day-to-day operations. Let's review some of the more important user audiences.

Consumers

Consumers will be some of your easiest customers. They are primarily concerned with where to go to run reports and where to go to direct their data collection activities.

To complete typical jobs, consumers will likely use Hive tables under the covers that you have created and exposed for them. These Hive tables will be the easiest way for them to query underlying data structures, unless they have a tool such as Power Query from Microsoft, which will enable them to go directly to files in the Hadoop Distributed File System (HDFS).

Your consumers will be used to using all of their existing tools, so it is important that you help them understand that many of these tools will continue to be used in exactly the same way, that they do not have to replace their existing enterprise BI environment. We are just adding new data sets to enable them to pull additional information to add value to their existing reports and provide a deeper level of analysis than what they would normally be able to perform. These existing tools (for instance, Excel and Power BI from Microsoft) provide leading-edge capabilities to take advantage of all we are doing with these new tools.

The ecosystem that we cover in this book is providing an additional data platform and a set of capabilities for developers to service additional types of data at volumes that were previously much more difficult (and in some cases prohibitive). With these limitations removed, consumers can now reach a new level of information collection collaboration and management.

Power Users

Power users are going to be some of the most advanced users of these new solutions. They can be doing everything from custom model creation to creating their own Hive tables and writing their own Pig scripts.

These users will demand the most assistance with the actual tools and more assistance with where data is located, how it is being loaded into the system, how they access it, and what kind of metadata is available. Power users will work closely with the developers as the developers build out the functional pieces of the platform, which the power users will then integrate into the system to take something to the next level.

Data Developers

Data developers will be one of your largest and most demanding of all the audiences. In fact, you might be in this category yourself.

Data developers are going to be in charge of everything from sourcing data, to bringing it into the HDFS, loading it into nonrelational data sources such as Mongo DB, and creating Hive tables for users to connect to. And then, most likely, they will be responsible for developing export functionality to dump the subsets of valuable data that users find during analysis out to the format that can then be integrated into the enterprise BI system.

It is this life cycle of identifying data, making it available for exploration, reviewing the results of that exploration, and then taking subsets of value-added data in building integration processes that will keep the new big data developer very busy. This is an extremely exciting time for developers. After all, you get to take all that you have learned about acquiring, cleansing, and managing data and apply it to a set of new tools and thus enable new scalability opportunities and new performance limits that are far beyond what existing systems can offer. If, in addition, you are leveraging the cloud for your biggest solutions, you will have additional opportunities to manage data asynchronously, on-premise solutions, cloud solutions, and enterprise BI solutions wherever they are deployed.

System Administrators

System administrators are usually concerned with ensuring that systems are appropriately deployed and maintained and performing at peak level. In the big data ecosystem, all of these concerns still exist with clusters of the Hadoop nodes within a distributed file system.

Many tools require configuration to control how they scale, how they distribute, and how they manage data across different clusters. Some of these clusters may be on premise, some of them in the cloud, and some of them may be a hybrid scenario where some of the machines are on premise and others are taking advantage of cloud elasticity for seasonal or a particular type of analytic throughput.

System administrators will have their hands full learning all the new configuration options, configuration files, and various other information covered earlier in this book.

NOTE

This book is not an exhaustive reference; you might want to use many other configuration options to manage the environment particular to an organization. It is important to understand all the different options, and there are great references out there to help you do that. You should review the platform documentation for whichever vendor you are using (for instance, Microsoft documentation for HDInsight or the Hortonworks documentation for the Hortonworks Data Platform).

System administrators will serve as critical partners for data developers of the future because the data developers will rely on a certain amount of scalability and distributed processing power for how they build their applications. It is important that these two groups and roles work together and have a strong relationship to provide the right level of scalability performance and elasticity in these types of environments.

Analysts/Data Scientists

Many analysts and data scientists have been using these tools now for a few years. Their analyses and requirements served as a major driver in the development and enhancement of the tools in this ecosystem. Being able to run statistical models in tools (for instance, predictive analytics in tools such as Mahout) is critical to the success of persons in this role.

Many organizations are still trying to define what it means to be a data scientist in their company, so do not get hung up on the terms data scientist or analyst. In brief, a data scientist attempts to review internal data along with external data and based on that to then recommend opportunities for the company to measure new things as markers of success or failure for particular projects, divisions, organizations, or markets. For example, many large retail organizations are rapidly growing their data science teams to include people who understand statistics, predictive modeling, regression analysis, market basket analysis, and additional advanced analytics.

Do not worry if you (or your developer) do not yet have all the skills you need. You might want to learn some of them if you find them interesting. However, your main goal is to work with these folks to provide the right platform with the right access to additional data sets and to provide opportunities for them to excel at their job.

Summary

In this chapter you learned how today's common industries are using big data analytics. These industries included telecommunications, energy, retail, data services, and IT/hosting optimization. You also learned how these industries are turning these initiatives from incubation efforts to operational success and how you can do the same thing in your company. You learned about how these solutions will impact different key roles and who the stakeholders will be in your next big data project. Specific things for each of these roles, such as developers and systems administrators, were covered.

You can leverage the new big data technologies covered so far in this book in many different ways. Get creative and share with your organization where you believe they could leverage technology solutions like the ones in this chapter. For more information on specific implementations, ask your platform vendors for customer stories and case studies for a deeper technical description.