Big Data Bootcamp: What Managers Need to Know to Profit from the Big Data Revolution (2014)
Chapter 3. Your Big Data Roadmap
Big Data: Where to Start
Thinking about how you want to act on your results after your data gathering and analysis is complete will help you navigate your way to a successful Big Data outcome.
In some cases, the results of your project may be interactive visualizations or dashboards you present to management, partners, or customers. In other cases, you might implement automated systems that use Big Data to take algorithmic action—to make better financial decisions, change pricing automatically, or deliver more targeted ads.
Your roadmap should also include a plan for pulling together the right team members and for getting access to the necessary data assets you want to analyze. With your vision and key questions in hand, you can then combine the necessary people, technology, and data sources to deliver the answers you need.
Goodbye SQL, Hello NoSQL
First, let’s take a look at one of the highest profile, emerging areas in Big Data—that of SQL and NoSQL. In many ways, NoSQL data stores are a step back to the future.
Traditional Relational Database Management Systems (RDBMS) systems rely on a table-based data store and a structured query language (SQL) for accessing data. In contrast, NoSQL systems do not use a table-based approach, nor do they use SQL. Instead, they rely on a key-value store approach to data storage and lookup. NoSQL systems are actually a lot like the Information Management Systems (IMS) that were commonly used before the introduction of relational database systems!
Before relational database systems, data was stored in relatively simple data stores, not unlike the key-value systems of today’s NoSQL offerings. What has changed is that, after a very long period (some 40 years) of using computer systems to store structured data, unstructured data is now growing at a much faster rate than its structured counterpart. The IMS is back in favor and it is known as NoSQL. What has also changed is that just as MySQL became popular due to its open source distribution approach, NoSQL is benefiting from open source popularity as well, primarily in the form of MongoDB and Cassandra.
Like the MySQL database, which was open source but had commercial support available through the company of the same name, MongoDB takes a similar approach. MongoDB is available as free, open source software, with commercial versions and support available from the company of the same name. Cassandra is another NoSQL database, and it’s maintained by the Apache Software Foundation.
MongoDB is the most popular NoSQL database and MongoDB (the company) has raised some $231 million in funding. DataStax offers commercial support for Cassandra and has raised some $83 million to date. Aerospike is another player in the field and has focused on improving NoSQL performance by optimizing its offering specifically to take advantage of the characteristics of flash-based storage. Aerospike recently decided to open-source its database server code.
NoSQL databases are particularly well-suited as document stores. Traditional structured databases require the up-front definition of tables and columns before they can store data. In contrast, NoSQL databases use a dynamic schema that can be changed on the fly.
Due to the huge volume of unstructured data now being created, NoSQL databases are surging in popularity. It may be a little too soon to say “goodbye” to SQL databases, but for applications that need to store vast quantities of unstructured data, NoSQL makes a lot of sense.
Compiling and Querying Big Data
Identifying the data sources you want to work with is one of the first major steps in developing your Big Data roadmap. Some data may be public, such as market, population, or weather data. Other sources of data may be in a variety of different locations internally. If you’re looking to analyze and reduce network security issues, your data may be in the form of log files spread out across different network devices.
If your goal is to increase revenue through better sales and marketing, your data may be in the form of web site logs, application dashboards, and various analytics products, both in-house and cloud-based. Or you may be working on a Big Data legal compliance project. In this case, your data is in the form of documents and emails spread across email systems, file servers, and system backups. Financial transaction data may be stored in different data warehouses.
Depending on the tools you choose to use, you may not need to pull all the data together into one repository. Some products can work with the data where it resides, rather than pulling it into a common store. In other cases, you may want to compile all the data into one place so that you can analyze it more quickly.
Once you have access, your next step is to run queries to get the specific data you need. Despite the growing size of unstructured data, the vast majority of Big Data is still queried using SQL and SQL-like approaches. But first, let’s talk about how to get the data into a state you can work with.
One of the biggest issues that typically comes up when working with Big Data is data formatting For example, log files produced by network and computer systems manufactured by different vendors contain similar information but in different formats. For example, all the log files might contain information about what errors occurred, when those errors occurred, and on which device.
But each vendor outputs such error information in a slightly different format. The text identifying the name of a device might be called DeviceName in one log file and device_name in another. Times may be stored in different formats; some systems store time in Universal Coordinated Time (UTC), whereas other systems record time in local time, say Pacific Standard Time (PST) if that is the time zone in which a particular server or device is running.
Companies like Informatica have historically developed tools and services to address this issue. However, such approaches require users to define complex rules to transform data into consistent formats. More recently, companies like Trifacta, founded by UC Berkeley Computer Science professor Joe Hellerstein, have sought to simplify the data-transformation challenge through modeling and machine learning software.
Once you have the data in a workable state, your next step is to query the data so you can get the information you need. For structured data stored in Oracle, DB2, MySQL, PostgreSQL, and other structured databases, you’ll use SQL. Simple commands like SELECT allow you to retrieve data from tables. When you want to combine data from multiple tables you can use the JOIN command. SQL queries can get quite complex when working with many tables and columns, and expert database administrators can often optimize queries to run more efficiently. In data-intensive applications, quite often the data store and the queries used to access the data stored therein can become the most critical part of the system.
While SQL queries can sometimes be slow to execute, they are typically faster than most Hadoop-based Big Data jobs, which are batch-based. Hadoop, along with MapReduce, has become synonymous with Big Data for its ability to distribute massive Big Data workloads across many commodity servers. But the problem for data analysts used to working with SQL was that Hadoop did not support SQL-style queries, and it took a long time to get results due to Hadoop’s batch-oriented nature.
Note While traditional SQL-oriented databases have until recently provided the easiest way to query data, NoSQL technology is catching up fast. Apache Hive, for example, provides a SQL-like interface to Hadoop.
Several technologies have evolved to address the problem. First, Hive is essentially SQL for Hadoop. If you’re working with Hadoop and need a query interface for accessing the data, Hive, which was originally developed at Facebook, can provide that interface. Although it isn’t the same as SQL, the query approach is similar.
After developing your queries, you’ll ultimately want to display your data in visual form. Dashboards and visualizations provide the answer. Some data analysis and presentation can be done in classic spreadsheet applications like Excel. But for interactive and more advanced visualizations, it is worth checking out products from vendors like Tableau and Qliktech. Their products work with query languages, databases, and both structured and unstructured data sources to convert raw analytics results into meaningful insights.
Big Data Analysis: Getting What You Want
The key to getting what you want with Big Data is to adopt an iterative, test-driven approach. Rather than assuming that a particular data analysis is correct, you can draw an informed conclusion and then test to see if it’s correct.
For example, suppose you want to apply Big Data to your marketing efforts, specifically to understand how to optimize your lead-generation channels. If you’re using a range of different channels, like social, online advertising, blogging, and search engine optimization, you’ll get the data about all those sources into one place. After doing that, you’ll analyze the results and determine the conversion rates from campaign to lead to prospect to customer, by channel. You may even take things a step further, producing a granular analysis that tells you what time of day, day of week, kinds of tweets, or types of content produce the best results.
From this analysis, you might decide to invest more in social campaigns on a particular day of the week, say Tuesdays. The great news with Big Data is that you can iteratively test your conclusions to see if they are correct. In this case, by running more social media marketing campaigns on Tuesdays, you’ll quickly know whether that additional investment makes sense. You can even run multivariate tests—combining your additional social media investment with a variety of different kinds of content marketing to see which combinations convert the most prospects into customers.
Because Big Data makes it more cost-effective to collect data and faster to analyze that data, you can afford to be a lot more iterative and experimental than in the past. That holds true whether you’re applying Big Data to marketing, sales, or to virtually any other part of your business.
Big Data Analytics: Interpreting What You Get
When it comes to interpreting the results you get from Big Data, context is everything. It’s all too easy to take data from a single point in time and assume that those are representative. What really matters with Big Data is looking at the trend.
Continuing with the earlier example, if lead conversions from one channel are lower on one day of the week but higher on another, it would be all too easy to assume that those conversion rates will remain the same over time. Similarly, if there are a few errors reported by a computer or piece of network equipment, those might be isolated incidents. But if those errors keep happening or happen at the same time as other errors in a larger network, that could be indicative of a much larger problem.
In reality, it’s crucial to have data points over time and in the context of other events. To determine if a particular day of the week produces better results from social marketing efforts than other days of the week, you need to measure the performance of each of those days over multiple weeks and compare the results. To determine if a system error is indicative of a larger network failure or security attack, you need to be able to look at errors across the entire system.
Perhaps no tool is more powerful for helping to interpret Big Data results than visualization. Geographic visualizations can highlight errors that are occurring in different locations. Time-based visualizations can show trends, such as conversion rates, revenue, leads, and other metrics over time. Putting geographic and time-based data together can produce some of the most powerful visualizations of all. Interactive visualizations can enable you to go backward and forward in time. Such approaches can be applied not only to business but also to education, healthcare, and the public sector. For example, interactive visualizations can show changes in population and GDP on a country-by-country basis over time, which can then be used to evaluate potential investments in those regions going forward.
You’ll learn about visualization in detail later in the book, but at a high level, you can use tools from vendors like Tableau and Qliktech’s Qlikview to create general visualizations. For specific areas, such as geographic visualizations, products like CartoDB make it easy to build visualizations into any web page with just a few lines of HTML and JScript.
In some cases, interpreting the data means little to no human interaction at all. For online advertising, financial trading, and pricing, software combined with algorithms can help you interpret the results and try new experiments. Computer programs interpret complex data sets and serve up new advertisements, trade stocks, and increase or decrease pricing based on what’s working and what’s not. Such automated interpretation is likely to take on a larger and larger role as data sets continue to grow in volume and decisions have to be made faster and on a larger scale than ever before. In these systems, human oversight is critical—dashboards that show changes in system behavior, highlight exceptions, and allow for manual intervention are key.
Should I Throw Out My RDBMS?
Practically since they were first invented, people have been trying to get rid of databases. The simple answer is that you shouldn’t throw out your RDBMS. That’s because each Big Data technology has its place, depending on the kind of data you need to store and the kind of access you need to that data.
Relational database systems were invented by Edgar Codd at IBM in the late 1960s and early 1970s. Unlike its predecessor, the Information Management System (IMS), an RDBMS consists of various tables. In contrast, IMS systems are organized in a hierarchical structure and are often used for recording high-velocity transaction data, such as financial services transactions. Just as radio did not disappear when television came into existence, IMS systems did not disappear when RDBMS appeared. Similarly, with newer technologies like Hadoop and MapReduce, the RDBMS is not going away.
First, it would be impractical simply to get rid of classic structured database systems like Oracle, IBM DB2, MySQL, and others. Software engineers know these systems well and have worked with them for years. They have written numerous applications that interface with these systems using the well-known structured query language or SQL. And such systems have evolved over time to support data replication, redundancy, backup, and other enterprise requirements.
Second, the RDBMS continues to serve a very practical purpose. As implied by SQL, the language used to store data to and retrieve data from databases, an RDBMS is most suitable for storing structured data. Account information, customer records, order information, and the like are typically stored in an RDBMS.
But when it comes to storing unstructured information such as documents, web pages, text, images, and videos, using an RDBMS is likely not the best approach. RDBMS systems are also not ideal if the structure of the data you need to store changes frequently.
In contrast to traditional database systems, the more recently introduced Big Data technologies—while excelling at storing massive amounts of unstructured data—do not do as well with highly structured data. They lack the elegant simplicity of a structured query language. So while storing data is easy, getting it back in the form you need is hard.
That’s why today’s Big Data systems are starting to take on more of the capabilities of traditional database systems, while traditional database systems are taking on the characteristics of today’s Big Data technologies. Two technology approaches at opposite ends of the spectrum are becoming more alike because customers demand it.
Today’s users aren’t willing to give up the capabilities they’ve grown accustomed to over the past 40 years. Instead, they want the best of both worlds—systems that can handle both structured and unstructured data, with the ability to spread that data over many different machines for storage and processing.
Note Big Data systems are less and less either SQL or NoSQL. Instead, customers more and more want the best of both worlds, and vendors as a result are providing hybrid products sporting the strengths of each.
Big Data Hardware
One of the big promises of Big Data is that it can process immense amounts of data on commodity hardware. That means companies can store and process more data at a lower cost than ever before. But it’s important to keep in mind that the most data-intensive companies companies like Facebook, Google, and Amazon are designing their own custom servers to run today’s most data-intensive applications.
So when it comes to designing your Big Data roadmap, there are a few different approaches you can take to your underlying infrastructure—the hardware that runs all the software that turns all that data into actionable insights.
One option is to build your own infrastructure using commodity hardware. This typically works well for in-house Hadoop clusters, where companies want to maintain their own infrastructure, from the bare metal up to the application. Such an approach means it’s easy to add more computing capacity to your cluster—you just add more servers. But the hidden cost plays out over time. While such clusters are relatively easy to set up, you’re on the book to manage both the underlying hardware and the application software.
This approach also requires you to have hardware, software, and analytics expertise in-house. If you have the resources, this approach can make the most sense when experimenting with Big Data as a prototype for a larger application. It gives you first-hand knowledge of what is required to set up and operate a Big Data system and gives you the most control over underlying hardware that often has a huge impact on application performance.
Another option is to go with a managed service provider. In this case, your hardware can reside either in-house or at the data center of your provider. In most cases, the provider is responsible for managing hardware failures, power, cooling, and other hardware-related infrastructure needs.
A still further option is to do all your Big Data processing in the cloud. This approach allows you to spin up and down servers as you need them, without making the up-front investment typically required to build your own Big Data infrastructure from scratch. The downside is there’s often a layer of virtualization software between your application and the underlying hardware, which can slow down application performance.
Note Cloud technology allows you to set up servers as necessary and scale the system quickly—and all without investing in your own hardware. However, the virtualization software used in many cloud-based systems slows things down a bit.
In days past IBM, Sun Microsystems, Dell, HP, and others competed to offer a variety of different hardware solutions. That is no longer the case. These days, it’s a lot more about the software that runs on the hardware rather than the hardware itself.
A few things have changed. First, designing and manufacturing custom hardware is incredibly expensive and the margins are thin. Second, the scale-out (instead of scale-up) nature of many of today’s applications means that when a customer needs more power or storage, it’s as simple as adding another server. Plus, due to Moore’s Law, more computing power can fit in an ever-smaller space. By packing multiple cores (processors) into each processor, each processor packs more punch. Each server has more memory, and each drive stores more data.
There’s no doubt about it, though. Today’s Big Data applications are memory and storage hogs. In most cases, you should load up your servers, whether they are local or cloud-based, with as much memory as you can. Relative to memory, disks are still relatively slow. As a result, every time a computer has to go to disk to fetch a piece of data, performance suffers. The more data that can be cached in memory, the faster your application—and therefore your data processing—will be.
The other key is to write distributed applications that take advantage of the scale-out hardware approach. This is the approach that companies like Facebook, Google, Amazon, and Twitter have used to build their applications. It’s the approach you should take so that you can easily add more computing capacity to meet the demands of your applications. Every day more data is created, and once you get going with your own Big Data solutions, it’s nearly a sure thing that you’ll want more storage space and faster data processing.
Once your Big Data project becomes large enough, it may make sense to go with hardware specifically optimized for your application. Vendors like Oracle and IBM serve some of the largest applications. Most of their systems are built from commodity hardware parts like Intel processors, standard memory, and so on. But for applications where you need guaranteed reliability and service you can count on, you may decide it’s worth paying the price for their integrated hardware, software, and service offerings.
Until then, however, commodity, off-the-shelf servers should give you everything you need to get started with Big Data. The beauty of scale-out applications when it comes to Big Data is that if a server or hard drive fails, you can simply replace it with another. But definitely buy more memory and bigger disks than you think you’ll need. Your data will grow a lot faster and chew up a lot more memory than your best estimates can anticipate.
Data Scientists and Data Admins: Big Data Managers
Consulting firm McKinsey estimates that Big Data can have a huge impact in many areas, from healthcare to manufacturing to the public sector. In healthcare alone, the United States could reduce healthcare expenditure some 8% annually with effective use of Big Data.
Yet we are facing a shortage of those well-versed in Big Data technologies, analytics, and visualization tools. According to McKinsey, the United States alone faces a shortage of 140,000 to 190,000 data analysts and 1.5 million “managers and analysts with the skills to understand and make decisions based on the analysis of Big Data.”1
There is no standard definition of the roles of data scientist and data analyst. However, data scientists tend to be more versed in the tools required to work with data, while data analysts focus more on the desired outcome—the questions that need to be answered, the business objectives, the key metrics, and the resulting dashboards that will be used on a daily basis. Because of the relative newness of Big Data, such roles often blur.
When it comes to working with structured data stored in a traditional database, there is nothing better than having an experienced database administrator (DBA) on hand. Although building basic SQL queries is straightforward, when working with numerous tables and lots of rows, and when working with data stored in multiple databases, queries can grow complex very quickly. A DBA can figure out how to optimize queries so that they run as efficiently as possible.
Meanwhile, system administrators are typically responsible for managing data repositories and sometimes for managing the underlying systems required to store and process large quantities of data. As organizations continue to capture and store more data—from web sites, applications, customers, and physical systems such as airplane engines, cars, and all of the other devices making up the Internet of Things—the role of system and database administrators will continue to grow in importance. Architecting the right systems from the outset and choosing where and how to store the data locally or in the cloud is critical to a successful long-term Big Data strategy. Such a strategy is the basis for building a proprietary Big Data asset that delivers a strategic competitive advantage.
All of these roles are critical to getting a successful outcome from Big Data. All the world’s data won’t give you the answers you need if you don’t know what questions to ask. Once the answers are available, they need to be put into a form that people can act on. That means translating complex analysis, decision trees, and raw statistics into easy-to-understand dashboards and visualizations.
The good news for those without the depth of skills required to become Big Data experts is that dozens of Big Data related courses are now available online, both from companies and at local colleges and universities. Vendors like IBM and Rackspace also offer Big Data courses online, with Rackspace recently introducing its free-to-view CloudU.2
One Fortune 500 company I work with runs an internal university as part of its on-going human capital investment. They have recently added Big Data, cloud, and cyber-security courses from my firm—The Big Data Group—to their offerings. In these courses, we not only cover the latest developments in these areas but also discuss Big Data opportunities the participants can explore when they return to their day to day jobs.
Given the numbers cited by the McKinsey report, I expect that more companies will invest in Big Data, analytics, and data administration education for executives and employees alike in the months and years ahead. That’s good news for those looking to break into Big Data, change job functions, or simply make better use of data when it comes to making important business decisions.
Chief Data Officer: Big Data Owner
According to industry research firm Gartner, by 2015, 25% of large global organizations will have appointed a chief data officer (CDO).3 The number of CDOs in large organizations doubled from 2012 to 2014. Why the rise of the CDO?
For many years, information technology (IT) was primarily about hardware, software, and managing the resources, vendors, and budgets required to provide technology capabilities to large organizations. A company’s IT resources were a source of competitive advantage. The right IT investments could enable an organization to be more collaborative and nimble than its competitors. Data was simply one component of a larger IT strategy, quite often not even thought about in its own right separate from storage and compute resources.
Now data itself is becoming the source of strategic competitive advantage that IT once was, with storage hardware and computing resources the supporting actors in the Big Data play.
At the same time that companies are looking to gain more advantage from Big Data, the uses and misuses of that data has come under higher scrutiny. The dramatic rise in the amount of information that companies are storing, combined with a rising number of cases in which hackers have obtained personal data companies are storing has put data into its own class.
2013 saw two very high-profile data breaches. Hackers stole more than 70 million personal data records from retailer Target that included the name, address, email address, and phone number information of Target shoppers. Hackers also stole some 38 million user account records from software maker Adobe. Then in 2014, hackers gained access to eBay accounts, putting some 145 million user accounts at risk.
The prominence of data has also increased due to privacy issues. Customers are sharing a lot more personal information than they used to, from credit card numbers to buying preferences, from personal photos to social connections. As a result, companies know a lot more about their customers than they did in years past. Today, just visiting a web site creates a huge trail of digital information. Which products did you spend more time looking at? Which ones did you skip over? What products have you looked at on other web sites that might be related to your purchase decision? All that data is available in the form of digital breadcrumbs that chart your path as you browse the web, whether on your computer, tablet, or mobile phone.
Demand for access to data—not just to the data itself but to the right tools for analyzing it, visualizing it, and sharing it—has also risen. Once the domain of data analysis experts versed in statistics tools like R, actionable insights based on data are now in high demand from all members of the organization. Operations managers, line managers, marketers, sales people, and customer service people all want better access to data so they can better serve customers.
What’s more, the relevant data is just as likely to be in the cloud as it is to be stored on a company’s in-house servers. Who can pull all that data together, make sure the uses of that data adhere to privacy requirements, and provide the necessary tools, training, and access to data to enable people to make sense of it? The CDO.
IT was and continues to be one form of competitive advantage. Data is now rising in importance as another form of competitive advantage—a strategic asset to be built and managed in its own right. As a result, I expect many more organizations to elevate the importance of data to the executive level in the years ahead.
Your Big Data Vision
To realize your Big Data vision, you’ll need the right team members, technology, and data sources. Before you assemble your team, you might want to experiment with a few free, online Big Data resources. The Google Public Data Explorer lets you interact with a variety of different public data sources at www.google.com/publicdata/directory. Even though these may not be the data sources you eventually want to work with, spending a few hours or even a few minutes playing around with the Data Explorer can give you a great sense for the kinds of visualizations and dashboards you might want to build to help you interpret and explain your Big Data results.
With your Big Data outcome in mind, you can begin to assemble your team, technology, and data resources.
Your team consists of both actual and virtual team members. Bring in business stakeholders from different functional areas as you put your Big Data effort together. Even though your project may be focused just on marketing or web site design, it’s likely you’ll need resources from IT and approval from legal or compliance before your project is done.
So many different Big Data technologies are available that the choice can often be overwhelming. Don’t let the technology drive your project—your business requirements should drive your choice of technology.
Finally, you may not be able to get access to all the data you need right away. This is where having the dashboards you ultimately want to deliver designed and sketched out up front can help keep things on track. Even when you can’t fill in every chart, using mock-up visualizations can help make it clear to everyone which data sources are most important in delivering on your Big Data vision.
In the chapters ahead, we’ll explore a variety of different technologies, visualization tools, and data types in detail. Starting with the end in mind and taking an iterative, test-driven approach will help turn your Big Data vision into reality.
1McKinsey & Company Interactive, Big Data: The Next Frontier for Competition. www.mckinsey.com/features/big_data