Big Data Bootcamp: What Managers Need to Know to Profit from the Big Data Revolution (2014)
Chapter 1. Big Data
What It Is, and Why You Should Care
Scour the Internet and you’ll find dozens of definitions of Big Data. There are the three v’s—volume, variety, and velocity. And there are the more technical definitions, like this one from Edd Dumbill, analyst at O’Reilly Media: “Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”1
Such definitions, while accurate, miss the true value of Big Data. Big Data should be measured by the size of its impact, not by the amount of storage space or processing power that it consumes. All too often, the discussion around Big Data gets bogged down in terabytes and petabytes, and in how to store and process the data rather than in how to use it.
As consumers and business users, the size and scale of data isn’t what we care about. Rather, we want to be able to ask and answer the questions that matter to us. What medicine should we take to address a serious health condition? What information, study tools, and exercises should we give students to help them learn more effectively? How much more should we spend on a marketing campaign? Which features of a new product are our customers using?
That is what Big Data is really all about. It is the ability to capture and analyze data and gain actionable insights from that data at a much lower cost than was historically possible.
What is truly transformative about Big Data is the ease with which we can now use data. No longer do we need complex software that takes months or years to set up and use. Nearly all the analytics power we need is available through simple software downloads or in the cloud.
No longer do we need expensive devices to collect data. Now we can collect performance and driving data from our cars, fitness and location data from GPS watches, and even personal health data from low-cost attachments to our mobile phones. It is the combination of these capabilities—Big Data meets the cloud meets mobile—that is truly changing the game when it comes to making it easy to use and apply data.
Note Big Data is transformative: You don’t need complex software or expensive data-collection techniques to make use of it. Big Data meeting the cloud and mobile worlds is a game changer for businesses of all sizes.
Big Data Crosses Over Into the Mainstream
So why has Big Data become so hot all of a sudden? Big Data has broken into the mainstream due to three trends coming together.
First, multiple high-profile consumer companies have ramped up their use of Big Data. Social networking behemoth Facebook uses Big Data to track user behavior across its network. The company makes new friend recommendations by figuring out who else you know.
The more friends you have, the more likely you are to stay engaged on Facebook. More friends means you view more content, share more photos, and post more status updates.
Business networking site LinkedIn uses Big Data to connect job seekers with job opportunities. With LinkedIn, headhunters no longer need to cold call potential employees. They can find and contact them via a simple search. Similarly, job seekers can get a warm introduction to a potential hiring manager by connecting to others on the site.
LinkedIn CEO Jeff Weiner recently talked about the future of the site and its economic graph—a digital map of the global economy that will in real time identify “the trends pointing to economic opportunities.”2 The challenge of delivering on such a graph and its predictive capabilities is a Big Data problem.
Second, both of these companies went public in just the last few years—Facebook on NASDAQ, LinkedIn on NYSE. Although these companies and Google are consumer companies on the surface, they are really massive Big Data companies at the core.
The public offerings of these companies—combined with that of Splunk, a provider of operational intelligence software, and that of Tableau Software, a visualization company—significantly increased Wall Street’s interest in Big Data businesses.
As a result, venture capitalists in Silicon Valley are lining up to fund Big Data companies like never before. Big Data is defining the next major wave of startups that Silicon Valley is hoping to take to Wall Street over the next few years.
Accel Partners, an early investor in Facebook, announced a $100 million Big Data Fund in late 2011 and made its first investment from the fund in early 2012. Zetta Venture Partners is a new fund launched in 2013 focused exclusively on Big Data analytics. Zetta was founded by Mark Gorenberg, who was previously a Managing Director at Hummer Winblad.3 Well-known investors Andreessen Horowitz, Greylock Partners, and others have made a number of investments in the space as well.
Third, business people, who are active users of Amazon, Facebook, LinkedIn, and other consumer products with data at their core, started expecting the same kind of fast and easy access to Big Data at work that they were getting at home. If Internet retailer Amazon could use Big Data to recommend books to read, movies to watch, and products to purchase, business users felt their own companies should be able to leverage Big Data too.
Why couldn’t a car rental company, for example, be smarter about which car to offer a renter? After all, the company has information about which car the person rented in the past and the current inventory of available cars. But with new technologies, the company also has access to public information about what’s going on in a particular market—information about conferences, events, and other activities that might impact market demand and availability.
By bringing together internal supply chain data with external market data, the company should be able to predict which cars to make available and when more accurately.
Similarly, retailers should be able to use a mix of internal and external data to set product prices, placement, and assortment on a day-to-day basis. By taking into account a variety of factors—from product availability to consumer shopping habits, including which products tend to sell well together—retailers can increase average basket size and drive higher profits. This in turn keeps their customers happy by having the right products in stock at the right time.
So while Big Data became hot seemingly overnight, in reality, Big Data is the culmination of a mix of years of software development, market growth, and pent up consumer and business user demand.
How Google Puts Big Data Initiatives to Work
If there’s one technology company that has capitalized on that demand and that epitomizes Big Data, it’s search engine giant Google, Inc. According to Google, the company handles an incredible 100 billion search queries per month.4
But Google doesn’t just store links to the web sites that appear in its search results. It also stores all the searches people make, giving the company unparalleled insight into the when, what, and how of human search behavior.
Those insights mean that Google can optimize the advertising it displays to monetize web traffic better than almost every other company on the planet. It also means that Google can predict what people are going to search for next. Put another way, Google knows what you’re looking for before you do!
Google has had to deal, for years, with massive quantities of unstructured data such as web pages, images, and the like rather than more traditional structured data, such as tables that contain names and addresses. As a result, Google’s engineers developed innovative Big Data technologies from the ground up. Such opportunities have helped Google attract an army of talented engineers who are attracted to the unique size and scale of Google’s technical challenges.
Another advantage the company has is its infrastructure. The Google search engine itself is designed to work seamlessly across hundreds of thousands of servers. If more processing or storage is required or if a server goes down, Google’s engineers simply add more servers. Some estimates put Google’s total number of servers at greater than a million.
Google’s software technologies were designed with this infrastructure in mind. Two technologies in particular, MapReduce and the Google File System, “reinvented the way Google built its search index,” Wired magazine reported during the summer of 2012.5
Numerous companies are now embracing Hadoop, an open-source derivative of MapReduce and the Google File System. Hadoop, which was pioneered at Yahoo! based on a Google paper about MapReduce, allows for distributed processing of large data sets across many computers.
While other companies are just now starting to make use of Hadoop, Google has been using large-scale Big Data technologies for years, giving it an enormous leg up in the industry. Meanwhile, Google is shifting its focus to other, newer technologies. These include Caffeine for content indexing, Pregel for mapping relationships, and Dremel for querying very large quantities of data. Dremel is the basis for the company’s BigQuery offering.6
Now Google is opening up some of its investment in data processing to third parties. Google BigQuery is a web offering that allows interactive analysis of massive data sets containing billions of rows of data. BigQuery is data analytics on-demand, in the cloud. In 2014, Google introduced Cloud Dataflow, a successor to Hadoop and MapReduce, which works with large volumes of both batch-based and streaming-based data.
Previously, companies had to buy expensive installed software and set up their own infrastructure to perform this kind of analysis. With offerings like BigQuery, these same companies can now analyze large data sets without making a huge up-front investment.
Google also has access to a very large volume of machine data generated by people doing searches on its site and across its network. Every time someone enters a search query, Google knows what that person is looking for. Every human action on the Internet leaves a trail, and Google is well positioned to capture and analyze that trail.
Yet Google has even more data available to it beyond search. Companies install products like Google Analytics to track visitors to their own web sites, and Google gets access to that data too. Web sites use Google AdSense to display ads from Google’s network of advertisers on their own web sites, so Google gets insight not only into how advertisements perform on its own site but on other publishers’ sites as well. Google also has vast amounts of mapping data from Google Maps and Google Earth.
Put all that data together and the result is a business that benefits not just from the best technology but from the best information. When it comes to Information Technology (IT), many companies invest heavily in the technology part of IT, but few invest as heavily and as successfully as Google does in the information component of IT.
Note When it comes to IT, the most forward thinking companies invest as much in information as they do in technology.
How Big Data Powers Amazon’s Quest to Become the World’s Largest Retailer
Of course, Google isn’t the only major technology company putting Big Data to work. Internet retailer Amazon.com has made some aggressive moves and may pose the biggest long-term threat to Google’s data-driven dominance.
At least one analyst predicts that Amazon will exceed $100B in revenue by 2015, putting it on track to eclipse Walmart as the world’s largest retailer. Like Google, Amazon has vast amounts of data at its disposal, albeit with a much heavier e-commerce bent.
Every time a customer searches for a TV show to watch or a product to buy on the company’s web site, Amazon gets a little more insight about that customer. Based on searches and product purchasing behavior, Amazon can figure out what products to recommend next.
And the company is even smarter than that. It constantly tests new design approaches on its web site to see which approach produces the highest conversion rate.
Think a piece of text on a web page on the Amazon site just happened to be placed there? Think again. Layout, font size, color, buttons, and other elements of the company’s site design are all meticulously tested and retested to deliver the best results.
The data-driven approach doesn’t stop there. According to more than one former employee, the company culture is ruthlessly data-driven. The data shows what’s working and what isn’t, and cases for new business investments must be supported by data.
This incessant focus on data has allowed Amazon to deliver lower prices and better service. Consumers often go directly to Amazon’s web site to search for goods to buy or to make a purchase, skipping search engines like Google entirely.
The battle for control of the consumer reaches even further. Apple, Amazon, Google, and Microsoft—known collectively as The Big Four—are battling it out not just online but in the mobile domain as well.
With consumers spending more and more time on mobile phones and tablets instead of in front of their computers, the company whose mobile device is in the consumer’s hand will have the greatest ability to sell to that consumer and gain the most insight about that consumer’s behavior. The more information a company has about consumers in aggregate and as individuals, the more effectively it can target its content, advertisements, and products to those consumers.
Incredibly, Amazon’s grip reaches all the way from the infrastructure supporting emerging technology companies to the mobile devices on which people consume content. Years ago, Amazon foresaw the value in opening the server and storage infrastructure that is the backbone of its e-commerce platform to others.
Amazon Web Services (AWS), as the company’s public cloud offering is known, provides scalable computing and storage resources to emerging and established companies. While AWS is still relatively early in its growth, one analyst estimate puts the offering at greater than a $3.8 billion annual revenue run rate.7
The availability of such easy-to-access computing power is paving the way for new Big Data initiatives. Companies can and will still invest in building out their own private infrastructure in the form of private clouds, of course. Private clouds—clouds that companies manage and host internally—make sense when dealing with specific security, regulatory, or availability concerns.
But if companies want to take advantage of additional or scalable computing resources quickly, they can simply fire up a bunch of server instances in Amazon’s public cloud. What’s more, Amazon continues to lower the prices of its computing and storage offerings. Because of the company’s massive purchasing power and the scale of its infrastructure, it can negotiate prices for computers and networking equipment that are far lower than those available even to most other large corporations. Amazon’s Web Services offering puts the company front and center not just with its own consumer-facing site and mobile devices like the Kindle Fire, but with infrastructure that supports thousands of other popular web sites as well.
The result is that Big Data analytics no longer requires investing in fixed-cost IT up-front. Users can simply purchase more computing power to perform analysis or more storage to store their data when they need it. Data capture and analysis can be done quickly and easily in the cloud, and users don’t need to make expensive decisions about IT infrastructure up-front. Instead they can purchase just the computing and storage resources they need to meet their Big Data needs and do so at the time and for the duration that those resources are actually needed.
Businesses can now capture and analyze an unprecedented amount of data—data they simply couldn’t afford to analyze or store before and instead had to throw away.
Note One of the most powerful aspects of Big Data is its scalability. Using cloud resources, including analytics and storage, there is now no limit to the amount of data a company can store, crunch, and make useful.
Big Data Finally Delivers the Information Advantage
Infrastructure like Amazon Web Services combined with the availability of open-source technologies like Hadoop means that companies are finally able to realize the benefits long promised by IT.
For decades, the focus in IT was on the T—the technology. The job of the Chief Information Officer (CIO) was to buy and manage servers, storage, and networks.
Now, however, it is information and the ability to store, analyze, and predict based on that information that is delivering a competitive advantage (Figure 1-1).
When IT first became widely available, companies that adopted it early on were able to move faster and out-execute those that did not. Some credit Microsoft’s rise in the 1990s not just to its ability to deliver the world’s most widely used operating system, but to the company’s internal embrace of email as the standard communication mechanism.
Figure 1-1. Information is becoming the critical asset that technology once was
While many companies were still deciding whether or how to adopt email, at Microsoft, email became the de facto communication mechanism for discussing new hires, product decisions, marketing strategy, and the like. While electronic group communication is now commonplace, at the time it gave the company a speed and collaboration advantage over those companies that had not yet embraced email.
Companies that embrace data and democratize the use of that data across their organizations will benefit from a similar advantage. Companies like Google and Facebook have already benefited from this data democratization.
By opening up their internal data analytics platforms to analysts, managers, and executives throughout their organizations, Google, Facebook, and others have enabled everyone in their organizations to ask business questions of the data and get the answers they need, and to do so quickly. As Ashish Thusoo, a former Big Data leader at Facebook, put it, new technologies have changed the conversation from “what data to store” to “what can we do with more data?”
Facebook, for example, runs its Big Data effort as an internal service. That means the service is designed not for engineers but for end-users—line managers who need to run queries to figure out what’s working and what isn’t.
As a result, managers don’t have to wait days or weeks to find out what site changes are most effective or which advertising approaches work best. They can use the internal Big Data service to get answers to their business questions in real time. And the service is designed with end-user needs in mind, all the way from operational stability to social features that make the results of data analysis easy to share with fellow employees.
The past two decades were about the technology part of IT. In contrast, the next two decades will be about the information part of IT. Companies that can process data faster and integrate public and internal sources of data will gain unique insights that enable them to leapfrog over their competitors.
As J. Andrew Rogers, founder and CTO of the Big Data startup SpaceCurve, put it, “the faster you analyze your data, the greater its predictive value.” Companies are moving away from batch processing (that is, storing data and then running slow analytics processing on the data after the fact) to real-time analytics to gain a competitive advantage.
The good news for executives is that the information advantage that comes from Big Data is no longer exclusively available to companies like Google and Amazon. Open-source technologies like Hadoop are making it possible for many other companies—both established Fortune 1,000 enterprises and emerging startups—to take advantage of Big Data to gain a competitive advantage, and to do so at a reasonable cost. Big Data truly does deliver the long-promised information advantage.
What Big Data Is Disrupting
The big disruption from Big Data is not just the ability to capture and analyze more data than in the past, but to do so at price points that are an order of magnitude cheaper. As prices come down, consumption goes up.
This ironic twist is known as Jevons paradox, named for the economist who made this observation about the Industrial Revolution. As technological advances make storing and analyzing data more efficient, companies are doing a lot more analysis, not less. This, in a nutshell, is what’s so disruptive about Big Data.
Many large technology companies, from Amazon to Google and from IBM to Microsoft, are getting in on Big Data. Yet dozens of startups are cropping up to deliver open-source and cloud-based Big Data solutions.
While the big companies are focused on horizontal Big Data solutions—platforms for general-purpose analysis—smaller companies are focused on delivering applications for specific lines of business and key verticals. Some products optimize sales efficiency while others provide recommendations for future marketing campaigns by correlating marketing performance across a number of different channels with actual product usage data. There are Big Data products that can help companies hire more efficiently and retain those employees once hired.
Still other products analyze massive quantities of survey data to provide insights into customer needs. Big Data products can evaluate medical records to help doctors and drug makers deliver better medical care. And innovative applications can now use statistics from student attendance and test scores to help students learn more effectively and have a higher likelihood of completing their studies.
Historically, it has been all too easy to say that we don’t have the data we need or that the data is too hard to analyze. Now, the availability of these Big Data Applications means that companies don’t need to develop or deploy all Big Data technology in-house. In many cases they can take advantage of cloud-based services to address their analytics needs. Big Data is making data, and the ability to analyze that data and gain actionable insights from it, much, much easier than it has been. That truly is disruptive.
Big Data Applications Changing Your Work Day
Big Data Applications, or BDAs, represent the next big wave in the Big Data space. Industry analyst firm CB Insights looked at the funding landscape for Big Data and reported that Big Data companies raised some $1.28 billion in the first half of 2013 alone.8 Since then, investors have continued to pour money into existing infrastructure players. One company, Cloudera, a commercial provider of Hadoop software, announced a massive $900 million funding round in March of 2014, bringing the company’s total funding to $1.2 billion.
Going forward, the focus will shift from the infrastructure necessary to work with large amounts of data to the uses of that data. No longer will the question be where and how to store large quantities of data. Instead, users will ask how they can use all that data to gain insight and obtain competitive advantage.
Note The era of creating and providing the infrastructure necessary to work with Big Data is nearly over. Going forward, the focus will be on one key question: “How can we use all our data to create new products, sell more, and generally outrun our competitors?”
Splunk, an operational intelligence company, is one existing example of this. Historically, companies had to analyze log files—the files generated by network equipment and servers that make up their IT systems—in a relatively manual process using scripts they developed themselves.
Not only did IT administrators have to maintain the servers, network equipment, and software for the infrastructure of a business, they also had to build their own tools in the form of scripts to determine the cause of issues arising from those systems. And those systems generate an immense amount of data. Every time a user logs in or a file is accessed, every time a piece of software generates a warning or an error, that is another piece of data that administrators have to comb through to figure out what’s going on.
With BDAs, companies no longer have to build the tools themselves. They can take advantage of pre-built applications and focus on running their businesses instead. Splunk’s software, for example, makes it possible to find infrastructure issues easily by searching through IT log files and visualizing the locations and frequency of issues. Of course, the company’s software is primarily installed software, meaning it has to be installed at a customer’s site.
Cloud-based BDAs hold the promise of not requiring companies to install any hardware or software at all. In some ways, they can be thought of as the next logical step after Software as a Service (SaaS) offerings. SaaS, which are software products delivered over the Internet, are relatively well-established. As an example, Salesforce.com, which first introduced the “no software” concept over a decade ago, has become the de-facto standard for cloud-based Customer Relationship Management (CRM), software that helps companies manage their customer lists and relationships.
SaaS transformed software into something that could be used anytime, anywhere, with little maintenance required on the part of its users. Just as SaaS transformed how we access software, BDAs are transforming how we access data. Moreover, BDAs are moving the value in software from the software itself to the data that that software enables us to act on. Put another way, BDAs have the potential to turn today’s technology companies into tomorrow’s highly valuable information businesses.
BDAs are transforming both our workdays and our personal lives, often at the same time. Opower, for example, is changing the way energy is consumed. The company tracks energy consumption across some 50 million U.S. households by working with 75 different utility companies. The company uses data from smart meters—devices that track household energy usage—to provide consumers with detailed reports on energy consumption. Even a small change in energy consumption can have a big impact when spread across tens of millions of households.
Just as Google has access to incredible amounts of data about how consumers behave on the Internet, Opower has huge amounts of data about how people behave when it comes to energy usage. That kind of data will ultimately give Opower, and companies like it, highly differentiated insights. Although the company has started out by delivering energy reports, by continuing to build up its information assets, it will be well-positioned as a Big Data business.9
BDAs aren’t just appearing in the business world, however. Companies are developing many other data applications that can have a positive impact on our daily lives. In one example, some mobile applications track health-related metrics and make recommendations to improve human behavior. Such products hold the promise of reducing obesity, increasing quality of life, and lowering healthcare costs. They also demonstrate how it is at the intersection of new mobile devices, Big Data, and cloud computing where some of the most innovative and transformative Big Data Applications may yet appear.
Big Data Enables the Move to Real Time
If the last few years of Big Data have been about capturing, storing, and analyzing data at lower cost, the next few years will be about speeding up the access to that data and enabling us to act on it in real time. If you’ve ever clicked on web site button only to be presented with a wait screen, you know just how frustrating it is to have to wait for a transaction to complete or for a report to be generated.
Contrast that with the response time for a Google search result. Google Instant, which Google introduced in 2010, shows you search results as you type. By introducing the feature, Google ended up serving five to seven times more search result pages for typical searches. When the interface was introduced, people weren’t10 sure they liked it. Now, just a few years later, no one can imagine living without it.
Data analysts, managers, and executives want the Google Instant kind of immediacy in understanding their businesses. As these users of Big Data push for faster and faster results, just adopting Big Data technologies will no longer be sufficient. Sustained competitive advantage will come not from Big Data itself but from the ability to gain insight from information assets faster than others. Interfaces like Google Instant demonstrate just how powerful immediate access can be.
According to IBM, “every day we create 2.5 quintillion bytes of data—so much that 90% of the data in the world has been created in the last two years alone.”11 Industry research firm Forrester estimates that the overall amount of corporate data is growing by 94% per year.12
With this kind of growth, every company needs a Big Data roadmap. At a minimum, companies need to have a strategy for capturing data, from machine log files generated by in-house computer systems to user interactions on web sites, even if they don’t decide what to do with that data until later. As Rogers put it, “data has value far beyond what you originally anticipate—don’t throw it away.”
Tip “Data has value far beyond what you originally anticipate—don’t throw it away.” —J. Andrew Rogers, CTO, SpaceCurve.
Companies need to plan for exponential growth of their data. While the number of photos, instant messages, and emails is very large, the amount of data generated by networked “sensors” such as mobile phones, GPSs, and other devices is much larger.
Ideally, companies should have a vision for enabling data analysis throughout the organization and for that analysis to be done in as close to real time as possible. By studying the Big Data approaches of Google, Amazon, Facebook, and other tech leaders, you can see what’s possible with Big Data. From there, you can put an effective Big Data strategy in place in your own organization.
Companies that have success with Big Data add one more key element to the mix: a Big Data leader. All the data in the world means nothing if you can’t get insights from it. Your Big Data leader—a Chief Data Officer or a VP of Data Insights—can not only help your entire organization get the right strategy in place but can also guide your organization in getting the actionable insights it needs.
Companies like Google and Amazon have been using data to drive their decisions for years and have become wildly successful in the process. With Big Data, these same capabilities are now available to you. You’ll read a lot more about how to take advantage of these capabilities and formulate your own Big Data roadmap in Chapter 3. But first, let’s take a look at the incredible market drivers and innovative technologies all across the Big Data landscape.13
3Zetta Venture Partners is an investor in my company, Content Analytics.
13Some of the material for this chapter appeared in a guest contribution I authored for Harvard Business Review China, January 2013.