Getting Involved with Big Data - Big Data - Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Part VI. Big Data

Chapter 24. Getting Involved with Big Data

What Would You Like To Know?

In the previous chapter, we saw that the procedures for extracting information from big data can be quite simple. In contrast, the setting up of computer systems for applying the procedures to large amounts of data in a reliable and rapid manner requires considerable expertise. Following a summary of the potential applications of big data, we will discuss how businesses can become part of this new and exciting development.


There is hardly any area of human activity where big data is not having an impact, and the trend is likely to continue at an increasing rate. Any summary of the applications tends to end up as a very long list of examples.

Search engine providers such as Google and Yahoo were probably the first to use big data methods. Documents are located by text retrieval on the basis of keywords and similarities.

Retail sales provide appropriate applications. Large supermarkets, such as Walmart and Tesco, have incredible amounts of data, as each sale of each item is recorded via the barcode. It may be that certain items tend to be sold together in the same transaction, or certain items may sell better at certain times. Pricing and stocking can be altered to take advantage, and sales promotions can be timed appropriately. If purchases are made with the use of store cards or loyalty cards, the items sold can be linked to the customer and thus to the purchasing habits of customers in relation to gender, age, address, and so on. Tesco is installing face-scanning devices in its petrol stations to register the gender and likely ages of its customers.

If a retailer sends vouchers and details of products or special offers by post to potential customers, most of the approaches, of course, will produce no result. But there will be stored data available showing the features of those that have been successful in the past. It thus becomes possible to target customers of the type that are most likely to respond positively. Sales of financial products such as credit cards, insurance, or investment opportunities can be targeted in a similar way. The strategy can be applied by all kinds of businesses involved in sales. Amazon, for example, sells a large proportion of its books by sending recommended titles to customers on the basis of previous purchases. Retention of customers can be improved by identifying and targeting those most likely to depart.

The Internet can provide large amounts of data for the retailer. Every click on a website gives information not only about a sale but also about initial interest in a product, repeated interest, an immediate rejection, a rejection when the price is revealed or when the delivery charge appears, and so on.

Companies providing transport of mail, parcels, and goods generally make use of barcodes and thereby potentially accumulate much data. Information regarding the nature of the goods, their sources, and their destinations allows future resource needs to be anticipated and areas of growth to be identified. Scheduling of deliveries and route planning can be improved.

Financial institutions and banks can use historical data to identify the level of risk in offering loans to particular customers or even the possibility of fraud. The spending habits of credit card customers can reveal those most likely to be interested in other financial products. Fraud detection can also be employed by tax authorities and those responsible for government contracts. Possible fraudulent insurance and warranty claims can be detected.

In the fight against crime, areas more likely to be hit by specific kinds of criminal activity can be highlighted. Likelihood of terrorist attack can be determined, and the probability of repeat crimes from prisoners pending release can be quantified.

Product development is an expensive process, and if the product misses its target market, the result can be disastrous. Traditionally, sampling of prospective customers has been used to establish the desirable features of proposed products; but sampling is expensive, and its effectiveness is limited by the size of the sample. Predictive analytics offers the possibility of relating features of the new product to its desirability among specific types of customer as revealed by previous purchasing patterns.

Medical records show the characteristics and previous medical histories of patients who later develop specific conditions. Relationships can be identified that give warnings of possible future ailments. Similarly, the efficacy of different treatments may be compared. Successful predictions have included the spread of influenza, occurrence of premature births, and risk of death while undergoing surgery. Diagnosis of breast cancer has been improved.

The control of industrial processes reaps benefits such as reducing the number of defective items and avoiding operational problems. Faults in machinery and large industrial installations are often preceded by symptoms such as vibrations, temperature rises, or noises of various kinds. Information that distinguishes between serious and benign symptoms or that indicates a probable time to breakdown is of considerable value. Preventive maintenance scheduling can benefit from such information. In a similar way, the diagnosis of problems with cars and other vehicles from reported symptoms of malfunctioning is possible. The likelihood of failure of electric cables, washing machines, and office equipment has been predicted.

In the energy industries and utilities, monitoring of customer usage with regard to time and location can improve the efficiency of generation and supply.

Charities have benefited from increased donations and lower costs by targeting likely donors. People most in need of help have also been located.

Governments hold vast amounts of data. Some of it is centrally held—census and tax records, for example—and is usefully processed, but much of it is spread over numerous local sites. Combining data stores offers the potential of useful predictions in infrastructure planning, crime fighting, and health care, for example.

Mayer-Schönberger and Cukier (2013) and Siegel (2013) describe many applications in fascinating detail. The latter has a summary table of 147 specific cases of predictive analytics that have produced benefits, usually financial, for the organizations involved.

It should be noted that some of the applications mentioned above are not, strictly speaking, forecasting. Rather, they are searching for answers that are known, by someone somewhere, at the present time. Search engines, for example, locate information that already exists, though for the user the information is for future use. In the case mentioned in the preceding chapter of Watson’s success in playing Jeopardy!, the answers to the factual questions were of course known in advance to the contest producers, and Watson's task was to determine those answers from the information contained in its few terabytes of disk storage.


Rolls Royce has risen from a position of financial difficulties in the 1970s to being a successful global company. It is the world's third largest maker of aircraft engines and the second largest maker of large jet engines. About half of wide-bodied passenger jets and a quarter of smaller aircraft under production are powered by Rolls Royce engines. Also important is its business in marine engines and in the energy industries.

A major factor in its success story has been the collection and application of data. Its jet engines are fitted with monitoring systems that collect temperatures, pressures, flows, rotational speeds, and vibration levels at various locations within the engine. The successful series of Trent engines can be fitted with about 25 sensors. Signals from the sensors are collected during takeoff, climb, and cruise and are transmitted to the company's headquarters in Derby, via radio or satellite link, during each flight of the aircraft. Any unusual engine conditions trigger additional transmissions.

At Derby, the collected data is analyzed automatically using algorithms based on neural networks. Unusual features are studied by skilled engineers to obtain a diagnosis on which decisions can then be made. It may be necessary to notify the maintenance team at the destination airport that there is a need to undertake checks or, alternatively, to give assurance that the engine performance is satisfactory. Either way, the procedure leads to fewer delays and improved passenger safety and satisfaction. Gradual deterioration of an engine can also be identified and inspection schedules agreed on after discussions with the operating company. Sudden changes in engine performance may require more immediate examinations which, again, can be programmed to suit the operator's options without compromising safety. The procedures have led to improved working lives for the engines.

Rolls Royce's utilization of data puts it in a commanding position when it comes to the servicing side of the business. When it sells an engine, it is effectively selling a service for the life of the engine. It would be difficult for another company to break into this corner of the market.

The Big Players

Chapter 1 started with matters that could be handled with pencil and paper. Subsequent chapters concerned calculations that require a pocket calculator, a spreadsheet, and eventually computer packages. This chapter reaches a stage when it is time to get assistance from experts. In spite of what you may have read or been told, handling big data is not easy. The subject is full of new terminology and much jargon, and the procedures require knowledge of programming and other specialized subjects.

The best-known technology for handling big data is probably Apache Hadoop, which was developed by Yahoo in the period 2006 to 2008. It is now an open source data-storage framework that can handle 10 to 100 gigabytes of data and above (Dumbill, 2012). It uses a file system—theHadoop Distributed File System (HDFS)—which is distributed among numerous servers. In real time, it can capture, read, and update large amounts of unstructured data such as social media, clicks, event data, and sensor data. In fact, Hadoop can accept any kind of data, either for processing or long-term storage. There is much replication and redundancy in the system so that server failures do not cause problems.

Hadoop is not a single defined entity but rather an evolving ecosystem embracing numerous auxiliary modules and programs.

Moving data is expensive, so the data processing is carried out where the data resides, though the tasks are distributed to the numerous servers. The processing is by means of MapReduce, which was originally developed by Google. The “map” in MapReduce refers to the filtering and sorting of the data, and the “reduce” refers to a summarizing process. Results of processing are returned to HDFS. MapReduce is used in other databases apart from Hadoop.

Java programming for loading files in HDFS is tedious. The task is made easier by the use of Pig or Hive. Pig, from Yahoo, is a programming language that can deal with semi-structured data. Hive, from FaceBook, is a module that allows Hadoop to be used as a data warehouse, accepting queries in a form similar to SQL, a commonly used programming language for database management.

Improvements in data access are provided by HBase, Sqoop, and Flume. HBase is a database that runs on top of HDFS providing billions of rows of data for rapid access. HBase can also be used as a source and destination of data for MapReduce. Sqoop imports data from databases into Hadoop via HDFS or Hive. Flume, from Google, is used for streaming data into HDFS.

Zookeeper organizes the various components, while Oozie manages the work flow. Mahout is a machine learning component.

Other add-ons are used in Hadoop applications, some of which are part of Hadoop and some of which are not. It can be seen from this brief summary that the choice of components for particular circumstances is a job for experts.

The kinds of problems suitable for analysis are varied. Risk exposure can be modeled for the banking and insurance industries. Customer churning can be analyzed. Product preference for Internet sales, retailing generally, advertising, and manufacturing can be identified. Sensor data is used to predict failures for telecommunications operators and data centers. Search analysis for Internet commerce and websites is dealt with. Threats, fraud, and spam can be identified. There is a facility for data from any kind of business, on which various analyses can be tried in the search for patterns.

Apache Cassandra is another open source database management system. It was developed at Facebook, and its long list of important users, such as Twitter and Netflix, vouches for its versatility and reliability. It is a distributed system that automatically replicates to multiple centers. There are no single points of failure. In comparison with Hadoop, it scores in dealing with real-time data and less so in terms of analysis.

Large companies such as Google, IBM, Microsoft, HP, Amazon, SAP, and Oracle make use of open source facilities, together with their own components, to offer a commercial service to businesses. Cloudera, Teradata, 1010data, Fujitsu, Kognitio, Microstrategy, and NetApp are some of the other companies offering similar services.


In 2009, United States legislation was introduced to protect subprime borrowers. It required lenders to provide fairer rates and fees for their borrowers. Traditionally, lenders in the subprime market depended on rates and fees for their profitability.

Premier Bankcard is an organization providing credit cards for individuals with damaged credit histories. The company is committed to helping individuals receive a second chance with regard to their finances. The new legislation created problems. On the one hand, if too many cards were issued to customers who had not reached a satisfactory level of creditworthiness, there would be losses and pressure from the regulators. On the other hand, too much emphasis on those well on their way to recovery would lead to loss of customers moving to prime card issuers.

Premier decided to employ SAS Business Analytics to identify its best customers: the ones who lie between the two extremes and are on their way to creditworthiness. Aspects also covered in the analyses were rapid response to customer and market data, by daily review and daily forecasting, and fee justification analyses to meet the regulations.

The approach had the beneficial feature of being based on Premier's own data, and not on imported data or performance models.

The characteristics of the ideal customer were identified. It was found, for example, that the best customers had been with Premier on average for five years. Knowing who the best customers are means they can be targeted effectively. Customer retention was seen to be important. Retaining one customer for an extra month makes for Premier nearly $12. An improvement of 10% in retention strategy produced $4.8 million.

The results achieved a revenue increase of $50 million, an additional $24 million from better customer retention and a decrease of $1 million is losses resulting from fraud.

The Smaller Options

With so much publicity being given to big data, many small and medium-sized businesses that are not involved are considering whether they should be, and perhaps wondering what they want from it. Most of these businesses will not have in-house expertise and will rely on commercial providers of big data analysis. Furthermore, some such businesses will avoid involvement with the big players described above and will prefer to start in a more modest way. There are dozens of consultants who can provide big data analysis. These are the smaller players, employing between a handful of staff up to several hundred. Often they will have developed modules to carry out fairly standard analyses of data that can be readily adapted to the needs of different businesses, and this clearly reduces the costs involved. The Internet, of course, provides details of these consulting firms, often with example case studies of their activities, and there are useful directories that include comparisons between the various firms. SourcingLine is a company that provides rankings and reviews of consulting companies in the field of big data analytics.

A business looking for assistance will have plenty of data in the form of past records of activities, and this would clearly be the starting point. Analysis of the existing data is likely to be straightforward, although it is important to be clear about what questions are being asked of the data. It will not be exciting, for example, to be told that more sandals are likely to be sold in the summer and more boots are likely to be sold in the winter.

The initial results will provide a useful introduction but will be of limited value unless new data is fed into the system as it becomes available. Streaming of real-time data is essential for rapid application of the results of analysis and effective control of business operations. If the company recognizes a particular problem for which a solution is required, the data analysis firm can develop a suitable model or use one that it may have available. The model can be supplied and staff trained to use it, applying it as necessary to different sets of data. Any additional problems will require further appropriate models.

So far, the business is probably not locked into an agreement with the solution provider and can shop around, but further analysis of a more advanced nature may require commitment to a more permanent arrangement. The business will supply full details of its activities and request a system that will produce modules capable of analyzing and dealing with predictions and problems. Included will be predictions of potential future problems and the ability to deal with them as they emerge.

With regard to the choice of data analysis company, the same criteria will apply as when engaging any other form of consultancy. Issues such as cost, extent of lock-in, time scale, and security of data will be considered. There may be advantage in commissioning a company that has assisted or specializes in similar businesses and may have appropriate expertise and software readily available.


2degrees is a New Zealand mobile telecommunications company. In four years it won one million customers in the face of long-installed competition.

With no in-house expertise but with a recognition of the value of big data analysis, the company decided to enlist the help of 11Ants Analytics. Churning—that is, customers leaving and moving to a competitor—was a specific problem. Indeed, it is a common problem in the mobile phone business. 2degrees chose to use a suite of modules consisting of a customer analyzer, a customer-churn analyzer, and a model builder. The use of these available modules from 11Ants Analytics meant the work could go ahead quickly.

The results were impressive. Customers most at risk of churning were identified by time on network, days since last top-up, whether the customer number was ported or not, customer plan, and calling behavior over the previous 90 days.

An experiment was run for three months. The customers were classified by their likelihood of churning. The 5% of customers chosen by the 11Ants Churn Analyzer as being most likely to churn were found to be 12.75 times more likely to churn than customers chosen at random. The 10% of customers chosen as most likely to churn were found to be 7.28 times more likely to churn than customers chosen at random.

2degrees could now focus on those most likely at risk and reduce its expenditure on retention marketing. The smaller number to be targeted meant that the retention offers could be more generous. The added benefit was that customers not likely to churn were not annoyed by messages asking them to stay. Also, offers could be aligned to the customers’ usage—minutes for talkers, texts for texters.