Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)
Part VI. Big Data
Chapter 25. Concerns with Big Data
The Small Print
Beneficial innovations always have downsides. We accept large numbers of deaths from road accidents and occasional air disasters for the benefits of faster travel. We accept the risk of nuclear war for the benefits of nuclear energy. Big data is not unique in having problem areas—but we are not contemplating the end of civilization!
Data is valuable. It is not just the conclusions from the processed data that have an economic value; the data itself has value because of its potential. OpusData, for example, is a company that sells access to data from The Numbers, a large database containing financial details on about 15,000 movies and 18,000 actors, directors, and technicians. Traditionally, of course, there have been small businesses that have collected data, fairly laboriously, to supply to industries and media organizations for a fee. As the stores of data get larger, the value increases exponentially. It has even been suggested that data stored by a company should be attributed monetary value, which should then be added to the company assets.
In the wrong hands, data can result in serious problems for companies, governments, and the general public. Security is therefore paramount, particularly when a business entrusts its data to cloud storage and processing by a different company. Although companies take extreme precautions, it is well-known that leaks of sensitive material have always occurred and still do occur. They can range from a laptop being left on a train to hackers accessing bank accounts. In February 2014, Barclays Bank reported that it was investigating the loss of several thousand files containing customers’ details. It had been alleged that the files, which included the customers’ attitudes to risk, had been sold to rogue City traders.
Some businesses, understandably, are put off from engaging with big data because of concerns about security.
Each one of us is extensively documented by data stored by various organizations. Our details are hoovered up in the course of purchases, online searches, social website interactions, financial transactions, and so on. In addition, there are the more obvious traditional repositories of our data, such as electoral registers, employment and tax details, passports, and licenses of various kinds.
In the past, when the data lay dormant, it didn’t really matter; but now, without consent, the data is being used for various purposes that the general public is only just beginning to realize. Sometimes the data are anonymized by the removal of names and addresses, but numerous studies have shown that it is often a trivial analytical task to identify individuals by combinations of particular characteristics in the records and links to other databases.
Suggestions of what the future holds are reminiscent of George Orwell’s 1984. A married couple chatting at home discover a difference of opinion. The television is switched on and is listening to their conversation. The information is processed, and in the next commercial break an advertisement for marriage counseling appears (FT Reporters, 2013). More serious are issues of possible legal action against individuals on the basis of probability. Should a person be released from prison if there is an 85% chance of his committing a further murder? Should drivers of fast cars be fined for potential speeding? These kinds of questions may sound rather silly, but we can already be prosecuted for actions justified on the basis of probability. Not wearing a seat belt while driving and smoking in public buildings are examples of actions that are harmful—but only potentially harmful.
We are beginning to see reactions to the invasion of privacy by big data. Cornell University students are opposing New York State’s cooperation with inBloom, an organization that seeks to assemble student details in a single database. Of the nine states that joined with inBloom, eight have already pulled out because of issues of privacy. The Washington Post reported that there is considerable concern regarding the activities of the Patient-Centered Outcomes Research Institute (PCORI), which is collecting detailed patient medical records. The aim is to assemble the data for analysis to improve diagnosis and treatment.
There has been similar opposition in the UK. The introduction of identity cards has been strongly objected to. CCTV cameras have had to be removed in a suburb having a high population of ethnic minorities, after public protests. In 2012, the government introduced legislation for the compulsory destruction of samples and profiles of DNA, and fingerprint records, of anyone arrested but not convicted of a crime.
The UK government’s plan to bring together the vast amount of medical records that the National Health Service holds, at present scattered among the various doctors’ offices, health centers, and hospitals, has been put on hold. The concern is that the data will be sold to health companies and academics to bring about major improvements in health care, but breaches in security may result in patients being identified. It is somewhat ironic that it was a British nurse, Florence Nightingale, famous for her nursing of the sick and wounded in the Crimean War, who pioneered the recording of medical data in the 1850s in order to improve treatments.
Handling big data requires special skills. The new kind of scientist—the data scientist—needs to be a combination of statistician, software programmer, and graphics designer. Some knowledge of machine learning, artificial intelligence, and neural networks is required. Furthermore, he or she needs to have an understanding of business goals and have good communication skills. The latter are particularly important because the findings of big-data analysis may have to be put to senior executives who have their own prejudices regarding what action is needed.
Though we can expect to produce sufficient data scientists in the future, there is currently a shortage. This is well illustrated by the experiences of the Institute for Advanced Analytics of North Carolina State University (Burlingame and Nielson, 2012: 60–61). In 2012, there were 38 candidates for the Master of Science in Analytics (MSA). Among them, they had 591 job interviews with 54 employers. One or more offers went to 97% of them, and 47% had three or more offers. Offers covered a range of businesses: banking, finance, consulting, energy, gaming, health care, Internet, pharmaceuticals, research, and software.
A New Concept
The arrival of big data has changed the way we think about statistics. Traditionally, statistics has embodied the principles that correlation should not be taken to imply a causal relationship and that extrapolation is a necessary evil. In the applications of big data analysis, these basic doctrines are not denied, but they are circumvented. The existence of a causal relationship is not considered relevant. If association between variables exists, it can be used to advantage provided that we act quickly, the rapid response minimizing the problem with extrapolation. Some statisticians have reservations about big data on these grounds. Others have noted that with so many conclusions being derived from the sets of data, a proportion will be simply wrong. This point was made in previous chapters when discussing multiple comparisons from the same data.
It has been well recognized in the sciences that the act of observing affects to some extent that which is being observed. Experimental designs and investigative programmers take this into account where possible. When the experiment is of limited range, the consequences of this feedback are of little significance, but the applications of big data can affect large numbers of people. If shoppers appear to prefer Jispo cornflakes, techniques will push the sales even higher. Eventually other brands disappear and everyone eats Jispo. Are we beginning to create a population of puppets, all behaving in identical ways?
The handling and processing of big data is probably the most radical innovation in the practice of statistics that has been seen for a very long time. I can visualize the day when statistics based on modest samples becomes referred to as “traditional statistics” or even “classical statistics.”
World Chief Theo 7D9G gazed intently as the 3D screen around him faded. He had witnessed the interviews of two candidates for the position of Deputy World Chief. The interviews, of course, had not been conducted by him, but by PASWRD. The indispensable PASWRD, or, to give it its full name, Processor and Storage of all World Real-time Data, could interview and do many other things better than any human could.
But there was a problem. PASWRD had reported that there was, in the foreseeable future, no difference, economically or socially, whichever of the two candidates were to be appointed. Theo would have to decide, but he did not anticipate the task with enthusiasm—it was rare for him to have to make a decision without assistance.
He thought for a moment, and then his wrinkled face began to show signs of a smile developing. He mentally set PASWRD into forecasting mode and focused on the image of the control panel. He inserted the hypothetical appointment of candidate A and began to steer a precisely defined path moving into the future. Forecasting beyond five years was not permitted, but Theo was able to override the restriction. Eventually satisfied, he stopped the projection. He then repeated the process with the assumption that candidate B was appointed.
When the second projection had been completed, he thought his chair into a relaxing position. He had the answer. Candidate B would be appointed as the new deputy. And Theo would have an extra four-year’s life span—plus or minus, of course, the uncertainty, which PASWRD reported as 2.3 years at a 95% confidence level.