Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)
Part I. Uncertainties
Chapter 2. Sources of Uncertainty
Why “Sure Thing!” Rarely Is
The results of any investigation will, of course, be uncertain, if not completely wrong, if the information on which the investigation is based is not correct. However, in statistical investigations there are additional sources of uncertainty, because of the need to extract a neat and useful conclusion from information that may be extensive and variable.
Statements that appear at first sight to be clear and unambiguous often hide a great deal of uncertainty. In the previous chapter, I used the proposition “All cows eat grass” as an example of an acceptable starting point from which to draw a logical conclusion. Looking closely, you can see that it is a statistical statement. It relates cows to the eating of grass via the word all, which is in effect numerical. If I had said “100% of cows eat grass,” the statistical nature of the statement would have been more apparent. Uncertainties in the statement arise even before we question the statistical claim of 100%. There is the question of what is included in the definitions of cows and eating grass. Am I including young cows, sick cows, or cows in all parts of the world? Do I mean eating grass and nothing else, or eating grass if they were given it? And what do I include in the term grass?
This may seem to be rather pedantic, but it illustrates that we have to question what, precisely, the things are that a statistical statement claims to relate in some way. A more realistic example could relate to unemployment. In Wabash, three out of four men are unemployed, we may read. How have the boundaries of the district been defined for the survey? Is it Wabash town, or is it the total extent covered by Wabash County? Then there is the question of how we are to understand the term unemployed. Does it include the retired, the sick, the imprisoned, the part-time workers, the casual workers, the voluntary workers, or the rich who have no desire or need to work? The way the terms are defined needs to be questioned before the statistics can be considered to have real meaning.
Turning now to the statistical aspects, we appreciate that data are gathered from many different sources. Opinion polls are fruitful and popular. We seem to spend as much time prior to an election listening to the details of polls as we do listening to the election results being declared. Data collected this way cannot be taken at face value and should always be questioned. Do people tell the truth when asked for their opinions or their activities, or even their ages or where they live? Probably not always, but who can really say? Even if they have every intention of being truthful, there is the possibility of misunderstanding the question. More commonly, perhaps, the question forces a difficult judgment or recollection. “Do you replace a light bulb once a week, once a month, or once every three months?” “When did you last speak to a policeman or policewoman?” In addition, many questions require answers that are completely subjective.
Statistics are often taken from “official sources,” and this suggests reliability. However, the question remains how the figures were obtained. We would expect that the number of cars on the roads would be known quite accurately, whereas we accept that the number of illegal immigrants in the country is vague. Between these extremes are debatable areas. The number of street muggings could appear to be low if only the reported and successfully prosecuted cases were included, but could appear much greater if attempted muggings, or even estimated unreported muggings, were included.
Statistics from authoritative sources are sometimes simply not true. Charles Seife (2010) gives numerous examples ranging from intentional lies to statements that are impossible to verify. US Senator Joe McCarthy in 1950 claimed to have a list of 205 names of people working in the US State Department who were members of the Communist Party. The claim had serious repercussions, yet he never produced the names, and no evidence was ever found that he had such a list. At the other end of the scale, such as when in 1999 UN Secretary-General Kofi Annan declared a Bosnian boy to be the six billionth person on Earth, the repercussions may be trivial.
When statistics are quoted, a reference to the source is frequently given. This is, of course, good practice, but it does impart an air of authority that may not be warranted. Rarely does the recipient follow up on the reference to check its validity. The originator may not even have checked the reference but simply have grabbed it from somewhere else. Worse is the situation where the originator has been unfairly selective in his or her choice of statistics from the referenced source. Be aware that organizations with a particular agenda may, in their literature, give references to publications from the same organization, or to those closely allied with it (Taverne 2005: 82-86).
Wikipedia is now an important and frequently used source of information. Bear in mind that it is based on contributions from anyone who wishes to contribute. A consequential degree of regulation results, but the information Wikipedia contains at any moment in time is not necessarily correct.
So far we have been considering statistical data, which is in a sense second-hand. It is derived from what others, who have no way of determining the truth of what they quote, tell us. But there are other situations where objective measurements are made and data are provided by those who have made the measurements. Factories supplying foodstuffs have weighing machines for controlling the amount of jam or corn flakes that goes into each container. The weighing machines are inspected periodically to ensure accuracy. Though accurate, the machines will be imprecise to some degree. That is to say, when the machine registers one kilogram the true weight will be one kilogram plus or minus a small possible error. The smaller the possible error, the greater the precision, but there still remains a degree of uncertainty.
A company supplying car parts has to ensure that a bracket, say, is 10 cm long plus or minus 0.5 mm. The latitude permitted is referred to as the tolerance. Within the company, regular measurements of the lengths of brackets are made as they are produced. These measurements, to an accuracy of perhaps 0.1 mm or less, provide a data sample, which when properly processed provides the company with warnings that the tolerance is being or in danger of being exceeded. Such situations result in statistical data that is reliable to a degree dependent on measuring equipment, and with this knowledge the degree of reliability can be quantified.
Results of investigations in the various science and technology disciplines are published in reputable and often long-established journals. A process of refereeing the articles submitted to the journals provides good assurance that the results quoted are valid and that any provisos are made clear. References to such journals are a good sign.
Processing the Data
Raw data, which as you have seen already have a degree of uncertainty, are processed by mathematical procedures to allow you to draw conclusions. Recalling what I have said about the truth of mathematics, you might think that the processing will introduce no additional uncertainty. If raw data are factual, we might expect that our conclusions would be factual. However, as you shall see, processing introduces further uncertainty. But you will also see that the conclusions are factual. They are factual statements expressing the probability of something being true, or expressing the uncertainty involved in stating that something is true. For example, we might have a conclusion saying that hair restorer A is more effective than hair restorer B with a 90% certainty, or saying that the weight of a randomly chosen bag of sugar is half a kilogram within one hundredth of a kilogram either way, with a 99% certainty. Both statements are factually correct, but neither gives us a precise conclusion regarding the performance of a particular treatment of hair restorer or the weight of a specific bag of sugar.
When such statements are made without the necessary qualifications of uncertainty, they appear to provide proof. “Hair restorer A is more effective than hair restorer B” and “This bag of sugar weighs half a kilogram” are the kinds of statements we usually encounter. With regard to the bag of sugar, the statement is near enough correct, and it would be considered extremely pedantic to insist on a precise statement. But with regard to the hair restorer, the situation is much more serious. The statement, when looked at carefully, is seen to convey almost no useful information, yet it is likely to encourage customers to spend their money on the product.
The uncertainties that arise in statistical processing do not reflect any inadequacy of the mathematical procedures. They arise in the summarizing of the data and in the use of samples to predict the characteristics of the populations from which the samples were drawn.
Raw data is summarized because there is generally too much to allow easy recognition of the important features. Simply picking out bits and pieces to illustrate underlying principles can lead to incorrect conclusions but may sometimes be done to justify prejudiced views. Summarizing—averaging, for example—is carried out according to accepted procedures. Nevertheless, any procedure that reduces the data necessarily results in loss of information and therefore some uncertainty.
The second source of uncertainty lies in the difference between a sample and a population, and in the attempt to characterize a population using the features of a sample. It must be recognized that the words are being used as statisticians use them. A population in the statistical sense is not a group of people living in a particular area (though it could be, in a study involving actual people living in some area of interest).
A sample is more easily explained first. It is a set of data of the same kind obtained by some consistent process. We could ask shoppers coming out of a supermarket how many items they purchased. The list of the number of items that we obtained would be the sample. The size of the sample would be the number of shoppers we asked, which would correspond to the number of data in our sample. In this example, the population would be the replies from the larger number of shoppers or potential shoppers that might have been asked, including of course the ones who were actually asked.
Sometimes the sample embraces the entire population. If we produce, by a novel process, 100 pewter tankards and measure and weigh each one to examine the consistency of the process, our sample corresponds to the population. The monthly profits made by a company over a period of time comprise the entire population of results relating to that particular company and can be treated as such to derive performance figures for the company. If, however, the accumulated data were considered to be representative of similar companies, then they would be treated as a sample drawn from a larger population.
My wife’s birthday book shows the birthdays of relatives and friends that she wishes to recall. The number of birthdays in each month of the year is as follows.
The data can be considered a sample or a population, depending on what we wish to do. Considering the whole of the world population or a hypothetical large collection of people, the data is a sample. It is not a very reliable sample because it suggests that many more people are born in November than in January. However, in terms of the people actually included, the data are the population; and it is true to say that the probability of selecting a person at random from the book and finding his or her birthday to be in November would be 9/61 (=0.15) rather than a probability of 1/12 (=0.08) that we would expect in a much larger sample.
In each of these examples—the shoppers, the pewter mugs, and so on—the population is finite. In many situations, however, the population is hypothetical and considered to be infinite. If we make repeated measurements of the diameter of the Moon, in order to improve the accuracy of our result, we can consider that the measurements are a sample drawn from a population consisting of an infinite number of possible measurements. If we carry out an experiment to study the effectiveness of a new rat poison, using a sample of rats, we would consider the results applicable to a hypothetical infinite population of rats.
Because the sample is assumed to be representative of the population from which it is drawn, it is said to be a random sample. Random means that of all the possibilities, each is equally likely to be chosen. Thus if we deal 6 cards from a well-shuffled pack of cards, the selection of 6 cards is a random sample from the population of 52 cards. The randomness that is achieved in practice depends on the method of sampling, and it can be difficult in many real situations to ensure that the sample is random. Even when it is random, it is simply one of a very large number of possible random samples that might have been selected. Because the subsequent processing is restricted to the data in the sample that happens to have been selected, the results of the processing, when used to reveal characteristics of the population, will carry uncertainties.
A 6-card sample is very likely to be random; but returning to the supermarket shoppers, you can see the difficulty of obtaining a random sample. Do we stop men and women, or just women? If both, do we take account of there being more women shoppers than men? And should we spread our enquiries through the day? Perhaps different days of the week would give different results. And what about time of year? And so on. We could restrict the scope of our sample to, say, women shoppers on Friday afternoons in summer, but this of course restricts our population similarly, and restricts the scope of the results that we will obtain from our statistical analysis. Any attempts to apply the results more generally—to women shoppers on any afternoon, summer or winter, say—will introduce further uncertainties.
It should be noted that the information we can obtain, and the uncertainty associated with it, depend entirely on the size of the sample and not on the size of the population from which it is drawn. A poll of 1,000 potential voters will yield the same information whether it relates to a population of 1 million or 10 million potential voters. The absolute size of the sample, rather than the relative size of the sample, is the key value to note.