Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)
Part II. Data
The temptation to form premature theories upon insufficient data is the bane of our profession.
—Arthur Conan Doyle
We now look at how data is obtained. This is the critical first stage in making use of data as the reliability of the conclusions of any statistical investigation depends on the data being obtained in an appropriate and fair manner. The features and format of the data, and how we can classify the data, are then discussed.
Chapter 4. Sampling
Did Nine out of Ten Really Say That?
An essential feature of a sample is that it is representative of the population from which it is drawn. Unfortunately, it is impossible to predict that this will be so, or even check that it is so when the sample has been obtained. A judgment has to be made as to the adequacy of the sampling procedure in relation to the individual circumstances. This has given rise to many different methods of sampling to cover a wide range of situations.
Problems with Sampling
When the data obtained represents the entire population, the question of how relevant the sample is does not arise: the sample is the population. Thus the monthly profits of a company over a twelve-month period represent the complete picture for the specifically defined twelve-month period. If, however, the sample is drawn from a population larger than the sample, the question of how representative the sample of the population is becomes critical. If the twelve-month sample mentioned above was claimed to be representative of other twelve-month periods—in other words, if it were considered to be a sample from a population of numerous twelve-month periods—the evidence for its wider relevance would need to be examined.
For those carrying out statistical investigations, the adoption of appropriate sampling methods is a priority. The credibility of everything that follows in an investigation hinges on whether the samples are representative of the populations to which the conclusions of the investigation will be applied. If we are not carrying out the investigations but simply looking at results of investigations that others are responsible for, we have a considerable advantage. We have the benefit of hindsight and can assess what populations the samples best represent, and whether these are the appropriate populations or close enough to what we require for our purposes.
Even with proper sampling, arrangement problems can arise. Some of the data may be incorrect. A meter may be misread, or the meter may be faulty. Tallies can be miscounted. A respondent may accidentally or intentionally give a false answer. The question may be worded in a way that invites a particular answer. Charles Seife (2010: 117) gives an amusing example of how the wording of a question is likely to determine the reply. “Do you think it is acceptable to smoke while praying?” is likely to get the answer “No”; whereas “Do you think it is acceptable to pray while smoking?” is likely to get the answer “Yes.”
Worse still is the slanted reporting of the results of a survey when the questions may have already biased the answers received. In 2011, the media reported that a children’s charity had commissioned a survey which included the question, “Are children becoming more feral?” The conclusion from the commissioning charity was that almost 50% of the public felt that children were behaving like animals. A further question asked at what age it was too late to reform children. Although 44% said never too late, and 28% said between 11 and 16 years, it was reported that a quarter of all adults think children are beyond help at the age of 10.
Blastland and Dilnot (2007) give an account of questionable information arising from surveys. It is worthwhile reading for anyone examining the results of an investigation based on sampling. Examples range from the number of immigrants entering the UK every day to the decline in the hedgehog population. The latter is particularly intriguing. The Mammals on the Road Survey, as it is called, is carried out from June to August each year. The numbers of squashed hedgehogs on selected roads are counted. The numbers are decreasing each year, from which it is deduced that the hedgehog population is declining. However, there are many reasons why the sample of dead hedgehogs may not represent the total population of hedgehogs. Traffic density on the selected roads may be changing. Hedgehogs may be evolving and becoming more wary of traffic. Climate change may be altering the time of year when hedgehogs are likely to be on the roads, and so on. Of course, one has to recognize that it is not easy to devise a better method without involving greater expense.
It must be remembered that sampling costs money. There always has to be a compromise between having large samples to maximize the reliability of the results and small samples to minimize costs. As mentioned previously, the reliability of the results depends directly on the size of the sample and does not depend on the size of the population from which the sample is drawn. It does not follow therefore that samples have to be large because the target populations are large, though it may be more difficult to ensure that a small sample is representative of a population when the population is large.
Some of the data required from a survey may be missing, and the reason why they are missing may relate to how representative the sample is. For example, older respondents may refuse to state their age, and simply deleting their contribution to the sample will bias the sample in favor of younger respondents. Samples should include a record of any data that has been deleted. David Hand (2008) provides a useful discussion of missing data and other potential problems in sampled data, and he describes ways of dealing with them.
In scientific investigations, certain properties that have fixed values have to be ascertained. For example, the density of pure copper or the rate of decay of a radioactive material may have to be determined as accurately as possible. The laboratory faced with such tasks will repeat the measurements several times, and each time a slightly different value may be obtained.
The set of values constitutes a sample and, since there are in principle an infinite number of such possible values, it is a sample drawn from an infinite population. The sample and the method by which the data is obtained define the population.
Compare this situation with an apparently similar one that is in reality somewhat different. Suppose our scientists are interested in determining accurately the circumference of the Earth around the Equator. Such measurements have been made over many centuries by different investigators. If we were to bring together all the values obtained in the past, we could not say that we had a sample from the same population. Each of the values would have its associated method of measurement and its level of precision and would be representative of an infinite population of such values, all obtained in the same way. But each of the populations would be different. Nevertheless, because all the values are targeted at the same property, such as the circumference of the Earth, it ought to be possible to make use of the collection of data, and indeed it is by weighting the values, as you shall see in Chapter 7.
Simple Random Sampling
For simple random sampling, each datum from the population must have an equal chance of being selected, and the selection of each must be independent of the selection of any other. This is more difficult to achieve than might appear at first sight.
The first difficulty arises because people are not good at adopting a random procedure. If faced with a tray of apples and asked to select ten at random, people generally make selections that are likely to be biased. Some may select “average-looking” apples, ignoring the very small or very large. Others may attempt to get a full range of sizes from the smallest to the largest. Some will be more concerned with the range of color, others with shape.
A similar difficulty can arise because of the nonrandom times of sampling. An inspector visits a production line in a factory, at supposedly random times, to select an item for quality-control inspection. But production starts at 8:00 a.m., and he is not available until 9:30 a.m. Also, he takes a coffee break between 11:00 and 11:15.
Rather than using the judgments of individuals to establish the randomness of the sampling, it is preferable to make use of random numbers. These are generated by computer and are listed in statistics books. (Strictly speaking, computer-generated numbers are “pseudo-random,” but this is not a problem.) A sequence of random numbers can be used to determine which apples to select or which products to take from the production line.
When surveys are required and people have to be questioned, the difficulties are greater. The population may be widely spread geographically. If the study relates to adult twins, for example, and the results are intended to be applicable to all such twins in the United Kingdom, say, then the population is spread throughout the United Kingdom and the sample has to be randomly selected from this widespread population. Even if available finances allowed the sampling of adult twins to extend so widely, there is still the problem of ensuring randomness. If the twins were located by telephone, those without a phone, on holiday, or away from home for some other reason, for example, would be excluded. And, of course, there are always those people who refuse to take part in surveys and those who never tell the truth in surveys.
Questioning people in the street is easier to do in fine weather. But those who are out in the pouring rain or freezing temperatures, who are unlikely to be questioned and unlikely anyway to be prepared to stop and answer, may have quite different views from fine-weather strollers.
It is because of such difficulties that other sampling methods have been devised. Not all the problems can be overcome: if someone is determined to be untruthful, no sampling method is going to rectify the situation.
For systematic sampling a number is chosen—say, 10. The sample is then selected by taking every tenth member from the list or from the arrangement of items. The first member is chosen at random. If necessary, the end of the list is assumed to be joined to the beginning to allow the counting to continue in a circular fashion until the required sample size is reached.
It is important to consider whether the choice of the arbitrary number creates any bias because of patterns in the listing. If the list is of people arranged in family groups, for example, then a number as large as 10 would make it unlikely that two members of the same family would be chosen. If the list was arranged in pairs—man–wife, say—then any even number would bias the results in favor of the wives.
Stratified Random Sampling
If the population under study consists of non-overlapping groups, and the sizes of the groups relative to the size of the population are known, then stratified random sampling can be used. The groups or subpopulations are referred to as strata.
Suppose a survey needs to be carried out to get the views of a town’s adult population on the plans for a new shopping mall. People of different ages could well be expected to have different views, so age could be used to define the strata. A stratum would be a particular age range: for example, 20–29 years. Suppose this age group makes up 25% of the town’s adult population. The sample is then defined as requiring 25% to be within this age range. The other age ranges are used similarly to fix the composition of the sample. This is referred to as proportional allocation.
It could be decided that in addition to age affecting the views of the respondents, the geographical location of their homes might have an effect. A second level of stratification might be introduced, dividing the town into a number of districts. If proportional allocation is again applied, it may put unbearable demands on the size of the sample. It may be found that some of the subgroups—for example, the over-60-year-olds in one of the town districts—are represented in the sample by only a handful of individuals. Disproportional allocation could be applied, increasing the number in the sample for these groups but not for the others.
Stratified random sampling is a popular procedure used in surveys, but it is not easy to set up in the most efficient way. It may turn out that the choice of strata was not the most appropriate. In the example above, it might have been better to define annual household income as the strata. Not until the sample results have been processed will some of the shortcomings come to light. In order to achieve a better sampling design, a pilot survey is often undertaken, or results of previous similar surveys are examined.
There is a mathematical procedure for calculating the optimum allocation for a single level of stratification, called the Neyman allocation, but this requires prior knowledge of the variability of the various groups within the strata. Again, a pilot study would be required to provide the information.
Cluster sampling is used when the population under study is widespread in space or time. For example, it might be necessary to survey fire engine drivers over the whole country, or hospital admissions 24 hours a day, 7 days a week.
To limit sampling costs, the geographical or time extents are divided into compact zones or clusters. For the fire engine drivers, the country could be divided into geographical zones, such as counties. A random sample of the clusters, the primary sampling units, is selected. In multistage cluster sampling, further clustering takes place. The fire stations within the selected counties would be identified. Random sampling would then be applied to select the fire stations for study in each selected county. Clearly, the validity of the results hinges critically on how well the random selection of the clusters represents the population.
Interviewers employed in surveys are frequently given quotas to fulfill. They may be required to interview three middle-aged professionals, six young housewives, two retired pensioners, and so on. This is quota sampling. The quotas are determined from the known constitution of the population, as in stratified sampling.
The advantages of quota sampling are that the required procedure is easily taught, and the correct quotas are obtained even for very small samples which can then be pooled. However, no element of randomization is involved and bias can easily arise as the interviewer can choose who to approach and who to avoid.
In sequential sampling, the size of the sample is not defined at the outset. Instead, random sampling is continued until a required criterion is met. This is particularly useful when the cost of obtaining each response is high. After each response the data is analyzed and a decision made whether to obtain a further response.
The rapid growth of the use of computer systems in business and industry has produced vast databases containing data of all kinds. Banking, insurance, health, and retailing organizations, for example, have data relating to patterns of behavior linking customers, purchasing habits, preferences, products, and so on. Much of the data has been collected because it is easy to do so when the operations of the organizations are computerized. Thus databases are a source of large samples that can be used for further analysis. I shall discuss databases further when I describe data mining and big data in Part VII.
If we have a sample from a population, we can consider the question of what other samples we could have obtained might have looked like. Clearly, they could have consisted of a selection of the values we see in our existing sample and they could well have duplicated some of the values. This is the thinking behind resampling. We can produce further samples by randomly selecting values from our existing sample.
Suppose we have a sample consisting of the following values:
1 2 3 4 5 6 .
If we now select groups of six randomly from these values, we might get
1 3 3 4 5 6 ,
1 3 4 5 5 5 ,
Numerous additional samples can be generated in this way, and from the samples it is possible to gain information about the population from which the original sample was drawn.
Particular techniques of this type include the jackknife, where one or more values are removed from the original sample each time, and the bootstrap, where a random selection of the values provides each new sample. They are computer-intensive, requiring large numbers of randomly generated samples.
If the sample is random, it is not expected that the data viewed in the order they were obtained would show any patterns. Data that is collected over a period of time could show a trend, increasing or decreasing with time, and this would raise suspicions. Similarly, a sample of members of the public answering yes or no to a question should show a random distribution of the two answers. It would be suspicious if most of the yes answers were early in the listing and most of the no answers were later. Equally, of course, it would be suspicious if the two answers alternated in perfect sequence.
A statistical test called the one-sample runs test can be used to check the randomness of a sequence of yes and no answers. The following sequence
YYY N Y NN Y N YYY NN Y NN YYY
has 20 data, 12 of which are Y and 8 of which are N. There are 11 runs, such as YYY, followed by N, followed by Y, … etc. The number of runs can be referred to published tables to establish whether the sequence is unlikely to be random. Note that it cannot be confirmed that the sequence is random.
Numerical data can be coded in order to carry out the one-sample runs test. The following sequence
5 3 8 4 6 7 4 3 5 8 9 5 4 2 5 6 4 8 6 7
has 20 data, with an average (mean) value of 5.5. The sequence can be rewritten with H representing higher than the mean and L representing lower than the mean. This gives the sequence
LL H L HH LLL HH LLLL H L HHH
which has 10 runs.
The test is of limited use, not only because it cannot confirm that a sequence is random but because runs arise more commonly than our intuition would suggest (Havil, 2008: 88-102). In a sequence of 100 tosses of a coin, the chance of a run of 5 or more is 0.97; and in a sequence of 200 tosses, there is a better than even chance of observing a run of 8.