Numerical Data - Samples - Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Part III. Samples

Chapter 7. Numerical Data

Are Your Statistics Normal?

When a sample consists of numerical data, it has many features that can be quantified. These features can be used to summarize the data, to provide information about the population from which the sample was obtained, and to indicate the reliability of such information. Also, the calculated properties of the sample can be used subsequently if the sample becomes part of a further investigation.

A well-known feature of a sample of numerical data is the average value. Indeed, we get a daily dose of averages from the media and from general conversation. But an average value, though having its proper uses, can be extremely misleading when quoted in isolation. A proper consideration of a data sample requires information about how the data is spread over a range of values.

Diagrammatic Representation

Chapter 5 introduced the idea of a distribution and used a sample of sizes of shoes worn by a group of men to plot the distribution as a bar chart (Figure 5-1). Notice that the area covered by the bar chart represents the total number of data, since each bar has a height representing the number of data in the particular group. If the bar chart is shown with the vertical axis representing relative frequency—that is, frequency divided by the total, as in Figure 7-1(a)—the appearance is exactly the same, but the total area covered by the bar chart is now unity and the relative frequency is equivalent to probability. Thus we could deduce from the diagram that the probability of selecting from the group a man who wears a size 8 shoe is 0.24. The diagram may be referred to as a probability distribution. Generally speaking, we use relative frequency as the label for the vertical axis when the data is observed or measured data. When the diagram is theoretical or being used to determine probabilities, we label the axis probability.

Diagrams such as Figure 7-1(a), which display relative frequency and have a numerical sequence along the horizontal axis, are often called histograms. This is to distinguish them from the type of bar chart shown, for example, in Figure 6-1, where frequency is indicated on the vertical axis and where the horizontal axis has no numerical property. The practice is popular and has some advantage, but the term histogram applies strictly to diagrams in which the bars are not all of equal width. This is explained further in the “Grouped Data” section.

Figure 7-1(b) shows the data of Figure 7-1(a) as a relative frequency polygon, the term polygon indicating the joining of points with straight lines.

9781484201855_Fig07-01

Figure 7-1. Relative frequency shown as (a) a bar chart and (b) a polygon

Such data can be presented as cumulative values. The shoe size data are extended below to include the cumulative frequency, cumulative relative frequency and cumulative percentage.

image

Figure 7-2 shows the cumulative frequency in the form of (a) a bar chart and (b) a polygon.

9781484201855_Fig07-02

Figure 7-2. Cumulative frequency shown as (a) a bar chart and (b) a polygon

The above data are discrete, but if the data are continuous a cumulative frequency graph can contain more information than its corresponding frequency bar chart. To see this, suppose instead of noting the size of shoe worn by each of our volunteers, we had measured the length of his foot. The data, measured in cm and arranged in order of size, might have been as follows:

Group 1

22.1, 22.3, 22.9, 23.7

Group 2

24.2, 24.4, 24.6, 24.6, 25.1, 25.4, 25.5, 25.8, 25.9

Group 3

26.0, 26.3, 26.4, 26.6, 26.7, 26.9, 27.0, 27.3, 27.5,
27.8, 27.8, 27.9

Group 4

28.1, 28.1, 28.2, 28.2, 28.4, 28.5, 28.5, 28.7, 28.8,
28.8, 28.9, 29.1, 29.3, 29.6, 29.8, 29.9

Group 5

30.0, 30.2, 30.5, 30.6, 30.7, 31.0, 31.4, 31.8

Group 6

32.1

When plotted as a bar chart the data have to be grouped. The groups could be, for example, as shown above, 22.0 to 23.9, 24.0 to 25.9, 26.0 to 27.9, and so on. Figure 7-3(a) shows the resulting bar chart. Within each group, the individual values become equivalent to each other, each simply contributing to the total number of values within the group. From the bar chart there is no way of knowing what the individual values are within each group. In contrast, the cumulative frequency graph can be plotted using each value, as shown in Figure 7-3(b). A smooth curve is generally drawn when the data is continuous, and the curve is frequently referred to as an ogive. When the vertical axis is cumulative relative frequency or cumulative probability, the shape of the curve remains the same, but the graph may be referred to as a cumulative distribution function or simply as a distribution function.

9781484201855_Fig07-03

Figure 7-3. Frequency and cumulative frequency shown as (a) a bar chart constructed from grouped data and (b) a curve plotted from individual values

Sets of data often show a tendency to cluster around a central value, as in Figure 7-3(a). As we would expect, there are relatively few of the small or large sizes. Most are close to the average size for the group. When the data is centrally clustered, the cumulative frequency graph has a characteristic S-shape as seen in Figures 7-3(b). The graph additionally provides a convenient way of determining the median, or middle value, as Figure 7-3(b) illustrates. The quartiles, at one quarter and three quarters of the values, are frequently quoted in statistical conclusions and are also shown. The interquartile range embraces the middle half of the data.

If we have a bar chart with the peak at the low end of the data, the distribution is said to be positivelyskewed. Family incomes would be expected to be of this kind, a peak occurring well below the midpoint value (Figure 7-4(a)). When the peak of the distribution is towards the high end of the data, the distribution is negatively skewed. If we looked at ages of people at death, we would expect to see the distribution negatively skewed, with most people dying in old age (Figure 7-4(b)).

9781484201855_Fig07-04

Figure 7-4. (a) Positively and (b) negatively skewed distributions

Normally Distributed Data

Fortunately, in the statistical term normally distributed the word normal does carry the conventional meaning of “usually encountered” or “of everyday occurrence.” Nevertheless, it is not easy to summarize in a few words what is meant by the important concept of normally distributed data.

Normally distributed data are centrally clustered and symmetrical—i.e., not skewed positively or negatively. They are, however, special in the way the distribution varies across the range of values encompassed.

Heights and weights of people are normally distributed. Suppose we measure the heights of a small sample of men, say 20. We could represent the data in the form of a bar chart with a group width of 8 cm, as shown in Figure 7-5(a). Central clustering around a mean value is clearly shown but the data are presented very coarsely with wide steps in relation to the total width. If we decide to reduce the group width to 4 cm in an attempt to improve the presentation we might end up with Figure 7-5(b). Because we now have so few data in each group, the bar chart begins to lose its shape.

If we now consider having larger samples, we can reduce the group width and still have a sufficient number in each group to represent the distribution of heights in a reliable way. Figure 7-5(c) shows what we might get with a sample size of 10,000 and a group width of 2 cm. The bar chart now has a smoother outline. Extending the process to larger sample sizes and narrower group widths eventually gives a smooth curve, superimposed on the bar chart in Figure 7-5(c), which is the normal distribution. The curve, also known as the Gaussian curve, has a characteristic bell shape. It has an exact though complicated mathematical formula that defines it precisely. It is not, of course, derived from bar charts in the way I may have implied: the description via bar charts is useful in providing a simple and correct view of the meaning of the normal distribution.

9781484201855_Fig07-05

Figure 7-5. Distributions of heights of men

Just as in the bar charts, where the number of data within each group is indicated by the area of the corresponding vertical bar, any vertical strip defined under the normal distribution curve represents the relative number of data lying between the horizontal limits of the strip. The proportion of data within the strip relative to the total number of data is thus equal to the proportion of the area within the strip relative to the total area under the curve. Furthermore, this proportion is equal to the probability of a man, chosen at random from the total, having a height lying between the limits of the strip.

Progressing from the bar chart in Figure 7-5(a) to the continuous curve in Figure 7-5(c) necessitates a change in the labeling of the vertical axis. For the bar chart, the label is frequency. Provided the bar width is constant across the whole of the diagram, the scale on the axis will always allow us to read off the frequency. However, once we replace the set of bars with a smooth curve, we can no longer read off frequency: the frequency will depend on the width of strip that we choose. The axis is labeled frequency density.

Clearly, each set of data will have its own scale in terms of the numerical values on the horizontal axis and the frequency on the vertical axis. But the shape of the curve will be the same, provided the data follow a normal distribution. In order to utilize the normal distribution in analyzing data, a standard normal distribution, shown in Figure 7-6, is defined with a peak value located at zero on the horizontal axis. Thus the curve extends symmetrically in the positive and negative directions. The horizontal scale is explained in the “Spread of Data” section of this chapter. The vertical scale is adjusted so that the total area under the curve is 1. The area of any vertical strip then expresses directly the probability of occurrence of values within the strip. Any set of data fitting a normal distribution can be reduced to the standard normal distribution by a change of scale, taken up later in the context of analyzing data.

9781484201855_Fig07-06

Figure 7-6. The standard normal distribution

This characteristic curve results whenever the variation of the data is due to numerous random effects. The effects may be intrinsic to the property being measured, as in the example of the heights of the men sampled, but in other situations the effects may be due to errors in the method of measurement. Repeated measurements of the height of Mount Everest would be expected to give a normal distribution of data clustering around a central value. The normal distribution is found to arise in many situations of data collection and is used extensively in subsequent statistical analysis. There are, of course, other special distributions that are encountered, and I will describe some of these in later chapters.

Examples of data that conform to the normal distribution fall into several categories. The first category is where there exists a true value and the sample consists of estimates or measurements of the value, which inevitably are inaccurate to some degree. The inaccuracies arise from random errors in the observation or measurement methods. Repeated measurements of the density of pure copper, various chemical and physical constants, or estimates of the volume of water in the oceans of the world would fall into this category.

The second category is where an attempt has been made to produce items consistent in such properties as size and weight. Because of random fluctuations in materials or manufacturing processes each item is slightly different. Measurements on a number of the items would be expected to follow a normal distribution.

The third category consists of data that are correct (within some error of measurement, of course) but in reality quite different. That is to say, the observed differences are due not to small errors of measurement or manufacturing, as in the previous two categories but reflect their differences by nature. Nevertheless the values exhibit a tendency to cluster around a central value—the likelihood of departure greater or smaller than the central value being less the greater the departure. Examples of such data are the heights and weights of people, examination marks, and intelligence quotients. A comparison between this category and the previous category raises an interesting point. It is as if natural processes attempt to produce everything the same, as we do in our factories, but don’t quite succeed because of random errors, just as we don’t quite succeed. Viewed this way, categories two and three are in a sense the same.

The fourth category consists of data that theoretically conform to distributions other than the normal distribution but that, under certain circumstances, can be well represented by the normal distribution. Usually it is when samples are large that we find the closest approximations to the normal distribution.

The ability of the normal distribution to represent data that does not conform exactly to the theoretical requirements of the distribution helps to give it its primary role in statistics. In reality, of course, no set of data is likely to conform exactly. The theoretical distribution tapers to infinity in both directions, indicating that there is always a probability, albeit very small, of observing a value of any size. In reality, this cannot be so, not only because of practical limits on the maximum value but also because the low-value tail is limited by the value of zero. Negative values would be meaningless in most situations.

Distribution Type

A data sample may, simply by inspection, be judged to be of a particular type of distribution or approximately so. Data may be seen to be approximately normally distributed, clustering around a central value with few extreme values. Other sets of data may appear, for example, to be uniformly distributed with no evidence of central clustering.

It is possible to make a comparison between the data and an assumed distribution in a way that provides a measure of the likelihood of the data belonging to the distribution. Such a comparison is called a goodness-of-fit test.

The data are laid out in sequence, and calculations are made of the corresponding values that are obtained on the basis of the assumed distribution. For example, we may have data showing how many employees are late for work on different days of the week, and we wish to test the hypothesis that the number late for work is independent of the day of the week. If the hypothesis is correct, the distribution of data should be uniform: that is, the numbers for different days should be the same within the likely random fluctuations. We therefore lay out the expected data, each value being the average (mean) of the actual data:

image

The differences between the two sets are calculated. From the squares of these differences, a statistic called chi-squared, χ2 (Greek letter chi), is determined. In this example, chi-squared = 8.25. You will appreciate that a value of zero would be obtained if the data agreed exactly with the expected data. So the larger the value is, the more likely the distribution is not uniform. The value obtained is referred to tables of the chi-squared distribution to obtain the probability of there being a dependence on the day of the week as opposed to the actual number of late arrivals being subject to a random fluctuation. The following is an extract from tables of the chi-squared distribution:

image

In this example we find from the tables that, with 4 degrees of freedom (see above), 8.25 lies between the values for 10% and 5% significance. Thus we have a greater than one in twenty (5%) chance of being wrong if we claim that the number of late arrivals does depend on the day of the week. The claim would be unreliable. Figure 7-7 shows the distribution of the data and, for comparison, the supposed uniform distribution.

9781484201855_Fig07-07

Figure 7-7. Comparison of an observed distribution and a supposed uniform distribution

For a non-uniform expected distribution, the required expected values would have to be obtained from tables. To test whether data conformed to a normal distribution, for example, values would be obtained from tables of the normal distribution. The calculation would then proceed as above, the differences between the actual values and the expected normally distributed values being squared and summed.

I need to explain the term degrees of freedom, which I stated above is a feature of the data and which is required to obtain the level of significance from the published tables. In a sense, the freedom referred to is the freedom to be different—and this, I suggest, is a useful way of appreciating what is meant by degrees of freedom. If we have data consisting of just one value, there is no difference involved and no variation or measure of uncertainty. If we have two values, there is one measure of difference, that being the difference between the two values. Thus we have a measure of variation based on a single difference, and we refer to this as one degree of freedom.

With three values—a, b, and c—there are two measures of variation: a–b and b–c. Note that a–c is not a further measure of variation, because its value is fixed by the other two differences. Thus we have two degrees of freedom. With four values, we have three degrees of freedom, and so on.

The degrees of freedom in the above example are shown as four. There are actually five differences involved—i.e., the difference between each of the five daily values and the number 24. However, the value 24 was obtained from the five daily values by ensuring that the totals of the actual and expected values were the same. This restriction removes one degree of freedom, leaving four. When distributions other than a uniform distribution are selected for comparison, there may be additional reductions in the degrees of freedom. This arises when additional features of the assumed distribution have to be calculated from the original data.

Various statistical tests are in standard use to establish the reliability of estimated values from the data, or the likelihood of there being differences or similarities between sets of data. In these tests, use is made of published tables, and the tables generally require the appropriate degrees of freedom of the data to be entered.

The chi-squared test can also show evidence of surprisingly good agreement with a supposed distribution. Too-good agreement should be viewed with some suspicion. Is the data genuine? Has some of the data been removed?

There are other goodness-of-fit tests. The likelihood-ratio test produces a statistic, G2, which is similar to χ2. The Kolmogorov-Smirnov test is similarly based on the differences between the observed data and the data expected from the assumed distribution.

Averages

The word average in common usage refers, from a mathematics point of view, to the mean value—the sum of all the data in a collection divided by the number of data. The mean value expresses a central value around which the other values are arranged. It is a useful summary of the data, especially when, evident from the nature of the data, there is a central clustering effect. As we said previously, the heights or weights of people would be expected to cluster around the mean value, there being relatively few people with extreme, large or small, height or weight. The description of the normal distribution in the section “Normally Distributed Data,” recognized the symmetry about the peak value, which is the mean value.

Using a mean value where there is no clustering would be misleading and rather pointless in some situations. But not always: the scores when a die is repeatedly thrown show no central clustering, each of the possible six scores occurring roughly equally, but the average score is useful in allowing an estimate of the total score expected after a given number of throws. Thus, the mean value is (1+2+3+4+5+6)/6 = 3.5—so ten throws, say, would be expected to give a total of about 35.

Statisticians use the word expectation to mean the expected mean value as opposed to the achieved mean value. So if we throw a die a number of times, the expectation is 3.5. The actual achieved mean value is likely to be close to 3.5 but could be any number between 1 and 6.

There are two other averages frequently used in statistical presentations: the median and the mode. The median, described in the “Diagrammatic Representation” section and shown in Figure 7-3, is the middle value of the data ordered by size, such that half the data are less than and half are greater than the median. The mode is the most common value—i.e., the value that occurs most frequently in the data. There could be more than one mode in the sample, whereas the mean and median have unique values.

The decision as to which average to use depends on the nature of the data. The use of an inappropriate average can distort the impression gained. If we were looking at the average number of children per family, a calculation of the mean would probably give a non-integer value, 2.23 say. Although no family has 2.23 children, the value could be extremely useful because, given a total number of families, it would allow us to calculate the best estimate of the total number of children. The median would probably lie between 2 and 3, telling us that half the families had 2 or fewer children and half had 3 or more, which is not very informative. The mode, with a value of 2 probably, would at least tell us that the families were more likely to have 2 children than any other number.

If we were looking at family incomes, we would have different considerations. The mean income could be, say, $50,000 per annum. However, there will be in the data a few very high earners with incomes three or four times the mean. Most families will lie well below the mean. Thus there is an upward bias effect that can be misleading. If we work out the median we might find that the value is $40,000, showing that half the families have incomes less than this. If we wish to see what the mode is, we find that because income has continuous values (no finer than a penny, of course), there are insufficient families, or perhaps none, with the same value of income. This could be overcome by rounding off the data or, better, by grouping the data. This might give us an answer that the most common family income is in the range $35,000 to $40,000.

As data fit the normal distribution more exactly, the mean, median, and mode come closer together. As the distribution becomes positively skewed, the mode moves below the mean; as it becomes negatively skewed, it moves above the mean. The median usually lies between the mode and the mean.

Choice of the inappropriate average can give erroneous impressions of the meaning of the data and is often done with the intention of misleading. The situation is made worse when the type of average that has been used is not specified. The moral is to be wary of averages of unspecified type and—even when it is stated that the mean, median, or mode has been quoted—to explore the consequences of viewing the results in terms of the other averages.

Spread of Data

Average values are extremely useful but give no indication of the spread of the values from which they were derived. It is not possible to make any judgment about how valid it will be to base decisions on the average values. Some indication of the spread of the data should accompany any quoted average.

The maximum and minimum values, and the difference between them, the latter being referred to as the range, are easily quoted but of limited use. They give no information as to how the individual values are distributed within the sample. Of course, if one were interested in knowing the weight of the heaviest parcel to be transported or the smallest size of skates to be provided at a skating rink, then the information could be useful.

Of more general use are the quartiles, described in the first section of this chapter and Figure 7-3. The lower quartile, or 25 percentile, is defined such that one quarter of the data lies below it and three quarters above. The upper quartile, or 75 percentile, occupies a corresponding position with a quarter of the data above it and three quarters below. The interquartile range is the difference between the two quartiles and thus embraces the middle 50% of the data. Sometimes other percentiles are quoted: the 90 percentile, for example, embraces the lower 90% of the data.

The most useful measure of the spread of data is the standard deviation. This is calculated using all the data in the sample. The deviation of each data value from the mean value contributes to the standard deviation, but each deviation is effectively weighted, by squaring the value, to give greater contribution to the larger deviations. The squares of all the deviations are totaled and the mean calculated. The square root of this mean value is the standard deviation.

As an example, suppose we have the following rather unlikely, but easy on the eye, values:

2 3 4 4 5 5 6 6 7 8

The mean value is 50/10 = 5.

The deviation of each value from the mean is

–3 –2 –1 –1 0 0 1 1 2 3

and the squares of the deviations are

9 4 1 1 0 0 1 1 4 9

The mean of the squares of the deviations is 30/10 = 3, and the standard deviation is the square root of 3—viz., 1.73.

The standard deviation has particular meaning in relation to the normal distribution. It is half the width of the normal curve at a particular height. Its position is such that the area under the curve between one standard deviation below the mean and one standard deviation above the mean is 0.683 of the total area under the curve. It follows that 68.3% of the data lies within one standard deviation of the mean value. Two standard deviations either side of the mean value include 95.4% of the data, and three standard deviations include 99.7% of the data. These figures provide a very useful quick way of visualizing the spread of data when the mean and standard deviation are quoted. In the example above, one standard deviation each side of the mean is from 3.27 to 6.73, and 60% (6 of the 10 values) of our data lie within this band.

The preceding discussion sets up the completion of the description of the standard normal distribution introduced in the “Normally Distributed Data” section and shown in Figure 7-8. The mean value, which is the central peak value, is located at a value of zero on the horizontal axis, the area under the curve is equal to 1, and now the scale along the horizontal axis is in units of standard deviations. The vertical scale is probability density but it is not of direct interest, having been selected in order to render the area under the curve equal to unity, given that the horizontal scale is in units of standard deviations.

9781484201855_Fig07-08

Figure 7-8. The percentage of data within a number of standard deviations from the mean

The square of the standard deviation is called the variance. It is used extensively in statistical analysis by reason of its special properties, discussed later. It has no readily visualized meaning: indeed, its units are rather odd. If our standard deviation happens to be in dollars, the variance is in dollars squared—or, if you prefer, square dollars (whatever they are!).

Even when the data does not conform well to the normal distribution, the standard deviation still provides a useful measure of the spread of the data. To illustrate this point, consider data that we might accumulate by throwing a die. Because all numbers from 1 to 6 have equal chance of appearing, we would expect to get nearly the same number of each of the scores. The data would conform to a uniform distribution, and the bar chart would be flat-topped, not looking anything like the normal distribution. The mean score is 3.5, and the standard deviation is calculated to be 1.87. So we would predict that about two thirds of the scores would lie between 1.63 and 5.37. In fact, two thirds of the scores are from 2 to 5, which is roughly in agreement.

Statistical tables are available that give values for the area under the standard normal distribution curve at various distances from the mean. The total area under the curve is defined as 1, so the partial areas appear as fractions between 0 and 1 and represent directly the probability of occurrence of the required range. The tables are not particularly easy to use. Because the curve is symmetrical, the tables give values for only half of the distribution—the positive, right-hand, half. The economy is justified in view of the extent of the tables demanded by the level of precision required, but it does mean that considerable care has to be taken when probabilities represented by areas extending to both sides of the mean are required.

Figure 7-9 shows the values of the standard normal distribution in a simpler, but abridged, form that is more convenient for obtaining approximate values and for checking that declared values are not grossly in error. The values are given to only two digits, to economize on space; and they are given as percentages, which are more readily appreciated than the conventionally used decimal fractions. Furthermore, the probability between any two limits can be read immediately, whereas the published tables require separate values to be extracted for the two limits and the difference to be then calculated.

9781484201855_Fig07-09

Figure 7-9. Tabulated probabilities of occurrence of normally distributed values between lower and upper limits

It needs to be emphasized that the probability of occurrence is represented by an area. We are asking for the probability of occurrence between two values. We cannot ask for the probability of a unique value being observed. In the earlier example of the heights of people, we cannot ask for the probability of an adult being exactly 160 cm tall. This would be a single vertical line on the normal distribution graph and would enclose no area. The answer is that there is no probability of an adult being exactly 160 cm tall. If this seems odd at first sight, note that the word “exactly” is used. We could ask for the probability of an adult being between 159.5 and 160.5 cm tall, between 159.9 and 160.1 cm tall, or between any other closer limits. These narrow strips would have areas representing the required probabilities. The areas would be small, so the resulting probabilities would be small. This is perfectly reasonable inasmuch as the probability is indeed small of finding someone of a very precisely defined height.

The probability of occurrence can be expressed as a proportion. Thus, if the probability of occurrence of an adult of height between 159.5 cm and 160.5 cm is 0.1, one can say that that the proportion of adults between 159.5 cm and 160.5 cm tall is 0.1, or one tenth, or one in ten.

Grouped Data

Data are often not available in detail but are grouped at the outset. Information, for example, may be gathered from a number of people of different ages, but the ages may not be recorded or even obtained individually but simply classified within bands. The bands have to be carefully defined and equally carefully understood.

We might have bands each of ten years in extent. If we define a band from 20 years to 30 years and the next one as 30 years to 40 years, we do not know in which group to locate someone who is 30. To avoid this problem, we have to define the bands as 20 years to 29 years and 30 years to 39 years. If the data is not discrete, a different procedure has to be adopted. Heights of people vary continuously, so we cannot have, for example, a group 130 cm to 139 cm and then one from 140 cm to 149 cm. There is nowhere to locate 139.5 cm. The groups have to be “equal to or greater than 130 cm and less than 140 cm” followed by “equal to or greater than 140 cm and less than 150 cm.” These designations are quite a mouthful and are instead usually shown using mathematical notation as ≥130 to <140 followed by ≥140 to <150.

If a single representative value is quoted for the group, it is usually the midpoint of the group width. Take note, however, that if the values have been rounded off, the midpoint may not be where it seems. If the group is 10 to 19 and the values have been rounded to the nearest whole number, the group actually ranges from 9.5 to 19.5. The midpoint is then 14.5. But if the group is ≥10 to <20, the midpoint is 15.

Sometimes the groups are not of equal width. This may be because of unevenness in the sampling or simply because there is a real shortage of data within certain bands. Ages of people, for example, are more thinly spread between 80 and 100 years than between 20 and 40 years. Notice that when this happens, the area of each block on a bar chart must still represent the total number of data values within the specified band. The following data can be plotted as a relative frequency bar chart, shown in Figure 7-10(a):

image

The groups are of equal width. Each person is represented by an area of 0.02 so that the total area is 1.00 for the 50 people. The tail end of the distribution is uneven; to avoid this, the data can be pooled in a wider group, as shown in Figure 7-10(b). The final group has just three members, so the height is 0.02 in order to make the area of the final block equal to 0.06 units. Notice that we cannot now label the axis as relative frequency because the final group (70 to 99 years) has an actual relative frequency of 0.06. The correct designation is relative frequency density, as shown.

9781484201855_Fig07-10

Figure 7-10. The difference between (a) a bar chart and (b) a histogram

I can now explain the difference between a bar chart and a histogram. A bar chart represents discrete data or discrete groups of data, and the groups are all of the same width. The vertical axis represents frequency or relative frequency, and the latter is equivalent to probability. In a histogram, which also represents discrete groups of data, the groups are not all of the same width. The vertical axis represents frequency density or relative frequency density, the latter being equivalent to probability density. The vertical axis on a histogram does not represent probability: it is the area of the block that represents the probability of the data being within the limits of the group. The histogram is thus analogous to continuous data curves, which, as explained in relation to the normal distribution, are also labeled probability density and indicate probability by the area under the curve.

Pooling and Weighting

Several sets of data can be brought together to provide a pooled mean value. The pooled value is more representative because it is based on more observations.

When pooling results, a weighted mean is often more appropriate in order to allow some values to have a greater influence on the final result. Sometimes this is essential to avoid the result being in error. For example, if I buy ten apples for 20¢ each in one shop and 4 apples for 24¢ each in another shop, the mean cost per apple is clearly not 22¢. The appropriate indicator is the weighted mean, which is the total money paid divided by the total number of apples purchased— i.e., (20 × 10 + 24 × 4) / (10 + 4) = 21.1¢.

The need for weighting is not always so apparent. The tires on the front wheels of my car wear out much faster than those on the rear wheels. I get 45,000 miles from the rear tires but only 15,000 miles from the front ones. So, on average, I get 30,000 miles—that is, (45,000 + 15,000) / 2—from my tires. This is not correct. In 45,000 miles, I will wear out one pair of rear tires and three pairs of front tires, four pairs in all, so a tire lasts on average (45,000 × 1 + 15,000 × 3) / 4, which is 22,500 miles.

Sometimes the need for weighting seems very surprising. Suppose you catch a bus on a regular basis. The buses are scheduled to arrive every 10 minutes, so this is the mean time between buses; but some will be early and some late. If you arrive at the bus stop at random times, what will be your mean waiting time? It seems at first sight that the answer is five minutes, but this is not correct. You are more likely to arrive in one of the longer gaps between buses than in one of the shorter ones, so your waiting time will be slightly longer than five minutes.

A simple example will illustrate this. Think of two buses: one arrives 12 minutes after the previous one, and the second one arrives after a further 8 minutes, so the mean arrival time is 10 minutes. When you arrive in the 12-minute gap, your mean waiting time is 6 minutes; and when you arrive in the 8-minute gap, your mean waiting time is 4 minutes. The longer waiting time is encountered more often than the shorter waiting time, the ratio being 12 to 8, so the overall mean waiting time has to be obtained by weighting:

(6 x 12 + 4 x 8)/20 = 5.2 minutes.

Weighting in such examples is necessary and can be applied unambiguously, but the weighting may sometimes be a matter of judgment. If several similar investigations have been carried out previously, it may be decided that some of them, though of value, are not as reliable as others because of the techniques used. So the less reliable results are pooled with the others but given a lower weighting. In the following calculation, three estimates of the height of Mount Everest—h1, h2, and h3—are pooled, but h3 is given only half the weight of the other two:

Pooled estimate = (2h1 + 2h2 + h3)/5.

Chapter 5 described simple index numbers as, in effect, percentages referred to a selected base value. Many index numbers that are frequently encountered are derived from more complex calculations because the values are averages of several items. Thus the UK Retail Price Index is based on prices of various commodities on a specific date, the prices of the commodities being averaged. Different commodities are purchased in different quantities, so the average price has to be obtained by weighting the average in relation to the quantities purchased. Clearly a loaf of bread costing £1 and a liter of wine costing £6 cannot simply be averaged. If two bottles of wine are purchased for every 35 loaves of bread, we take the price of two bottles of wine, £12, add it to the price of 35 loaves of bread, £35, and divide the result by 37, the total number of items. Thus the weighted average price is (35 x 1 + 2 x 6)/37 = £1.27. Of course, even the initial prices, £1 and £6, would have to be obtained by averaging, taking account of the different types, different brands, and different shops.

The Retail Price Index involves defining the list of commodities to be included and strict procedures for recording the prices at defined outlets at defined times. The commodities are grouped according to type, so an index can be calculated for various groups of commodities. For example, the overall index is constructed from group indices representing household goods, food, housing, and other groups. The household goods index is constructed from section indices representing household consumables, furniture, and other sections. The household consumables section index is constructed from item indices representing envelopes, toilet paper, and other items. The item index for envelopes is constructed from those of specified type purchased in specified shops in specified locations. In total about 700 items are represented in the Retail Price Index.

When the actual quantities purchased are used to determine the weights to be applied in the averaging, the choice still remains as to whether the quantities should be those purchased in the base year or those bought in the current year. The index that results from using the base year quantities is called a Laspeyres index. The index incorporating the current year quantities is a Paasche index and clearly involves more time and expense in its determination.

As an example, suppose we have data for our base year as follows:

Bread £1 per loaf

Relative quantity 35 loaves

Wine £6 per bottle

Relative quantity 2 bottles.

The data for a subsequent year, for which we require the index, are

Bread £1.20 per loaf

Relative quantity 35 loaves

Wine £8 per bottle

Relative quantity 1 bottle.

The Laspeyres index, using base year quantities, is calculated thus:

£1x35 = £35

£1.20x35 = £42

£6x2 = £12

£8x2 = £16

Total £47

Total £58

Index = (58 / 47) × 100 = 123.

The Paasche index, using current year quantities, is calculated thus:

£1x35 = £35

£1.20x35 = £42

£6x1 = £6

£8x1 = £8

Total £41

Total £50

Index = (50 / 41) × 100 = 122.

The two indices are quite similar unless the quantities vary appreciably from year to year. A disadvantage of the Paasche index is that indices for different years cannot be compared with each other but only with the base year. The Laspeyres index allows comparison between any two years. The UK Retail Price Index is a Laspeyres-type index, but its derivation is modified in a number of ways. Other well-known index numbers are those illustrating the prices of shares, such as the FTSE 100 and the Dow Jones, and various housing price indices.

Be wary of pooled data that can apparently show a quite different result. The pooling may have been carried out to disguise an embarrassing set of data. Consider the following example.

A company has two new salesmen, Smith and Brown. In their first week Smith makes 5 sales from 40 contacts, giving him an average of one sale per 8 contacts. Brown makes one sale from 10 contacts. So Smith has the best average. The situation is illustrated in Figure 7-11. In the second week, Smith makes 3 sales from 10 contacts, giving him an average of one sale per 3.33 contacts. Brown makes 10 sales from 40 contacts, giving him an average of one sale per 4 contacts. So Smith again has the better average.

9781484201855_Fig07-11

Figure 7-11. Simpson’s paradox

But what happens if we pool the results of the two weeks? Smith has a total of 8 sales from 50 contacts, whereas Brown has a total of 11 sales from 50 contacts. So Brown has the better average. Who is the better salesman? Some might argue that Smith is better because he made the greater contribution to the company’s performance in both weeks. Others could say that Brown is better because his better performance is revealed when a greater amount of data is available. The most realistic conclusion is that there is not sufficient evidence to distinguish between them. The difference between their own performances in the two weeks is greater than the difference between their own and their colleague’s performance. Also, there may be one or more variables affecting the conditions during the two weeks, of which no account has been taken.

This kind of situation is known as Simpson’s paradox and is usually met with surprise. Apart from its curiosity value, it does illustrate very well the fact that statistical results should not be accepted blindly but should always be judged alongside other evidence and practical considerations.

FOOD FOR THOUGHT

The consultant’s report lay on the desk. Liz Fisher, Head of Food Processing at Moroney Cookie Company, picked it up and started to read. The company had decided to introduce a new low-sugar cookie, and Liz’s team had produced two recipes, both of which were judged to be marketable.

Graham Consultants had been employed to test the two new cookies on the public before a decision was made as to which one would go into production.

The report described how two shops, already selling Moroney cookies, had each agreed to set up two stalls on a busy afternoon. One stall, handling recipe A, offered each willing customer a sample cookie and then invited the customer to purchase a packet at a reduced price. The second stall did the same with cookies of recipe B. The customers did not know that the two stalls were offering different cookies. The number of customers sampling the cookies and the number of customers purchasing a packet were recorded.

The report concluded that recipe A was more popular than recipe B at both shops, though the difference was not great.

Liz thought the figures looked rather odd. She was not happy with the findings:

Store 1

Recipe A 22 purchases from 24 sampled 92%

Recipe B 89 purchases from 106 sampled 84%

Store 2

Recipe A 50 purchases from 71 sampled 70%

Recipe B 18 purchases from 26 sampled 64%

Suspecting that the situation was not satisfactory, she decided to pool the results from the two stores and got the following result.

Recipe A 72 purchases from 95 sampled 76%

Recipe B 107 purchases from 132 sampled 81%

The situation was now reversed: recipe B was more popular than recipe A! Liz saw that this was an instance of Simpson’s paradox. One or more additional variables were influencing the results. The experimental arrangements at the two stores were not comparable.

She knew that the report had to be rejected and Graham’s would have to investigate the source of the problem. The experiment would have to be repeated with improved controls.

She picked up the phone ….

Estimated Population Properties

The population, to recap, is the complete, perhaps hypothetical, and perhaps infinite, set of data from which the sample was randomly drawn. It is necessary to realize that the information gained from the sample may not be representative of the population characteristics without some modification, although the modifications are generally quite minor. It has been mentioned already that that the sample sometimes consists of the entire population, which simplifies matters.

The best estimate of the population mean, μ, is the sample mean, cm. Statisticians use the word expectation rather than mean when speaking of an expected mean rather than a calculated mean. Thus one refers to “the mean of a sample” and “the expectation of the population” from which the sample was drawn.

The best estimate of the standard deviation of the population, σ, is the sample standard deviation, s, slightly modified. The modification is required because sample standard deviations slightly underestimate the population standard deviation, particularly when the sample is small. The sample standard deviation has to be multiplied by the square root of the ratio of n to n-1, to give the estimate of the population standard deviation, σ, where n is the number of values in the sample. Thus,and the estimated population variance is

image

and the estimated population variance is

σ2 = s2n/(n-1)

If the sample is small the alteration of the sample standard deviation may be appreciable, but for large samples the ratio n/(n-1) is close to unity and has a small effect only.

If two samples are pooled to provide a larger single sample, the estimated mean value for the population is obtained in the usual way of obtaining a weighted mean. Thus,

μ, = (n1 χm1 + n2 χm2 )/(n1 + n2) ,

where the suffixes 1 and 2 refer to the two samples. The estimated pooled variance is

σ2 = {(n1- 1) s12 + ( n2- 1) s22 }/(n1 + n2 - 2) ,

and the estimated pooled standard deviation is the square root of this.

The best estimate of the population proportion is the sample proportion, and pooling is dealt with in exactly the same way as for the estimated population mean.

Confidence Intervals

The preceding section said that the sample mean provides the best estimate of the population mean (the expectation). Thus if we survey people attending a particular film at the local cinema and find that in a sample of 40 people the mean age is 32 years, then this provides the best estimate of the mean age of the people who did attend or might have attended under the same circumstances. Clearly this may easily be in error, and a useful procedure is to attach confidence limits. These are calculated from the population variance, but before you see how it is useful to see how they are presented and what they mean. The result might be quoted as, for example,

Mean age = 32 ± 5 (95% confidence) ,

meaning that the true population mean will be found to lie between 27 years and 37 years in 95% of such investigations.

Note that it does not mean that there is a 95% probability of the true population mean lying in the interval 27 to 37. The true value is either within a given interval or not. The issue is subtle and may be illustrated as follows. Suppose the true population mean is 26. The sample we obtained estimated the mean to be between 27 and 37, which is not correct. However, we were unlucky, as nineteen other similar samples would, on average, have included 26 in the range of uncertainty. It can be seen that this is different from saying that the true value of 26 has a 95% chance of being between 27 and 37. It has no chance of being so. However, it is clearly a fairly infrequent occurrence to have established a range that does not trap the true value, and it is easy to see how the meaning of confidence limits is often wrongly stated.

Let us now see how the confidence limits are obtained. From the previous description of the normal distribution in the second section of this chapter, we know that a single value drawn from a population has approximately a two-thirds chance of being within one standard deviation of the mean. The single value is the best estimate of the mean, but it would clearly be a very poor one. With just one value—one person in the cinema, for example—we cannot calculate a standard deviation, so we do not even know how poor our estimate is.

In reality, we take a sample and calculate the mean value. This is now our best estimate of the population mean. We have a sample mean of 32 years in our cinema example, obtained from, we suppose, a sample size of 40. We can calculate a standard deviation from the sample and obtain a value of 16 years, say. This allows us to calculate the best estimate of the standard deviation of the population, which is found to be 16.2, after making the minor correction described in the preceding section.

The estimate of the population mean is more reliable than the one from a single data value, but how much more? It turns out that when means of samples are obtained, they themselves are distributed normally but with a smaller standard deviation than that of the population. In fact, the standard deviation of samples of equal size is equal to the standard deviation of the population divided by the square root of the number of data in each sample. So the larger the sample, the more likely the sample mean will be close to the population mean. This is what one would expect. The standard deviation of sample means becomes

image

Reference to tables of the standard normal distribution shows that there is a probability of 95% that a value lies within 1.96 standard deviations either side of the mean value. In our example,

1.96 x 2.56 = 5.02 .

Hence, we have the conclusion that that the estimated mean age of those attending or potentially attending the cinema is 32 ± 5 (95% confidence).

It is worth adding that the means of samples are found to be distributed normally, or nearly so, even when the original data departs considerably from a normal distribution.

It is useful to note that the value of 1.96 is always associated with the 95% confidence limits, so there is no need to consult tables of the standard normal distribution on each occasion. Similarly, for other confidence limits, appropriate values that can always be used are summarized as follows:

image

where

χm = sample mean

σ = estimated standard deviation of the population

n = sample size

So far in this section we have assumed that our sample is large. If our sample is small, less than about 30, we do not use the normal distribution. Instead, we have to refer to tables of a distribution called Student’s-t. This distribution varies as the number of data changes, so we cannot fix the number of standard deviations for a given level of confidence as we did above. As the number of data in the sample increases, the t-distribution comes closer to the normal distribution—hence the need for the t-distribution only for small samples. (Student was the pen name of William Gosset, who devised the test for small samples; the test was not so named because of its use by students of statistics.)

Below are tabulated values from the t-distribution for a number of different sample sizes. The values shown replace the numerical factors in the confidence limits statements above, obtained from the normal distribution. The latter factors are repeated in the bottom line of the tabulation for ease of comparison. The tendency of the t-distribution values to approach the normal distribution values can be appreciated.

image

It can also be seen that smaller samples result in a widening of the confidence limits. This widening is additional to the widening that arises as a result of the smaller value of n in the estimation of the population standard deviation.