# Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

### Part IV. Comparisons

### Chapter 10. Comparisons with Numerical Data

**Are Today’s Chocolate Bars Smaller Than Yesterday’s?**

Once a numerical sample or population has been characterized in a quantifiable way, as shown in Chapter 7, it can be compared with others to seek differences or similarities. This chapter explains what can be learned from single values, pairs of values, pairs of samples, and sets of samples. In each case, the null hypothesis, that no difference is evidenced, is set up; and, by calculating the appropriate test statistic, it is established whether the null hypothesis should be accepted or not.

Single Value

The null hypothesis is that a single value could have come from a given population. An example might be to investigate whether a bar of chocolate weighing 121g could have come from a production line producing bars with a mean weight of 120g and a standard deviation of 0.5g. The situation would be considered to involve a normal distribution of chocolate bar weights.

We have already seen that the area under the curve of the normal distribution represents the probability of occurrence of the values within the bounds of the area. If we are interested in a 5% level of significance, say, we would be asking whether or not a value as large as 121g would be found within the 5% tail of the normal distribution which has a mean of 120.0g and a standard deviation of 0.5g.

The difference between 121g and 120g is scaled to fit the standard normal distribution by calculating the so-called Z-score, where

Z = (Single Value – Population Mean)/(Standard Deviation)

= (121–120)/ 0.5

= 2.0 .

This gives the amount the value being investigated differs from the population mean, in units of standard deviations. Referring to Figure 7-9, the probability of a value being at least 2.0 standard deviations from the mean is 2%. (Read off A = 2.0, B = infinity, which gives 2%.) Complete tables of the normal distribution give the result more accurately as 0.0228 (2.28%). The value is below the 5% level of significance, and so we conclude that the null hypothesis is incorrect and it is unlikely that the chocolate bar came from the production line. Put another way, there is only a 2.28% chance of our being wrong if we say that the chocolate bar did not come from the production line.

The example is a one-tailed test, because we are quoting the probability of a value as high as 121g being observed. In a two-tailed test, we would be inquiring as to the probability of a value being 1g distant from the mean, either above or below. We would thus work with a 2.5% probability in the upper tail and a 2.5% probability in the lower tail to fix the limits corresponding to a 5% probability of the value not being likely to have been selected from the population.

The relation between probability of occurrence and departure from the mean value is shown as follows for the commonly used levels of significance and for one-tailed and two-tailed tests:

Use of these preferred values of significance level avoids the need to consult the full table of values for the normal distribution. It is useful to note that the values for two-tailed tests are the same values that we used in setting up confidence limits in Chapter 7. This is not too surprising, because, for example, a 95% probability that a value is within a symmetrical central band is equivalent to a 2.5% probability of it being above the band and a 2.5% of it being below.

Mean of a Sample

The null hypothesis is that a sample mean value could have come from a given population. An example, continuing with our chocolate bars, would be that a production line has been serviced and, after servicing, a sample of 100 bars is found to have a mean of 119.9g, compared with a previously determined population mean of 120.0g. To establish whether the production line is now operating satisfactorily, we set up the null hypothesis that the population from which the sample was drawn has a mean of 120.0g. We suppose that the sample had the expected standard deviation of 0.5g, as before.

The procedure is similar to that in the previous example, a Z-score being obtained and referred to tables of the normal distribution. However, because our sample mean is more representative than the single value was in the previous section, we reduce the standard deviation of the sample to get the standard deviation of the mean. This is done by dividing the variance of the sample by the number of data values in the sample and then taking the square root. This gives us the standard deviation of the mean, which is usually called the *standard error* of the mean. Put another way, we have divided the standard deviation of the sample by the square root of the number of data values to get the standard deviation of the mean. Thus, the standard deviation of the mean is 0.5g divided by the square root of 100—i.e., 0.5/10 = 0.05g. This has the effect of reducing the uncertainty of the result. This reduction in standard deviation was used in a similar way in calculating confidence limits in Chapter 7.

(The Z-score is

(119.9 – 120.0)/0.05 = – 2.0.

This value exceeds the required value for 5% significance, almost reaching the 2% level, as can be seen from the values shown in the previous section. We conclude that the null hypothesis should be rejected, there being evidence that the production line is not operating as required. (The negative value obtained for the Z-score simply shows that the value being tested is below the population mean; you will recall that the mean of a standard normal distribution is located at zero.)

It is better to use a large sample because of the reducing effect on the Z-score. However, because the reduction is by the square root of the sample size, a situation of diminishing returns sets in. With a sample size of 16, the Z-score is reduced by a factor of 4 compared with the Z-score for a single value. If we wish to reduce it by a factor of 8, we need a sample of size 64. The effort and cost of obtaining samples thus rises rapidly as we attempt to reduce the uncertainty in the results.

If the size of the sample is small, a slightly different procedure is adopted. The Z-score is modified slightly but then referred not to tables of the normal distribution, but to tables of the *t*-distribution (Chapter 7). The *t*-distribution approaches the normal distribution, giving the same results for large samples.

Difference between Variances

The null hypothesis is that two samples having different variances could have been drawn from the same population. This amounts to examining whether the two samples differ significantly, because if they could not have come from the same population, they must have been drawn from different populations.

The ratio of the two variances, F, is calculated by dividing the larger variance, *s*_{1}^{2}_{,} by the smaller, *s*_{2}* ^{2}*, to give a value greater than 1,

F = *s*_{1}^{2}*/s*_{2}^{2}*.*

If *n** _{1}* and

*n*

*are the numbers of data in the two samples, the degrees of freedom are*

_{2}*n*

*– 1 and*

_{1}*n*

*– 1. The value of F and the degrees of freedom are referred to tables of*

_{2}*Snedecor’s F-values*. The tables are fairly extensive because of the need to cater to each level of significance and the number of data in each of the two samples. Extracts from the tables are shown in the “Multiple Samples” section and in Chapter 16, where further uses of the F-test are illustrated.

If the two variances are not significantly different, they may be pooled and the weighted mean value used as a more reliable estimate of the population variance. Thus, as shown in Chapter 7, the pooled estimated population variance is given by

*σ*^{ 2} = {(*n*_{1}– 1) *s*_{1}^{2} + ( *n*_{2}– 1) *s*_{2}^{2} }/(*n*_{1} + *n*_{2} – 2) .

Difference between Means

The null hypothesis is that two samples having different means could have been drawn from the same population. Notice that the previous test, the variance ratio test, should be first carried out. If the F-test shows the two samples to be significantly different, it might be pointless to ask if the means show the samples to be different. Of course, the F-test is subject to a degree of unreliability, so it becomes a matter of judgment how to proceed.

On the assumption that we continue to examine the two mean values, a Z-score is calculated, expressing the difference between the means in terms of the number of standard deviations. This is similar to what we did in the “Mean of a Sample” section when we compared the mean of a single sample with a population mean. However, we now have two samples, each of which is an estimate of the supposed underlying population. We will use the difference between the two means, as we did before, but the required standard deviation now refers to a new distribution—that is, the distribution of the differences between two samples. The standard deviation to be used here is the standard deviation of the difference. Each of the mean values has its associated variance, expressing its uncertainty. So the sum of the two variances expresses the uncertainty in the difference between the means.

At this stage an example will make clear how to proceed. Assume we have details of sales of a particular product by two sales staff over a period of time, and we wish to make a comparison:

The variance of the difference of the means is *σ*^{ 2}*/ n** _{1}* +

*σ*

^{ 2}*/ n*

*where*

_{2}*σ*

*is the population variance, which has to be estimated as we do not know its value. The estimate of the population variance, using the sample standard deviations, is*

^{ 2}*σ** ^{ 2}* = {(

*n*

_{1}– 1)

*s*

_{1}

^{2}+ (

*n*

_{2}– 1)

*s*

_{2}

^{2}}/(

*n*

_{1}+

*n*

_{2}– 2) .

This is the equation you met in Chapter 7 and in the preceding section for pooling two samples to estimate the population variance. Using the values in the table above gives 30.06, so the variance of the difference of the means is 30.06/30 + 30.06/35—that is, 1.86. The standard deviation of the distribution of the difference of the means is the square root of this, which is 1.36.

The Z-score, the difference between the two means in terms of standard deviations, is therefore (16 – 12)/1.36, which is 2.94. It can be seen from the values for the normal distribution shown in the “Single Value”section of this chapter that this is significant at the 1% level, so we would conclude that the null hypothesis is rejected and the two members of staff differ in performance.

Notice that use is made here of the additive nature of variance: we cannot simply add the two values of standard deviation to get the standard deviation of the difference between the means.

Means of Paired Data

Paired data frequently arise in before-and-after situations. Thus we could have test results for a group of students before and after a week of revision. For example:

If there had been no effect of the revision sessions, we would expect these increases to be small, with an average close to zero. We can therefore ask whether this distribution of increases differs significantly from the values that might be obtained from a population with a mean value of zero. Our null hypothesis is therefore that the sample of increases could have been drawn from a population of values with a mean of zero.

The calculation can now follow a procedure similar to that used in the previous section, where we compared two sample means. The variance of the difference of the means reduces to the variance of the mean of the increases, and the estimate of the population variance reduces to the variance of the increases.

The sample is small; so, rather than quote a Z-score, the result should be referred to as a value of Student’s-*t*, and tables of *t*-values should then be used to determine the level of significance. Samples are commonly small in paired data because exact pairing becomes more difficult when the required sample size gets larger. In this example, the *t*-value of 2.39 is somewhat short of the value required to indicate a 5% level of significance for a sample size of 5. (See the selection of *t*-values in Chapter 7.) It would be concluded therefore that the null hypothesis is accepted and there is insufficient evidence to show that the revision sessions were of any benefit.

Multiple Samples

If more than two samples need to be compared, it would be quite possible to compare them in pairs using the methods described above. This, however, would be an unsatisfactory procedure for the following reason. If there were three samples, A, B, and C, there would be three pairs to compare: AB, AC, and BC. If we are testing at the 5% level, we have a 1 in 20 chance of being wrong in each of these comparisons. We have a chance of approximately 3 in 20 of at least one of the results being wrong. The situation gets worse rapidly as we increase the number of samples. Four samples produce six pairs, and five samples produce ten pairs, rendering the probability of being wrong unacceptably high.

A technique called *variance analysis* (ANOVA) is used in such situations, and it is here that the important role that variance plays in statistical routines becomes apparent. Variance, in spite of having often strange units, has the useful property of being additive. This we have encountered previously where, in order to calculate a mean standard deviation, we first obtained the variances from each standard deviation, then averaged the variances and obtained the mean standard deviation by taking the square root of the mean variance. You saw, similarly, in the “Difference between Means” section, that to get the variance of the difference between two values, we added the two individual variances.

If we have a number of samples, there will be variation of the data within each sample. In addition, the samples will differ from each other. In order to quantify the difference between the samples, it is necessary to separate the variation within the samples and the variation between the samples. The analysis of variance allows this to be done.

From the variances of all the samples, we can obtain a pooled variance. This gives a measure of the variation within the samples. In effect, we are supposing, temporarily, that the samples are in reality drawn from the same population, so each sample variance is an estimate of the population variance. The best estimate of the population variance is then obtained by pooling the several estimates. This is the measure of the within-sample variance.

We can then temporarily remove the variation within each sample by replacing each datum with its sample mean and calculating the variance of the total data. This gives a measure of the variation between the samples. In effect, we are asking what the best estimate of the population variance would be if each sample consisted of a set of identical values having the original mean but zero variance.

If all the samples could have been drawn from the same population, it would be expected that the variation within the samples would be similar to the variation between the samples. Thus the ratio of the within-sample variance to the between-sample variance indicates the extent to which the samples could have a common source. An example will make this clear.

Five soccer players have scored goals, as follows, in a number of matches. The number of matches played is not necessarily the same for each player. The null hypothesis is that the five samples could have been drawn from the same population. In other words, there is no evidence that the performance of the five players differs significantly:

The pooled variance, following the pooling procedure explained in Chapter 7, is 1.65. This is the within-sample variance. The degrees of freedom associated with this variance are obtained by adding the degrees of freedom for each sample: that is, one less than the number of data. So, (2+3+4+4+4) = 17 is the number of degrees of freedom.

To get the between-sample variance, each datum is replaced by its sample mean:

The variance of these values about the overall mean, 3.32, is the between-sample variance. The degrees of freedom associated with this variance are one less than the number of samples—viz., 4. Note that the sum of the degrees of freedom for the within-sample variance and the between-sample variance, 21—i.e., 17 + 4 —is equal to the total degrees of freedom for the total of 22 data values—i.e., 22 – 1 = 21.

The ratio of the two variances (3.32/1.65 = 2.01), together with their degrees of freedom, are referred to the table of F-values described in the section “Differences between Variances.” An extract from the tables follows:

In this example, the variance ratio, 2.01, is not sufficiently large to indicate a significant difference between the performances of the players. The null hypothesis is accepted.

The analysis of variance used in this way is referred to as a *one-way analysis* of variance, in that the variation between groups of samples is examined, each sample being of a similar type and possibly drawn from the same population. In Chapter 16, you will see that variance analysis can be applied to sets of samples that differ in some way.

**MANAGING THE MANAGER**

Premier Pressings is a company manufacturing steel pressings for engineering firms making automobiles, washing machines, gas boilers, and similar items. The company has units located in five different cities, each serving local needs.

The chief executive, George Robinson, was concerned that one of his units, Shempton, had been showing low profits over the past six months in comparison with the four other units. He had discussed his concerns with the manager of the Shempton unit, Tom Greeves, to establish what the problem was. The meeting was less than satisfactory: Tom was unable to offer any reasonable explanation for his poor results and claimed that it was a statistical quirk and that no doubt in subsequent months the effect would balance out.

Unconvinced, George decided on further investigation. He called on a senior draftsman from the Design Office who had some knowledge of statistics to have a look at the figures.

The draughtsman, Arnold Mason, could see immediately that the average profit over the six-month period was much lower that the profits for any of the other four units, although the variation, month to month, was quite large for all the units. He decided to first check on the consistency or otherwise of the results from the four other units. He listed the six profit values for the four units, 24 data in total, and carried out a one-way variance analysis. This gave him a value of the within-sample variance and a value of the between-sample variance. He calculated the variance ratio, F. Reference to tables of F-values showed that the result was not significant, so the four units could be considered to be producing results with similar amount of spread. He therefore calculated the mean and variance for the 24 values of profit.

The next step was to see if the Shempton results were significantly different from the combined 24 values. The mean and variance of the Shempton results were calculated. A comparison of the two variances gave an F-value that was not significant. However, the comparison of the two mean values showed a significant difference at the 5% level. This indicated that there would be a 1 in 20 chance of being wrong if it were maintained that the Shempton results were inferior to the others.

The CEO, armed with the results, summoned Tom to a further meeting and pointed out that there was good evidence that the Shempton results were not satisfactory. It was accepted that the evidence was not overwhelming; and, in view of a degree of uncertainty, Tom was told that he would be given a further six months to improve his profits. The exercise would be repeated in six months’ time, and Tom’s future would then be considered.