Relationships with Descriptive Data - Relationships - Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)

Part V. Relationships

Chapter 15. Relationships with Descriptive Data

Any Color as Long as It’s Black

Much of the data involved in business operations is descriptive rather than numerical. In product development and marketing we have decisions to make regarding color, shape and packaging. Surveys will have resulted in yes/no answers to questions. Records will show whether a product is popular or unpopular, whether it sells or doesn’t sell.

Nominal Data

If the data is nominal, we speak of association between the variables rather than correlation, and this can be examined by several means. Suppose we wish to know whether a particular medical treatment is effective in helping to cure a complaint. A sample of patients might give the following results:

image

Yule’s coefficient of association, Q, can be calculated from the four values in the two-by-two table, making use of the products of the diagonals. With the above values,

Q = (100x30 – 30x40)/(100x30 + 30x40) = 0.43.

The value of Q is always between +1 and –1, the size of the value being related to the strength of the association. The sign, + or –, indicates the direction of the association: in our example, whether the treatment results in more or fewer cures. An improved version of Yule’s coefficient, which involves a slightly more involved calculation, is the tetrachoric correlation coefficient. Yule’s coefficient of association cannot be used when there are more than two rows or columns. Instead, the polychoric correlation coefficient is used.

The same data can be examined by a so-called contingency test. If the treatment had no effect, it would be expected that the proportion of cured to not cured would be the same for the treated and the not treated. The values, keeping the totals the same, would appear as follow:

image

Thus 91/49 = 39/21 = 130/70. The issue then is whether the actual values depart significantly from these expected values. Our null hypothesis is that the two sets of values are not significantly different.

You saw in Chapter 7 how the chi-squared test can be used to compare two distributions. In effect, we have two distributions here: the distribution of sampled values and the distribution of expected values. Thus the chi-squared test can be used. The first step is to tabulate the differences between the actual and expected values. Each difference is squared and divided by the expected value. The sum of these values is the value of chi-squared.

image

There is only one degree of freedom, because the fixing of one of the four values in the table determines the other three. From the extract of the tables of the chi-squared distribution shown in Chapter 7, we see that the value of 8.48 is significant at the 1% level. Thus our null hypothesis is rejected, and we conclude that there is strong evidence for the effectiveness of the cure.

The example above uses a two-by-two table, having two rows and two columns. The procedure can accommodate a larger number of categories within the restriction of two variables. The variables could be, for example, color of hair and place of birth. The following table, a three-by-three, shows a possible set of data from a small sample:

image

If there was no relation between hair color and place of birth, we would expect the numbers to simply reflect the sizes of the various categories. Thus the table can be recast showing the expected number of individuals in each category. The expected number of brown-haired individuals born in Wales, for example, is shown as 5: a quarter of the total of 20 brown-haired individuals sampled, because a quarter of the total individuals, 10 out of 40, were born in Wales.

image

The decision to be made is whether the two tables are significantly different. If they are not significantly different, we can conclude that there is no evidence of hair color being related to place of birth. The differences between the two tables would be attributed to random errors in the sampling. If we find a significant difference, we would conclude that there is evidence of hair color being related to place of birth, and we would examine the data further to identify which combinations of hair color and place of birth were the source of the relationship.

To establish the level of significance, chi-squared is calculated as before. The difference between each sample value and its expected value is squared and divided by the expected value. These values, individual values of chi-squared, are added together to give an accumulated chi-squared for the whole data set. There are four degrees of freedom because fixing four of the nine values in the table determines the other five. In general, for contingency tables, the degrees of freedom are one less than the number of rows multiplied by one less than the number of columns. In this example, the value of chi-squared is 10.0, though the calculation is not shown; and from the tables of the distribution in Chapter 7, we find that the result is significant at the 5% level.

Ordinal Data

You saw in Chapter 11 how you can compare two sets of rankings to decide whether there is a significant difference between them. In the example that you looked at, two judges each put seven restaurants in order of preference. The identical technique, using Spearman’s rank coefficient or a similar one, can be used to examine whether two rankings of different attributes are related. Indeed, we mentioned that these ranking techniques are in essence correlation techniques for examining possible relationships. Ranks can be allocated to data from different variables regardless of the nature of the variables, and it is this feature that makes ranking techniques so versatile.

Suppose, for example, that we suspected that our first judge in the previous example was influenced by the size of the restaurant rather than the quality of the food and the service. We could rank the restaurants in order of size and list them alongside the judge’s ranking:

image

Spearman’s coefficient is calculated by

ρ = 1– 6 x (sum of d2)/(n(n2– 1))

where n is the number of items that are ranked. In our example

ρ = 1– 6 x 18/(7(49–1)) = 0.68.

This value is referred to published tables of ρ to obtain the significance level. A selection of published values was included in Chapter 11. Our value of 0.68 with n equal to 7 does not reach the 5% significance level, so we conclude that there is no significant evidence that the judge was influenced by the size of the restaurant.