Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)
Part V. Relationships
Chapter 14. Relationships with Numerical Data
Straight Lines, Curved Lines, and Wiggly Lines
It is frequently required to compare two or more sets of data to decide whether they are related in some way. Some quantities are related because we have defined them to be so. Kilometers are related to miles in a precise way and the relationship can be expressed as a formula:
kilometers = miles x 1.609.
Dollars are related to pounds sterling by a precise rate of exchange, which may vary from day to day and place to place, but which is nevertheless precise for a particular transaction. Generally, however, we are dealing with quantities which may show some relationship but rarely a precise relationship.
Scientific investigations under closely controlled laboratory conditions probably come closest to precise relationships, but even here there are small errors involved in making measurements which give uncertainties in the established relationships. At the other end of the scale we may be looking, for example, for a relationship between the way people vote in an election and how their parents vote. Here, it is likely that any relationship is uncertain, and the role of statistical analysis is to quantify the uncertainties.
When a relationship between two variables is sought, a distinction is made between the independent variable and the dependent variable. In Figure 14-1, the relationship between sales of ice cream and daily noon temperature is shown as a line graph. The temperature is the independent variable and sales is the dependent variable, sales depending on the temperature and not the other way round. Line graphs are commonly used, as here, to show relationships, and there is a convention in plotting them with regard to the choice of the quantities to be located on the two axes. The horizontal axis is used for the independent variable and the vertical axis for the dependent variable. Sometimes it is not clear which is which, both variables being dependent on other factors. We may have a choice as to which we treat as the dependent variable and which we treat as the independent variable. If we measure the temperature and humidity at a location at noon each day and plot temperature against humidity, the choice of axes for the two variables will be arbitrary.
Figure 14-1. Graph of ice cream sales at various daily temperatures, illustrating the difference between the dependent and independent variables
The relation between two variables is the easiest situation to deal with; the difficulties increase rapidly as further variables are introduced. These difficulties are not only in the analysis but also in the decreasing reliability of the conclusions that can be derived.
The raw data that have been collected may allow many different explorations of relationship. If people of different ages are sampled, or if products of different categories are involved, the number of possible pairs of variables can be many. There is a danger that the investigators, rather than deciding at the outset what comparisons are to be examined, will compare everything possible with everything else. The result can be completely unreliable. If, for example, a statistical level of 5% is intended to be accepted, it is likely that 1 in 20 comparisons will spuriously exhibit this level of significance. Since the statistical tests to be described can now be rapidly carried out by computer programs, the temptation to search for any possible relationship is very great. When the tests had previously to be carried out by hand, time simply did not permit a far-reaching search for any evidence of relationship, however unlikely.
It does seem that we are being overwhelmed by claimed associations at the present time. The media are full of statistical correlations relating to what we think, what we do, what we eat and drink, what we should eat and drink, and so on. I wonder—cynically, I suppose—whether some manufacturers sponsor investigations to seek relationships between their product’s characteristics and just about anything else that might boost the desirability of their product.
In viewing results obtained by others, it is not possible to know how many different pairings of variables were or were not examined. If the raw data are available, or if details of the sampling are known, suspicions may be raised if it appears that the results reported are particularly selective. If the reported results refer to cabbages only, yet a range of vegetables was included in the sampling, some explanation would be called for.
I need to point out that what I have said above applies strictly to relations between pairs of variables. It does not apply to investigations in which the aim is to study simultaneously the effect of several different variables. Such investigations are perfectly proper and will be considered inChapter 16.
If two quantities are related precisely, the relationship can be represented by a line graph; and if the line is straight, the relationship is said to be linear. The line may pass through the origin of the graph, indicating that the two quantities are proportional to each other. Thus a graph of dollars plotted against pounds sterling, illustrating a rate of exchange, is a straight line graph passing through the origin (Figure 14-2). The formula describing the graph is
pounds sterling = R x US dollars ,
R being the rate of exchange.
Figure 14-2. A straight-line conversion graph
Some linear relationships have straight lines that do not pass through the origin. For example, the cost of shipping goods to a particular destination might be $2 per kilogram plus $60. The formula has the form,
Cost ($) = 2 x weight (kg) + 60.
The graphs in Figure 14-3 show that as one of the quantities increases, so does the other. This is called positive correlation. Negative correlation describes relationships in which one quantity decreases as the other increases.
Figure 14-3. A graph of the cost of shipping goods of different weights, presented in several ways to illustrate the visual effects of changing the scale and suppressing the origin
Figure 14-3 illustrates also how the relationship between the variables can be made to appear different by changes of the scale used for plotting the graphs and by suppression of the origin.
When we are dealing with variables that are not precisely related, an initial examination of the data involves plotting a scatter graph. The individual data points are plotted on a graph whose axes represent the two variables involved. By eye, it may be possible to see a rising or falling trend indicating positive or negative correlation. A useful technique, illustrated in Figure 14-4, is to draw a horizontal line positioned so that half the data points are above the line and half are below. A vertical line is then drawn so that half the points are to the left and half are to the right. A count of the points in each quadrant suggests correlation by any appreciable excess in either of the diagonally linked pairs of quadrants.
Figure 14-4. Examples of scatter graphs used to explore the existence of correlation between two variables
If there is evidence of correlation, a best-fit straight line can be located by eye. A transparent ruler allows the line to be located so that an equal or nearly equal number of points lie each side of the line and so that the distances of the points from the line are minimized. An improvement to the procedure involves calculating the mean value of each of the two quantities, plotting the values as a point on the graph, and ensuring that the line passes through it.
The gradient of the best-fit line—the steepness or slope, in other words—is the extent the line goes upward divided by the extent it moves to the right. Note that the gradient can mistakenly be taken as a measure of the correlation between the two variables, a steep line appearing to suggest strong correlation. The numerical value of the gradient is, in fact, arbitrary in depending on the units used in measuring the variables. For example, a formula for the time to cook a chicken might be
Time (minutes) = 45 × Weight (kg) + 30,
and the gradient of the graph is 45. If hours are used, the equation is
Time (hours) = 0.75 × Weight (kg) + 0.5,
and the gradient is 0.75. The extent of correlation between the two variables depends on the closeness of the points to the line, regardless of the line’s gradient—provided that there is a gradient. Clearly, if there is no gradient, one of the variables cannot influence the other and there is zero correlation. A glance back at Figure 14-3 confirms that the gradient can be made to look large or small by changes of scale and can therefore misrepresent the extent of correlation between the two variables.
The position of the best-fit line can be determined by a statistical procedure called linear regression, which we now need to look at. The word regression is used here with the meaning of estimation, in that the line will be used to estimate the value of one variable from the value of the other.
Suppose we wish to know how fast a particular type of tree grows. We obtain data showing the height of a representative tree as measured each year up to five years old. The points are plotted in Figure 14-5 and the values are as follow:
Figure 14-5. A graph of the height of a tree at different ages with its calculated simple linear regression lines
The difference between each value of x and the mean value of x is shown, together with the square of each difference. The y values are treated similarly. The product of each x difference and the corresponding y difference is included.
The equation of the best-fit line is given by the formula,
y–ym = (x–xm)Sxy /Sxx ,
which, inserting the values from above and rearranging, gives
y = 0.21x + 0.07.
The line, which is included in Figure 14-5, passes through the point located at the mean value of x and the mean value of y, and this will always be found to be so. The line is best-fit in that the squares of the deviations of the measured y values from the values predicted by the graph are minimized.
The observant reader will have noticed that although the value of Sxx appears in the equation, Syy does not. This is because there are in reality two best-fit lines, the second one having a similar equation except for the replacement of Sxx by Syy and the transposing of x and y. How could there be two best-fit lines? The reason is that it depends on how the graph is to be used. The line we have just calculated is called the regression of y on x and it is designed to give the best estimate of the y value when the x value is given. Thus, if we know the age of our tree, we can use the graph to estimate its height.
But we may want to use the graph to be able to estimate the age of a tree when we measure its height, which is a different process.
The requirement then is for the line representing the regression of x on y. The equation is
x–xm = (y – ym)Sxy /Syy
which, when rearranged, gives
y = 0.26x – 0.07.
This second regression line is included in Figure 14-5. The line again passes through the point representing the mean values of x and y, but it has a somewhat different gradient compared to the previous line. The two lines are similar in this example; and, generally speaking, the greater the correlation between the two variables, the closer the two lines will be. Indeed, if there is perfect correlation, a conversion graph for example, there can be only a single line.
The above example was chosen to illustrate the usefulness of and the difference between the two regression lines. Often, however, it is sensible to use the graph in one direction only: this gives rise to the distinction between the independent variable and the dependent variable, which has been previously described. If we are free to fix the values of one of the variables, then this variable is the independent variable. The other is the dependent variable because its values depend on the fixing of the values of the independent variable. Relationships are commonly used to estimate the value of the dependent variable, and only one regression line is then required.
Sometimes it is known at the outset that the regression line must pass through the origin because, when one of the variables is zero, the other must be zero. The equation is now somewhat simpler, but we need a few additional calculations, as follow.
The equation is
y = (S(xy)/S(x2))x,
y = 0.23x
if we take x as the independent variable. In other words, we are estimating the height of the tree, y, from its age, x. If y is considered to be the independent variable, to allow estimates of age from a known height, then the equation is
x = (S(xy)/S(y2))y,
y = 0.24x.
We might argue that the height of our tree is zero, or very nearly so, when its age is zero and that, therefore, these would be the preferred equations. However, taking a more practical view, we would say that our graphs are for use on trees that have achieved sufficient height to be considered trees; and, furthermore, the rate of growth might be very different when the tree is little more than a seedling and should not be allowed to influence the correlation within the range of practical use. In this case we would use the graphs of Figure 14-5.
Appropriate analysis yields regression lines, but the question remains as to how meaningful the correlation is. A correlation coefficient, r, can be readily calculated at the same time as the regression lines are being determined. The full name of the coefficient is the product-moment correlation coefficient, but it is often referred to as Pearson’s coefficient. The coefficient has the property of always having a value between +1 and –1. A value of +1 indicates perfect positive correlation: all the plotted points lie exactly on the straight line and the line has a rising slope. A value of –1 indicates perfect negative correlation: the points again lie exactly on the line but the slope is descending. A value of 0 indicates no correlation, the plotted points being randomly scattered. A degree of judgment is generally necessary in interpreting the value obtained. A value of around 0.5 indicates some correlation but anything below about 0.4 would raise serious doubts. The equation for r is .
In the tree example above, the data gives r = 0.90.
The value of r2 can be used to indicate the usefulness of the correlation. With r equal to 0.9, r2 is 0.81 and shows that 81% of the variation in the dependent variable is due to the variation in the independent variable. Thus 19% of the variation is due to other factors.
The correlation that has been established relates strictly to the data in this particular sample, whereas we would want to use the correlation in studying other samples of similar trees. In order to justify the use of the correlation to represent the population from which the sample was obtained, it is necessary to determine the significance of the result. This can be done by using tables of critical values for the product-moment correlation coefficient. A selection of values is shown below.
Our value of 0.9 can be seen to be significant at the 5% level for both a one-tail and a two-tail test. If at the outset we were investigating whether there was a significant correlation between tree height and age—in other words, whether the true gradient of the graph was not zero—then we would apply the two-tail test. If we were investigating whether there was a positive correlation between tree height and age—in other words, whether the gradient was greater than zero—we would apply the one-tail test: the second tail corresponding to a negative correlation which would clearly be impossible in our tree example. The point was made previously that statistical tests are designed to establish the significance of hypotheses which are clearly defined at the outset.
It may seem odd that the testing for significance relies on comparing the gradient of the graph with the value of zero. We might have a gradient of 28, say, on one occasion and a gradient of only 0.28 on another. The first value is much further from zero than the second. However, as mentioned previously, the numerical value of the gradient is arbitrary because it depends on the units being used. The criterion for significant correlation is the likelihood of there being a gradient—that is to say, we are looking at the probability of the gradient having any nonzero value.
Confidence intervals can be obtained and represented by bands either side of the regression line. These indicate the reliability, on average, of predictions made from the line. Somewhat similar are prediction intervals with wider bands, showing the reliability of individual predictions at different position along the graph.
In some investigations, it may be known that individual points on the graph have different degrees of reliability. Some may be the mean values from large samples and others from small samples. Some may be from more accurate measurements than others. In such instances, each point may be shown with an error bar indicating the individual reliability. A vertical error bar is centered on the plotted point, the length of the bar indicating the confidence limits of the dependant variable. If the independent variable is subject to some uncertainty, there may be a similar horizontal bar centered on the point.
Note that predictions from regression lines are valid only within the range of the values represented. It is not possible to extrapolate a regression line to obtain values outside this range. Effort put into obtaining regression lines in order to extrapolate the data can give dangerously misleading results.
Numerical data can be treated in a ranking procedure as an alternative to obtaining regression lines. I described the method in Chapter 11. Each set of data is arranged in order and given rank numbers of 1 upward. The method is rapid compared with treating the numerical data as we did in the example of linear regression, but the main advantage arises when the data contains extreme values, usually a result of the data not being normally distributed. Samples of salaries, for example, generally contain some very high values that will influence greatly a numerical correlation based on fitting a straight line. When the data are ranked in order of size, there are no extreme values. Note, however, that ranking tests are non-parametric: they do not assume any particular distribution of the data and are not as powerful as parametric tests. Also, although the ranking provides a measure of the extent of correlation, it does not give information regarding the way the two variables are related, other than showing whether the correlation is positive or negative.
When the data are plotted, there may be evidence of a curved rather than a linear relationship. One way of dealing with the situation is to transform the data in order to achieve linearity. The following values show the growth in population of a town over a number of years:
The graph, shown in Figure 14-6(a), is curved and the ever-steepening shape as time increases suggests that taking the square root of each value of population would produce a straighter line.
Figure 14-6(b) shows the re-plotted data and it can be seen that the graph is approximately linear. Examination of significant correlation could be carried out as in the previous section.
Figure 14-6. Graphs of the population growth of a town showing (a) the raw data and (b) the data transformed by plotting the square root of the population
Data can be transformed by applying any mathematical procedure. Commonly used transformations employ squaring, square rooting, cubing, cube rooting, taking the logarithm of one of the variables, and taking the logarithm of both variables.
In scientific work when a law relating the two variables is sought, transformations that are successful can suggest the physical processes underlying the law. To illustrate this, we can consider a well-known law relating the distance, R, of a planet from the Sun and the time, T, it takes to go once round the Sun. If we plot the two variables as shown in Figure 14-7(a), we get a curve. If we transform the variables by plotting the cube root of T against the square root of R, we get a straight line passing through the origin, Figure 14-7(b). (This is equivalent to showing that T2 is proportional to R3, which is the way the law is usually expressed. However, a plot of T2 against R3 has to be unacceptably large to accommodate the very wide range of the data.) We can then apply linear regression to locate the best line and use it to predict the path of any new minor planet that might be discovered. In reality, of course, the law relating T and R is well known (though not quite as simple as suggested here, because the orbits are elliptical and not perfectly circular) and the features of the orbit of any planet can be accurately calculated.
Figure 14-7. Graphs of a planet’s length of year in relation to its distance from the Sun, showing (a) the raw data and (b) the transformed data
Sometimes, however, red herrings crop up. The Titus–Bode law was based on an apparent relationship between the sequence of the planets from the Sun and their distances from the Sun. Figure 14-8(a) shows a plot with the numerical sequence on the x-axis and the distances from the Sun on the y-axis. Note that Neptune had not been discovered when the law was proposed and that Ceres (a prominent asteroid) was considered to be a planet. The graph is curved, and a transformation looks useful. If we transform by taking the logarithm of the distance and re-plotting, we get, with the exclusion of Neptune, a linear relationship, as shown in Figure 14-8(b). The correlation is good: the correlation coefficient, r, has a value of 0.995. When Neptune was discovered, it was found to depart drastically from the supposed relationship. Nowadays, the Titus–Bode law is considered to be no more than a curious coincidence or, at best, a combination of several factors that combine to give an apparent simple connection.
Figure 14-8. Graphs of a planet’s distance from the Sun in relation to its numerical sequence from the Sun (Titus-Bode law), showing (a) the raw data and (b) the transformed data
Achieving a linear relationship by use of a transformation is thus a useful and straightforward technique. It does suffer from the problem that minimizing errors in order to get the best fit is itself affected by the transformation. In other words, the best-fit line represents the best fit for the transformed variables but not necessarily the best fit for the variables themselves.
It should be noted that it is always possible to find the equation of a line that will pass through any distribution of points. An equation of the form
y = a + bx,
where a and b are constants, is always a straight line. An equation of the form
y = a + bx + cx2
gives a curve which turns once. An equation of the form
y = a + bx + cx2 + dx3
gives a curve which turns twice, and so on. Such equations are called polynomials, and the fitting of such equations is referred to as polynomial regression. If we search for a polynomial equation with no restriction on length, we will always be able to obtain a curve that passes through all our experimental points. Clearly, this becomes a useless exercise: the final equation will have no meaning. We might just as well have drawn by hand a wiggly curve passing through all our points. Common sense dictates the extent to which it is reasonable to proceed down this path.
It is evident that there is not a unique best-fit line if we are to allow unrestricted curving of the line: it is always necessary to decide what is to be accepted in terms of the shape of the line or the form of the equation describing the line. Numerous computer packages are available for nonlinear regression. They are essentially trial-and-error procedures, and hence computer-intensive, progressing by iteration to a fit that meets the acceptable criteria and minimizes the errors of the experimental points in comparison with predictions from the line. This is, of course, what we saw with regard to linear correlation, where a straight line was the acceptable criterion and the mathematics minimized the errors of the individual points, though without the need for lengthy iterations.
Two variables may not be related in any apparent, or even predictable, way but may nevertheless be related. Commonly, one of the variables is time. Many things vary with time: indeed, most things do. In the world of business and commerce, a great deal of attention is paid to how various quantities are changing. We wish to see how our profits are rising month by month or year by year. Or we look at the change in the stock market figures each morning in the newspaper. Such data are characterized by their to-and-fro variability, and because of this it is possible to draw numerous conclusions, some of which will be favorable from the point of view of the presenter and others of which will be unfavorable. Figure 14-9(a) shows the variation of the FTSE 100 financial index of shares from its inception in 1984. There is clearly a marked degree of positive correlation, but a search for a quantifiable correlation would be rather pointless.
In Chapter 6, I warned about suppressing the origin when presenting bar charts. The same warning applies to line graphs: the result can be extremely misleading, particularly when the origin is suppressed on the vertical axis—i.e., the dependent variable. We must bear in mind, however, that sometimes, particularly with regard to graphs showing changes with time, we must suppress the origin. Indeed, when did time start? The time axis clearly can start at any convenient point, and the vertical axis may have to start distant from zero. The graph in Figure 14-9(a) has a true origin, since the FTSE index started in 1984 with a value of 1000, and the graph is useful in showing the historical changes. But if you purchased shares within the past few weeks, you would be more interested in a graph such as Figure 14-9(b), which necessarily has its origin suppressed. The vertical axis is broken. A break is shown in the vertical axis, the index value; but to break the time axis would be pedantic in view of what has been said.
Figure 14-9. Graphs of the movements of the UK FTSE 100 index showing (a) the inclusion of the origin on the vertical axis and (b) an acceptable presentation of the suppression of the origin
Similarly, financial data relating to companies may be of interest only over the recent past and suppressing of the origin of graphs may be justified. The justification, however, can provide latitude within which misleading impressions may be given.
The table of figures below shows the monthly profits over a two year period for a small company. For simplicity, the figures are shown as small numbers in units of $1000.
The data is shown as a line graph in Figure 14-10(a). The ups and downs provide opportunities for the company to present optimistic views from time to time and also for critics to present less favorable commentaries.
To present the data in a way that smooths out the fluctuations, a moving average can be used. This is particularly useful when it is recognized that there could be cyclic variations in the data—seasonal variations, for example.
The average employed can be the mean or the median. We will use a three-month moving average based on mean values. That is to say, we will average the values for the first three months, Jan to Mar, 2008; and then, moving along by one month we will average the values for Feb to Apr, 2008. The next average is for Mar to May, 2008, and so on. The results are shown plotted in Figure 14-10(b). The graph is now smoother, showing a gentle rise with time. The product-moment correlation coefficient is 0.88, compared to 0.70 for the original graph. A graph of the six-month moving average, shown in Figure 14-10(c), fluctuates even less, and the correlation coefficient has increased to 0.99.
Figure 14-10. Graphs of the growth of profits of a small company showing (a) the raw data, (b) the three-month moving average, and (c) the six-month moving average
Figure 14-11 shows the data of Figure 14-10(c) with the origin suppressed, the vertical scale extended, and no break in the vertical axis. It can be seen that the effect is to suggest that there has been an improved growth in profits. Also, it becomes apparent that the omission of the origin on the vertical axis (the dependent variable) is more misleading than omission on the horizontal axis (the independent variable).
Figure 14-11. Data from Figure 14-10 (c) with the origin suppressed and the scale changed
Time is always plotted as the independent variable, and, as I previously pointed out, it is not feasible to show a true origin. Some other variables present the same kind of problem. Temperature is often the independent variable; the true zero, which is –273° Celsius, is never shown except in scientific publications relating to extremely low temperatures. In Figure 14-1, the temperature axis was shown with the origin suppressed and without a break in the axis. Note that 0ºC and 0ºF are not true zeroes: 20ºC is not twice as hot as 10ºC. Converting the two temperatures to Fahrenheit—50ºF and 68ºF, respectively—shows that the apparent doubling is not meaningful.
John and his wife Kate had a small business operating from market stalls in nearby towns. They visited each town once a week on the same day of the week. They sold a range of household essentials, such as kitchen and bathroom cleaning products, soaps, polishes, dusters, and brushes.
Although their overhead was low, they still had difficulty competing on price with the large supermarkets. They considered offering reduced prices for multiple purchases, as the supermarkets did, but were unsure whether this would lead to an increase in profit.
Kate asked her brother, Ted, for advice. He had some business experience and also had some knowledge of statistics.
Ted suggested an experiment. The goods would be sold on the basis of a percentage reduction when two of the same items were purchased. The purpose of the experiment was to find the optimum percentage reduction to apply. If the reduction was too low, say 10%, it would make little difference to sales or profit. If it was too high, 80% say, it would eat into the existing profit margin so much that increased sales would not compensate. Somewhere in between would be an optimum.
Ted suggested that John and Kate should start with a 10% reduction for two weeks and increase the reduction in 5% steps every two weeks up to a maximum reduction of 75%. The profit each day for each two-week period would be recorded.
The experiment was undertaken and the results passed to Ted for analysis. He first plotted a scatter graph of profit against percentage price reduction. He was not surprised to see that a best straight line would not be of any use: it would be approximately horizontal. However, he was pleased to see that there was indication of a profit increase in the central region of the graph. The task was to identify where the peak occurred. He made use of a statistical package to fit a low-order polynomial to the data and found that the peak value of profit was located at about 35% price reduction. John and Kate adopted a “one third off for a purchase of two of the same” practice and were pleased to enjoy a 3% increase in profits.
Ted pointed out that more could be done. In the original experiment, other variables had not been separated out. It would be possible to experiment further with a range of price reductions applied to different products and to the different towns that the couple traded in. This was just the beginning of a new marketing strategy.