Better Business Decisions from Data: Statistical Analysis for Professional Success (2014)
Part III. Samples
The tendency of the casual mind is to pick out or stumble upon a sample which supports or defies its prejudices, and then to make it the representative of a whole class.
The raw data may provide all the information that is required and therefore undergo no subsequent processing. In most situations, however, this will not be so. The data may be too extensive to be readily appreciated and may require summarizing. It is essential that the summarizing be done in a suitable manner so that it represents the original data in a fair way. Processing may then be required to estimate the characteristics of the population from which the data were drawn.
Chapter 6. Descriptive Data
Not Every Picture Is Worth a Thousand Words
There is not much that can be done to characterize a sample of descriptive data in comparison with the options available for numerical data. The latter has had the advantages of centuries of development of mathematics. Where possible, and usually by simply counting, descriptive data is rendered numerical. In addition, the frequent use of diagrams provides neat summaries of the data, though there are many ways in which diagrams can mislead.
Nominal data consists of numbers that can be placed in categories and totaled, the categories having no numerical relationship to each other. Thus an employer might group the staff according to the mode of transport used to get to work, and use the total in each group to draw conclusions about the required size of the car park or bicycle shed.
In Figure 6-1(a), the populations of each of four towns are shown in the form of a bar chart. Because there is no numerical relation between the categories (towns), the bars could have been lined up in any order.
The bar chart format is useful in allowing the relative numbers in each category to be visualized: the eye is quite sensitive in spotting small differences between the heights of the bars while at the same time assimilating large differences. Bar charts are sometimes presented in a way that exaggerates the differences between the large and the small bars, as shown in Figure 6-1(b). The origin has been suppressed, giving the impression that the population of Northton is very much greater than that of the others. Suppression of the origin in this fashion is generally not acceptable and should arouse suspicion regarding the intentions behind the presentation of the statistics. A bar chart of this sort was used in advertising Quaker Oats, suggesting that eating the breakfast cereal reduced the level of cholesterol (Seife, 2010: 35-36). The diagram was withdrawn after complaints were received.
In situations where it is considered necessary to exaggerate—for instance, we might wish to ensure that it is clear that Easton has a larger population than Weston—the vertical axis, and possibly the bars, should show breaks as in Figure 6-1(c).
Figure 6-1. Three representations of the same bar chart, showing the visual effects of suppressing the origin and breaking the vertical axis
When it is important to draw attention to the relative proportions of each of the categories, a pie chart is preferable to a bar chart. Figure 6-2(a) shows the results of an election. The impression given visually is the relative support for each political party rather than the actual number of votes received. However, it is not easy to see whether the Yellow Party or the Blue Party won the election without looking at the numbers. A bar chart, Figure 6-2(b), shows more clearly who won the election, but the impression of proportion of votes is lost.
Figure 6-2. A pie chart and a bar chart representing the same data
Diagrams that consist of two or more pie charts can be visually misleading. In Figure 6-3(a), the number of households in two districts are shown and divided into three categories: those that have dogs, those that have cats, and those that have neither. The area of each sector of the pie chart represents the number in each category, and the total area of each pie chart represents the total number of households in each district. Upper Dale, with 3000 households, has a chart with 50% greater area than that for Lower Dale, which has 2000 households. To achieve the correct area proportion, the chart for Upper Dale has a diameter only 22% greater than the Lower Dale chart. This gives a visual bias to the distribution of pets in Lower Dale. The stacked bar chart in Figure 6-3(b) gives a fairer visual impression of the relative numbers of dogs and cats.
Figure 6-3. A pair of pie charts and a stacked bar chart representing the same data
Pictograms can be even more misleading. Figure 6-4(a) shows a comparison of the number of cats in Upper Dale and Lower Dale. The vertical scale indicates the number of cats, so only the vertical height of the image of the cat is significant. However, because the taller cat is also wider, the difference between the numbers of cats appears visually to be greater than it really is. The style of pictogram shown in Figure 6-4(b) is preferable in showing no bias. Here a small image of the cat is used to represent 100 cats in each of the districts.
Figure 6-4. The use of pictograms in charts may be more or less visually misleading, as exemplified in (a) and (b), respectively
The use of three-dimensional images in pictograms can be extremely misleading. Figure 6-5 shows the output of two factories. Visually, it appears that there is not a great difference between the two. However, as the actual cubic meters for each confirm, Factory A has an output almost twice that of Factory B. The illusion occurs because although the volumes of the two cubes represent the outputs correctly, the length of the side of the cube for Factory A is only 25% greater than that for Factory B. Thus: 50 × 50 × 50 = 125,000, and 40 × 40 × 40 = 64,000.
Figure 6-5. A misleading visual comparison of the outputs of two factories
When categories overlap, the data is often represented by a Venn diagram. Consider the following data. In a group of 100 students, 30 are not studying a language, 50 are studying French, and 30 are studying German. Thus 10 are studying both French and German. Figure 6-6 shows the data diagrammatically. Enclosed regions represent the different categories, but the actual sizes of the areas enclosed are not intended to represent the numbers within the categories. The intention is purely to illustrate the overlaps. It is therefore important when viewing Venn diagrams to be aware of the actual numbers and avoid visual clues from the sizes of the regions.
Figure 6-6. Venn diagram showing the numbers of students studying French and German
Venn diagrams are useful in visualizing conditional probability (Chapter 3). Suppose we choose a student randomly from those shown in Figure 6-6 but specify the condition that the student studies French. The only students of interest to us are those in the left ellipse, 50 in total. If we ask what the probability is that the student studies German, we see from the overlap region that 10 students would meet the requirement. So the probability is 10/50 = 0.2. If, on the other hand, we specify the condition that the student studies German and ask for the probability that the student studies French, we are concerned only with the right ellipse. The probability is thus 10/30 = 0.33.
For ordinal data, although pie charts can be used, bar charts have the advantage of allowing the categories to be lined up in logical order. Figure 6-7 shows the number of medals won by a sports club in the form of a bar chart.
Figure 6-7. Bar chart showing the numbers of medals won by a sports club
Nominal data can be rendered numerical insofar as the numbers in each group can be expressed as proportions or percentages of the total. Thus the data in Figure 6-2 yield the following proportions:
Use of proportions or percentages is often adopted to disguise the fact that the numbers involved are very small. It may sound impressive to be told that 12% of the staff of a local company are still working full time at the age of 70 years, but less so when you learn that the number represents just one person.
Ordinal data can be represented as proportions or percentages, as with nominal data. Thus, sales of shirts could be reported as 30% small size, 50% medium, and 20% large.