Bubble, Box Plot, and Error Bar Charts - Learning Highcharts 4 (2015)

Learning Highcharts 4 (2015)

Chapter 7. Bubble, Box Plot, and Error Bar Charts

In this chapter, we will explore bubble charts by first studying how the bubble size is determined from various series options, and then familiarizing ourselves with it by replicating a real-life chart. After that, we will study the structure of the box plot chart and discuss an article converting an over-populated spider chart into a box plot. We will use that as an exercise to familiarize ourselves with box plot charts. Finally, we will move on to the error bar series to understand its structure and apply the series to some statistical data.

This chapter assumes that you have some basic knowledge of statistics, such as mean, percentile, standard error, and standard deviation. For readers needing to revise these topics, there are plenty of online materials covering them. Alternatively, the bookStatistics For Dummies by Deborah J. Rumsey provides great explanations and covers some fundamental charts such as the box plot.

In this chapter, we will cover the following topics:

·        How bubble size is determined

·        Bubble chart options when reproducing a real-life chart in a step-by-step approach

·        Box plot structure and the series options when replotting data from a spider chart

·        Error bar charts with real-life data

The bubble chart

The bubble series is an extension of the scatter series where each individual data point has a variable size. It is generally used for showing 3-dimensional data on a 2-dimensional chart, and the bubble size reflects the scale between z-values.

Understanding how the bubble size is determined

In Highcharts, bubble size is decided by associating the smallest z-value to the plotOptions.bubble.minSize option and the largest z-value to plotOptions.bubble.maxSize. By default, minSize is set to 8 pixels in diameter, whereas maxSize is set to 20 percent of the plot area size, which is the minimum value of width and height.

There is another option, sizeBy, which also affects the bubble size. The sizeBy option accepts string values: 'width' or 'area'. The width value means that the bubble width is decided by its z-value ratio in the series, in proportion to the minSize and maxSize range. As for 'area', the size of the bubble is scaled by taking the square root of the z-value ratio (see http://en.wikipedia.org/wiki/Bubble_chart for more description on 'area' implementation). This option is for viewers who have different perceptions when comparing the size of circles. To demonstrate this concept, let's take the sizeBy example (http://jsfiddle.net/ZqTTQ/) from the Highcharts online API documentation. Here is the snippet of code from the example:

        plotOptions: {

            series: { minSize: 8, maxSize: 40 }

        },

        series: [{

            data: [ [1, 1, 1], [2, 2, 2],

                    [3, 3, 3], [4, 4, 4], [5, 5, 5] ],

            sizeBy: 'area',

            name: 'Size by area'

        }, {

            data: [ [1, 1, 1], [2, 2, 2],

                    [3, 3, 3], [4, 4, 4], [5, 5, 5] ],

           sizeBy: 'width',

            name: 'Size by width'

        }]

Two sets of bubbles are set with different sizeBy schemes. The minimum and maximum bubble sizes are set to 8 and 40 pixels wide respectively. The following is the screen output of the example:

Understanding how the bubble size is determined

Both series have the exact same x, y, and z values, so they have the same bubble sizes at the extremes (1 and 5). With the middle values 2 to 4, the Size by area bubbles start off with a larger area than the Size by width, and gradually both schemes narrow to the same size. The following is a table showing the final size values in different z-values for each method, whereas the associating value inside the bracket is the z-value ratio in the series:

Z-Value

1

2

3

4

5

Size by width (Ratio)

8 (0)

16 (0.25)

24 (0.5)

32 (0.75)

40 (1)

Size by area (Ratio)

8 (0)

24 (0.5)

31 (0.71)

36 (0.87)

40 (1)

Let's see how the bubble sizes are computed in both approaches. The ratio in Size by width is calculated as (Z - Zmin) / (Zmax - Zmin). So, for z-value 3, the ratio is computed as (3 - 1) / (5 - 1) = 0.5. To evaluate the ratio for the Size by area scheme, simply take the square root of the Size by width ratio. In this case, for z-value 3, it works out as √0.5 ≈ 0.71. We then convert the ratio value into the number of pixels based on the minSize and maxSize range. Size by width with z-value 3 is calculated as:

ratio * (maxSize - minSize) + minSize = 0.5 * (40 - 8) + 8 = 24

Reproducing a real-life chart

In this section, we will examine bubble series options by replicating a real-life example (MLB Players Chart: http://fivethirtyeight.com/datalab/has-mike-trout-already-peaked/). The following is a bubble chart of baseball players' milestones:

Reproducing a real-life chart

First, there are two ways that we can list the data points (the values are derived from best estimations of the graph) in the series. The conventional way is an array of x, y, and z values where x is the age value starting from 21 in this example:

            series: [{

                 type: 'bubble',

                 data: [ [ 21, 100, 1 ],

                         [ 22, 50, 5 ],

                         ....

Alternatively, we can simply use the pointStart option as the initial age value and miss out the rest:

            series: [{

                type: 'bubble',

                pointStart: 21,

                data: [ [ 100, 1 ],

                        [ 50, 5 ],

                        ....

Then, we define the background color, axis titles, and rotate the axis title to the top of the chart. The following is our first try:

Reproducing a real-life chart

As we can see, there are a number of areas that are not quite right. Let's fix the bubble size and color first. Compared to the original chart, the preceding chart has a larger bubble size for the upper value and the bubbles should be solid red. We update theplotOptions.bubble series as follows:

      plotOptions: {

           bubble: {

               minSize: 9,

               maxSize: 30,

               color: 'red'

           }

      },

This changes the bubble size perspective to more closely resemble the original chart:

Reproducing a real-life chart

The next step is to fix the y-axis range as we want it to be between 0 and 100 only. So, we apply the following config to the yAxis option:

     yAxis: {

         endOnTick: false,

         startOnTick: false,

         labels: {

             formatter: function() {

                  return (this.value === 100) ?

                       this.value + ' %' : this.value;

             }

         }

     },

By setting the options endOnTick and startOnTick to false, we remove the extra interval at both ends. The label formatter only prints the % sign at the 100 interval. The following chart shows the improvement on the y axis:

Reproducing a real-life chart

The next enhancement is to move the x axis up to the zero value level and refine the axis into even number intervals. We also enable the grid lines on each major interval and increase the width of the axis line to resemble the original chart:

     xAxis: {

         tickInterval: 2,

         offset: -27,

         gridLineColor: '#d1d1d1',

         gridLineWidth: 1,

         lineWidth: 2,

         lineColor: '#7E7F7E',

         labels: {

             y: 28

         },

         minPadding: 0.04,

         maxPadding: 0.15,

         title: ....

     },

The tickInterval property sets the label interval to even numbers and the offset option pushes the x axis level upwards, in line with the zero value. The interval lines are enabled by setting the gridLineWidth option to a non-zero value. In the original chart, there are extra spaces at both extremes of the x axis for the data labels. We can achieve this by assigning both minPadding and maxPadding with the ratio values. The x axis labels are pushed further down by increasing the property of the y value. The following screenshot shows the improvement:

Reproducing a real-life chart

The final enhancement is to put data labels next to the first and penultimate data points. In order to enable a data label for a particular point, we turn the specific point value into an object configuration with the dataLabels option as follows:

     series: [{

         pointStart: 21,

         data: [{

             y: 100,

             z: 1,

             name: 'Alex Rodriguez <br>in Seattle at age 21',

             dataLabels: {

                 enabled: true,

                 align: 'right',

                 verticalAlign: 'middle',

                 format: '{point.name}',

                 color: 'black',

                 x: 15

             }

         }, ....

We use the name property for the data label content and set the format option pointing to the name property. We also position the label on the right-hand side of the data point and assign the label color.

Reproducing a real-life chart

From the preceding observation, we notice that the font seems quite blurred. Actually, this is the default setting for dataLabels in the bubble series. (The default label color is white and the position inside the bubble is filled with the series color. As a result, the data label actually looks clear even when the text shadow effect is applied) Also, there is a connector between the bubble and data label in the original chart. Here is our second attempt to enhance the chart:

         { y: 100,

           z: 1,

           name: 'Alex Rodriguez <br>in Seattle at age 21',

           dataLabels: {

               enabled: true,

               align: 'right',

               verticalAlign: 'middle',

               format: '<div style="float:left">' +

                       '<font size="5">∟</font></div>' +

                       '</span><div>{point.name}</div>',

               color: 'black',

               shadow: false,

               useHTML: true,

               x: -2,

               y: 18,

               style: {

                   fontSize: '13px',

                   textShadow: 'none'

               }

           }

         },

To remove the blurred effect, we redefine the label style without CSS textShadow. As for the L-shaped connector, we use the trick of alt-code (alt-code: 28) with a larger font size. Then, we put the inline CSS style in the format option to make the two DIV boxes connector and text label adjacent to each other. The new arrangement looks considerably more polished:

Reproducing a real-life chart

We apply the same trick to the other label; here is the final draft of our bubble chart:

Reproducing a real-life chart

The only part left undone is the bubble size legend. Unfortunately, Highcharts currently doesn't offer such a feature. However, this can be accomplished by using the chart's Renderer engine to create a circle and text label. We will leave this as an exercise for readers.

Technically speaking, we can create the same chart with a scatter series instead, with each data point specified in object configuration, and assign a precomputed z-value ratio to the data.marker.radius option.

Understanding the box plot chart

A box plot is a technical chart that shows data samples in terms of the shape of distribution. Before we can create a box plot chart, we need to understand the basic structure and concept. The following diagram illustrates the structure of a box plot:

Understanding the box plot chart

In order to find out the percentile values, the entire data sample needs to be sorted first. Basically, a box plot is composed of top and bottom whisker values, first (Q1) and third (Q3) quartile values, and the median. The quartile Q1 represents the median value between the 50th percentile and the minimum data. Quartile Q3 works in a similar fashion but with maximum data. For data with a perfectly normal distribution, the box plot will have an equal distance between each section.

Strictly speaking, there are other types of box plot that differ in how much the percentiles of both whiskers cover. Some use the definition of 1.5 times the inter-quartile range, that is, 1.5 * (Q3 - Q1), or standard deviation. The purpose is to isolate the outlier data and plot them as separate points which can be put into scatter data points along with the box plot. Here, we use the simplest form of box plot: the maximum and minimum data points are regarded as the top and bottom whiskers respectively.

Plotting the box plot chart

In order to create a box plot chart, we need to load an additional library, highcharts-more.js:

<script src="http://code.highcharts.com/highcharts-more.js"></script>

Highcharts offers a set of options to shape and style the box plot series, such as the line width, style, and color, which are shown in the following code snippet:

          plotOptions: {

              boxplot: {

                  lineWidth: 2,

                  fillColor: '#808080',

                  medianColor: '#FFFFFF',

                  medianWidth: 2,

                  stemColor: "#808080",

                  stemDashStyle: 'dashdot',

                  stemWidth: 1,

                  whiskerColor: '#808080',

                  whiskerWidth: 2,

                  whiskerLength: '120%'

              }

          },

The lineWidth option is the overall line width of the boxplot, and fillColor is for the color inside the box. The median options refer to the horizontal median line inside the box whereas the stem options are for the line between the quartile and whisker. The whiskerLengthoption is the ratio that corresponds to the width of the quartile box. In this example, we will enlarge the whiskerLength option for ease of visualization, as there are a number of box plots packed into the graph.

The series data values for a box plot are listed in array form in ascending order, so from the bottom to top whisker. The following shows a sample of series data:

        series: [{

            type: 'boxplot',

            data: [

                   [16.855, 19.287, 26.537, 31.368, 33.035 ],

                   [16.139, 18.668, 25.33, 30.632, 32.385 ],

                   [12.589, 15.536, 23.5495, 28.960, 30.848 ],

                   [13.395, 16.399, 22.078, 27.013, 29.146 ],

                   ....

            ]

        }]

Making sense with the box plot data

Before we dive into an example with real-life data, it is worth looking at an excellent article (http://junkcharts.typepad.com/junk_charts/2014/04/an-overused-chart-why-it-fails-and-how-to-fix-it.html) by Kaiser Fund, a marketing analytics and data visualization expert who also authored a couple of books on big data crunching. In the article, Kaiser raises an observation of a spider chart from a video Arctic Death Spiral (http://youtu.be/20pjigmWwiw), as follows:

Making sense with the box plot data

The video demonstrates how the arctic sea ice volume (each month per series over the years) spirals towards the center at an alarming rate. He argues that using a spider chart doesn't do justice to the important message in the data. To summarize his arguments:

·        It is difficult for readers to comprehend the real downward trend scale in a circular chart.

·        Humans perceive time series data more naturally in a horizontal progression than in a circular motion.

·        If the movement of monthly data within a year fluctuates more than other years, we will have multiple line series crossing each other. As a result, we have a plateful of spaghetti instead of a comprehensible chart.

In order to fix this, Kaiser suggests that a box plot is the best candidate. Instead of having 12 multiple series lines crammed together, he uses a box plot to represent the annual data distribution. The 12 months' data for each year are sorted and only the median, quartiles, and extreme values are substituted into the box plot. Although small details are lost due to less data, the range and scale of the downward trend over time are better represented in this case.

The following is the final box plot presentation in Highcharts:

Making sense with the box plot data

The box plot tooltip

Since the box plot series holds various values, the series has different property names—low, q1, median, q3, high—to refer to them. The following illustrates an example of tooltip.formatter:

      chart: {

          ....

      },

      ....,

      tooltip: {

          formatter: function() {

              return "In year: " + this.x + ", <br>" +

                     "low: " + this.point.low + ", <br>" +

                     "Q1: " + this.point.q1 + ", <br>" +

                     "median: " + this.point.median + ", <br>" +

                     "Q3: " + this.point.q3 + ", <br>" +

                     "high: " + this.point.high;

              }

          },

      series: [{

          ....

      }]

Note that formatter should be added to the tooltip property of the main options object, and not in the series object. Here is what the box plot tooltip looks like:

The box plot tooltip

The error bar chart

An error bar is another technical chart that shows the standard error, which is the standard deviation divided by the square root of the sample size. It means that as the sample size increases, the variation from the sample mean diminishes. The error bar series has similar color and style options as the box plot but only applies to whisker and stem:

               plotOptions: {

                   errorbar: {

                       stemColor: "#808080",

                       stemDashStyle: 'dashdot',

                       stemWidth: 2,

                       whiskerColor: '#808080',

                       whiskerWidth: 2,

                       whiskerLength: '20%'

                   }

               },

The same also applies to the tooltip formatter, in which low and high refer to both ends of the error bar. As for the series data option, it takes an array of tuples of lower and upper values:

        series: [{

            type: 'column',

            data: ....

        }, {

            name: 'error range',

            type: 'errorbar',

            data: [

                   [ 22.76, 23.404 ],

                   [ 25.316, 29.976 ],

To demonstrate the error bar, we use the F1 pit stop times from all the teams in each circuit from http://www.formula1.com/results/season/2013. We plot the mean of each circuit in a column series. We then calculate the standard error and apply the result to the mean. Here is a screenshot of the error bar chart:

The error bar chart

Note that when displaying an error bar series with a column series, the column series has to be specified before the error bar series in the series array. Otherwise, half of the error bar is blocked by the column, as in the following example:

The error bar chart

Summary

In this chapter, we have learned about bubble charts and how bubble sizes are determined. We also tested the bubble chart series by recreating a real-life chart. We examined the box plot principle and structure, and practiced how to turn data samples into percentile data and layout into a box plot chart. Finally, we studied error bar charts by using some statistical data.

In the next chapter, we will investigate the properties of waterfall, funnel, pyramid, and heatmap charts and experiment with them.