D3.js in Action (2015)
Part 2. The pillars of information visualization
Chapter 5. Layouts
This chapter covers
· Histogram and pie chart layouts
· Simple tweening
· Tree, circle pack, and stack layouts
· Sankey diagrams and word clouds
D3 contains a variety of functions, referred to as layouts, that help you format your data so that it can be presented using a popular charting method. In this chapter we’ll look at several different layouts so that you can understand general layout functionality, learn how to deal with D3’s layout structure, and deploy one of these layouts (some of which are shown in figure 5.1) with your data.
Figure 5.1. Multiple layouts are demonstrated in this chapter, including the circle pack (section 5.3), tree (section 5.4), stack (section 5.5), and Sankey (section 5.6.1), as well as tweening to properly animate shapes like the arcs in pie charts (section 5.2.3).
In each case, as you’ll see with the following examples, when a dataset is associated with a layout, each of the objects in the dataset has attributes that allow for drawing the data. Layouts don’t draw the data, nor are they called like components or referred to in the drawing code like generators. Rather, they’re a preprocessing step that formats your data so that it’s ready to be displayed in the form you’ve chosen. You can update a layout, and then if you rebind that altered data to your graphical objects, you can use the D3 enter/update/exit syntax you encountered inchapter 2 to update your layout. Paired with animated transitions, this can provide you with the framework for an interactive, dynamic chart.
This chapter gives an overview of layout structure by implementing popular layouts such as the histogram, pie chart, tree, and circle packing. Other layouts such as the chord layout and more exotic ones follow the same principles and should be easy to understand after looking at these. We’ll get started with a kind of chart you’ve already worked with, the bar chart or histogram, which has its own layout that helps abstract the process of building this kind of chart.
5.1. Histograms
Before we get into charts that you’ll need layouts for, let’s take a look at a chart that we easily made without a layout. In chapter 2 we made a bar chart based on our Twitter data by using d3.nest(). But D3 has a layout, d3.layout.histogram(), that bins values automatically and provides us with the necessary settings to draw a bar chart based on a scale that we’ve defined. Many people who get started with D3 think it’s a charting library, and that they’ll find a function like d3.layout.histogram that creates a bar chart in a <div> when it’s run. But D3 layouts don’t result in charts; they result in the settings necessary for charts. You have to put in a bit of extra work for charts, but you have enormous flexibility (as you’ll see in this and later chapters) that allows you to make diagrams and charts that you can’t find in other libraries.
Listing 5.1 shows the code to create a histogram layout and associate it with a particular scale. I’ve also included an example of how you can use interactivity to adjust the original layout and rebind the data to your shapes. This changes the histogram from showing the number of tweets that were favorited to the number of tweets that were retweeted.
Listing 5.1. Histogram code
You’re not expected to follow the process of using the histogram to create the results in figure 5.2. You’ll get into that as you look at more layouts throughout this chapter. Notice a few general principles: first, a layout formats the data for display, as I pointed out in the beginning of chapter 4. Second, you still need the same scales and components that you needed when you created a bar chart from raw data without the help of a layout. Third, the histogram is useful because it automatically bins data, whether it’s whole numbers like this or it falls in a range of values in a scale. Finally, if you want to dynamically change a chart using a different dimension of your data, you don’t need to remove the original. You just need to reformat your data using the layout and rebind it to the original elements, preferably with a transition. You’ll see this in more detail in your next example, which uses another type of chart: pie charts.
Figure 5.2. The histogram in its initial state (left) and after we change the measure from favorites to retweets (right) by clicking on one of the bars.
5.2. Pie charts
One of the most straightforward layouts available in D3 is the pie layout, which is used to make pie charts like those shown in figure 5.3. Like all layouts, a pie layout can be created, assigned to a variable, and used as both an object and a function. In this section you’ll learn how to create a pie chart and transform it into a ring chart. You’ll also learn how to use tweening to properly transition it when you change its data source. After you create it, you can pass it an array of values (which I’ll refer to as a dataset), and it will compute the necessary starting and ending angles for each of those values to draw a pie chart. When we pass an array of numbers as our dataset to a pie layout in the console as in the following code, it doesn’t produce any kind of graphics but rather results in the response shown in figure 5.4:
var pieChart = d3.layout.pie();
var yourPie = pieChart([1,1,2]);
Figure 5.3. The traditional pie chart (bottom right) represents proportion as an angled slice of a circle. With slight modification, it can be turned into a donut or ring chart (top) or an exploded pie chart (bottom left).
Figure 5.4. A pie layout applied to an array of [1,1,2] shows objects created with a start angle, end angle, and value attribute corresponding to the dataset, as well as the original data, which in this case is a number.
Our pieChart function created a new array of three objects. The startAngle and endAngle for each of the data values draw a pie chart with one piece from 0 degrees to pi, the next from pi to 1.5 pi, and the last from 1.5 pi to 2 pi. But this isn’t a drawing, or SVG code like the line and area generators produced.
5.2.1. Drawing the pie layout
These are settings that need to be passed to a generator to make each of the pieces of our pie chart. This particular generator is d3.svg.arc, and it’s instantiated like the generators we worked with in chapter 4. It has a few settings, but the only one we need for this first example is theouterRadius() function, which allows us to set a dynamic or fixed radius for our arcs:
Now that you know how the arc constructor works and that it works with our data, all we need to do is bind the data created by our pie layout and pass it to <path> elements to draw our pie chart. The pie layout is centered on the 0,0 point in the same way as a circle. If we want to draw it at the center of our canvas, we need to create a new <g> element to hold the <path> elements we’ll draw and then move the <g> to the center of the canvas:
Figure 5.5 shows our pie chart. The pie chart layout, like most layouts, grows more complicated when you want to work with JSON object arrays rather than number arrays. Let’s bring back our tweets.json from chapter 2. We can nest and measure it to transform it from an array of tweets into an array of Twitter users with their number of tweets computed:
Figure 5.5. A pie chart showing three pie pieces that subdivide the circle between the values in the array [1,1,2].
5.2.2. Creating a ring chart
If we try to run pieChart(nestedTweets) like with the earlier array illustrated in figure 5.4, it will fail, because it doesn’t know that the numbers we should be using to size our pie pieces come from the .numTweets attribute. Most layouts, pie included, can define where the values are in your array by defining an accessor function to get to those values. In the case of nestedTweets, we define pieChart.value() to point at the numTweets attribute of the dataset it’s being used on. While we’re at it, let’s set a value for our arc generator’s innerRadius() so that we create a donut chart instead of a pie chart. With those changes in place, we can use the same code as before to draw the pie chart in figure 5.6:
pieChart.value(function(d) {
return d.numTweets;
});
newArc.innerRadius(20)
yourPie = pieChart(nestedTweets);
Figure 5.6. A donut chart showing the number of tweets from our four users represented in the nestedTweets dataset
5.2.3. Transitioning
You’ll notice that for each value in nestedTweets, we totaled the number of tweets, and also used d3.sum() to total the number of retweets and favorites (if any). Because we have this data, we can adjust our pie chart to show pie pieces based not on the number of tweets but on those other values. One of the core uses of a layout in D3 is to update the graphical chart. All we need to do is make changes to the data or layout and then rebind the data to the existing graphical elements. By using a transition, we can see the pie chart change from one form to the other. Running the following code first transforms the pie chart to represent the number of favorites instead of the number of tweets. The next block causes the pie chart to represent the number of retweets. The final forms of the pie chart after running that code are shown in figure 5.7.
Figure 5.7. The pie charts representing, on the left, the total number of favorites and, on the right, the total number of retweets
pieChart.value(function(d) {
return d.numFavorites
});
d3.selectAll("path").data(pieChart(nestedTweets))
.transition().duration(1000).attr("d", newArc);
pieChart.value(function(d) {return d.numRetweets});
d3.selectAll("path").data(pieChart(nestedTweets))
.transition().duration(1000).attr("d", newArc);
Although the results are what we want, the transition can leave a lot to be desired. Figure 5.8 shows snapshots of the pie chart transitioning from representing the number of tweets to representing the number of favorites. As you’ll see by running the code and comparing these snapshots, the pie chart doesn’t smoothly transition from one state to another but instead distorts quite significantly.
Figure 5.8. Snapshots of the transition of the pie chart representing the number of tweets to the number of favorites. This transition highlights the need to assign key values for data binding and to use tweens for some types of graphical transition, such as that used for arcs.
The reason you see this wonky transition is because, as you learned earlier, the default data-binding key is array position. When the pie layout measures data, it also sorts it in order from largest to smallest, to create a more readable chart. But when you recall the layout, it re-sorts the dataset. The data objects are bound to different pieces in the pie chart, and when you transition between them graphically, you see the effect shown in figure 5.8. To prevent this from happening, we need to disable this sort:
pieChart.sort(null);
The result is a smooth graphical transition between numTweets and numRetweets, because the object position in the array remains unchanged, and so the transition in the drawn shapes is straightforward. But if you look closely, you’ll notice that the circle deforms a bit because the defaulttransition() behavior doesn’t deal with arcs well. It’s not transitioning the degrees in our arcs; instead, it’s treating each arc as a geometric shape and transitioning from one to another.
This becomes obvious when you look at the transition from either of those versions of our pie chart to one that shows numFavorites, because some of the objects in our dataset have 0 values for that attribute, and one of them changes size dramatically. To clean this all up and make our pie chart transition properly, we need to change the code. Some of this you’ve already dealt with, like using key values for your created elements and using them in conjunction with exit and update behavior. But to make our pie pieces transition in a smooth graphical manner, we need to extend our transitions to include a custom tween to define how an arc can grow or shrink graphically into a different arc.
Listing 5.2. Updated binding and transitioning for pie layout
The result of the code in listing 5.2 is a pie chart that cleanly transitions the individual arcs or removes them when no data corresponds to the pie pieces. You’ll see more of attrTween and styleTween, as well as a deeper investigation of easing and other transition properties, in later chapters.
We could label each pie piece <path> element, color it according to a measurement or category, or add interactivity. But rather than spend a chapter creating the greatest pie chart application you’ve ever seen, we’ll move on to another kind of layout that’s often used: the circle pack.
5.3. Pack layouts
Hierarchical data is amenable to an entire family of layouts. One of the most popular is circle packing, shown in figure 5.9. Each object is placed graphically inside the hierarchical parent of that object. You can see the hierarchical relationship. As with all layouts, the pack layout expects a default representation of data that may not align with the data you’re working with. Specifically, pack expects a JSON object array where the child elements in a hierarchy are stored in a children attribute that points to an array. In examples of layout implementations on the web, the data is typically formatted to match the expected data format. In our case, we would format our tweets like this:
{id: "All Tweets", children: [
{id: "Al's Tweets", children: [{id: "tweet1"}, {id: "tweet2"}]},
{id: "Roy's Tweets", children: [{id: "tweet1"}, {id: "tweet2"}]}
...
Figure 5.9. Pack layouts are useful for representing nested data. They can be flattened (top), or they can visually represent hierarchy (bottom). (Examples from Bostock, https://github.com/mbostock/d3/wiki/Pack-Layout.)
But it’s better to get accustomed to adjusting the accessor functions of the layout to match our data. This doesn’t mean we don’t have to do any data formatting. We still need to create a root node for circle packing to work (what’s referred to as “All Tweets” in the previous code). But we’ll adjust the accessor function .children() to match the structure of the data as it’s represented in nestedTweets, which stores the child elements in the values attribute. In the following listing, we also override the .value() setting that determines the size of circles and set it to a fixed value, as shown in figure 5.10.
Figure 5.10. Each tweet is represented by a green circle (A) nested inside an orange circle (B) that represents the user who made the tweet. The users are all nested inside a blue circle (C) that represents our “root” node.
Listing 5.3. Circle packing of nested tweets data
Notice that when the pack layout has a single child (as in the case of Sam, who only made one tweet), the size of the child node is the same as the size of the parent. This can visually seem like Sam is at the same hierarchical level as the other Twitter users who made more tweets. To correct this, we can modify the radius of the circle. That accounts for its depth in the hierarchy, which can act as a margin of sorts:
.attr("r", function(d) {return d.r - (d.depth * 10)})
If you want to implement margins like those shown in figure 5.11 in the real world, you should use something more sophisticated than just the depth times 10. That scales poorly with a hierarchical dataset with many levels or with a crowded circle-packing layout. If there were one or two more levels in this hierarchy, our fixed margin would result in negative radius values for the circles, so we should use a d3.scale.linear() or other method to set the margin. You can also use the pack layout’s built-in .padding() function to adjust the spacing between circles at the same hierarchical level.
Figure 5.11. An example of a fixed margin based on hierarchical depth. We can create this by reducing the circle size of each node based on its computed “depth” value.
I glossed over the .value() setting on the pack layout earlier. If you have some numerical measurement for your leaf nodes, then you can use that measurement to set their size using .value() and therefore influence the size of their parent nodes. In our case, we can base the size of our leaf nodes (tweets) on the number of favorites and retweets each has received (the same value we used in chapter 4 as our “impact factor”). The results in figure 5.12 reflect this new setting.
Figure 5.12. A circle-packing layout with the size of the leaf nodes set to the impact factor of those nodes
Layouts, like generators and components, are amenable to method chaining. You’ll see examples where the settings and data are all strung together in long chains. As with the pie chart, you could assign interactivity to the nodes or adjust the colors, but this chapter focuses on the general structure of layouts. Notice that circle packing is quite similar to another hierarchical layout known as treemaps. Treemaps pack space more effectively because they’re built out of rectangles, but they can be harder to read. The next layout is another hierarchical layout, known as adendrogram, that more explicitly draws the hierarchical connections in your data.
5.4. Trees
Another way to show hierarchical data is to lay it out like a family tree, with the parent nodes connected to the child nodes in a dendrogram (figure 5.13).
Figure 5.13. Tree layouts are another useful method for expressing hierarchical relationships and are often laid out vertically (top), horizontally (middle), or radially (bottom). (Examples from Bostock.)
The prefix dendro means “tree,” and in D3 the layout is d3.layout.tree. It follows much the same setup as the pack layout, except that to draw the lines connecting the nodes, we need a new generator, d3.svg.diagonal, which draws a curved line from one point to another.
Listing 5.4. Callback function to draw a dendrogram
Our dendrogram in figure 5.14 is a bit hard to read. To turn it on its side, we need to adjust the positioning of the <g> elements by flipping the x and y coordinates, which orients the nodes horizontally. We also need to adjust the .projection() of the diagonal generator, which orients the lines horizontally:
linkGenerator.projection(function (d) {return [d.y, d.x]})
...
.append("g")
...
.attr("transform", function(d) {return "translate(" +d.y+","+d.x+")"});
Figure 5.14. A dendrogram laid out vertically using data from tweets.json. The level 0 “root” node (which we created to contain the users) is in blue, the level 1 nodes (which represent users) are in orange, and the level 2 “leaf” nodes (which represent tweets) are in green.
The result, shown in figure 5.15, is more legible because the text isn’t overlapping on the bottom of the canvas. But critical aspects of the chart are still drawn off the canvas. We only see half of the root node and the leaf nodes (the blue and green circles) and can’t read any of the labels of the leaf nodes, which represent our tweets.
Figure 5.15. The same dendrogram as figure 5.14 but laid out horizontally.
We could try to create margins along the height and width of the layout as we did earlier. Or we could provide information about each node as a information box that opens when we click it, as with the soccer data. But a better option is to give the user the ability to drag the canvas up and down and left and right to see more of the visualization.
To do this, we use the D3 zoom behavior, d3.behavior.zoom, which creates a set of event listeners. A behavior is like a component, but instead of creating graphical objects, it creates events (in this case for drag, mousewheel, and double-click) and ties those events to the element that calls the behavior. With each of these events, a zoom object changes its .translate() and/or .scale() values to correspond to the traditional dragging and zooming interaction. You’ll use these changed values to adjust the position of graphical elements in response to user interaction. Like a component, the zoom behavior needs to be called by the element to which you want these events attached. Typically, you call the zoom from the base <svg> element, because then it fires whenever you click anything in your graphical area. When creating the zoom component, you need to define what functions are called on zoomstart, zoom, and zoom-end, which correspond (as you might imagine) to the beginning of a zoom event, the event itself, and the end of the event, respectively. Because zoom fires continuously as a user drags the mouse, you may want resource-intensive functions only at the beginning or end of the zoom event. You’ll see more complicated zoom strategies, as well as the use of scale, in chapter 7 when we look at geospatial mapping, which uses zooming extensively.
As with other components, to start a zoom component you create a new instance and set any attributes of it you may need. In our case, we only want the default zoom component, with the zoom event triggering a new function, zoomed(). This function changes the position of the <g>element that holds our chart and allows the user to drag it around:
Now we can drag and pan our entire chart left and right and up and down. In figure 5.16, we can finally read the text of the tweets by dragging the chart to the left. The ability to zoom and pan gives you powerful interactivity to enhance your charts. It may seem odd that you learned how to use something called zoom and haven’t even dealt with zooming in and out, but panning tends to be more universally useful with charts like these, while changing scale becomes a necessity when dealing with maps.
Figure 5.16. The dendrogram, when dragged to the left, shows the labels for the tweets.
We have other choices besides drawing our tree from top to bottom and left to right. If we tie the position of each node to an angle, and use a diagonal generator subclass created for radial layouts, we can draw our tree diagrams in a radial pattern:
var linkGenerator = d3.svg.diagonal.radial()
.projection(function(d) { return [d.y, d.x / 180 * Math.PI]; });
To make this work well, we need to reduce the size of our chart, because the radial drawing of a tree layout in D3 uses the size to determine the maximum radius, and is drawn out from the 0,0 point of its container like a <circle> element:
treeChart.size([200,200])
With these changes in place, we need only change the positioning of the nodes to take rotation into account:
.attr("transform", function(d) { return "rotate(" + (d.x - 90) +
")translate(" + d.y + ")"; })
Figure 5.17 shows the results of these changes.
Figure 5.17. The same dendrogram laid out in a radial manner. Notice that the <g> elements are rotated, so their child <text> elements are rotated in the same manner.
The dendrogram is a generic way of displaying information. It can be repurposed for menus or information you may not think of as traditionally hierarchical. One example (figure 5.18) is from the work of Jason Davies, who used the dendrogram functionality in D3 to create word trees.
Figure 5.18. Example of using a dendrogram in a word tree by Jason Davies (http://www.jasondavies.com/wordtree/).
Hierarchical layouts are common and well understood by readers. This gives you the option to emphasize the nested container nature of a hierarchy, as we did with the circle pack layout, or the links between parent and child elements, as with the dendrogram.
5.5. Stack layout
You saw the effects of the stack layout in the last chapter when we created a streamgraph, an example of which is shown in figure 5.19. We began with a simple stacking function and then made it more complex. As I pointed out then, D3 actually implements a stack layout, which formats your data so that it can be easily passed to d3.svg.area to draw a stacked graph or streamgraph.
Figure 5.19. The streamgraph used in a New York Times piece on movie grosses (figure from The New York Times, February 23, 2008; http://mng.bz/rV7M)
To implement this, we’ll use the area generator in tandem with the stack layout in listing 5.5. This general pattern should be familiar to you by now:
1. Process the data to match the requirements of the layout.
2. Set the accessor functions of the layout to align it with the dataset.
3. Use the layout to format the data for display.
4. Send the modified data either directly to SVG elements or paired with a generator like d3.svg.diagonal, d3.svg.arc, or d3.svg.area.
The first step is to take our original streamdata.csv data and transform it into an array of movies objects that each have an array of values at points that correspond to the thickness of the section of the streamgraph that they represent.
Listing 5.5. Stack layout example
After the initial dataset is reformatted, the data in the object array is structured so that the stack layout can deal with it:
[
{"name":"movie1","values":[{"x":1,"y":20},{"x":2,"y":18},{"x":3,"y":14},{"x":
4,"y":7},{"x":5,"y":4},{"x":6,"y":3},{"x":7,"y":2},{"x":8,"y":0},{"x":9,
"y":0},{"x":10,"y":0}]},
{"name":"movie2","values":[{"x":1,"y":8},{"x":2,"y":5},{"x":3,"y":3},{"x":4,"
y":3},{"x":5,"y":3},{"x":6,"y":1},{"x":7,"y":0},{"x":8,"y":0},{"x":9,"y"
:0},{"x":10,"y":0}]}
...
The x value is the day, and the y value is the amount of money made by the movie that day, which corresponds to thickness. As with other layouts, if we didn’t format our data this way, we’d need to adjust the .x() and .y() accessors to match our data names for those values. One of the benefits of formatting our data to match the expected data model of the layout is that the layout function is very simple:
After our stackLayout function processes our dataset, we can get the results by running stackLayout(stackData). The layout creates x, y, and y0 functions corresponding to the top and bottom of the object at the x position. If we use the stack layout to create a streamgraph, then it requires a corresponding area generator:
After we have our data, layout, and area generator in order, we can call them all as part of the selection and binding process. This gives a set of SVG <path> elements the necessary shapes to make our chart:
The result, as shown in figure 5.20, isn’t a streamgraph but rather a stacked area chart, which isn’t that different from a streamgraph, as you’ll soon find out.
Figure 5.20. The stack layout default settings, when tied to an area generator, produce a stacked area chart like this one.
The stack layout has an .offset() function that determines the relative positions of the areas that make up the chart. Although we can write our own offset functions to create exotic charts, this function recognizes a few keywords that achieve the typical effects we’re looking for. We’ll use the silhouette keyword, which centers the drawing of the stacked areas around the middle. Another function useful for creating streamgraphs is the .order() function of a stack layout, which determines the order in which areas are drawn, so that you can alternate them like in a streamgraph. We’ll use inside-out because that produces the best streamgraph effect. The last change is to the area constructor, which we’ll update to use the basis interpolator because that gave the best look in our earlier streamgraph example:
stackLayout.offset("silhouette").order("inside-out");
stackArea.interpolator("basis");
This results in a cleaner streamgraph than our example from chapter 4, and is shown in figure 5.21.
Figure 5.21. The streamgraph effect from a stack layout with basis interpolation for the areas and using the silhouette and inside-out settings for the stack layout. This is similar to our hand-built example from chapter 4 and shows the same graphical artifacts from the basis interpolation.
The last time we made a streamgraph, we explored the question of whether it was a useful chart. It is useful, for various reasons, not least of which is because the area in the chart corresponds graphically to the aggregate profit of each movie.
But sometimes a simple stacked bar graph is better. Layouts can be used for various types of charts, and the stack layout is no different. If we restore the .offset() and .order() back to the default settings, we can use the stack layout to create a set of rectangles that makes a traditional stacked bar chart:
stackLayout = d3.layout.stack()
.values(function(d) { return d.values; });
var heightScale = d3.scale.linear()
.domain([0, 70])
.range([0, 480]);
d3.select("svg").selectAll("g.bar")
.data(stackLayout(stackData))
.enter()
.append("g")
.attr("class", "bar")
.each(function(d) {
d3.select(this).selectAll("rect")
.data(d.values)
.enter()
.append("rect")
.attr("x", function(p) { return xScale(p.x) - 15; })
.attr("y", function(p) { return yScale(p.y + p.y0); })
.attr("height", function(p) { return heightScale(p.y); })
.attr("width", 30)
.style("fill", movieColors(d.name));
});
In many ways, the stacked bar chart in figure 5.22 is much more readable than the streamgraph. It presents the same information, but the y-axis tells us exactly how much money a movie made. There’s a reason why bar charts, line charts, and pie charts are the standard chart types found in your spreadsheet. Streamgraph, stacked bar charts, and stacked area charts are fundamentally the same thing, and rely on the stack layout to format your dataset to draw it. Because you can deploy them equally easily, your decision whether to use one or the other can be based on user testing rather than your ability to create awesome dataviz.
Figure 5.22. A stacked bar chart using the stack layout to determine the position of the rectangles that make up each day’s stacked bar
The layouts we’ve looked at so far, as well as the associated methods and generators, have broad applicability. Now we’ll look at a pair of layouts that don’t come with D3 that are designed for more specific kinds of data: the Sankey diagram and the word cloud. Even though these layouts aren’t as generic as the layouts included in the core D3 library that we’ve looked at, they have some prominent examples and can come in handy.
5.6. Plugins to add new layouts
The examples we’ve touched on in this chapter are a few of the layouts that come with the core D3 library. You’ll see a few more in later chapters, and we’ll focus specifically on the force layout in chapter 6. But layouts outside of core D3 may also be useful to you. These layouts tend to use specifically formatted datasets or different terminology for layout functions.
5.6.1. Sankey diagram
The Sankey diagram provides you with the ability to map flow from one category to another. It’s the kind of diagram used in Google Analytics (figure 5.23) to show event flow or user flow from one part of your website to another. Sankey diagrams consist of two types of objects: nodes and edges. In this case, the nodes are the web pages or events, and the edges are the traffic between them. This differs from the hierarchical data you worked with before, because nodes can have many overlapping connections.
Figure 5.23. Google Analytics uses Sankey diagrams to chart event and user flow for website visitors.
The D3 version of the Sankey layout is a plugin written by Mike Bostock a couple of years ago, and you can find it at https://github.com/d3/d3-plugins along with other interesting D3 plugins. The Sankey layout has a couple of examples and sparse documentation—one of the drawbacks of noncore layouts. Another minor drawback is that they don’t always follow the patterns of the core layouts in D3. To understand the Sankey layout, you need to examine the format of the data, the examples, and the code itself.
D3 Plugins
The core d3.js library that you download comes with quite a few layouts and useful functions, but you can find even more at https://github.com/d3/d3-plugins. Besides the two noncore layouts discussed in this chapter, we’ll look at the geo plugins in chapter 7 when we deal with maps. Also available is a fisheye distortion lens, a canned boxplot layout, a layout for horizon charts, and more exotic plugins for Chernoff faces and implementing the superformula.
The data is a JSON array of nodes and a second JSON array of links. Get used to this format, because it’s the format of most of the network data we’ll use in chapter 6. For our example, we’ll look at the traffic flow in a website that sells milk and milk-based products. We want to see how visitors move through the site from the homepage to the store page to the various product pages. In the parlance of the data format we need to work with, the nodes are the web pages, the links are the visitors who go from one page to another (if any), and the value of each link is the total number of visitors who move from that page to the next.
Listing 5.6. sitestats.json
The nodes array is clear—each object represents a web page. The links array is a bit more opaque, until you realize the numbers represent the array position of nodes in the node array. So when links[0] reads "source": 0, it means that the source is nodes[0], which is the index page of the site. It connects to nodes[1], the about page, and indicates that 25 people navigated from the home page to the about page. That defines our flow—the flow of traffic through a site.
The Sankey layout is initialized like any layout:
Until now, you’ve only seen .size(). It controls the graphical extent that the layout uses. The rest you’d need to figure out by looking at the example, experimenting with different values, or reading the sankey.js code itself. Most of it will quickly make sense, especially if you’re familiar with the .nodes() and .links() convention used in D3 network visualizations. The .layout() setting is pretty hard to understand without diving into the code, but I’ll explain that next.
After we define our Sankey layout as in listing 5.7, we need to draw the chart by selecting and binding the necessary SVG elements. In this case, that typically consists of <rect> elements for the nodes and <path> elements for the flows. We’ll also add <text> elements to label the nodes.
Listing 5.7. Sankey drawing code
The implementation of this layout has some interactivity, as shown in figure 5.24. Diagrams like these, with wavy paths overlapping other wavy paths, need interaction to make them legible to your site visitor. In this case, it differentiates one flow from another.
Figure 5.24. A Sankey diagram where the number of visitors is represented in the color of the path. The flow between index and contact has an increased opacity as the result of a mouseover event.
With a Sankey diagram like this at your disposal, you can track the flow of goods, visitors, or anything else through your organization, website, or other system. Although you could expand on this example in any number of ways, I think one of the most useful is also one of the simplest. Remember, layouts aren’t tied to particular shape elements. In some cases, like with the flows in the Sankey diagram, you’ll have a hard time adapting the layout data to any element other than a <path>, but the nodes don’t need to be <rect> elements. If we adjust our code, we can easily make nodes that are circles:
sankey.nodeWidth(1);
d3.selectAll(".node").append("circle")
.attr("height", function(d) { return d.dy; })
.attr("r", function(d) { return d.dy / 2; })
.attr("cy", function(d) { return d.dy / 2; })
.style("fill", "pink")
.style("stroke", "gray");
Don’t shy away from experimenting with tweaks to traditional charting methods. Using circles instead of rectangles, like in figure 5.25, may seem frivolous, but it may be a better fit visually, or it may distinguish your Sankey from all the boring sharp-edged Sankeys out there. In the same vein, don’t be afraid of leveraging D3’s capacity for information visualization to teach yourself how a layout works. You’ll remember that d3.layout.sankey has a layout() function, and you might discover the operation of that function by reading the code. But there’s another way for you to see how this function works: by using transitions and creating a function that updates the .layout() property dynamically, you can see what this function does to the chart graphically.
Figure 5.25. A squid-like Sankey diagram
Visualizing Algorithms
Although you may think of data visualization as all the graphics in this book, it’s also simultaneously a graphical representation of the methods you used to process the data. In some cases, like the Sankey diagram here or the force-directed network visualization you’ll see in the next chapter, the algorithm used to sort and arrange the graphical elements is front and center. After you have a layout that displays properly, you can play with the settings and update the elements like you’ve done with the Sankey diagram to better understand how the algorithm works visually.
First we need to add an onclick function to make the chart interactive, as shown in listing 5.8. We’ll attach this function to the <svg> element itself, but you could just as easily add a button like we did in chapter 3.
The moreLayouts() function does two things. It updates the sankey.layout() property by incrementing a variable and setting it to the new value of that variable. It also selects the graphical elements that make up your chart (the <g> and <path> elements) and redraws them with the updated settings. By using transition() and delay(), you’ll see the chart dynamically adjust.
Listing 5.8. Visual layout function for the Sankey diagram
The end result is a visual experience of the effect of the .layout() function. This function specifies the number of passes that d3.layout.sankey makes to determine the best position of the lines representing flow. You can see some snapshots of this in figure 5.26 showing the lines sort out and get out of each other’s way. This kind of position optimization is a common technique in information visualization, and drives the force-directed network layout that you’ll see in chapter 6. In the case of our Sankey example, even one pass of the layout provides good positioning. That’s because this is a simple dataset, and it stabilizes quickly. As you can see as you click your chart and in figure 5.26, the layout doesn’t change much with progressively higher numbers of passes in the layout() setting.
Figure 5.26. The Sankey layout algorithm attempts to optimize the positioning of nodes to reduce overlap. The chart reflects the position of nodes after (from left to right) 1 pass, 20 passes, 40 passes, and 200 passes.
It should be clear by this example that when you update the settings of the layout, you can also update the visual display of the layout. You can use animations and transitions by simply calling the elements and setting their drawing code or position to reflect the changed data. You’ll see much more of this in later chapters.
5.6.2. Word clouds
One of the most popular information visualization charts is also one of the most maligned: the word cloud. Also known as a tag cloud, the word cloud uses text and text size to represent the importance or frequency of words. Figure 5.27 shows a thumbnail gallery of 15 word clouds derived from text in a species biodiversity database. Oftentimes, word clouds rotate the words to set them at right angles or jumble them at random angles to improve the appearance of the graphics. Word clouds, like streamgraphs, receive criticism for being hard to read or presenting too little information. But both are surprisingly popular with audiences.
Figure 5.27. A word or tag cloud uses the size of a word to indicate its importance or frequency in a text, creating a visual summary of text. These word clouds were created by the popular online word cloud generator Wordle (www.wordle.net).
I created these word clouds using my data with the popular Java applet Wordle, which provides an easy UI and a few aesthetic customization choices. Wordle has flooded the internet with word clouds because it lets anyone create visually arresting but problematic graphics by dropping text onto a page. This caused much consternation among data visualization experts, who think word clouds are evil because they embed no analysis in the visualization and only highlight superficial data such as the quantity of words in a blog post.
But word clouds aren’t evil. First of all, they’re popular with audiences. But more than that, words are remarkably effective graphical objects. If you can identify a numerical attribute that indicates the significance of a word, then scaling the size of a word in a word cloud relays that significance to your reader.
So let’s start by assuming we have the right kind of data for a word cloud. Fortunately, we do: the top twenty words used in this chapter, with the number of each word.
Listing 5.9. worddata.csv
text,frequency
layout,63
function,61
data,47
return,36
attr,29
chart,28
array,24
style,24
layouts,22
values,22
need,21
nodes,21
pie,21
use,21
figure,20
circle,19
we'll,19
zoom,19
append,17
elements,17
To create a word cloud with D3, you have to use another layout that isn’t in the core library, created by Jason Davies (who created the sentence trees using the tree layout shown in figure 5.17). You’ll also need to implement an algorithm written by Jonathan Feinberg (http://static.mrfeinberg.com/bv_ch03.pdf). The layout, d3.layout.cloud(), is available on GitHub at https://github.com/jasondavies/d3-cloud. It requires that you define what attribute will determine word size and what size you want the word cloud to lay out for.
Unlike most other layouts, cloud() fires a custom event "end" that indicates it’s done calculating the most efficient use of space to generate the word cloud. The layout then passes to this event the processed dataset with the position, rotation, and size of the words. We can then run the cloud layout without ever referring to it again, and we don’t even need to assign it to a variable, as we do in the following listing. If we plan to reuse the cloud layout and adjust the settings, we assign it to a variable like with any other layout.
Listing 5.10. Creating a word cloud with d3.layout.cloud
This code creates an SVG <text> element that’s rotated and placed according to the code. None of our words are rotated, so we get the staid word cloud shown in figure 5.28.
Figure 5.28. A word cloud with words that are arranged horizontally
It’s simple enough to define rotation, and we only need to set some rotation value in the cloud layout’s .rotate() function:
At this point, we have your traditional word cloud (figure 5.29), and we can tweak the settings and colors to create anything you’ve seen on Wordle. But now let’s take a look at why word clouds get such a bad reputation. We’ve taken an interesting dataset, the most common words in this chapter, and, other than size them by their frequency, done little more than place them on screen and jostle them a bit. We have different channels for expressing data visually, and in this case the best channels that we have, besides size, are color and rotation.
Figure 5.29. A word cloud using the same worddata.csv but with words slightly perturbed by randomizing the rotation property of each word
With that in mind, let’s imagine that we have a keyword list for this book, and that each of these words is in a glossary in the back of the book. We’ll place those keywords in an array and use them to highlight the words in our word cloud that appear in the glossary. The code in the following listing also rotates shorter words 90 degrees and leaves the longer words unrotated so that they’ll be easier to read.
Listing 5.11. Word cloud layout with key word highlighting
The word cloud in figure 5.30 is fundamentally the same, but instead of using color and rotation for aesthetics, we used them to encode information in the dataset. You can read about more controls over the format of your word cloud, including selecting fonts and padding, in the layout’s documentation at https://www.jasondavies.com/wordcloud/about/.
Figure 5.30. This word cloud highlights keywords and places longer words horizontally and shorter words vertically.
Layouts like the word cloud aren’t suitable for as wide a variety of data as some other layouts, but because they’re so easy to deploy and customize, you can combine them with other charts to represent the multiple facets of your data. You’ll see this kind of synchronized chart in chapter 9.
5.7. Summary
In this chapter, we took an in-depth look at D3 layout structure and experimented with several datasets. In doing so, you learned how to use layouts not just to draw one particular chart, but also variations on that chart. You also experimented with interactivity and animation.
In particular, we covered
· Layout structure and functions common to D3 core layouts
· Arc and diagonal generators for drawing arcs and connecting links
· How to make pie charts and donut charts using the pie layout
· Using tweens to better animate the graphical transition for arc segments (pie pieces)
· How to create circle-packing diagrams and format them effectively using the pack layout
· How to create vertical, horizontal, and radial dendrograms using the tree layout
· How to create stacked area charts, streamgraphs, and stacked bar charts using the stack layout
· How to use noncore D3 layouts to build Sankey diagrams and word clouds
Now that you understand layouts in general, in the next chapter we’ll focus on how to represent networks. We’ll spend most of our time working with the force-directed layout, which has much in common with general layouts but is distinguished from them because it’s designed to be interactive and animated. Because the chapter deals with network data, like the kind you used for the Sankey layout in this chapter, you’ll also learn a few tips and tricks for processing and measuring networks.