Data Science For Dummies (2016)
Part 3
Creating Data Visualizations That Clearly Communicate Meaning
IN THIS PART …
Explore the principles of data visualization design.
Use D3.js resources to create dynamic data visualizations.
Work with web-based data visualization applications.
Create maps using spatial data.
Chapter 9
IN THIS CHAPTER
Laying out the basic types of data visualizations
Choosing the perfect data visualization type for the needs of your audience
Picking the perfect design style
Adding context
Crafting clear and powerful visual messages with the right data graphic
Any standard definition of data science will specify that its purpose is to help you extract meaning and value from raw data. Finding and deriving insights from raw data is at the crux of data science, but these insights mean nothing if you don’t know how to communicate your findings to others. Data visualization is an excellent means by which you can visually communicate your data’s meaning. To design visualizations well, however, you must know and truly understand the target audience and the core purpose for which you’re designing. You must also understand the main types of data graphics that are available to you, as well as the significant benefits and drawbacks of each. In this chapter, I present you with the core principles in data visualization design.
A data visualization is a visual representation that’s designed for the purpose of conveying the meaning and significance of data and data insights. Since data visualizations are designed for a whole spectrum of different audiences, different purposes, and different skill levels, the first step to designing a great data visualization is to know your audience. Audiences come in all shapes, forms, and sizes. You could be designing for the young and edgy readers of Rolling Stone magazine or to convey scientific findings to a research group. Your audience might consist of board members and organizational decision makers or a local grassroots organization.
Data Visualizations: The Big Three
Every audience is composed of a unique class of consumers, each with unique data visualization needs, so you have to clarify for whom you’re designing. I first introduce the three main types of data visualizations, and then I explain how to pick the one that best meets the needs of your audience.
Data storytelling for organizational decision makers
Sometimes you have to design data visualizations for a less technical-minded audience, perhaps in order to help members of this audience make better-informed business decisions. The purpose of this type of visualization is to tell your audience the story behind the data. In data storytelling, the audience depends on you to make sense of the data behind the visualization and then turn useful insights into visual stories that they can understand.
With data storytelling, your goal should be to create a clutter-free, highly focused visualization so that members of your audience can quickly extract meaning without having to make much effort. These visualizations are best delivered in the form of static images, but more adept decision makers may prefer to have an interactive dashboard that they can use to do a bit of exploration and what-if modeling.
Data showcasing for analysts
If you’re designing for a crowd of logical, calculating analysts, you can create data visualizations that are rather open-ended. The purpose of this type of visualization is to help audience members visually explore the data and draw their own conclusions.
When using data showcasing techniques, your goal should be to display a lot of contextual information that supports audience members in making their own interpretations. These visualizations should include more contextual data and less conclusive focus so that people can get in and analyze the data for themselves, and then draw their own conclusions. These visualizations are best delivered as static images or dynamic, interactive dashboards.
Designing data art for activists
You could be designing for an audience of idealists, dreamers, and change-makers. When designing for this audience, you want your data visualization to make a point! You can assume that typical audience members aren’t overly analytical. What they lack in math skills, however, they more than compensate for in solid convictions.
These people look to your data visualization as a vehicle by which to make a statement. When designing for this audience, data art is the way to go. The main goal in using data art is to entertain, to provoke, to annoy, or to do whatever it takes to make a loud, clear, attention-demanding statement. Data art has little to no narrative and offers no room for viewers to form their own interpretations.
Data scientists have an ethical responsibility to always represent data accurately. A data scientist should never distort the message of the data to fit what the audience wants to hear — not even for data art! Nontechnical audiences don’t even recognize, let alone see, the possible issues. They rely on the data scientist to provide honest and accurate representations, thus amplifying the level of ethical responsibility that the data scientist must assume.
Designing to Meet the Needs of Your Target Audience
To make a functional data visualization, you must get to know your target audience and then design precisely for their needs. But to make every design decision with your target audience in mind, you need to take a few steps to make sure that you truly understand your data visualization’s target consumers.
To gain the insights you need about your audience and your purpose, follow this process:
1. Brainstorm.
Think about a specific member of your visualization’s audience, and make as many educated guesses as you can about that person’s motivations.
Give this (imaginary) audience member a name and a few other identifying characteristics. I always imagine a 45-year-old divorced mother of two named Brenda.
2. Define the purpose of your visualization.
Narrow the purpose of the visualization by deciding exactly what action or outcome you want audience members to make as a result of the visualization.
3. Choose a functional design.
Review the three main data visualization types (discussed earlier in this chapter) and decide which type can best help you achieve your intended outcome.
The following sections spell out this process in detail.
Step 1: Brainstorm (about Brenda)
To brainstorm properly, pull out a sheet of paper and think about your imaginary audience member (Brenda) so that you can create a more functional and effective data visualization. Answer the following questions to help you better understand her, and thus better understand and design for your target audience.
Form a picture of what Brenda’s average day looks like — what she does when she gets out of bed in the morning, what she does over her lunch hour, and what her workplace is like. Also consider how Brenda will use your visualization.
To form a comprehensive view of who Brenda is and how you can best meet her needs, ask these questions:
· Where does Brenda work? What does she do for a living?
· What kind of technical education or experience, if any, does she have?
· How old is Brenda? Is she married? Does she have children? What does she look like? Where does she live?
· What social, political, caused-based, or professional issues are important to Brenda? What does she think of herself?
· What problems and issues does Brenda have to deal with every day?
· How does your data visualization help solve Brenda’s work problems or her family problems? How does it improve her self-esteem?
· Through what avenue will you present the visualization to Brenda — for example, over the Internet or in a staff meeting?
· What does Brenda need to be able to do with your data visualization?
Say that Brenda is the manager of the zoning department in Irvine County. She is 45 years old and a single divorcee with two children who are about to start college. She is deeply interested in local politics and eventually wants to be on the county’s board of commissioners. To achieve that position, she has to get some major “oomph” on her county management résumé. Brenda derives most of her feelings of self-worth from her job and her keen ability to make good management decisions for her department.
Until now, Brenda has been forced to manage her department according to her gut-feel intuition, backed by a few disparate business systems reports. She is not extraordinarily analytical, but she knows enough to understand what she sees. The problem is that Brenda hasn’t had the visualization tools that are necessary to display all the relevant data she should be considering. Because she has neither the time nor the skill to code something herself, she’s been waiting in the lurch. Brenda is excited that you’ll be attending next Monday’s staff meeting to present the data visualization alternatives available to help her get under way in making data-driven management decisions.
Step 2: Define the purpose
After you brainstorm about the typical audience member (see the preceding section), you can much more easily pinpoint exactly what you’re trying to achieve with the data visualization. Are you attempting to get consumers to feel a certain way about themselves or the world around them? Are you trying to make a statement? Are you seeking to influence organizational decision makers to make good business decisions? Or do you simply want to lay all the data out there, for all viewers to make sense of, and deduce from what they will?
Return to the hypothetical Brenda: What decisions or processes are you trying to help her achieve? Well, you need to make sense of her data, and then you need to present it to her in a way that she can clearly understand. What’s happening within the inner mechanics of her department? Using your visualization, you seek to guide Brenda into making the most prudent and effective management choices.
Step 3: Choose the most functional visualization type for your purpose
Keep in in mind that you have three main types of visualization from which to choose: data storytelling, data art, and data showcasing. If you’re designing for organizational decision makers, you’ll most likely use data storytelling to directly tell your audience what their data means with respect to their line of business. If you’re designing for a social justice organization or a political campaign, data art can best make a dramatic and effective statement with your data. Lastly, if you’re designing for engineers, scientists, or statisticians, stick with data showcasing so that these analytical types have plenty of room to figure things out on their own.
Back to Brenda — because she’s not extraordinarily analytical and because she’s depending on you to help her make excellent data-driven decisions, you need to employ data storytelling techniques. Create either a static or interactive data visualization with some, but not too much, context. The visual elements of the design should tell a clear story so that Brenda doesn’t have to work through tons of complexities to get the point of what you’re trying to tell her about her data and her line of business.
Picking the Most Appropriate Design Style
Analytical types might say that the only purpose of a data visualization is to convey numbers and facts via charts and graphs — no beauty or design is needed. But more artistic-minded folks may insist that they have to feelsomething in order to truly understand it. Truth be told, a good data visualization is neither artless and dry nor completely abstract in its artistry. Rather, its beauty and design lie somewhere on the spectrum between these two extremes.
To choose the most appropriate design style, you must first consider your audience (discussed earlier in this chapter) and then decide how you want them to respond to your visualization. If you’re looking to entice the audience into taking a deeper, more analytical dive into the visualization, employ a design style that induces a calculating and exacting response in its viewers. But if you want your data visualization to fuel your audience’s passion, use an emotionally compelling design style instead.
Inducing a calculating, exacting response
If you’re designing a data visualization for corporate types, engineers, scientists, or organizational decision makers, keep the design simple and sleek, using the data showcasing or data storytelling visualization. To induce a logical, calculating feel in your audience, include a lot of bar charts, scatter plots, and line charts. Color choices here should be rather traditional and conservative. The look and feel should scream “corporate chic.” (See Figure 9-1.) Visualizations of this style are meant to quickly and clearly communicate what’s happening in the data — direct, concise, and to the point. The best data visualizations of this style convey an elegant look and feel.
FIGURE 9-1: This design style conveys a calculating and exacting feel.
Eliciting a strong emotional response
If you’re designing a data visualization to influence or persuade people, incorporate design artistry that invokes an emotional response in your target audience. These visualizations usually fall under the data art category, but an extremely creative data storytelling piece could also inspire this sort of strong emotional response. Emotionally provocative data visualizations often support the stance of one side of a social, political, or environmental issue. These data visualizations include fluid, artistic design elements that flow and meander, as shown in Figure 9-2. Additionally, rich, dramatic color choices can influence the emotions of the viewer. This style of data visualization leaves a lot of room for artistic creativity and experimentation.
FIGURE 9-2: This design style is intended to evoke an emotional response.
Keep artistic elements relevant — and recognize when they’re likely to detract from the impression you want to make, particularly when you’re designing for analytical types.
Choosing How to Add Context
Adding context helps people understand the value and relative significance of the information your data visualization conveys. Adding context to calculating, exacting data visualization styles helps to create a sense of relative perspective. In pure data art, you should omit context because, with data art, you’re only trying to make a single point and you don’t want to add information that would distract from that point.
Creating context with data
In data showcasing, you should include relevant contextual data for the key metrics shown in your data visualization — for example, in a situation where you’re creating a data visualization that describes conversion rates for e-commerce sales. The key metric would be represented by the percentage of users who convert to customers by making a purchase. Contextual data that’s relevant to this metric might include shopping cart abandonment rates, average number of sessions before a user makes a purchase, average number of pages visited before making a purchase, or specific pages that are visited before a customer decides to convert. This sort of contextual information helps viewers understand the “why and how” behind sales conversions.
Adding contextual data tends to decentralize the focus of a data visualization, so add this data only in visualizations that are intended for an analytical audience. These folks are in a better position to assimilate the extra information and use it to draw their own conclusions; with other types of audiences, context is only a distraction.
Creating context with annotations
Sometimes you can more appropriately create context by including annotations that provide a header and a small description of the context of the data that’s shown. (See Figure 9-3.) This method of creating context is most appropriate for data storytelling or data showcasing. Good annotation is helpful to both analytical and non-analytical audiences alike.
Source: Lynda.com, Python for DS
FIGURE 9-3: Using annotation to create context.
Creating context with graphical elements
Another effective way to create context in a data visualization is to include graphical elements that convey the relative significance of the data. Such graphical elements include moving average trend lines, single-value alerts, target trend lines (as shown in Figure 9-4), or predictive benchmarks.
FIGURE 9-4: Using graphical elements to create context.
KNOWING WHEN TO GET PERSUASIVE
Persuasive design needs to confirm or refute a point. It doesn’t leave room for audience interpretation. Persuasive data visualizations generally invoke a strong emotional response. Persuasive design is appropriate for both data art and data storytelling, but because it doesn’t leave much room for audience interpretation, this type of design is not helpful for data showcasing. Use persuasive design when you’re making data visualizations on behalf of social, political, or cause-based organizations.
Selecting the Appropriate Data Graphic Type
Your choice of data graphic type can make or break a data visualization. Because you probably need to represent many different facets of your data, you can mix and match among the different graphical classes and types. Even among the same class, certain graphic types perform better than others; therefore, create test representations to see which graphic type conveys the clearest and most obvious message.
This book introduces only the most commonly used graphic types (among hundreds that are available). Don’t wander too far off the beaten path. The further you stray from familiar graphics, the harder it becomes for people to understand the information you’re trying to convey.
Pick the graphic type that most dramatically displays the data trends you’re seeking to reveal. You can display the same data trend in many ways, but some methods deliver a visual message more effectively than others. The point is to deliver a clear, comprehensive visual message to your audience so that people can use the visualization to help them make sense of the data presented.
Among the most useful types of data graphics are standard chart graphics, comparative graphics, statistical plots, topology structures, and spatial plots and maps. The next few sections take a look at each type in turn.
Standard chart graphics
When making data visualizations for an audience of non-analytical people, stick to standard chart graphics. The more foreign and complex your graphics, the harder it is for non-analytical people to understand them. And not all standard chart types are boring — you have quite a variety to choose from, as the following list makes clear:
· Area: Area charts (see Figure 9-5) are a fun yet simple way to visually compare and contrast attribute values. You can use them to effectively tell a visual story when you’ve chosen data storytelling and data showcasing.
· Bar: Bar charts (see Figure 9-6) are a simple way to visually compare and contrast values of parameters in the same category. Bar charts are best for data storytelling and data showcasing.
· Line: Line charts (see Figure 9-7) most commonly show changes in time-series data, but they can also plot relationships between two, or even three, parameters. Line charts are so versatile that you can use them in all data visualization design types.
· Pie: Pie chart graphics (see Figure 9-8), which are among the most commonly used, provide a simple way to compare values of parameters in the same category. Their simplicity, however, can be a double-edged sword; deeply analytical people tend to scoff at them, precisely because they seem so simple, so you may want to consider omitting them from data-showcasing visualizations.
Source: Lynda.com, Python for DS
FIGURE 9-5: An area chart in three dimensions.
FIGURE 9-6: A bar chart.
Source: Lynda.com, Python for DS
FIGURE 9-7: A line chart.
Source: Lynda.com, Python for DS
FIGURE 9-8: A pie chart.
Comparative graphics
A comparative graphic displays the relative value of multiple parameters in a shared category or the relatedness of parameters within multiple shared categories. The core difference between comparative graphics and standard graphics is that comparative graphics offer you a way to simultaneously compare more than one parameter and category. Standard graphics, on the other hand, provide a way to view and compare only the difference between one parameter of any single category. Comparative graphics are geared for an audience that’s at least slightly analytical, so you can easily use these graphics in either data storytelling or data showcasing. Visually speaking, comparative graphics are more complex than standard graphics.
This list shows a few different types of popular comparative graphics:
· Bubble plots (see Figure 9-9) use bubble size and color to demonstrate the relationship between three parameters of the same category.
· Packed circle diagrams (see Figure 9-10) use both circle size and clustering to visualize the relationships between categories, parameters, and relative parameter values.
· Gantt charts (see Figure 9-11) are bar charts that use horizontal bars to visualize scheduling requirements for project management purposes. This type of chart is useful when you’re developing a plan for project delivery. It’s also helpful in determining the sequence in which tasks must be completed in order to meet delivery timelines.
Choose Gantt charts for project management and scheduling.
· Stacked charts (see Figure 9-12) are used to compare multiple attributes of parameters in the same category. To ensure that it doesn’t become difficult to make a visual comparison, resist the urge to include too many parameters.
· Tree maps aggregate parameters of like categories and then use area to show the relative size of each category compared to the whole, as shown in Figure 9-13.
· Word clouds use size and color to show the relative difference in frequency of words used in a body of text, as shown in Figure 9-14. Colors are generally employed to indicate classifications of words by usage type.
FIGURE 9-9: A bubble chart.
FIGURE 9-10: A packed circle diagram.
FIGURE 9-11: A Gantt chart.
FIGURE 9-12: A stacked chart.
FIGURE 9-13: A tree map.
FIGURE 9-14: A simple word cloud.
Statistical plots
Statistical plots, which show the results of statistical analyses, are usually useful only to a deeply analytical audience (and aren’t useful for making data art). Your statistical-plot choices are described in this list:
· Histogram: A diagram that plots a variable’s frequency and distribution as rectangles on a chart, a histogram (see Figure 9-15) can help you quickly get a handle on the distribution and frequency of data in a dataset.
Get comfortable with histograms. You’ll see a lot of them in the course of making statistical analyses.
· Scatter plot: A terrific way to quickly uncover significant trends and outliers in a dataset, a scatter plot plots data points according to its x- and y- values in order to visually reveal any significant patterns. (See Figure 9-16.) If you use data storytelling or data showcasing, start by generating a quick scatter plot to get a feel for areas in the dataset that may be interesting — areas that could potentially uncover significant relationships or yield persuasive stories.
· Scatter plot matrix: A good choice when you want to explore the relationships between several variables, a scatter plot matrix places its scatter plots in a visual series that shows correlations between multiple variables, as shown in Figure 9-17. Discovering and verifying relationships between variables can help you to identify clusters among variables and identify oddball outliers in your dataset.
Source: Lynda.com, Python for DS
FIGURE 9-15: A histogram.
Source: Lynda.com, Python for DS
FIGURE 9-16: A scatter plot.
Source: Lynda.com, Python for DS
FIGURE 9-17: A scatter plot matrix.
Topology structures
Topology is the practice of using geometric structures to describe and model the relationships and connectedness between entities and variables in a dataset. You need to understand basic topology structures so that you can accurately structure your visual display to match the fundamental underlying structure of the concepts you’re representing.
The following list describes a series of topological structures that are popular in data science:
· Linear topological structures: Representing a pure one-to-one relationship, linear topological structures are often used in data visualizations that depict time-series flow patterns. Any process that can occur only by way of a sequential series of dependent events is linear (see Figure 9-18), and you can effectively represent it by using this underlying topological structure.
· Graph models: These kinds of models underlie group communication networks and traffic flow patterns. You can use graph topology to represent many-to-many relationships (see Figure 9-19), like those that form the basis of social media platforms.
In a many-to-many relationship structure, each variable or entity has more than one link to the other variables or entities in that same dataset.
· Tree network topology: This topology represents a hierarchical classification, where a network is distributed in a top-down order — nodes act as receivers and distributors of connections, and lines represent the connections between nodes. End nodes act only as receivers and not as distributors. (See Figure 9-20.) Hierarchical classification underlies clustering and machine learning methodologies in data science. Tree network structures can represent one-to-many relationships, such as the ones that underlie a family tree or a taxonomy structure.
FIGURE 9-18: A linear topology.
FIGURE 9-19: A graph mesh network topology.
FIGURE 9-20: A hierarchical tree topology.
Spatial plots and maps
Spatial plots and maps are two different ways of visualizing spatial data. A map is just a plain figure that represents the location, shape, and size of features on the face of the earth. A spatial plot, which is visually more complex than a map, shows the values for, and location distribution of, a spatial feature’s attributes.
The following list describes a few types of spatial plots and maps that are commonly used in data visualization:
· Cloropleth: Despite its fancy name, a Cloropleth map is really just spatial data plotted out according to area boundary polygons rather than by point, line, or raster coverage. To better understand what I mean, look at Figure 9-21. In this map, each state boundary represents an area boundary polygon. The color and shade of the area within each boundary represents the relative value of the attribute for that state — where red areas have a higher attribute value and blue areas have a smaller attribute value.
· Point: Composed of spatial data that is plotted out according to specific point locations, a point map presents data in a graphical point form (see Figure 9-22) rather than in a polygon, line, or raster surface format.
· Raster surface: This spatial map can be anything from a satellite image map to a surface coverage with values that have been interpolated from underlying spatial data points. (See Figure 9-23.)
Source: Lynda.com, Python for DS
FIGURE 9-21: A Cloropleth map.
FIGURE 9-22: A point map.
FIGURE 9-23: A raster surface map.
Whether you’re a data visualization designer or a consumer, be aware of some common pitfalls in data visualization. Simply put, a data visualization can be misleading if it isn’t constructed correctly. Common problems include pie charts that don’t add up to 100 percent, bar charts with a scale that starts in a strange place, and multicolumn bar charts with vertical axes that don’t match.
Choosing a Data Graphic
When you want to craft clear and powerful visual messages with the appropriate data graphic, follow the three steps in this section to experiment and determine whether the one you choose can effectively communicate the meaning of the data:
1. Ask the questions that your data visualization should answer, and then examine the visualization to determine whether the answers to those questions jump out at you.
Before thinking about what graphics to use, first consider the questions you want to answer for your audience. In a marketing setting, the audience may want to know why their overall conversion rates are low. Or, if you’re designing for business managers, they may want to know why service times are slower in certain customer service areas than in others.
Though many data graphic types can fulfill the same purpose, whatever you choose, ensure that your choices clearly answer the exact and intended questions.
2. Consider users and media in determining where the data visualization will be used.
Ask who will consume your data visualization, and using which medium, and then determine whether your choice of data graphics makes sense in that context. Will an audience of scientists consume it, or will you use it for content marketing to generate Internet traffic? Do you want to use it to prove a point in a boardroom? Or do you want to support a story in an upcoming newspaper publication? Pick data graphic types that are appropriate for the intended consumers and for the medium through which they’ll consume the visualization.
3. Examine the data visualization a final time to ensure that its message is clearly conveyed using only the data graphic.
If viewers have to stretch their minds to make a visual comparison of data trends, you probably need to use a different graphic type. If they have to read numbers or annotations to get the gist of what’s happening, that’s not good enough. Try out some other graphic forms to see whether you can convey the visual message more effectively.
Just close your eyes and ask yourself the questions that you seek to answer through your data visualization. Then open your eyes and look at your visualization again. Do the answers jump out at you? If not, try another graphic type.