Data Science For Dummies (2016)
Part 6
The Part of Tens
Chapter 23
Ten Free Data Science Tools and Applications
IN THIS CHAPTER
Getting creative with free R packages for data visualization
Using open-source tools for scraping, collecting, and handling data
Analyzing your data with free open-source tools
Having fun with visualizations in advanced open-source applications
Because visualizations are a vitally important part of the data scientist’s toolkit, it should come as no surprise that you can use quite a few free web-based tools to make visualizations in data science. (Check out Chapter 11 for links to a few.) With such tools, you can leverage the brain’s capacity to quickly absorb visual information. Because data visualizations are an effective means of communicating data insights, many tool and application developers work hard to ensure that the platforms they design are simple enough for even beginners to use. These simple applications can sometimes be useful to more advanced data scientists, but at other times, data science experts simply need more technical tools to help them delve deeper into datasets.
In this chapter, I present ten free web-based applications that you can use to complete data science tasks that are more advanced than the ones described in Chapter 11. You can download and install many of these applications on your personal computer, and most of the downloadable applications are available for multiple operating systems.
Always read and understand the licensing requirements of any app you use. Protect yourself by determining how you’re allowed to use the products you create with that app.
Making Custom Web-Based Data Visualizations with Free R Packages
I discuss some easy-to-use web apps for data visualization in Chapter 11, so you may be wondering why I’m presenting yet another set of the packages and tools that are useful for creating cool data visualizations. Here’s the simple answer: The tools I present in this section require you to code using the R statistical programming language — a programming language I present in Chapter 15. Although you may not have much fun coding things up yourself, with these packages and tools, you can create results that are more customized for your needs. In the following sections, I discuss using Shiny, rCharts, and rMaps to create neat-looking web-based data visualizations.
Getting Shiny by RStudio
Not long ago, you needed to know how to use a statistics-capable programming language like R if you wanted to do any kind of serious data analysis. And if you needed to make interactive web visualizations, you’d have to know how to code in languages like JavaScript or PHP. Of course, if you wanted to do both simultaneously, you’d have to know how to code in an additional two or three more programming languages. In other words, web-based data visualization based on statistical analyses was a cumbersome task.
The good news is that things have changed. Due to the work of a few dedicated developers, the walls between analysis and presentation have crumbled. After the 2012 launch of RStudio’s Shiny package (http://shiny.rstudio.com), both statistical analysis and web-based data visualization can be carried out in the same framework.
RStudio — already, by far, the most popular integrated development environment (IDE) for R — developed the Shiny package to allow R users to create web apps. Web apps made in Shiny run on a web server and are interactive— with them, you can interact with the data visualization to move sliders, select check boxes, or click the data itself. Because these apps run on a server, they’re considered live — when you make changes to the underlying data, those changes are automatically reflected in the appearance of the data visualization. Web apps created in Shiny are also reactive — in other words, their output updates instantly in response to a user interaction, without the user having to click a Submit button.
If you want to quickly use a few lines of code to instantly generate a web-based data visualization application, use R’s Shiny package. What’s more, if you want to customize your web-based data visualization app to be more aesthetically appealing, you can do that by simply editing the HTML, CSS, and JavaScript that underlies the Shiny application.
Because Shiny produces server-side web apps, you need a server host and the know-how to host your web app on a server before you can make useful web apps by using the package.
Shiny runs the public web server ShinyApps.io (www.shinyapps.io). You can use that server to host an app for free, or you can pay to host there if your requirements are more resource-intensive. The most basic level of service costs $39 per month and promises you 250 hours of application runtime per month.
Charting with rCharts
Although R has always been famous for its beautiful static visualizations, only just recently has it been possible to use R to produce web-based interactive data visualizations.
Things changed dramatically with the advent of rCharts (http://ramnathv.github.io/rCharts). The rCharts open-source package for R takes your data and parameters as input and then quickly converts them to a JavaScript code block output. Code block outputs from rCharts can use one of many popular JavaScript data visualization libraries, including NVD3, Highcharts, Rickshaw, xCharts, Polychart, and Morris. To see some examples of data visualizations created by using rCharts, check out the data visualizations located on its GitHub page.
Mapping with rMaps
rMaps (http://rmaps.github.io) is the brother of rCharts. Both of these open-source R packages were crafted by Ramnath Vaidyanathan. Using rMaps, you can create animated or interactive choropleths, heat maps, or even maps that contain annotated location droplets (such as those found in the JavaScript mapping libraries Leaflet, CrossLet, and Data Maps).
rMaps allows you to create a spatial data visualization containing interactive sliders that users can move to select the data range they want to see.
If you’re an R user and you’re accustomed to using the simple R Markdown syntax to create web pages, you’ll be happy to know that you can easily embed both rCharts and rMaps in R Markdown.
If you prefer Python to R, Python users aren’t being left out on this trend of creating interactive web-based visualizations within one platform. Python users can use server-side web app tools such as Flask — a less-user-friendly but more powerful tool than Shiny — and the Bokeh and Mpld3 modules to create client-side JavaScript versions of Python visualizations. The Plotly tool has a Python application programming interface (API) — as well as ones for R, MATLAB, and Julia — that you can use to create web-based interactive visualizations directly from your Python IDE or command line. (Check out Flask at http://flask.pocoo.org, Bokeh at http://bokeh.pydata.org, Mpld3 at http://mpld3.github.io, and Plotly at https://plot.ly.)
Examining Scraping, Collecting, and Handling Tools
Whether you need data to support a business analysis or an upcoming journalism piece, web-scraping can help you track down interesting and unique data sources. In web-scraping, you set up automated programs and then let them scour the web for the data you need. I mention the general ideas behind web-scraping in Chapter 18, but in the following sections, I elaborate a bit more on the free tools you can use to scrape data or images, including import.io, ImageQuilts, and DataWrangler.
Scraping data with import.io
Have you ever tried to copy and paste a table from the web into a Microsoft Office document and then not been able to get the columns to line up correctly? Frustrating, right? This is exactly the pain point that import.io was designed to address.
import.io — pronounced “import-eye-oh” — is a free desktop application that you can use to painlessly copy, paste, clean, and format any part of a web page with only a few clicks of the mouse. You can even use import.io to automatically crawl and extract data from multipage lists. (Check out import.io at https://import.io.)
Using import.io, you can scrape data from a simple or complicated series of web pages:
· Simple: Access the web pages through simple hyperlinks that appear on Page 1, Page 2, Page 3.
· Complicated: Fill in a form or choose from a drop-down list, and then submit your scraping request to the tool.
import.io’s most impressive feature is its capability to observe your mouse clicks to learn what you want, and then offer you ways that it can automatically complete your tasks for you. Although import.io learns and suggests tasks, it doesn’t take action on those tasks until after you’ve marked the suggestion as correct. Consequently, these human-augmented interactions lower the risk that the machine will draw an incorrect conclusion due to overguessing.
Collecting images with ImageQuilts
ImageQuilts (http://imagequilts.com) is a Chrome extension developed in part by the legendary Edward Tufte, one of the first great pioneers in data visualization — he popularized the use of the data-to-ink ratio to judge the effectiveness of charts.
The task that ImageQuilts performs is deceptively simple to describe but quite complex to implement. ImageQuilts makes collages of tens of images and pieces them all together into one “quilt” that’s composed of multiple rows of equal height. This task can be complex because the source images are almost never the same height. ImageQuilts scrapes and resizes the images before stitching them together into one output image. The image quilt shown in Figure 23-1 was derived from a “Labeled for Reuse” search at Google Images for the term data science.
FIGURE 23-1: An ImageQuilts output from the Google Images search term data science.
ImageQuilts even allows you to choose the order of images or to randomize them. You can use the tool to drag and drop any image to any place, remove an image, zoom all images at the same time, or zoom each image individually. You can even use the tool to covert between image colors — from color to grayscale or inverted color (which is handy for making contact sheets of negatives, if you’re one of those rare people who still processes analog photography).
Wrangling data with DataWrangler
DataWrangler (http://vis.stanford.edu/wrangler) is an online tool that’s supported by the University of Washington Interactive Data Lab. (At the time DataWrangler was developed, this group was called the Stanford Visualization Group.) This same group developed Lyra, an interactive data visualization environment that you can use to create complex visualizations without programming experience.
If your goal is to sculpt your dataset — or clean things up by moving things around like a sculptor would (split this part in two, slice off that bit and move it over there, push this down so that everything below it gets shifted to the right, and so on) — DataWrangler is the tool for you.
You can do manipulations with DataWrangler similar to what you can do in Excel using Visual Basic. For example, you can use DataWrangler or Excel with Visual Basic to copy, paste, and format information from lists on the Internet.
DataWrangler even suggests actions based on your dataset and can repeat complex actions across entire datasets — actions such as eliminating skipped rows, splitting data from one column into two, and turning a header into column data. DataWrangler can also show you where your dataset is missing data.
Missing data can indicate a formatting error that needs to be cleaned up.
Looking into Data Exploration Tools
Throughout this book, I talk a lot about free tools that you can use to visualize your data. And although visualization can help clarify and communicate your data’s meaning, you need to make sure that the data insights you’re communicating are correct — that requires great care and attention in the data analysis phase. In the following sections, I introduce you to a few free tools that you can use for some advanced data analysis tasks.
Getting up to speed in Gephi
Remember back in school when you were taught how to use graph paper to do math and then draw graphs of the results? Well, apparently that nomenclature is incorrect. Those things with an x-axis and y-axis are called charts.Graphs are actually network topologies — the same type of network topologies I talk about in Chapter 9.
If this book is your first introduction to network topologies, welcome to this weird and wonderful world. You’re in for a voyage of discovery. Gephi (http://gephi.github.io) is an open-source software package you can use to create graph layouts and then manipulate them to get the clearest and most effective results. The kinds of connection-based visualizations you can create in Gephi are useful in all types of network analyses — from social media data analysis to an analysis of protein interactions or horizontal gene transfers between bacteria.
To illustrate a network analysis, imagine that you want to analyze the interconnectedness of people in your social networks. You can use Gephi to quickly and easily present the different aspects of interconnectedness between your Facebook friends. So, imagine that you’re friends with Alice. You and Alice share 10 of the same friends on Facebook, but Alice also has an additional 200 friends with whom you’re not connected. One of the friends that you and Alice share is named Bob. You and Bob share 20 of the same friends on Facebook also, but Bob has only 5 friends in common with Alice. On the basis of shared friends, you can easily surmise that you and Bob are the most similar, but you can use Gephi to visually graph the friend links between you, Alice, and Bob.
To take another example, imagine you have a graph that shows which characters appear in the same chapter as which other characters in Victor Hugo’s immense novel Les Misérables. (Actually, you don’t have to imagine it; Figure 23-2 shows just such a graph, created in the Gephi application.) The larger bubbles indicate that these characters appear most often, and the more lines attached to a bubble, the more he or she co-occurs with others — the big bubble in the center-left is, of course, Jean Valjean.
FIGURE 23-2: A moderate-size graph on characters in the book Les Misérables.
When you use Gephi, the application automatically colors your data into different clusters. Looking to the upper-left of Figure 23-2, the cluster of characters in blue (the somewhat-darker color in this black-and-white image) are characters who mostly appear only with each other. (They’re the friends of Fantine, such as Félix Tholomyès — if you’ve only seen the musical, they don’t appear in that production.) These characters are connected to the rest of the book’s characters through only one character, Fantine. If a group of characters appear only together and never with any other characters, they’d be in a separate cluster of their own and not attached to the rest of the graph in any way.
To take one final example, check out Figure 23-3, which shows a graph of the U.S. power grid and the degrees of interconnectedness between thousands of power-generation and power-distribution facilities. This type of graph is commonly referred to as a hairball graph, for obvious reasons. You can make it less dense and more visually clear, but making those kinds of adjustments is as much of an art as it is a science. The best way to learn is through practice, trial, and error.
FIGURE 23-3: A Gephi hairball graph of the U.S. power grid.
Machine learning with the WEKA suite
Machine learning is the class of artificial intelligence that’s dedicated to developing and applying algorithms to data, so that the algorithms can automatically learn and detect patterns in large datasets. Waikato Environment for Knowledge Analysis (WEKA; www.cs.waikato.ac.nz/ml/weka) is a popular suite of tools that is useful for machine learning tools. It was written in Java and developed at the University of Waikato, New Zealand.
WEKA is a stand-alone application that you can use to analyze patterns in your datasets and then visualize those patterns in all sorts of interesting ways. For advanced users, WEKA’s true value is derived from its suite of machine-learning algorithms that you can use to cluster or categorize your data. WEKA even allows you to run different machine-learning algorithms in parallel to see which ones perform most efficiently. WEKA can be run through a graphical user interface (GUI) or by command line. Thanks to the well-written Weka Wiki documentation, the learning curve for WEKA isn’t as steep as you might expect for a piece of software this powerful.
Evaluating Web-Based Visualization Tools
As I mention earlier in this chapter, Chapter 11 highlights a lot of free web apps you can use to easily generate unique and interesting data visualizations. As neat as those tools are, two more are worth your time. These tools are a little more sophisticated than many of the ones I cover in Chapter 11, but with that sophistication comes more customizable and adaptable outputs.
Getting a little Weave up your sleeve
Web-Based Analysis and Visualization Environment, or Weave, is the brainchild of Dr. Georges Grinstein at the University of Massachusetts Lowell. Weave is an open-source, collaborative tool that uses Adobe Flash to display data visualizations. (Check it out at www.oicweave.org.)
Because Weave relies on Adobe Flash, you can’t access it with all browsers, particularly those on Apple mobile devices — iPad, iPhone, and so on.
The Weave package is Java software designed to be run on a server with a database engine like MySQL or Oracle, although it can be run on a desktop computer as long as a local host server (such as Apache Tomcat) and database software are both installed. Weave offers an excellent Wiki (http://info.iweave.com/projects/weave/wiki) that explains all aspects of the program, including installation on Mac, Linux, or Windows systems.
You can most easily install Weave on the Windows OS because of Weave’s single installer, which installs the desktop middleware, as well as the server and database dependencies. For the installer to be able to install all of this, though, you need to first install the free Adobe Air runtime environment on your machine.
You can use Weave to automatically access countless open datasets or simply upload your own, as well as generate multiple interactive visualizations (such as charts and maps) that allow your users to efficiently explore even the most complex datasets.
Weave is the perfect tool to create visualizations that allow your audience to see and explore the interrelatedness between subsets of your data. Also, if you update your underlying data source, your Weave data visualizations update in real-time as well.
Figure 23-4 shows a demo visualization on Weave’s own server. It depicts every county in the United States, with many columns of data from which to choose. In this example, the map shows county-level obesity data on employed women who are 16 years of age and older. The chart at the bottom-left shows a correlation between obesity and unemployment in this group.
FIGURE 23-4: A figure showing a chart, map, and data table in Weave.
Checking out Knoema’s data visualization offerings
Knoema (http://knoema.com) is an excellent open data source, as I spell out in Chapter 22, but I would be telling only half the story if I didn’t also mention Knoema’s open-source data visualization tools. With these tools, you can create visualizations that enable your audience to easily explore data, drill down on geographic areas or on different indicators, and automatically produce data-driven timelines. Using Knoema, you can quickly export all results into PowerPoint files (.ppt), Excel files (.xls), PDF files (.pdf), JPEG images (.jpg), or PNG images (.png), or even embed them on your website.
If you embed the data visualizations in a web page of your website, those visualizations automatically update if you make changes to the underlying dataset.
Figure 23-5 shows a chart and a table that were quickly, easily, and automatically generated with just two mouse clicks in Knoema. After creating charts and tables in Knoema, you can export the data, further explore it, save it, or embed it in an external website.
FIGURE 23-5: An example of data tables and charts in Knoema.
You can use Knoema to make your own dashboards as well, either from your own data or from open data in Knoema’s repository. Figures 23-6 and 23-7 show two dashboards that I quickly created using Knoema’s Eurostat data on capital and financial accounts.
FIGURE 23-6: A map of Eurostat data in Knoema.
FIGURE 23-7: A line chart of Eurostat data in Knoema.