Why Do You Need Clean Data - Megan Squire (2015)

Megan Squire (2015)

Chapter 1. Why Do You Need Clean Data?

Big data, data mining, machine learning, and visualization—it seems like data is at the center of everything great happening in computing lately. From statisticians to software developers to graphic designers, everyone is suddenly interested in data science. The confluence of cheap hardware, better processing and visualization tools, and massive amounts of freely available data means that we can now discover trends and make predictions more accurately and more easily than ever before.

What you might not have heard, though, is that all of these data science hopes and dreams are predicated on the fact that data is messy. Usually, data has to be moved, compressed, cleaned, chopped, sliced, diced, and subjected to any number of other transformations before it is ready to be used in the algorithms or visualizations that we think of as the heart of data science.

In this chapter, we will cover:

· A simple six-step process you can follow for data science, including cleaning

· Helpful guidelines to communicate how you cleaned your data

· Some tools that you might find helpful for data cleaning

· An introductory example that shows how data cleaning fits into the overall data science process

A fresh perspective

We recently read that The New York Times called data cleaning janitor work and said that 80 percent of a data scientist's time will be spent doing this kind of cleaning. As we can see in the following figure, despite its importance, data cleaning has not really captured the public imagination in the same way as big data, data mining, or machine learning:

A fresh perspective

Who can blame us for not wanting to gather in droves to talk about how fun and super-cool janitor work is? Well, unfortunately—and this is true for actual housekeeping chores as well—we would all be a lot better off if we just got the job done rather than ignoring it, complaining about it, and giving it various demeaning names.

Not convinced yet? Consider a different metaphor instead, you are not a data janitor; you are a data chef. Imagine you have been handed a market basket overflowing with the most gorgeous heirloom vegetables you have ever seen, each one handpicked at the peak of freshness and sustainably produced on an organic farm. The tomatoes are perfectly succulent, the lettuce is crisp, and the peppers are bright and firm. You are excited to begin cooking, but you look around and the kitchen is filthy, the pots and pans have baked-on, caked-on who-knows-what, and, as for tools, you have nothing but a rusty knife and a soggy towel. The sink is broken and you just saw a beetle crawl out from under that formerly beautiful lettuce.

Even a beginner chef knows you should not cook in a place like this. At the very least, you will destroy that perfectly good delicious basket of goodies you have been given. And at worst, you will make people sick. Plus, cooking like this is not even fun, and it would take all day to chop the veggies with an old rusty knife.

Just as you would in a kitchen, it's definitely worth spending time cleaning and preparing your data science workspace, your tools, and your raw materials upfront. The old computer programming adage from the 1960s—garbage in, garbage out—is also true with data science.

The data science process

How does cleaning fit into the rest of the job of data science? Well, the short answer is that it's a critical piece, and it directly affects the processes that come before and after it.

The longer answer relies on describing the data science process in six steps, as shown in the following lists. Data cleaning is right in the middle, at step 3. But rather than thinking of these steps as a linear, start-to-finish framework, we will revisit the steps as needed several times over the course of a project in more of an iterative manner. It is also worth pointing out that not every project will have all the steps; for example, sometimes, we do not have a collection step or a visualization step. It really depends on the particular needs of the project:

1. The first step is to come up with the problem statement. Identify what the problem you are trying to solve is.

2. The next step is data collection and storage. Where did the data come from that is helping you answer this question? Where did you store it and in what format?

3. Then comes data cleaning. Did you change the data at all? Did you delete anything? How did you prepare it for the analysis and mining step next?

4. The next step is data analysis and machine learning. What kind of processing did you do to the data? What transformations? What algorithms did you use? What formulas did you apply? What machine learning algorithms did you use? In what order?

5. Representation and visualization is the next step. How do you show the results of your work? This can be one or more tables, drawings, graphs, charts, network diagrams, word clouds, maps, and so on. Is this the best visualization to represent the data? What alternatives did you consider?

6. The last step is problem resolution. What is the answer to the question or problem you posed in step 1? What limitations do you have on your results? Were there parts of the question that you could not answer with this method? What could you have done differently? What are the next steps?

It makes sense that data cleaning needs to happen before you attempt the analysis / mining / machine learning or visualization steps. Although, remember that, as this is an iterative process, we may revisit cleaning several times during the course of a project. Also, the type of mining or analysis we are going to do will often drive the way we clean the data. We consider cleaning to include a variety of tasks that may be dictated by the analysis method chosen, for example, swapping file formats, changing character encodings, or parsing out pieces of data to operate on.

Data cleaning is also going to be very closely tied to the collection and storage step (step 2). This means that you may have to collect raw data, store it, clean it, store the cleaned data again, collect some more, clean that, combine it with the previous data, clean it again, store it, and so on. As such, it is going to be very important to remember what you did and be able to repeat that process again if needed or tell someone else what you did.

Communicating about data cleaning

As the six-step process is organized around a story arc that starts with a question and ends with the resolution of the question, it works nicely as a reporting framework. If you decide to use the six-step framework as a way to report on your data science process, you will find that you get to write about your cleaning in the third step.

But even if you do not document your data science process in a formal report, you will find that it is extremely helpful to keep careful notes of what you did to the data and in what order.

Remember that even for the smallest, lowest-stakes project, you are always working for an audience of at least two people: you now and you 6 months from now. Believe me when I tell you that the you-in-six-months isn't going to remember what the you-of-today did to clean your data, much less why you did it or how to do it again!

The simplest solution for this is to just keep a log of what you did. The log should include links, screenshots, or copy and pastes of the specific commands you ran as well as a short explanation for why you did this. The following example shows a log for a very small text mining project with embedded links to the output files at each stage as well as links to the cleaning scripts. Don't worry if you are unfamiliar with some of the technologies mentioned in this log. This example shows you what a log might look like:

1. We wrote a SQL query to retrieve a list of every item and its description.

2. In order to conduct a term frequency analysis in Python, we needed data in a specific JSON format. We constructed a PHP script that looped through the results of our query, placing its results in a JSON file (version 1).

3. This file had some formatting errors, such as unescaped quotation marks and embedded HTML tags. These errors were corrected with a second PHP script, which when run, printed this cleaned JSON file (version 2).

Note that our log tries to explain what we did and why we did it. It is short and can include links when possible.

There are many more sophisticated solutions to communicate about data cleaning should you choose to use them, for example, if you are familiar with version control systems such as Git or Subversion, which are usually used to manage a software project, you can probably conceive how to extend them to keep track of your data cleaning. Whatever system you choose, even if it is a simple log, the most important thing is to actually use it. So, pick something that will encourage its use and not impede your progress.

Our data cleaning environment

The approach to data cleaning we are using in this book is a general-purpose, widely applicable one. It does not require or assume that you have any high-end specialty single-vendor database or data analysis products (in fact, these vendors and products may have their own cleaning routines or methods). I have designed the cleaning tutorials in this book around common, everyday issues that you might encounter when using real-world datasets. I have designed the book around real-world data that anyone can access. I'll show you how to clean data using open source, general-purpose software and technologies that are easy to get and are commonly used in the workplace.

Here are some of the tools and technologies that you should be ready to use:

· For nearly every chapter, we will use a terminal window and its command-line interface, such as the Terminal program on Mac OSX or bash on a Linux system. In Windows, some commands will be able to be run using Windows Command Prompt, but other commands may require the use of a more full-featured Windows command-line program, such as CygWin.

· For nearly every chapter, we will use a text editor or programmer's editor, such as Text Wrangler on a Mac, vi or emacs in Linux, or Notepad++ or Sublime Editor on Windows.

· For most chapters, we will need a Python 2.7 client, such as Enthought Canopy, and we will need enough permissions to install packages. Many of the examples will work with Python 3, but some will not, so if you already have that, you may wish to create an alternate 2.7 installation.

· For Chapter 3, Workhorses of Clean Data – Spreadsheets and Text Editors, we will need a spreadsheet program (we will focus on Microsoft Excel and Google Spreadsheets).

· For Chapter 7, RDBMS Cleaning Techniques, we will need a working MySQL installation and the client software to access it.

An introductory example

To get started, let's sharpen our chef's knife with a small example that integrates the six-step framework and illustrates how to tackle a few simple cleaning issues. This example uses the publicly available Enron e-mail dataset. This is a very famous dataset consisting of e-mail messages sent to, from, and between employees working at the now-defunct Enron Corporation. As part of the U.S. Government investigation into accounting fraud at Enron, the e-mails became part of the public record and are now downloadable by anyone. Researchers in a variety of domains have found the e-mails helpful for studying workplace communication, social networks, and more.

Note

You can read more about Enron and the financial scandal that led to its demise on its Wikipedia page at http://en.wikipedia.org/wiki/Enron, and you can read about the Enron e-mail corpus itself on its separate page at http://en.wikipedia.org/wiki/Enron_Corpus.

In this example, we will implement the six-step framework on a simple data science question. Suppose we want to reveal trends and patterns in e-mail usage over time within Enron Corporation. Let's start by counting messages that were sent to/from Enron employees by date. We will then show the counts visually on a graph over time.

First, we need to download the MySQL Enron corpus using the instructions at http://www.ahschulz.de/enron-email-data/. Another (backup) source for this file is https://www.cs.purdue.edu/homes/jpfeiff/enron.html. Following these instructions, we will need to import the data into a new database scheme called Enron on a MySQL server. The data is now ready to be queried using either the MySQL command-line interface or using a web-based tool such as PHPMyAdmin.

Our first count query is shown as follows:

SELECT date(date) AS dateSent, count(mid) AS numMsg

FROM message

GROUP BY dateSent

ORDER BY dateSent;

Right away, we notice that numerous e-mails have incorrect dates, for example, there are a number of dates that seem to predate or postdate the existence of the corporation (for example, 1979) or that were from years that were illogical (for example, 0001 or 2044). E-mail is old but not thatold!

The following table shows an excerpt of a few of the weird lines (the complete result set is about 1300 rows long) All of these dates are formatted correctly; however, some of the dates are definitely wrong:

dateSent

numMsg

0002-03-05

1

0002-03-07

3

0002-03-08

2

0002-03-12

1

1979-12-31

6

1997-01-01

1

1998-01-04

1

1998-01-05

1

1998-10-30

3

These bad dates are most likely due to misconfigured e-mail clients. At this point, we have three choices for what to do:

· Do nothing: Maybe we can just ignore the bad data and get away with building the line graph anyway. But, as the lowest bad date was from the year 0001 and the highest was from the year 2044, we can imagine our line graph with the 1300 tick marks on the time axis, each showing a count of 1 or 2. This graph does not sound very appealing or informative, so doing nothing will not work.

· Fix the data: We could try to figure out what the correct date for each bad message was and produce a corrected dataset that we can then use to build our graph.

· Throw out the affected e-mails: We can just make an informed decision to discard any e-mail that has a date that falls outside a predetermined window.

In order to decide between options 2 and 3, we will need to count how many messages will be affected using only a 1999-2002 window. We can use the following SQL:

SELECT count(*) FROM message

WHERE year(date) < 1998 or year(date) > 2002;

Result: 325

325 messages with bad dates may initially seem like a lot, but then again, they are only about 1 percent of the entire dataset. Depending on our goals, we might decide to fix these dates manually, but let's assume here that we do not mind losing 1 percent of the messages. We can proceed cautiously toward option 3, throwing out the affected e-mails. Here is the amended query:

SELECT date(date) AS dateSent, count(mid) AS numMsg

FROM message

WHERE year(date) BETWEEN 1998 AND 2002

GROUP BY dateSent

ORDER BY dateSent;

The cleaned data now consists of 1,211 rows, each with a count. Here is an excerpt of the new dataset:

dateSent

numMsg

1998-01-04

1

1998-01-05

1

1998-10-30

3

1998-11-02

1

1998-11-03

1

1998-11-04

4

1998-11-05

1

1998-11-13

2

In this example, it looks like there are two questionable dates in January 1998 and no other messages until October, at which point, the messages start coming in more regularly. This seems weird, and it also points to another issue, is it important that we have every date on the x axis, even if there were no e-mails sent that day?

If we answer yes, it is important to show every date, even those with 0 counts; this may mean going through another round of cleaning in order to produce rows that show the date with a zero.

But then again, maybe we can be more strategic about this. Whether we need to have zero values in our raw data actually depends on what tool we are using to create the graph and what type of graph it is, for example, Google Spreadsheets will build a line or bar graph that can automatically detect that there are missing dates on the x axis and will fill in zero values even if they are not given in the initial dataset. In our data, these zero values would be the mysterious missing dates from most of 1998.

The next three figures show each of these tools and how they handle zero values on a date axis. Note the long zero tails at the beginning and end of the Google Spreadsheets representation of the data shown here:

An introductory example

Google Spreadsheets automatically fills in any missing days with a zero.

The D3 JavaScript visualization library will do the same, filling in zero values for missing dates in a range by default, as shown in the next graph.

Tip

For a simple D3 line graph example, take a look at this tutorial: http://bl.ocks.org/mbostock/3883245.

An introductory example

D3 automatically fills in any missing days with a zero.

Excel also has identical date-filling behavior in its default line graph, as shown here:

An introductory example

Excel automatically fills in any missing days with a zero.

Next, we need to consider whether, by allowing zero values for dates, we are also making our x axis substantially longer (my count query yielded 1211 rows, but there are a total of 1822 days in the range specified, which is 1998-2002). Maybe showing zero count days might not work; if the graph is so crowded, we cannot see the gaps anyway.

To compare, we can quickly run the same data into Google Spreadsheets (you can do this in Excel or D3 too), but this time, we will only select our count column to build the graph, thereby forcing Google Spreadsheets to not show dates on the x axis. The result is the true shape of only the data that came from the database count query with no zero count days filled in. The long tails are gone, but the overall shape of the important part of the graph (the middle) remains the same:

An introductory example

The graph now shows only dates with one or more message.

Lucky for us, the shape of the data is similar, save for a shorter head and tail on the graph. Based on this comparison, and based on what we plan to do with the data (remember that all we wanted to do was create a simple line graph), we can feel good about our decision to move forward without specifically creating a dataset showing zero count days.

When all is said and done, the line graphs reveal that Enron had several significant peaks in e-mail traffic. The largest peaks and heaviest traffic occurred in the October and November of 2001, when the scandal broke. The two smaller peaks occurred around June 26-27 of 2001 and December 12-13 of 2000, when similar newsworthy events involving Enron transpired (one involving the California energy crisis and another involving a leadership change at the company).

If you get excited by data analysis, you probably have all sorts of cool ideas for what to do next with this data. And now that you have cleaned data, it will make your analysis tasks easier, hopefully!

Summary

After all that work, it looks like The New York Times was right. As you can see from this simple exercise, data cleaning indeed comprises about 80 percent of the effort of answering even a tiny data-oriented question (in this case, talking through the rationale and choices for data cleaning took 700 words out of the 900-word case study). Data cleaning really is a pivotal part of the data science process, and it involves understanding technical issues and also requires us to make some value judgments. As part of data cleaning, we even had to take into account the desired outcomes of both the analysis and visualization steps even though we had not really completed them yet.

After considering the role of data cleaning as presented in this chapter, it becomes even more obvious how improvements in our cleaning effectiveness could quickly add up to substantial time savings.

The next chapter will describe a few of the fundamentals that will be required for any "data chef" who wants to move into a bigger, better "kitchen", including file formats, data types, and character encodings.