Data Science For Dummies (2016)
Part 4
Computing for Data Science
Chapter 16
Using SQL in Data Science
IN THIS CHAPTER
Getting a grip on relational databases and SQL
Designing great relational databases
Doing data science tasks with SQL functions
SQL, or Structured Query Language, is a standard for creating, maintaining, and securing relational databases. It’s a set of rules that you can use to quickly and efficiently query, update, modify, add, or remove data from large and complex databases. You use SQL, rather than Python or Excel, to do these tasks because SQL is the simplest, fastest way to get the job done. It offers a plain and standardized set of core commands and methods that are easy to use when performing these tasks. In this chapter, I introduce you to basic SQL concepts and explain how you can use SQL to do cool things, like query, join, group, sort, and even text-mine structured datasets.
Although you can use SQL to work with structured data that resides in relational database management systems, you can’t use standard SQL as a solution for handling big data, because you can’t handle big data using relational database technologies. I give you more solutions for handling big data in Chapter 2, where I discuss data engineering and its components. For now, suffice it to say that SQL is simply a tool you can use to manipulate and edit structured data tables. It’s nothing exceedingly innovative, but it can be helpful to use SQL for the data querying and manipulation tasks that often arise in the practice of data science. In this chapter, I introduce the basics of relational databases, SQL, and database design.
Getting a Handle on Relational Databases and SQL
Although the name Structured Query Language suggests that SQL is a programming language, don’t be misled. SQL is not a programming language like R or Python. Rather, it’s a language of commands and syntax that you can use only to create, maintain, and search relational database systems. SQL supports a few common programming forms, like conditionals and loops, but to do anything more complex, you’d have to import your SQL query results into another programming platform and then do the more complex work there.
SQL has become so ubiquitous in the data field that its passionate users commonly debate whether SQL should be pronounced “ess-cue-el” or “sequel.” Most users I’ve met lean toward the latter.
One fundamental characteristic of SQL is that you can use it on only structured data that sits in a relational database. SQL database management systems (DBMSs) optimize their own structure with minimal user input, which enables blazing-fast operational performance.
An index is the lookup table you create in order to index (and point to) data in tables of a database. Although SQL DBMSs are known for their fast structured database querying capabilities, this speed and effectiveness are heavily dependent on good indexing. Good indexing is vital for fast data retrieval in SQL.
Similar to how different web browsers comply with, add to, and ignore different parts of the HTML standard in different ways, SQL rules are interpreted a bit differently, depending on whether you’re working open-source or commercial vendor software applications. Because not every SQL solution is the same, it’s a good idea to know something about the benefits and drawbacks of some of the more popular SQL solutions on the market. Here are the three most popular open-source SQL implementations among data scientists:
· SQLite: This software is more limited than other SQL implementations, especially when it comes to user management and performance-enhancing customizations, but it’s a fantastic place to get started when you’re first learning to use SQL.
· MySQL: MySQL is by far the most popular open-source version of SQL. It offers a complete and powerful version of SQL, and it’s used on the back end of millions of websites.
· PostgreSQL: This software adds object-oriented elements to SQL’s relational language, making it popular with programmers who want to integrate SQL objects into their own platforms’ object model.
Other powerful commercial SQL implementations, such as Oracle and Microsoft SQL Server, are great solutions as well, but they’re designed for use in business rather than as a data science tool.
As you might guess from the name, the most salient aspect of relational databases is that they’re relational: They’re composed of related tables. To illustrate the idea of a relational database, first imagine an Excel spreadsheet with rows, columns, and predefined relationships between shared columns. Then imagine having an Excel workbook with many worksheets (tables), in which every worksheet has a column with the same name as a column in one or more other worksheets. Because these worksheets have a shared relationship, if you use SQL you can use that shared relationship to look up data in all related worksheets. This type of relationship is illustrated in Figure 16-1.
FIGURE 16-1: A relationship between data tables that share a column.
The primary key of a table is a column of values that uniquely identifies every row in that table. A good example of primary keys is the use of ISBN numbers for a table of books or employee ID numbers for a table of employees. A foreign key is a column in one table that matches the primary key of another and is used to link tables.
Keeping the focus on terminology, remember that proper database science often associates particular meanings to particular words, as you can see in this list:
· Columns, called fields, keys, and attributes
· Rows, called records
· Cells, called values
Database science uses a lot of synonyms. For simplicity’s sake, I try to stick to using the words column, row, and cell. And because primary key and foreign key are standard terms, I use them to describe these two special column types.
The main benefits of using relational database management systems (RDBMSs, for short) is that they’re fast, they have large storage and handling capacity (compared to spreadsheet applications such as Excel), and they’re ideal tools to help you maintain data integrity — the consistency and accuracy of data in your database. If you need to make quick and accurate changes and updates to your datasets, you can use SQL and a RDBMS.
Let the following scenario serve as an illustration. This table describes films and lists ratings from viewers:
id title genre rating timestamp rating
1 The Even Couple NULL 2011-08-03 16:04:23 4
2 The Fourth Man Drama 2014-02-19 19:17:16 5
2 The Fourth Man Drama 2010-04-27 10:05:36 4
3 All About Adam Drama 2011-04-05 21:21:05 4
3 All About Adam Drama 2014-02-21 00:11:07 3
4 Dr. Yes Thriller NULL
What happens if you find out that All About Adam is a comedy rather than a drama? If the table were in a simple spreadsheet, you’d have to open the data table, find all instances of the film, and then manually change the genre value for that record. That’s not so difficult in this sample table because only two records are related to that film. But even here, if you forget to change one of these records, this inconsistency would cause a loss of data integrity, which can cause all sorts of unpredictable problems for you down the road.
In contrast, the relational database solution is simple and elegant. Instead of one table for this example, you’d have three:
Film id title
1 The Even Couple
2 The Fourth Man
3 All About Adam
4 Dr. Yes
Genre id genre
2 Drama
3 Drama
4 Thriller
Rating timestamp id rating
2011-08-03 16:04:23 1 4
2014-02-19 19:17:16 2 5
2010-04-27 10:05:36 2 4
2011-04-05 21:21:05 3 4
2014-02-21 00:11:07 3 3
The primary key for the Film and Genre tables is id. The primary key for the Rating table is timestamp — because a film can have more than one rating, id is not a unique field and, consequently, it can’t be used as a primary key. In this example, if you want to look up and change the genre for All About Adam, you’d use Film.id as the primary key and Genre.id as the foreign key. You’d simply use these keys to query the records you need to change and then apply the changes systematically. This systematic approach eliminates the risk of stray errors.
Investing Some Effort into Database Design
If you want to ensure that your database will be useful to you for the foreseeable future, you need to invest time and resources into excellent database design. If you want to create databases that offer fast performance and error-free results, your database design needs to be flawless, or as flawless as you can manage. Before you enter any data into a data table, first carefully consider the tables and columns you want to include, the kinds of data those tables will hold, and the relationships you want to create between those tables.
Every hour you spend planning your database and anticipating future needs can save you countless hours down the road, when your database might hold a million records. Poorly planned databases can easily turn into slow, error-ridden monstrosities — avoid them at all costs.
Keep just a few principles in mind when you design databases:
· Data types
· Constraints
· Normalization
In the next few sections, I help you take a closer look at each topic.
Defining data types
When creating a data table, one of the first things you have to do is define the data type of each column. The data type can be designated from any of the following options:
· Text: If your column is to contain text values, you can classify it as a Character data type with a fixed length or a Text data type of indeterminate length.
· Numerical: If your column is to hold number values, you can classify it as a Numerical data type. These can be stored as integers or floats.
· Date: If your column is to hold date- or time-based values, you can designate this as a Date data type or Date-Time data type.
Text data types are handy, but they’re terrible for searches. If you plan to query a column, assign that column a fixed length.
Designing constraints properly
Properly designed constraints are an important consideration in database design. You can think of them, in the context of SQL, as rules that are used to control the type of data that can be placed in a table. When you’re considering adding constraints, first decide whether each column is allowed to hold a NULL value. (NULL is not the same as blank or zero data; it indicates a total absence of data in a cell.)
For example, if you have a table of products you’re selling, you probably don’t want to allow a NULL in the Price column. In the Product Description column, however, some products may have long descriptions, so you might allow some of the cells in this column to contain NULL values.
Within any data type, you can also constrain exactly what type of input values the column accepts. Imagine that you have a text field for Employee ID, which must contain values that are exactly two letters followed by seven numbers, like this: SD0154919. Because you don’t want your database to accept a typo, you’d define a constraint that requires all values entered into the cells of the Employee ID column to have exactly two letters followed by seven numbers.
Normalizing your database
After you’ve defined the data types and designed constraints, you need to deal with normalization — structuring your database so that any changes, additions, or deletions to the data have to be done only once and won’t result in anomalous, inconsistent data. There are many different degrees and types of normalization (at least seven), but a good, robust, normalized SQL database should have at least the following properties:
· Primary keys: Each table has a primary key, which is a unique value for every row in that column.
· Non-redundancy of columns: No two tables have the same column, unless it’s the primary key of one and the foreign key of the other.
· No multiple dependencies: Every column’s value must depend on only one other column, whose value does not in turn depend on any other column. Calculated values — values such as the total for an invoice, for example — must therefore be done on the fly for each query and should not be hard-coded into the database. This means that Zip codes should be stored in a separate table because a Zip code depends on three columns — address, city, and state.
· Column indexes: As you may recall, in SQL an index is a lookup table that points to data in tables of a database. When you make a column index — an index of a particular column — each record in that column is assigned a unique key value that’s indexed in a lookup table. Column indexing enables faster data retrieval from that column.
It’s an excellent idea to create a column index for frequent searches or to be used as a search criterion. The column index takes up memory, but it increases your search speeds tremendously. It’s easy to set up, too. Just tell your SQL DBMS to index a certain column, and then the system sets it up for you.
If you’re concerned that your queries are slow, first make sure that you have all the indexes you need before trying other, perhaps more involved, troubleshooting efforts.
· Subject-matter segregation: Another feature of good database design is that each table contains data for only one kind of subject matter. This is not exactly a normalization principle per se, but it helps to achieve a similar end.
Consider again the film rating example, from an earlier section:
Film id title
1 The Even Couple
2 The Fourth Man
3 All About Adam
4 Dr. Yes
Genre id genre
2 Drama
3 Drama
4 Thriller
Rating timestamp id rating
2011-08-03 16:04:23 1 4
2014-02-19 19:17:16 2 5
2010-04-27 10:05:36 2 4
2011-04-05 21:21:05 3 4
2014-02-21 00:11:07 3 3
I could have designated Genre to be a separate column in the Film table, but it’s better off in its own table because that allows for the possibility of missing data values (NULLs). Look at the Film table shown above. Film 1 has no genre assigned to it. If the Genre column were included in this table, then Film 1 would have a NULL value there. Rather than have a column that contains a NULL value, it’s much easier to make a separate Genre data table. The primary keys of the Genre table don’t align exactly with those of the Film table, but they don’t need to when you go to join them.
NULL values can be quite problematic when you’re running a SELECT query. When you’re querying based on the value of particular attribute, any records that have a null value for that attribute will not be returned in the query results. Of course, these records would still exist, and they may even fall within the specified range of values you’ve defined for your query, but if the record has a null value, it will be omitted from the query results. In this case, you’re likely to miss them in your analysis.
Any data scientist worth her salt must address many challenges, when dealing with either the data or the science. SQL takes some of the pressure off when you’re dealing with the time-consuming tasks of storing and querying data, saving precious time and effort.
Integrating SQL, R, Python, and Excel into Your Data Science Strategy
Some data scientists are resistant to learning SQL because of the cognitive overhead. They think, “I’ve already memorized a bunch of commands for dealing with data in R or Python. Won’t it be confusing to switch over to a whole new language?” In the case of SQL, no, it’s not that confusing, and it’s worth the small degree of hassle. Although the SQL standard is lengthy, a user commonly needs fewer than 20 commands, and the syntax is human-readable. Making things even easier, SQL commands are written in ALL CAPS, which helps to keep the language distinct and separate in your mind from other programming languages.
If you want to integrate SQL capabilities into your R or Python workflow, every DBMS has a library or module that you can use. Generally, it’s a good idea to take advantage of SQL’s speed by doing as much work in SQL as possible, and then accessing the SQL database from within your scripting language only when necessary. In this type of procedure, you’d translate query results into native R or Python data forms only when you finish with SQL and have all the data you need.
You can also integrate Microsoft Excel with your work in SQL. You can use MySQL to import your databases into Excel using the Data Ribbon (the Ribbon’s Other Sources button, to be precise), or you can save your Excel tables as text files and import them into the DBMS. If you’re not working in MySQL, look around online and you’ll be sure to find plug-ins for integrating other DBMSs into Excel. Some plug-ins are even free.
Narrowing the Focus with SQL Functions
When working with SQL commands, you use functions to perform tasks, and arguments to more narrowly specify those tasks. To query a particular set from within your data tables, for example, use the SELECT function. To combine separate tables into one, use the JOIN function. To place limits on the data that your query returns, use a WHERE argument. As I say in the preceding section, fewer than 20 commands are commonly used in SQL. This section introduces SELECT, FROM, JOIN, WHERE, GROUP, MAX(), MIN(), COUNT(), AVG(), and HAVING.
The most common SQL command is SELECT. You can use this function to generate a list of search results based on designated criteria. To illustrate, imagine the film-rating scenario mentioned earlier in this chapter with a tiny database of movie ratings that contains the three tables Film, Genre, and Rating.
To generate a printout of all data FROM the Rating table, use the SELECT function. Any function with SELECT is called a query, and SELECT functions accept different arguments to narrow down or expand the data that is returned. Since an asterisk (*) represents a wildcard, the asterisk in SELECT * tells the interpreter — the SQL component that carries out all SQL statements — to show every column in the table. You can then use the WHEREargument to limit the output to only certain values. For example, here is the complete Rating table:
Rating timestamp id rating
2011-08-03 16:04:23 1 4
2014-02-19 19:17:16 2 5
2010-04-27 10:05:36 2 4
2011-04-05 21:21:05 3 4
2014-02-21 00:11:07 3 3
If you want to limit your ratings to those made after a certain time, you’d use code like that shown in Listing 16-1.
LISTING 16-1 Using SELECT, WHERE, and DATE() to Query Data
SELECT * FROM Rating
WHERE Rating.timestamp >= date('2014-01-01')
timestamp id rating
2014-02-19 19:17:16 2 5
2014-02-21 00:11:07 3 3
In Listing 16-1, the DATE() function turns a string into a date that can then be compared with the timestamp column.
You can also use SQL to join columns into a new data table. Joins are made on the basis of shared (or compared) data in a particular column (or columns). There are several ways you can execute a join in SQL, but the ones listed here are probably the most popular:
· Inner join: The default JOIN type; returns all records that lie in the intersecting regions between the tables being queried
· Outer join: Returns all records that lie outside the overlapping regions between queried data tables
· Full outer join: Returns all records that lie both inside and outside the overlapping regions between queried data tables — in other words, returns all records for both tables
· Left join: Returns all records that reside in the leftmost table
· Right join: Returns all records that reside in the rightmost table
Be sure to differentiate between an inner join and an outer join because these functions handle missing data in different ways. As an example of a join in SQL, if you want a list of films that includes genres, you use an inner join between the Film and Genre tables to return only the results that intersect (overlap) between the two tables.
To refresh your memory, here are the two tables you’re interested in:
Film id title
1 The Even Couple
2 The Fourth Man
3 All About Adam
4 Dr. Yes
Genre id genre
2 Drama
3 Drama
4 Thriller
Listing 16-2 shows how you’d use an inner join to get the information you want.
LISTING 16-2 An Inner JOIN Function
SELECT Film.id, Film.title, Genre.genre
FROM Film
JOIN Genre On Genre.id=Film.id
id title genre
2 The Fourth Man Drama
3 All About Adam Drama
4 Dr. Yes Thriller
In Listing 16-2, I name specific columns (Film.title and Genre.genre) after the SELECT command. I do this to avoid creating a duplicate id column in the table that results from the JOIN — one id from the Film table and one id from the Genre table. Since the default for JOIN is inner, and inner joins return only records that are overlapping or shared between tables, Film 1 is omitted from the results (due to its missing genre value).
If you want to return all rows, even ones with NULL values, simply do a full outer join, like the one shown in Listing 16-3.
LISTING 16-3 A Full Outer JOIN
SELECT Film.id, Film.title, Genre.genre
FROM Film
FULL JOIN Genre On Genre.id=Film.id
id title genre
1 The Even Couple NULL
2 The Fourth Man Drama
3 All About Adam Drama
4 Dr. Yes Thriller
To aggregate values so that you can figure out the average rating for a film, use the GROUP statement. (GROUP statement commands include MAX(), MIN(), COUNT(), or AVG().) Listing 16-4 shows one way you could aggregate values.
In Listing 16-4, the average rating of films was returned; the AS statement was used in SELECT to rename the column, to make sure it was properly labeled. The Film and Ratings tables had to be joined, and because Dr. Yes had no ratings and an inner join was used, that film was left out.
LISTING 16-4 Using a GROUP Statement to Aggregate Data
SELECT Film.title, AVG(rating) AS avg_rating
FROM Film
JOIN Rating On Film.id=Rating.id
GROUP BY Film.title
title avg_rating
All About Adam 3.5
The Even Couple 4.0
The Fourth Man 4.5
To narrow the results even further, add a HAVING clause at the end, as shown in Listing 16-5.
LISTING 16-5 A HAVING Clause to Narrow Results
SELECT Film.title, AVG(rating) AS avg_rating
FROM Film
JOIN Rating On Film.id=Rating.id
GROUP BY Film.title
HAVING avg_rating >= 4
title avg_rating
The Even Couple 4.0
The Fourth Man 4.5
The code in Listing 16-5 limits the data your query returns so that you get only records of titles that have an average rating greater than or equal to 4.
Though SQL can do some basic text mining, packages such as Natural Language Toolkit in Python (NLTK, at www.nltk.org) and General Architecture for Text Engineering (GATE, at https://gate.ac.uk) are needed in order to do anything more complex than counting words and combinations of words. These more advanced packages can be used for preprocessing of data to extract linguistic items such as parts of speech or syntactic relations, which can then be stored in a relational database for later querying.
MINING TEXT WITH SQL
In this era of big data, more and more analysis is being done on larger and larger amounts of raw text — from books to government procedures and even Twitter feeds. You can use the tm and nltk packages in R and Python, respectively, to process such data, but as scripting languages, they can be rather slow. That’s why users commonly do some text mining in SQL. If you want to generate quick statistics on word counts and frequencies, you can use SQL to your advantage.
When the first SQL standard was published, its originators likely had no idea it would be used for these purposes, but the boundaries of SQL are being pushed and expanded all the time. This flexibility is yet another reason that SQL maintains its place as an indispensable tool among data science practitioners.