Databases and SQL - Data Science from Scratch: First Principles with Python (2015)
Data Science from Scratch: First Principles with Python (2015)
Chapter 23.Databases and SQL
Memory is man’s greatest friend and worst enemy.
Gilbert Parker
The data you need willoften live indatabases, systems designed for efficiently storing and querying data. The bulk of thesearerelationaldatabases, such as Oracle, MySQL, and SQL Server, which store data intablesand are typically queried using Structured Query Language (SQL), a declarative language for manipulating data.
SQL is a pretty essential part of the data scientist’s toolkit. In this chapter, we’llcreate NotQuiteABase, a Python implementation of something that’s not quite a database. We’ll also cover the basics of SQL while showing how they work in our not-quite database, which is the most “from scratch” way I could think of to help you understand what they’re doing. My hope is that solving problems in NotQuiteABase will give you a good sense of how you might solve the same problems using SQL.
CREATE TABLE and INSERT
A relational database is a collection of tables (and of relationships among them).A table is simply a collection of rows, not unlike the matrices we’ve been working with. However, a table also has associated with it afixedschemaconsisting of column names and column types.
For example, imagine ausersdata set containing for each user heruser_id,name, andnum_friends:
users=[[0,"Hero",0],
[1,"Dunn",2],
[2,"Sue",3],
[3,"Chi",3]]
In SQL, we might createthis table with:
CREATETABLEusers(
user_idINTNOTNULL,
nameVARCHAR(200),
num_friendsINT);
Notice that we specified that theuser_idandnum_friendsmust be integers (and thatuser_idisn’t allowed to beNULL, which indicates a missing value and is sort of like ourNone) and that the name should be a string of length 200 or less. NotQuiteABase won’t take types into account, but we’ll behave as if it did.
NOTE
SQL is almost completely case and indentation insensitive. The capitalization and indentation style here is my preferred style. If you start learning SQL, you will surely encounter other examples styled differently.
Notice also that SQL statements need to end with semicolons, and that SQL requires single quotes for its strings.
In NotQuiteABase, you’ll create aTablesimply by specifying the names of its columns. And to insert a row, you’ll use the table’sinsert()method, which takes alistof row values that need to be in the same order as the table’s column names.
Behind the scenes, we’ll store each row as adictfrom column names to values. A real database would never use such a space-wasting representation, but doing so will make NotQuiteABase much easier to work with:
classTable:
def__init__(self,columns):
self.columns=columns
self.rows=[]
def__repr__(self):
"""pretty representation of the table: columns then rows"""
Sometimes you need to update the data that’s already in the database.For instance, if Dunn acquires another friend, you might need to do this:
UPDATEusers
SETnum_friends=3
WHEREuser_id=1;
The key features are:
§ What table to update
§ Which rows to update
§ Which fields to update
§ What their new values should be
We’ll add a similarupdatemethod to NotQuiteABase. Its first argument will be adictwhose keys are the columns to update and whose values are the new values for those fields. And its second argument is apredicatethat returnsTruefor rows that should be updated,Falseotherwise:
defupdate(self,updates,predicate):
forrowinself.rows:
ifpredicate(row):
forcolumn,new_valueinupdates.iteritems():
row[column]=new_value
after which we can simply do this:
users.update({'num_friends':3},# set num_friends = 3
lambdarow:row['user_id']==1)# in rows where user_id == 1
DELETE
There are two ways to delete rows from a table in SQL.The dangerous way deletes every row from a table:
DELETEFROMusers;
The less dangerous way adds aWHEREclauseand only deletes rows that match a certain condition:
If you supply apredicatefunction (i.e., aWHEREclause), thisdeletes only the rows that satisfy it. If you don’t supply one, the default predicate always returnsTrue, and you will delete every row.
For example:
users.delete(lambdarow:row["user_id"]==1)# deletes rows with user_id == 1
users.delete()# deletes every row
SELECT
Typically you don’t inspect SQL tables directly. Instead you query them withaSELECTstatement:
SELECT*FROMusers;-- get the entire contents
SELECT*FROMusersLIMIT2;-- get the first two rows
SELECTuser_idFROMusers;-- only get specific columns
SELECTuser_idFROMusersWHEREname='Dunn';-- only get specific rows
You can also useSELECTstatements to calculate fields:
SELECTLENGTH(name)ASname_lengthFROMusers;
We’ll give ourTableclass aselect()method that returns a newTable. The method accepts two optional arguments:
§ keep_columnsspecifies the name of the columns you want to keep in the result. If you don’t supply it, the result contains all the columns.
§ additional_columnsis a dictionary whose keys are new column names and whose values are functions specifying how to compute the values of the new columns.
If you were to supply neither of them, you’d simply get back a copy of the table:
Ourselect()returns a newTable, while the typical SQLSELECTjust produces some sort of transient result set (unless you explicitly insert the results into a table).
We’ll also needwhere()andlimit()methods. Both are pretty simple:
defwhere(self,predicate=lambdarow:True):
"""return only the rows that satisfy the supplied predicate"""
where_table=Table(self.columns)
where_table.rows=filter(predicate,self.rows)
returnwhere_table
deflimit(self,num_rows):
"""return only the first num_rows rows"""
limit_table=Table(self.columns)
limit_table.rows=self.rows[:num_rows]
returnlimit_table
after which we can easily construct NotQuiteABase equivalents to the preceding SQL statements:
# SELECT * FROM users;
users.select()
# SELECT * FROM users LIMIT 2;
users.limit(2)
# SELECT user_id FROM users;
users.select(keep_columns=["user_id"])
# SELECT user_id FROM users WHERE name = 'Dunn';
users.where(lambdarow:row["name"]=="Dunn") \
.select(keep_columns=["user_id"])
# SELECT LENGTH(name) AS name_length FROM users;
defname_length(row):returnlen(row["name"])
users.select(keep_columns=[],
additional_columns={"name_length":name_length})
Notice that — unlike in the rest of the book — here I use backslash\to continue statements across multiple lines. I find it makes the chained-together NotQuiteABase queries more readable than any other way of writing them.
GROUP BY
Another common SQL operation isGROUP BY, whichgroups together rows with identical values in specified columns and produces aggregate values likeMINandMAXandCOUNTandSUM. (This should remind you of thegroup_byfunction from“Manipulating Data”.)
For example, you might want to find the number of users and the smallestuser_idfor each possible name length:
SELECTLENGTH(name)asname_length,
MIN(user_id)ASmin_user_id,
COUNT(*)ASnum_users
FROMusers
GROUPBYLENGTH(name);
Every field weSELECTneeds to be either in theGROUP BYclause (whichname_lengthis) or an aggregate computation (whichmin_user_idandnum_usersare).
SQL also supports aHAVINGclause that behaves similarly to aWHEREclause except that its filter is applied to the aggregates (whereas aWHEREwould filter out rows before aggregation even took place).
You might want to know the average number of friends for users whose names start with specific letters but only see the results for letters whose corresponding average is greater than 1. (Yes, some of these examples are contrived.)
SELECTSUBSTR(name,1,1)ASfirst_letter,
AVG(num_friends)ASavg_num_friends
FROMusers
GROUPBYSUBSTR(name,1,1)
HAVINGAVG(num_friends)>1;
(Functions for working with strings vary across SQL implementations; some databases might instead useSUBSTRINGor something else.)
You can also compute overall aggregates. In that case, you leave off theGROUP BY:
SELECTSUM(user_id)asuser_id_sum
FROMusers
WHEREuser_id>1;
To add this functionality to NotQuiteABaseTables, we’ll add agroup_by()method. It takes the names of the columns you want to group by, a dictionary of the aggregation functions you want to run over each group, and an optional predicatehavingthat operates on multiple rows.
Then it does the following steps:
1. Creates adefaultdictto maptuples (of the group-by-values) to rows (containing the group-by-values). Recall that you can’t use lists asdictkeys; you have to use tuples.
2. Iterates over the rows of the table, populating thedefaultdict.
3. Creates a new table with the correct output columns.
4. Iterates over thedefaultdictand populates the output table, applying thehavingfilter if any.
(An actual database would almost certainly do this in a more efficient manner.)
Frequently, you’ll want to sort your results.For example, you might want to know the (alphabetically) first two names of your users:
SELECT*FROMusers
ORDERBYname
LIMIT2;
This is easy to implement by giving ourTableanorder_by()method that takes anorderfunction:
deforder_by(self,order):
new_table=self.select()# make a copy
new_table.rows.sort(key=order)
returnnew_table
which we can then use as follows:
friendliest_letters=avg_friends_by_letter \
.order_by(lambdarow:-row["avg_num_friends"]) \
.limit(4)
The SQLORDER BYlets you specifyASC(ascending) orDESC(descending) for each sort field; here we’d have to bake that into ourorderfunction.
JOIN
Relational database tables are oftennormalized, whichmeans that they’re organized to minimize redundancy. For example, when we work with our users’ interests in Python we can just give each user alistcontaining his interests.
SQL tables can’t typically contain lists, so the typical solution is to create a second tableuser_interestscontaining the one-to-many relationship betweenuser_ids and interests. In SQL you might do:
CREATETABLEuser_interests(
user_idINTNOTNULL,
interestVARCHAR(100)NOTNULL
);
whereas in NotQuiteABase you’d create the table:
user_interests=Table(["user_id","interest"])
user_interests.insert([0,"SQL"])
user_interests.insert([0,"NoSQL"])
user_interests.insert([2,"SQL"])
user_interests.insert([2,"MySQL"])
NOTE
There’s still plenty of redundancy — the interest “SQL” is stored in two different places. In a real database you might storeuser_idandinterest_idin theuser_intereststable and then create a third tableinterestsmappinginterest_idtointerestso you could store the interest names only once each. Here that would just make our examples more complicated than they need to be.
When our data lives across different tables, how do we analyze it? ByJOINing the tables together. AJOINcombines rows in the left table with corresponding rows in the right table, where the meaning of “corresponding” is based on how we specify the join.
For example, to find the users interested in SQL you’d query:
SELECTusers.name
FROMusers
JOINuser_interests
ONusers.user_id=user_interests.user_id
WHEREuser_interests.interest='SQL'
TheJOINsays that, for each row inusers, we should look at theuser_idand associate that row with every row inuser_interestscontaining the sameuser_id.
Notice we had to specify which tables toJOINand also which columns to joinON.This is anINNER JOIN, which returns the combinations of rows (and only the combinations of rows) that match according to the specified join criteria.
Thereis also aLEFT JOIN, which — in addition to the combinations of matching rows — returns a row for each left-table row with no matching rows (in which case, the fields that would have come from the right table are allNULL).
Using aLEFT JOIN, it’s easy to count the number of interests each user has:
TheLEFT JOINensures that users with no interests will still have rows in the joined data set (withNULLvalues for the fields coming fromuser_interests), andCOUNTonly counts values that are non-NULL.
The NotQuiteABasejoin()implementation will be more restrictive — it simply joins two tables on whatever columns they have in common. Even so, it’s not trivial to write:
defjoin(self,other_table,left_join=False):
join_on_columns=[cforcinself.columns# columns in
ifcinother_table.columns]# both tables
additional_columns=[cforcinother_table.columns# columns only
ifcnotinjoin_on_columns]# in right table
# all columns from left table + additional_columns from right table
In SQL, there is also aRIGHT JOIN, whichkeeps rows from the right table that have no matches, and aFULL OUTER JOIN, whichkeeps rows from both tables that have no matches. We won’t implement either of those.
Subqueries
In SQL, you canSELECTfrom (andJOIN) the results ofqueries as if they were tables. So if you wanted to find the smallestuser_idof anyone interested in SQL, you could use a subquery. (Of course, you could do the same calculation using aJOIN, but that wouldn’t illustrate subqueries.)
Given the way we’ve designed NotQuiteABase, we get this for free. (Our query results are actual tables.)
likes_sql_user_ids=user_interests \
.where(lambdarow:row["interest"]=="SQL") \
.select(keep_columns=['user_id'])
likes_sql_user_ids.group_by(group_by_columns=[],
aggregates={"min_user_id":min_user_id})
Indexes
To find rows containing a specific value (say, wherenameis “Hero”), NotQuiteABase has to inspect every row in the table.If the table has a lot of rows, this can take a very long time.
Similarly, ourjoinalgorithm is extremely inefficient. For each row in the left table, it inspects every row in the right table to see if it’s a match. With two large tables this could take approximately forever.
Also, you’d often like to apply constraints to some of your columns. For example, in youruserstable you probably don’t want to allow two different users to have the sameuser_id.
Indexes solve all these problems. If theuser_intereststable had an index onuser_id, a smartjoinalgorithm could find matches directly rather than scanning the whole table. If theuserstable had a “unique” index onuser_id, you’d get an error if you tried to insert a duplicate.
Each table in a database can have one or more indexes, which allow you to quickly look up rows by key columns, efficiently join tables together, and enforce unique constraints on columns or combinations of columns.
Designing and using indexes well is somewhat of a black art (which varies somewhat depending on the specific database), but if you end up doing a lot of database work it’s worth learning about.
Query Optimization
Recall the query to find all userswho are interested in SQL:
SELECTusers.name
FROMusers
JOINuser_interests
ONusers.user_id=user_interests.user_id
WHEREuser_interests.interest='SQL'
In NotQuiteABase there are (at least) two different ways to write this query. You could filter theuser_intereststable before performing the join:
user_interests \
.where(lambdarow:row["interest"]=="SQL") \
.join(users) \
.select(["name"])
Or you could filter the results of the join:
user_interests \
.join(users) \
.where(lambdarow:row["interest"]=="SQL") \
.select(["name"])
You’ll end up with the same results either way, but filter-before-join is almost certainly more efficient, since in that casejoinhas many fewer rows to operate on.
In SQL, you generally wouldn’t worry about this. You “declare” the results you want and leave it up to the query engine to execute them (and use indexes efficiently).
NoSQL
A recent trend in databases is toward nonrelational “NoSQL” databases, which don’t represent data in tables.For instance, MongoDB is a popular schema-less database whose elements are arbitrarily complex JSON documents rather than rows.
There are column databases that store data in columns instead of rows (good when data has many columns but queries need few of them), key-value stores that are optimized for retrieving single (complex) values by their keys, databases for storing and traversing graphs, databases that are optimized to run across multiple datacenters, databases that are designed to run in memory, databases for storing time-series data, and hundreds more.
Tomorrow’s flavor of the day might not even exist now, so I can’t do much more than let you know that NoSQL is a thing. So now you know. It’s a thing.
For Further Exploration
§ If you’d like to download a relational database to play with,SQLiteis fast and tiny, whileMySQLandPostgreSQLare larger and featureful. All are free and have lots of documentation.
§ If you want to explore NoSQL,MongoDBis very simple to get started with, which can be both a blessing and somewhat of a curse. It also has pretty good documentation.
§ TheWikipedia article on NoSQLalmost certainly now contains links to databases that didn’t even exist when this book was written.