Joining and Subquerying Data - Basics of Handling Data - Learning MySQL and MariaDB (2015)

Learning MySQL and MariaDB (2015)

Part III. Basics of Handling Data

Chapter 9. Joining and Subquerying Data

Most of the examples used in this book thus far have intentionally involved one table per SQL statement in order to allow you to focus on the basic syntax of each SQL statement. When developing a MySQL or MariaDB database, though, you will often query multiple tables. There are a few methods by which you may do that — you’ve seen some simple examples of them in previous chapters. This chapter covers how to merge results from multiple SQL statements, how to join tables, and how to use subqueries to achieve similar results.

Unifying Results

Let’s start this chapter by looking at a simple method of unifying results from multiple SQL statements. There may be times when you just want the unified results of two SELECT statements that don’t interact with each other. In this situation, you can use the UNION operator, which merges twoSELECT statements to form a unified results set. You can merge many SELECT statements together simply by placing the UNION between them in a chain. Let’s look at an example.

In Counting and Grouping Results, we queried the birds table to get a count of the number of birds in the Pelecanidae family (i.e., Pelicans). Suppose we want to also know how many birds are in the Ardeidae family (i.e., Herons). That’s easy to do: we’d use a copy of the same SELECT, but change the value in the WHERE clause. Suppose further that we want to merge the results of the SELECT statement counting Pelicans with the results of a SELECT counting Herons. We’ll do this with a UNION operator, so we can enter two complete SELECT statements and unite them into one results set. Enter the following in the mysql client:

SELECT 'Pelecanidae' AS 'Family',

COUNT(*) AS 'Species'

FROM birds, bird_families AS families

WHERE birds.family_id = families.family_id

AND families.scientific_name = 'Pelecanidae'

UNION

SELECT 'Ardeidae',

COUNT(*)

FROM birds, bird_families AS families

WHERE birds.family_id = families.family_id

AND families.scientific_name = 'Ardeidae';

+-------------+---------+

| Family | Species |

+-------------+---------+

| Pelecanidae | 10 |

| Ardeidae | 157 |

+-------------+---------+

First notice that the column headings in the results is taken only from the first SELECT statement. Next notice that for the first fields in both SELECT statements, we didn’t reference a column. Instead, we gave plain text within quotes: 'Pelecanidae' and 'Ardeidae'. That’s an acceptable choice in MySQL and MariaDB. It works well when you want to fill a field with text like this. Notice that we gave field aliases for the columns in the first SELECT statement, but not in the second one. MySQL uses the first ones it’s given for the column headings of the results set when using the UNIONoperator. It ignores any field aliases in subsequent SELECT statements, so they’re not needed. If you don’t give aliases, it uses the column names of the first SQL statement of the UNION.

The reason a UNION was somewhat necessary in the preceding example is because we’re using an aggregate function, COUNT() with GROUP BY. We can group by multiple columns, but to get results like this which show separate counts for two specific values of the same column, a UNION or some other method is necessary.

There are a few minor things to know about using a UNION. It’s used only with SELECT statements. The SELECT statements can select columns from different tables. Duplicate rows are combined into a single column in the results set.

You can use the ORDER BY clause to order the unified results. If you want to order the results of a SELECT statements, independently of the unified results, you have to put that SELECT statement within parentheses and add an ORDER BY clause to it. When specifying the columns in the ORDER BYclauses, you cannot preface column names with the table names (e.g., families.scientific_name). If using the column names would be ambiguous, you should instead use column aliases. Let’s expand our previous example to better illustrate how to use the ORDER BY clause with UNION. Let’s get a count for each bird family within two orders: Pelecaniformes and Suliformes. Enter the following:

SELECT families.scientific_name AS 'Family',

COUNT(*) AS 'Species'

FROM birds, bird_families AS families, bird_orders AS orders

WHERE birds.family_id = families.family_id

AND families.order_id = orders.order_id

AND orders.scientific_name = 'Pelecaniformes'

GROUP BY families.family_id

UNION

SELECT families.scientific_name, COUNT(*)

FROM birds, bird_families AS families, bird_orders AS orders

WHERE birds.family_id = families.family_id

AND families.order_id = orders.order_id

AND orders.scientific_name = 'Suliformes'

GROUP BY families.family_id;

+-------------------+---------+

| Family | Species |

+-------------------+---------+

| Pelecanidae | 10 |

| Balaenicipitidae | 1 |

| Scopidae | 3 |

| Ardeidae | 157 |

| Threskiornithidae | 53 |

| Fregatidae | 13 |

| Sulidae | 16 |

| Phalacrocoracidae | 61 |

| Anhingidae | 8 |

+-------------------+---------+

The first five rows are are Pelecaniformes and the remaining rows are Suliformes. The results are not in alphabetical order, but in the order of each SELECT statement and the order that server found the rows for each SELECT statement based on the family_id. If we want to order the results alphabetically by the family name, we have to use an ORDER BY clause, but after the unified results are generated. To do this, we’ll wrap the results set in parentheses to tell MySQL to treat it as a table. Then we’ll select all of the columns and rows of that results set and use the ORDER BY clause to order them based on the family name. To avoid confusion, we’ll add the name of the order to the results. Enter the following:

SELECT * FROM

(

SELECT families.scientific_name AS 'Family',

COUNT(*) AS 'Species',

orders.scientific_name AS 'Order'

FROM birds, bird_families AS families, bird_orders AS orders

WHERE birds.family_id = families.family_id

AND families.order_id = orders.order_id

AND orders.scientific_name = 'Pelecaniformes'

GROUP BY families.family_id

UNION

SELECT families.scientific_name, COUNT(*), orders.scientific_name

FROM birds, bird_families AS families, bird_orders AS orders

WHERE birds.family_id = families.family_id

AND families.order_id = orders.order_id

AND orders.scientific_name = 'Suliformes'

GROUP BY families.family_id ) AS derived_1

ORDER BY Family;

+-------------------+---------+----------------+

| Family | Species | Order |

+-------------------+---------+----------------+

| Anhingidae | 8 | Suliformes |

| Ardeidae | 157 | Pelecaniformes |

| Balaenicipitidae | 1 | Pelecaniformes |

| Fregatidae | 13 | Suliformes |

| Pelecanidae | 10 | Pelecaniformes |

| Phalacrocoracidae | 61 | Suliformes |

| Scopidae | 3 | Pelecaniformes |

| Sulidae | 16 | Suliformes |

| Threskiornithidae | 53 | Pelecaniformes |

+-------------------+---------+----------------+

In these examples, it may seem to be a lot of typing to achieve very little. But there are times — albeit rare times — when UNION is the best or simplest choice. It’s more useful when you retrieve data from very distinct, separate sources or other situations that would require contortions to fit into a single SELECT statement and are executed more easily as separate ones, still giving you a unified results set.

You can get the same results as the previous examples, though, with less effort by using a subquery. Actually, when we put the UNION within parentheses, that became a subquery, just not much of one. We’ll cover subqueries later in this chapter. For now, let’s consider how to join multiple tables in one SQL statement.

Joining Tables

The JOIN clause links two tables together in a SELECT, UPDATE, or DELETE statement. JOIN links tables based on columns with common data for purposes of selecting, updating, or deleting data. In A Little Complexity, for instance, we joined two tables named books and status_names, taking advantage of the design that put identical values in the status column of books and the status_id column of status_names. That way, we could show data from each table about the same book:

SELECT book_id, title, status_name

FROM books JOIN status_names

WHERE status = status_id;

Let’s review the way a join works, using this example. The status and status_id fields both contain numbers that refer to a status. In the books table, the numbers have no intrinsic meaning. But the status_names table associates the numbers with meaningful text. Thus, by joining the tables, you can associate a book with its status.

Sometimes there are alternatives to the JOIN clause. For instance, when constructing an SQL statement that includes multiple tables, a simple method is to list the tables in a comma-separated list in the appropriate position of the SQL statement — for a SELECT statement, you would list them in the FROM clause — and to provide pairing of columns in the WHERE clause on which the tables will be joined. This is the method we have used several times in the previous chapters. Although this method works fine and would seem fairly straightforward, a more agreeable method is to use aJOIN clause to join both tables and to specify the join point columns. When you have an error with an SQL statement, keeping these items together and not having part of them in the WHERE clause makes troubleshooting SQL statements easier.

With JOIN, tables are linked together based on columns with common data for purposes of selecting, updating, or deleting data. The JOIN clause is entered in the relevant statement where tables referenced are specified usually. This precludes the need to join the tables based on key columns in the WHERE clause. The ON operator is used to indicate the pair of columns by which the tables are to be joined (indicated with the equals-sign operator). If needed, you may specify multiple pairs of columns, separated by AND. If the column names by which the two tables are joined are the same in both tables, as an alternative method, the USING operator may be given along with a comma-separated list of columns that both tables have in common, contained within parentheses. The columns must be contained in each table that is joined. To improve performance, join to a column that is indexed.

Here is how the first of these two syntax looks using a JOIN:

SELECT book_id, title, status_name

FROM books

JOIN status_names ON(status = status_id);

This is the same example as before, but without the WHERE clause. It doesn’t need it, because it uses ON instead to indicate the join point. If we were to alter the books table to modify the name of the status column to be status_id, so that the names of both columns on which we join these two tables are the same, we could do the join like this:

SELECT book_id, title, status_name

FROM books

JOIN status_names USING(status_id);

Here we use the keyword USING in the JOIN clause to indicate the identical column by which to join.

These syntaxes are only two of a few possible with the JOIN. They show how you might construct a SELECT statement using a JOIN. It’s basically the same for the UPDATE and DELETE statements. In the next subsections, we’ll consider the methods for using JOIN with each of these three SQL statements, and look at some examples for each.

Selecting a Basic Join

Suppose we want to get a list of species of Geese whose existence is Threatened — that’s a category of conservation states. We will need to construct a SELECT statement that takes data from the birds table and the conservation_status table. The shared data in the birds and theconservation_status tables is the conservation_status_id column of each table. We didn’t have to give the column the same name in each table, but doing so makes it easier to know where to join them.

Enter the following in the mysql client:

SELECT common_name, conservation_state

FROM birds

JOIN conservation_status

ON(birds.conservation_status_id = conservation_status.conservation_status_id)

WHERE conservation_category = 'Threatened'

AND common_name LIKE '%Goose%';

+----------------------------+--------------------+

| common_name | conservation_state |

+----------------------------+--------------------+

| Swan Goose | Vulnerable |

| Lesser White-fronted Goose | Vulnerable |

| Hawaiian Goose | Vulnerable |

| Red-breasted Goose | Endangered |

| Blue-winged Goose | Vulnerable |

+----------------------------+--------------------+

The ON operator specifies the conservation_status_id columns from each table as the common item on which to join the tables. MySQL knows the proper table in which to find the conservation_category and common_name columns, and pulls the rows that match.

That works fine, but it’s a lot to type. Let’s modify this statement to use the USING operator, specifing conservation_status_id just once to make the join. MySQL will understand what to do. Here’s that same SQL statement, but with the USING operator:

SELECT common_name, conservation_state

FROM birds

JOIN conservation_status

USING(conservation_status_id)

WHERE conservation_category = 'Threatened'

AND common_name LIKE '%Goose%';

Now let’s modify the SQL statement to include the bird family. To do that, we’ll have to add another table, the bird_families. Let’s also include Ducks in the list. Try executing the following:

SELECT common_name AS 'Bird',

bird_families.scientific_name AS 'Family', conservation_state AS 'Status'

FROM birds

JOIN conservation_status USING(conservation_status_id)

JOIN bird_families USING(family_id)

WHERE conservation_category = 'Threatened'

AND common_name REGEXP 'Goose|Duck'

ORDER BY Status, Bird;

+----------------------------+----------+-----------------------+

| Bird | Family | Status |

+----------------------------+----------+-----------------------+

| Laysan Duck | Anatidae | Critically Endangered |

| Pink-headed Duck | Anatidae | Critically Endangered |

| Blue Duck | Anatidae | Endangered |

| Hawaiian Duck | Anatidae | Endangered |

| Meller's Duck | Anatidae | Endangered |

| Red-breasted Goose | Anatidae | Endangered |

| White-headed Duck | Anatidae | Endangered |

| White-winged Duck | Anatidae | Endangered |

| Blue-winged Goose | Anatidae | Vulnerable |

| Hawaiian Goose | Anatidae | Vulnerable |

| Lesser White-fronted Goose | Anatidae | Vulnerable |

| Long-tailed Duck | Anatidae | Vulnerable |

| Philippine Duck | Anatidae | Vulnerable |

| Swan Goose | Anatidae | Vulnerable |

| West Indian Whistling-Duck | Anatidae | Vulnerable |

| White-headed Steamer-Duck | Anatidae | Vulnerable |

+----------------------------+----------+-----------------------+

We gave two JOIN clauses in this SQL statement. It doesn’t usually matter which table is listed where. For instance, although bird_families is listed just after the join for the conservation_statustable, MySQL determined that bird_families is to be joined to the birds table. Without usingJOIN, we would have to be more emphatic in specifying the join points, and we would have to list them in the WHERE clause. It would have to be entered like this:

SELECT common_name AS 'Bird',

bird_families.scientific_name AS 'Family', conservation_state AS 'Status'

FROM birds, conservation_status, bird_families

WHERE birds.conservation_status_id = conservation_status.conservation_status_id

AND birds.family_id = bird_families.family_id

AND conservation_category = 'Threatened'

AND common_name REGEXP 'Goose|Duck'

ORDER BY Status, Bird;

That’s a very cluttered WHERE clause, making it difficult to see clearly the conditions by which we’re selecting data from the tables. Using JOIN clauses is much tidier.

Incidentally, the SQL statement with two JOIN clauses used a regular expression — the REGEXP operator in the WHERE clause — to specify that the clause find either Goose or Duck. We also added an ORDER BY clause to order first by Status, then by Bird name.

In this example, though, there’s little point in listing the bird family name, because the birds are all of the same family. Plus, there may be similar birds that we might like to have in the list, but that don’t have the words Goose or Duck in their name. So let’s change that in the SQL statement. Let’s also order the results differently and list birds from the least endangered to the most endangered. Enter the following:

SELECT common_name AS 'Bird from Anatidae',

conservation_state AS 'Conservation Status'

FROM birds

JOIN conservation_status AS states USING(conservation_status_id)

JOIN bird_families USING(family_id)

WHERE conservation_category = 'Threatened'

AND bird_families.scientific_name = 'Anatidae'

ORDER BY states.conservation_status_id DESC, common_name ASC;

+----------------------------+-----------------------+

| Bird from Anatidae | Conservation Status |

+----------------------------+-----------------------+

| Auckland Islands Teal | Vulnerable |

| Blue-winged Goose | Vulnerable |

| Eaton's Pintail | Vulnerable |

| Hawaiian Goose | Vulnerable |

| Lesser White-fronted Goose | Vulnerable |

| Long-tailed Duck | Vulnerable |

| Marbled Teal | Vulnerable |

| Philippine Duck | Vulnerable |

| Salvadori's Teal | Vulnerable |

| Steller's Eider | Vulnerable |

| Swan Goose | Vulnerable |

| West Indian Whistling-Duck | Vulnerable |

| White-headed Steamer-Duck | Vulnerable |

| Bernier's Teal | Endangered |

| Blue Duck | Endangered |

| Brown Teal | Endangered |

| Campbell Islands Teal | Endangered |

| Hawaiian Duck | Endangered |

| Meller's Duck | Endangered |

| Red-breasted Goose | Endangered |

| Scaly-sided Merganser | Endangered |

| White-headed Duck | Endangered |

| White-winged Duck | Endangered |

| White-winged Scoter | Endangered |

| Baer's Pochard | Critically Endangered |

| Brazilian Merganser | Critically Endangered |

| Crested Shelduck | Critically Endangered |

| Laysan Duck | Critically Endangered |

| Madagascar Pochard | Critically Endangered |

| Pink-headed Duck | Critically Endangered |

+----------------------------+-----------------------+

An obvious change to this example is the elimination of bird_families.scientific_name from the list of selected columns, so only two columns appear in the output. Another change, which is cosmetic, is to provide the alias states to the conservation_status table so we could refer to the short alias later instead of the long name.

Finally, the ORDER BY clause orders the output by conservation_status_id, because that value happens to be in the order of severity in the conservation_status table. We want to override the default order, which puts the most threatened species first, so we add the DESC option to put the least threatened first. We’re still ordering results secondarily by the common name of the birds, but using the actual column name this time instead of an alias. This is because we changed the alias for the common_name column from Birds to Birds from Anatidae, because all the results are in that family. We could have used 'Birds from Anatidae' in the ORDER BY clause, but that’s bothersome to type.

Let’s look at one more basic example of a JOIN. Suppose we wanted to get a list of members located in Russia (i.e., where country_id has a value of ru) who have reported sighting a bird from the Scolopacidae family (shore and wader birds like Sandpipers and Curlews). Information on bird sightings is stored in the bird_sightings table. It includes GPS coordinates recorded from a bird list application on the member’s mobile phone when they note the sighting. Enter this SQL statement:

SELECT CONCAT(name_first, ' ', name_last) AS Birder,

common_name AS Bird, location_gps AS 'Location of Sighting'

FROM birdwatchers.humans

JOIN birdwatchers.bird_sightings USING(human_id)

JOIN rookery.birds USING(bird_id)

JOIN rookery.bird_families USING(family_id)

WHERE country_id = 'ru'

AND bird_families.scientific_name = 'Scolopacidae'

ORDER BY Birder;

+-------------------+-------------------+---------------------------+

| Birder | Bird | Location of Sighting |

+-------------------+-------------------+---------------------------+

| Anahit Vanetsyan | Bar-tailed Godwit | 42.81958072; 133.02246094 |

| Elena Bokova | Eurasian Curlew | 51.70469364; 58.63746643 |

| Elena Bokova | Eskimo Curlew | 66.16051056; -162.7734375 |

| Katerina Smirnova | Eurasian Curlew | 42.69096856; 130.78185081 |

+-------------------+-------------------+---------------------------+

This SQL statement joins together four tables, two from the birdwatchers database and two from the birds database. Look closely at this SQL statement and consider the purpose of including each of those four tables. All of them were needed to assemble the results shown. Incidentally, we used the CONCAT() function to concatenate together the member’s first and last name for the Birder field in the results.

There are other types of joins besides a plain JOIN. Let’s do another SELECT using another type of JOIN. For an example of this, we’ll get a list of Egrets and their conservation status. Enter the following SQL statement:

SELECT common_name AS 'Bird',

conservation_state AS 'Status'

FROM birds

LEFT JOIN conservation_status USING(conservation_status_id)

WHERE common_name LIKE '%Egret%'

ORDER BY Status, Bird;

+--------------------+-----------------+

| Bird | Status |

+--------------------+-----------------+

| Great Egret | NULL |

| Cattle Egret | Least Concern |

| Intermediate Egret | Least Concern |

| Little Egret | Least Concern |

| Snowy Egret | Least Concern |

| Reddish Egret | Near Threatened |

| Chinese Egret | Vulnerable |

| Slaty Egret | Vulnerable |

+--------------------+-----------------+

This SELECT statement is like the previous examples, except that instead of using a JOIN, we’re using a LEFT JOIN. This type of join selects rows in the table on the left (i.e., birds) regardless of whether there is a matching row in the table on the right (i.e., conservation_status). Because there is no match on the right, MySQL returns a NULL value for columns it cannot reconcile from the table on the right. You can see this in the results. The Great Egret has a value of NULL for its Status. This is because no value was entered in the conservation_status_id column of the row related to that bird species. It would return NULL if the value of that column is NULL, blank if the column was set to empty (e.g., ''), or any value that does not match in the right table.

Because of the LEFT JOIN, the results show all birds with the word Egret in the common name even if we don’t know their conservation status. It also indicates which Egrets need to set the value of conservation_status_id. We’ll need to update that row and others like it. An UPDATE statement with this same LEFT JOIN can easily do that. We’ll show a couple in the next section.

Updating Joined Tables

If you want to use the UPDATE statement to change the data in multiple tables, or change data in a table based on criteria from multiple tables, you can use the JOIN clause. The syntax of the JOIN clause for UPDATE is the same as it is for SELECT. So let’s go straight to some practical examples. We’ll start with the example at the end of the previous subsection.

Let’s use UPDATE with LEFT JOIN to locate rows in the birds table that don’t have a value in conservation_status_id. We could update all of the rows, but let’s do only rows for one bird family, Ardeidae (i.e., Herons, Egrets, and Bitterns). First, execute this SELECT statement to test our joins and WHERE clause:

SELECT common_name,

conservation_state

FROM birds

LEFT JOIN conservation_status USING(conservation_status_id)

JOIN bird_families USING(family_id)

WHERE bird_families.scientific_name = 'Ardeidae';

If you’re working from the data from the MySQL Resources site, you should have over 150 rows in the results. You’ll notice that many of the rows have nothing in the common_name field. That’s because there are many bird species for which there are scientific names, but no common names. Those rows also have no value for the conservation_status_id. There are also a few rows for bird species that do have common names.

Let’s add another row to the conservation_status, one for an unknown state. We’ll set these unknown rows to that state. Enter these two SQL statements:

INSERT INTO conservation_status (conservation_state)

VALUES('Unknown');

SELECT LAST_INSERT_ID();

+------------------+

| LAST_INSERT_ID() |

+------------------+

| 9 |

+------------------+

In the first SQL statement here we entered only a value for conservation_state. The defaults for the other columns are fine. We’ll use the UPDATE statement to set the rows for the birds in Ardeidae to this new state, so we want to know the conservation_status_id for it. To get that value, we issue a SELECT statement with the LAST_INSERT_ID() function. It returns the identifier generated from the previous SQL statement entered, which added a row for the current client connection (i.e., just us). Let’s use that number to set the conservation_status_id in the birds table for bird species in Ardeidae. If your identification number is different, use what you received in the following SQL statement:

UPDATE birds

LEFT JOIN conservation_status USING(conservation_status_id)

JOIN bird_families USING(family_id)

SET birds.conservation_status_id = 9

WHERE bird_families.scientific_name = 'Ardeidae'

AND conservation_status.conservation_status_id IS NULL;

This UPDATE statement should have changed almost 100 rows on your server. The joins here are the same as we used in the previous SELECT statement, in which we discovered that we did not have a conservation status set for the Great Egret. Notice in the WHERE clause here that one of the conditions is that conservation_status.conservation_status_id has a value of NULL. We could have removed the LEFT JOIN to the conservation_status table and then updated simply all of the rows for the Ardeidae birds that have a NULL value in the conservation_status_id column. But that would not have included any rows that might have other nonmatching values (e.g., a blank column). By including this LEFT JOIN, we updated all of these possibilities. However, it requires the condition that the conservation_status.conservation_status_id is NULL, the column from the right table — it will be assumed NULL if not matched.

Because the method of joining tables is the same for both the SELECT statement and the UPDATE statement, you can easily test the JOIN clauses and WHERE clause using a SELECT first. When that’s successful, you can then execute an UPDATE statement with the same JOIN and WHERE clauses. That’s the best procedure to follow to ensure proper updating of data when joining multiple tables.

Deleting Within Joined Tables

Having used JOIN with SELECT and UPDATE statements, let’s look at some practical examples using DELETE. In Deleting in Multiple Tables, we saw an example of DELETE with a JOIN. In that example, we wanted to delete the rows where the member Elena Bokova has a yahoo.com email address from both the humans and the prize_winners tables from the birdwatchers database. For that purpose, we constructed a DELETE statement that worked fine, but there was potentially a problem with it. Here is that SQL statement again:

DELETE FROM humans, prize_winners

USING humans JOIN prize_winners

WHERE name_first = 'Elena'

AND name_last = 'Bokova'

AND email_address LIKE '%yahoo.com'

AND humans.human_id = prize_winners.human_id;

Compared to the JOIN clauses we’ve been using, the syntax here may look strange. This is how it works with a DELETE statement. Tables from which data is deleted are listed in the FROM clause, while tables used in the WHERE clause to provide filters to determine which rows to delete are listed in a USING clause. The clause "USING humans JOIN prize_winners" just tells the server that those two tables provide the columns in the WHERE clause.

NOTE

Don’t confuse a USING clause, which has JOIN subclauses, with the USING operator, which can be used in a JOIN clause.

As the preceding DELETE SQL statement is constructed, if MySQL finds a row in the humans table where the name and email information match, there has to be a matching row in the prize_winners table for the human_id. If there’s not a row in both, MySQL won’t delete the row in the humanstable and no error will be returned — you might not realize it failed. To allow for this possibility, we could use a LEFT JOIN like so:

DELETE FROM humans, prize_winners

USING humans LEFT JOIN prize_winners

ON humans.human_id = prize_winners.human_id

WHERE name_first = 'Elena'

AND name_last = 'Bokova'

AND email_address LIKE '%yahoo.com';

Notice that for this syntax we moved the valuation of the human_id columns to the USING clause, adding a LEFT JOIN and an ON operator to replace that condition in the WHERE clause. That’s necessary because if there’s not a match in the other table, the WHERE clause won’t include that row in the results to be deleted. With the LEFT JOIN, all of the rows in both the humans and the prize_winners tables that match the criteria given to it will be deleted, and any rows found in the humans table for which there isn’t a match in the prize_winners table, but which match the criteria of the WHEREclause will be deleted also. This prevents what are known as orphaned rows.

For general maintenance, we should check occasionally to see if there are rows in the prize_winners table that don’t have matching rows in the humans table, and then delete them. Someone might have had us delete their account, but we may have forgotten to remove entries for them in related tables. To handle that possibility, we could use RIGHT JOIN instead of LEFT JOIN. We could enter something like this:

DELETE FROM prize_winners

USING humans RIGHT JOIN prize_winners

ON humans.human_id = prize_winners.human_id

WHERE humans.human_id IS NULL;

In this DELETE statement, we listed only the prize_winners table in the FROM clause because that’s the only one from which we want to delete rows. It’s a good policy not to list tables that are not to be affected in the FROM clause of a DELETE statement, even if you think there’s no possible way that there is a row that would be deleted in the other tables.

Because we put the humans table first in the USING clause and the prize_winners table second, we’re doing a RIGHT JOIN so that columns from the table on the right (prize_winners) will be deleted even if there is no value in the table on the left. If we reversed the order of the tables, we would then need a LEFT JOIN for this task.

It’s worth focusing for a moment on the final clause of the previous DELETE statement, a WHERE clause checking for NULLs in one column. As we saw earlier, a LEFT JOIN or RIGHT JOIN can return rows where there was nothing in the column you’re doing the join on. The results contain NULL for the missing value. So in the WHERE clause here, we’re using that as the condition for finding the orphaned rows in the prize_winners table.

There are many contortions to the JOIN clause. The basic JOIN syntaxes that we covered in Selecting a Basic Join are worth learning well; they will be the ones you will use primarily. You will sometimes have a need for using a LEFT JOIN or a RIGHT JOIN. Let’s move on to a related topic that can be valuable in many situations: subqueries.

Subqueries

A subquery is a query within another query, a SELECT statement within another SQL statement. A subquery returns a single value, a row of data, a single column from several rows, or several columns from several rows. These are known respectively as scalar, column, row, and table subqueries. I’ll refer to these distinctions later in this chapter.

Although the same results can be accomplished by using the JOIN clause and sometimes the UNION, depending on the situation, subqueries are a cleaner approach. They make a complex query more modular, which makes it easier to create and to troubleshoot problems. Here are two generic examples of subqueries (we also used a few subqueries in Chapter 8):

UPDATE table_1

SET col_5 = 1

WHERE col_id =

SELECT col_id

FROM table_2

WHERE col_1 = value;

SELECT column_a, column_1

FROM table_1

JOIN

(SELECT column_1, column_2

FROM table_2

WHERE column_2 = value) AS derived_table

USING(col_id);

In the first example, the SELECT statement is an inner query. The UPDATE statement is referred to as the main or outer query. In the second example, the SELECT within parentheses is the inner query and the SELECT outside of the parentheses is the outer query. An outer query containing a subquery can be a SELECT, INSERT, UPDATE, DELETE, DO, or even a SET statement. There are some limitations, though. An outer query cannot generally select data or modify data from the same table of an inner query. This doesn’t apply though if the subquery is part of a FROM clause.

These generic examples may be confusing. Generic examples aren’t usually easy to follow. I’d rather present first the syntax for subqueries, but there is no syntax per se for the use of subqueries — other than the syntax inherent in the SQL statements used for the inner and outer queries. Subqueries are rather a method of constructing combinations of SQL statements. As such, you need only to make sure of two basic factors with subqueries.

The first factor of which you need to be mindful is how a subquery is contained within an outer query, where you position it. For instance, if you construct an outer query which is an UPDATE statement, you could place a subquery in the WHERE clause to provide a set of values to which a column is equal (e.g., as in the first generic example). Or you might locate a subquery in the FROM clause of an outer, SELECT statement (e.g., as in the second generic example). These are where subqueries may be positioned. You can have multiple subqueries within an outer query, but they will be positioned generally within the FROM clause or the WHERE clause.

The second factor is whether the results returned from a subquery are in keeping with the expectations of the outer query. For instance, in the first generic example, the UPDATE clause has a WHERE clause that expects a single value from the subquery. If the subquery returns several values, a row of columns, or a table of results, it will confuse MySQL and cause an error. So you need to be sure that the subquery you construct will return the type of values required by the outer query as you constructed it.

You’ll better understand these factors as we look at examples of them. As mentioned at the start of this section, the different types of subqueries are scalar, column, row, and table subqueries. In the following subsections, we’ll look at each of these types, along with examples of them.

Scalar Subqueries

The most basic subquery is one that returns a single value, a scalar value. This type of subquery is particularly useful in a WHERE clause in conjunction with an = operator, or in other instances where a single value from an expression is permitted. Let’s look at simple example of this. Let’s get a list of bird families that are members of the Galliformes bird order (i.e., Grouse, Partridges, Quails, and Turkeys). This can be done easily with a JOIN in which we join the birds and bird_families tables together based on the order_id for Galliformes. We’ll use instead a scalar subquery to get the order_id we need. Enter this in mysql:

SELECT scientific_name AS Family

FROM bird_families

WHERE order_id =

(SELECT order_id

FROM bird_orders

WHERE scientific_name = 'Galliformes');

+----------------+

| Family |

+----------------+

| Megapodiidae |

| Cracidae |

| Numididae |

| Odontophoridae |

| Phasianidae |

+----------------+

The inner query (i.e., the subquery here) returns one value, the order_id. That’s used to complete the WHERE clause of the outer query. That was pretty simple. Let’s look at another example of a scalar subquery.

We had an example earlier in this chapter, in the section related to using a JOIN, in which we selected members from Russia who had sighted birds of the family Scolopacidae. To thank members in Russia for using our telephone application for recording sightings, we’re going to give a one-year premium membership to one of those members. Enter this hefty SQL statement in mysql:

UPDATE humans

SET membership_type = 'premium',

membership_expiration = DATE_ADD(IFNULL(membership_expiration,

CURDATE()), INTERVAL 1 YEAR)

WHERE human_id =

(SELECT human_id

FROM

(SELECT human_id, COUNT(*) AS sightings, join_date

FROM birdwatchers.bird_sightings

JOIN birdwatchers.humans USING(human_id)

JOIN rookery.birds USING(bird_id)

JOIN rookery.bird_families USING(family_id)

WHERE country_id = 'ru'

AND bird_families.scientific_name = 'Scolopacidae'

GROUP BY human_id) AS derived_1

WHERE sightings > 5

ORDER BY join_date ASC

LIMIT 1);

The most inner query here is basically the same as the one in the example mentioned earlier. The difference is that here we’re not selecting the names involved. Instead, we’re selecting the human_id and the join_date (i.e., the date that the member joined). With the GROUP BY clause, we’re grouping members based on the human_id to get a count with the COUNT() function. Put another way, we’re counting the number of entries of each human_id in the bird_sightings table for the bird family and member country we specified. That subquery will return a table of results; it’s a table subquery. We’ll talk more about that type of subquery later in this chapter.

The query wrapped around the most inner query, which is also a subquery, selects only rows where the number of sightings is more than five. It orders the rows with newer members first based on the date the members joined — we want the newest Russian member reporting several Curlews and the like to be awarded a year of premium membership. This subquery is limited to one row with one column. It’s a scalar query.

The main query in the preceding example is using the single value from the scalar query to determine which member to give one year of premium membership. If we hadn’t added the LIMIT to the scalar query, it would have returned more than one value — it then wouldn’t have been a scalar query. Based on the operator in the WHERE clause of its outer query, MySQL would have returned an error message like this:

ERROR 1242 (ER_SUBSELECT_NO_1_ROW)

SQLSTATE = 21000

Message = "Subquery returns more than 1 row"

As with all subqueries, there’s always a way to get the same results without a subquery, using JOIN or some other method to bring results together in complex ways. To some extent, it’s a matter of style which method you decide to use. I generally prefer subqueries, especially when using them in applications I develop in PHP or Perl. They’re easier for me to decipher months or years later when I want to make changes to a program I’ve written.

Column Subqueries

In the preceding subsection, we discussed instances in which one scalar value was obtained in a WHERE clause. However, there are times when you may want to match multiple values. For those situations, you will need to use the subquery in conjunction with an operator such as IN, which is used to specify a comma-separated list of values. Let’s look at an example of this.

In one of the examples in the previous subsection, we used a scalar subquery to get a list of bird families for the bird order Galliformes. Suppose that we also want the common name of one bird species from each family in the order; we want to randomly select a bird name from each. To do this, we will create a subquery that will select a list of bird family names for the order. Enter the following SQL statement:

SELECT * FROM

(SELECT common_name AS 'Bird',

families.scientific_name AS 'Family'

FROM birds

JOIN bird_families AS families USING(family_id)

JOIN bird_orders AS orders USING(order_id)

WHERE common_name != ''

AND families.scientific_name IN

(SELECT DISTINCT families.scientific_name AS 'Family'

FROM bird_families AS families

JOIN bird_orders AS orders USING(order_id)

WHERE orders.scientific_name = 'Galliformes'

ORDER BY Family)

ORDER BY RAND()) AS derived_1

GROUP BY (Family);

+------------------------+----------------+

| Bird | Family |

+------------------------+----------------+

| White-crested Guan | Cracidae |

| Forsten's Scrubfowl | Megapodiidae |

| Helmeted Guineafowl | Numididae |

| Mountain Quail | Odontophoridae |

| Gray-striped Francolin | Phasianidae |

+------------------------+----------------+

In this example, we have two subqueries, a subquery within a subquery, within an outer query. The most inner subquery is known as a nested subquery. The subqueries here are executed before the outer query, so the results will be available before the WHERE clause of the outer query is executed. In that vein, the nested subquery will be executed before the subquery in which it is contained. In this example, the nested query is contained within the parentheses of the IN operator — the most indented query. That SQL statement selects the bird family name where the name of the order is Galliformes. The DISTINCT flag by the alias Family instructs MySQL to return only one entry for each distinct family name. If we had manually entered that information, it would look like this: ('Cracidae','Megapodiidae','Numididae','Odontophoridae','Phasianidae'). This subquery is a multiple-field or column subquery.

The inner subquery in the preceding example is a table subquery. It selects a list of all birds that are in the list of bird families provided by its subquery. We could just select one bird for each family at this level using a GROUP BY clause to group by the Family name to get one bird species per family. But that would select the first rows found and the results would be the same every time. We want to select randomly each time this SQL statement is executed. To do that, we’re selecting all of the birds for each bird family and then using ORDER BY RAND() to randomly order the rows of the results table. Then we’re wrapping that in another query, the outer query to GROUP BY the bird family. That will give us one entry for each bird family.

Row Subqueries

Row subqueries retrieve a single row of data that is then used by the outer query. It’s used in a WHERE clause to compare one row of columns to one row of columns selected in the subquery. Let’s consider an example of this and then we’ll discuss it more. Suppose another bird-watcher site closes, this one in Eastern Europe. They send us their database, which contains a table with the names of their members, and another table with information members provided related to birds they spotted. We put both of these tables in the birdwatchers database to import into our tables. In the process of importing these members into our humans table, we discover people who are already members of our site. That’s OK: we know how to avoid importing the duplicates. Now we want to import the table of birds spottings. Because there were duplicate members, maybe those members have logged information on birds they saw in the wild on this Eastern European site. So we want to check that each entry is not a duplicate and then import it. Look at this SQL statement:

INSERT INTO bird_sightings

(bird_id, human_id, time_seen, location_gps)

VALUES

(SELECT birds.bird_id, humans.human_id,

date_spotted, gps_coordinates

FROM

(SELECT personal_name, family_name, science_name, date_spotted,

CONCAT(latitude, '; ', longitude) AS gps_coordinates

FROM eastern_birders

JOIN eastern_birders_spottings USING(birder_id)

WHERE

(personal_name, family_name,

science_name, CONCAT(latitude, '; ', longitude) )

NOT IN

(SELECT name_first, name_last, scientific_name, location_gps

FROM humans

JOIN bird_sightings USING(human_id)

JOIN rookery.birds USING(bird_id) ) ) AS derived_1

JOIN humans

ON(personal_name = name_first

AND family_name = name_last)

JOIN rookery.birds

ON(scientific_name = science_name) );

This looks very complicated and can be difficult to understand or construct correctly. Let’s discern the major elements here. Look first at the subquery in parentheses, the nested subquery. We’re selecting data from tables in our database: the names of each person, the bird species and where the member sighted it. This nested subquery is contained within the WHERE clause of another subquery, a row subquery. Notice that a list of columns from the tables of the row subquery is given in parentheses. So the condition of the WHERE clause is that the values of those columns for each row of the joined tables are compared to the values of the columns for each row from joined tables in its subquery. The outer query inserts the relevant values into the bird_sightings table.

The preceding example is certainly an odd one and seemingly, unnecessarily complex. But there are times when a row query like this can be useful. To put our example more simply, if there’s a row with the same human name who spotted the same bird species at the exact same map coordinates, don’t import it. If all of those values are not the same, then insert it into the bird_sightings table. There are other ways, though, you can accomplish this task. For instance, you might do this in stages with multiple SQL statements and a temporary table. You could also do it in stages within a program using one of the languages like Perl and an API like the Perl DBI. But it’s good to know you have the option of doing it within one SQL statement if that’s what you want.

Table Subqueries

A subquery can be used to generate a results set, a table from which an outer query can select data. That is to say, a subquery can be used in a FROM clause as if it were another table in a database. It is said to be a derived table.

There are a few rules related to table subqueries. Each derived table must be assigned an alias — any unique name is fine. You can use the keyword AS for assigning an alias. Each column in a subquery that is in part of a FROM clause must have a unique name. For instance, if you select the same column twice in a subquery, you have to assign at least one of them an alias that is unique. A subquery contained in a FROM clause cannot generally be a correlated subquery; it cannot reference the same table as the outer query.

For an example of a table subquery, let’s use the example near the beginning of this chapter that used a UNION. In that example, we had two SELECT statements which counted the number of rows for birds in two bird families: Pelecanidae and Ardeidae. With a UNION, the results were merged into one results set. That was a bulky method. We can do better with a table subquery. The subquery we’ll use will select just the bird family name for each bird of the two families that we wanted to count. That may seem silly, to list the bird family name multiple times, especially when we already know the name of the bird families we want to count. But that’s how we can count them and use the name for our results set. MySQL won’t display the names multiple times — that will go on behind the scenes. It will display only one entry per family because of the GROUP BY clause. Enter the following:

SELECT family AS 'Bird Family',

COUNT(*) AS 'Number of Birds'

FROM

(SELECT families.scientific_name AS family

FROM birds

JOIN bird_families AS families USING(family_id)

WHERE families.scientific_name IN('Pelecanidae','Ardeidae')) AS derived_1

GROUP BY family;

+-------------+-----------------+

| Bird Family | Number of Birds |

+-------------+-----------------+

| Ardeidae | 157 |

| Pelecanidae | 10 |

+-------------+-----------------+

This a much better way to form this unified results set than using a UNION. We could add more bird family names to the WHERE clause in the subquery to get more rows in the results set, instead of having to copy the SELECT statement for each family we add.

You can see in this example that a table subquery is the same as a table in the FROM clause. We can even give it an alias (e.g., derived_1) as we can with a normal table. The subquery returns a table of results (i.e., the bird family names). The GROUP BY clause tells MySQL to group the results based on the family field, the alias in the subquery for the scientific_name column of the bird_families table. We used that same alias to select that field in the column list of the outer query. When a column in a subquery is set to an alias, you have to use the alias; the column name becomes inaccessible outside the subquery when an alias is given.

Performance Considerations with Subqueries

Performance problems can occur with subqueries if they are not well constructed. There can be a performance drain when a subquery is placed within an IN() operator as part of a WHERE clause of the outer query. It’s generally better to use instead the = operator, along with AND for eachcolumn=value pair. For situations in which you suspect poor performance with a subquery, try reconstructing the SQL statement with JOIN and compare the differences between the two SQL statements using the BENCHMARK() function. For ideas on improving subquery performance, Oracle has tips on their site for Optimizing Subqueries.

Summary

Many developers prefer subqueries — I do. They’re easier to construct and decipher when you have problems later. If you work on a database that is very large and has a huge amount of activity, subqueries may not be a good choice because they can sometimes affect performance. For small databases, though, they’re fine. You should learn to use subqueries and learn how to work without them (i.e, use JOIN) so you can handle any situation presented to you. You cannot be sure which method your next employer and team of developers may being using. It’s best to be versatile.

As for learning to use JOIN, that’s hardly optional. Very few developers don’t use JOIN. Even if you prefer subqueries, they still call for JOIN. You can see this in almost all of the examples of subqueries in this chapter. You may rarely use UNION. But there’s not much to learn there. However, you should be proficient in using JOIN. So don’t avoid them; practice manually entering SQL statements that use them. The act of typing them helps.

Exercises

The goal of the following exercises is to give you practice assembling tables using JOIN and creating subqueries. In the process of doing these exercises, think about how tables and data come together. Try to envision each table as a separate piece of paper with a list of data on it, and how you might place them on a desk to find information on them in relation to each other. In such a scenario, you might tend to place your left index finger at one point on a page on the left and your right index finger on a point on another page on your right. That’s a join. Where you point on each are the join points. As you type the SQL statements in these exercises, think of this scene and say aloud what you’re doing, what you’re telling MySQL to do. It helps to better understand the joining of tables and creating of subqueries.

1. In the birdwatchers database, there is a table called bird_sightings in which there are records of birds that members have seen in the wild. Suppose we have a contest in which we will award a prize based on the most sightings of birds from the order Galliformes. A member gets one point for each sighting of birds in this order.
Construct an SQL statement to count the number of entries from each member. There should be two fields in the results set: one containing the human_id with Birder as the alias; and the second field containing the number of entries with Entries as its alias. To accomplish this, join thebird_sightings table to birds, bird_families, and bird_orders. Remember that these tables are in a different database. You will have to use the COUNT() function and a GROUP BY clause. Do all of this with JOIN and not with subqueries. Your results should look like the following:

2. +--------+---------+

3. | Birder | Entries |

4. +--------+---------+

5. | 19 | 1 |

6. | 28 | 5 |

+--------+---------+


When you have successfully constructed this SQL statement, modify it to join in the humans table. In the column list, replace the field for human_id with the first and last name of the member. Use the CONCAT() function to put them together into a single field (with a space in between the names), with the same alias. Once you make the needed changes and execute it, the results should look like this, but the number of names and points may be different:

+--------------+--------+

| Birder | Points |

+--------------+--------+

| Elena Bokova | 4 |

| Marie Dyer | 8 |

+--------------+--------+

2. In the preceding exercises, you were asked to count the number of bird species the members sighted from the Galliformes. So that the contest is more fun, instead of giving one point for each bird species in that order, give a point for only one bird species per bird family in the bird order. That means that a member doesn’t get more points for sighting the same bird species multiple times. A member also doesn’t get more points for spotting several birds in the same family. Instead, the member has to look through bird guides to find a species for each species and then go looking for one from each in their area. This should make the contest more of an adventure for the members.
To allow for the change to the contest, you will need to modify the SQL statement you constructed at the end of the previous exercise. First, you will need to add a DISTINCT to the start of the column list in the outer query. You’ll need to remove the CONCAT() and GROUP BY. When you’ve done that, execute the SQL statement to make sure you have no errors. You should get a results set that shows multiple entries for some members. Next, place the whole SQL statement inside another SQL statement to make it a subquery. The new, outer query should include CONCAT()and GROUP BY so that it can count the single entries from each family for each member. It should return results like this:

3. +--------------+--------+

4. | Birder | Points |

5. +--------------+--------+

6. | Elena Bokova | 1 |

7. | Marie Dyer | 5 |

+--------------+--------+

3. There are five families in the Galliformes bird order. For the contest described in the last two exercises, the most points that a member could achieve therefore is 5. Change the SQL statement you entered at the end of the previous exercise to list only members who have 5 points. To do this, you will need to wrap the previous SQL statement inside another, creating a nested query. When you execute the full SQL statement, the results should look like this:

4. +------------+--------+

5. | Birder | Points |

6. +------------+--------+

7. | Marie Dyer | 5 |

+------------+--------+