Aggregate Clauses, Aggregate Functions, and Subqueries - SQL Statements and Functions - MySQL in a Nutshell (2008)

MySQL in a Nutshell (2008)

Part II. SQL Statements and Functions

Chapter 10. Aggregate Clauses, Aggregate Functions, and Subqueries

MySQL has many built-in functions that you can use in SQL statements for performing calculations on combinations of values in databases; these are called aggregate functions. They include such types of basic statistical analysis as counting rows, determining the average of a given column’s value, finding the standard deviation, and so forth. The first section of this chapter describes MySQL aggregate functions and includes examples of most of them. The second section provides a tutorial about subqueries. It includes several examples of subqueries in addition to the ones shown in the first section and in various examples throughout this book. Subqueries are included in this chapter because they are often used with GROUP BY and aggregate functions and because they’re another method for grouping selected data.

The following functions are covered in this chapter:

AVG(), BIT_AND(), BIT_OR(), COUNT(), GROUP_CONCAT(), MAX(), MIN(), STD(), STDDEV(), STDDEV_POP(), STDDEV_SAMP(), SUM(), VAR_POP(), VAR_SAMP(), VARIANCE().

Aggregate Functions in Alphabetical Order

This section describes each aggregate function. Many of the examples use a subquery. For detailed information about subqueries, see the Subqueries” section later in this chapter.

A few general aspects of aggregate functions include:

§ Aggregate functions return NULL when they encounter an error.

§ Most uses for aggregate functions include a GROUP BY clause, which is specified in each description. If an aggregate function is used without a GROUP BY clause it operates on all rows.

Name

AVG()

Synopsis

AVG([DISTINCT] column)

This function returns the average or mean of a set of numbers given as the argument. It returns NULL if unsuccessful. The DISTINCT keyword causes the function to count only unique values in the calculation; duplicate values will not factor into the averaging.

When returning multiple rows, you generally want to use this function with the GROUP BY clause that groups the values for each unique item, so that you can get the average for that item. This will be clearer with an example:

SELECT sales_rep_id,

CONCAT(name_first, SPACE(1), name_last) AS rep_name,

AVG(sale_amount) AS avg_sales

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

This SQL statement returns the average amount of sales in the sales table made by each sales representative. It will total all values found for the sale_amount column, for each unique value for sales_rep_id, and divide by the number of rows found for each of those unique values. If you would like to include sales representatives who made no sales in the results, you’ll need to change the JOIN to a RIGHT JOIN:

SELECT sales_rep_id,

CONCAT(name_first, SPACE(1), name_last) AS rep_name,

FORMAT(AVG(sale_amount), 2) AS avg_sales

FROM sales

RIGHT JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

Sales representatives who made no sales will show up with NULL in the avg_sales column. This version of the statement also includes an enhancement: it rounds the results for avg_sales to two decimal places by adding the FORMAT() function.

If we only want the average sales for the current month, we could add a WHERE clause. However, that would negate the effect of the RIGHT JOIN: sales people without orders for the month wouldn’t appear in the list. To include them, first we need to run a subquery that extracts the sales data that meets the conditions of the WHERE clause, and then we need to join the subquery’s results to another subquery containing a tidy list of the names of sales reps:

SELECT sales_rep_id, rep_name,

IFNULL(avg_sales, 'none') as avg_sales_month

FROM

(SELECT sales_rep_id,

FORMAT(AVG(sale_amount), 2) AS avg_sales

FROM sales

JOIN sales_reps USING(sales_rep_id)

WHERE DATE_FORMAT(date_of_sale, '%Y%m') =

DATE_FORMAT(CURDATE(), '%Y%m')

GROUP BY sales_rep_id) AS active_reps

RIGHT JOIN

(SELECT sales_rep_id,

CONCAT(name_first, SPACE(1), name_last) AS rep_name

FROM sales_reps) AS all_reps

USING(sales_rep_id)

GROUP BY sales_rep_id;

In the first subquery here, we are determining the average sales for each sales rep that had sales for the current month. In the second subquery, we’re putting together a list of names of all sales reps, regardless of sales. In the main query, using the sales_rep_id column as the joining point of the two results sets derived from the subqueries, we are creating a results set that will show the average sales for the month for each rep that had some sales, or (using IFNULL()) the word none for those who had none.

Name

BIT_AND()

Synopsis

BIT_AND(expression)

This function returns the bitwise AND for all bits for the expression given. Use this in conjunction with the GROUP BY clause. The function has a 64-bit precision. If there are no matching rows, before version 4.0.17 of MySQL, –1 is returned. Newer versions return 18446744073709551615, which is the value of 1 for all bits of an unsigned BIGINT column.

Name

BIT_OR()

Synopsis

BIT_OR(expression)

This function returns the bitwise OR for all bits for the expression given. It calculates with a 64-bit precision (BIGINT). It returns 0 if no matching rows are found. Use it in conjunction with the GROUP BY clause.

Name

BIT_XOR()

Synopsis

BIT_XOR(expression)

This function returns the bitwise XOR (exclusive OR) for all bits for the expression given. It calculates with a 64-bit precision (BIGINT). It returns 0 if no matching rows are found. Use it in conjunction with the GROUP BY clause. This function is available as of version 4.1.1 of MySQL.

Name

COUNT()

Synopsis

COUNT([DISTINCT] expression)

This function returns the number of rows retrieved in the SELECT statement for the given column. By default, rows in which the column is NULL are not counted. If the wildcard * is used as the argument, the function counts all rows, including those with NULL values. If you want only a count of the number of rows in the table, you don’t need GROUP BY, and you can still include a WHERE to count only rows meeting specific criteria. If you want a count of the number of rows for each value of a column, you will need to use the GROUP BY clause. As an alternative to usingGROUP BY, you can add the DISTINCT keyword to get a count of unique non-NULL values found for the given column. When you use DISTINCT, you cannot include any other columns in the SELECT statement. You can, however, include multiple columns or expressions within the function. Here is an example:

SELECT branch_name,

COUNT(sales_rep_id) AS number_of_reps

FROM sales_reps

JOIN branch_offices USING(branch_id)

GROUP BY branch_id;

This example joins the sales_reps and branch_offices tables together using the branch_id contained in both tables. We then use the COUNT() function to count the number of sales reps found for each branch (determined by the GROUP BY clause).

Name

GROUP_CONCAT()

Synopsis

GROUP_CONCAT([DISTINCT] expression[, ...]

[ORDER BY {unsigned_integer|column|expression}

[ASC|DESC] [,column...]]

[SEPARATOR character])

This function returns non-NULL values of a group concatenated by a GROUP BY clause, separated by commas. The parameters for this function are included in the parentheses, separated by spaces, not commas. The function returns NULL if the group doesn’t contain non-NULL values.

Duplicates are omitted with the DISTINCT keyword. The ORDER BY clause instructs the function to sort values before concatenating them. Ordering may be based on an unsigned integer value, a column, or an expression. The sort order can be set to ascending with the ASC keyword (default), or to descending with DESC. To use a different separator from a comma, use the SEPARATOR keyword followed by the preferred separator.

The value of the system variable group_concat_max_len limits the number of elements returned. Its default is 1024. Use the SET statement to change the value. This function is available as of version 4.1 of MySQL.

As an example of this function, suppose that we wanted to know how many customers order a particular item. We could enter an SQL statement like this:

SELECT item_nbr AS Item,

GROUP_CONCAT(quantity) AS Quantities

FROM orders

WHERE item_nbr = 100

GROUP BY item_nbr;

+------+------------+

| Item | Quantities |

+------+------------+

| 100 | 7,12,4,8,4 |

+------+------------+

Notice that the quantities aren’t sorted—it’s the item numbers that are sorted by the GROUP BY clause. To sort the quantities within each field and to use a different separator, we would enter something like the following instead:

SELECT item_nbr AS Item,

GROUP_CONCAT(DISTINCT quantity

ORDER BY quantity ASC

SEPARATOR '|')

AS Quantities

FROM orders

WHERE item_nbr = 100

GROUP BY item_nbr;

+------+------------+

| Item | Quantities |

+------+------------+

| 100 | 4|7|8|12 |

+------+------------+

Because the results previously contained a duplicate value (4), we’re eliminating duplicates here by including the DISTINCT keyword.

Name

MAX()

Synopsis

MAX(expression)

This function returns the highest number in the values for a given column. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately.

As an example of this function, suppose that we wanted to know the maximum sale for each sales person for the month. We could enter the following SQL statement:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

MAX(sale_amount) AS biggest_sale

FROM sales

JOIN sales_reps USING(sales_rep_id)

WHERE DATE_FORMAT(date_of_sale, '%Y%m') =

DATE_FORMAT(CURDATE(), '%Y%m')

GROUP BY sales_rep_id DESC;

We’ve given sale_amount as the column for which we want the largest value returned for each sales rep. The WHERE clause indicates that we want only sales for the current month. Notice that the GROUP BY clause includes the DESC keyword. This will order the rows in descending order for the values of the biggest_sale field: the biggest sale at the top, the smallest at the bottom.

Here’s an example of another handy but less obvious use of this function: suppose we have a table in which client profiles are kept by the sales people. When a sales rep changes a client profile through a web interface, instead of updating the existing row, the program we wrote creates a new entry. We use this method to prevent sales people from inadvertently overwriting data and to keep previous client profiles in case someone wants to refer to them later. When the client profile is viewed through the web interface, we want only the latest profile to appear. Retrieving the latest row becomes a bit cumbersome, but we can do this with MAX() and a subquery as follows:

SELECT client_name, profile,

MAX(entry_date) AS last_entry

FROM

(SELECT client_id, entry_date, profile

FROM client_profiles

ORDER BY client_id, entry_date DESC) AS profiles

JOIN clients USING(client_id)

GROUP BY client_id;

In the subquery, we retrieve a list of profiles with the date each has in its entry in the table client_profiles; the results contain the duplicate entries for clients. In the main query, using MAX(), we get the maximum (latest) date for each client. The associated profile is included in the columns selected by the main query. We join the results of the subquery to the clients table to extract the client’s name.

The subquery is necessary so that we get the latest date instead of the oldest. The problem is that the GROUP BY clause orders the fields based on the given column. Without the subquery, the GROUP BY clause would use the value for the entry_date of the first row it finds, which will be the earliest date, not the latest. So we order the data in the subquery with the latest entry for each client first. GROUP BY then takes the first entry of the subquery results, which will be the latest entry.

Name

MIN()

Synopsis

MIN(expression)

This function returns the lowest number in the values for a given column. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. Here is an example:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

MIN(sale_amount) AS smallest_sale,

MAX(sale_amount) AS biggest_sale

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

In this example, we retrieve the smallest sale and largest sale made by each sales representative. We use JOIN to join the two tables to get the sales rep’s name. Because MAX() is very similar, see the examples in its description earlier in this chapter for additional ways to use MIN().

Name

STD()

Synopsis

STD(expression)

This function returns the population standard deviation of the given column. This function is an alias for STDDEV(); see the description of that function for an example of its use.

Name

STDDEV()

Synopsis

STDDEV(expression)

This function returns the population standard deviation of the given column. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. It returns NULL if no matching rows are found. Here is an example:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

SUM(sale_amount) AS total_sales,

COUNT(sale_amount) AS total_tickets,

AVG(sale_amount) AS avg_sale_per_ticket,

STDDEV(sale_amount) AS standard_deviation

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

This statement employs several aggregate functions. We use SUM() to get the total sales for each sales rep, COUNT() to retrieve the number of orders for the each, AVG() to determine the average sale, and STDDEV() to find out how much each sale made by each sales rep tends to vary from each one’s average sale. Incidentally, statistical functions return several decimal places. To return only two decimal places, you can wrap each function in FORMAT().

Name

STDDEV_POP()

Synopsis

STDDEV_POP(expression)

This function returns the population standard deviation of the given column. It was added in version 5.0.3 of MySQL for compliance with SQL standards. This function is an alias for STDDEV(); see the description of that function earlier in this chapter for an example of its use.

Name

STDDEV_SAMP()

Synopsis

STDDEV_SAMP(expression)

This function returns the sample standard deviation of the given column. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. It returns NULL if no matching rows are found. It was added in version 5.0.3 of MySQL for compliance with SQL standards. Here is an example:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

AVG(sale_amount) AS avg_sale_per_ticket,

STDDEV_POP(sale_amount) AS population_std_dev,

STDDEV_SAMP(sale_amount) AS sample_std_dev

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

This SQL statement uses several aggregate functions: AVG() to determine the average sale for each sales rep; STDDEV_POP() to determine how much each sale made by each sales rep tends to vary from each rep’s average sale; and STDDEV_SAMP() to determine the standard deviation from the average based on a sample of the data.

Name

SUM()

Synopsis

SUM([DISTINCT] expression)

This function returns the sum of the values for the given column or expression. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. It returns NULL if no matching rows are found. The parameterDISTINCT may be given within the parentheses of the function to add only unique values found for a given column. This parameter was added in version 5.1 of MySQL. Here is an example:

SELECT sales_rep_id,

SUM(sale_amount) AS total_sales

FROM sales

WHERE DATE_FORMAT(date_of_sale, '%Y%m') =

DATE_FORMAT(SUBDATE(CURDATE(), INTERVAL 1 MONTH), '%Y%m')

GROUP BY sales_rep_id;

This statement queries the sales table to retrieve only sales made during the last month. From these results, SUM() returns the total sale amounts aggregated by the sales_rep_id (see Grouping SELECT results” under the SELECT statement in Chapter 6).

Name

VAR_POP()

Synopsis

VAR_POP(expression)

This function returns the variance of a given column, based on the rows selected as a population. It’s synonymous with VARIANCE and was added in version 5.0.3 of MySQL for compliance with SQL standards. See the description of VAR_SAMP() for an example of this function’s use.

Name

VAR_SAMP()

Synopsis

VAR_SAMP(expression)

This function returns the variance of a given column, based on the rows selected as a sample of a given population. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. To determine the variance based on the entire population rather than a sample, use VAR_POP(). Both of these functions were added in version 5.0.3 of MySQL for compliance with SQL standards. Here is an example of both:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

AVG(sale_amount) AS avg_sale,

STDDEV_POP(sale_amount) AS population_std_dev,

STDDEV_SAMP(sale_amount) AS sample_std_dev,

VAR_POP(sale_amount) AS population_variance,

VAR_SAMP(sale_amount) AS sample_variance

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

This SQL statement uses several aggregate functions: AVG() to determine the average sale for each sales rep; STDDEV_POP() to determine how much each sale made by each sales rep tends to vary from each rep’s average sale; and STDDEV_SAMP() to determine the standard deviation from the average based on a sample of the data. It also includes VAR_POP() to show the variances based on the population, and VAR_SAMP() to return the variance based on the sample data.

Name

VARIANCE()

Synopsis

VARIANCE(expression)

The variance is determined by taking the difference between each given value and the average of all values given. Each of those differences is then squared, and the results are totaled. The average of that total is then determined to get the variance. This function returns the variance of a given column, based on the rows selected as a population. It’s normally used in conjunction with a GROUP BY clause specifying a unique column, so that values are compared for each unique item separately. This function is available as of version 4.1 of MySQL. Here is an example:

SELECT CONCAT(name_first, SPACE(1), name_last) AS rep_name,

AVG(sale_amount) AS avg_sale,

STDDEV_POP(sale_amount) AS standard_deviation,

VARIANCE(sale_amount) AS variance

FROM sales

JOIN sales_reps USING(sales_rep_id)

GROUP BY sales_rep_id;

This SQL statement uses a few aggregate functions: AVG() to determine the average sale for each sales rep; STDDEV_POP() to determine how much each sale made by each sales rep tends to vary from each rep’s average sale; and VARIANCE() to show the variances based on the population. To comply with SQL standards, VAR_POP() could have been used instead of VARIANCE().

Subqueries

A subquery is a SELECT statement nested within another SQL statement. This feature became available as of version 4.1 of MySQL. Although the same results can be accomplished by using the JOIN clause or UNION, depending on the situation, subqueries are a cleaner approach that is sometimes easier to read. They make a complex query more modular, which makes it easier to create and to troubleshoot. Here is a simple example of a subquery:

SELECT *

FROM

(SELECT col1, col2

FROM table1

WHERE col_id = 1000) AS derived1

ORDER BY col2;

In this example, the subquery or inner query is a SELECT statement specifying two column names. The other query is called the main or outer query. It doesn’t have to be a SELECT. It can be an INSERT, a DELETE, a DO, an UPDATE, or even a SET statement. The outer query generally can’t select data or modify data from the same table as an inner query, but this doesn’t apply if the subquery is part of a FROM clause. A subquery can return a value (a scalar), a field, multiple fields containing values, or a full results set that serves as a derived table.

You can encounter performance problems with subqueries if they are not well constructed. One problem occurs when a subquery is placed within an IN() clause as part of a WHERE clause. It’s generally better to use the = operator for each value, along with AND for each parameter/value pair.

When you see a performance problem with a subquery, try reconstructing the SQL statement with JOIN and compare the differences using the BENCHMARK() function. If the performance is better without a subquery, don’t give up on subqueries. Only in some situations is performance poorer. For those situations where there is a performance drain, MySQL AB is working on improving MySQL subqueries. So performance problems you experience now may be resolved in future versions. You may just need to upgrade to the current release or watch for improvements in future releases.

Single Field Subqueries

The most basic subquery is one that returns a scalar or single value. This type of subquery is particularly useful in a WHERE clause in conjunction with an = operator, or in other instances where a single value from an expression is permitted.

As an example of this situation, suppose that at our fictitious college one of the music teachers, Sonia Oram, has called us saying that she wants a list of students for one of her classes so that she can call them to invite them to a concert. She wants the names and telephone numbers for only the students in her first period Monday morning class.

The way most databases store this data, the course number would be a unique key and would make it easy to retrieve the other data without a subquery. But Sonia doesn’t know the course number, so we enter an SQL statement like this:

SELECT CONCAT(name_first, ' ', name_last) AS student,

phone_home, phone_dorm

FROM students

JOIN course_rosters USING (student_id)

WHERE course_id =

(SELECT course_id

FROM course_schedule

JOIN teachers USING (teacher_id)

WHERE semester_code = '2007AU'

AND class_time = 'monday_01'

AND name_first = 'Sonia'

AND name_last = 'Oram');

Notice in the subquery that we’re joining the course_schedule table with teachers so we can give the teacher’s first and last name in the WHERE clause of the subquery. We’re also indicating in the WHERE clause a specific semester (Autumn 2007) and time slot (Monday, first period). The results of these specifics should be one course identification number because a teacher won’t teach more than one class during a particular class period. That single course number will be used by the WHERE clause of the main query to return the list of students on the class roster for the course, along with their telephone numbers.

If by chance more than one value is returned by the subquery in the previous example, MySQL will return an error:

ERROR 1242 (ER_SUBSELECT_NO_1_ROW)

SQLSTATE = 21000

Message = "Subquery returns more than 1 row"

Despite our supposition, it is possible that a teacher might teach more than one class at a time: perhaps the teacher is teaching one course in violin and another in viola, but each class had so few students that the department head put them together. In such a situation, the teacher would want the data for both course numbers. To use multiple fields derived from a subquery in a WHERE clause like this, we would have to use something other than the = operator, such as IN. For this kind of situation, see the next section on Multiple Fields Subqueries.”

Multiple Fields Subqueries

In the previous section, we discussed instances where one scalar value was obtained from a subquery in a WHERE clause. However, there are times when you may want to match multiple values. For those situations you will need to use the subquery in conjunction with an operator or a clause:ALL, ANY, EXISTS, IN, or SOME.

As an example of a multiple fields subquery—and specifically of a subquery using IN (or using ANY or SOME)—let’s adapt the example from the previous section to a situation where the teacher wants the contact information for students in all of her classes. To do this, we can enter the following SQL statement:

SELECT CONCAT(name_first, ' ', name_last) AS student,

phone_home, phone_dorm

FROM students

JOIN course_rosters USING (student_id)

WHERE course_id IN

(SELECT course_id

FROM course_schedule

JOIN teachers USING (teacher_id)

WHERE semester_code = '2007AU'

AND name_first = 'Sonia'

AND name_last = 'Oram');

In this example, notice that the subquery is contained within the parentheses of the IN clause. Subqueries are executed first, so the results will be available before the WHERE clause is executed. Although a comma-separated list isn’t returned, MySQL still accepts the results so that they may be used by the outer query. The criteria of the WHERE clause here does not specify a specific time slot as the earlier example did, so multiple values are much more likely to be returned.

Instead of IN, you can use ANY or SOME to obtain the same results by the same methods. (ANY and SOME are synonymous.) These two keywords must be preceded by a comparison operator (e.g., =, <, >). For example, we could replace the IN in the SQL previous statement with = ANY or with= SOME and the same results will be returned. IN can be preceded with NOT for negative comparisons: NOT IN(...). This is the same as != ANY (...) and != SOME (...).

Let’s look at another subquery returning multiple values but using the ALL operator. The ALL operator must be preceded by a comparison operator (e.g., =, <, >). As an example of this usage, suppose one of the piano teachers provides weekend seminars for students. Suppose also that he heard a few students are enrolled in all of the seminars he has scheduled for the semester and he wants a list of their names and telephone numbers in advance. We should be able to get that data by entering an SQL statement like the following (though currently it doesn’t work, for reasons to be explained shortly):

SELECT DISTINCT student_id,

CONCAT(name_first, ' ', name_last) AS student

FROM students

JOIN seminar_rosters USING (student_id)

WHERE seminar_id = ALL

(SELECT seminar_id

FROM seminar_schedule

JOIN teachers ON (instructor_id = teacher_id)

WHERE semester_code = '2007AU'

AND name_first = 'Sam'

AND name_last = 'Oram');

In this example, a couple of the tables have different column names for the ID we want, and we have to join one of them with ON instead of USING, but that has nothing to do with the subquery. What’s significant is that this subquery returns a list of seminar identification numbers and is used in the WHERE clause of the main query with = ALL. Unfortunately, although this statement is constructed correctly, it doesn’t work with MySQL at the time of this writing and just returns an empty set. However, it should work in future releases of MySQL, so I’ve included it for future reference. For now, we would have to reorganize the SQL statement like so:

SELECT student_id, student

FROM

(SELECT student_id, COUNT(*)

AS nbr_seminars_registered,

CONCAT(name_first, ' ', name_last)

AS student

FROM students

JOIN seminar_rosters USING (student_id)

WHERE seminar_id IN

(SELECT seminar_id

FROM seminar_schedule

JOIN teachers

ON (instructor_id = teacher_id)

WHERE semester_code = '2007AU'

AND name_first = 'Sam'

AND name_last = 'Oram')

GROUP BY student_id) AS students_registered

WHERE nbr_seminars_registered =

(SELECT COUNT(*) AS nbr_seminars

FROM seminar_schedule

JOIN teachers

ON (instructor_id = teacher_id)

WHERE semester_code = '2007AU'

AND name_first = 'Sam'

AND name_last = 'Oram');

This is much more involved, but it does work with the latest release of MySQL.

The first subquery is used to get the student’s name. This subquery’s WHERE clause uses another subquery to retrieve the list of seminars taught by the professor for the semester, to determine the results set from which the main query will draw its ultimate data. The third subquery counts the number of seminars that the same professor is teaching for the semester. This single value is used with the WHERE clause of the main query. In essence, we’re determining the number of seminars the professor is teaching and which students are registered for all of them.

The last possible method for using multiple fields in a subquery uses EXISTS. With EXISTS, in order for it to return meaningful or desired results, you need to stipulate in the WHERE clauses of the subquery a point in which it is joined to the outer query. Using the example from the previous section involving the teacher Sonia Oram, let’s suppose that we want to retrieve a list of courses that she teaches:

SELECT DISTINCT course_id, course_name

FROM courses

WHERE EXISTS

(SELECT course_id

FROM course_schedule

JOIN teachers USING (teacher_id)

WHERE semester_code = '2007AU'

AND name_first = 'Sonia'

AND name_last = 'Oram'

AND courses.course_id = course_schedule.course_id);

As you can see here, we’ve added EXISTS to the WHERE clause with the subquery in parentheses, similar to using IN. The significant difference is that we added courses.course_id = course_schedule.course_id to the end. Without it, a list of all courses would be returned regardless of the criteria of the WHERE clause in the subquery. Incidentally, if we specified NOT EXISTS instead, we would get all courses except for the ones taught by the teacher given.

Results Set Subqueries

A subquery can be used to generate a results set, which is a table from which an outer query can select data. That is, a subquery can be used in a FROM clause as if it were another table in a database. It is a derived table. Along these lines, each derived table must be named. This is done withAS following the parentheses containing the subquery. A subquery contained in a FROM clause generally cannot be a correlated subquery—that is, it cannot reference the same table as the outer query. The exception is if it’s constructed with a JOIN.

In the following example, let’s consider the subquery separately as though it were a plain query and not a subquery. It will generate a results set containing the student’s ID and the student’s average exam score for a specific course taught during a specific semester. The query uses AVG(), which requires a GROUP BY clause. The problem with GROUP BY is that it will order data only by the columns by which it’s given to group data. In this case, it will order the data by student_id and not list the results by any other, more useful column. If we want to order the data so that the highest student average is first, descending in order to the lowest student average, we have to turn our query into a subquery and have the outer query re-sort the results:

SELECT CONCAT(name_first, ' ', name_last) AS student,

student_id, avg_grade

FROM students

JOIN

(SELECT student_id,

AVG(exam_grade) AS avg_grade

FROM exams

WHERE semester_code = '2007AU'

AND course_id = 1489

GROUP BY student_id) AS grade_averages

USING(student_id)

ORDER BY avg_grade DESC;

The results set (the derived table generated by the subquery in the FROM clause) is named grade_averages. Notice that although the column student_id exists in the derived table, in the table from which it gets its data (i.e., exams) and in the primary table used in the main query (i.e.,students), there is no ambiguity. No error is generated. However, if we wanted to specify that the data be taken from the derived table, we could put grade_averages.student_id in the SELECT of the outer query.

This subquery is a correlated subquery, which is generally not permitted in a FROM clause. It’s allowed in this example because we are using a JOIN to join the results set to the table referenced in the outer query.