Generating Summaries - MySQL Cookbook (2007)

MySQL Cookbook (2007)

Chapter 8. Generating Summaries

Introduction

Database systems are useful for storing and retrieving records, but they can also summarize your data in more concise forms. Summaries are useful when you want the overall picture rather than the details. They’re also typically more readily understood than a long list of records. Summary techniques enable you to answer questions such as “How many?” or “What is the total?” or “What is the range of values?” If you’re running a business, you may want to know how many customers you have in each state, or how much sales volume you’re generating each month. You could determine the per-state count by producing a list of customer records and counting them yourself, but that makes no sense when MySQL can count them for you. Similarly, to determine sales volume by month, a list of raw order information records is not especially useful if you have to add up the order amounts yourself. Let MySQL do it.

The examples just mentioned illustrate two common summary types. The first (the number of customer records per state) is a counting summary. The content of each record is important only for purposes of placing it into the proper group or category for counting. Such summaries are essentially histograms, where you sort items into a set of bins and count the number of items in each bin. The second example (sales volume per month) is an instance of a summary that’s based on the contents of records—sales totals are computed from sales values in individual order records.

Yet another kind of summary produces neither counts nor sums, but simply a list of unique values. This is useful if you don’t care how many instances of each value are present, but only which values are present. If you want to know the states in which you have customers, you want a list of the distinct state names contained in the records, not a list consisting of the state value from every record. Sometimes it’s even useful to apply one summary technique to the result of another summary. For example, to determine how many states your customers live in, generate a list of unique customer states, and then count them.

The type of summaries that you generate may depend on the kind of data you’re working with. A counting summary can be generated from any kind of values, whether they be numbers, strings, or dates. For summaries that involve sums or averages, only numeric values can be used. You can count instances of customer state names to produce a demographic analysis of your customer base, but you cannot add or average state names—that doesn’t make sense.

Summary operations in MySQL involve the following SQL constructs:

§ To compute a summary value from a set of individual values, use one of the functions known as aggregate functions. These are so called because they operate on aggregates (groups) of values. Aggregate functions include COUNT(), which counts rows or values in a query result; MIN()and MAX(), which find smallest and largest values; and SUM() and AVG(), which produce sums and means of values. These functions can be used to compute a value for the entire result set, or with a GROUP BY clause to group the rows into subsets and obtain an aggregate value for each one.

§ To obtain a list of unique values, use SELECT DISTINCT rather than SELECT.

§ To count how many distinct values there are, use COUNT(DISTINCT) rather than COUNT().

The recipes in this chapter first illustrate basic summary techniques, and then show how to perform more complex summary operations. You’ll find additional examples of summary methods in later chapters, particularly those that cover joins and statistical operations. (See Chapters 12 and13.)

Summary queries sometimes involve complex expressions. For summaries that you execute often, keep in mind that views can make queries easier to use. Using Views to Simplify Table Access demonstrates the basic technique of creating a view. Summarizing with COUNT() shows how it applies to summary simplification, and you’ll see easily how it can be used in later sections of the chapter as well.

The primary tables used for examples in this chapter are the driver_log and mail tables. These were also used heavily in Chapter 7, so they should look familiar. A third table used recurrently throughout the chapter is states, which has rows containing a few columns of information for each of the United States:

mysql>SELECT * FROM states ORDER BY name;

+----------------+--------+------------+----------+

| name | abbrev | statehood | pop |

+----------------+--------+------------+----------+

| Alabama | AL | 1819-12-14 | 4530182 |

| Alaska | AK | 1959-01-03 | 655435 |

| Arizona | AZ | 1912-02-14 | 5743834 |

| Arkansas | AR | 1836-06-15 | 2752629 |

| California | CA | 1850-09-09 | 35893799 |

| Colorado | CO | 1876-08-01 | 4601403 |

| Connecticut | CT | 1788-01-09 | 3503604 |

...

The name and abbrev columns list the full state name and the corresponding abbreviation. The statehood column indicates the day on which the state entered the Union. pop is the state population as of July, 2004, as reported by the U.S. Census Bureau.

This chapter uses other tables occasionally as well. You can create most of them with the scripts found in the tables directory of the recipes distribution. Using FULLTEXT Searches describes the kjv table.

Summarizing with COUNT()

Problem

You want to count the number of rows in a table, the number of rows that match certain conditions, or the number of times that particular values occur.

Solution

Use the COUNT() function.

Discussion

To count the number of rows in an entire table or that match particular conditions, use the COUNT() function. For example, to display the contents of the rows in a table, you can use a SELECT * statement, but to count them instead, use SELECT COUNT(*). Without a WHERE clause, the statement counts all the rows in the table, such as in the following statement that shows how many rows the driver_log table contains:

mysql>SELECT COUNT(*) FROM driver_log;

+----------+

| COUNT(*) |

+----------+

| 10 |

+----------+

If you don’t know how many U.S. states there are, this statement tells you:

mysql>SELECT COUNT(*) FROM states;

+----------+

| COUNT(*) |

+----------+

| 50 |

+----------+

COUNT(*) with noWHERE clause is very quick for MyISAM tables. However, for BDB or InnoDB tables, you may want to avoid it because the statement requires a full table scan, which can be slow for large tables. If an approximate row count is all you require, a workaround that avoids a full scan for those storage engines is to extract theTABLE_ROWS value from the INFORMATION_SCHEMA database:

mysql>SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES

-> WHERE TABLE_SCHEMA = 'cookbook' AND TABLE_NAME = 'states';

+------------+

| TABLE_ROWS |

+------------+

| 50 |

+------------+

Before MySQL 5.0, INFORMATION_SCHEMA is unavailable. Instead, use SHOW TABLE STATUS and extract the value of the Rows column.

To count only the number of rows that match certain conditions, include an appropriate WHERE clause in a SELECT COUNT(*) statement. The conditions can be chosen to make COUNT(*) useful for answering many kinds of questions:

§ How many times did drivers travel more than 200 miles in a day?

§ mysql>SELECT COUNT(*) FROM driver_log WHERE miles > 200;

§ +----------+

§ | COUNT(*) |

§ +----------+

§ | 4 |

+----------+

§ How many days did Suzi drive?

§ mysql>SELECT COUNT(*) FROM driver_log WHERE name = 'Suzi';

§ +----------+

§ | COUNT(*) |

§ +----------+

§ | 2 |

+----------+

§ How many states did the United States consist of at the beginning of the 20th century?

§ mysql>SELECT COUNT(*) FROM states WHERE statehood < '1900-01-01';

§ +----------+

§ | COUNT(*) |

§ +----------+

§ | 45 |

+----------+

§ How many of those states joined the Union in the 19th century?

§ mysql>SELECT COUNT(*) FROM states

§ -> WHERE statehood BETWEEN '1800-01-01' AND '1899-12-31';

§ +----------+

§ | COUNT(*) |

§ +----------+

§ | 29 |

+----------+

The COUNT() function actually has two forms. The form we’ve been using, COUNT(*), counts rows. The other form, COUNT( expr ), takes a column name or expression argument and counts the number of non-NULL values. The following statement shows how to produce both a row count for a table and a count of the number of non-NULL values in one of its columns:

SELECT COUNT(*), COUNT(mycol) FROM mytbl;

The fact that COUNT( expr ) doesn’t count NULL values is useful for producing multiple counts from the same set of rows. To count the number of Saturday and Sunday trips in the driver_log table with a single statement, do this:

mysql>SELECT

-> COUNT(IF(DAYOFWEEK(trav_date)=7,1,NULL)) AS 'Saturday trips',

-> COUNT(IF(DAYOFWEEK(trav_date)=1,1,NULL)) AS 'Sunday trips'

-> FROM driver_log;

+----------------+--------------+

| Saturday trips | Sunday trips |

+----------------+--------------+

| 3 | 1 |

+----------------+--------------+

Or to count weekend versus weekday trips, do this:

mysql>SELECT

-> COUNT(IF(DAYOFWEEK(trav_date) IN (1,7),1,NULL)) AS 'weekend trips',

-> COUNT(IF(DAYOFWEEK(trav_date) IN (1,7),NULL,1)) AS 'weekday trips'

-> FROM driver_log;

+---------------+---------------+

| weekend trips | weekday trips |

+---------------+---------------+

| 4 | 6 |

+---------------+---------------+

The IF() expressions determine, for each column value, whether it should be counted. If so, the expression evaluates to 1 and COUNT() counts it. If not, the expression evaluates to NULL and COUNT() ignores it. The effect is to count the number of values that satisfy the condition given as the first argument to IF().

CREATE A VIEW TO SIMPLIFY USING A SUMMARY

If you need a given summary often, a view might be useful so that you need not type the summarizing expressions repeatedly. For example, the following view implements the weekend versus weekday trip summary:

mysql>CREATE VIEW trip_summary_view AS

-> SELECT

-> COUNT(IF(DAYOFWEEK(trav_date) IN (1,7),1,NULL)) AS weekend_trips,

-> COUNT(IF(DAYOFWEEK(trav_date) IN (1,7),NULL,1)) AS weekday_trips

-> FROM driver_log;

Selecting from this view is much easier than selecting directly from the underlying table:

mysql>SELECT * FROM trip_summary_view;

+---------------+---------------+

| weekend_trips | weekday_trips |

+---------------+---------------+

| 4 | 6 |

+---------------+---------------+

See Also

Summaries and NULL Values further discusses the difference between COUNT(*) and COUNT(expr).

Summarizing with MIN() and MAX()

Problem

You need to determine the smallest or largest of a set of values.

Solution

Use MIN() to find the smallest value, MAX() to find the largest.

Discussion

Finding smallest or largest values is somewhat akin to sorting, except that instead of producing an entire set of sorted values, you select only a single value at one end or the other of the sorted range. This kind of operation applies to questions about smallest, largest, oldest, newest, most expensive, least expensive, and so forth. One way to find such values is to use the MIN() and MAX() functions. (Another way to address these questions is to use LIMIT; see the discussions in Recipes and .)

Because MIN() and MAX() determine the extreme values in a set, they’re useful for characterizing ranges:

§ What date range is represented by the rows in the mail table? What are the smallest and largest messages sent?

§ mysql>SELECT

§ -> MIN(t) AS earliest, MAX(t) AS latest,

§ -> MIN(size) AS smallest, MAX(size) AS largest

§ -> FROM mail;

§ +---------------------+---------------------+----------+---------+

§ | earliest | latest | smallest | largest |

§ +---------------------+---------------------+----------+---------+

§ | 2006-05-11 10:15:08 | 2006-05-19 22:21:51 | 271 | 2394482 |

+---------------------+---------------------+----------+---------+

§ What are the shortest and longest trips in the driver_log table?

§ mysql>SELECT MIN(miles) AS shortest, MAX(miles) AS longest

§ -> FROM driver_log;

§ +----------+---------+

§ | shortest | longest |

§ +----------+---------+

§ | 79 | 502 |

+----------+---------+

§ What are the lowest and highest U.S. state populations?

§ mysql>SELECT MIN(pop) AS 'fewest people', MAX(pop) AS 'most people'

§ -> FROM states;

§ +---------------+-------------+

§ | fewest people | most people |

§ +---------------+-------------+

§ | 506529 | 35893799 |

+---------------+-------------+

§ What are the first and last state names, lexically speaking?

§ mysql>SELECT MIN(name), MAX(name) FROM states;

§ +-----------+-----------+

§ | MIN(name) | MAX(name) |

§ +-----------+-----------+

§ | Alabama | Wyoming |

+-----------+-----------+

MIN() and MAX() need not be applied directly to column values. They also work with expressions or values that are derived from column values. For example, to find the lengths of the shortest and longest state names, do this:

mysql>SELECT

-> MIN(CHAR_LENGTH(name)) AS shortest,

-> MAX(CHAR_LENGTH(name)) AS longest

-> FROM states;

+----------+---------+

| shortest | longest |

+----------+---------+

| 4 | 14 |

+----------+---------+

Summarizing with SUM() and AVG()

Problem

You need to add a set of numbers or find their average.

Solution

Use the SUM() or AVG() functions.

Discussion

SUM() and AVG() produce the total and average (mean) of a set of values:

§ What is the total amount of mail traffic and the average size of each message?

§ mysql>SELECT

§ -> SUM(size) AS 'total traffic',

§ -> AVG(size) AS 'average message size'

§ -> FROM mail;

§ +---------------+----------------------+

§ | total traffic | average message size |

§ +---------------+----------------------+

§ | 3798185 | 237386.5625 |

+---------------+----------------------+

§ How many miles did the drivers in the driver_log table travel? What was the average number of miles traveled per day?

§ mysql>SELECT

§ -> SUM(miles) AS 'total miles',

§ -> AVG(miles) AS 'average miles/day'

§ -> FROM driver_log;

§ +-------------+-------------------+

§ | total miles | average miles/day |

§ +-------------+-------------------+

§ | 2166 | 216.6000 |

+-------------+-------------------+

§ What is the total population of the United States?

§ mysql>SELECT SUM(pop) FROM states;

§ +-----------+

§ | SUM(pop) |

§ +-----------+

§ | 293101881 |

+-----------+

The value represents the population reported for July 2004. The figure shown here differs from the U.S. population reported by the U.S. Census Bureau, because the states table doesn’t contain a count for Washington, D.C.

SUM() and AVG() are strictly numeric functions, so they can’t be used with strings or temporal values. On the other hand, sometimes you can convert nonnumeric values to useful numeric forms. Suppose that a table stores TIME values that represent elapsed time:

mysql>SELECT t1 FROM time_val;

+----------+

| t1 |

+----------+

| 15:00:00 |

| 05:01:30 |

| 12:30:20 |

+----------+

To compute the total elapsed time, use TIME_TO_SEC() to convert the values to seconds before summing them. The resulting sum also will be in seconds; pass it to SEC_TO_TIME() to convert the sum back to TIME format:

mysql>SELECT SUM(TIME_TO_SEC(t1)) AS 'total seconds',

-> SEC_TO_TIME(SUM(TIME_TO_SEC(t1))) AS 'total time'

-> FROM time_val;

+---------------+------------+

| total seconds | total time |

+---------------+------------+

| 117110 | 32:31:50 |

+---------------+------------+

See Also

The SUM() and AVG() functions are especially useful in applications that compute statistics. They’re explored further in Chapter 13 along with STD(), a related function that calculates standard deviations.

Using DISTINCT to Eliminate Duplicates

Problem

You want to know which values are present in a set of values, without displaying duplicate values multiple times. Or you want to know how many distinct values there are.

Solution

Use DISTINCT to select unique values or COUNT(DISTINCT) to count them.

Discussion

One summary operation that doesn’t use aggregate functions is to determine which values or rows are contained in a dataset by eliminating duplicates. Do this with DISTINCT (or DISTINCTROW, which is synonymous). DISTINCT is useful for boiling down a query result, and often is combined with ORDER BY to place the values in more meaningful order. For example, to determine the names of the drivers listed in the driver_log table, use the following statement:

mysql>SELECT DISTINCT name FROM driver_log ORDER BY name;

+-------+

| name |

+-------+

| Ben |

| Henry |

| Suzi |

+-------+

A statement without DISTINCT produces the same names, but is not nearly as easy to understand, even with a small dataset:

mysql>SELECT name FROM driver_log;

+-------+

| name |

+-------+

| Ben |

| Suzi |

| Henry |

| Henry |

| Ben |

| Henry |

| Suzi |

| Henry |

| Ben |

| Henry |

+-------+

To determine how many different drivers there are, use COUNT(DISTINCT):

mysql>SELECT COUNT(DISTINCT name) FROM driver_log;

+----------------------+

| COUNT(DISTINCT name) |

+----------------------+

| 3 |

+----------------------+

COUNT(DISTINCT) ignores NULL values. Should you wish to count NULL as one of the values in the set if it’s present, use one of the following expressions:

COUNT(DISTINCTval) + IF(COUNT(IF(val IS NULL,1,NULL))=0,0,1)

COUNT(DISTINCT val) + IF(SUM(ISNULL(val))=0,0,1)

COUNT(DISTINCT val) + (SUM(ISNULL(val))!=0)

DISTINCT queries often are useful in conjunction with aggregate functions to obtain a more complete characterization of your data. Suppose that you have a customer table that contains a state column indicating the state where customers are located. Applying COUNT(*) to thecustomer table indicates how many customers you have, using DISTINCT on the state values in the table tells you the number of states in which you have customers, and COUNT(DISTINCT) on the state values tells you how many states your customer base represents.

When used with multiple columns, DISTINCT shows the different combinations of values in the columns and COUNT(DISTINCT) counts the number of combinations. The following statements show the different sender/recipient pairs in the mail table and how many such pairs there are:

mysql>SELECT DISTINCT srcuser, dstuser FROM mail

-> ORDER BY srcuser, dstuser;

+---------+---------+

| srcuser | dstuser |

+---------+---------+

| barb | barb |

| barb | tricia |

| gene | barb |

| gene | gene |

| gene | tricia |

| phil | barb |

| phil | phil |

| phil | tricia |

| tricia | gene |

| tricia | phil |

+---------+---------+

mysql> SELECT COUNT(DISTINCT srcuser, dstuser) FROM mail;

+----------------------------------+

| COUNT(DISTINCT srcuser, dstuser) |

+----------------------------------+

| 10 |

+----------------------------------+

DISTINCT works with expressions, too, not just column values. To determine the number of hours of the day during which messages in the mail are sent, count the distinct HOUR() values:

mysql>SELECT COUNT(DISTINCT HOUR(t)) FROM mail;

+-------------------------+

| COUNT(DISTINCT HOUR(t)) |

+-------------------------+

| 12 |

+-------------------------+

To find out which hours those were, list them:

mysql>SELECT DISTINCT HOUR(t) AS hour FROM mail ORDER BY hour;

+------+

| hour |

+------+

| 7 |

| 8 |

| 9 |

| 10 |

| 11 |

| 12 |

| 13 |

| 14 |

| 15 |

| 17 |

| 22 |

| 23 |

+------+

Note that this statement doesn’t tell you how many messages were sent each hour. That’s covered in Date-Based Summaries.

Finding Values Associated with Minimum and Maximum Values

Problem

You want to know the values for other columns in the row that contains a minimum or maximum value.

Solution

Use two statements and a user-defined variable. Or use a subquery. Or use a join.

Discussion

MIN() and MAX() find the endpoints of a range of values, but sometimes when finding a minimum or maximum value, you’re also interested in other values from the row in which the value occurs. For example, you can find the largest state population like this:

mysql>SELECT MAX(pop) FROM states;

+----------+

| MAX(pop) |

+----------+

| 35893799 |

+----------+

But that doesn’t show you which state has this population. The obvious attempt at getting that information looks like this:

mysql>SELECT MAX(pop), name FROM states WHERE pop = MAX(pop);

ERROR 1111 (HY000): Invalid use of group function

Probably everyone tries something like that sooner or later, but it doesn’t work. Aggregate functions such as MIN() and MAX() cannot be used in WHERE clauses, which require expressions that apply to individual rows. The intent of the statement is to determine which row has the maximum population value, and then display the associated state name. The problem is that while you and I know perfectly well what we mean by writing such a thing, it makes no sense at all to MySQL. The statement fails because MySQL uses the WHERE clause to determine which rows to select, but it knows the value of an aggregate function only after selecting the rows from which the function’s value is determined! So, in a sense, the statement is self-contradictory. You can solve this problem by saving the maximum population value in a user-defined variable and then comparing rows to the variable value:

mysql>SET @max = (SELECT MAX(pop) FROM states);

mysql> SELECT pop AS 'highest population', name FROM states

WHERE pop = @max;

+--------------------+------------+

| highest population | name |

+--------------------+------------+

| 35893799 | California |

+--------------------+------------+

For a single-statement solution, use a subquery in the WHERE clause that returns the maximum population value:

mysql>SELECT pop AS 'highest population', name FROM states

-> WHERE pop = (SELECT MAX(pop) FROM states);

+--------------------+------------+

| highest population | name |

+--------------------+------------+

| 35893799 | California |

+--------------------+------------+

This technique also works even if the minimum or maximum value itself isn’t actually contained in the row, but is only derived from it. If you want to know the length of the shortest verse in the King James Version, that’s easy to find:

mysql>SELECT MIN(CHAR_LENGTH(vtext)) FROM kjv;

+-------------------------+

| MIN(CHAR_LENGTH(vtext)) |

+-------------------------+

| 11 |

+-------------------------+

If you want to know, “What verse is that?” do this instead:

mysql>SELECT bname, cnum, vnum, vtext FROM kjv

-> WHERE CHAR_LENGTH(vtext) = (SELECT MIN(CHAR_LENGTH(vtext)) FROM kjv);

+-------+------+------+-------------+

| bname | cnum | vnum | vtext |

+-------+------+------+-------------+

| John | 11 | 35 | Jesus wept. |

+-------+------+------+-------------+

Yet another way to select other columns from rows containing a minimum or maximum value is to use a join. Select the value into another table, and then join it to the original table to select the row that matches the value. To find the row for the state with the highest population, use a join like this:

mysql>CREATE TABLE t SELECT MAX(pop) as maxpop FROM states;

mysql> SELECT states.* FROM states INNER JOIN t ON states.pop = t.maxpop;

+------------+--------+------------+----------+

| name | abbrev | statehood | pop |

+------------+--------+------------+----------+

| California | CA | 1850-09-09 | 35893799 |

+------------+--------+------------+----------+

See Also

For more information about joins, see Chapter 12, in particular, Finding Rows Containing Per-Group Minimum or Maximum Values, which further discusses the problem of finding rows that contain groupwise minimum or maximum values.

Controlling String Case Sensitivity for MIN() and MAX()

Problem

MIN() and MAX() select strings in case-sensitive fashion when you don’t want them to, or vice versa.

Solution

Alter the comparison characteristics of the strings.

Discussion

Chapter 5 discusses how string comparison properties depend on whether the strings are binary or nonbinary:

§ Binary strings are sequences of bytes. They are compared byte by byte using numeric byte values. Character set and lettercase have no meaning for comparisons.

§ Nonbinary strings are sequences of characters. They have a character set and collation and are compared character by character using the order defined by the collation.

These properties also apply when you use a string column as the argument to the MIN() or MAX() functions because they are based on comparison. To alter how these functions work with a string column, you must alter the column’s comparison properties. Controlling Case Sensitivity in String Comparisons discusses how to control these properties, and Controlling Case Sensitivity of String Sorts shows how they apply to string sorts. The same principles apply to finding minimum and maximum string values, so I’ll just summarize here, and you can read Controlling Case Sensitivity of String Sorts for additional details.

§ To compare case-insensitive strings in case-sensitive fashion, order the values using a case-sensitive collation:

§ SELECT

§ MIN(str_col COLLATE latin1_general_cs) AS min,

§ MAX(str_col COLLATE latin1_general_cs) AS max

FROM tbl;

§ To compare case-sensitive strings in case-insensitive fashion, order the values using a case-insensitive collation:

§ SELECT

§ MIN(str_col COLLATE latin1_swedish_ci) AS min,

§ MAX(str_col COLLATE latin1_swedish_ci) AS max

FROM tbl;

Another possibility is to compare values that have all been converted to the same lettercase, which makes lettercase irrelevant. However, that also changes the retrieved values:

SELECT

MIN(UPPER(str_col)) AS min,

MAX(UPPER(str_col)) AS max

FROM tbl;

§ Binary strings compare using numeric byte values, so there is no concept of lettercase involved. However, because letters in different cases have different byte values, comparisons of binary strings effectively are case-sensitive (that is, a and A are unequal). To compare binary strings using a case-insensitive ordering, convert them to nonbinary strings, and apply an appropriate collation:

§ SELECT

§ MIN(CONVERT(str_col USING latin1) COLLATE latin1_swedish_ci) AS min,

§ MAX(CONVERT(str_col USING latin1) COLLATE latin1_swedish_ci) AS max

FROM tbl;

If the default collation is case-insensitive (as is true for latin1), you can omit the COLLATE clause.

Dividing a Summary into Subgroups

Problem

You want to calculate a summary for each subgroup of a set of rows, not an overall summary value.

Solution

Use a GROUP BY clause to arrange rows into groups.

Discussion

The summary statements shown so far calculate summary values over all rows in the result set. For example, the following statement determines the number of records in the mail table, and thus the total number of mail messages that have been sent:

mysql>SELECT COUNT(*) FROM mail;

+----------+

| COUNT(*) |

+----------+

| 16 |

+----------+

Sometimes it’s desirable to break a set of rows into subgroups and summarize each group. Do this by using aggregate functions in conjunction with a GROUP BY clause. To determine the number of messages per sender, group the rows by sender name, count how many times each name occurs, and display the names with the counts:

mysql>SELECT srcuser, COUNT(*) FROM mail

-> GROUP BY srcuser;

+---------+----------+

| srcuser | COUNT(*) |

+---------+----------+

| barb | 3 |

| gene | 6 |

| phil | 5 |

| tricia | 2 |

+---------+----------+

That query summarizes the same column that is used for grouping (srcuser), but that’s not always necessary. Suppose that you want a quick characterization of the mail table, showing for each sender listed in it the total amount of traffic sent (in bytes) and the average number of bytes per message. In this case, you still use the srcuser column to place the rows in groups, but the summary functions operate on the size values:

mysql>SELECT srcuser,

-> SUM(size) AS 'total bytes',

-> AVG(size) AS 'bytes per message'

-> FROM mail GROUP BY srcuser;

+---------+-------------+-------------------+

| srcuser | total bytes | bytes per message |

+---------+-------------+-------------------+

| barb | 156696 | 52232.0000 |

| gene | 1033108 | 172184.6667 |

| phil | 18974 | 3794.8000 |

| tricia | 2589407 | 1294703.5000 |

+---------+-------------+-------------------+

Use as many grouping columns as necessary to achieve as fine-grained a summary as you require. The earlier query that shows the number of messages per sender is a coarse summary. To be more specific and find out how many messages each sender sent from each host, use two grouping columns. This produces a result with nested groups (groups within groups):

mysql>SELECT srcuser, srchost, COUNT(srcuser) FROM mail

-> GROUP BY srcuser, srchost;

+---------+---------+----------------+

| srcuser | srchost | COUNT(srcuser) |

+---------+---------+----------------+

| barb | saturn | 2 |

| barb | venus | 1 |

| gene | mars | 2 |

| gene | saturn | 2 |

| gene | venus | 2 |

| phil | mars | 3 |

| phil | venus | 2 |

| tricia | mars | 1 |

| tricia | saturn | 1 |

+---------+---------+----------------+

The preceding examples in this section have used COUNT(), SUM(), and AVG() for per-group summaries. You can use MIN() or MAX(), too. With a GROUP BY clause, they will tell you the smallest or largest value per group. The following query groups mail table rows by message sender, displaying for each the size of the largest message sent and the date of the most recent message:

mysql>SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser;

+---------+-----------+---------------------+

| srcuser | MAX(size) | MAX(t) |

+---------+-----------+---------------------+

| barb | 98151 | 2006-05-14 14:42:21 |

| gene | 998532 | 2006-05-19 22:21:51 |

| phil | 10294 | 2006-05-17 12:49:23 |

| tricia | 2394482 | 2006-05-14 17:03:01 |

+---------+-----------+---------------------+

You can group by multiple columns and display a maximum for each combination of values in those columns. This query finds the size of the largest message sent between each pair of sender and recipient values listed in the mail table:

mysql>SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser;

+---------+---------+-----------+

| srcuser | dstuser | MAX(size) |

+---------+---------+-----------+

| barb | barb | 98151 |

| barb | tricia | 58274 |

| gene | barb | 2291 |

| gene | gene | 23992 |

| gene | tricia | 998532 |

| phil | barb | 10294 |

| phil | phil | 1048 |

| phil | tricia | 5781 |

| tricia | gene | 194925 |

| tricia | phil | 2394482 |

+---------+---------+-----------+

When using aggregate functions to produce per-group summary values, watch out for the following trap, which involves selecting nonsummary table columns not related to the grouping columns. Suppose that you want to know the longest trip per driver in the driver_log table. That’s produced by this query:

mysql>SELECT name, MAX(miles) AS 'longest trip'

-> FROM driver_log GROUP BY name;

+-------+--------------+

| name | longest trip |

+-------+--------------+

| Ben | 152 |

| Henry | 300 |

| Suzi | 502 |

+-------+--------------+

But what if you also want to show the date on which each driver’s longest trip occurred? Can you just add trav_date to the output column list? Sorry, that won’t work:

mysql>SELECT name, trav_date, MAX(miles) AS 'longest trip'

-> FROM driver_log GROUP BY name;

+-------+------------+--------------+

| name | trav_date | longest trip |

+-------+------------+--------------+

| Ben | 2006-08-30 | 152 |

| Henry | 2006-08-29 | 300 |

| Suzi | 2006-08-29 | 502 |

+-------+------------+--------------+

The query does produce a result, but if you compare it to the full table (shown following), you’ll see that although the dates for Ben and Henry are correct, the date for Suzi is not:

+--------+-------+------------+-------+

| rec_id | name | trav_date | miles |

+--------+-------+------------+-------+

| 1 | Ben | 2006-08-30 | 152 |← Ben’s longest trip

| 2 | Suzi | 2006-08-29 | 391 |

| 3 | Henry | 2006-08-29 | 300 | ← Henry’s longest trip

| 4 | Henry | 2006-08-27 | 96 |

| 5 | Ben | 2006-08-29 | 131 |

| 6 | Henry | 2006-08-26 | 115 |

| 7 | Suzi | 2006-09-02 | 502 | ← Suzi’s longest trip

| 8 | Henry | 2006-09-01 | 197 |

| 9 | Ben | 2006-09-02 | 79 |

| 10 | Henry | 2006-08-30 | 203 |

+--------+-------+------------+-------+

So what’s going on? Why does the summary statement produce incorrect results? This happens because when you include a GROUP BY clause in a query, the only values that you can select are the grouped columns or summary values calculated from the groups. If you display additional table columns, they’re not tied to the grouped columns and the values displayed for them are indeterminate. (For the statement just shown, it appears that MySQL may simply be picking the first date for each driver, regardless of whether it matches the driver’s maximum mileage value.)

The general solution to the problem of displaying contents of rows associated with minimum or maximum group values involves a join. The technique is described in Chapter 12. For the problem at hand, the required results are produced as follows:

mysql>CREATE TABLE t

-> SELECT name, MAX(miles) AS miles FROM driver_log GROUP BY name;

mysql> SELECT d.name, d.trav_date, d.miles AS 'longest trip'

-> FROM driver_log AS d INNER JOIN t USING (name, miles) ORDER BY name;

+-------+------------+--------------+

| name | trav_date | longest trip |

+-------+------------+--------------+

| Ben | 2006-08-30 | 152 |

| Henry | 2006-08-29 | 300 |

| Suzi | 2006-09-02 | 502 |

+-------+------------+--------------+

Summaries and NULL Values

Problem

You’re summarizing a set of values that may include NULL values and you need to know how to interpret the results.

Solution

Understand how aggregate functions handle NULL values.

Discussion

Most aggregate functions ignore NULL values. Suppose that you have a table expt that records experimental results for subjects who are to be given four tests each and that lists the test score as NULL for those tests that have not yet been administered:

mysql>SELECT subject, test, score FROM expt ORDER BY subject, test;

+---------+------+-------+

| subject | test | score |

+---------+------+-------+

| Jane | A | 47 |

| Jane | B | 50 |

| Jane | C | NULL |

| Jane | D | NULL |

| Marvin | A | 52 |

| Marvin | B | 45 |

| Marvin | C | 53 |

| Marvin | D | NULL |

+---------+------+-------+

By using a GROUP BY clause to arrange the rows by subject name, the number of tests taken by each subject, as well as the total, average, lowest, and highest scores can be calculated like this:

mysql>SELECT subject,

-> COUNT(score) AS n,

-> SUM(score) AS total,

-> AVG(score) AS average,

-> MIN(score) AS lowest,

-> MAX(score) AS highest

-> FROM expt GROUP BY subject;

+---------+---+-------+---------+--------+---------+

| subject | n | total | average | lowest | highest |

+---------+---+-------+---------+--------+---------+

| Jane | 2 | 97 | 48.5000 | 47 | 50 |

| Marvin | 3 | 150 | 50.0000 | 45 | 53 |

+---------+---+-------+---------+--------+---------+

You can see from the results in the column labeled n (number of tests) that the query counts only five values, even though the table contains eight. Why? Because the values in that column correspond to the number of non-NULL test scores for each subject. The other summary columns display results that are calculated only from the non-NULL scores as well.

It makes a lot of sense for aggregate functions to ignore NULL values. If they followed the usual SQL arithmetic rules, adding NULL to any other value would produce a NULL result. That would make aggregate functions really difficult to use because you’d have to filter out NULL values every time you performed a summary, to avoid getting a NULL result. Ugh. By ignoring NULL values, aggregate functions become a lot more convenient.

However, be aware that even though aggregate functions may ignore NULL values, some of them can still produce NULL as a result. This happens if there’s nothing to summarize, which occurs if the set of values is empty or contains only NULL values. The following query is the same as the previous one, with one small difference. It selects only NULL test scores, so there’s nothing for the aggregate functions to operate on:

mysql>SELECT subject,

-> COUNT(score) AS n,

-> SUM(score) AS total,

-> AVG(score) AS average,

-> MIN(score) AS lowest,

-> MAX(score) AS highest

-> FROM expt WHERE score IS NULL GROUP BY subject;

+---------+---+-------+---------+--------+---------+

| subject | n | total | average | lowest | highest |

+---------+---+-------+---------+--------+---------+

| Jane | 0 | NULL | NULL | NULL | NULL |

| Marvin | 0 | NULL | NULL | NULL | NULL |

+---------+---+-------+---------+--------+---------+

For COUNT(), the number of scores per subject is zero and is reported that way. On the other hand, SUM() , AVG(), MIN(), and MAX() return NULL when there are no values to summarize. If you don’t want these functions to produce NULL in the query output, use IFNULL() to map their results appropriately:

mysql>SELECT subject,

-> COUNT(score) AS n,

-> IFNULL(SUM(score),0) AS total,

-> IFNULL(AVG(score),0) AS average,

-> IFNULL(MIN(score),'Unknown') AS lowest,

-> IFNULL(MAX(score),'Unknown') AS highest

-> FROM expt WHERE score IS NULL GROUP BY subject;

+---------+---+-------+---------+---------+---------+

| subject | n | total | average | lowest | highest |

+---------+---+-------+---------+---------+---------+

| Jane | 0 | 0 | 0.0000 | Unknown | Unknown |

| Marvin | 0 | 0 | 0.0000 | Unknown | Unknown |

+---------+---+-------+---------+---------+---------+

COUNT() is somewhat different with regard to NULL values than the other aggregate functions. Like other aggregate functions, COUNT(expr) counts only non-NULL values, but COUNT(*) counts rows, regardless of their content. You can see the difference between the forms of COUNT() like this:

mysql>SELECT COUNT(*), COUNT(score) FROM expt;

+----------+--------------+

| COUNT(*) | COUNT(score) |

+----------+--------------+

| 8 | 5 |

+----------+--------------+

This tells us that there are eight rows in the expt table but that only five of them have the score value filled in. The different forms of COUNT() can be very useful for counting missing values. Just take the difference:

mysql>SELECT COUNT(*) - COUNT(score) AS missing FROM expt;

+---------+

| missing |

+---------+

| 3 |

+---------+

Missing and nonmissing counts can be determined for subgroups as well. The following query does so for each subject, providing an easy way to assess the extent to which the experiment has been completed:

mysql>SELECT subject,

-> COUNT(*) AS total,

-> COUNT(score) AS 'nonmissing',

-> COUNT(*) - COUNT(score) AS missing

-> FROM expt GROUP BY subject;

+---------+-------+-------------+---------+

| subject | total | nonmissing | missing |

+---------+-------+-------------+---------+

| Jane | 4 | 2 | 2 |

| Marvin | 4 | 3 | 1 |

+---------+-------+-------------+---------+

Selecting Only Groups with Certain Characteristics

Problem

You want to calculate group summaries but display the results only for those groups that match certain criteria.

Solution

Use a HAVING clause.

Discussion

You’re familiar with the use of WHERE to specify conditions that individual rows must satisfy to be selected by a statement. It’s natural, therefore, to use WHERE to write conditions that involve summary values. The only trouble is that it doesn’t work. If you want to identify drivers in thedriver_log table who drove more than three days, you’d probably first think to write the statement like this:

mysql>SELECT COUNT(*), name

-> FROM driver_log

-> WHERE COUNT(*) > 3

-> GROUP BY name;

ERROR 1111 (HY000): Invalid use of group function

The problem here is that WHERE specifies the initial constraints that determine which rows to select, but the value of COUNT() can be determined only after the rows have been selected. The solution is to put the COUNT() expression in a HAVING clause instead. HAVING is analogous toWHERE, but it applies to group characteristics rather than to single rows. That is, HAVING operates on the already-selected-and-grouped set of rows, applying additional constraints based on aggregate function results that aren’t known during the initial selection process. The preceding query therefore should be written like this:

mysql>SELECT COUNT(*), name

-> FROM driver_log

-> GROUP BY name

-> HAVING COUNT(*) > 3;

+----------+-------+

| COUNT(*) | name |

+----------+-------+

| 5 | Henry |

+----------+-------+

When you use HAVING, you can still include a WHERE clause—but only to select rows, not to test summary values.

HAVING can refer to aliases, so the previous query can be rewritten like this:

mysql>SELECT COUNT(*) AS count, name

-> FROM driver_log

-> GROUP BY name

-> HAVING count > 3;

+-------+-------+

| count | name |

+-------+-------+

| 5 | Henry |

+-------+-------+

Using Counts to Determine Whether Values Are Unique

Problem

You want to know whether table values are unique.

Solution

Use HAVING in conjunction with COUNT().

Discussion

DISTINCT eliminates duplicates but doesn’t show which values actually were duplicated in the original data. You can use HAVING to find unique values in situations to which DISTINCT does not apply. HAVING can tell you which values were unique or nonunique.

The following statements show the days on which only one driver was active, and the days on which more than one driver was active. They’re based on using HAVING and COUNT() to determine which trav_date values are unique or nonunique:

mysql>SELECT trav_date, COUNT(trav_date)

-> FROM driver_log

-> GROUP BY trav_date

-> HAVING COUNT(trav_date) = 1;

+------------+------------------+

| trav_date | COUNT(trav_date) |

+------------+------------------+

| 2006-08-26 | 1 |

| 2006-08-27 | 1 |

| 2006-09-01 | 1 |

+------------+------------------+

mysql> SELECT trav_date, COUNT(trav_date)

-> FROM driver_log

-> GROUP BY trav_date

-> HAVING COUNT(trav_date) > 1;

+------------+------------------+

| trav_date | COUNT(trav_date) |

+------------+------------------+

| 2006-08-29 | 3 |

| 2006-08-30 | 2 |

| 2006-09-02 | 2 |

+------------+------------------+

This technique works for combinations of values, too. For example, to find message sender/recipient pairs between whom only one message was sent, look for combinations that occur only once in the mail table:

mysql>SELECT srcuser, dstuser

-> FROM mail

-> GROUP BY srcuser, dstuser

-> HAVING COUNT(*) = 1;

+---------+---------+

| srcuser | dstuser |

+---------+---------+

| barb | barb |

| gene | tricia |

| phil | barb |

| tricia | gene |

| tricia | phil |

+---------+---------+

Note that this query doesn’t print the count. The first two examples did so, to show that the counts were being used properly, but you can refer to an aggregate value in a HAVING clause without including it in the output column list.

Grouping by Expression Results

Problem

You want to group rows into subgroups based on values calculated from an expression.

Solution

Put the expression in the GROUP BY clause.

Discussion

GROUPBY, like ORDERBY, can refer to expressions. This means you can use calculations as the basis for grouping. For example, to find the distribution of the lengths of state names, use those lengths as the grouping characteristic:

mysql>SELECT CHAR_LENGTH(name), COUNT(*)

-> FROM states GROUP BY CHAR_LENGTH(name);

+-------------------+----------+

| CHAR_LENGTH(name) | COUNT(*) |

+-------------------+----------+

| 4 | 3 |

| 5 | 3 |

| 6 | 5 |

| 7 | 8 |

| 8 | 12 |

| 9 | 4 |

| 10 | 4 |

| 11 | 2 |

| 12 | 4 |

| 13 | 3 |

| 14 | 2 |

+-------------------+----------+

As with ORDER BY, you can write the grouping expression directly in the GROUP BY clause, or use an alias for the expression (if it appears in the output column list), and refer to the alias in the GROUP BY.

You can group by multiple expressions if you like. To find days of the year on which more than one state joined the Union, group by statehood month and day, and then use HAVING and COUNT() to find the nonunique combinations:

mysql>SELECT

-> MONTHNAME(statehood) AS month,

-> DAYOFMONTH(statehood) AS day,

-> COUNT(*) AS count

-> FROM states GROUP BY month, day HAVING count > 1;

+----------+------+-------+

| month | day | count |

+----------+------+-------+

| February | 14 | 2 |

| June | 1 | 2 |

| March | 1 | 2 |

| May | 29 | 2 |

| November | 2 | 2 |

+----------+------+-------+

Categorizing Noncategorical Data

Problem

You need to summarize a set of values that are not naturally categorical.

Solution

Use an expression to group the values into categories.

Discussion

Grouping by Expression Results showed how to group rows by expression results. One important application for doing so is to provide categories for values that are not particularly categorical. This is useful because GROUPBY works best for columns with repetitive values. For example, you might attempt to perform a population analysis by grouping rows in the states table using values in the pop column. As it happens, that does not work very well due to the high number of distinct values in the column. In fact, they’re all distinct, as the following query shows:

mysql>SELECT COUNT(pop), COUNT(DISTINCT pop) FROM states;

+------------+---------------------+

| COUNT(pop) | COUNT(DISTINCT pop) |

+------------+---------------------+

| 50 | 50 |

+------------+---------------------+

In situations like this, where values do not group nicely into a small number of sets, you can use a transformation that forces them into categories. Begin by determining the range of population values:

mysql>SELECT MIN(pop), MAX(pop) FROM states;

+----------+----------+

| MIN(pop) | MAX(pop) |

+----------+----------+

| 506529 | 35893799 |

+----------+----------+

You can see from that result that if you divide the pop values by five million, they’ll group into six categories—a reasonable number. (The category ranges will be 1 to 5,000,000, 5,000,001 to 10,000,000, and so forth.) To put each population value in the proper category, divide by five million, and use the integer result:

mysql>SELECT FLOOR(pop/5000000) AS 'max population (millions)',

-> COUNT(*) AS 'number of states'

-> FROM states GROUP BY 'max population (millions)';

+---------------------------+------------------+

| max population (millions) | number of states |

+---------------------------+------------------+

| 0 | 29 |

| 1 | 13 |

| 2 | 4 |

| 3 | 2 |

| 4 | 1 |

| 7 | 1 |

+---------------------------+------------------+

Hmm. That’s not quite right. The expression groups the population values into a small number of categories, all right, but doesn’t report the category values properly. Let’s try multiplying the FLOOR() results by five:

mysql>SELECT FLOOR(pop/5000000)*5 AS 'max population (millions)',

-> COUNT(*) AS 'number of states'

-> FROM states GROUP BY 'max population (millions)';

+---------------------------+------------------+

| max population (millions) | number of states |

+---------------------------+------------------+

| 0 | 29 |

| 5 | 13 |

| 10 | 4 |

| 15 | 2 |

| 20 | 1 |

| 35 | 1 |

+---------------------------+------------------+

That still isn’t correct. The maximum state population was 35,893,799, which should go into a category for 40 million, not one for 35 million. The problem here is that the category-generating expression groups values toward the lower bound of each category. To group values toward the upper bound instead, use the following technique. For categories of size n, you can place a value x into the proper category using this expression:

FLOOR((x+(n-1))/n)

So the final form of our query looks like this:

mysql>SELECT FLOOR((pop+4999999)/5000000)*5 AS 'max population (millions)',

-> COUNT(*) AS 'number of states'

-> FROM states GROUP BY 'max population (millions)';

+---------------------------+------------------+

| max population (millions) | number of states |

+---------------------------+------------------+

| 5 | 29 |

| 10 | 13 |

| 15 | 4 |

| 20 | 2 |

| 25 | 1 |

| 40 | 1 |

+---------------------------+------------------+

The result shows clearly that the majority of U.S. states have a population of five million or less.

This technique works for all kinds of numeric values. For example, you can group mail table rows into categories of 100,000 bytes as follows:

mysql>SELECT FLOOR((size+99999)/100000) AS 'size (100KB)',

-> COUNT(*) AS 'number of messages'

-> FROM mail GROUP BY 'size (100KB)';

+--------------+--------------------+

| size (100KB) | number of messages |

+--------------+--------------------+

| 1 | 13 |

| 2 | 1 |

| 10 | 1 |

| 24 | 1 |

+--------------+--------------------+

In some instances, it may be more appropriate to categorize groups on a logarithmic scale. For example, the state population values can be treated that way as follows:

mysql>SELECT FLOOR(LOG10(pop)) AS 'log10(population)',

-> COUNT(*) AS 'number of states'

-> FROM states GROUP BY 'log10(population)';

+-------------------+------------------+

| log10(population) | number of states |

+-------------------+------------------+

| 5 | 7 |

| 6 | 35 |

| 7 | 8 |

+-------------------+------------------+

The query shows the number of states that have populations measured in hundreds of thousands, millions, and tens of millions, respectively.

HOW REPETITIVE IS A SET OF VALUES?

To assess how much repetition is present in a set of values, use the ratio of COUNT(DISTINCT) and COUNT(). If all values are unique, both counts will be the same and the ratio will be 1. This is the case for the t values in the mail table and the pop values in the states table:

mysql>SELECT COUNT(DISTINCT t) / COUNT(t) FROM mail;

+------------------------------+

| COUNT(DISTINCT t) / COUNT(t) |

+------------------------------+

| 1.0000 |

+------------------------------+

mysql> SELECT COUNT(DISTINCT pop) / COUNT(pop) FROM states;

+----------------------------------+

| COUNT(DISTINCT pop) / COUNT(pop) |

+----------------------------------+

| 1.0000 |

+----------------------------------+

For a more repetitive set of values, COUNT(DISTINCT) will be less than COUNT(), and the ratio will be smaller:

mysql>SELECT COUNT(DISTINCT name) / COUNT(name) FROM driver_log;

+------------------------------------+

| COUNT(DISTINCT name) / COUNT(name) |

+------------------------------------+

| 0.3000 |

+------------------------------------+

What’s the practical use for this ratio? A result close to zero indicates a high degree of repetition, which means the values will group into a small number of categories naturally. A result of 1 or close to it indicates many unique values, with the consequence that GROUP BY won’t be very efficient for grouping the values into categories. (That is, there will be a lot of categories, relative to the number of values.) This tells you that to generate a summary, you’ll probably find it necessary to impose an artificial categorization on the values, using the techniques described in this recipe.

Controlling Summary Display Order

Problem

You want to sort the result of a summary statement.

Solution

Use an ORDER BY clause—if GROUP BY doesn’t produce the desired sort order.

Discussion

In MySQL, GROUP BY not only groups, it sorts. Thus, there is often no need for an ORDER BY clause in a summary statement. But you can still use ORDER BY if you want a sort order other than the one that GROUP BY produces by default. For example, to determine the number of days driven and total miles for each person in the driver_log table, use this statement:

mysql>SELECT name, COUNT(*) AS days, SUM(miles) AS mileage

-> FROM driver_log GROUP BY name;

+-------+------+---------+

| name | days | mileage |

+-------+------+---------+

| Ben | 3 | 362 |

| Henry | 5 | 911 |

| Suzi | 2 | 893 |

+-------+------+---------+

But that sorts by the names. If you want to sort drivers according to who drove the most days or miles, add the appropriate ORDER BY clause:

mysql>SELECT name, COUNT(*) AS days, SUM(miles) AS mileage

-> FROM driver_log GROUP BY name ORDER BY days DESC;

+-------+------+---------+

| name | days | mileage |

+-------+------+---------+

| Henry | 5 | 911 |

| Ben | 3 | 362 |

| Suzi | 2 | 893 |

+-------+------+---------+

mysql> SELECT name, COUNT(*) AS days, SUM(miles) AS mileage

-> FROM driver_log GROUP BY name ORDER BY mileage DESC;

+-------+------+---------+

| name | days | mileage |

+-------+------+---------+

| Henry | 5 | 911 |

| Suzi | 2 | 893 |

| Ben | 3 | 362 |

+-------+------+---------+

The ORDER BY clause in these statements refers to an aggregate value by using an alias. In MySQL 5.0 and up, that is not necessary and you can refer directly to aggregate values in ORDER BY clauses. Before MySQL 5.0, you must alias them and use the alias in the ORDER BY.

Sometimes you can reorder a summary without an ORDER BY clause by choosing an appropriate GROUP BY expression. For example, if you count how many states joined the Union on each day of the week, grouped by day name, the results are sorted in lexical order:

mysql>SELECT DAYNAME(statehood), COUNT(*) FROM states

-> GROUP BY DAYNAME(statehood);

+--------------------+----------+

| DAYNAME(statehood) | COUNT(*) |

+--------------------+----------+

| Friday | 8 |

| Monday | 9 |

| Saturday | 11 |

| Thursday | 5 |

| Tuesday | 6 |

| Wednesday | 11 |

+--------------------+----------+

From this you can see that no state entered the Union on a Sunday, but that becomes apparent only after you stare at the query result for a while. The output would be more easily understood were it sorted into day-of-week order. It’s possible to do that by adding an explicit ORDER BY to sort on the numeric day-of-week value, but another way to achieve the same result without ORDER BY is to group by DAYOFWEEK() rather than by DAYNAME():

mysql>SELECT DAYNAME(statehood), COUNT(*)

-> FROM states GROUP BY DAYOFWEEK(statehood);

+--------------------+----------+

| DAYNAME(statehood) | COUNT(*) |

+--------------------+----------+

| Monday | 9 |

| Tuesday | 6 |

| Wednesday | 11 |

| Thursday | 5 |

| Friday | 8 |

| Saturday | 11 |

+--------------------+----------+

The implicit ordering done by GROUP BY can add overhead to query processing. If you don’t care whether output rows are sorted, add an ORDER BY NULL clause to suppress this sorting and eliminate its overhead:

mysql>SELECT name, COUNT(*) AS days, SUM(miles) AS mileage

-> FROM driver_log GROUP BY name;

+-------+------+---------+

| name | days | mileage |

+-------+------+---------+

| Ben | 3 | 362 |

| Henry | 5 | 911 |

| Suzi | 2 | 893 |

+-------+------+---------+

mysql> SELECT name, COUNT(*) AS days, SUM(miles) AS mileage

-> FROM driver_log GROUP BY name ORDER BY NULL;

+-------+------+---------+

| name | days | mileage |

+-------+------+---------+

| Ben | 3 | 362 |

| Suzi | 2 | 893 |

| Henry | 5 | 911 |

+-------+------+---------+

The sorting done by GROUP BY is a MySQL extension. To write statements for MySQL that are less likely to need revision when used with other database systems, you may find it beneficial to add an explicit ORDER BY clause in all cases.

Finding Smallest or Largest Summary Values

Problem

You want to compute per-group summary values but display only the smallest or largest of them.

Solution

Add a LIMIT clause to the statement.

Discussion

MIN()and MAX() find the values at the endpoints of a range of values, but if you want to know the extremes of a set of summary values, those functions won’t work. The arguments to MIN() and MAX() cannot be other aggregate functions. For example, you can easily find per-driver mileage totals:

mysql>SELECT name, SUM(miles)

-> FROM driver_log

-> GROUP BY name;

+-------+------------+

| name | SUM(miles) |

+-------+------------+

| Ben | 362 |

| Henry | 911 |

| Suzi | 893 |

+-------+------------+

But this doesn’t work if you want to select only the row for the driver with the most miles:

mysql>SELECT name, SUM(miles)

-> FROM driver_log

-> GROUP BY name

-> HAVING SUM(miles) = MAX(SUM(miles));

ERROR 1111 (HY000): Invalid use of group function

Instead, order the rows with the largest SUM() values first, and use LIMIT to select the first row:

mysql>SELECT name, SUM(miles) AS 'total miles'

-> FROM driver_log

-> GROUP BY name

-> ORDER BY 'total miles' DESC LIMIT 1;

+-------+-------------+

| name | total miles |

+-------+-------------+

| Henry | 911 |

+-------+-------------+

Note that if there is more than one row with the given summary value, a LIMIT 1 query won’t tell you that. For example, you might attempt to ascertain the most common initial letter for state names like this:

mysql>SELECT LEFT(name,1) AS letter, COUNT(*) AS count FROM states

-> GROUP BY letter ORDER BY count DESC LIMIT 1;

+--------+-------+

| letter | count |

+--------+-------+

| M | 8 |

+--------+-------+

But eight state names also begin with N. If you need to know all most-frequent values when there may be more than one of them, find the maximum count first, and then select those values with a count that matches the maximum:

mysql>SET @max = (SELECT COUNT(*) FROM states

-> GROUP BY LEFT(name,1) ORDER BY COUNT(*) DESC LIMIT 1);

mysql> SELECT LEFT(name,1) AS letter, COUNT(*) AS count FROM states

-> GROUP BY letter HAVING count = @max;

+--------+-------+

| letter | count |

+--------+-------+

| M | 8 |

| N | 8 |

+--------+-------+

Alternatively, put the maximum-count calculation in a subquery and combine the statements into one:

mysql>SELECT LEFT(name,1) AS letter, COUNT(*) AS count FROM states

-> GROUP BY letter HAVING count =

-> (SELECT COUNT(*) FROM states

-> GROUP BY LEFT(name,1) ORDER BY COUNT(*) DESC LIMIT 1);

+--------+-------+

| letter | count |

+--------+-------+

| M | 8 |

| N | 8 |

+--------+-------+

Date-Based Summaries

Problem

You want to produce a summary based on date or time values.

Solution

Use GROUP BY to place temporal values into categories of the appropriate duration. Often this involves using expressions to extract the significant parts of dates or times.

Discussion

To put rows in time order, use an ORDER BY clause to sort a column that has a temporal type. If instead you want to summarize rows based on groupings into time intervals, you need to determine how to categorize each row into the proper interval and use GROUP BY to group them accordingly.

For example, to determine how many drivers were on the road and how many miles were driven each day, group the rows in the driver_log table by date:

mysql>SELECT trav_date,

-> COUNT(*) AS 'number of drivers', SUM(miles) As 'miles logged'

-> FROM driver_log GROUP BY trav_date;

+------------+-------------------+--------------+

| trav_date | number of drivers | miles logged |

+------------+-------------------+--------------+

| 2006-08-26 | 1 | 115 |

| 2006-08-27 | 1 | 96 |

| 2006-08-29 | 3 | 822 |

| 2006-08-30 | 2 | 355 |

| 2006-09-01 | 1 | 197 |

| 2006-09-02 | 2 | 581 |

+------------+-------------------+--------------+

However, this summary will grow lengthier as you add more rows to the table. At some point, the number of distinct dates likely will become so large that the summary fails to be useful, and you’d probably decide to change the category size from daily to weekly or monthly.

When a temporal column contains so many distinct values that it fails to categorize well, it’s typical for a summary to group rows using expressions that map the relevant parts of the date or time values onto a smaller set of categories. For example, to produce a time-of-day summary for rows in the mail table, do this:[11]

mysql>SELECT HOUR(t) AS hour,

-> COUNT(*) AS 'number of messages',

-> SUM(size) AS 'number of bytes sent'

-> FROM mail

-> GROUP BY hour;

+------+--------------------+----------------------+

| hour | number of messages | number of bytes sent |

+------+--------------------+----------------------+

| 7 | 1 | 3824 |

| 8 | 1 | 978 |

| 9 | 2 | 2904 |

| 10 | 2 | 1056806 |

| 11 | 1 | 5781 |

| 12 | 2 | 195798 |

| 13 | 1 | 271 |

| 14 | 1 | 98151 |

| 15 | 1 | 1048 |

| 17 | 2 | 2398338 |

| 22 | 1 | 23992 |

| 23 | 1 | 10294 |

+------+--------------------+----------------------+

To produce a day-of-week summary instead, use the DAYOFWEEK() function:

mysql>SELECT DAYOFWEEK(t) AS weekday,

-> COUNT(*) AS 'number of messages',

-> SUM(size) AS 'number of bytes sent'

-> FROM mail

-> GROUP BY weekday;

+---------+--------------------+----------------------+

| weekday | number of messages | number of bytes sent |

+---------+--------------------+----------------------+

| 1 | 1 | 271 |

| 2 | 4 | 2500705 |

| 3 | 4 | 1007190 |

| 4 | 2 | 10907 |

| 5 | 1 | 873 |

| 6 | 1 | 58274 |

| 7 | 3 | 219965 |

+---------+--------------------+----------------------+

To make the output more meaningful, you might want to use DAYNAME() to display weekday names instead. However, because day names sort lexically (for example, “Tuesday” sorts after “Friday”), use DAYNAME() only for display purposes. Continue to group based on the numeric day values so that output rows sort that way:

mysql>SELECT DAYNAME(t) AS weekday,

-> COUNT(*) AS 'number of messages',

-> SUM(size) AS 'number of bytes sent'

-> FROM mail

-> GROUP BY DAYOFWEEK(t);

+-----------+--------------------+----------------------+

| weekday | number of messages | number of bytes sent |

+-----------+--------------------+----------------------+

| Sunday | 1 | 271 |

| Monday | 4 | 2500705 |

| Tuesday | 4 | 1007190 |

| Wednesday | 2 | 10907 |

| Thursday | 1 | 873 |

| Friday | 1 | 58274 |

| Saturday | 3 | 219965 |

+-----------+--------------------+----------------------+

A similar technique can be used for summarizing month-of-year categories that are sorted by numeric value but displayed by month name.

Uses for temporal categorizations are numerous:

§ DATETIME or TIMESTAMP columns have the potential to contain many unique values. To produce daily summaries, strip off the time of day part to collapse all values occurring within a given day to the same value. Any of the following GROUPBY clauses will do this, although the last one is likely to be slowest:

§ GROUP BY DATE(col_name)

§ GROUP BY FROM_DAYS(TO_DAYS(col_name))

§ GROUP BY YEAR(col_name), MONTH(col_name), DAYOFMONTH(col_name)

GROUP BY DATE_FORMAT(col_name,'%Y-%m-%e')

§ To produce monthly or quarterly sales reports, group by MONTH( col_name ) or QUARTER( col_name ) to place dates into the correct part of the year.

§ To summarize web server activity, store your server’s logs in MySQL and run statements that collapse the rows into different time categories. Using MySQL for Apache Logging discusses how to do this for Apache.


[11] Note that the result includes an entry only for hours of the day actually represented in the data. To generate a summary with an entry for every hour, use a join to fill in the “missing” values. See Using a Join to Fill or Identify Holes in a List.

Working with Per-Group and Overall Summary Values Simultaneously

Problem

You want to produce a report that requires different levels of summary detail. Or you want to compare per-group summary values to an overall summary value.

Solution

Use two statements that retrieve different levels of summary information. Or use a subquery to retrieve one summary value and refer to it in the outer query that refers to other summary values. If it’s necessary only to display multiple summary levels, WITH ROLLUP might be sufficient.

Discussion

Sometimes a report involves different levels of summary information. For example, the following report displays the total number of miles per driver from the driver_log table, along with each driver’s miles as a percentage of the total miles in the entire table:

+-------+--------------+------------------------+

| name | miles/driver | percent of total miles |

+-------+--------------+------------------------+

| Ben | 362 | 16.7128 |

| Henry | 911 | 42.0591 |

| Suzi | 893 | 41.2281 |

+-------+--------------+------------------------+

The percentages represent the ratio of each driver’s miles to the total miles for all drivers. To perform the percentage calculation, you need a per-group summary to get each driver’s miles and also an overall summary to get the total miles. First, run a query to get the overall mileage total:

mysql>SELECT @total := SUM(miles) AS 'total miles' FROM driver_log;

+-------------+

| total miles |

+-------------+

| 2166 |

+-------------+

Now, calculate the per-group values and use the overall total to compute the percentages:

mysql>SELECT name,

-> SUM(miles) AS 'miles/driver',

-> (SUM(miles)*100)/@total AS 'percent of total miles'

-> FROM driver_log GROUP BY name;

+-------+--------------+------------------------+

| name | miles/driver | percent of total miles |

+-------+--------------+------------------------+

| Ben | 362 | 16.7128 |

| Henry | 911 | 42.0591 |

| Suzi | 893 | 41.2281 |

+-------+--------------+------------------------+

To combine the two statements into one, use a subquery that computes the total miles:

mysql>SELECT name,

-> SUM(miles) AS 'miles/driver',

-> (SUM(miles)*100)/(SELECT SUM(miles) FROM driver_log)

-> AS 'percent of total miles'

-> FROM driver_log GROUP BY name;

+-------+--------------+------------------------+

| name | miles/driver | percent of total miles |

+-------+--------------+------------------------+

| Ben | 362 | 16.7128 |

| Henry | 911 | 42.0591 |

| Suzi | 893 | 41.2281 |

+-------+--------------+------------------------+

Another type of problem that uses different levels of summary information occurs when you want to compare per-group summary values with the corresponding overall summary value. Suppose that you want to determine which drivers had a lower average miles per day than the group average. Calculate the overall average in a subquery, and then compare each driver’s average to the overall average using a HAVING clause:

mysql>SELECT name, AVG(miles) AS driver_avg FROM driver_log

-> GROUP BY name

-> HAVING driver_avg < (SELECT AVG(miles) FROM driver_log);

+-------+------------+

| name | driver_avg |

+-------+------------+

| Ben | 120.6667 |

| Henry | 182.2000 |

+-------+------------+

If you just want to display different summary values (and not perform calculations involving one summary level against another), add WITH ROLLUP to the GROUP BY clause:

mysql>SELECT name, SUM(miles) AS 'miles/driver'

-> FROM driver_log GROUP BY name WITH ROLLUP;

+-------+--------------+

| name | miles/driver |

+-------+--------------+

| Ben | 362 |

| Henry | 911 |

| Suzi | 893 |

| NULL | 2166 |

+-------+--------------+

mysql> SELECT name, AVG(miles) AS driver_avg FROM driver_log

-> GROUP BY name WITH ROLLUP;

+-------+------------+

| name | driver_avg |

+-------+------------+

| Ben | 120.6667 |

| Henry | 182.2000 |

| Suzi | 446.5000 |

| NULL | 216.6000 |

+-------+------------+

In each case, the output row with NULL in the name column represents the overall sum or average calculated over all drivers.

WITHROLLUP can present multiple levels of summary, if you group by more than one column. The following statement shows the number of mail messages sent between each pair of users:

mysql>SELECT srcuser, dstuser, COUNT(*)

-> FROM mail GROUP BY srcuser, dstuser;

+---------+---------+----------+

| srcuser | dstuser | COUNT(*) |

+---------+---------+----------+

| barb | barb | 1 |

| barb | tricia | 2 |

| gene | barb | 2 |

| gene | gene | 3 |

| gene | tricia | 1 |

| phil | barb | 1 |

| phil | phil | 2 |

| phil | tricia | 2 |

| tricia | gene | 1 |

| tricia | phil | 1 |

+---------+---------+----------+

Adding WITH ROLLUP causes the output to include an intermediate count for each srcuser value (these are the lines with NULL in the dstuser column), plus an overall count at the end:

mysql>SELECT srcuser, dstuser, COUNT(*)

-> FROM mail GROUP BY srcuser, dstuser WITH ROLLUP;

+---------+---------+----------+

| srcuser | dstuser | COUNT(*) |

+---------+---------+----------+

| barb | barb | 1 |

| barb | tricia | 2 |

| barb | NULL | 3 |

| gene | barb | 2 |

| gene | gene | 3 |

| gene | tricia | 1 |

| gene | NULL | 6 |

| phil | barb | 1 |

| phil | phil | 2 |

| phil | tricia | 2 |

| phil | NULL | 5 |

| tricia | gene | 1 |

| tricia | phil | 1 |

| tricia | NULL | 2 |

| NULL | NULL | 16 |

+---------+---------+----------+

Generating a Report That Includes a Summary and a List

Problem

You want to create a report that displays a summary, together with the list of rows associated with each summary value.

Solution

Use two statements that retrieve different levels of summary information. Or use a programming language to do some of the work so that you can use a single statement.

Discussion

Suppose that you want to produce a report that looks like this:

Name: Ben; days on road: 3; miles driven: 362

date: 2006-08-29, trip length: 131

date: 2006-08-30, trip length: 152

date: 2006-09-02, trip length: 79

Name: Henry; days on road: 5; miles driven: 911

date: 2006-08-26, trip length: 115

date: 2006-08-27, trip length: 96

date: 2006-08-29, trip length: 300

date: 2006-08-30, trip length: 203

date: 2006-09-01, trip length: 197

Name: Suzi; days on road: 2; miles driven: 893

date: 2006-08-29, trip length: 391

date: 2006-09-02, trip length: 502

The report shows, for each driver in the driver_log table, the following information:

§ A summary line showing the driver name, the number of days on the road, and the number of miles driven.

§ A list of the dates and mileages for the individual trips from which the summary values are calculated.

This scenario is a variation on the “different levels of summary information” problem discussed in Working with Per-Group and Overall Summary Values Simultaneously. It may not seem like it at first, because one of the types of information is a list rather than a summary. But that’s really just a “level zero” summary. This kind of problem appears in many other forms:

§ You have a database that lists contributions to candidates in your political party. The party chair requests a printout that shows, for each candidate, the number of contributions and total amount contributed, as well as a list of contributor names and addresses.

§ You want to make a handout for a company presentation that summarizes total sales per sales region with a list under each region showing the sales for each state in the region.

Such problems can be solved using a couple of approaches:

§ Run separate statements to get the information for each level of detail that you require. (A single query won’t produce per-group summary values and a list of each group’s individual rows.)

§ Fetch the rows that make up the lists and perform the summary calculations yourself to eliminate the summary statement.

Let’s use each approach to produce the driver report shown at the beginning of this section. The following implementation (in Python) generates the report using one query to summarize the days and miles per driver, and another to fetch the individual trip rows for each driver:

# select total miles per driver and construct a dictionary that

# maps each driver name to days on the road and miles driven

name_map = { }

cursor = conn.cursor ()

cursor.execute ("""

SELECT name, COUNT(name), SUM(miles)

FROM driver_log GROUP BY name

""")

for (name, days, miles) in cursor.fetchall ():

name_map[name] = (days, miles)

# select trips for each driver and print the report, displaying the

# summary entry for each driver prior to the list of trips

cursor.execute ("""

SELECT name, trav_date, miles

FROM driver_log ORDER BY name, trav_date

""")

cur_name = ""

for (name, trav_date, miles) in cursor.fetchall ():

if cur_name != name: # new driver; print driver's summary info

print "Name: %s; days on road: %d; miles driven: %d" \

% (name, name_map[name][0], name_map[name][1])

cur_name = name

print " date: %s, trip length: %d" % (trav_date, miles)

cursor.close ()

An alternate implementation performs summary calculations in the program. By doing this, you can reduce the number of queries required. If you iterate through the trip list and calculate the per-driver day counts and mileage totals yourself, a single query suffices:

# get list of trips for the drivers

cursor = conn.cursor ()

cursor.execute ("""

SELECT name, trav_date, miles FROM driver_log

ORDER BY name, trav_date

""")

# fetch rows into data structure because we

# must iterate through them multiple times

rows = cursor.fetchall ()

cursor.close ()

# iterate through rows once to construct a dictionary that

# maps each driver name to days on the road and miles driven

# (the dictionary entries are lists rather than tuples because

# we need mutable values that can be modified in the loop)

name_map = { }

for (name, trav_date, miles) in rows:

if not name_map.has_key (name): # initialize entry if nonexistent

name_map[name] = [0, 0]

name_map[name][0] = name_map[name][0] + 1 # count days

name_map[name][1] = name_map[name][1] + miles # sum miles

# iterate through rows again to print the report, displaying the

# summary entry for each driver prior to the list of trips

cur_name = ""

for (name, trav_date, miles) in rows:

if cur_name != name: # new driver; print driver's summary info

print "Name: %s; days on road: %d; miles driven: %d" \

% (name, name_map[name][0], name_map[name][1])

cur_name = name

print " date: %s, trip length: %d" % (trav_date, miles)

Should you require more levels of summary information, this type of problem gets more difficult. For example, you might want the report to show driver summaries and trip logs to be preceded by a line that shows the total miles for all drivers:

Total miles driven by all drivers combined: 2166

Name: Ben; days on road: 3; miles driven: 362

date: 2006-08-29, trip length: 131

date: 2006-08-30, trip length: 152

date: 2006-09-02, trip length: 79

Name: Henry; days on road: 5; miles driven: 911

date: 2006-08-26, trip length: 115

date: 2006-08-27, trip length: 96

date: 2006-08-29, trip length: 300

date: 2006-08-30, trip length: 203

date: 2006-09-01, trip length: 197

Name: Suzi; days on road: 2; miles driven: 893

date: 2006-08-29, trip length: 391

date: 2006-09-02, trip length: 502

In this case, you need either another query to produce the total mileage, or another calculation in your program that computes the overall total.