Statistical Techniques - MySQL Cookbook (2007)

MySQL Cookbook (2007)

Chapter 13. Statistical Techniques

Introduction

This chapter covers several topics that relate to basic statistical techniques. For the most part, these recipes build on those described in earlier chapters, such as the summary techniques discussed in Chapter 8. The examples here thus show additional ways to apply the material from those chapters. Broadly speaking, the topics discussed in this chapter include:

§ Techniques for data characterization, such as calculating descriptive statistics, generating frequency distributions, counting missing values, and calculating least-squares regressions or correlation coefficients

§ Randomization methods, such as how to generate random numbers and apply them to randomizing of a set of rows or to selecting individual items randomly from the rows

§ Rank assignments

Statistics covers such a large and diverse array of topics that this chapter necessarily only scratches the surface and simply illustrates a few of the potential areas in which MySQL may be applied to statistical analysis. Note that some statistical measures can be defined in different ways (for example, do you calculate standard deviation based on n degrees of freedom, or n–1?). For that reason, if the definition I use for a given term doesn’t match the one you prefer, you’ll need to adapt the queries or algorithms shown here to some extent.

You can find scripts related to the examples discussed here in the stats directory of the recipes distribution, and scripts for creating some of the example tables in the tables directory.

Calculating Descriptive Statistics

Problem

You want to characterize a dataset by computing general descriptive or summary statistics.

Solution

Many common descriptive statistics, such as mean and standard deviation, can be obtained by applying aggregate functions to your data. Others, such as median or mode, can be calculated based on counting queries.

Discussion

Suppose that you have a table testscore containing observations representing subject ID, age, sex, and test score:

mysql>SELECT subject, age, sex, score FROM testscore ORDER BY subject;

+---------+-----+-----+-------+

| subject | age | sex | score |

+---------+-----+-----+-------+

| 1 | 5 | M | 5 |

| 2 | 5 | M | 4 |

| 3 | 5 | F | 6 |

| 4 | 5 | F | 7 |

| 5 | 6 | M | 8 |

| 6 | 6 | M | 9 |

| 7 | 6 | F | 4 |

| 8 | 6 | F | 6 |

| 9 | 7 | M | 8 |

| 10 | 7 | M | 6 |

| 11 | 7 | F | 9 |

| 12 | 7 | F | 7 |

| 13 | 8 | M | 9 |

| 14 | 8 | M | 6 |

| 15 | 8 | F | 7 |

| 16 | 8 | F | 10 |

| 17 | 9 | M | 9 |

| 18 | 9 | M | 7 |

| 19 | 9 | F | 10 |

| 20 | 9 | F | 9 |

+---------+-----+-----+-------+

A good first step in analyzing a set of observations is to generate some descriptive statistics that summarize their general characteristics as a whole. Common statistical values of this kind include:

§ The number of observations, their sum, and their range (minimum and maximum)

§ Measures of central tendency, such as mean, median, and mode

§ Measures of variation, such as standard deviation and variance

Aside from the median and mode, all of these can be calculated easily by invoking aggregate functions:

mysql>SELECT COUNT(score) AS n,

-> SUM(score) AS sum,

-> MIN(score) AS minimum,

-> MAX(score) AS maximum,

-> AVG(score) AS mean,

-> STDDEV_SAMP(score) AS 'std. dev.',

-> VAR_SAMP(score) AS 'variance'

-> FROM testscore;

+----+------+---------+---------+--------+-----------+----------+

| n | sum | minimum | maximum | mean | std. dev. | variance |

+----+------+---------+---------+--------+-----------+----------+

| 20 | 146 | 4 | 10 | 7.3000 | 1.8382 | 3.3789 |

+----+------+---------+---------+--------+-----------+----------+

The STDDEV_SAMP() and VAR_SAMP() functions produce sample measures rather than population measures. That is, for a set of n values, they produce a result that is based on n–1 degrees of freedom. If you want the population measures, which are based on n degrees of freedom, use theSTDDEV_POP() and VAR_POP() functions instead. STDDEV() and VARIANCE() are synonyms for STDDEV_POP() and VAR_POP().

Standard deviation can be used to identify outliers—values that are uncharacteristically far from the mean. For example, to select values that lie more than three standard deviations from the mean, you can do something like this:

SELECT @mean := AVG(score), @std := STDDEV_SAMP(score) FROM testscore;

SELECT score FROM testscore WHERE ABS(score-@mean) > @std * 3;

MySQL has no built-in function for computing the mode or median of a set of values, but you can compute them yourself. The mode is the value that occurs most frequently. To determine what it is, count each value and see which one is most common:

mysql>SELECT score, COUNT(score) AS frequency

-> FROM testscore GROUP BY score ORDER BY frequency DESC;

+-------+-----------+

| score | frequency |

+-------+-----------+

| 9 | 5 |

| 6 | 4 |

| 7 | 4 |

| 4 | 2 |

| 8 | 2 |

| 10 | 2 |

| 5 | 1 |

+-------+-----------+

In this case, 9 is the modal score value.

The median of a set of ordered values can be calculated like this:[16]

§ If the number of values is odd, the median is the middle value.

§ If the number of values is even, the median is the average of the two middle values.

Based on that definition, use the following procedure to determine the median of a set of observations stored in the database:

1. Issue a query to count the number of observations. From the count, you can determine whether the median calculation requires one or two values, and what their indexes are within the ordered set of observations.

2. Issue a query that includes an ORDER BY clause to sort the observations and a LIMIT clause to pull out the middle value or values.

3. If there is a single middle value, it is the median. Otherwise, take the average of the middle values.

For example, if a table t contains a score column with 37 values (an odd number), you need to select a single value to get the median, using a statement like this:

SELECT score FROM t ORDER BY score LIMIT 18,1

If the column contains 38 values (an even number), the statement becomes:

SELECT score FROM t ORDER BY score LIMIT 18,2

Then you can select the values returned by the statement and compute the median from their average.

The following Perl function implements a median calculation. It takes a database handle and the names of the database, table, and column that contain the set of observations. Then it generates the statement that retrieves the relevant values and returns their average:

sub median

{

my ($dbh, $db_name, $tbl_name, $col_name) = @_;

my ($count, $limit);

$count = $dbh->selectrow_array ("SELECT COUNT($col_name)

FROM $db_name.$tbl_name");

return undef unless $count > 0;

if ($count % 2 == 1) # odd number of values; select middle value

{

$limit = sprintf ("LIMIT %d,1", ($count-1)/2);

}

else # even number of values; select middle two values

{

$limit = sprintf ("LIMIT %d,2", $count/2 - 1);

}

my $sth = $dbh->prepare ("SELECT $col_name

FROM $db_name.$tbl_name

ORDER BY $col_name $limit");

$sth->execute ();

my ($n, $sum) = (0, 0);

while (my $ref = $sth->fetchrow_arrayref ())

{

++$n;

$sum += $ref->[0];

}

return ($sum / $n);

}

The preceding technique works for a set of values stored in the database. If you happen to have already fetched an ordered set of values into an array @val, you can compute the median like this instead:

if (@val == 0) # if array is empty, median is undefined

{

$median = undef;

}

elsif (@val % 2 == 1) # if array size is odd, median is middle number

{

$median = $val[(@val-1)/2];

}

else # array size is even; median is average

{ # of two middle numbers

$median = ($val[@val/2 - 1] + $val[@val/2]) / 2;

}

The code works for arrays that have an initial subscript of 0; for languages that use 1-based array indexes, adjust the algorithm accordingly.


[16] Note that the definition of median given here isn’t fully general; it doesn’t address what to do if the middle values in the dataset are duplicated.

Per-Group Descriptive Statistics

Problem

You want to produce descriptive statistics for each subgroup of a set of observations.

Solution

Use aggregate functions, but employ a GROUP BY clause to arrange observations into the appropriate groups.

Discussion

Calculating Descriptive Statistics shows how to compute descriptive statistics for the entire set of scores in the testscore table. To be more specific, you can use GROUPBY to divide the observations into groups and calculate statistics for each of them. For example, the subjects in thetestscore table are listed by age and sex, so it’s possible to calculate similar statistics by age or sex (or both) by application of appropriate GROUPBY clauses.

Here’s how to calculate by age:

mysql>SELECT age, COUNT(score) AS n,

-> SUM(score) AS sum,

-> MIN(score) AS minimum,

-> MAX(score) AS maximum,

-> AVG(score) AS mean,

-> STDDEV_SAMP(score) AS 'std. dev.',

-> VAR_SAMP(score) AS 'variance'

-> FROM testscore

-> GROUP BY age;

+-----+---+------+---------+---------+--------+-----------+----------+

| age | n | sum | minimum | maximum | mean | std. dev. | variance |

+-----+---+------+---------+---------+--------+-----------+----------+

| 5 | 4 | 22 | 4 | 7 | 5.5000 | 1.2910 | 1.6667 |

| 6 | 4 | 27 | 4 | 9 | 6.7500 | 2.2174 | 4.9167 |

| 7 | 4 | 30 | 6 | 9 | 7.5000 | 1.2910 | 1.6667 |

| 8 | 4 | 32 | 6 | 10 | 8.0000 | 1.8257 | 3.3333 |

| 9 | 4 | 35 | 7 | 10 | 8.7500 | 1.2583 | 1.5833 |

+-----+---+------+---------+---------+--------+-----------+----------+

By sex:

mysql>SELECT sex, COUNT(score) AS n,

-> SUM(score) AS sum,

-> MIN(score) AS minimum,

-> MAX(score) AS maximum,

-> AVG(score) AS mean,

-> STDDEV_SAMP(score) AS 'std. dev.',

-> VAR_SAMP(score) AS 'variance'

-> FROM testscore

-> GROUP BY sex;

+-----+----+------+---------+---------+--------+-----------+----------+

| sex | n | sum | minimum | maximum | mean | std. dev. | variance |

+-----+----+------+---------+---------+--------+-----------+----------+

| M | 10 | 71 | 4 | 9 | 7.1000 | 1.7920 | 3.2111 |

| F | 10 | 75 | 4 | 10 | 7.5000 | 1.9579 | 3.8333 |

+-----+----+------+---------+---------+--------+-----------+----------+

By age and sex:

mysql>SELECT age, sex, COUNT(score) AS n,

-> SUM(score) AS sum,

-> MIN(score) AS minimum,

-> MAX(score) AS maximum,

-> AVG(score) AS mean,

-> STDDEV_SAMP(score) AS 'std. dev.',

-> VAR_SAMP(score) AS 'variance'

-> FROM testscore

-> GROUP BY age, sex;

+-----+-----+---+------+---------+---------+--------+-----------+----------+

| age | sex | n | sum | minimum | maximum | mean | std. dev. | variance |

+-----+-----+---+------+---------+---------+--------+-----------+----------+

| 5 | M | 2 | 9 | 4 | 5 | 4.5000 | 0.7071 | 0.5000 |

| 5 | F | 2 | 13 | 6 | 7 | 6.5000 | 0.7071 | 0.5000 |

| 6 | M | 2 | 17 | 8 | 9 | 8.5000 | 0.7071 | 0.5000 |

| 6 | F | 2 | 10 | 4 | 6 | 5.0000 | 1.4142 | 2.0000 |

| 7 | M | 2 | 14 | 6 | 8 | 7.0000 | 1.4142 | 2.0000 |

| 7 | F | 2 | 16 | 7 | 9 | 8.0000 | 1.4142 | 2.0000 |

| 8 | M | 2 | 15 | 6 | 9 | 7.5000 | 2.1213 | 4.5000 |

| 8 | F | 2 | 17 | 7 | 10 | 8.5000 | 2.1213 | 4.5000 |

| 9 | M | 2 | 16 | 7 | 9 | 8.0000 | 1.4142 | 2.0000 |

| 9 | F | 2 | 19 | 9 | 10 | 9.5000 | 0.7071 | 0.5000 |

+-----+-----+---+------+---------+---------+--------+-----------+----------+

Generating Frequency Distributions

Problem

You want to know the frequency of occurrence for each value in a table.

Solution

Derive a frequency distribution that summarizes the contents of your dataset.

Discussion

A common application for per-group summary techniques is to generate a breakdown of the number of times each value occurs. This is called a frequency distribution. For the testscore table, the frequency distribution looks like this:

mysql>SELECT score, COUNT(score) AS occurrence

-> FROM testscore GROUP BY score;

+-------+------------+

| score | occurrence |

+-------+------------+

| 4 | 2 |

| 5 | 1 |

| 6 | 4 |

| 7 | 4 |

| 8 | 2 |

| 9 | 5 |

| 10 | 2 |

+-------+------------+

If you express the results in terms of percentages rather than as counts, you produce a relative frequency distribution. To break down a set of observations and show each count as a percentage of the total, use one query to get the total number of observations and another to calculate the percentages for each group:

mysql>SELECT @n := COUNT(score) FROM testscore;

mysql> SELECT score, (COUNT(score)*100)/@n AS percent

-> FROM testscore GROUP BY score;

+-------+---------+

| score | percent |

+-------+---------+

| 4 | 10 |

| 5 | 5 |

| 6 | 20 |

| 7 | 20 |

| 8 | 10 |

| 9 | 25 |

| 10 | 10 |

+-------+---------+

The distributions just shown summarize the number of values for individual scores. However, if the dataset contains a large number of distinct values and you want a distribution that shows only a small number of categories, you may want to lump values into categories and produce a count for each category. “Lumping” techniques are discussed in Categorizing Noncategorical Data.

One typical use of frequency distributions is to export the results for use in a graphing program. In the absence of such a program, you can use MySQL itself to generate a simple ASCII chart as a visual representation of the distribution. For example, to display an ASCII bar chart of the test score counts, convert the counts to strings of * characters:

mysql>SELECT score, REPEAT('*',COUNT(score)) AS occurrences

-> FROM testscore GROUP BY score;

+-------+-------------+

| score | occurrences |

+-------+-------------+

| 4 | ** |

| 5 | * |

| 6 | **** |

| 7 | **** |

| 8 | ** |

| 9 | ***** |

| 10 | ** |

+-------+-------------+

To chart the relative frequency distribution instead, use the percentage values:

mysql>SELECT @n := COUNT(score) FROM testscore;

mysql> SELECT score, REPEAT('*',(COUNT(score)*100)/@n) AS percent

-> FROM testscore GROUP BY score;

+-------+---------------------------+

| score | percent |

+-------+---------------------------+

| 4 | ********** |

| 5 | ***** |

| 6 | ******************** |

| 7 | ******************** |

| 8 | ********** |

| 9 | ************************* |

| 10 | ********** |

+-------+---------------------------+

The ASCII chart method is fairly crude, obviously, but it’s a quick way to get a picture of the distribution of observations, and it requires no other tools.

If you generate a frequency distribution for a range of categories where some of the categories are not represented in your observations, the missing categories will not appear in the output. To force each category to be displayed, use a reference table and a LEFT JOIN (a technique discussed in Using a Join to Fill or Identify Holes in a List). For the testscore table, the possible scores range from 0 to 10, so a reference table should contain each of those values:

mysql>CREATE TABLE ref (score INT);

mysql> INSERT INTO ref (score)

-> VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10);

Then join the reference table to the test scores to generate the frequency distribution:

mysql>SELECT ref.score, COUNT(testscore.score) AS occurrences

-> FROM ref LEFT JOIN testscore ON ref.score = testscore.score

-> GROUP BY ref.score;

+-------+-------------+

| score | occurrences |

+-------+-------------+

| 0 | 0 |

| 1 | 0 |

| 2 | 0 |

| 3 | 0 |

| 4 | 2 |

| 5 | 1 |

| 6 | 4 |

| 7 | 4 |

| 8 | 2 |

| 9 | 5 |

| 10 | 2 |

+-------+-------------+

This distribution includes rows for scores 0 through 3, none of which appear in the frequency distribution shown earlier.

The same principle applies to relative frequency distributions:

mysql>SELECT @n := COUNT(score) FROM testscore;

mysql> SELECT ref.score, (COUNT(testscore.score)*100)/@n AS percent

-> FROM ref LEFT JOIN testscore ON ref.score = testscore.score

-> GROUP BY ref.score;

+-------+---------+

| score | percent |

+-------+---------+

| 0 | 0 |

| 1 | 0 |

| 2 | 0 |

| 3 | 0 |

| 4 | 10 |

| 5 | 5 |

| 6 | 20 |

| 7 | 20 |

| 8 | 10 |

| 9 | 25 |

| 10 | 10 |

+-------+---------+

Counting Missing Values

Problem

A set of observations is incomplete. You want to find out how much so.

Solution

Count the number of NULL values in the set.

Discussion

Values can be missing from a set of observations for any number of reasons: a test may not yet have been administered, something may have gone wrong during the test that requires invalidating the observation, and so forth. You can represent such observations in a dataset as NULL values to signify that they’re missing or otherwise invalid, and then use summary statements to characterize the completeness of the dataset.

If a table t contains values to be summarized along a single dimension, a simple summary will do to characterize the missing values. Suppose that t looks like this:

mysql>SELECT subject, score FROM t ORDER BY subject;

+---------+-------+

| subject | score |

+---------+-------+

| 1 | 38 |

| 2 | NULL |

| 3 | 47 |

| 4 | NULL |

| 5 | 37 |

| 6 | 45 |

| 7 | 54 |

| 8 | NULL |

| 9 | 40 |

| 10 | 49 |

+---------+-------+

COUNT(*) counts the total number of rows, and COUNT(score) counts only the number of nonmissing scores. The difference between the two values is the number of missing scores, and that difference in relation to the total provides the percentage of missing scores. These calculations are expressed as follows:

mysql>SELECT COUNT(*) AS 'n (total)',

-> COUNT(score) AS 'n (nonmissing)',

-> COUNT(*) - COUNT(score) AS 'n (missing)',

-> ((COUNT(*) - COUNT(score)) * 100) / COUNT(*) AS '% missing'

-> FROM t;

+-----------+-----------------+-------------+-----------+

| n (total) | n (nonmissing) | n (missing) | % missing |

+-----------+-----------------+-------------+-----------+

| 10 | 7 | 3 | 30.00 |

+-----------+-----------------+-------------+-----------+

As an alternative to counting NULL values as the difference between counts, you can count them directly using SUM(ISNULL(score)). The ISNULL() function returns 1 if its argument is NULL, zero otherwise:

mysql>SELECT COUNT(*) AS 'n (total)',

-> COUNT(score) AS 'n (nonmissing)',

-> SUM(ISNULL(score)) AS 'n (missing)',

-> (SUM(ISNULL(score)) * 100) / COUNT(*) AS '% missing'

-> FROM t;

+-----------+-----------------+-------------+-----------+

| n (total) | n (nonmissing) | n (missing) | % missing |

+-----------+-----------------+-------------+-----------+

| 10 | 7 | 3 | 30.00 |

+-----------+-----------------+-------------+-----------+

If values are arranged in groups, occurrences of NULL values can be assessed on a per-group basis. Suppose that t contains scores for subjects that are distributed among conditions for two factors A and B, each of which has two levels:

mysql>SELECT subject, A, B, score FROM t ORDER BY subject;

+---------+------+------+-------+

| subject | A | B | score |

+---------+------+------+-------+

| 1 | 1 | 1 | 18 |

| 2 | 1 | 1 | NULL |

| 3 | 1 | 1 | 23 |

| 4 | 1 | 1 | 24 |

| 5 | 1 | 2 | 17 |

| 6 | 1 | 2 | 23 |

| 7 | 1 | 2 | 29 |

| 8 | 1 | 2 | 32 |

| 9 | 2 | 1 | 17 |

| 10 | 2 | 1 | NULL |

| 11 | 2 | 1 | NULL |

| 12 | 2 | 1 | 25 |

| 13 | 2 | 2 | NULL |

| 14 | 2 | 2 | 33 |

| 15 | 2 | 2 | 34 |

| 16 | 2 | 2 | 37 |

+---------+------+------+-------+

In this case, the query uses a GROUP BY clause to produce a summary for each combination of conditions:

mysql>SELECT A, B, COUNT(*) AS 'n (total)',

-> COUNT(score) AS 'n (nonmissing)',

-> COUNT(*) - COUNT(score) AS 'n (missing)',

-> ((COUNT(*) - COUNT(score)) * 100) / COUNT(*) AS '% missing'

-> FROM t

-> GROUP BY A, B;

+------+------+-----------+-----------------+-------------+-----------+

| A | B | n (total) | n (nonmissing) | n (missing) | % missing |

+------+------+-----------+-----------------+-------------+-----------+

| 1 | 1 | 4 | 3 | 1 | 25.00 |

| 1 | 2 | 4 | 4 | 0 | 0.00 |

| 2 | 1 | 4 | 2 | 2 | 50.00 |

| 2 | 2 | 4 | 3 | 1 | 25.00 |

+------+------+-----------+-----------------+-------------+-----------+

Calculating Linear Regressions or Correlation Coefficients

Problem

You want to calculate the least-squares regression line for two variables or the correlation coefficient that expresses the strength of the relationship between them.

Solution

Apply summary functions to calculate the necessary terms.

Discussion

When the data values for two variables X and Y are stored in a database, the least-squares regression for them can be calculated easily using aggregate functions. The same is true for the correlation coefficient. The two calculations are actually fairly similar, and many terms for performing the computations are common to the two procedures.

Suppose that you want to calculate a least-squares regression using the age and test score values for the observations in the testscore table:

mysql>SELECT age, score FROM testscore;

+-----+-------+

| age | score |

+-----+-------+

| 5 | 5 |

| 5 | 4 |

| 5 | 6 |

| 5 | 7 |

| 6 | 8 |

| 6 | 9 |

| 6 | 4 |

| 6 | 6 |

| 7 | 8 |

| 7 | 6 |

| 7 | 9 |

| 7 | 7 |

| 8 | 9 |

| 8 | 6 |

| 8 | 7 |

| 8 | 10 |

| 9 | 9 |

| 9 | 7 |

| 9 | 10 |

| 9 | 9 |

+-----+-------+

The following equation expresses the regression line, where a and b are the intercept and slope of the line:

Y = bX + a

Letting age be X and score be Y, begin by computing the terms needed for the regression equation. These include the number of observations; the means, sums, and sums of squares for each variable; and the sum of the products of each variable:[17]

mysql>SELECT

-> @n := COUNT(score) AS N,

-> @meanX := AVG(age) AS 'X mean',

-> @sumX := SUM(age) AS 'X sum',

-> @sumXX := SUM(age*age) AS 'X sum of squares',

-> @meanY := AVG(score) AS 'Y mean',

-> @sumY := SUM(score) AS 'Y sum',

-> @sumYY := SUM(score*score) AS 'Y sum of squares',

-> @sumXY := SUM(age*score) AS 'X*Y sum'

-> FROM testscore\G

*************************** 1. row ***************************

N: 20

X mean: 7.000000000

X sum: 140

X sum of squares: 1020

Y mean: 7.300000000

Y sum: 146

Y sum of squares: 1130

X*Y sum: 1053

From those terms, calculate the regression slope and intercept as follows:

mysql>SELECT

-> @b := (@n*@sumXY - @sumX*@sumY) / (@n*@sumXX - @sumX*@sumX)

-> AS slope;

+-------------+

| slope |

+-------------+

| 0.775000000 |

+-------------+

mysql> SELECT @a := (@meanY - @b*@meanX) AS intercept;

+----------------------+

| intercept |

+----------------------+

| 1.875000000000000000 |

+----------------------+

The regression equation then is:

mysql>SELECT CONCAT('Y = ',@b,'X + ',@a) AS 'least-squares regression';

+-----------------------------------------+

| least-squares regression |

+-----------------------------------------+

| Y = 0.775000000X + 1.875000000000000000 |

+-----------------------------------------+

To compute the correlation coefficient, many of the same terms are used:

mysql>SELECT

-> (@n*@sumXY - @sumX*@sumY)

-> / SQRT((@n*@sumXX - @sumX*@sumX) * (@n*@sumYY - @sumY*@sumY))

-> AS correlation;

+------------------+

| correlation |

+------------------+

| 0.61173620442199 |

+------------------+


[17] You can see where these terms come from by consulting any standard statistics text.

Generating Random Numbers

Problem

You need a source of random numbers.

Solution

Invoke MySQL’s RAND() function.

Discussion

MySQL has a RAND() function that produces random numbers between 0 and 1:

mysql>SELECT RAND(), RAND(), RAND();

+------------------+--------------------+------------------+

| RAND() | RAND() | RAND() |

+------------------+--------------------+------------------+

| 0.18768198246852 | 0.0052350517411111 | 0.46312934203365 |

+------------------+--------------------+------------------+

When invoked with an integer argument, RAND() uses that value to seed the random number generator. You can use this feature to produce a repeatable series of numbers for a column of a query result. The following example shows that RAND() without an argument produces a different column of values per query, whereas RAND( N ) produces a repeatable column:

mysql>SELECT i, RAND(), RAND(10), RAND(20) FROM t;

+------+------------------+------------------+------------------+

| i | RAND() | RAND(10) | RAND(20) |

+------+------------------+------------------+------------------+

| 1 | 0.60170396939079 | 0.65705152196535 | 0.15888261251047 |

| 2 | 0.10435410784963 | 0.12820613023658 | 0.63553050033332 |

| 3 | 0.71665866180943 | 0.66987611602049 | 0.70100469486881 |

| 4 | 0.27023101623192 | 0.96476222012636 | 0.59843200407776 |

+------+------------------+------------------+------------------+

mysql> SELECT i, RAND(), RAND(10), RAND(20) FROM t;

+------+------------------+------------------+------------------+

| i | RAND() | RAND(10) | RAND(20) |

+------+------------------+------------------+------------------+

| 1 | 0.55794027034001 | 0.65705152196535 | 0.15888261251047 |

| 2 | 0.22995210460383 | 0.12820613023658 | 0.63553050033332 |

| 3 | 0.47593974273274 | 0.66987611602049 | 0.70100469486881 |

| 4 | 0.68984243058585 | 0.96476222012636 | 0.59843200407776 |

+------+------------------+------------------+------------------+

If you want to seed RAND() randomly, pick a seed value based on a source of entropy. Possible sources are the current timestamp or connection identifier, alone or perhaps in combination:

RAND(UNIX_TIMESTAMP())

RAND(CONNECTION_ID())

RAND(UNIX_TIMESTAMP()+CONNECTION_ID())

HOW RANDOM IS RAND()?

Does the RAND() function generate evenly distributed numbers? Check it out for yourself with the following Python script, rand_test.py, from the stats directory of the recipes distribution. It uses RAND() to generate random numbers and constructs a frequency distribution from them, using .1-sized categories. This provides a means of assessing how evenly distributed the values are:

#!/usr/bin/python

# rand_test.pl - create a frequency distribution of RAND() values.

# This provides a test of the randomness of RAND().

# Method: Draw random numbers in the range from 0 to 1.0,

# and count how many of them occur in .1-sized intervals

import MySQLdb

import Cookbook

npicks = 1000 # number of times to pick a number

bucket = [0] * 10 # buckets for counting picks in each interval

conn = Cookbook.connect ()

cursor = conn.cursor ()

for i in range (0, npicks):

cursor.execute ("SELECT RAND()")

(val,) = cursor.fetchone ()

slot = int (val * 10)

if slot > 9:

slot = 9 # put 1.0 in last slot

bucket[slot] = bucket[slot] + 1

cursor.close ()

conn.close ()

# Print the resulting frequency distribution

for slot in range (0, 9):

print "%2d %d" % (slot+1, bucket[slot])

The stats directory also contains equivalent scripts in other languages.

However, it’s probably better to use other seed value sources if you have them. For example, if your system has a /dev/random or /dev/urandom device, you can read the device and use it to generate a value for seeding RAND().

Randomizing a Set of Rows

Problem

You want to randomize a set of rows or values.

Solution

Use ORDER BY RAND().

Discussion

MySQL’s RAND() function can be used to randomize the order in which a query returns its rows. Somewhat paradoxically, this randomization is achieved by adding an ORDER BY clause to the query. The technique is roughly equivalent to a spreadsheet randomization method. Suppose that you have a set of values in a spreadsheet that looks like this:

Patrick

Penelope

Pertinax

Polly

To place these in random order, first add another column that contains randomly chosen numbers:

Patrick .73

Penelope .37

Pertinax .16

Polly .48

Then sort the rows according to the values of the random numbers:

Pertinax .16

Penelope .37

Polly .48

Patrick .73

At this point, the original values have been placed in random order, because the effect of sorting the random numbers is to randomize the values associated with them. To re-randomize the values, choose another set of random numbers, and sort the rows again.

In MySQL, you can achieve a similar effect by associating a set of random numbers with a query result and sorting the result by those numbers. To do this, add an ORDER BY RAND() clause:

mysql>SELECT name FROM t ORDER BY RAND();

+----------+

| name |

+----------+

| Pertinax |

| Penelope |

| Patrick |

| Polly |

+----------+

mysql> SELECT name FROM t ORDER BY RAND();

+----------+

| name |

+----------+

| Patrick |

| Pertinax |

| Penelope |

| Polly |

+----------+

Applications for randomizing a set of rows include any scenario that uses selection without replacement (choosing each item from a set of items until there are no more items left). Some examples of this are:

§ Determining the starting order for participants in an event. List the participants in a table, and select them in random order.

§ Assigning starting lanes or gates to participants in a race. List the lanes in a table, and select a random lane order.

§ Choosing the order in which to present a set of quiz questions.

§ Shuffling a deck of cards. Represent each card by a row in a table, and shuffle the deck by selecting the rows in random order. Deal them one by one until the deck is exhausted.

To use the last example as an illustration, let’s implement a card deck-shuffling algorithm. Shuffling and dealing cards is randomization plus selection without replacement: each card is dealt once before any is dealt twice; when the deck is used up, it is reshuffled to rerandomize it for a new dealing order. Within a program, this task can be performed with MySQL using a table deck that has 52 rows, assuming a set of cards with each combination of 13 face values and 4 suits:

1. Select the entire table, and store it into an array.

2. Each time a card is needed, take the next element from the array.

3. When the array is exhausted, all the cards have been dealt. “Reshuffle” the table to generate a new card order.

Setting up the deck table is a tedious task if you insert the 52 card records by writing all the INSERT statements manually. The deck contents can be generated more easily in combinatorial fashion within a program by generating each pairing of face value with suit. Here’s some PHP code that creates a deck table with face and suit columns, and then populates the table using nested loops to generate the pairings for the INSERT statements:

$result =& $conn->query ("

CREATE TABLE deck

(

face ENUM('A', 'K', 'Q', 'J', '10', '9', '8',

'7', '6', '5', '4', '3', '2') NOT NULL,

suit ENUM('hearts', 'diamonds', 'clubs', 'spades') NOT NULL

)");

if (PEAR::isError ($result))

die ("Cannot issue CREATE TABLE statement\n");

$face_array = array ("A", "K", "Q", "J", "10", "9", "8",

"7", "6", "5", "4", "3", "2");

$suit_array = array ("hearts", "diamonds", "clubs", "spades");

# insert a "card" into the deck for each combination of suit and face

$stmt =& $conn->prepare ("INSERT INTO deck (face,suit) VALUES(?,?)");

if (PEAR::isError ($stmt))

die ("Cannot insert card into deck\n");

foreach ($face_array as $index => $face)

{

foreach ($suit_array as $index2 => $suit)

{

$result =& $conn->execute ($stmt, array ($face, $suit));

if (PEAR::isError ($result))

die ("Cannot insert card into deck\n");

}

}

Shuffling the cards is a matter of issuing this statement:

SELECT face, suit FROM deck ORDER BY RAND();

To do that and store the results in an array within a script, write a shuffle_deck() function that issues the query and returns the resulting values in an array (again shown in PHP):

function shuffle_deck ($conn)

{

$result =& $conn->query ("SELECT face, suit FROM deck ORDER BY RAND()");

if (PEAR::isError ($result))

die ("Cannot retrieve cards from deck\n");

$card = array ();

while ($obj =& $result->fetchRow (DB_FETCHMODE_OBJECT))

$card[] = $obj; # add card record to end of $card array

$result->free ();

return ($card);

}

Deal the cards by keeping a counter that ranges from 0 to 51 to indicate which card to select. When the counter reaches 52, the deck is exhausted and should be shuffled again.

Selecting Random Items from a Set of Rows

Problem

You want to pick an item or items randomly from a set of values.

Solution

Randomize the values, and then pick the first one (or the first few, if you need more than one).

Discussion

When a set of items is stored in MySQL, you can choose one at random as follows:

1. Select the items in the set in random order, using ORDER BY RAND() as described in Randomizing a Set of Rows.

2. Add LIMIT 1 to the query to pick the first item.

For example, a simple simulation of tossing a die can be performed by creating a die table containing rows with values from 1 to 6 corresponding to the six faces of a die cube, and then picking rows from it at random:

mysql>SELECT n FROM die ORDER BY RAND() LIMIT 1;

+------+

| n |

+------+

| 6 |

+------+

mysql> SELECT n FROM die ORDER BY RAND() LIMIT 1;

+------+

| n |

+------+

| 4 |

+------+

mysql> SELECT n FROM die ORDER BY RAND() LIMIT 1;

+------+

| n |

+------+

| 5 |

+------+

mysql> SELECT n FROM die ORDER BY RAND() LIMIT 1;

+------+

| n |

+------+

| 4 |

+------+

As you repeat this operation, you pick a random sequence of items from the set. This is a form of selection with replacement: an item is chosen from a pool of items and then returned to the pool for the next pick. Because items are replaced, it’s possible to pick the same item multiple times when making successive choices this way. Other examples of selection with replacement include:

§ Selecting a banner ad to display on a web page

§ Picking a row for a “quote of the day” application

§ “Pick a card, any card” magic tricks that begin with a full deck of cards each time

If you want to pick more than one item, change the LIMIT argument. For example, to draw five winning entries at random from a table named drawing that contains contest entries, use RAND() in combination with LIMIT:

SELECT * FROM drawing ORDER BY RAND() LIMIT 5;

A special case occurs when you’re picking a single row from a table that you know contains a column with values in the range from 1 to n in unbroken sequence. Under these circumstances, it’s possible to avoid performing an ORDER BY operation on the entire table by picking a random number in that range and selecting the matching row:

SET @id = FLOOR(RAND()*n)+1;

SELECT ... FROM tbl_name WHERE id = @id;

This will be much quicker than ORDER BY RAND() LIMIT 1 as the table size increases.

Assigning Ranks

Problem

You want to assign ranks to a set of values.

Solution

Decide on a ranking method, and then put the values in the desired order and apply the method to them.

Discussion

Some kinds of statistical tests require assignment of ranks. This section describes three ranking methods and shows how each can be implemented by using user-defined variables. The examples assume that a table t contains the following scores, which are to be ranked with the values in descending order:

mysql>SELECT score FROM t ORDER BY score DESC;

+-------+

| score |

+-------+

| 5 |

| 4 |

| 4 |

| 3 |

| 2 |

| 2 |

| 2 |

| 1 |

+-------+

One type of ranking simply assigns each value its row number within the ordered set of values. To produce such rankings, keep track of the row number and use it for the current rank:

mysql>SET @rownum := 0;

mysql> SELECT @rownum := @rownum + 1 AS rank, score

-> FROM t ORDER BY score DESC;

+------+-------+

| rank | score |

+------+-------+

| 1 | 5 |

| 2 | 4 |

| 3 | 4 |

| 4 | 3 |

| 5 | 2 |

| 6 | 2 |

| 7 | 2 |

| 8 | 1 |

+------+-------+

That kind of ranking doesn’t take into account the possibility of ties (instances of values that are the same). A second ranking method does so by advancing the rank only when values change:

mysql>SET @rank = 0, @prev_val = NULL;

mysql> SELECT @rank := IF(@prev_val=score,@rank,@rank+1) AS rank,

-> @prev_val := score AS score

-> FROM t ORDER BY score DESC;

+------+-------+

| rank | score |

+------+-------+

| 1 | 5 |

| 2 | 4 |

| 2 | 4 |

| 3 | 3 |

| 4 | 2 |

| 4 | 2 |

| 4 | 2 |

| 5 | 1 |

+------+-------+

A third ranking method is something of a combination of the other two methods. It ranks values by row number, except when ties occur. In that case, the tied values each get a rank equal to the row number of the first of the values. To implement this method, keep track of the row number and the previous value, advancing the rank to the current row number when the value changes:

mysql>SET @rownum = 0, @rank = 0, @prev_val = NULL;

mysql> SELECT @rownum := @rownum + 1 AS row,

-> @rank := IF(@prev_val!=score,@rownum,@rank) AS rank,

-> @prev_val := score AS score

-> FROM t ORDER BY score DESC;

+------+------+-------+

| row | rank | score |

+------+------+-------+

| 1 | 1 | 5 |

| 2 | 2 | 4 |

| 3 | 2 | 4 |

| 4 | 4 | 3 |

| 5 | 5 | 2 |

| 6 | 5 | 2 |

| 7 | 5 | 2 |

| 8 | 8 | 1 |

+------+------+-------+

Ranks are easy to assign within a program as well. For example, the following Ruby fragment ranks the scores in t using the third ranking method:

dbh.execute("SELECT score FROM t ORDER BY score DESC") do |sth|

rownum = 0

rank = 0

prev_score = nil

puts "Row\tRank\tScore\n"

sth.fetch do |row|

score = row[0]

rownum += 1

rank = rownum if rownum == 1 || prev_score != score

prev_score = score

puts "#{rownum}\t#{rank}\t#{score}"

end

end

The third type of ranking is commonly used outside the realm of statistical methods. Recall that in Choosing Appropriate LIMIT Values, we used a table al_winner that contained the American League pitchers who won 15 or more games during the 2001 baseball season:

mysql>SELECT name, wins FROM al_winner ORDER BY wins DESC, name;

+----------------+------+

| name | wins |

+----------------+------+

| Mulder, Mark | 21 |

| Clemens, Roger | 20 |

| Moyer, Jamie | 20 |

| Garcia, Freddy | 18 |

| Hudson, Tim | 18 |

| Abbott, Paul | 17 |

| Mays, Joe | 17 |

| Mussina, Mike | 17 |

| Sabathia, C.C. | 17 |

| Zito, Barry | 17 |

| Buehrle, Mark | 16 |

| Milton, Eric | 15 |

| Pettitte, Andy | 15 |

| Radke, Brad | 15 |

| Sele, Aaron | 15 |

+----------------+------+

These pitchers can be assigned ranks using the third method as follows:

mysql>SET @rownum = 0, @rank = 0, @prev_val = NULL;

mysql> SELECT @rownum := @rownum + 1 AS row,

-> @rank := IF(@prev_val!=wins,@rownum,@rank) AS rank,

-> name,

-> @prev_val := wins AS wins

-> FROM al_winner ORDER BY wins DESC;

+------+------+----------------+------+

| row | rank | name | wins |

+------+------+----------------+------+

| 1 | 1 | Mulder, Mark | 21 |

| 2 | 2 | Clemens, Roger | 20 |

| 3 | 2 | Moyer, Jamie | 20 |

| 4 | 4 | Garcia, Freddy | 18 |

| 5 | 4 | Hudson, Tim | 18 |

| 6 | 6 | Zito, Barry | 17 |

| 7 | 6 | Sabathia, C.C. | 17 |

| 8 | 6 | Mussina, Mike | 17 |

| 9 | 6 | Mays, Joe | 17 |

| 10 | 6 | Abbott, Paul | 17 |

| 11 | 11 | Buehrle, Mark | 16 |

| 12 | 12 | Milton, Eric | 15 |

| 13 | 12 | Pettitte, Andy | 15 |

| 14 | 12 | Radke, Brad | 15 |

| 15 | 12 | Sele, Aaron | 15 |

+------+------+----------------+------+