MySQL Cookbook (2007)

Chapter 14. Handling Duplicates

Introduction

Tables or result sets sometimes contain duplicate rows. In some cases this is acceptable. For example, if you conduct a web poll that records date and client IP number along with the votes, duplicate rows may be allowable, because it’s possible for large numbers of votes to appear to originate from the same IP number for an Internet service that routes traffic from its customers through a single proxy host. In other cases, duplicates will be unacceptable, and you’ll want to take steps to avoid them. Operations related to handling of duplicate rows include the following:

§ Preventing duplicates from being created within a table in the first place. If each row in a table is intended to represent a single entity (such as a person, an item in a catalog, or a specific observation in an experiment), the occurrence of duplicates presents significant difficulties in using it that way. Duplicates make it impossible to refer to every row in the table unambiguously, so it’s best to make sure duplicates never occur.

§ Counting the number of duplicates to determine if they are present and to what extent.

§ Identifying duplicated values (or the rows containing them) so you can see what they are and where they occur.

§ Eliminating duplicates to ensure that each row is unique. This may involve removing rows from a table to leave only unique rows. Or it may involve selecting a result set in such a way that no duplicates appear in the output. For example, to display a list of the states in which you have customers, you probably wouldn’t want a long list of state names from all customer records. A list showing each state name only once suffices and is easier to understand.

Several tools are at your disposal for dealing with duplicate rows. Choose them according to the objective that you’re trying to achieve:

§ Creating a table to include a primary key or unique index prevents duplicates from being added to the table. MySQL uses the index as a constraint to enforce the requirement that each row in the table contains a unique key in the indexed column or columns.

§ In conjunction with a unique index, the INSERT IGNORE and REPLACE statements enable you to handle insertion of duplicate rows gracefully without generating errors. For bulk-loading operations, the same options are available in the form of the IGNORE or REPLACE modifiers for theLOAD DATA statement.

§ If you need to determine whether a table contains duplicates, use GROUP BY to categorize rows into groups, and COUNT() to see how many rows are in each group. These techniques are described in Chapter 8, in the context of producing summaries, but they’re useful for duplicate counting and identification as well. A counting summary is essentially an operation that groups values into categories to determine how frequently each one occurs.

§ SELECTDISTINCT is useful for removing duplicate rows from a result set (see Removing Duplicate Rows for more information). For an existing table that already contains duplicates, adding a unique index can remove them. If you determine that there are n identical rows in a table, you can useDELETE ... LIMIT to eliminate n–1 instances from that specific set of rows.

Scripts related to the examples shown in this chapter are located in the dups directory of the recipes distribution. For scripts that create the tables used here, look in the tables directory.

Preventing Duplicates from Occurring in a Table

Problem

You want to prevent a table from ever containing duplicates.

Solution

Use a PRIMARY KEY or a UNIQUE index.

Discussion

To make sure that rows in a table are unique, some column or combination of columns must be required to contain unique values in each row. When this requirement is satisfied, you can refer to any row in the table unambiguously by using its unique identifier. To make sure a table has this characteristic, include a PRIMARY KEY or UNIQUE index in the table structure when you create the table. The following table contains no such index, so it would allow duplicate rows:

CREATE TABLE person

(

last_name CHAR(20),

first_name CHAR(20),

address CHAR(40)

);

To prevent multiple rows with the same first and last name values from being created in this table, add a PRIMARY KEY to its definition. When you do this, the indexed columns must be NOT NULL, because a PRIMARY KEY does not allow NULL values:

CREATE TABLE person

(

last_name CHAR(20) NOT NULL,

first_name CHAR(20) NOT NULL,

address CHAR(40),

PRIMARY KEY (last_name, first_name)

);

The presence of a unique index in a table normally causes an error to occur if you insert a row into the table that duplicates an existing row in the column or columns that define the index. Dealing with Duplicates When Loading Rows into a Table discusses how to handle such errors or modify MySQL’s duplicate-handling behavior.

Another way to enforce uniqueness is to add a UNIQUE index rather than a PRIMARY KEY to a table. The two types of indexes are similar, with the exception that a UNIQUE index can be created on columns that allow NULL values. For the person table, it’s likely that you’d require both the first and last names to be filled in. If so, you still declare the columns as NOT NULL, and the following table definition is effectively equivalent to the preceding one:

CREATE TABLE person

(

last_name CHAR(20) NOT NULL,

first_name CHAR(20) NOT NULL,

address CHAR(40),

UNIQUE (last_name, first_name)

);

If a UNIQUE index does happen to allow NULL values, NULL is special because it is the one value that can occur multiple times. The rationale for this is that it is not possible to know whether one unknown value is the same as another, so multiple unknown values are allowed. (An exception to this is that BDB tables allow at most one NULL value in a column that has a UNIQUE index.)

It may of course be that you’d want the person table to reflect the real world, for which people do sometimes have the same name. In this case, you cannot set up a unique index based on the name columns, because duplicate names must be allowed. Instead, each person must be assigned some sort of unique identifier, which becomes the value that distinguishes one row from another. In MySQL, it’s common to accomplish this by using an AUTO_INCREMENT column:

CREATE TABLE person

(

id INT UNSIGNED NOT NULL AUTO_INCREMENT,

last_name CHAR(20),

first_name CHAR(20),

address CHAR(40),

PRIMARY KEY (id)

);

In this case, when you create a row with an id value of NULL, MySQL assigns that column a unique ID automatically. Another possibility is to assign identifiers externally and use those IDs as unique keys. For example, citizens in a given country might have unique taxpayer ID numbers. If so, those numbers can serve as the basis for a unique index:

CREATE TABLE person

(

tax_id INT UNSIGNED NOT NULL,

last_name CHAR(20),

first_name CHAR(20),

address CHAR(40),

PRIMARY KEY (tax_id)

);

See Also

If an existing table already contains duplicate rows that you want to remove, see Eliminating Duplicates from a Table. Chapter 11 further discusses AUTO_INCREMENT columns.

Dealing with Duplicates When Loading Rows into a Table

Problem

You’ve created a table with a unique index to prevent duplicate values in the indexed column or columns. But this results in an error if you attempt to insert a duplicate row, and you want to avoid having to deal with such errors.

Solution

One approach is to just ignore the error. Another is to use an INSERT IGNORE, REPLACE, or INSERT ... ON DUPLICATE KEY UPDATE statement, each of which modifies MySQL’s duplicate-handling behavior. For bulk-loading operations, LOAD DATA has modifiers that enable you to specify how to handle duplicates.

Discussion

By default, MySQL generates an error when you insert a row that duplicates an existing unique key value. Suppose that the person table has the following structure, with a unique index on the last_name and first_name columns:

CREATE TABLE person

(

last_name CHAR(20) NOT NULL,

first_name CHAR(20) NOT NULL,

address CHAR(40),

PRIMARY KEY (last_name, first_name)

);

An attempt to insert a row with duplicate values in the indexed columns results in an error:

mysql>INSERT INTO person (last_name, first_name)

-> VALUES('X1','Y1');

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO person (last_name, first_name)

-> VALUES('X1','Y1');

ERROR 1062 (23000): Duplicate entry 'X1-Y1' for key 1

If you’re issuing the statements from the mysql program interactively, you can simply say, “Okay, that didn’t work,” ignore the error, and continue. But if you write a program to insert the rows, an error may terminate the program. One way to avoid this is to modify the program’s error-handling behavior to trap the error and then ignore it. See Checking for Errors for information about error-handling techniques.

If you want to prevent the error from occurring in the first place, you might consider using a two-query method to solve the duplicate-row problem:

§ Issue a SELECT to see whether the row is already present.

§ Issue an INSERT if the row is not present.

But that doesn’t really work: another client might insert the same row after the SELECT and before the INSERT, in which case the error would still occur for your INSERT. To make sure that doesn’t happen, you could use a transaction or lock the tables, but then you’ve gone from two statements to four. MySQL provides three single-query solutions to the problem of handling duplicate rows. Choose from among them according to the duplicate-handling behavior you want to affect:

§ Use INSERT IGNORE rather than INSERT if you want to keep the original row when a duplicate occurs. If the row doesn’t duplicate an existing row, MySQL inserts it as usual. If the row is a duplicate, the IGNORE keyword tells MySQL to discard it silently without generating an error:

§ mysql>INSERT IGNORE INTO person (last_name, first_name)

§ -> VALUES('X2','Y2');

§ Query OK, 1 row affected (0.00 sec)

§ mysql> INSERT IGNORE INTO person (last_name, first_name)

§ -> VALUES('X2','Y2');

Query OK, 0 rows affected (0.00 sec)

The row count value indicates whether the row was inserted or ignored. From within a program, you can obtain this value by checking the rows-affected function provided by your API (see Recipes and ).

§ Use REPLACE rather than INSERT if you want to replace the original row with the new one when a duplicate occurs. If the row is new, it’s inserted just as with INSERT. If it’s a duplicate, the new row replaces the old one:

§ mysql>REPLACE INTO person (last_name, first_name)

§ -> VALUES('X3','Y3');

§ Query OK, 1 row affected (0.00 sec)

§ mysql> REPLACE INTO person (last_name, first_name)

§ -> VALUES('X3','Y3');

Query OK, 2 rows affected (0.00 sec)

The rows-affected value in the second case is 2 because the original row is deleted and the new row is inserted in its place.

§ Use INSERT ... ON DUPLICATE KEY UPDATE if you want to modify columns of an existing row when a duplicate occurs. If the row is new, it’s inserted. If it’s a duplicate, the ON DUPLICATE KEY UPDATE clause indicates how to modify the existing row in the table. In other words, this statement can initialize or update a row as necessary. The rows-affected count indicates what happened: 1 for an insert, 2 for an update.

INSERTIGNORE is more efficient than REPLACE because it doesn’t actually insert duplicates. Thus, it’s most applicable when you just want to make sure a copy of a given row is present in a table. REPLACE, on the other hand, is often more appropriate for tables in which other nonkey columns need to be replaced. INSERT ... ONDUPLICATEKEYUPDATE is appropriate when you must insert a record if it doesn’t exist, but just update some of its columns if the new record is a duplicate in the indexed columns.

Suppose that you’re maintaining a table named passtbl for a web application that contains email addresses and password hash values, and that is keyed by email address:

CREATE TABLE passtbl

(

email VARCHAR(60) NOT NULL,

password VARBINARY(60) NOT NULL,

PRIMARY KEY (email)

);

How do you create new rows for new users, but change passwords of existing rows for existing users? A typical algorithm for handling row maintenance might look like this:

1. Issue a SELECT to see whether a row already exists with a given email value.

2. If no such row exists, add a new one with INSERT.

3. If the row does exist, update it with UPDATE.

These steps must be performed within a transaction or with the tables locked to prevent other users from changing the tables while you’re using them. In MySQL, you can use REPLACE to simplify both cases to the same single-statement operation:

REPLACE INTO passtbl (email,password) VALUES(address,hash_value);

If no row with the given email address exists, MySQL creates a new one. If a row does exist, MySQL replaces it; in effect, this updates the password column of the row associated with the address.

INSERTIGNORE and REPLACE are useful when you know exactly what values should be stored in the table when you attempt to insert a row. That’s not always the case. For example, you might want to insert a row if it doesn’t exist, but update only certain parts of it otherwise. This commonly occurs when you use a table for counting. Suppose that you’re recording votes for candidates in polls, using the following table:

CREATE TABLE poll_vote

(

poll_id INT UNSIGNED NOT NULL AUTO_INCREMENT,

candidate_id INT UNSIGNED,

vote_count INT UNSIGNED,

PRIMARY KEY (poll_id, candidate_id)

);

The primary key is the combination of poll and candidate number. The table should be used like this:

§ The first time a vote is received for a given poll candidate, a new row should be inserted with a vote count of 1.

§ For all subsequent votes for that poll candidate, the vote count of the existing record should be incremented.

Neither INSERT IGNORE nor REPLACE are appropriate here because for all votes except the first, you don’t know what the vote count should be. INSERT ... ON DUPLICATE KEY UPDATE works better here. The following example shows how it works, beginning with an empty table:

mysql>SELECT * FROM poll_vote;

Empty set (0.01 sec)

mysql> INSERT INTO poll_vote (poll_id,candidate_id,vote_count) VALUES(14,2,1)

-> ON DUPLICATE KEY UPDATE vote_count = vote_count + 1;

Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM poll_vote;

+---------+--------------+------------+

| poll_id | candidate_id | vote_count |

+---------+--------------+------------+

| 14 | 2 | 1 |

+---------+--------------+------------+

1 row in set (0.01 sec)

mysql> INSERT INTO poll_vote (poll_id,candidate_id,vote_count) VALUES(14,2,1)

-> ON DUPLICATE KEY UPDATE vote_count = vote_count + 1;

Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM poll_vote;

+---------+--------------+------------+

| poll_id | candidate_id | vote_count |

+---------+--------------+------------+

| 14 | 2 | 2 |

+---------+--------------+------------+

1 row in set (0.00 sec)

For the first INSERT, no row for the candidate exists, so the row is inserted. For the second INSERT, the row already exists, so MySQL just updates the vote count. With INSERT ... ON DUPLICATE KEY UPDATE, there is no need to check whether the row exists because MySQL does that for you. The row count indicates what action the INSERT statement performs: 1 for a new row and 2 for an update to an existing row.

The techniques just described have the benefit of eliminating overhead that might otherwise be required for a transaction. But this benefit comes at the price of portability because they all involve MySQL-specific syntax. If portability is a high priority, you might prefer to stick with a transactional approach.

See Also

For bulk record-loading operations in which you use the LOAD DATA statement to load a set of rows from a file into a table, duplicate-row handling can be controlled using the statement’s IGNORE and REPLACE modifiers. These produce behavior analogous to that of the INSERT IGNORE andREPLACE statements. See Handling Duplicate Key Values for more information.

The use of INSERT ... ON DUPLICATE KEY UPDATE for initializing and updating counts is further demonstrated in Recipes and .

Counting and Identifying Duplicates

Problem

You want to determine whether a table contains duplicates, and to what extent they occur. Or you want to see the rows that contain the duplicated values.

Solution

Use a counting summary that looks for and displays duplicated values. To see the rows in which the duplicated values occur, join the summary to the original table to display the matching rows.

Discussion

Suppose that your web site includes a sign-up page that enables visitors to add themselves to your mailing list to receive periodic product catalog mailings. But you forgot to include a unique index in the table when you created it, and now you suspect that some people are signed up multiple times. Perhaps they forgot they were already on the list, or perhaps people added friends to the list who were already signed up. Either way, the result of having duplicate rows is that you mail out duplicate catalogs. This is an additional expense to you, and it annoys the recipients. This section discusses how to find out if duplicates are present in a table, how prevalent they are, and how to display the duplicated rows. (For tables that do contain duplicates, Eliminating Duplicates from a Table describes how to eliminate them.)

To determine whether duplicates occur in a table, use a counting summary (a topic covered in Chapter 8). Summary techniques can be applied to identifying and counting duplicates by grouping rows with GROUP BY and counting the rows in each group using COUNT(). For the examples here, assume that catalog recipients are listed in a table named catalog_list that has the following contents:

mysql>SELECT * FROM catalog_list;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| Baxter | Wallace | 57 3rd Ave. |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

| BAXTER | WALLACE | 57 3rd Ave. |

| Brown | Bartholomew | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

| Baxter | Wallace | 57 3rd Ave., Apt 102 |

+-----------+-------------+--------------------------+

Suppose that you want to define “duplicate” using the last_name and first_name columns. That is, recipients with the same name are assumed to be the same person. The following statements are typical of those used to characterize the table and to assess the existence and extent of duplicate values:

§ The total number of rows in the table:

§ mysql>SELECT COUNT(*) AS rows FROM catalog_list;

§ +------+

§ | rows |

§ +------+

§ | 8 |

+------+

§ The number of distinct names:

§ mysql>SELECT COUNT(DISTINCT last_name, first_name) AS 'distinct names'

§ -> FROM catalog_list;

§ +----------------+

§ | distinct names |

§ +----------------+

§ | 5 |

+----------------+

§ The number of rows containing duplicated names:

§ mysql>SELECT COUNT(*) - COUNT(DISTINCT last_name, first_name)

§ -> AS 'duplicate names'

§ -> FROM catalog_list;

§ +-----------------+

§ | duplicate names |

§ +-----------------+

§ | 3 |

+-----------------+

§ The fraction of the rows that contain unique or nonunique names:

§ mysql>SELECT COUNT(DISTINCT last_name, first_name) / COUNT(*)

§ -> AS 'unique',

§ -> 1 - (COUNT(DISTINCT last_name, first_name) / COUNT(*))

§ -> AS 'nonunique'

§ -> FROM catalog_list;

§ +--------+------------+

§ | unique | nonunique |

§ +--------+------------+

§ | 0.6250 | 0.3750 |

+--------+------------+

These statements help you characterize the extent of duplicates, but they don’t show you which values are duplicated. To see the duplicated names in the catalog_list table, use a summary statement that displays the nonunique values along with the counts:

mysql>SELECT COUNT(*) AS repetitions, last_name, first_name

-> FROM catalog_list

-> GROUP BY last_name, first_name

-> HAVING repetitions > 1;

+-------------+-----------+------------+

| repetitions | last_name | first_name |

+-------------+-----------+------------+

| 3 | Baxter | Wallace |

| 2 | Pinter | Marlene |

+-------------+-----------+------------+

The statement includes a HAVING clause that restricts the output to include only those names that occur more than once. (If you omit the clause, the summary lists unique names as well, which is useless when you’re interested only in duplicates.) In general, to identify sets of values that are duplicated, do the following:

1. Determine which columns contain the values that may be duplicated.

2. List those columns in the column selection list, along with COUNT(*).

3. List the columns in the GROUP BY clause as well.

4. Add a HAVING clause that eliminates unique values by requiring group counts to be greater than one.

Queries constructed that way have the following form:

SELECT COUNT(*),column_list

FROM tbl_name

GROUP BY column_list

HAVING COUNT(*) > 1

It’s easy to generate duplicate-finding queries like that within a program, given database and table names and a nonempty set of column names. For example, here is a Perl function make_dup_count_query() that generates the proper query for finding and counting duplicated values in the specified columns:

sub make_dup_count_query

{

my ($db_name, $tbl_name, @col_name) = @_;

return (

"SELECT COUNT(*)," . join (",", @col_name)

. "\nFROM $db_name.$tbl_name"

. "\nGROUP BY " . join (",", @col_name)

. "\nHAVING COUNT(*) > 1"

);

}

make_dup_count_query() returns the query as a string. If you invoke it like this:

$str = make_dup_count_query ("cookbook", "catalog_list",

"last_name", "first_name");

the resulting value of $str is:

SELECT COUNT(*),last_name,first_name

FROM cookbook.catalog_list

GROUP BY last_name,first_name

HAVING COUNT(*) > 1

What you do with the query string is up to you. You can execute it from within the script that creates it, pass it to another program, or write it to a file for execution later. The dups directory of the recipes distribution contains a script named dup_count.pl that you can use to try the function (as well as some translations into other languages). Eliminating Duplicates from a Table discusses use of the make_dup_count_query() function to implement a duplicate-removal technique.

Summary techniques are useful for assessing the existence of duplicates, how often they occur, and displaying which values are duplicated. But if duplicates are determined using only a subset of a table’s columns, a summary in itself cannot display the entire content of the rows that contain the duplicate values. (For example, the summaries shown thus far display counts of duplicated names in the catalog_list table or the names themselves, but don’t show the addresses associated with those names.) To see the original rows containing the duplicate names, join the summary information to the table from which it’s generated. The following example shows how to do this to display the catalog_list rows that contain duplicated names. The summary is written to a temporary table, which then is joined to the catalog_list table to produce the rows that match those names:

mysql>CREATE TABLE tmp

-> SELECT COUNT(*) AS count, last_name, first_name FROM catalog_list

-> GROUP BY last_name, first_name HAVING count > 1;

mysql> SELECT catalog_list.*

-> FROM tmp INNER JOIN catalog_list USING(last_name, first_name)

-> ORDER BY last_name, first_name;

+-----------+------------+----------------------+

| last_name | first_name | street |

+-----------+------------+----------------------+

| Baxter | Wallace | 57 3rd Ave. |

| BAXTER | WALLACE | 57 3rd Ave. |

| Baxter | Wallace | 57 3rd Ave., Apt 102 |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+------------+----------------------+

DUPLICATE IDENTIFICATION AND STRING CASE SENSITIVITY

For strings that have a case-insensitive collation, values that differ only in lettercase are considered the same for comparison purposes. To treat them as distinct values, compare them using a case-sensitive or binary collation. Controlling Case Sensitivity in String Comparisons shows how to do this.

Eliminating Duplicates from a Table

Problem

You want to remove duplicate rows from a table so that it contains only unique rows.

Solution

Select the unique rows from the table into a second table that you use to replace the original one. Or add a unique index to the table using ALTER TABLE, which will remove duplicates as it builds the index. Or use DELETE ... LIMIT n to remove all but one instance of a specific set of duplicate rows.

Discussion

Preventing Duplicates from Occurring in a Table discusses how to prevent duplicates from being added to a table by creating it with a unique index. However, if you forget to include a unique index when you create a table, you may discover later that it contains duplicates and that it’s necessary to apply some sort of duplicate-removal technique. The catalog_list table used earlier is an example of this, because it contains several instances in which the same person is listed multiple times:

mysql>SELECT * FROM catalog_list ORDER BY last_name, first_name;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Baxter | Wallace | 57 3rd Ave. |

| BAXTER | WALLACE | 57 3rd Ave. |

| Baxter | Wallace | 57 3rd Ave., Apt 102 |

| Brown | Bartholomew | 432 River Run |

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+-------------+--------------------------+

The table contains redundant entries and it would be a good idea to remove them, to eliminate duplicate mailings and reduce postage costs. To do this, you have several options:

§ Select the table’s unique rows into another table, and then use that table to replace the original one. The result is to remove the table’s duplicates. This works when “duplicate” means “the entire row is the same as another.”

§ Add a unique index to the table using ALTER TABLE. This operation turns duplicate rows into unique rows, where “duplicate” means “the index values are the same.”

§ You can remove duplicates for a specific set of duplicate rows by using DELETE ... LIMIT n to remove all but one row.

This recipe discusses each of these duplicate-removal method. When you consider which of them to choose under various circumstances, the applicability of a given method to a specific problem is often determined by several factors:

§ Does the method require the table to have a unique index?

§ If the columns in which duplicate values occur may contain NULL, will the method remove duplicate NULL values?

§ Does the method prevent duplicates from occurring in the future?

Removing duplicates using table replacement

If a row is considered to duplicate another only if the entire row is the same, one way to eliminate duplicates from a table is to select its unique rows into a new table that has the same structure, and then replace the original table with the new one. To perform table replacement, use the following procedure:

1. Create a new table that has the same structure as the original one. CREATE TABLE ... LIKE is useful for this (see Cloning a Table):

mysql>CREATE TABLE tmp LIKE catalog_list;

2. Use INSERT INTO ... SELECT DISTINCT to select the unique rows from the original table into the new one:

mysql>INSERT INTO tmp SELECT DISTINCT * FROM catalog_list;

Select rows from the tmp table to verify that the new table contains no duplicates:

mysql>SELECT * FROM tmp ORDER BY last_name, first_name;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Baxter | Wallace | 57 3rd Ave. |

| Baxter | Wallace | 57 3rd Ave., Apt 102 |

| Brown | Bartholomew | 432 River Run |

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+-------------+--------------------------+

3. After creating the new tmp table that contains unique rows, use it to replace the original catalog_list table:

4. mysql>DROP TABLE catalog_list;

mysql> RENAME TABLE tmp TO catalog_list;

The effective result of this procedure is that catalog_list no longer contains duplicates.

This table-replacement method works in the absence of an index (although it might be slow for large tables). For tables that contain duplicate NULL values, it removes those duplicates. It does not prevent the occurrence of duplicates in the future.

This method requires rows to be completely identical for rows to be considered duplicates. Thus, it treats as distinct those rows for Wallace Baxter that have slightly different street values.

If duplicates are defined only with respect to a subset of the columns in the table, create a new table that has a unique index for those columns, select rows into it using INSERT IGNORE, and then replace the original table with the new one:

mysql>CREATE TABLE tmp LIKE catalog_list;

mysql> ALTER TABLE tmp ADD PRIMARY KEY (last_name, first_name);

mysql> INSERT IGNORE INTO tmp SELECT * FROM catalog_list;

mysql> SELECT * FROM tmp ORDER BY last_name, first_name;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Baxter | Wallace | 57 3rd Ave. |

| Brown | Bartholomew | 432 River Run |

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+-------------+--------------------------+

mysql> DROP TABLE catalog_list;

mysql> RENAME TABLE tmp TO catalog_list;

The unique index prevents rows with duplicate key values from being inserted into tmp, and IGNORE tells MySQL not to stop with an error if a duplicate is found. One shortcoming of this method is that if the indexed columns can contain NULL values, you must use a UNIQUE index rather than a PRIMARY KEY, in which case the index will not remove duplicate NULL keys. (UNIQUE indexes allow multiple NULL values.) This method does prevent occurrence of duplicates in the future.

Removing duplicates by adding an index

To remove duplicates from a table “in place,” add a unique index to the table with ALTER TABLE, using the IGNORE keyword to tell it to discard rows with duplicate key values during the index construction process. The original catalog_list table looks like this without an index:

mysql>SELECT * FROM catalog_list ORDER BY last_name, first_name;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Baxter | Wallace | 57 3rd Ave. |

| BAXTER | WALLACE | 57 3rd Ave. |

| Baxter | Wallace | 57 3rd Ave., Apt 102 |

| Brown | Bartholomew | 432 River Run |

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+-------------+--------------------------+

Add a unique index, and then check what effect doing so has on the table contents:

mysql>ALTER IGNORE TABLE catalog_list

-> ADD PRIMARY KEY (last_name, first_name);

mysql> SELECT * FROM catalog_list ORDER BY last_name, first_name;

+-----------+-------------+--------------------------+

| last_name | first_name | street |

+-----------+-------------+--------------------------+

| Baxter | Wallace | 57 3rd Ave. |

| Brown | Bartholomew | 432 River Run |

| Isaacson | Jim | 515 Fordam St., Apt. 917 |

| McTavish | Taylor | 432 River Run |

| Pinter | Marlene | 9 Sunset Trail |

+-----------+-------------+--------------------------+

If the indexed columns can contain NULL, you must use a UNIQUE index rather than a PRIMARY KEY. In that case, the index will not remove duplicate NULL key values. In addition to removing existing duplicates, the method prevents the occurrence of duplicates in the future.

Removing duplicates of a particular row

You can use LIMIT to restrict the effect of a DELETE statement to a subset of the rows that it otherwise would delete. This makes the statement applicable to removing duplicate rows. Suppose that you have a table t with the following contents:

+-------+

| color |

+-------+

| blue |

| green |

| blue |

| red |

| green |

| red |

+-------+

The table lists blue three times, and green and red twice each. To remove the extra instances of each color, do this:

mysql>DELETE FROM t WHERE color = 'blue' LIMIT 2;

mysql> DELETE FROM t WHERE color = 'green' LIMIT 1;

mysql> DELETE FROM t WHERE color = 'red' LIMIT 1;

mysql> SELECT * FROM t;

+-------+

| color |

+-------+

| blue |

| green |

| red |

+-------+

This technique works in the absence of a unique index, and it eliminates duplicate NULL values. It’s handy if you want to remove duplicates only for a specific set of rows within a table. However, if there are many different sets of duplicates that you want to remove, this is not a procedure you’d want to carry out by hand. The process can be automated by using the techniques discussed earlier in Counting and Identifying Duplicates for determining which values are duplicated. There, we wrote a make_dup_count_query() function to generate the statement needed to count the number of duplicate values in a given set of columns in a table. The result of that statement can be used to generate a set of DELETE ... LIMIT n statements that remove duplicate rows and leave only unique rows. The dups directory of the recipes distribution contains code that shows how to generate these statements.

In general, using DELETE ... LIMIT n is likely to be slower than removing duplicates by using a second table or by adding a unique index. Those methods keep the data on the server side and let the server do all the work. DELETE ... LIMIT n involves a lot of client-server interaction because it uses a SELECT statement to retrieve information about duplicates, followed by several DELETE statements to remove instances of duplicated rows. Also, this technique does not prevent duplicates from occurring in the future.

Eliminating Duplicates from a Self-Join Result

Problem

Self-joins often produce rows that are “near” duplicates—that is, rows that contain the same values but in different orders. Because of this, SELECT DISTINCT will not eliminate the duplicates.

Solution

Select column values in a specific order within rows to make rows with duplicate sets of values identical. Then you can use SELECT DISTINCT to remove duplicates. Alternatively, retrieve rows in such a way that near-duplicates are not even selected.

Discussion

Self-joins can produce rows that are duplicates in the sense that they contain the same values, yet are not identical. Consider the following statement (originally seen in Comparing a Table to Itself), which uses a self-join to find all pairs of states that joined the Union in the same year:

mysql>SELECT YEAR(s1.statehood) AS year,

-> s1.name AS name1, s1.statehood AS statehood1,

-> s2.name AS name2, s2.statehood AS statehood2

-> FROM states AS s1 INNER JOIN states AS s2

-> ON YEAR(s1.statehood) = YEAR(s2.statehood) AND s1.name != s2.name

-> ORDER BY year, s1.name, s2.name;

+------+----------------+------------+----------------+------------+

+------+----------------+------------+----------------+------------+

| 1787 | Delaware | 1787-12-07 | New Jersey | 1787-12-18 |

| 1787 | Delaware | 1787-12-07 | Pennsylvania | 1787-12-12 |

| 1787 | New Jersey | 1787-12-18 | Delaware | 1787-12-07 |

| 1787 | New Jersey | 1787-12-18 | Pennsylvania | 1787-12-12 |

| 1787 | Pennsylvania | 1787-12-12 | Delaware | 1787-12-07 |

| 1787 | Pennsylvania | 1787-12-12 | New Jersey | 1787-12-18 |

...

| 1912 | Arizona | 1912-02-14 | New Mexico | 1912-01-06 |

| 1912 | New Mexico | 1912-01-06 | Arizona | 1912-02-14 |

| 1959 | Alaska | 1959-01-03 | Hawaii | 1959-08-21 |

| 1959 | Hawaii | 1959-08-21 | Alaska | 1959-01-03 |

+------+----------------+------------+----------------+------------+

The condition in the ON clause that requires state pair names not to be identical eliminates the trivially duplicate rows showing that each state joined the Union in the same year as itself. But each remaining pair of states still appears twice. For example, there is one row that lists Delaware and New Jersey, and another that lists New Jersey and Delaware. Each such pair of rows may be considered as effective duplicates because they contain the same values. However, because the values are not listed in the same order within the rows, they are not identical and you can’t get rid of the duplicates by adding DISTINCT to the statement.

One way to solve this problem is to make sure that state names are always listed in a specific order within a row. This can be done by selecting the names with a pair of expressions that place the lesser value first in the output column list:

IF(val1<val2,val1,val2) AS lesser_value,

IF(val1<val2,val2,val1) AS greater_value

Applying this technique to the state-pairs query yields the following result, in which the expressions display state names in lexical order within each row:

mysql>SELECT YEAR(s1.statehood) AS year,

-> IF(s1.name<s2.name,s1.name,s2.name) AS name1,

-> IF(s1.name<s2.name,s1.statehood,s2.statehood) AS statehood1,

-> IF(s1.name<s2.name,s2.name,s1.name) AS name2,

-> IF(s1.name<s2.name,s2.statehood,s1.statehood) AS statehood2

-> FROM states AS s1 INNER JOIN states AS s2

-> ON YEAR(s1.statehood) = YEAR(s2.statehood) AND s1.name != s2.name

-> ORDER BY year, name1, name2;

+------+----------------+------------+----------------+------------+

+------+----------------+------------+----------------+------------+

| 1787 | Delaware | 1787-12-07 | New Jersey | 1787-12-18 |

| 1787 | Delaware | 1787-12-07 | Pennsylvania | 1787-12-12 |

| 1787 | New Jersey | 1787-12-18 | Pennsylvania | 1787-12-12 |

...

| 1912 | Arizona | 1912-02-14 | New Mexico | 1912-01-06 |

| 1959 | Alaska | 1959-01-03 | Hawaii | 1959-08-21 |

+------+----------------+------------+----------------+------------+

Duplicate rows are still present in the output, but now duplicate pairs are identical and you can eliminate the extra copies by adding DISTINCT to the statement:

mysql>SELECT DISTINCT YEAR(s1.statehood) AS year,

-> IF(s1.name<s2.name,s1.name,s2.name) AS name1,

-> IF(s1.name<s2.name,s1.statehood,s2.statehood) AS statehood1,

-> IF(s1.name<s2.name,s2.name,s1.name) AS name2,

-> IF(s1.name<s2.name,s2.statehood,s1.statehood) AS statehood2

-> FROM states AS s1 INNER JOIN states AS s2

-> ON YEAR(s1.statehood) = YEAR(s2.statehood) AND s1.name != s2.name

-> ORDER BY year, name1, name2;

+------+----------------+------------+----------------+------------+

+------+----------------+------------+----------------+------------+

| 1787 | Delaware | 1787-12-07 | New Jersey | 1787-12-18 |

| 1787 | Delaware | 1787-12-07 | Pennsylvania | 1787-12-12 |

| 1787 | New Jersey | 1787-12-18 | Pennsylvania | 1787-12-12 |

...

| 1912 | Arizona | 1912-02-14 | New Mexico | 1912-01-06 |

| 1959 | Alaska | 1959-01-03 | Hawaii | 1959-08-21 |

+------+----------------+------------+----------------+------------+

An alternative approach to removing nonidentical duplicates relies not so much on detecting and eliminating them as on selecting rows in such a way that only one row from each pair ever appears in the query result. This makes it unnecessary to reorder values within output rows or to useDISTINCT. For the state-pairs query, selecting only those rows in which the first state name is lexically less than the second automatically eliminates rows whose names appear in the other order: ^[18^]

mysql>SELECT YEAR(s1.statehood) AS year,

-> IF(s1.name<s2.name,s1.name,s2.name) AS name1,

-> IF(s1.name<s2.name,s1.statehood,s2.statehood) AS statehood1,

-> IF(s1.name<s2.name,s2.name,s1.name) AS name2,

-> IF(s1.name<s2.name,s2.statehood,s1.statehood) AS statehood2

-> FROM states AS s1 INNER JOIN states AS s2

-> ON YEAR(s1.statehood) = YEAR(s2.statehood) AND s1.name < s2.name

-> ORDER BY year, name1, name2;

+------+----------------+------------+----------------+------------+

+------+----------------+------------+----------------+------------+

| 1787 | Delaware | 1787-12-07 | New Jersey | 1787-12-18 |

| 1787 | Delaware | 1787-12-07 | Pennsylvania | 1787-12-12 |

| 1787 | New Jersey | 1787-12-18 | Pennsylvania | 1787-12-12 |

...

| 1912 | Arizona | 1912-02-14 | New Mexico | 1912-01-06 |

| 1959 | Alaska | 1959-01-03 | Hawaii | 1959-08-21 |

+------+----------------+------------+----------------+------------+

^[18^]The same constraint also eliminates those rows in which the state names are identical.