Working with Strings - MySQL Cookbook (2007)

MySQL Cookbook (2007)

Chapter 5. Working with Strings

Introduction

Like most types of data, string values can be compared for equality or inequality or relative ordering. However, strings have some additional features to consider:

§ A string can be binary or nonbinary. Binary strings are used for raw data such as images, music files, or encrypted values. Nonbinary strings are used for character data such as text and are associated with a character set and collation (sorting order).

§ A character set determines which characters are legal in a string. Collations can be chosen according to whether you need comparisons to be case-sensitive or case-insensitive, or to use the rules of a particular language.

§ Data types for binary strings are BINARY, VARBINARY, and BLOB. Data types for nonbinary strings are CHAR, VARCHAR, and TEXT, each of which allows CHARACTER SET and COLLATE attributes. See Choosing a String Data Type for information about choosing data types for string columns.

§ You can convert a binary string to a nonbinary string and vice versa, or convert a nonbinary string from one character set or collation to another.

§ You can use a string in its entirety or extract substrings from it. Strings can be combined with other strings.

§ You can apply pattern-matching operations to strings.

§ FULLTEXT searching is available for efficient queries on large collections of text.

This chapter discusses how to use all those features, so that you can store, retrieve, and manipulate strings according to whatever requirements your applications have.

Scripts to create the tables used in this chapter can be found in the tables directory of the recipes distribution.

String Properties

One property of a string is whether it is binary or nonbinary:

§ A binary string is a sequence of bytes. It can contain any type of information, such as images, MP3 files, or compressed or encrypted data. A binary string is not associated with a character set, even if you store a value such as abc that looks like ordinary text. Binary strings are compared byte by byte using numeric byte values.

§ A nonbinary string is a sequence of characters. It stores text that has a particular character set and collation. The character set defines which characters can be stored in the string. The collation defines the comparison and sorting properties of the characters.

A characteristic of nonbinary strings is that they have a character set. To see which character sets are available, use this statement:

mysql>SHOW CHARACTER SET;

+----------+-----------------------------+---------------------+--------+

| Charset | Description | Default collation | Maxlen |

+----------+-----------------------------+---------------------+--------+

| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |

| dec8 | DEC West European | dec8_swedish_ci | 1 |

| cp850 | DOS West European | cp850_general_ci | 1 |

| hp8 | HP West European | hp8_english_ci | 1 |

| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |

| latin1 | cp1252 West European | latin1_swedish_ci | 1 |

| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |

...

| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |

| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |

...

The default character set in MySQL is latin1. If you need to store characters from several languages in a single column, consider using one of the Unicode character sets (utf8 or ucs2) because they can represent characters from multiple languages.

Some character sets contain only single-byte characters, whereas others allow multibyte characters. For some multibyte character sets, all characters have a fixed length. Others contain characters of varying lengths. For example, Unicode data can be stored using the ucs2 character set in which all characters take two bytes or the utf8 character set in which characters take from one to three bytes.

You can determine whether a given string contains multibyte characters using the LENGTH() and CHAR_LENGTH() functions, which return the length of a string in bytes and characters, respectively. If LENGTH() is greater than CHAR_LENGTH() for a given string, multibyte characters are present.

§ For the ucs2 Unicode character set, all characters are encoded using two bytes, even if they might be single-byte characters in another character set such as latin1. Thus, every ucs2 string contains multibyte characters:

§ mysql>SET @s = CONVERT('abc' USING ucs2);

§ mysql> SELECT LENGTH(@s), CHAR_LENGTH(@s);

§ +------------+-----------------+

§ | LENGTH(@s) | CHAR_LENGTH(@s) |

§ +------------+-----------------+

§ | 6 | 3 |

+------------+-----------------+

§ The utf8 Unicode character set has multibyte characters, but a given utf8 string might contain only single-byte characters, as in the following example:

§ mysql>SET @s = CONVERT('abc' USING utf8);

§ mysql> SELECT LENGTH(@s), CHAR_LENGTH(@s);

§ +------------+-----------------+

§ | LENGTH(@s) | CHAR_LENGTH(@s) |

§ +------------+-----------------+

§ | 3 | 3 |

+------------+-----------------+

Another characteristic of nonbinary strings is collation, which determines the sort order of characters in the character set. Use SHOW COLLATION to see which collations are available; add a LIKE clause to see the collations for a particular character set:

mysql>SHOW COLLATION LIKE 'latin1%';

+-------------------+---------+----+---------+----------+---------+

| Collation | Charset | Id | Default | Compiled | Sortlen |

+-------------------+---------+----+---------+----------+---------+

| latin1_german1_ci | latin1 | 5 | | Yes | 1 |

| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |

| latin1_danish_ci | latin1 | 15 | | Yes | 1 |

| latin1_german2_ci | latin1 | 31 | | Yes | 2 |

| latin1_bin | latin1 | 47 | | Yes | 1 |

| latin1_general_ci | latin1 | 48 | | Yes | 1 |

| latin1_general_cs | latin1 | 49 | | Yes | 1 |

| latin1_spanish_ci | latin1 | 94 | | Yes | 1 |

+-------------------+---------+----+---------+----------+---------+

In contexts where no collation is indicated, the collation with Yes in the Default column is the default collation used for strings in the given character set. As shown, the default collation for latin1 is latin1_swedish_ci. (Default collations are also displayed by SHOW CHARACTERSET.)

A collation can be case-sensitive (a and A are different), case-insensitive (a and A are the same), or binary (two characters are the same or different based on whether their numeric values are equal). A collation name ending in ci, cs, or bin is case-insensitive, case-sensitive, or binary, respectively.

A binary collation provides a sort order for nonbinary strings that is something like the order for binary strings, in the sense that comparisons for binary strings and binary collations both use numeric values. However, there is a difference: binary string comparisons are always based on single-byte units, whereas a binary collation compares nonbinary strings using character numeric values; depending on the character set, some of these might be multibyte values.

The following example illustrates how collation affects sort order. Suppose that a table contains a latin1 string column and has the following rows:

mysql>CREATE TABLE t (c CHAR(3) CHARACTER SET latin1);

mysql> INSERT INTO t (c) VALUES('AAA'),('bbb'),('aaa'),('BBB');

mysql> SELECT c FROM t;

+------+

| c |

+------+

| AAA |

| bbb |

| aaa |

| BBB |

+------+

By applying the COLLATE operator to the column, you can choose which collation to use for sorting and thus affect the order of the result:

§ A case-insensitive collation sorts a and A together, placing them before b and B. However, for a given letter, it does not necessarily order one lettercase before another, as shown by the following result:

§ mysql>SELECT c FROM t ORDER BY c COLLATE latin1_swedish_ci;

§ +------+

§ | c |

§ +------+

§ | AAA |

§ | aaa |

§ | bbb |

§ | BBB |

+------+

§ A case-sensitive collation puts A and a before B and b, and sorts uppercase before lowercase:

§ mysql>SELECT c FROM t ORDER BY c COLLATE latin1_general_cs;

§ +------+

§ | c |

§ +------+

§ | AAA |

§ | aaa |

§ | BBB |

§ | bbb |

+------+

§ A binary collation sorts characters using their numeric values. Assuming that uppercase letters have numeric values less than those of lowercase letters, a binary collation results in the following ordering:

§ mysql>SELECT c FROM t ORDER BY c COLLATE latin1_bin;

§ +------+

§ | c |

§ +------+

§ | AAA |

§ | BBB |

§ | aaa |

§ | bbb |

+------+

Note that, because characters in different lettercases have different numeric values, a binary collation produces a case-sensitive ordering. However, the order is different than that for the case-sensitive collation.

You can choose a language-specific collation if you require that comparison and sorting operations use the sorting rules of a particular language. For example, if you store strings using the utf8 character set, the default collation (utf8_general_ci) treats ch and ll as two-character strings. If you need the traditional Spanish ordering that treats ch and ll as single characters that follow c and l, respectively, use the utf8_spanish2_ci collation. The two collations produce different results, as shown here:

mysql>CREATE TABLE t (c CHAR(2) CHARACTER SET utf8);

mysql> INSERT INTO t (c) VALUES('cg'),('ch'),('ci'),('lk'),('ll'),('lm');

mysql> SELECT c FROM t ORDER BY c COLLATE utf8_general_ci;

+------+

| c |

+------+

| cg |

| ch |

| ci |

| lk |

| ll |

| lm |

+------+

mysql> SELECT c FROM t ORDER BY c COLLATE utf8_spanish2_ci;

+------+

| c |

+------+

| cg |

| ci |

| ch |

| lk |

| lm |

| ll |

+------+

Choosing a String Data Type

Problem

You need to store string data but aren’t sure which is the most appropriate data type.

Solution

Choose the data type according to the characteristics of the information to be stored and how you need to use it. Consider questions such as these:

§ Are the strings binary or nonbinary?

§ Does case sensitivity matter?

§ What is the maximum string length?

§ Do you want to store fixed- or variable-length values?

§ Do you need to retain trailing spaces?

§ Is there a fixed set of allowable values?

Discussion

MySQL provides several binary and nonbinary string data types. These types come in pairs as shown in the following table.

Binary data type

Nonbinary data type

Maximum length

BINARY

CHAR

255

VARBINARY

VARCHAR

65,535

TINYBLOB

TINYTEXT

255

BLOB

TEXT

65,535

MEDIUMBLOB

MEDIUMTEXT

16,777,215

LONGBLOB

LONGTEXT

4,294,967,295

For the binary data types, the maximum length is the number of bytes the string must be able to hold. For the nonbinary types, the maximum length is the number of characters the string must be able to hold (which for a string containing multibyte characters requires more than that many bytes).

For the BINARY and CHAR data types, MySQL stores column values using a fixed width. For example, values stored in a BINARY(10) or CHAR(10) column always take 10 bytes or 10 characters, respectively. Shorter values are padded to the required length as necessary when stored. ForBINARY, the pad value is 0x00 (the zero-valued byte, also known as ASCII NUL). CHAR values are padded with spaces. Trailing pad bytes or characters are stripped from BINARY and CHAR values when they are retrieved.

For VARBINARY, VARCHAR, and the BLOB and TEXT types, MySQL stores values using only as much storage as required, up to the maximum column length. No padding is added or stripped when values are stored or retrieved.

If you want to preserve trailing pad values that are present in the original strings that are stored, use a data type for which no stripping occurs. For example, if you’re storing character (nonbinary) strings that might end with spaces, and you want to preserve them, use VARCHAR or one of theTEXT data types. The following statements illustrate the difference in trailing-space handling for CHAR and VARCHAR columns:

mysql>CREATE TABLE t (c1 CHAR(10), c2 VARCHAR(10));

mysql> INSERT INTO t (c1,c2) VALUES('abc ','abc ');

mysql> SELECT c1, c2, CHAR_LENGTH(c1), CHAR_LENGTH(c2) FROM t;

+------+------------+-----------------+-----------------+

| c1 | c2 | CHAR_LENGTH(c1) | CHAR_LENGTH(c2) |

+------+------------+-----------------+-----------------+

| abc | abc | 3 | 10 |

+------+------------+-----------------+-----------------+

Thus, if you store a string that contains trailing spaces into a CHAR column, you will find that they’re gone when you retrieve the value. Similar padding and stripping occurs for BINARY columns, except that the pad value is 0x00.

NOTE

Prior to MySQL 5.0.3, VARCHAR and VARBINARY have a maximum length of 255. Also, stripping of trailing pad values for retrieved values applies to VARCHAR and VARBINARY columns, so you should use one of the TEXT or BLOB types if you want to retain trailing spaces or 0x00 bytes.

A table can include a mix of binary and nonbinary string columns, and its nonbinary columns can use different character sets and collations. When you declare a nonbinary string column, use the CHARACTER SET and COLLATE attributes if you require a particular character set and collation. For example, if you need to store utf8 (Unicode) and sjis (Japanese) strings, you might define a table with two columns like this:

CREATE TABLE mytbl

(

utf8data VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_danish_ci,

sjisdata VARCHAR(100) CHARACTER SET sjis COLLATE sjis_japanese_ci

);

It is allowable to omit CHARACTER SET, COLLATE, or both from a column definition:

§ If you specify CHARACTER SET and omit COLLATE, the default collation for the character set is used.

§ If you specify COLLATE and omit CHARACTER SET, the character set implied by the collation name (the first part of the name) is used. For example, utf8_danish_ci and sjis_japanese_ci imply utf8 and sjis, respectively. (This means that the CHARACTER SET attributes could have been omitted from the preceding CREATE TABLE statement.)

§ If you omit both CHARACTER SET and COLLATE, the column is assigned the table default character set and collation. (A table definition can include those attributes following the closing parenthesis at the end of the CREATE TABLE statement. If present, they apply to columns that have no explicit character set or collation of their own. If omitted, the table defaults are taken from the database defaults. The database defaults can be specified when you create the database with the CREATE DATABASE statement. The server defaults apply to the database if they are omitted.)

The server default character set and collation are latin1 and latin1_swedish_ci unless you start the server with the --character-set-server and --collation-server options to specify different values. This means that, by default, strings use the latin1 character set and are not case-sensitive.

MySQL also supports ENUM and SET string types, which are used for data that has a fixed set of allowable values. You can use the CHARACTER SET and COLLATE attributes for these data types as well.

Setting the Client Connection Character Set Properly

Problem

You’re executing SQL statements or producing query results that don’t use the default character set.

Solution

Use SET NAMES or an equivalent method to set your connection to the proper character set.

Discussion

When you send information back and forth between your application and the server, you should tell MySQL what the appropriate character set is. For example, the default character set is latin1, but that may not always be the proper character set to use for connections to the server. If you’re working with Greek data, trying to display it using latin1 will result in gibberish on your screen. If you’re using Unicode strings in the utf8 character set, latin1 might not be sufficient to represent all the characters that you might need.

To deal with this problem, configure your connection with the server to use the appropriate character set. There are various ways to do this:

§ If your client program supports the --default-character-set option, you can use it to set the character set at program invocation time. mysql is one such program. Put the option in an option file so that it takes effect each time you connect to the server:

§ [mysql]

default-character-set=utf8

§ Issue a SET NAMES statement after you connect:

mysql>SET NAMES 'utf8';

SETNAMES also allows the connection collation to be specified:

mysql>SET NAMES 'utf8' COLLATE 'utf8_general_ci';

§ Some programming interfaces provide their own method of setting the character set. MySQL Connector/J for Java clients is one such interface. It detects the character set used on the server side automatically when you connect, but you can specify a different set explicitly by using thecharacterEncoding property in the connection URL. The property value should be the Java-style character-set name. For example, to select utf8, you might use a connection URL like this:

jdbc:mysql://localhost/cookbook?characterEncoding=UTF-8

This is preferable to SET NAMES because MySQL Connector/J performs character set conversion on behalf of the application but is unaware of which character set applies if you use SET NAMES.

By the way, you should make sure that the character set used by your display device matches what you’re using for MySQL. Otherwise, even with MySQL handling the data properly, it might display as garbage. Suppose that you’re using the mysql program in a terminal window and that you configure MySQL to use utf8 and store utf8-encoded Japanese data. If you set your terminal window to use euc-jp encoding, that is also Japanese, but its encoding for Japanese characters is different from utf8, so the data will not display as you expect.

NOTE

ucs2cannot be used as the connection character set.

Writing String Literals

Problem

You need to write literal strings in SQL statements.

Solution

Learn the syntax rules that govern string values.

Discussion

Strings can be written several ways:

§ Enclose the text of the string within single or double quotes:

§ 'my string'

"my string"

Be aware that you cannot use double quotes for quoting strings when the ANSI_QUOTES SQL mode is enabled. With that mode enabled, the server interprets double quote as the quoting character for identifiers such as table or column names, and not for strings. (See Handling Special Characters in Identifiers.) For this reason, it’s preferable to adopt the convention of always writing quoted strings using single quotes. That way, MySQL will interpret them as strings and not as identifiers regardless of the ANSI_QUOTES setting.

§ Use hexadecimal notation. Each pair of hex digits produces one byte of the string. abcd can be written using any of these formats:

§ 0x61626364

§ X'61626364'

x'61626364'

MySQL treats strings written using hex notation as binary strings. Not coincidentally, it’s common for applications to use hex strings when constructing SQL statements that refer to binary values:

INSERT INTO t SET binary_col = 0xdeadbeef;

§ To specify a character set for interpretation of a literal string, use an introducer consisting of a character set name preceded by an underscore:

§ _latin1 'abcd'

_ucs2 'abcd'

An introducer tells the server how to interpret the following string. For _latin1 'abcd', the server produces a string consisting of four single-byte characters. For _ucs2 'abcd', the server produces a string consisting of two two-byte characters because ucs2 is a double-byte character set.

To ensure that a string is a binary string or that a nonbinary string has a specific character set or collation, use the instructions for string conversion given in Changing a String’s Character Set or Collation.

If you need to write a quoted string that includes a quote character and you put the quote into the string as is, a syntax error results:

mysql>SELECT 'I'm asleep';

ERROR 1064 (42000): You have an error in your SQL syntax near 'asleep''

There are several ways to deal with this:

§ Enclose a string containing single quotes within double quotes (assuming that ANSI_QUOTES is disabled):

§ mysql>SELECT "I'm asleep";

§ +------------+

§ | I'm asleep |

§ +------------+

§ | I'm asleep |

+------------+

The converse also works; a string containing double quotes can be enclosed within single quotes:

mysql>SELECT 'He said, "Boo!"';

+-----------------+

| He said, "Boo!" |

+-----------------+

| He said, "Boo!" |

+-----------------+

§ To include a quote character within a string that is quoted by the same kind of quote, either double the quote or precede it by a backslash. When MySQL reads the statement, it will strip off the extra quote or the backslash:

§ mysql>SELECT 'I''m asleep', 'I\'m wide awake';

§ +------------+----------------+

§ | I'm asleep | I'm wide awake |

§ +------------+----------------+

§ | I'm asleep | I'm wide awake |

§ +------------+----------------+

§ mysql> SELECT "He said, ""Boo!""", "And I said, \"Yikes!\"";

§ +-----------------+----------------------+

§ | He said, "Boo!" | And I said, "Yikes!" |

§ +-----------------+----------------------+

§ | He said, "Boo!" | And I said, "Yikes!" |

+-----------------+----------------------+

A backslash turns off any special meaning of the following character. (It causes a temporary escape from normal string processing rules, so sequences such as \' and \" are called escape sequences.) This means that backslash itself is special. To write a literal backslash within a string, you must double it:

mysql>SELECT 'Install MySQL in C:\\mysql on Windows';

+--------------------------------------+

| Install MySQL in C:\mysql on Windows |

+--------------------------------------+

| Install MySQL in C:\mysql on Windows |

+--------------------------------------+

Other escape sequences recognized by MySQL are \b (backspace), \n (newline, also called linefeed), \r (carriage return), \t (tab), and \0 (ASCII NUL).

§ Write the string as a hex value:

§ mysql>SELECT 0x49276D2061736C656570;

§ +------------------------+

§ | 0x49276D2061736C656570 |

§ +------------------------+

§ | I'm asleep |

+------------------------+

See Also

If you are issuing SQL statements from within a program, you can refer to strings or binary values symbolically and let your programming interface take care of quoting: use the placeholder mechanism provided by the language’s database-access API. See Handling Special Characters and NULL Values in Statements for details. Alternatively, load binary values such as images from files using the LOAD_FILE() function; Storing Images or Other Binary Data shows an example.

Checking a String’s Character Set or Collation

Problem

You want to know the character set or collation of a string.

Solution

Use the CHARSET() or COLLATION() function.

Discussion

If you create a table using the following definition, you know that values stored in the column will have a character set of utf8 and a collation of utf8_danish_ci:

CREATE TABLE t (c CHAR(10) CHARACTER SET utf8 COLLATE utf8_danish_ci);

But sometimes it’s not so clear what character set or collation applies to a string. Server configuration affects literal strings and some string functions, and other string functions return values in a specific character set. Symptoms that you have the wrong character set or collation are that a collation-mismatch error occurs for a comparison operation, or a lettercase conversion doesn’t work properly. This section shows how to check what character set or collation a string has. Changing a String’s Character Set or Collation shows how to convert strings to a different character set or collation.

To find out what character set or collation a string has, use the CHARSET() or COLLATION() function. For example, did you know that the USER() function returns a Unicode string?

mysql>SELECT USER(), CHARSET(USER()), COLLATION(USER());

+------------------+-----------------+-------------------+

| USER() | CHARSET(USER()) | COLLATION(USER()) |

+------------------+-----------------+-------------------+

| cbuser@localhost | utf8 | utf8_general_ci |

+------------------+-----------------+-------------------+

String values that take their character set and collation from the current configuration may change properties if the configuration changes. This is true for literal strings:

mysql>SET NAMES 'latin1';

mysql> SELECT CHARSET('abc'), COLLATION('abc');

+----------------+-------------------+

| CHARSET('abc') | COLLATION('abc') |

+----------------+-------------------+

| latin1 | latin1_swedish_ci |

+----------------+-------------------+

mysql> SET NAMES latin7 COLLATE 'latin7_bin';

mysql> SELECT CHARSET('abc'), COLLATION('abc');

+----------------+------------------+

| CHARSET('abc') | COLLATION('abc') |

+----------------+------------------+

| latin7 | latin7_bin |

+----------------+------------------+

For a binary string, the CHARSET() or COLLATION() functions return a value of binary, which means that the string is compared and sorted based on numeric byte values, not character collation values. Several functions return binary strings, such as MD5() and PASSWORD():

mysql>SELECT CHARSET(MD5('a')), COLLATION(MD5('a'));

+-------------------+---------------------+

| CHARSET(MD5('a')) | COLLATION(MD5('a')) |

+-------------------+---------------------+

| binary | binary |

+-------------------+---------------------+

mysql> SELECT CHARSET(PASSWORD('a')), COLLATION(PASSWORD('a'));

+------------------------+--------------------------+

| CHARSET(PASSWORD('a')) | COLLATION(PASSWORD('a')) |

+------------------------+--------------------------+

| binary | binary |

+------------------------+--------------------------+

It can be useful to know that a function or string expression produces a binary string if you’re trying to perform lettercase conversion on the result and it’s not working. See Converting the Lettercase of a Stubborn String for details.

Changing a String’s Character Set or Collation

Problem

You want to convert a string from one character set or collation to another.

Solution

Use the CONVERT() function to convert a string to another character set. Use the COLLATE operator to convert a string to another collation.

Discussion

To convert a string from one character set to another, use the CONVERT() function:

mysql>SET @s1 = 'my string';

mysql> SET @s2 = CONVERT(@s1 USING utf8);

mysql> SELECT CHARSET(@s1), CHARSET(@s2);

+--------------+--------------+

| CHARSET(@s1) | CHARSET(@s2) |

+--------------+--------------+

| latin1 | utf8 |

+--------------+--------------+

To change the collation of a string, use the COLLATE operator:

mysql>SET @s1 = 'my string';

mysql> SET @s2 = @s1 COLLATE latin1_spanish_ci;

mysql> SELECT COLLATION(@s1), COLLATION(@s2);

+-------------------+-------------------+

| COLLATION(@s1) | COLLATION(@s2) |

+-------------------+-------------------+

| latin1_swedish_ci | latin1_spanish_ci |

+-------------------+-------------------+

The new collation must be legal for the character set of the string. For example, you can use the utf8_general_ci collation with utf8 strings, but not with latin1 strings:

mysql>SELECT _latin1 'abc' COLLATE utf8_bin;

ERROR 1253 (42000): COLLATION 'utf8_bin' is not valid for

CHARACTER SET 'latin1'

To convert both the character set and collation of a string, use CONVERT() to change the character set, and apply the COLLATE operator to the result:

mysql>SET @s1 = 'my string';

mysql> SET @s2 = CONVERT(@s1 USING utf8) COLLATE utf8_spanish_ci;

mysql> SELECT CHARSET(@s1), COLLATION(@s1), CHARSET(@s2), COLLATION(@s2);

+--------------+-------------------+--------------+-----------------+

| CHARSET(@s1) | COLLATION(@s1) | CHARSET(@s2) | COLLATION(@s2) |

+--------------+-------------------+--------------+-----------------+

| latin1 | latin1_swedish_ci | utf8 | utf8_spanish_ci |

+--------------+-------------------+--------------+-----------------+

The CONVERT() function can also be used to convert binary strings to nonbinary strings and vice versa. To produce a binary string, use binary; any other character set name produces a nonbinary string:

mysql>SET @s1 = 'my string';

mysql> SET @s2 = CONVERT(@s1 USING binary);

mysql> SET @s3 = CONVERT(@s2 USING utf8);

mysql> SELECT CHARSET(@s1), CHARSET(@s2), CHARSET(@s3);

+--------------+--------------+--------------+

| CHARSET(@s1) | CHARSET(@s2) | CHARSET(@s3) |

+--------------+--------------+--------------+

| latin1 | binary | utf8 |

+--------------+--------------+--------------+

Alternatively, you can produce binary strings using the BINARY operator, which is equivalent to CONVERT( str USING binary):

mysql>SET @s1 = 'my string';

mysql> SET @s2 = BINARY @s2;

mysql> SELECT CHARSET(@s1), CHARSET(@s2);

+--------------+--------------+

| CHARSET(@s1) | CHARSET(@s2) |

+--------------+--------------+

| latin1 | binary |

+--------------+--------------+

Converting the Lettercase of a String

Problem

You want to convert a string to uppercase or lowercase.

Solution

Use the UPPER() or LOWER() function. If they don’t work, see Converting the Lettercase of a Stubborn String.

Discussion

The UPPER() and LOWER() functions convert the lettercase of a string:

mysql>SELECT thing, UPPER(thing), LOWER(thing) FROM limbs;

+--------------+--------------+--------------+

| thing | UPPER(thing) | LOWER(thing) |

+--------------+--------------+--------------+

| human | HUMAN | human |

| insect | INSECT | insect |

| squid | SQUID | squid |

| octopus | OCTOPUS | octopus |

| fish | FISH | fish |

| centipede | CENTIPEDE | centipede |

| table | TABLE | table |

| armchair | ARMCHAIR | armchair |

| phonograph | PHONOGRAPH | phonograph |

| tripod | TRIPOD | tripod |

| Peg Leg Pete | PEG LEG PETE | peg leg pete |

| space alien | SPACE ALIEN | space alien |

+--------------+--------------+--------------+

To convert the lettercase of only part of a string, break it into pieces, convert the relevant piece, and put the pieces back together. Suppose that you want to convert only the initial character of a string to uppercase. The following expression accomplishes that:

CONCAT(UPPER(LEFT(str,1)),MID(str,2))

But it’s ugly to write an expression like that each time you need it. For convenience, define a stored function:

mysql>CREATE FUNCTION initial_cap (s VARCHAR(255))

-> RETURNS VARCHAR(255) DETERMINISTIC

-> RETURN CONCAT(UPPER(LEFT(s,1)),MID(s,2));

You can then capitalize initial characters more easily like this:

mysql>SELECT thing, initial_cap(thing) FROM limbs;

+--------------+--------------------+

| thing | initial_cap(thing) |

+--------------+--------------------+

| human | Human |

| insect | Insect |

| squid | Squid |

| octopus | Octopus |

| fish | Fish |

| centipede | Centipede |

| table | Table |

| armchair | Armchair |

| phonograph | Phonograph |

| tripod | Tripod |

| Peg Leg Pete | Peg Leg Pete |

| space alien | Space alien |

+--------------+--------------------+

For more information about writing stored functions, see Chapter 16.

Converting the Lettercase of a “Stubborn” String

Problem

You want to convert a string to uppercase or lowercase, but UPPER() and LOWER() don’t work.

Solution

You’re probably trying to convert a binary string. Convert it to a nonbinary string so that it has a character set and collation and becomes subject to case mapping.

Discussion

The usual way to convert a string to uppercase or lowercase is to use the UPPER() or LOWER() function:

mysql>SET @s = 'aBcD';

mysql> SELECT UPPER(@s), LOWER(@s);

+-----------+-----------+

| UPPER(@s) | LOWER(@s) |

+-----------+-----------+

| ABCD | abcd |

+-----------+-----------+

But sometimes you’ll run across a string that is “stubborn” and resists lettercase conversion. This is common for columns that have a BINARY or BLOB data type:

mysql>CREATE TABLE t (b BLOB) SELECT 'aBcD' AS b;

mysql> SELECT b, UPPER(b), LOWER(b) FROM t;

+------+----------+----------+

| b | UPPER(b) | LOWER(b) |

+------+----------+----------+

| aBcD | aBcD | aBcD |

+------+----------+----------+

The cause of the problem here is that the column is a binary string: it has no character set or collation and lettercase does not apply. Thus, UPPER() and LOWER() do nothing, which can be confusing. Compounding the confusion is that lettercase conversion of binary strings used to work in older versions of MySQL, but does so no longer. What’s going on? Here is the history:

§ Before MySQL 4.1, all strings, including binary strings, were interpreted with respect to the server’s default character set. Consequently, the UPPER() and LOWER() functions performed case mapping even for binary strings:

§ mysql>SET @s = BINARY 'aBcD';

§ mysql> SELECT @s, LOWER(@s), UPPER(@s);

§ +------+-----------+-----------+

§ | @s | LOWER(@s) | UPPER(@s) |

§ +------+-----------+-----------+

§ | aBcD | abcd | ABCD |

+------+-----------+-----------+

§ In MySQL 4.1, character set handling was revised significantly, with one of the changes being that character set and collation applied only to nonbinary strings. From 4.1 up, a binary string is just a sequence of bytes, and lettercase has no meaning, even if you store what looks like text in the string. As a result, the LOWER() and UPPER() functions do nothing when applied to binary strings:

§ mysql>SET @s = BINARY 'aBcD';

§ mysql> SELECT @s, LOWER(@s), UPPER(@s);

§ +------+-----------+-----------+

§ | @s | LOWER(@s) | UPPER(@s) |

§ +------+-----------+-----------+

§ | aBcD | aBcD | aBcD |

+------+-----------+-----------+

To map a binary string to a given lettercase, convert it to a nonbinary string, choosing a character set that contains an alphabet with uppercase and lowercase characters. The case-conversion functions then will work as you expect because the collation provides case mapping. The following example uses the BLOB column from earlier in this section, but the same principles apply to binary string literals and string expressions:

mysql>SELECT b,

-> UPPER(CONVERT(b USING latin1)) AS upper,

-> LOWER(CONVERT(b USING latin1)) AS lower

-> FROM t;

+------+-------+-------+

| b | upper | lower |

+------+-------+-------+

| aBcD | ABCD | abcd |

+------+-------+-------+

The same kind of case-conversion problem occurs with functions that return binary strings, which is typical for functions such as MD5() or COMPRESS() that perform encryption or compression.

If you’re not sure whether a string expression is binary or nonbinary, use the CHARSET() function. The following example shows that VERSION() returns a nonbinary string, but MD5() returns a binary string:

mysql>SELECT CHARSET(VERSION()), CHARSET(MD5('some string'));

+--------------------+-----------------------------+

| CHARSET(VERSION()) | CHARSET(MD5('some string')) |

+--------------------+-----------------------------+

| utf8 | binary |

+--------------------+-----------------------------+

That result indicates that the string produced by VERSION() can be case-mapped directly, but the string produced by MD5() must first be converted to a nonbinary string:

mysql>SELECT UPPER(VERSION());

+------------------+

| UPPER(VERSION()) |

+------------------+

| 5.1.12-BETA-LOG |

+------------------+

mysql> SELECT UPPER(CONVERT(MD5('some string') USING latin1));

+-------------------------------------------------+

| UPPER(CONVERT(MD5('some string') USING latin1)) |

+-------------------------------------------------+

| 5AC749FBEEC93607FC28D666BE85E73A |

+-------------------------------------------------+

Controlling Case Sensitivity in String Comparisons

Problem

You want to know whether strings are equal or unequal, or which one appears first in lexical order.

Solution

Use a comparison operator. But remember that strings have properties such as case sensitivity that you must take into account. For example, a string comparison might be case-sensitive when you don’t want it to be, or vice versa.

Discussion

As for other data types, you can compare string values for equality, inequality, or relative ordering:

mysql>SELECT 'cat' = 'cat', 'cat' = 'dog';

+---------------+---------------+

| 'cat' = 'cat' | 'cat' = 'dog' |

+---------------+---------------+

| 1 | 0 |

+---------------+---------------+

mysql> SELECT 'cat' != 'cat', 'cat' != 'dog';

+----------------+----------------+

| 'cat' != 'cat' | 'cat' != 'dog' |

+----------------+----------------+

| 0 | 1 |

+----------------+----------------+

mysql> SELECT 'cat' < 'awk', 'cat' < 'dog';

+---------------+---------------+

| 'cat' < 'awk' | 'cat' < 'dog' |

+---------------+---------------+

| 0 | 1 |

+---------------+---------------+

mysql> SELECT 'cat' BETWEEN 'awk' AND 'egret';

+---------------------------------+

| 'cat' BETWEEN 'awk' AND 'egret' |

+---------------------------------+

| 1 |

+---------------------------------+

However, comparison and sorting properties of strings are subject to certain complications that don’t apply to other types of data. For example, sometimes you need to make sure a string operation is case-sensitive that would not otherwise be, or vice versa. This section describes how to do that for ordinary comparisons. Controlling Case Sensitivity in Pattern Matching covers case sensitivity in pattern-matching operations.

String comparison properties depend on whether the operands are binary or nonbinary strings:

§ A binary string is a sequence of bytes and is compared using numeric byte values. Lettercase has no meaning. However, because letters in different cases have different byte values, comparisons of binary strings effectively are case-sensitive. (That is, a and A are unequal.) If you want to compare binary strings so that lettercase does not matter, convert them to nonbinary strings that have a case-insensitive collation.

§ A nonbinary string is a sequence of characters and is compared in character units. (Depending on the character set, some characters might have multiple bytes.) The string has a character set that defines the legal characters and a collation that defines their sort order. The collation also determines whether to consider characters in different lettercases the same in comparisons. If the collation is case-sensitive, and you want a case-insensitive collation (or vice versa), convert the strings to use a collation with the desired case-comparison properties.

By default, strings have a character set of latin1 and a collation of latin1_swedish_ci. This results in case-insensitive string comparisons.

The following example shows how two binary strings that compare as unequal can be handled so that they are equal when compared as case-insensitive nonbinary strings:

mysql>SET @s1 = BINARY 'cat', @s2 = BINARY 'CAT';

mysql> SELECT @s1 = @s2;

+-----------+

| @s1 = @s2 |

+-----------+

| 0 |

+-----------+

mysql> SET @s1 = CONVERT(@s1 USING latin1) COLLATE latin1_swedish_ci;

mysql> SET @s2 = CONVERT(@s2 USING latin1) COLLATE latin1_swedish_ci;

mysql> SELECT @s1 = @s2;

+-----------+

| @s1 = @s2 |

+-----------+

| 1 |

+-----------+

In this case, because latin1_swedish_ci is the default collation for latin1, you can omit the COLLATE operator:

mysql>SET @s1 = CONVERT(@s1 USING latin1);

mysql> SET @s2 = CONVERT(@s2 USING latin1);

mysql> SELECT @s1 = @s2;

+-----------+

| @s1 = @s2 |

+-----------+

| 1 |

+-----------+

The next example shows how to compare two strings that are not case-sensitive (as demonstrated by the first SELECT) in case-sensitive fashion (as demonstrated by the second):

mysql>SET @s1 = _latin1 'cat', @s2 = _latin1 'CAT';

mysql> SELECT @s1 = @s2;

+-----------+

| @s1 = @s2 |

+-----------+

| 1 |

+-----------+

mysql> SELECT @s1 COLLATE latin1_general_cs = @s2 COLLATE latin1_general_cs

-> AS '@s1 = @s2';

+-----------+

| @s1 = @s2 |

+-----------+

| 0 |

+-----------+

If you compare a binary string with a nonbinary string, the comparison treats both operands as binary strings:

mysql>SELECT _latin1 'cat' = BINARY 'CAT';

+------------------------------+

| _latin1 'cat' = BINARY 'CAT' |

+------------------------------+

| 0 |

+------------------------------+

Thus, if you want to compare two nonbinary strings as binary strings, apply the BINARY operator to either one when comparing them:

mysql>SET @s1 = _latin1 'cat', @s2 = _latin1 'CAT';

mysql> SELECT @s1 = @s2, BINARY @s1 = @s2, @s1 = BINARY @s2;

+-----------+------------------+------------------+

| @s1 = @s2 | BINARY @s1 = @s2 | @s1 = BINARY @s2 |

+-----------+------------------+------------------+

| 1 | 0 | 0 |

+-----------+------------------+------------------+

If you find that you’ve declared a column using a type that is not suitable for the kind of comparisons for which you typically use it, use ALTER TABLE to change the type. Suppose that you have a table in which you store news articles:

CREATE TABLE news

(

id INT UNSIGNED NOT NULL AUTO_INCREMENT,

article BLOB,

PRIMARY KEY (id)

);

Here the article column is declared as a BLOB, which is a binary string type. This means that if you store text in the column, comparisons are made without regard to character set. (In effect, they are case-sensitive.) If that’s not what you want, you can convert the column to a nonbinary type that has a case-insensitive collation using ALTER TABLE:

ALTER TABLE news

MODIFY article TEXT CHARACTER SET utf8 COLLATE utf8_general_ci;

Pattern Matching with SQL Patterns

Problem

You want to perform a pattern match rather than a literal comparison.

Solution

Use the LIKE operator and an SQL pattern, described in this section. Or use a regular-expression pattern match, described in Pattern Matching with Regular Expressions.

Discussion

Patterns are strings that contain special characters. These are known as metacharacters because they stand for something other than themselves. MySQL provides two kinds of pattern matching. One is based on SQL patterns and the other on regular expressions. SQL patterns are more standard among different database systems, but regular expressions are more powerful. The two kinds of pattern match uses different operators and different sets of metacharacters. This section describes SQL patterns. Pattern Matching with Regular Expressions describes regular expressions.

The example here uses a table named metal that contains the following rows:

+----------+

| name |

+----------+

| copper |

| gold |

| iron |

| lead |

| mercury |

| platinum |

| silver |

| tin |

+----------+

SQL pattern matching uses the LIKE and NOT LIKE operators rather than = and != to perform matching against a pattern string. Patterns may contain two special metacharacters: _ matches any single character, and % matches any sequence of characters, including the empty string. You can use these characters to create patterns that match a variety of values:

§ Strings that begin with a particular substring:

§ mysql>SELECT name FROM metal WHERE name LIKE 'co%';

§ +--------+

§ | name |

§ +--------+

§ | copper |

+--------+

§ Strings that end with a particular substring:

§ mysql>SELECT name FROM metal WHERE name LIKE '%er';

§ +--------+

§ | name |

§ +--------+

§ | copper |

§ | silver |

+--------+

§ Strings that contain a particular substring at any position:

§ mysql>SELECT name FROM metal WHERE name LIKE '%er%';

§ +---------+

§ | name |

§ +---------+

§ | copper |

§ | mercury |

§ | silver |

+---------+

§ Strings that contain a substring at a specific position (the pattern matches only if pp occurs at the third position of the name column):

§ mysql>SELECT name FROM metal where name LIKE '__pp%';

§ +--------+

§ | name |

§ +--------+

§ | copper |

+--------+

An SQL pattern matches successfully only if it matches the entire comparison value. Of the following two pattern matches, only the second succeeds:

'abc' LIKE 'b'

'abc' LIKE '%b%'

To reverse the sense of a pattern match, use NOT LIKE. The following statement finds strings that contain no i characters:

mysql>SELECT name FROM metal WHERE name NOT LIKE '%i%';

+---------+

| name |

+---------+

| copper |

| gold |

| lead |

| mercury |

+---------+

SQL patterns do not match NULL values. This is true both for LIKE and for NOT LIKE:

mysql>SELECT NULL LIKE '%', NULL NOT LIKE '%';

+---------------+-------------------+

| NULL LIKE '%' | NULL NOT LIKE '%' |

+---------------+-------------------+

| NULL | NULL |

+---------------+-------------------+

USING PATTERNS WITH NONSTRING VALUES

Unlike some other database systems, MySQL allows pattern matches to be applied to nonstring values such as numbers or dates, which can sometimes be useful. The following table shows some ways to test a DATE value d using function calls that extract date parts and using the equivalent pattern matches. The pairs of expressions are true for dates occurring in the year 1976, in the month of April, or on the first day of the month:

Function value test

Pattern match test

YEAR(d) = 1976

d LIKE '1976-%'

MONTH(d) = 4

d LIKE '%-04-%'

DAYOFMONTH(d) = 1

d LIKE '%-01'

In some cases, pattern matches are equivalent to substring comparisons. For example, using patterns to find strings at one end or the other of a string is like using LEFT() or RIGHT():

Pattern match

Substring comparison

str LIKE 'abc%'

LEFT( str ,3) = 'abc'

str LIKE '%abc'

RIGHT( str ,3) = 'abc'

If you’re matching against a column that is indexed and you have a choice of using a pattern or an equivalent LEFT() expression, you’ll likely find that the pattern match is faster. MySQL can use the index to narrow the search for a pattern that begins with a literal string. With LEFT(), it cannot.

Pattern Matching with Regular Expressions

Problem

You want to data type perform a pattern match rather than a literal comparison.

Solution

Use the REGEXP operator and a regular expression pattern, described in this section. Or use an SQL pattern, described in Pattern Matching with SQL Patterns.

Discussion

SQL patterns (see Pattern Matching with SQL Patterns) are likely to be implemented by other database systems, so they’re reasonably portable beyond MySQL. On the other hand, they’re somewhat limited. For example, you can easily write an SQL pattern %abc% to find strings that containabc, but you cannot write a single SQL pattern to identify strings that contain any of the characters a, b, or c. Nor can you match string content based on character types such as letters or digits. For such operations, MySQL supports another type of pattern matching operation based on regular expressions and the REGEXP operator (or NOT REGEXP to reverse the sense of the match). REGEXP matching uses the pattern elements shown in the following table.

Pattern

What the pattern matches

^

Beginning of string

$

End of string

.

Any single character

[...]

Any character listed between the square brackets

[^...]

Any character not listed between the square brackets

p1 | p2 | p3

Alternation; matches any of the patterns p1, p2, or p3

*

Zero or more instances of preceding element

+

One or more instances of preceding element

{ n }

n instances of preceding element

{ m , n }

m through n instances of preceding element

You may already be familiar with these regular expression pattern characters, because many of them are the same as those used by vi, grep, sed, and other Unix utilities that support regular expressions. Most of them are used also in the regular expressions understood by programming languages. (Chapter 10 discusses the use of pattern matching in programs for data validation and transformation.)

Pattern Matching with SQL Patterns showed how to use SQL patterns to match substrings at the beginning or end of a string, or at an arbitrary or specific position within a string. You can do the same things with regular expressions:

§ Strings that begin with a particular substring:

§ mysql>SELECT name FROM metal WHERE name REGEXP '^co';

§ +--------+

§ | name |

§ +--------+

§ | copper |

+--------+

§ Strings that end with a particular substring:

§ mysql>SELECT name FROM metal WHERE name REGEXP 'er$';

§ +--------+

§ | name |

§ +--------+

§ | copper |

§ | silver |

+--------+

§ Strings that contain a particular substring at any position:

§ mysql>SELECT name FROM metal WHERE name REGEXP 'er';

§ +---------+

§ | name |

§ +---------+

§ | copper |

§ | mercury |

§ | silver |

+---------+

§ Strings that contain a particular substring at a specific position:

§ mysql>SELECT name FROM metal WHERE name REGEXP '^..pp';

§ +--------+

§ | name |

§ +--------+

§ | copper |

+--------+

In addition, regular expressions have other capabilities and can perform kinds of matches that SQL patterns cannot. For example, regular expressions can contain character classes, which match any character in the class:

§ To write a character class, use square brackets and list the characters you want the class to match inside the brackets. Thus, the pattern [abc] matches either a, b, or c.

§ Classes may indicate ranges of characters by using a dash between the beginning and end of the range. [a-z] matches any letter, [0-9] matches digits, and [a-z0-9] matches letters or digits.

§ To negate a character class (“match any character but these”), begin the list with a ^ character. For example, [^0-9] matches anything but digits.

MySQL’s regular-expression capabilities also support POSIX character classes. These match specific character sets, as described in the following table.

POSIX class

What the class matches

[:alnum:]

Alphabetic and numeric characters

[:alpha:]

Alphabetic characters

[:blank:]

Whitespace (space or tab characters)

[:cntrl:]

Control characters

[:digit:]

Digits

[:graph:]

Graphic (nonblank) characters

[:lower:]

Lowercase alphabetic characters

[:print:]

Graphic or space characters

[:punct:]

Punctuation characters

[:space:]

Space, tab, newline, carriage return

[:upper:]

Uppercase alphabetic characters

[:xdigit:]

Hexadecimal digits (0-9, a-f, A-F)

POSIX classes are intended for use within character classes, so you use them within square brackets. The following expression matches values that contain any hexadecimal digit character:

mysql>SELECT name, name REGEXP '[[:xdigit:]]' FROM metal;

+----------+----------------------------+

| name | name REGEXP '[[:xdigit:]]' |

+----------+----------------------------+

| copper | 1 |

| gold | 1 |

| iron | 0 |

| lead | 1 |

| mercury | 1 |

| platinum | 1 |

| silver | 1 |

| tin | 0 |

+----------+----------------------------+

Regular expressions can contain alternations. The syntax looks like this:

alternative1|alternative2|...

An alternation is similar to a character class in the sense that it matches if any of the alternatives match. But unlike a character class, the alternatives are not limited to single characters. They can be multiple-character strings or even patterns. The following alternation matches strings that begin with a vowel or end with er:

mysql>SELECT name FROM metal WHERE name REGEXP '^[aeiou]|er$';

+--------+

| name |

+--------+

| copper |

| iron |

| silver |

+--------+

Parentheses can be used to group alternations. For example, if you want to match strings that consist entirely of digits or entirely of letters, you might try this pattern, using an alternation:

mysql>SELECT '0m' REGEXP '^[[:digit:]]+|[[:alpha:]]+$';

+-------------------------------------------+

| '0m' REGEXP '^[[:digit:]]+|[[:alpha:]]+$' |

+-------------------------------------------+

| 1 |

+-------------------------------------------+

However, as the query result shows, the pattern doesn’t work. That’s because the ^ groups with the first alternative, and the $ groups with the second alternative. So the pattern actually matches strings that begin with one or more digits, or strings that end with one or more letters. If you group the alternatives within parentheses, the ^ and $ apply to both of them, and the pattern acts as you expect:

mysql>SELECT '0m' REGEXP '^([[:digit:]]+|[[:alpha:]]+)$';

+---------------------------------------------+

| '0m' REGEXP '^([[:digit:]]+|[[:alpha:]]+)$' |

+---------------------------------------------+

| 0 |

+---------------------------------------------+

Unlike SQL pattern matches, which are successful only if the pattern matches the entire comparison value, regular expressions are successful if the pattern matches anywhere within the value. The following two pattern matches are equivalent in the sense that each one succeeds only for strings that contain a b character, but the first is more efficient because the pattern is simpler:

'abc' REGEXP 'b'

'abc' REGEXP '^.*b.*$'

Regular expressions do not match NULL values. This is true both for REGEXP and for NOT REGEXP:

mysql>SELECT NULL REGEXP '.*', NULL NOT REGEXP '.*';

+------------------+----------------------+

| NULL REGEXP '.*' | NULL NOT REGEXP '.*' |

+------------------+----------------------+

| NULL | NULL |

+------------------+----------------------+

The fact that a regular expression matches a string if the pattern is found anywhere in the string means you must take care not to inadvertently specify a pattern that matches the empty string. If you do, it will match any non-NULL value. For example, the pattern a* matches any number of acharacters, even none. If your goal is to match only strings containing nonempty sequences of a characters, use a+ instead. The + requires one or more instances of the preceding pattern element for a match.

As with SQL pattern matches performed using LIKE, regular expression matches performed with REGEXP sometimes are equivalent to substring comparisons. The ^ and $ metacharacters serve much the same purpose as LEFT( ) or RIGHT(), at least if you’re looking for literal strings:

Pattern match

Substring comparison

str REGEXP '^abc'

LEFT( str ,3) = 'abc'

str REGEXP 'abc$'

RIGHT( str ,3) = 'abc'

For nonliteral strings, it’s typically not possible to construct an equivalent substring comparison. For example, to match strings that begin with any nonempty sequence of digits, you can use this pattern match:

str REGEXP '^[0-9]+'

That is something that LEFT() cannot do (and neither can LIKE, for that matter).

NOTE

A limitation of regular expression (REGEXP) matching compared to SQL pattern (LIKE) matching is that REGEXP works only for single-byte character sets. It cannot be expected to work with multibyte character sets such as utf8 or sjis.

Controlling Case Sensitivity in Pattern Matching

Problem

A pattern match is case-sensitive when you don’t want it to be, or vice versa.

Solution

Alter the case sensitivity of the strings.

Discussion

The case sensitivity of a pattern match operation is like that of a string comparison. That is, it depends on whether the operands are binary or nonbinary strings, and for nonbinary strings, it depends on their collation. See Controlling Case Sensitivity in String Comparisons for discussion of how these factors apply to comparisons.

The default character set and collation are latin1 and latin1_swedish_ci, so pattern match operations are not case-sensitive by default:

mysql>SELECT 'a' LIKE 'A', 'a' REGEXP 'A';

+--------------+----------------+

| 'a' LIKE 'A' | 'a' REGEXP 'A' |

+--------------+----------------+

| 1 | 1 |

+--------------+----------------+

Note that a REGEXP operation that is not case-sensitive can lead to some unintuitive results:

mysql>SELECT 'a' REGEXP '[[:lower:]]', 'a' REGEXP '[[:upper:]]';

+--------------------------+--------------------------+

| 'a' REGEXP '[[:lower:]]' | 'a' REGEXP '[[:upper:]]' |

+--------------------------+--------------------------+

| 1 | 1 |

+--------------------------+--------------------------+

Both expressions are true because [:lower:] and [:upper:] are equivalent when case sensitivity doesn’t matter.

If a pattern match uses different case-sensitive behavior from what you want, control it the same way as for string comparisons: convert the strings to binary or nonbinary as necessary or change the collation of nonbinary strings.

To make a pattern match case-sensitive, use a case-sensitive collation for either operand. For example, with the latin1 character set, use a collation of latin1_general_cs:

mysql>SET @s = 'a' COLLATE latin1_general_cs;

mysql> SELECT @s LIKE 'A', @s REGEXP 'A';

+-------------+---------------+

| @s LIKE 'A' | @s REGEXP 'A' |

+-------------+---------------+

| 0 | 0 |

+-------------+---------------+

Use of a case-sensitive collation also has the effect of causing [:lower:] and [:upper:] in regular expressions to match only lowercase and uppercase characters, respectively. The second expression in the following statement yields a result that really is true only for uppercase letters:

mysql>SET @s = 'a', @s_cs = 'a' COLLATE latin1_general_cs;

mysql> SELECT @s REGEXP '[[:upper:]]', @s_cs REGEXP '[[:upper:]]';

+-------------------------+----------------------------+

| @s REGEXP '[[:upper:]]' | @s_cs REGEXP '[[:upper:]]' |

+-------------------------+----------------------------+

| 1 | 0 |

+-------------------------+----------------------------+

Breaking Apart or Combining Strings

Problem

You want to extract a piece of a string or combine strings to form a larger string.

Solution

To obtain a piece of a string, use a substring-extraction function. To combine strings, use CONCAT().

Discussion

Strings can be broken apart by using appropriate substring-extraction functions. For example, LEFT(), MID(), and RIGHT() extract substrings from the left, middle, or right part of a string:

mysql>SELECT name, LEFT(name,2), MID(name,3,1), RIGHT(name,3) FROM metal;

+----------+--------------+---------------+---------------+

| name | LEFT(name,2) | MID(name,3,1) | RIGHT(name,3) |

+----------+--------------+---------------+---------------+

| copper | co | p | per |

| gold | go | l | old |

| iron | ir | o | ron |

| lead | le | a | ead |

| mercury | me | r | ury |

| platinum | pl | a | num |

| silver | si | l | ver |

| tin | ti | n | tin |

+----------+--------------+---------------+---------------+

For LEFT() and RIGHT(), the second argument indicates how many characters to return from the left or right end of the string. For MID(), the second argument is the starting position of the substring you want (beginning from 1), and the third argument indicates how many characters to return.

The SUBSTRING() function takes a string and a starting position, returning everything to the right of the position. MID() acts the same way if you omit its third argument because MID() is actually a synonym for SUBSTRING():

mysql>SELECT name, SUBSTRING(name,4), MID(name,4) FROM metal;

+----------+-------------------+-------------+

| name | SUBSTRING(name,4) | MID(name,4) |

+----------+-------------------+-------------+

| copper | per | per |

| gold | d | d |

| iron | n | n |

| lead | d | d |

| mercury | cury | cury |

| platinum | tinum | tinum |

| silver | ver | ver |

| tin | | |

+----------+-------------------+-------------+

Use SUBSTRING_INDEX( str , c , n ) to return everything to the right or left of a given character. It searches into a string str for the n-th occurrence of the character c and returns everything to its left. If n is negative, the search for c starts from the right and returns everything to the right of the character:

mysql>SELECT name,

-> SUBSTRING_INDEX(name,'r',1),

-> SUBSTRING_INDEX(name,'i',-1)

-> FROM metal;

+----------+-----------------------------+------------------------------+

| name | SUBSTRING_INDEX(name,'r',1) | SUBSTRING_INDEX(name,'i',-1) |

+----------+-----------------------------+------------------------------+

| copper | coppe | copper |

| gold | gold | gold |

| iron | i | ron |

| lead | lead | lead |

| mercury | me | mercury |

| platinum | platinum | num |

| silver | silve | lver |

| tin | tin | n |

+----------+-----------------------------+------------------------------+

Note that if there is no n-th occurrence of the character, SUBSTRING_INDEX() returns the entire string. SUBSTRING_INDEX() is case-sensitive.

Substrings can be used for purposes other than display, such as to perform comparisons. The following statement finds metal names having a first letter that lies in the last half of the alphabet:

mysql>SELECT name from metal WHERE LEFT(name,1) >= 'n';

+----------+

| name |

+----------+

| platinum |

| silver |

| tin |

+----------+

To combine strings rather than pull them apart, use the CONCAT() function. It concatenates all its arguments and returns the result:

mysql>SELECT CONCAT('Hello, ',USER(),', welcome to MySQL!') AS greeting;

+--------------------------------------------+

| greeting |

+--------------------------------------------+

| Hello, cbuser@localhost, welcome to MySQL! |

+--------------------------------------------+

mysql> SELECT CONCAT(name,' ends in "d": ',IF(RIGHT(name,1)='d','YES','NO'))

-> AS 'ends in "d"?'

-> FROM metal;

+--------------------------+

| ends in "d"? |

+--------------------------+

| copper ends in "d": NO |

| gold ends in "d": YES |

| iron ends in "d": NO |

| lead ends in "d": YES |

| mercury ends in "d": NO |

| platinum ends in "d": NO |

| silver ends in "d": NO |

| tin ends in "d": NO |

+--------------------------+

Concatenation can be useful for modifying column values “in place.” For example, the following UPDATE statement adds a string to the end of each name value in the metal table:

mysql>UPDATE metal SET name = CONCAT(name,'ide');

mysql> SELECT name FROM metal;

+-------------+

| name |

+-------------+

| copperide |

| goldide |

| ironide |

| leadide |

| mercuryide |

| platinumide |

| silveride |

| tinide |

+-------------+

To undo the operation, strip off the last three characters (the CHAR_LENGTH() function returns the length of a string in characters):

mysql>UPDATE metal SET name = LEFT(name,CHAR_LENGTH(name)-3);

mysql> SELECT name FROM metal;

+----------+

| name |

+----------+

| copper |

| gold |

| iron |

| lead |

| mercury |

| platinum |

| silver |

| tin |

+----------+

The concept of modifying a column in place can be applied to ENUM or SET values as well, which usually can be treated as string values even though they are stored internally as numbers. For example, to concatenate a SET element to an existing SET column, use CONCAT() to add the new value to the existing value, preceded by a comma. But remember to account for the possibility that the existing value might be NULL. In that case, set the column value equal to the new element, without the leading comma:

UPDATEtbl_name

SET set_col = IF(set_col IS NULL,val,CONCAT(set_col,',',val));

Searching for Substrings

Problem

You want to know whether a given string occurs within another string.

Solution

Use LOCATE().

Discussion

The LOCATE() function takes two arguments representing the substring that you’re looking for and the string in which to look for it. The return value is the position at which the substring occurs, or 0 if it’s not present. An optional third argument may be given to indicate the position within the string at which to start looking.

mysql>SELECT name, LOCATE('in',name), LOCATE('in',name,3) FROM metal;

+----------+-------------------+---------------------+

| name | LOCATE('in',name) | LOCATE('in',name,3) |

+----------+-------------------+---------------------+

| copper | 0 | 0 |

| gold | 0 | 0 |

| iron | 0 | 0 |

| lead | 0 | 0 |

| mercury | 0 | 0 |

| platinum | 5 | 5 |

| silver | 0 | 0 |

| tin | 2 | 0 |

+----------+-------------------+---------------------+

LOCATE() uses the collation of its arguments to determine whether the search is case-sensitive. See Recipes and for information about changing the comparison properties of the arguments if you want to change the search behavior.

Using FULLTEXT Searches

Problem

You want to search through a lot of text.

Solution

Use a FULLTEXT index.

Discussion

You can use pattern matches to look through any number of rows, but as the amount of text goes up, the match operation can become quite slow. It’s also common to look for the same text in several string columns, which with pattern matching tends to result in unwieldy queries:

SELECT * fromtbl_name

WHERE col1 LIKE 'pat' OR col2 LIKE 'pat' OR col3 LIKE 'pat' ...

A useful alternative is FULLTEXT searching, which is designed for looking through large amounts of text and can search multiple columns simultaneously. To use this capability, add a FULLTEXT index to your table, and then use the MATCH operator to look for strings in the indexed column or columns. FULLTEXT indexing can be used with MyISAM tables for nonbinary string data types (CHAR, VARCHAR, or TEXT).

FULLTEXT searching is best illustrated with a reasonably good-sized body of text. If you don’t have a sample dataset, several repositories of freely available electronic text are available on the Internet. For the examples here, the one I’ve chosen is the complete text of the King James Version of the Bible (KJV), which is relatively large and has the useful property of being nicely structured by book, chapter, and verse. Because of its size, this dataset is not included with the recipes distribution, but is available separately as the mcb-kjv distribution at the MySQL Cookbook web site (see Appendix A). The mcb-kvj distribution includes a filekjv.txt that contains the verse records. Some sample records look like this:

O Genesis 1 1 1 In the beginning God created the heaven and the earth.

O Exodus 2 20 13 Thou shalt not kill.

N Luke 42 17 32 Remember Lot's wife.

Each record contains the following fields:

§ Book section. This is either O or N, signifying the Old or New Testament.

§ Book name and corresponding book number, from 1 to 66.

§ Chapter and verse numbers.

§ Text of the verse.

To import the records into MySQL, create a table named kjv that looks like this:

CREATE TABLE kjv

(

bsect ENUM('O','N') NOT NULL, # book section (testament)

bname VARCHAR(20) NOT NULL, # book name

bnum TINYINT UNSIGNED NOT NULL, # book number

cnum TINYINT UNSIGNED NOT NULL, # chapter number

vnum TINYINT UNSIGNED NOT NULL, # verse number

vtext TEXT NOT NULL # text of verse

) ENGINE = MyISAM;

Then load the kjv.txt file into the table using this statement:

mysql>LOAD DATA LOCAL INFILE 'kjv.txt' INTO TABLE kjv;

You’ll notice that the kjv table contains columns both for book names (Genesis, Exodus, ...) and for book numbers (1, 2, ...). The names and numbers have a fixed correspondence, and one can be derived from the other—a redundancy that means the table is not in normal form. It’s possible to eliminate the redundancy by storing just the book numbers (which take less space than the names), and then producing the names when necessary in query results by joining the numbers to a small mapping table that associates each book number with the corresponding name. But I want to avoid using joins at this point. Thus, the table includes book names so that search results can be interpreted more easily, and numbers so that the results can be sorted easily into book order.

After populating the table, prepare it for use in FULLTEXT searching by adding a FULLTEXT index. This can be done using an ALTER TABLE statement:

mysql>ALTER TABLE kjv ADD FULLTEXT (vtext);

It’s possible to include the index definition in the initial CREATE TABLE statement, but it’s generally faster to create a nonindexed table and then add the index with ALTER TABLE after populating the table than to load a large dataset into an indexed table.

To perform a search using the index, use MATCH() to name the indexed column and AGAINST() to specify what text to look for. For example, you might wonder, “How many times does the name Mizraim occur?” To answer that question, search the vtext column using this statement:

mysql>SELECT COUNT(*) from kjv WHERE MATCH(vtext) AGAINST('Mizraim');

+----------+

| COUNT(*) |

+----------+

| 4 |

+----------+

To find out what those verses are, select the columns you want to see (the example here uses \G so that the results better fit the page):

mysql>SELECT bname, cnum, vnum, vtext

-> FROM kjv WHERE MATCH(vtext) AGAINST('Mizraim')\G

*************************** 1. row ***************************

bname: Genesis

cnum: 10

vnum: 6

vtext: And the sons of Ham; Cush, and Mizraim, and Phut, and Canaan.

*************************** 2. row ***************************

bname: Genesis

cnum: 10

vnum: 13

vtext: And Mizraim begat Ludim, and Anamim, and Lehabim, and Naphtuhim,

*************************** 3. row ***************************

bname: 1 Chronicles

cnum: 1

vnum: 8

vtext: The sons of Ham; Cush, and Mizraim, Put, and Canaan.

*************************** 4. row ***************************

bname: 1 Chronicles

cnum: 1

vnum: 11

vtext: And Mizraim begat Ludim, and Anamim, and Lehabim, and Naphtuhim,

The results come out in book, chapter, and verse number order in this particular instance, but that’s actually just coincidence. By default, FULLTEXT searches compute a relevance ranking and use it for sorting. To make sure a search result is sorted the way you want, add an explicit ORDER BYclause:

SELECT bname, cnum, vnum, vtext

FROM kjv WHERE MATCH(vtext) AGAINST('search string')

ORDER BY bnum, cnum, vnum;

If you want to see the relevance ranking, repeat the MATCH() ... AGAINST() expression in the output column list.

You can include additional criteria to narrow the search further. The following queries perform progressively more specific searches to find out how often the name Abraham occurs in the entire KJV, the New Testament, the Book of Hebrews, and Chapter 11 of Hebrews:

mysql>SELECT COUNT(*) from kjv WHERE MATCH(vtext) AGAINST('Abraham');

+----------+

| COUNT(*) |

+----------+

| 216 |

+----------+

mysql> SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham')

-> AND bsect = 'N';

+----------+

| COUNT(*) |

+----------+

| 66 |

+----------+

mysql> SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham')

-> AND bname = 'Hebrews';

+----------+

| COUNT(*) |

+----------+

| 10 |

+----------+

mysql> SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham')

-> AND bname = 'Hebrews' AND cnum = 11;

+----------+

| COUNT(*) |

+----------+

| 2 |

+----------+

If you expect to use search criteria that include other non-FULLTEXT columns frequently, you can increase the performance of such queries by adding regular indexes to those columns. For example, to index the book, chapter, and verse number columns, do this:

mysql>ALTER TABLE kjv ADD INDEX (bnum), ADD INDEX (cnum), ADD INDEX (vnum);

Search strings in FULLTEXT queries can include more than just a single word, and you might suppose that adding additional words would make a search more specific. But in fact that widens it, because a FULLTEXT search returns rows that contain any of the words. In effect, the query performs an OR search for any of the words. This is illustrated by the following queries, which identify successively larger numbers of verses as additional search words are added:

mysql>SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham');

+----------+

| COUNT(*) |

+----------+

| 216 |

+----------+

mysql> SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham Sarah');

+----------+

| COUNT(*) |

+----------+

| 230 |

+----------+

mysql> SELECT COUNT(*) from kjv

-> WHERE MATCH(vtext) AGAINST('Abraham Sarah Ishmael Isaac');

+----------+

| COUNT(*) |

+----------+

| 317 |

+----------+

To perform a search in which each word in the search string must be present, see Requiring or Excluding FULLTEXT Search Words.

If you want to use a FULLTEXT search that looks through multiple columns simultaneously, name them all when you construct the index:

ALTER TABLEtbl_name ADD FULLTEXT (col1, col2, col3);

To issue a search query that uses this index, name those same columns in the MATCH() list:

SELECT ... FROMtbl_name

WHERE MATCH(col1, col2, col3) AGAINST('search string');

You’ll need one such FULLTEXT index for each distinct combination of columns that you want to search.

See Also

FULLTEXT indexes provide a quick-and-easy way to set up a basic search engine. One way to use this capability is to provide a web-based interface to the indexed text. This book’s web site (see Appendix A) includes a simple web-based KJV search page that demonstrates this. You can use it as the basis for your own search engine that operates on a different repository of text.

Using a FULLTEXT Search with Short Words

Problem

FULLTEXTsearches for short words return no rows.

Solution

Change the indexing engine’s minimum word length parameter.

Discussion

In a text like the KJV, certain words have special significance, such as “God” and “sin.” However, if you perform FULLTEXT searches on the kjv table for those words, you’ll observe a curious phenomenon—both words appear to be missing from the text entirely:

mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');

+----------+

| COUNT(*) |

+----------+

| 0 |

+----------+

mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');

+----------+

| COUNT(*) |

+----------+

| 0 |

+----------+

One property of the indexing engine is that it ignores words that are “too common” (that is, words that occur in more than half the rows). This eliminates words such as “the” or “and” from the index, but that’s not what is going on here. You can verify that by counting the total number of rows, and by using SQL pattern matches to count the number of rows containing each word:[9]

mysql>SELECT COUNT(*) AS 'total verses',

-> COUNT(IF(vtext LIKE '%God%',1,NULL)) AS 'verses containing "God"',

-> COUNT(IF(vtext LIKE '%sin%',1,NULL)) AS 'verses containing "sin"'

-> FROM kjv;

+--------------+-------------------------+-------------------------+

| total verses | verses containing "God" | verses containing "sin" |

+--------------+-------------------------+-------------------------+

| 31102 | 4118 | 1292 |

+--------------+-------------------------+-------------------------+

Neither word is present in more than half the verses, so sheer frequency of occurrence doesn’t account for the failure of a FULLTEXT search to find them. What’s really happening is that by default, the indexing engine doesn’t include words less than four characters long. The minimum word length is a configurable parameter, which you can change by setting the ft_min_word_len server variable. For example, to tell the indexing engine to include words as short as three characters, add a line to the [mysqld] group of the /etc/my.cnf file (or whatever option file you put server settings in):

[mysqld]

ft_min_word_len=3

After making this change, restart the server. Next, rebuild the FULLTEXT index to take advantage of the new setting:[10]

mysql>REPAIR TABLE kjv QUICK;

Finally, try the new index to verify that it includes shorter words:

mysql>SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');

+----------+

| COUNT(*) |

+----------+

| 3878 |

+----------+

mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');

+----------+

| COUNT(*) |

+----------+

| 389 |

+----------+

That’s better!

But why do the MATCH() queries find 3,878 and 389 rows, whereas the earlier LIKE queries find 4,118 and 1,292 rows? That’s because the LIKE patterns match substrings and the FULLTEXT search performed by MATCH() matches whole words.


[9] The use of COUNT() to produce multiple counts from the same set of values is described in Summarizing with COUNT().

[10] If you change ft_min_word_len, you must also use REPAIR TABLE to rebuild the indexes for all other tables that have FULLTEXT indexes.

Requiring or Excluding FULLTEXT Search Words

Problem

You want to specifically require or disallow words in a FULLTEXT search.

Solution

Use a Boolean mode search.

Discussion

Normally, FULLTEXT searches return rows that contain any of the words in the search string, even if some of them are missing. For example, the following statement finds rows that contain either of the names David or Goliath:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('David Goliath');

+----------+

| COUNT(*) |

+----------+

| 934 |

+----------+

This behavior is undesirable if you want only rows that contain both words. One way to do this is to rewrite the statement to look for each word separately and join the conditions with AND:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('David')

-> AND MATCH(vtext) AGAINST('Goliath');

+----------+

| COUNT(*) |

+----------+

| 2 |

+----------+

Another way to require multiple words is with a Boolean mode search. To do this, precede each word in the search string by a + character and add IN BOOLEAN MODE after the string:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('+David +Goliath' IN BOOLEAN MODE)

+----------+

| COUNT(*) |

+----------+

| 2 |

+----------+

Boolean mode searches also allow you to exclude words. Just precede any disallowed word by a - character. The following queries select kjv rows containing the name David but not Goliath, or vice versa:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('+David -Goliath' IN BOOLEAN MODE)

+----------+

| COUNT(*) |

+----------+

| 928 |

+----------+

mysql> SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('-David +Goliath' IN BOOLEAN MODE)

+----------+

| COUNT(*) |

+----------+

| 4 |

+----------+

Another useful special character in Boolean searches is *; when appended to a search word, it acts as a wildcard operator. The following statement finds rows containing not only whirl, but also words such as whirls, whirleth, and whirlwind:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('whirl*' IN BOOLEAN MODE);

+----------+

| COUNT(*) |

+----------+

| 28 |

+----------+

For a complete list of Boolean FULLTEXT operators, see the MySQL Reference Manual.

Performing Phrase Searches with a FULLTEXT Index

Problem

You want to perform a FULLTEXT search for a phrase; that is, for words that occur adjacent to each other and in a specific order.

Solution

Use the FULLTEXT phrase-search capability.

Discussion

To find rows that contain a particular phrase, you can’t use a simple FULLTEXT search:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('still small voice');

+----------+

| COUNT(*) |

+----------+

| 548 |

+----------+

The query returns a result, but it’s not the result you’re looking for. A FULLTEXT search computes a relevance ranking based on the presence of each word individually, no matter where it occurs within the vtext column, and the ranking will be nonzero as long as any of the words are present. Consequently, this kind of statement tends to find too many rows.

FULLTEXT searching supports phrase searching in Boolean mode. To use it, place the phrase in double quotes within the search string:

mysql>SELECT COUNT(*) FROM kjv

-> WHERE MATCH(vtext) AGAINST('"still small voice"' IN BOOLEAN MODE);

+----------+

| COUNT(*) |

+----------+

| 1 |

+----------+

A phrase match succeeds if a column contains the same words as in the phrase, in the order specified.