Importing and Exporting Data - MySQL Cookbook (2007)

MySQL Cookbook (2007)

Chapter 10. Importing and Exporting Data

Introduction

Suppose that you have a file named somedata.csv that contains 12 columns of data in comma-separated values (CSV) format. From this file you want to extract only columns 2, 11, 5, and 9, and use them to create database rows in a MySQL table that contains name, birth, height, andweight columns. You need to make sure that the height and weight are positive integers, and convert the birth dates from MM/DD/YY format to CCYY-MM-DD format. How can you do this?

In one sense, that problem is very specialized. But in another, it’s not at all atypical, because data transfer problems with specific requirements occur frequently when you transfer data into MySQL. It would be nice if datafiles were always nicely formatted and ready to load into MySQL with no preparation, but frequently that is not so. As a result, it’s often necessary to preprocess information to put it into a format that MySQL finds acceptable. The reverse also is true; data exported from MySQL may need massaging to be useful for other programs.

Although some data transfer operations are so difficult that they require a great deal of hand checking and reformatting, you can do at least part of the job automatically in most cases. Virtually all transfer problems involve at least some elements of a common set of conversion issues. This chapter discusses what these issues are, how to deal with them by taking advantage of the existing tools at your disposal, and how to write your own tools when necessary. The idea is not to cover all possible import and export situations (an impossible task), but to show some representative techniques and utilities. You can use them as is or adapt them for problems that they don’t handle. (There are also commercial conversion tools that may assist you, but my purpose here is to help you do things yourself.)

The first recipes in the chapter cover MySQL’s native facilities for importing data (the LOAD DATA statement and the mysqlimport command-line program), and for exporting data (the SELECT ... INTO OUTFILE statement and the mysqldump program). For operations that don’t require any data validation or reformatting, these facilities often are sufficient. They might even be sufficient for operations that require reformatting because LOAD DATA can perform preprocessing and SELECT can take advantage of functions and expressions for transforming column values into other formats.

For situations where MySQL’s native import and export capabilities do not suffice, the chapter moves on to cover techniques for using external supporting utilities and for writing your own. To some extent, you can avoid writing your own tools by using existing programs. For example, cutcan extract columns from a file, and sed and tr can be used as postprocessors to convert query output into other formats. But you’ll probably eventually reach the point where you decide to write your own programs. When you do, there are two broad sets of issues to consider:

§ How to manipulate the structure of datafiles. When a file is in a format that isn’t suitable for import, you’ll need to convert it to a different format. This may involve issues such as changing the column delimiters or line-ending sequences, or removing or rearranging columns in the file.

§ How to manipulate the content of datafiles. If you don’t know whether the values contained in a file are legal, you may want to preprocess it to check or reformat them. Numeric values may need to be verified as lying within a specific range, dates may need to be converted to or from ISO format, and so forth.

Source code for the program fragments and scripts discussed in this chapter is located in the transfer directory of the recipes distribution, with the exception that some of the utility functions are contained in library files located in the lib directory. The code for some of the shorter utilities is shown in full. For the longer ones, the chapter generally discusses only how they work and how to use them, but you have access to the source if you want to investigate in more detail how they’re written.

The problems addressed in this chapter involve a lot of text processing and pattern matching. These are particular strengths of Perl, so the program fragments and utilities shown here are written mainly in Perl. Ruby, PHP, Python, and Java provide pattern-matching capabilities, too, so they can of course do many of the same things.

General Import and Export Issues

Incompatible datafile formats and differing rules for interpreting various kinds of values lead to many headaches when transferring data between programs. Nevertheless, certain issues recur frequently. By being aware of them, you’ll be able to identify more easily just what you need to do to solve particular import or export problems.

In its most basic form, an input stream is just a set of bytes with no particular meaning. Successful import into MySQL requires being able to recognize which bytes represent structural information and which represent the data values framed by that structure. Because such recognition is key to decomposing the input into appropriate units, the most fundamental import issues are these:

§ What is the record separator? Knowing this enables you to partition the input stream into records.

§ What is the field delimiter? Knowing this enables you to partition each record into field values. Identifying the data values also might include stripping off quotes from around the values or recognizing escape sequences within them.

The ability to break apart the input into records and fields is important for extracting the data values from it. However, the values may still not be in a form that can be used directly, and you may need to consider other issues:

§ Do the order and number of columns match the structure of the database table? Mismatches require columns to be rearranged or skipped.

§ Do data values need to be validated or reformatted? If the values are in a format that matches MySQL’s expectations, no further processing is necessary. Otherwise, they need to be checked and possibly rewritten.

§ How should NULL or empty values be handled? Are they allowed? Can NULL values even be detected? (Some systems export the NULL value as an empty string, making it impossible to distinguish one from the other.)

For export from MySQL, the issues are somewhat the reverse. You can assume that values stored in the database are valid, but they may require reformatting for use by other programs, and it’s necessary to add column and record delimiters to form an output stream that has a structure other programs can recognize.

The chapter deals with these issues primarily within the context of performing bulk transfers of entire files, but many of the techniques discussed here can be applied in other situations as well. Consider a web-based application that presents a form for a user to fill in and then processes its contents to create a new row in the database. That is a data import situation. Web APIs generally make form contents available as a set of already parsed discrete values, so the application may not need to deal with record and column delimiters. On the other hand, validation issues remain paramount. You really have no idea what kind of values a user is sending your script, so it’s important to check them. Validation is covered in this chapter, and we’ll revisit the issue in Validating Web Input.

File Formats

Data files come in many formats, two of which are used frequently in this chapter:

Tab-delimited format

This is one of the simplest file structures; lines contain values separated by tab characters. A short tab-delimited file might look like this, where the whitespace between column values represents single tab characters:

a b c

a,b,c d e f

Comma-separated values (CSV) format

Files written in CSV format vary somewhat, because there is apparently no actual standard describing the format. However, the general idea is that lines consist of values separated by commas, and values containing internal commas are enclosed within quotes to prevent the commas from being interpreted as value delimiters. It’s also common for values containing spaces to be quoted as well. Here is an example, in which each line contains three values:

a,b,c

"a,b,c","d e",f

It’s trickier to process CSV files than tab-delimited files because characters like quotes and commas have a dual meaning: They may represent file structure or be included in the content of data values.

Another important datafile characteristic is the line-ending sequence. The most common sequences are carriage return, linefeed, and carriage return/linefeed pair, sometimes referred to here by the abbreviations CR, LF, and CRLF.

Datafiles often begin with a row of column labels. For some import operations, the row of labels is an annoyance because you must discard it to avoid having the labels be loaded into your table as a data row. In other cases, the labels are quite useful:

§ For import into existing tables, the labels help you match datafile columns with the table columns if they are not necessarily in the same order.

§ The labels can be used for column names when creating a new table automatically or semiautomatically from a datafile. For example, Guessing Table Structure from a Datafile discusses a utility that examines a datafile and guesses the CREATE TABLE statement that should be used to create a table from the file. If a label row is present, the utility uses the labels for column names. Otherwise, it’s necessary to make up generic names like c1, c2, and so forth, which isn’t very descriptive.

TAB-DELIMITED, LINEFEED-TERMINATED FORMAT

Although datafiles may be written in many formats, it’s unlikely that you’ll want to include machinery for reading several different formats within each file-processing utility that you write. I don’t want to, either. For that reason, many of the utilities described in this chapter assume for simplicity that their input is in tab-delimited, linefeed-terminated format. (This is also the default format for MySQL’s LOAD DATA statement.) By making this assumption, it becomes easier to write programs that read files.

On the other hand, something has to be able to read data in other formats. To handle that problem, we’ll develop a cvt_file.pl script that can read several types of files (Converting Datafiles from One Format to Another). The script is based on the Perl Text::CSV_XS module, which despite its name can be used for more than just CSV data. cvt_file.pl can convert between many file types, making it possible for other programs that require tab-delimited lines to be used with files not originally written in that format. In other words, you can use cvt_file.pl to convert a file to tab-delimited, linefeed-terminated format, and then any program that expects that format can process the file.

Notes on Invoking Shell Commands

This chapter shows a number of programs that you invoke from the command line using a shell like bash or tcsh under Unix or cmd.exe (“the command prompt”) under Windows. Many of the example commands for these programs use quotes around option values, and sometimes an option value is itself a quote character. Quoting conventions vary from one shell to another, but the following rules seem to work with most of them (including cmd.exe under Windows):

§ For an argument that contains spaces, enclose it within double quotes to prevent the shell from interpreting it as multiple separate arguments. The shell will strip off the quotes and then pass the argument to the command intact.

§ To include a double quote character in the argument itself, precede it with a backslash.

Some shell commands in this chapter are so long that they’re shown as you would enter them using several lines, with a backslash character as the line-continuation character:

%prog_name \

argument1 \

argument2 ...

That works for Unix, but not for Windows, where you’ll need to omit the continuation characters and type the entire command on one line:

C:\>prog_name argument1 argument2 ...

Importing Data with LOAD DATA and mysqlimport

Problem

You want to load a datafile into a table using MySQL’s built-in import capabilities.

Solution

Use the LOAD DATA statement or the mysqlimport command-line program.

Discussion

MySQL provides a LOAD DATA statement that acts as a bulk data loader. Here’s an example statement that reads a file mytbl.txt from your current directory and loads it into the table mytbl in the default database:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;

At some MySQL installations, the LOCAL loading capability may have been disabled for security reasons. If that is true at your site, omit LOCAL from the statement and specify the full pathname to the file. See Specifying the Datafile Location for more information on local versus non-local data loading.

MySQL also includes a utility program named mysqlimport that acts as a wrapper around LOAD DATA so that you can load input files directly from the command line. The mysqlimport command that is equivalent to the preceding LOAD DATA statement looks like this, assuming that mytbl is in the cookbook database:

%mysqlimport --local cookbook mytbl.txt

For mysqlimport, as with other MySQL programs, you may need to specify connection parameter options such as --user or --host (Starting and Stopping mysql).

The following list describes LOAD DATA’s general characteristics and capabilities; mysqlimport shares most of these behaviors. There are some differences that we’ll note as we go along, but for the most part you can interpret references to LOAD DATA as references to mysqlimport as well.

LOADDATA provides options to address many of the import issues mentioned in the chapter introduction, such as the line-ending sequence for recognizing how to break input into records, the column value delimiter that allows records to be broken into separate values, the quoting character that may enclose column values, quoting and escaping conventions within values, and NULL value representation:

§ By default, LOAD DATA expects the datafile to contain the same number of columns as the table into which you’re loading data, and the datafile columns must be present in the same order as in the table. If the file doesn’t contain a value for every column or the values aren’t in the proper order, you can specify which columns are present and the order in which they appear. If the datafile contains fewer columns than the table, MySQL assigns default values to columns for which no values are present in the datafile.

§ LOADDATA assumes that data values are separated by tab characters and that lines end with linefeeds (newlines). You can specify the data format explicitly if a file doesn’t conform to these conventions.

§ You can indicate that data values may have quotes around them that should be stripped, and you can specify what the quote character is.

§ Several special escape sequences are recognized and converted during input processing. The default escape character is backslash (\), but you can change it if you like. The \N sequence is interpreted to represent a NULL value. The \b, \n, \r, \t, \\, and \0 sequences are interpreted as backspace, linefeed, carriage return, tab, backslash, and ASCII NUL characters. (NUL is a zero-valued byte, which is different from the SQL NULL value.)

§ LOADDATA provides diagnostic information about which input values cause problems. You can display this information by issuing a SHOWWARNINGS statement after the LOADDATA statement.

The next few sections describe how to import datafiles into MySQL tables using LOAD DATA or mysqlimport.

Specifying the Datafile Location

Problem

You’re not sure how to tell LOAD DATA where to look for your datafile, particularly if it’s located in another directory.

Solution

It’s a matter of knowing the rules that determine where MySQL looks for the file.

Discussion

You can load files that are located on the server host or on the client host from which you issue the LOAD DATA statement. By default, the MySQL server assumes that the datafile is located on the server host. However, that might not be appropriate in all cases:

§ If you access the MySQL server from a remote client host and have no means of transferring your file to the server host (such as a login account there), you won’t be able to put the file on the server.

§ Even if you have a login account on the server host, your MySQL account must be enabled with the FILE privilege, and the file to be loaded must be either world-readable or located in the data directory for the default database. Most MySQL users do not have the FILE privilege (because it enables them to do dangerous things), and you might not want to make the file world readable (for security reasons) or be able to put it in the database directory.

Fortunately, you can load local files that are located on the client host using LOAD DATA LOCAL rather than LOAD DATA. The only permission you need to import a local file is the ability to read the file yourself. One caveat is that the LOCAL keyword might be disabled by default. You may be able to turn it on using the --local-infile option for mysql. If that doesn’t work, your server has been configured not to allow LOAD DATA LOCAL at all. (Many of the examples in this chapter assume that LOCAL can be used. If that’s not true for your system, you’ll need to adapt these examples. Omit LOCAL from the statement, make sure that the file is located on the MySQL server host, and specify its pathname using the following rules. For example, specify the full pathname.)

If the LOCAL keyword is not present in the LOAD DATA statement, the MySQL server reads the datafile. It looks for the file on the server host using the following rules:

§ An absolute pathname fully specifies the location of the file, beginning from the root of the filesystem. MySQL reads the file from the given location.

§ A relative pathname is interpreted two ways, depending on whether it has a single component or multiple components. For a single-component filename such as mytbl.txt, MySQL looks for the file in the database directory for the default database. (The statement will fail if you have not selected a default database.) For a multiple-component filename such as xyz/mytbl.txt, MySQL looks for the file beginning in the MySQL data directory. That is, it expects to find mytbl.txt in a directory named xyz.

Database directories are located directly under the server’s data directory, so these two statements are equivalent if the default database is cookbook:

mysql>LOAD DATA INFILE 'mytbl.txt' INTO TABLE mytbl;

mysql> LOAD DATA INFILE 'cookbook/mytbl.txt' INTO TABLE mytbl;

If the LOCAL keyword is specified in the LOAD DATA statement, your client program looks for the file on the client host and sends its contents to the server. The client interprets the pathname the same way your command interpreter does:

§ An absolute pathname fully specifies the location of the file, beginning from the root of the filesystem.

§ A relative pathname specifies the location of the file relative to your current directory.

If your file is located on the client host, but you forget to indicate that it’s local, you’ll get an error.

mysql>LOAD DATA 'mytbl.txt' INTO TABLE mytbl;

ERROR 1045 (28000): Access denied for user: 'user_name@host_name'

(Using password: YES)

That Access denied message can be confusing: if you’re able to connect to the server and issue the LOAD DATA statement, it would seem that you’ve already gained access to MySQL, right? The meaning of the error message is that the MySQL tried to open mytbl.txt on the server host and could not access it.

If your MySQL server runs on the host from which you issue the LOAD DATA statement, “remote” and “local” refer to the same host. But the rules just discussed for locating datafiles still apply. Without LOCAL, the server reads the datafile. With LOCAL, the client program reads the file and sends its contents to the server.

mysqlimport uses the same rules for finding files as LOADDATA. By default, it assumes that the datafile is located on the server host. To indicate that the file is local to the client host, specify the--local (or -L) option on the command line.

LOADDATA assumes that the table is located in the default database unless you specify the database name explicitly. mysqlimport always requires a database argument:

%mysqlimport --local cookbook mytbl.txt

To use LOAD DATA to load a file into a specific database rather than the default database, qualify the table name with the database name. The following statement does this, indicating that the mytbl table is located in the other_db database:

mysql>LOAD DATA LOCAL 'mytbl.txt' INTO TABLE other_db.mytbl;

LOADDATA assumes no relationship between the name of the datafile and the name of the table into which you’re loading the file’s contents. mysqlimport assumes a fixed relationship between the data-file name and the table name. Specifically, it uses the last component of the filename to determine the table name. For example, mysqlimport interprets mytbl,mytbl.txt, mytbl.dat, /tmp/mytbl.txt, /u/paul/data/mytbl.csv, and C:\projects\mytbl.txt all as files containing data for the mytbl table.

NAMING DATAFILES UNDER WINDOWS

Windows systems use \ as the pathname separator in filenames. That’s a bit of a problem, because MySQL interprets backslash as the escape character in string values. To specify a Windows pathname, use either doubled backslashes or forward slashes instead. These two statements show two ways of referring to the same Windows file:

mysql>LOAD DATA LOCAL INFILE 'C:\\projects\\mydata.txt' INTO mytbl;

mysql> LOAD DATA LOCAL INFILE 'C:/projects/mydata.txt' INTO mytbl;

If the NO_BACKSLASH_ESCAPES SQL mode is enabled, backslash is not special, and you do not double it:

mysql>SET sql_mode = 'NO_BACKSLASH_ESCAPES';

mysql> LOAD DATA LOCAL INFILE 'C:\projects\mydata.txt' INTO mytbl;

Specifying the Structure of the Datafile

Problem

You have a data file that’s not in LOAD DATA’s default format.

Solution

Use FIELDS and LINES clauses to tell LOAD DATA how to interpret the file.

Discussion

By default, LOAD DATA assumes that datafiles contain lines that are terminated by linefeed (newline) characters and that data values within a line are separated by tab characters. The following statement does not specify anything about the format of the datafile, so MySQL assumes the default format:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl;

Two LOAD DATA clauses provide explicit information about the datafile format. A FIELDS clause describes the characteristics of fields within a line, and a LINES clause specifies the line-ending sequence. The following LOAD DATA statement indicates that the input file contains data values separated by colons and lines terminated by carriage returns:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl

-> FIELDS TERMINATED BY ':'

-> LINES TERMINATED BY '\r';

Each clause follows the table name. If both are present, the FIELDS clause must precede the LINES clause. The line and field termination indicators can contain multiple characters. For example, \r\n indicates that lines are terminated by carriage return/linefeed pairs.

The LINES clause also has a STARTING BY subclause. It specifies the sequence to be stripped from each input record. Like TERMINATED BY, the sequence can have multiple characters. If TERMINATED BY and STARTING BY both are present in the LINES clause, they can appear in any order.

Note that for STARTING BY, everything up to the given sequence is stripped from each line. If you specify STARTING BY 'X' and an input line begins with abcX, all four leading characters are stripped.

If you use mysqlimport, command options provide the format specifiers. mysqlimport commands that correspond to the preceding two LOAD DATA statements look like this:

%mysqlimport --local cookbook mytbl.txt

% mysqlimport --local --fields-terminated-by=":" --lines-terminated-by="\r" \

cookbook mytbl.txt

The order in which you specify the options doesn’t matter for mysqlimport.

You can use hex notation to specify arbitrary format characters for FIELDS and LINES clauses. This can be useful for loading datafiles that use binary format codes. Suppose that a datafile has lines with Ctrl-A between fields and Ctrl-B at the end of lines. The ASCII values for Ctrl-A and Ctrl-B are 1 and 2, so you represent them as 0x01 and 0x02:

FIELDS TERMINATED BY 0x01 LINES TERMINATED BY 0x02

mysqlimport also understands hex constants for format specifiers. You may find this capability helpful if you don’t like remembering how to type escape sequences on the command line or when it’s necessary to use quotes around them. Tab is 0x09, linefeed is 0x0a, and carriage return is0x0d. Here’s an example that indicates that the datafile contains tab-delimited lines terminated by CRLF pairs:

%mysqlimport --local --lines-terminated-by=0x0d0a \

--fields-terminated-by=0x09 cookbook mytbl.txt

When you import datafiles, don’t assume that LOAD DATA (or mysqlimport) knows more than it does. It’s important always to keep in mind that LOAD DATA knows nothing at all about the format of your datafile. And always make sure that you do know what its format is. If the file has been transferred from one machine to another, its contents may have been changed in subtle ways of which you’re not aware.

Some LOAD DATA frustrations occur because people expect MySQL to know things that it cannot possibly know. LOAD DATA makes certain assumptions about the structure of input files, represented as the default settings for the line and field terminators, and for the quote and escape character settings. If your input doesn’t match those assumptions, you need to tell MySQL about it.

When in doubt, check the contents of your datafile using a hex dump program or other utility that displays a visible representation of whitespace characters like tab, carriage return, and linefeed. Under Unix, programs such as od or hexdump can display file contents in a variety of formats. If you don’t have these or some comparable utility, the transfer directory of the recipes distribution contains hex dumpers written in Perl, Ruby, and Python (hexdump.pl, hexdump.rb, and hexdump.py), as well as programs that display printable representations of all characters of a file (see.pl,see.rb, and see.py). You may find them useful for examining files to see what they really contain. In some cases, you may be surprised to discover that a file’s contents are different from what you think. This is, in fact, quite likely if the file has been transferred from one machine to another:

§ An FTP transfer between machines running different operating systems typically translates line endings to those that are appropriate for the destination machine if the transfer is performed in text mode rather than in binary (image) mode. Suppose that you have tab-delimited linefeed-terminated records in a datafile that load into MySQL on a Unix system just fine using the default LOAD DATA settings. If you copy the file to a Windows machine with FTP using a text transfer mode, the linefeeds might be converted to carriage return/linefeed pairs. On that machine, the file will not load properly with the same LOAD DATA statement if its contents have been changed. Does MySQL have any way of knowing that? No. So it’s up to you to tell it, by adding a LINES TERMINATED BY '\r\n' clause to the statement. Transfers between any two systems with dissimilar default line endings can cause these changes.

§ Data files pasted into email messages often do not survive intact. Mail software may wrap (break) long lines or convert line-ending sequences. If you must transfer a datafile by email, it’s best sent as an attachment.

Dealing with Quotes and Special Characters

Problem

Your datafile contains quoted values or escaped characters.

Solution

Tell LOAD DATA to be aware of quote and escape characters so that it doesn’t load data values into the database uninterpreted.

Discussion

The FIELDS clause can specify other format options besides TERMINATED BY. By default, LOAD DATA assumes that values are unquoted, and it interprets the backslash (\) as an escape character for special characters. To indicate the value-quoting character explicitly, use ENCLOSED BY; MySQL will strip that character from the ends of data values during input processing. To change the default escape character, use ESCAPED BY.

The three subclauses of the FIELDS clause (ENCLOSED BY, ESCAPED BY, and TERMINATED BY) may be present in any order if you specify more than one of them. For example, these FIELDS clauses are equivalent:

FIELDS TERMINATED BY ',' ENCLOSED BY '"'

FIELDS ENCLOSED BY '"' TERMINATED BY ','

The TERMINATED BY sequence can consist of multiple characters. If data values are separated within input lines by something like *@*, you indicate that like this:

FIELDS TERMINATED BY '*@*'

To disable escape processing entirely, specify an empty escape sequence:

FIELDS ESCAPED BY ''

When you specify ENCLOSED BY to indicate which quote character should be stripped from data values, it’s possible to include the quote character literally within data values by doubling it or by preceding it with the escape character. For example, if the quote and escape characters are " and\, the input value "a""b\"c" is interpreted as a"b"c.

For mysqlimport, the corresponding command options for specifying quote and escape values are --fields-enclosed-by and --fields-escaped-by. (When using mysqlimport options that include quotes or backslashes or other characters that are special to your command interpreter, remember that you may need to quote or escape the quote or escape characters.)

Importing CSV Files

Problem

You need to load a file that is in CSV format.

Solution

Add the appropriate format-specifier clauses to your LOAD DATA statement.

Discussion

Data files in CSV format contain values that are delimited by commas rather than tabs and that may be quoted with double-quote characters. For example, a CSV file mytbl.txt containing lines that end with carriage return/linefeed pairs can be loaded into mytbl using LOAD DATA:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl

-> FIELDS TERMINATED BY ',' ENCLOSED BY '"'

-> LINES TERMINATED BY '\r\n';

Or like this using mysqlimport:

%mysqlimport --local --lines-terminated-by="\r\n" \

--fields-terminated-by="," --fields-enclosed-by="\"" \

cookbook mytbl.txt

Reading Files from Different Operating Systems

Problem

Different operating systems use different line-ending sequences.

Solution

That’s why LOAD DATA has a LINES TERMINATED BY clause. Use it to your advantage.

Discussion

The line-ending sequence used in a datafile typically is determined by the system from which the file originated. Unix files normally have lines terminated by linefeeds, which you can indicate in a LOAD DATA statement like this:

LINES TERMINATED BY '\n'

However, because \n happens to be the default line terminator for LOAD DATA, you don’t need to specify a LINES TERMINATED BY clause in this case unless you want to indicate explicitly what the line-ending sequence is.

If your system doesn’t use the Unix default (linefeed), you need to specify the line terminator explicitly. Files created under Mac OS X or Windows often have lines ending in carriage returns or carriage return/linefeed pairs, respectively. To handle these different kinds of line endings, use the appropriate LINES TERMINATED BY clause:

LINES TERMINATED BY '\r'

LINES TERMINATED BY '\r\n'

For example, to load a Windows file that contains tab-delimited fields and lines ending with CRLF pairs, use this LOAD DATA statement:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl

-> LINES TERMINATED BY '\r\n';

The corresponding mysqlimport command is:

%mysqlimport --local --lines-terminated-by="\r\n" cookbook mytbl.txt

Handling Duplicate Key Values

Problem

Your input contains records that duplicate the values of unique keys in existing table rows.

Solution

Tell LOAD DATA to ignore the new records, or to replace the old ones.

Discussion

By default, an error occurs if you attempt to load a record that duplicates an existing row in the column or columns that form a PRIMARY KEY or UNIQUE index. To control this behavior, specify IGNORE or REPLACE after the filename to tell MySQL to either ignore duplicate rows or to replace old rows with the new ones.

Suppose that you periodically receive meteorological data about current weather conditions from various monitoring stations, and that you store measurements of various types from these stations in a table that looks like this:

CREATE TABLE weatherdata

(

station INT UNSIGNED NOT NULL,

type ENUM('precip','temp','cloudiness','humidity','barometer') NOT NULL,

value FLOAT,

PRIMARY KEY (station, type)

);

To make sure that you have only one row for each station for each type of measurement, the table includes a primary key on the combination of station ID and measurement type. The table is intended to hold only current conditions, so when new measurements for a given station are loaded into the table, they should kick out the station’s previous measurements. To accomplish this, use the REPLACE keyword:

mysql>LOAD DATA LOCAL INFILE 'data.txt' REPLACE INTO TABLE weatherdata;

mysqlimport has --ignoreand --replace options that have the same effect as the IGNORE and REPLACE keywords for LOADDATA.

Obtaining Diagnostics About Bad Input Data

Problem

When you issue a LOAD DATA statement, you want to know whether any input values are bad and what’s wrong with them.

Solution

Use the information line displayed by LOAD DATA to determine whether there are any problematic input values. If so, use SHOW WARNINGS to find where they are and what the problems are.

Discussion

When a LOAD DATA statement finishes, it returns a line of information that tells you how many errors or data conversion problems occurred. Suppose that you load a file into a table and see the following message when LOAD DATA finishes:

Records: 134 Deleted: 0 Skipped: 2 Warnings: 13

These values provide some general information about the import operation:

§ Records indicates the number of records found in the file.

§ Deleted and Skipped are related to treatment of input records that duplicate existing table rows on unique index values. Deleted indicates how many rows were deleted from the table and replaced by input records, and Skipped indicates how many input records were ignored in favor of existing rows.

§ Warnings is something of a catchall that indicates the number of problems found while loading data values into columns. Either a value stores into a column properly, or it doesn’t. In the latter case, the value ends up in MySQL as something different, and MySQL counts it as a warning. (Storing a string abc into a numeric column results in a stored value of 0, for example.)

What do these values tell you? The Records value normally should match the number of lines in the input file. If it is different from the file’s line count, that’s a sign that MySQL is interpreting the file as having a format that differs from the format it actually has. In this case, you’re likely also to see a high Warnings value, which indicates that many values had to be converted because they didn’t match the expected data type. (The solution to this problem often is to specify the proper FIELDS and LINES clauses.)

Assuming that your FIELDS and LINES format specifiers are correct, a nonzero Warnings count indicates the presence of bad input values. You can’t tell from the numbers in the LOAD DATA information line which input records had problems or which columns were bad. To get that information, issue a SHOW WARNINGS statement.

Suppose that a table t has this structure:

CREATE TABLE t

(

i INT,

c CHAR(3),

d DATE

);

And suppose that a datafile data.txt looks like this:

1 1 1

abc abc abc

2010-10-10 2010-10-10 2010-10-10

Loading the file into the table causes a number, a string, and a date to be loaded into each of the three columns. Doing so results in a number of data conversions and warnings:

mysql>LOAD DATA LOCAL INFILE 'data.txt' INTO TABLE t;

Query OK, 3 rows affected, 5 warnings (0.01 sec)

Records: 3 Deleted: 0 Skipped: 0 Warnings: 5

To see the warning messages, use SHOW WARNINGS immediately after the LOAD DATA statement:

mysql>SHOW WARNINGS;

+---------+------+--------------------------------------------+

| Level | Code | Message |

+---------+------+--------------------------------------------+

| Warning | 1265 | Data truncated for column 'd' at row 1 |

| Warning | 1264 | Out of range value for column 'i' at row 2 |

| Warning | 1265 | Data truncated for column 'd' at row 2 |

| Warning | 1265 | Data truncated for column 'i' at row 3 |

| Warning | 1265 | Data truncated for column 'c' at row 3 |

+---------+------+--------------------------------------------+

5 rows in set (0.00 sec)

The SHOW WARNINGS output helps you determine which values were converted and why. The resulting table looks like this:

mysql>SELECT * FROM t;

+------+------+------------+

| i | c | d |

+------+------+------------+

| 1 | 1 | 0000-00-00 |

| 0 | abc | 0000-00-00 |

| 2010 | 201 | 2010-10-10 |

+------+------+------------+

Skipping Datafile Lines

Problem

You want LOAD DATA to skip over the first line or lines of your datafile before starting to load records.

Solution

Tell LOAD DATA how many lines to ignore.

Discussion

To skip over the first n lines of a datafile, add an IGNORE n LINES clause to the LOAD DATA statement. For example, if a tab-delimited file begins with a line consisting of column headers, skip that line like this:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl IGNORE 1 LINES;

mysqlimport supports an --ignore-lines=noption that has the same effect.

IGNORE is often useful with files generated by external sources. For example, a program might export data in CSV format with an initial line of column labels. The following statement would be appropriate for skipping the labels in such a file that has carriage return line endings:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl

-> FIELDS TERMINATED BY ',' ENCLOSED BY '"'

-> LINES TERMINATED BY '\r'

-> IGNORE 1 LINES;

Specifying Input Column Order

Problem

The columns in your datafile aren’t in the same order as the columns in the table into which you’re loading the file.

Solution

Tell LOAD DATA how to match the table and the file by indicating which table columns correspond to the datafile columns.

Discussion

LOADDATA assumes that the columns in the datafile have the same order as the columns in the table. If that’s not true, specify a list to indicate which table columns the datafile columns should be loaded into. Suppose that your table has columns a, b, and c, but successive columns in the datafile correspond to columns b, c, and a. You can load the file like this:

mysql>LOAD DATA LOCAL INFILE 'mytbl.txt' INTO TABLE mytbl (b, c, a);

The equivalent mysqlimport statement uses the --columns option to specify the column list:

%mysqlimport --local --columns=b,c,a cookbook mytbl.txt

Preprocessing Input Values Before Inserting Them

Problem

You have values in a datafile that are not in a format that is suitable for loading into a table. For example, values are in the wrong units, or two input fields must be combined and inserted into a single column.

Solution

LOADDATAhas the capability of performing limited preprocessing of input values before inserting them. This enables you to map input data onto more appropriate values before loading them into your table.

Discussion

Specifying Input Column Order shows how you can specify a column list for LOADDATA to indicate how input fields correspond to table columns. The column list also can name user-defined variables, such that for each input record, the input fields are assigned to the variables. You can then perform calculations with those variables before inserting the result into the table. These calculations are specified in a SET clause that names one or more col_name=expr assignments, separated by commas.

Suppose that you have a datafile that has the following columns, with the first line providing column labels:

Date Time Name Weight State

2006-09-01 12:00:00 Bill Wills 200 Nevada

2006-09-02 09:00:00 Jeff Deft 150 Oklahoma

2006-09-04 03:00:00 Bob Hobbs 225 Utah

2006-09-07 08:00:00 Hank Banks 175 Texas

Suppose also that the file must be loaded into a table that has these columns:

CREATE TABLE t

(

dt DATETIME,

last_name CHAR(10),

first_name CHAR(10),

weight_kg FLOAT,

st_abbrev CHAR(2)

);

There are several mismatches between the datafile fields and the table columns that must be addressed to be able to import the file:

§ The file contains separate date and time columns that must be combined into date-and-time values for insertion into the DATETIME column.

§ The file contains a name field, which must be split into separate first and last name values for insertion into the first_name and last_name columns.

§ The file contains a weight in pounds, which must be converted to kilograms for insertion into the weight_kg column. (The conversion factor is that 1 lb. equals .454 kg.)

§ The file contains state names, but the table contains two-letter abbreviations. The name can be mapped to the abbreviation by performing a lookup in the states table.

To handle these conversions, assign each input column to a user-defined variable, and write a SET clause to perform the calculations. (Remember to skip the first line that contains the column labels.)

mysql>LOAD DATA LOCAL INFILE 'data.txt' INTO TABLE t

-> IGNORE 1 LINES

-> (@date,@time,@name,@weight_lb,@state)

-> SET dt = CONCAT(@date,' ',@time),

-> first_name = SUBSTRING_INDEX(@name,' ',1),

-> last_name = SUBSTRING_INDEX(@name,' ',-1),

-> weight_kg = @weight_lb * .454,

-> st_abbrev = (SELECT abbrev FROM states WHERE name = @state);

As a result of this import operation, the table contains these rows:

mysql>SELECT * FROM t;

+---------------------+-----------+------------+-----------+-----------+

| dt | last_name | first_name | weight_kg | st_abbrev |

+---------------------+-----------+------------+-----------+-----------+

| 2006-09-01 12:00:00 | Wills | Bill | 90.8 | NV |

| 2006-09-02 09:00:00 | Deft | Jeff | 68.1 | OK |

| 2006-09-04 03:00:00 | Hobbs | Bob | 102.15 | UT |

| 2006-09-07 08:00:00 | Banks | Hank | 79.45 | TX |

+---------------------+-----------+------------+-----------+-----------+

LOADDATA can perform data value reformatting, as shown in this section. Other examples showing uses for this capability occur later in the chapter. For example, Importing Non-ISO Date Values uses it to perform rewriting of non-ISO dates to ISO format during data import. However, although LOADDATA can map input values to other values, it cannot outright reject an input record that is found to contain unsuitable values. To do that, you can either preprocess the input file to remove these records or issue a DELETE statement after loading the file.

Ignoring Datafile Columns

Problem

Your datafile contains columns that should be ignored rather than loaded into the table.

Solution

That’s not a problem if the columns are at the ends of the input lines. Otherwise, you can use a column list with LOAD DATA that assigns the columns to be ignored to a dummy user-defined variable.

Discussion

Extra columns that occur at the end of input lines are easy to handle. If a line contains more columns than are in the table, LOAD DATA just ignores them (although it might produce a nonzero warning count).

Skipping columns in the middle of lines is a bit more involved. Suppose that you want to load information from a Unix password file /etc/passwd, which contains lines in the following format:

account:password:UID:GID:GECOS:directory:shell

Suppose also that you don’t want to bother loading the password column. A table to hold the information in the other columns looks like this:

CREATE TABLE passwd

(

account CHAR(8), # login name

uid INT, # user ID

gid INT, # group ID

gecos CHAR(60), # name, phone, office, etc.

directory CHAR(60), # home directory

shell CHAR(60) # command interpreter

);

To load the file, we need to specify that the column delimiter is a colon, which is easily handled with a FIELDS clause:

FIELDS TERMINATED BY ':'

However, we must also tell LOAD DATA to skip the second field that contains the password. To do this, add a column list in the statement. The list should include the name of each column to be loaded into the table, and a dummy user-defined variable for any column to be ignored:

mysql>LOAD DATA LOCAL INFILE '/etc/passwd' INTO TABLE passwd

-> FIELDS TERMINATED BY ':'

-> (account,@dummy,uid,gid,gecos,directory,shell);

The corresponding mysqlimport command should include a --columns option:

%mysqlimport --local \

--columns="account,@dummy,uid,gid,gecos,directory,shell" \

--fields-terminated-by=":" cookbook /etc/passwd

See Also

Another approach to ignoring columns is to preprocess the input file to remove columns. Extracting and Rearranging Datafile Columns discusses a utility that can pull out and display data-file columns in any order.

Exporting Query Results from MySQL

Problem

You want to export the result of a query from MySQL into a file or another program.

Solution

Use the SELECT ... INTO OUTFILE statement, or redirect the output of the mysql program.

Discussion

MySQL provides a SELECT ... INTO OUTFILE statement that exports a query result directly into a file on the server host. If you want to capture the result on the client host instead, another way to export a query is to redirect the output of the mysql program. These methods have different strengths and weaknesses, so you should get to know them both and apply whichever one best suits a given situation.

Exporting with the SELECT ... INTO OUTFILE statement

The syntax for this statement combines a regular SELECT with INTO OUTFILE filename. The default output format is the same as for LOAD DATA, so the following statement exports the passwd table into /tmp/passwd.txt as a tab-delimited, linefeed-terminated file:

mysql>SELECT * FROM passwd INTO OUTFILE '/tmp/passwd.txt';

You can change the output format using options similar to those used with LOAD DATA that indicate how to quote and delimit columns and records. For example, to export the passwd table in CSV format with CRLF-terminated lines, use this statement:

mysql>SELECT * FROM passwd INTO OUTFILE '/tmp/passwd.txt'

-> FIELDS TERMINATED BY ',' ENCLOSED BY '"'

-> LINES TERMINATED BY '\r\n';

SELECT ... INTOOUTFILE has the following properties:

§ The output file is created directly by the MySQL server, so the filename should indicate where you want the file to be written on the server host. The location for the file is determined using the same rules as for LOAD DATA without LOCAL, as described in Specifying the Datafile Location. There is no LOCAL version of the statement analogous to the LOCAL version of LOAD DATA.

§ You must have the MySQL FILE privilege to execute the SELECT ... INTO statement.

§ The output file must not already exist. (This prevents MySQL from clobbering files that may be important.)

§ You should have a login account on the server host or some way to access files on that host. SELECT ... INTO OUTFILE will be of no value to you if you cannot retrieve the output file.

§ Under Unix, the file is created world readable and is owned by the account used for running the MySQL server. This means that although you’ll be able to read the file, you may not be able to delete it.

Using the mysql client to export data

Because SELECT ... INTO OUTFILE writes the datafile on the server host, you cannot use it unless your MySQL account has the FILE privilege. To export data into a local file, you must use some other strategy. If all you require is tab-delimited output, you can do a “poor-man’s export” by executing a SELECT statement with the mysql program and redirecting the output to a file. That way you can write query results into a file on your local host without the FILE privilege. Here’s an example that exports the login name and command interpreter columns from the passwd table created earlier in this chapter:

%mysql -e "SELECT account, shell FROM passwd" --skip-column-names \

cookbook > shells.txt

The -e option specifies the statement to execute, and --skip-column-names tells MySQL not to write the row of column names that normally precedes statement output (Recipes and ).

Note that MySQL writes NULL values as the string “NULL”. Some sort of postprocessing may be necessary to convert them, depending on what you want to do with the output file.

It’s possible to produce output in formats other than tab-delimited by sending the query result into a post-processing filter that converts tabs to something else. For example, to use hash marks as delimiters, convert all tabs to # characters (TAB indicates where you type a tab character in the command):

%mysql --skip-column-names -e "

your statement here

"

db_name

\

| sed -e "s/

TAB

/#/g" >

output_file

You can also use tr for this purpose, although the syntax may vary for different implementations of this utility. For Mac OS X or Linux, the command looks like this:

%mysql --skip-column-names -e "

your statement here

"

db_name

\

| tr "\t" "#" >

output_file

The mysql commands just shown use --skip-column-names to suppress column labels from appearing in the output. Under some circumstances, it may be useful to include the labels. (For example, they might be useful when importing the file later.) If so, omit the --skip-column-names option from the command. In this respect, exporting query results with mysql is more flexible than SELECT ... INTO OUTFILE because the latter cannot produce output that includes column labels.

See Also

Another way to export query results to a file on the client host is to use the mysql_to_text.pl utility described in Writing Your Own Export Programs. That program has options that enable you to specify the output format explicitly. To export a query result as an Excel spreadsheet, seeExchanging Data Between MySQL and Microsoft Excel.

Exporting Tables as Text Files

Problem

You want to export an entire table to a file.

Solution

Use the mysqldump program with the --tab option.

Discussion

The mysqldump program is used to copy or back up tables and databases. It can write table output either as a text datafile or as a set of INSERT statements that recreate the rows in the table. The former capability is described here, the latter in Recipes and .

To dump a table as a datafile, you must specify a --tab option that indicates the directory on the MySQL server host to which you want the server to write the file. (The directory must already exist; the server won’t create it.) For example, to dump the states table from the cookbookdatabase to a file in the /tmp directory, use a command like this:

%mysqldump --no-create-info --tab=/tmp cookbook states

mysqldump creates a datafile using the table name plus a .txt suffix, so this command writes a file named /tmp/states.txt. This form of mysqldump is in some respects the command-line equivalent of SELECT ... INTOOUTFILE. For example, it writes out a table as a datafile on the server host, and you must have the FILE privilege to use it. See Exporting Query Results from MySQL for a list of general properties of SELECT ... INTOOUTFILE.

If you omit the --no-create-info option, mysqldump also creates a file /tmp/states.sql on your local host that contains the CREATE TABLE statement for the table. (The latter file will be owned by you, unlike the datafile, which is owned by the server.)

You can name multiple tables after the database name, in which case mysqldump writes output files for each of them. If you don’t name any tables, mysqldump writes output for every table in the database.

mysqldump creates datafiles in tab-delimited, linefeed-terminated format by default. To control the output format, use the--fields-enclosed-by, --fields-terminated-by, and --lines-terminated-byoptions (that is, the same options that mysqlimport understands as format specifiers). For example, to write the states table in CSV format with CRLF line endings, use this command:

%mysqldump --no-create-info --tab=/tmp \

--fields-enclosed-by="\"" --fields-terminated-by="," \

--lines-terminated-by="\r\n" cookbook states

A datafile exported this way can be imported using LOAD DATA or mysqlimport. Be sure to use matching format specifiers when importing if you didn’t dump the table using the default format.

Exporting Table Contents or Definitions in SQL Format

Problem

You want to export tables or databases as SQL statements to make them easier to import later.

Solution

Use the mysqldump program without the --tab option.

Discussion

As discussed in Exporting Tables as Text Files, mysqldump causes the MySQL server to write tables as text datafiles on the server host when it’s invoked with the --tab option. If you omit the --tab , the server formats the table rows as the INSERT statements and returns them tomysqldump, which writes the output on the client host. The output also can include the CREATE TABLE statement for each table. This provides a convenient form of output that you can capture in a file and use later to recreate a table or tables. It’s common to use such dump files as backups or for copying tables to another MySQL server. This section discusses how to save dump output in a file; Copying Tables or Databases to Another Server shows how to send it directly to another server over the network.

To export a table in SQL format to a file, use a command like this:

%mysqldump cookbook states > states.txt

That creates an output file states.txt that contains both the CREATE TABLE statement and a set of INSERT statements:

-- MySQL dump 10.10

--

-- Host: localhost Database: cookbook

-- ------------------------------------------------------

-- Server version 5.0.27-log

--

-- Table structure for table `states`

--

CREATE TABLE `states` (

`name` varchar(30) NOT NULL,

`abbrev` char(2) NOT NULL,

`statehood` date default NULL,

`pop` bigint(20) default NULL,

PRIMARY KEY (`abbrev`)

);

--

-- Dumping data for table `states`

--

INSERT INTO `states` VALUES ('Alabama','AL','1819-12-14',4530182);

INSERT INTO `states` VALUES ('Alaska','AK','1959-01-03',655435);

INSERT INTO `states` VALUES ('Arizona','AZ','1912-02-14',5743834);

INSERT INTO `states` VALUES ('Arkansas','AR','1836-06-15',2752629);

INSERT INTO `states` VALUES ('California','CA','1850-09-09',35893799);

INSERT INTO `states` VALUES ('Colorado','CO','1876-08-01',4601403);

...

NOTE

The preceding mysqldump output actually was produced by using the --skip-extended-insert option, which causes each row to be written as a separate INSERT statement. If you omit --skip-extended-insert (which is the usual case), mysqldump writes multiple-row INSERT statements. Those are more difficult for you and me to read, but more efficient for the MySQL server to process.

To dump multiple tables, name them all following the database name argument. To dump an entire database, don’t name any tables after the database. This statement dumps all tables in the cookbook database:

%mysqldump cookbook > cookbook.txt

If you want to dump all tables in all databases, invoke mysqldump like this:

%mysqldump --all-databases > dump.txt

In that case, the output file also includes CREATE DATABASE and USE db_name statements at appropriate places so that when you read in the file later, each table is created in the proper database.

Other options are available to control the output format:

--no-create-info

Suppress the CREATE TABLE statements. Use this option when you want to dump table contents only.

--no-data

Suppress the INSERT statements. Use this option when you want to dump table definitions only.

--add-drop-table

Precede each CREATE TABLE statement with a DROP TABLE statement. This is useful for generating a file that you can use later to recreate tables from scratch.

--no-create-db

Suppress the CREATE DATABASE statements that the --all-databases option normally produces.

Suppose now that you’ve used mysqldump to create an SQL-format dump file. How do you import the file back into MySQL? One common mistake at this point is to use mysqlimport. After all, it’s logical to assume that if mysqldump exports tables, mysqlimport must import them. Right? Sorry, no. That might be logical, but it’s not always correct. It’s true that if you use the --tab option with mysqldump, you can import the resulting datafiles with mysqlimport. But if you dump an SQL-format file, mysqlimport won’t process it properly. Use the mysql program instead. The way you do this depends on what’s in the dump file. If you dumped multiple databases using --all-databases, the file will contain the appropriate USE db_name statements to select the databases to which each table belongs, and you need no database argument on the command line:

%mysql < dump.txt

If you dumped tables from a single database, you’ll need to tell mysql which database to import them into:

%mysql

db_name

< cookbook.txt

Note that with this second import command, it’s possible to load the tables into a database different from the one from which they came originally. For example, you can use this fact to create copies of a table or tables in a test database to help debug data manipulation statements, without worrying about affecting the original tables.

Copying Tables or Databases to Another Server

Problem

You want to copy tables or databases from one MySQL server to another.

Solution

Use mysqldump and mysql together, connected by a pipe.

Discussion

SQL-format output from mysqldump can be used to copy tables or databases from one server to another. Suppose that you want to copy the states table from the cookbook database on the local host to the cb database on the host other-host.example.com. One way to do this is to dump the output into a file (as described in Exporting Table Contents or Definitions in SQL Format):

%mysqldump cookbook states > states.txt

Now copy states.txt to other-host.example.com, and run the following command there to import the table into that MySQL server’s cb database:

%mysql cb < states.txt

To accomplish this without using an intermediary file, send the output of mysqldump directly over the network to the remote MySQL server. If you can connect to both servers from your local host, use this command:

%mysqldump cookbook states | mysql -h other-host.example.com cb

The mysqldump half of the command connects to the local server and writes the dump output to the pipe. The mysql half of the command connects to the remote MySQL server on other-host.example.com. It reads the pipe for input and sends each statement to the other-host.example.comserver.

If you cannot connect directly to the remote server using mysql from your local host, send the dump output into a pipe that uses ssh to invoke mysql remotely on other-host.example.com:

%mysqldump cookbook states | ssh other-host.example.com mysql cb

sshconnects to other-host.example.com and launches mysql there. It then reads the mysqldump output from the pipe and passes it to the remote mysql process. ssh can be useful when you want to send a dump over the network to a machine that has the MySQL port blocked by a firewall but that enables connections on the SSH port.

To copy multiple tables over the network, name them all following the database argument of the mysqldump command. To copy an entire database, don’t specify any table names after the database name. mysqldump will dump all the tables contained in the database.

If you’re thinking about invoking mysqldump with the --all-databases option to send all your databases to another server, consider that the output will include the tables in the mysql database that contains the grant tables. If the remote server has a different user population, you probably don’t want to replace that server’s grant tables!

Writing Your Own Export Programs

Problem

MySQL’s built-in export capabilities don’t suffice.

Solution

Write your own utilities.

Discussion

When existing export software doesn’t do what you want, you can write your own programs. This section describes a Perl script, mysql_to_text.pl, that executes an arbitrary statement and exports it in the format you specify. It writes output to the client host and can include a row of column labels (two things that SELECT ... INTO OUTFILE cannot do). It produces multiple output formats more easily than by using mysql with a postprocessor, and it writes to the client host, unlike mysqldump, which can write only SQL-format output to the client. You can find mysql_to_text.pl in the transfer directory of the recipes distribution.

mysql_to_text.pl is based on the Text::CSV_XS module, which you’ll need to obtain if it’s not installed on your system. Once that module has been installed, you can read its documentation like so:

%perldoc Text::CSV_XS

This module is convenient because it makes conversion of query output to CSV format relatively trivial. All you have to do is provide an array of column values, and the module packages them up into a properly formatted output line. This makes it relatively trivial to convert query output to CSV format. But the real benefit of using the Text::CSV_XS module is that it’s configurable; you can tell it what kind of delimiter and quote characters to use. This means that although the module produces CSV format by default, you can configure it to write a variety of output formats. For example, if you set the delimiter to tab and the quote character to undef, Text::CSV_XS generates tab-delimited output. We’ll take advantage of that flexibility in this section for writing mysql_to_text.pl, and later in Converting Datafiles from One Format to Another to write a file-processing utility that converts files from one format to another.

mysql_to_text.pl accepts several command-line options. Some of them are used for specifying MySQL connection parameters (such as --user, --password, and --host). You’re already familiar with these, because they’re used by the standard MySQL clients like mysql. The script also can obtain connection parameters from an option file, if you specify a [client] group in the file. mysql_to_text.pl also accepts the following options:

--execute=query, -equery

Execute query and export its output.

--table=tbl_name, -ttbl_name

Export the contents of the named table. This is equivalent to using --execute to specify a query value of SELECT * FROM tbl_name.

--labels

Include an initial row of column labels in the output

--delim = str

Set the column delimiter to str. The option value can consist of one or more characters. The default is to use tabs.

--quote = c

Set the column value quote character to c. The default is to not quote anything.

--eol = str

Set the end-of-line sequence to str. The option value can consist of one or more characters. The default is to use linefeeds.

The defaults for the --delim, --quote, and --eol options correspond to those used by LOAD DATA and SELECT ... INTO OUTFILE.

The final argument on the command line should be the database name, unless it’s implicit in the statement. For example, these two commands are equivalent; each exports the passwd table from the cookbook database in colon-delimited format:

%mysql_to_text.pl --delim=":" --table=passwd cookbook

% mysql_to_text.pl --delim=":" --table=cookbook.passwd

To generate CSV output with CRLF line terminators instead, use a command like this:

%mysql_to_text.pl --delim="," --quote="\"" --eol="\r\n" \

--table=cookbook.passwd

That’s a general description of how you use mysql_to_text.pl. Now let’s discuss how it works. The initial part of the mysql_to_text.pl script declares a few variables, and then processes the command-line arguments, using option-processing techniques developed in Techniques for Obtaining Connection Parameters. As it happens, most of the code in the script is devoted to processing the command-line arguments and getting set up to run the query. Very little of it involves interaction with MySQL.

#!/usr/bin/perl

# mysql_to_text.pl - export MySQL query output in user-specified text format

# Usage: mysql_to_text.pl [ options ] [db_name] > text_file

use strict;

use warnings;

use DBI;

use Text::CSV_XS;

use Getopt::Long;

$Getopt::Long::ignorecase = 0; # options are case sensitive

$Getopt::Long::bundling = 1; # allow short options to be bundled

#... construct usage message variable $usage (not shown) ...

# Variables for command line options - all undefined initially

# except for options that control output structure, which is set

# to be tab-delimited, linefeed-terminated.

my $help;

my ($host_name, $password, $port_num, $socket_name, $user_name, $db_name);

my ($stmt, $tbl_name);

my $labels;

my $delim = "\t";

my $quote;

my $eol = "\n";

GetOptions (

# =i means an integer argument is required after the option

# =s means a string value is required after the option

"help" => \$help, # print help message

"host|h=s" => \$host_name, # server host

"password|p=s" => \$password, # password

"port|P=i" => \$port_num, # port number

"socket|S=s" => \$socket_name, # socket name

"user|u=s" => \$user_name, # username

"execute|e=s" => \$stmt, # statement to execute

"table|t=s" => \$tbl_name, # table to export

"labels|l" => \$labels, # generate row of column labels

"delim=s" => \$delim, # column delimiter

"quote=s" => \$quote, # column quoting character

"eol=s" => \$eol # end-of-line (record) delimiter

) or die "$usage\n";

die "$usage\n" if defined $help;

$db_name = shift (@ARGV) if @ARGV;

# One of --execute or --table must be specified, but not both

die "You must specify a query or a table name\n\n$usage\n"

if !defined ($stmt) && !defined ($tbl_name);

die "You cannot specify both a query and a table name\n\n$usage\n"

if defined ($stmt) && defined ($tbl_name);

# If table name was given, use it to create query that selects entire table

$stmt = "SELECT * FROM $tbl_name" if defined ($tbl_name);

# interpret special chars in the file structure options

$quote = interpret_option ($quote);

$delim = interpret_option ($delim);

$eol = interpret_option ($eol);

The interpret_option() function (not shown) processes escape and hex sequences for the --delim, --quote, and --eol options. It interprets \n, \r, \t, and \0 as linefeed, carriage return, tab, and the ASCII NUL character. It also interprets hex values, which can be given in 0x nnform (for example, 0x0d indicates a carriage return).

After processing the command-line options, mysql_to_text.pl constructs the data source name (DSN) and connects to the server:

my $dsn = "DBI:mysql:";

$dsn .= ";database=$db_name" if $db_name;

$dsn .= ";host=$host_name" if $host_name;

$dsn .= ";port=$port_num" if $port_num;

$dsn .= ";mysql_socket=$socket_name" if $socket_name;

# read [client] group parameters from standard option files

$dsn .= ";mysql_read_default_group=client";

my %conn_attrs = (PrintError => 0, RaiseError => 1, AutoCommit => 1);

my $dbh = DBI->connect ($dsn, $user_name, $password, \%conn_attrs);

The database name comes from the command line. Connection parameters can come from the command line or an option file. (Use of MySQL option files is covered in Techniques for Obtaining Connection Parameters.)

After establishing a connection to MySQL, the script is ready to execute the query and produce some output. This is where the Text::CSV_XS module comes into play. First, create a CSV object by calling new(), which takes an optional hash of options that control how the object handles data lines. The script prepares and executes the query, prints a row of column labels (if the user specified the --labels option), and writes the rows of the result set:

my $csv = Text::CSV_XS->new ({

sep_char => $delim,

quote_char => $quote,

escape_char => $quote,

eol => $eol,

binary => 1

});

my $sth = $dbh->prepare ($stmt);

$sth->execute ();

if ($labels) # write row of column labels

{

$csv->combine (@{$sth->{NAME}}) or die "cannot process column labels\n";

print $csv->string ();

}

my $count = 0;

while (my @val = $sth->fetchrow_array ())

{

++$count;

$csv->combine (@val) or die "cannot process column values, row $count\n";

print $csv->string ();

}

The sep_char and quote_char options in the name() call set the column delimiter and quoting character. The escape_char option is set to the same value as quote_char so that instances of the quote character occurring within data values are doubled in the output. The eol option indicates the line-termination sequence. Normally, Text::CSV_XS leaves it to you to print the terminator for output lines. By passing a non-undef eol value to new(), the module adds that value to every output line automatically. The binary option is useful for processing data values that contain binary characters.

The column labels are available in $sth->{NAME} after invoking execute(). Each line of output is produced using combine() and string(). The combine() method takes an array of values and converts them to a properly formatted string. string() returns the string so we can printit.

Converting Datafiles from One Format to Another

Problem

You want to convert a file to a different format to make it easier to work with, or so that another program can understand it.

Solution

Use the cvt_file.pl converter script described here.

Discussion

The mysql_to_text.pl script discussed in Writing Your Own Export Programs uses MySQL as a data source and produces output in the format you specify via the --delim, --quote, and --eol options. This section describes cvt_file.pl, a utility that provides similar formatting options, but for both input and output. It reads data from a file rather than from MySQL, and converts it from one format to another. For example, to read a tab-delimited file data.txt, convert it to colon-delimited format, and write the result to tmp.txt, invoke cvt_file.pl like this:

%cvt_file.pl --idelim="\t" --odelim=":" data.txt > tmp.txt

The cvt_file.pl script has separate options for input and output. Thus, whereas mysql_to_text.pl has just a --delim option for specifying the column delimiter, cvt_file.pl has separate --idelim and --odelim options to set the input and output line column delimiters. But as a shortcut, --delim is also supported; it sets the delimiter for both input and output. The full set of options that cvt_file.pl understands is as follows:

--idelim=str, --odelim=str, --delim=str

Set the column delimiter for input, output, or both. The option value can consist of one or more characters.

--iquote=c, --oquote=c, --quote=c

Set the column quote character for input, output, or both.

--ieol=str, --oeol=str, --eol=str

Set the end-of-line sequence for input, output, or both. The option value can consist of one or more characters.

--iformat=format, --oformat=format, --format=format,

Specify an input format, an output format, or both. This option is shorthand for setting the quote and delimiter values. For example, --iformat =csv sets the input quote and delimiter characters to double quote and comma. --iformat =tab sets them to “no quotes” and tab.

--ilabels, --olabels, --labels

Expect an initial line of column labels for input, write an initial line of labels for output, or both. If you request labels for the output but do not read labels from the input, cvt_file.pl uses column labels of c1, c2, and so forth.

cvt_file.pl assumes the same default file format as LOADDATA and SELECT...INTOOUTFILE, that is, tab-delimited lines terminated by linefeeds.

cvt_file.pl can be found in the transfer directory of the recipes distribution. If you expect to use it regularly, you should install it in some directory that’s listed in your search path so that you can invoke it from anywhere. Much of the source for the script is similar to mysql_to_text.pl, so rather than showing the code and discussing how it works, I’ll just give some examples that illustrate how to use it:

§ Read a file in CSV format with CRLF line termination, and write tab-delimited output with linefeed termination:

§ %cvt_file.pl --iformat=csv --ieol="\r\n" --oformat=tab --oeol="\n" \

data.txt > tmp.txt

§ Read and write CSV format, converting CRLF line terminators to carriage returns:

%cvt_file.pl --format=csv --ieol="\r\n" --oeol="\r" data.txt > % tmp.txt

§ Produce a tab-delimited file from the colon-delimited /etc/passwd file:

%cvt_file.pl --idelim=":" /etc/passwd > tmp.txt

§ Convert tab-delimited query output from mysql into CSV format:

§ %mysql -e "SELECT * FROM profile" cookbook \

| cvt_file.pl --oformat=csv > profile.csv

Extracting and Rearranging Datafile Columns

Problem

You want to pull out columns from a datafile or rearrange them into a different order.

Solution

Use a utility that can produce columns from a file on demand.

Discussion

cvt_file.pl serves as a tool that converts entire files from one format to another. Another common datafile operation is to manipulate its columns. This is necessary, for example, when importing a file into a program that doesn’t understand how to extract or rearrange input columns for itself. To work around this problem, you can rearrange the datafile instead.

Recall that this chapter began with a description of a scenario involving a 12-column CSV file somedata.csv from which only columns 2, 11, 5, and 9 were needed. You can convert the file to tab-delimited format like this:

%cvt_file.pl --iformat=csv somedata.csv > somedata.txt

But then what? If you just want to knock out a short script to extract those specific four columns, that’s fairly easy: write a loop that reads input lines and writes only the columns you want in the proper order. But that would be a special-purpose script, useful only within a highly limited context. With just a little more effort, it’s possible to write a more general utility yank_col.pl that enables you to extract any set of columns. With such a tool, you’d specify the column list on the command line like this:

%yank_col.pl --columns=2,11,5,9 somedata.txt > tmp.txt

Because the script doesn’t use a hardcoded column list, it can be used to pull out an arbitrary set of columns in any order. Columns can be specified as a comma-separated list of column numbers or column ranges. (For example, --columns=1,10,4-7 means columns 1, 10, 4, 5, 6, and 7.)yank_col.pl looks like this:

#!/usr/bin/perl

# yank_col.pl - extract columns from input

# Example: yank_col.pl --columns=2,11,5,9 filename

# Assumes tab-delimited, linefeed-terminated input lines.

use strict;

use warnings;

use Getopt::Long;

$Getopt::Long::ignorecase = 0; # options are case sensitive

my $prog = "yank_col.pl";

my $usage = <<EOF;

Usage: $prog [options] [data_file]

Options:

--help

Print this message

--columns=column-list

Specify columns to extract, as a comma-separated list of column positions

EOF

my $help;

my $columns;

GetOptions (

"help" => \$help, # print help message

"columns=s" => \$columns # specify column list

) or die "$usage\n";

die "$usage\n" if defined $help;

my @col_list = split (/,/, $columns) if defined ($columns);

@col_list or die "$usage\n"; # nonempty column list is required

# make sure column specifiers are positive integers, and convert from

# 1-based to 0-based values

my @tmp;

for (my $i = 0; $i < @col_list; $i++)

{

if ($col_list[$i] =~ /^\d+$/) # single column number

{

die "Column specifier $col_list[$i] is not a positive integer\n"

unless $col_list[$i] > 0;

push (@tmp, $col_list[$i] - 1);

}

elsif ($col_list[$i] =~ /^(\d+)-(\d+)$/) # column range m-n

{

my ($begin, $end) = ($1, $2);

die "$col_list[$i] is not a valid column specifier\n"

unless $begin > 0 && $end > 0 && $begin <= $end;

while ($begin <= $end)

{

push (@tmp, $begin - 1);

++$begin;

}

}

else

{

die "$col_list[$i] is not a valid column specifier\n";

}

}

@col_list = @tmp;

while (<>) # read input

{

chomp;

my @val = split (/\t/, $_, 10000); # split, preserving all fields

# extract desired columns, mapping undef to empty string (can

# occur if an index exceeds number of columns present in line)

@val = map { defined ($_) ? $_ : "" } @val[@col_list];

print join ("\t", @val) . "\n";

}

The input processing loop converts each line to an array of values, and then pulls out from the array the values corresponding to the requested columns. To avoid looping through the array, it uses Perl’s notation that allows a list of subscripts to be specified all at once to request multiple array elements. For example, if @col_list contains the values 2, 6, and 3, these two expressions are equivalent:

($val[2] , $val[6], $val[3])

@val[@col_list]

What if you want to extract columns from a file that’s not in tab-delimited format, or produce output in another format? In that case, combine yank_col.pl with the cvt_file.pl script discussed in Converting Datafiles from One Format to Another. Suppose that you want to pull out all but the password column from the colon-delimited /etc/passwd file and write the result in CSV format. Use cvt_file.pl both to preprocess /etc/passwd into tab-delimited format for yank_col.pl and to postprocess the extracted columns into CSV format:

%cvt_file.pl --idelim=":" /etc/passwd \

| yank_col.pl --columns=1,3-7 \

| cvt_file.pl --oformat=csv > passwd.csv

If you don’t want to type all of that as one long command, use temporary files for the intermediate steps:

%cvt_file.pl --idelim=":" /etc/passwd > tmp1.txt

% yank_col.pl --columns=1,3-7 tmp1.txt > tmp2.txt

% cvt_file.pl --oformat=csv tmp2.txt > passwd.csv

% rm tmp1.txt tmp2.txt

FORCING SPLIT() TO RETURN EVERY FIELD

The Perl split() function is extremely useful, but normally it doesn’t return trailing empty fields. This means that if you write only as many fields as split() returns, output lines may not have the same number of fields as input lines. To avoid this problem, pass a third argument to indicate the maximum number of fields to return. This forces split() to return as many fields as are actually present on the line or the number requested, whichever is smaller. If the value of the third argument is large enough, the practical effect is to cause all fields to be returned, empty or not. Scripts shown in this chapter use a field count value of 10,000:

# split line at tabs, preserving all fields

my @val = split (/\t/, $_, 10000);

In the (unlikely?) event that an input line has more fields than that, it will be truncated. If you think that will be a problem, you can bump up the number even higher.

Using the SQL Mode to Control Bad Input Data Handling

Problem

By default, MySQL is forgiving about accepting data values that are invalid, out of range, or otherwise unsuitable for the data types of the columns into which you insert them. (The server accepts the values and attempts to coerce them to the closest legal value.) But you want the server to be more restrictive and not accept bad data.

Solution

Set the SQL mode. Several mode values are available to control how strict the server is. Some of these modes apply generally to all input values. Others apply to specific data types such as dates.

Discussion

Normally, MySQL accepts data and coerces it to the data types of your table columns if the input doesn’t match. Consider the following table, which has integer, string, and date columns:

mysql>CREATE TABLE t (i INT, c CHAR(6), d DATE);

Inserting a row with unsuitable data values into the table causes warnings (which you can see with SHOW WARNINGS), but the values are loaded into the table after being coerced to whatever the closest legal value is (or at least to some value that fits the column):

mysql>INSERT INTO t (i,c,d) VALUES('-1x','too-long string!','1999-02-31');

mysql> SHOW WARNINGS;

+---------+------+----------------------------------------+

| Level | Code | Message |

+---------+------+----------------------------------------+

| Warning | 1265 | Data truncated for column 'i' at row 1 |

| Warning | 1265 | Data truncated for column 'c' at row 1 |

| Warning | 1265 | Data truncated for column 'd' at row 1 |

+---------+------+----------------------------------------+

mysql> SELECT * FROM t;

+------+--------+------------+

| i | c | d |

+------+--------+------------+

| -1 | too-lo | 0000-00-00 |

+------+--------+------------+

Prior to MySQL 5.0, the way to prevent these warnings is to check the input data on the client side to make sure that it’s legal. This remains a reasonable strategy in certain circumstances (see the sidebar “Server-Side Versus Client-Side Validation” in Validating and Transforming Data), but MySQL 5.0 and up offers you an alternative: let the server check data values on the server side and reject them with an error if they’re invalid. Then you don’t have to check them.

To use this strategy, enable restrictions on input data acceptance by setting the sql_mode system variable. You can set the SQL mode to cause the server to be much more restrictive than it is by default. With the proper restrictions in place, data values that would otherwise result in conversions and warnings result in errors instead. Try the preceding INSERT again after enabling “strict” SQL mode:

mysql>SET sql_mode = 'STRICT_ALL_TABLES';

mysql> INSERT INTO t (i,c,d) VALUES('-1x','too-long string!','1999-02-31');

ERROR 1265 (01000): Data truncated for column 'i' at row 1

Here the statement doesn’t even progress to the second and third data values because the first is invalid for an integer column and the server raises an error.

Even without enabling input restrictions, the server as of MySQL 5.0 is a bit more strict about date checking than previous versions. Before 5.0, the server checks only that the month and day parts of date values are in the ranges 1 to 12 and 1 to 31, respectively. This allows a date such as'2005-02-31' to be entered. As of MySQL 5.0, the month must be from 1 to 12 (as before), but the day value must be legal for the given month. This means that '2005-02-31' generates a warning now by default.

Although date checking is somewhat more restrictive as of MySQL 5.0, MySQL still allows dates such as '1999-11-00' or '1999-00-00' that have zero parts, or the “zero” date ('0000-00-00'), and this is true even in strict mode. If you want to restrict these kinds of date values, enable the NO_ZERO_IN_DATE and NO_ZERO_DATE SQL modes to cause warnings, or errors in strict mode. For example, to disallow dates with zero parts or “zero” dates, set the SQL mode like this:

mysql>SET sql_mode = 'STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE';

A simpler way to enable these restrictions, and a few more besides, is to enable TRADITIONAL SQL mode. TRADITIONAL mode is actually a constellation of modes, as you can see by setting the sql_mode value and then displaying it:

mysql>SET sql_mode = 'TRADITIONAL';

mysql> SELECT @@sql_mode\G

*************************** 1. row ***************************

@@sql_mode: STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,

NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,

NO_AUTO_CREATE_USER

You can read more about the various SQL modes in the MySQL Reference Manual.

The examples shown set the session value of the sql_mode system variable, so they change the SQL mode only for your current connection. To set the mode globally for all clients, set the global sql_mode value (this requires the SUPER privilege):

mysql>SET GLOBAL sql_mode = '

mode_value

';

Validating and Transforming Data

Problem

You need to make sure that the data values contained in a file are legal.

Solution

Check them, possibly rewriting them into a more suitable format.

Discussion

Earlier recipes in this chapter show how to work with the structural characteristics of files, by reading lines and breaking them up into separate columns. It’s important to be able to do that, but sometimes you need to work with the data content of a file, not just its structure:

§ It’s often a good idea to validate data values to make sure they’re legal for the data types into which you’re storing them. For example, you can make sure that values intended for INT, DATE, and ENUM columns are integers, dates in CCYY-MM-DD format, and legal enumeration values, respectively.

§ Data values may need reformatting. Rewriting dates from one format to another is especially common; for example, if a program writes dates in MM-DD-YY format to ISO format for import into MySQL. If a program understands only date and time formats and not a combined date-and-time format (such as MySQL uses for the DATETIME and TIMESTAMP data types), you need to split date-and-time values into separate date and time values.

§ It may be necessary to recognize special values in the file. It’s common to represent NULL with a value that does not otherwise occur in the file, such as -1, Unknown, or N/A. If you don’t want those values to be imported literally, you need to recognize and handle them specially.

This is the first of a set of recipes that describe validation and reformatting techniques that are useful in these kinds of situations. Techniques covered here for checking values include pattern matching and validation against information in a database. It’s not unusual for certain validation operations to come up over and over, in which case you’ll probably find it useful to to construct a library of functions. By packaging validation operations as library routines, it is easier to write utilities based on them, and the utilities make it easier to perform command-line operations on entire files so that you can avoid editing them yourself.

SERVER-SIDE VERSUS CLIENT-SIDE VALIDATION

As described in Using the SQL Mode to Control Bad Input Data Handling, you can cause data validation to be done on the server side by setting the SQL mode to be restrictive about accepting bad input data. In this case, the MySQL server raises an error for values that aren’t legal for the data types of the columns into which you insert them.

In the next few sections, the focus is validation on the client side rather than on the server side. Client-side validation can be useful when you require more control over validation than simply receiving an error from the server. (For example, if you test values yourself, it’s often easier to provide more informative messages about the exact nature of problems with the values.) Also, it might be necessary to couple validation with reformatting to transform complex values so that they are compatible with MySQL data types. You have more flexibility to do this on the client side.

If you want to avoid writing your own library routines, look around to see if someone else has already written suitable routines that you can use. For example, if you check the Perl CPAN (cpan.perl.org), you’ll find a Data::Validate module hierarchy. The modules there provide library routines that standardize a number of common validation tasks. Data::Validate::MySQL deals specifically with MySQL data types.

Writing an input-processing loop

Many of the validation recipes shown in the new few sections are typical of those that you perform within the context of a program that reads a file and checks individual column values. The general framework for such a file-processing utility can be written like this:

#!/usr/bin/perl

# loop.pl - Typical input-processing loop

# Assumes tab-delimited, linefeed-terminated input lines.

use strict;

use warnings;

while (<>) # read each line

{

chomp;

# split line at tabs, preserving all fields

my @val = split (/\t/, $_, 10000);

for my $i (0 .. @val - 1) # iterate through columns in line

{

# ... test $val[$i] here ...

}

}

The while() loop reads each input line and breaks it into fields. Inside the loop, each line is broken into fields. Then the inner for() loop iterates through the fields in the line, allowing each to be processed in sequence. If you’re not applying a given test uniformly to all the fields, replace the for() loop with separate column-specific tests.

This loop assumes tab-delimited, linefeed-terminated input, an assumption that is shared by most of the utilities discussed throughout the rest of this chapter. To use these programs with datafiles in other formats, you may be able to convert the files into tab-delimited format using thecvt_file.pl script discussed in Converting Datafiles from One Format to Another.

Putting common tests in libraries

For a test that you perform often, it may be useful to package it as a library function. This makes the operation easy to perform and also gives it a name that’s likely to make the meaning of the operation clearer than the comparison code itself. For example, the following test performs a pattern match to check that $val consists entirely of digits (optionally preceded by a plus sign), and then makes sure the value is greater than zero:

$valid = ($val =~ /^\+?\d+$/ && $val > 0);

In other words, the test looks for strings that represent positive integers. To make the test easier to use and its intent clearer, you might put it into a function that is used like this:

$valid = is_positive_integer ($val);

The function itself can be defined as follows:

sub is_positive_integer

{

my $s = $_[0];

return ($s =~ /^\+?\d+$/ && $s > 0);

}

Now put the function definition into a library file so that multiple scripts can use it easily. The Cookbook_Utils.pm module file in the lib directory of the recipes distribution is an example of a library file that contains a number of validation functions. Take a look through it to see which functions may be useful in your own programs (or as a model for writing your own library files). To gain access to this module from within a script, include a use statement like this:

use Cookbook_Utils;

You must of course install the module file in a directory where Perl will find it. For details on library installation, see Writing Library Files.

A significant benefit of putting a collection of utility routines into a library file is that you can use it for all kinds of programs. It’s rare for a data manipulation problem to be completely unique. If you can pick and choose at least a few validation routines from a library, it’s possible to reduce the amount of code you need to write, even for highly specialized programs.

Using Pattern Matching to Validate Data

Problem

You need to compare a value to a set of values that is difficult to specify literally without writing a really ugly expression.

Solution

Use pattern matching.

Discussion

Pattern matching is a powerful tool for validation because it enables you to test entire classes of values with a single expression. You can also use pattern tests to break up matched values into subparts for further individual testing or in substitution operations to rewrite matched values. For example, you might break a matched date into pieces so that you can verify that the month is in the range from 1 to 12, and the day is within the number of days in the month. You might use a substitution to reorder MM-DD-YY or DD-MM-YY values into YY-MM-DD format.

The next few sections describe how to use patterns to test for several types of values, but first let’s take a quick tour of some general pattern-matching principles. The following discussion focuses on Perl’s regular expression capabilities. Pattern matching in Ruby, PHP, and Python is similar, although you should consult the relevant documentation for any differences. For Java, use the java.util.regex package.

In Perl, the pattern constructor is / pat /:

$it_matched = ($val =~ /pat/); # pattern match

Put an i after the / pat / constructor to make the pattern match case-insensitive:

$it_matched = ($val =~ /pat/i); # case-insensitive match

To use a character other than slash, begin the constructor with m. This can be useful if the pattern itself contains slashes:

$it_matched = ($val =~ m|pat|); # alternate constructor character

To look for a nonmatch, replace the =~ operator with the !~ operator:

$no_match = ($val !~ /pat/); # negated pattern match

To perform a substitution in $val based on a pattern match, use s/ pat / replacement /. If pat occurs within $val, it’s replaced by replacement. To perform a case-insensitive match, put an i after the last slash. To perform a global substitution that replaces all instances of pat rather than just the first one, add a g after the last slash:

$val =~ s/pat/replacement/; # substitution

$val =~ s/pat/replacement/i; # case-insensitive substitution

$val =~ s/pat/replacement/g; # global substitution

$val =~ s/pat/replacement/ig; # case-insensitive and global

Here’s a list of some of the special pattern elements available in Perl regular expressions:

Pattern

What the pattern matches

^

Beginning of string

$

End of string

.

Any character

\s, \S

Whitespace or nonwhitespace character

\d, \D

Digit or nondigit character

\w, \W

Word (alphanumeric or underscore) or non-word character

[...]

Any character listed between the square brackets

[^...]

Any character not listed between the square brackets

p1 | p2 | p3

Alternation; matches any of the patterns p1, p2, or p3

*

Zero or more instances of preceding element

+

One or more instances of preceding element

{ n }

n instances of preceding element

{ m , n }

m through n instances of preceding element

Many of these pattern elements are the same as those available for MySQL’s REGEXP regular expression operator (Pattern Matching with Regular Expressions).

To match a literal instance of a character that is special within patterns, such as *, ^, or $, precede it with a backslash. Similarly, to include a character within a character class construction that is special in character classes ([, ], or -), precede it with a backslash. To include a literal ^ in a character class, list it somewhere other than as the first character between the brackets.

Many of the validation patterns shown in the following sections are of the form /^pat $/. Beginning and ending a pattern with ^ and $ has the effect of requiring pat to match the entire string that you’re testing. This is common in data validation contexts, because it’s generally desirable to know that a pattern matches an entire input value, not just part of it. (If you want to be sure that a value represents an integer, for example, it doesn’t do you any good to know only that it contains an integer somewhere.) This is not a hard-and-fast rule, however, and sometimes it’s useful to perform a more relaxed test by omitting the ^ and $ characters as appropriate. For example, if you want to strip leading and trailing whitespace from a value, use one pattern anchored only to the beginning of the string, and another anchored only to the end:

$val =~ s/^\s+//; # trim leading whitespace

$val =~ s/\s+$//; # trim trailing whitespace

That’s such a common operation, in fact, that it’s a good candidate for being written as a utility function. The Cookbook_Utils.pm file contains a function trim_whitespace() that performs both substitutions and returns the result:

$val = trim_whitespace ($val);

To remember subsections of a string that is matched by a pattern, use parentheses around the relevant parts of the pattern. After a successful match, you can refer to the matched substrings using the variables $1, $2, and so forth:

if ("abcdef" =~ /^(ab)(.*)$/)

{

$first_part = $1; # this will be ab

$the_rest = $2; # this will be cdef

}

To indicate that an element within a pattern is optional, follow it with a ? character. To match values consisting of a sequence of digits, optionally beginning with a minus sign, and optionally ending with a period, use this pattern:

/^-?\d+\.?$/

You can also use parentheses to group alternations within a pattern. The following pattern matches time values in hh:mm format, optionally followed by AM or PM:

/^\d{1,2}:\d{2}\s*(AM|PM)?$/i

The use of parentheses in that pattern also has the side effect of remembering the optional part in $1. To suppress that side effect, use (?: pat ) instead:

/^\d{1,2}:\d{2}\s*(?:AM|PM)?$/i

That’s sufficient background in Perl pattern matching to allow construction of useful validation tests for several types of data values. The following sections provide patterns that can be used to test for broad content types, numbers, temporal values, and email addresses or URLs.

The transfer directory of the recipes distribution contains a test_pat.pl script that reads input values, matches them against several patterns, and reports which patterns each value matches. The script is easily extensible, so you can use it as a test harness to try your own patterns.

Using Patterns to Match Broad Content Types

Problem

You want to classify values into broad categories.

Solution

Use a pattern that is similarly broad.

Discussion

If you need to know whether values are empty or nonempty, or consist only of certain types of characters, the patterns listed in the following table may suffice:

Pattern

Type of value the pattern matches

/^$/

Empty value

/./

Nonempty value

/^\s*$/

Whitespace, possibly empty

/^\s+$/

Nonempty whitespace

/\S/

Nonempty, and not just whitespace

/^\d+$/

Digits only, nonempty

/^[a-z]+$/i

Alphabetic characters only (case-insensitive), nonempty

/^\w+$/

Alphanumeric or underscore characters only, nonempty

Using Patterns to Match Numeric Values

Problem

You need to make sure a string looks like a number.

Solution

Use a pattern that matches the type of number you’re looking for.

Discussion

Patterns can be used to classify values into several types of numbers.

Pattern

Type of value the pattern matches

/^\d+$/

Unsigned integer

/^-?\d+$/

Negative or unsigned integer

/^[-+]?\d+$/

Signed or unsigned integer

/^[-+]?(\d+(\.\d*)?|\.\d+)$/

Floating-point number

The pattern /^\d+$/ matches unsigned integers by requiring a nonempty value that consists only of digits from the beginning to the end of the value. If you care only that a value begins with an integer, you can match an initial numeric part and extract it. To do this, match just the initial part of the string (omit the $ that requires the pattern to match to the end of the string) and place parentheses around the \d+ part. Then refer to the matched number as $1 after a successful match:

if ($val =~ /^(\d+)/)

{

$val = $1; # reset value to matched subpart

}

You could also add zero to the value, which causes Perl to perform an implicit string-to-number conversion that discards the nonnumeric suffix:

if ($val =~ /^\d+/)

{

$val += 0;

}

However, if you run Perl with the -w option or include a use warnings line in your script (which I recommend), this form of conversion generates warnings for values that actually have a nonnumeric part. It will also convert string values like 0013 to the number 13, which may be unacceptable in some contexts.

Some kinds of numeric values have a special format or other unusual constraints. Here are a few examples and how to deal with them:

Zip codes

Zip and Zip+4 codes are postal codes used for mail delivery in the United States. They have values like 12345 or 12345-6789 (that is, five digits, possibly followed by a dash and four more digits). To match one form or the other, or both forms, use the following patterns:

Pattern

Type of value the pattern matches

/^\d{5}$/

Zip code, five digits only

/^\d{5}-\d{4}$/

Zip+4 code

/^\d{5}(-\d{4})?$/

Zip or Zip+4 code

Credit card numbers

Credit card numbers typically consist of digits, but it’s common for values to be written with spaces, dashes, or other characters between groups of digits. For example, the following numbers are considered equivalent:

0123456789012345

0123 4567 8901 2345

0123-4567-8901-2345

To match such values, use this pattern:

/^[- \d]+/

(Perl allows the \d digit specifier within character classes.) However, that pattern doesn’t identify values of the wrong length, and it may be useful to remove extraneous characters. If you require credit card values to contain 16 digits, use a substitution to remove all nondigits, and then check the length of the result:

$val =~ s/\D//g;

$valid = (length ($val) == 16);

Innings pitched

In baseball, one statistic recorded for pitchers is the number of innings pitched, measured in thirds of innings (corresponding to the number of outs recorded.) These values are numeric, but must satisfy a specific additional constraint: a fractional part is allowed, but if present, must consist of a single digit 0, 1, or 2. That is, legal values are of the form 0, .1, .2, 1, 1.1, 1.2, 2, and so forth. To match an unsigned integer (optionally followed by a decimal point and perhaps a fractional digit of 0, 1, or 2), or a fractional digit with no leading integer, use this pattern:

/^(\d+(\.[012]?)?|\.[012])$/

The alternatives in the pattern are grouped within parentheses because otherwise the ^ anchors only the first of them to the beginning of the string, and the $ anchors only the second to the end of the string.

Using Patterns to Match Dates or Times

Problem

You need to make sure a string looks like a date or time.

Solution

Use a pattern that matches the type of temporal value you expect. Be sure to consider issues such as how strict to be about delimiters between subparts and the lengths of the subparts.

Discussion

Dates are a validation headache because they come in so many formats. Pattern tests are extremely useful for weeding out illegal values, but often insufficient for full verification: a date might have a number where you expect a month, but if the number is 13, the date isn’t valid. This section introduces some patterns that match a few common date formats. Performing Validity Checking on Date or Time Subparts revisits this topic in more detail and discusses how to combine pattern tests with content verification.

To require values to be dates in ISO (CCYY-MM-DD) format, use this pattern:

/^\d{4}-\d{2}-\d{2}$/

The pattern requires the - character as the delimiter between date parts. To allow either - or / as the delimiter, use a character class between the numeric parts (the slashes are escaped with a backslash to prevent them from being interpreted as the end of the pattern constructor):

/^\d{4}[-\/]\d{2}[-\/]\d{2}$/

Or you can use a different delimiter around the pattern and avoid the backslashes:

m|^\d{4}[-/]\d{2}[-/]\d{2}$|

To allow any non-digit delimiter (which corresponds to how MySQL operates when it interprets strings as dates), use this pattern:

/^\d{4}\D\d{2}\D\d{2}$/

If you don’t require the full number of digits in each part (to allow leading zeros in values like 03 to be missing, for example), just look for three nonempty digit sequences:

/^\d+\D\d+\D\d+$/

Of course, that pattern is so general that it will also match other values such as U.S. Social Security numbers (which have the format 012-34-5678). To constrain the subpart lengths by requiring two to four digits in the year part and one or two digits in the month and day parts, use this pattern:

/^\d{2,4}?\D\d{1,2}\D\d{1,2}$/

For dates in other formats such as MM-DD-YY or DD-MM-YY, similar patterns apply, but the subparts are arranged in a different order. This pattern matches both of those formats:

/^\d{2}-\d{2}-\d{2}$/

If you need to check the values of individual date parts, use parentheses in the pattern and extract the substrings after a successful match. If you’re expecting dates to be in ISO format, for example, do something like this:

if ($val =~ /^(\d{2,4})\D(\d{1,2})\D(\d{1,2})$/)

{

($year, $month, $day) = ($1, $2, $3);

}

The library file lib/Cookbook_Utils.pm in the recipes distribution contains several of these pattern tests, packaged as function calls. If the date doesn’t match the pattern, they return undef. Otherwise, they return a reference to an array containing the broken-out values for the year, month, and day. This can be useful for performing further checking on the components of the date. For example, is_iso_date() looks for dates that match ISO format. It’s defined as follows:

sub is_iso_date

{

my $s = $_[0];

return undef unless $s =~ /^(\d{2,4})\D(\d{1,2})\D(\d{1,2})$/;

return [ $1, $2, $3 ]; # return year, month, day

}

To use the function, do something like this:

my $ref = is_iso_date ($val);

if (defined ($ref))

{

# $val matched ISO format pattern;

# check its subparts using $ref->[0] through $ref->[2]

}

else

{

# $val didn't match ISO format pattern

}

You’ll often find additional processing necessary with dates, because although date-matching patterns help to weed out values that are syntactically malformed, they don’t assess whether the individual components contain legal values. To do that, some range checking is necessary. That topic is covered later in Performing Validity Checking on Date or Time Subparts.

If you’re willing to skip subpart testing and just want to rewrite the pieces, you can use a substitution. For example, to rewrite values assumed to be in MM-DD-YY format into YY-MM-DD format, do this:

$val =~ s/^(\d+)\D(\d+)\D(\d+)$/$3-$1-$2/;

Time values are somewhat more orderly than dates, usually being written with hours first and seconds last, with two digits per part:

/^\d{2}:\d{2}:\d{2}$/

To be more lenient, you can allow the hours part to have a single digit, or the seconds part to be missing:

/^\d{1,2}:\d{2}(:\d{2})?$/

You can mark parts of the time with parentheses if you want to range-check the individual parts, or perhaps to reformat the value to include a seconds part of 00 if it happens to be missing. However, this requires some care with the parentheses and the ? characters in the pattern if the seconds part is optional. You want to allow the entire :\d{2} at the end of the pattern to be optional, but not to save the : character in $3 if the third time section is present. To accomplish that, use (?: pat ), an alternative grouping notation that doesn’t save the matched substring. Within that notation, use parentheses around the digits to save them. Then $3 will be undef if the seconds part is not present, but will contain the seconds digits otherwise:

if ($val =~ /^(\d{1,2}):(\d{2})(?::(\d{2}))?$/)

{

my ($hour, $min, $sec) = ($1, $2, $3);

$sec = "00" if !defined ($sec); # seconds missing; use 00

$val = "$hour:$min:$sec";

}

To rewrite times from 12-hour format with AM and PM suffixes to 24-hour format, you can do something like this:

if ($val =~ /^(\d{1,2}):(\d{2})(?::(\d{2}))?\s*(AM|PM)?$/i)

{

my ($hour, $min, $sec) = ($1, $2, $3);

# supply missing seconds

$sec = "00" unless defined ($sec);

if ($hour == 12 && (!defined ($4) || uc ($4) eq "AM"))

{

$hour = "00"; # 12:xx:xx AM times are 00:xx:xx

}

elsif ($hour < 12 && defined ($4) && uc ($4) eq "PM")

{

$hour += 12; # PM times other than 12:xx:xx

}

$val = "$hour:$min:$sec";

}

The time parts are placed into $1, $2, and $3, with $3 set to undef if the seconds part is missing. The suffix goes into $4 if it’s present. If the suffix is AM or missing (undef), the value is interpreted as an AM time. If the suffix is PM, the value is interpreted as a PM time.

See Also

This section is just the beginning of what you can do when processing dates for data-transfer purposes. Date and time testing and conversion can be highly idiosyncratic, and the sheer number of issues to consider is mind-boggling:

§ What is the basic date format? Dates come in several common styles, such as ISO (CCYY-MM-DD), U.S. (MM-DD-YY), and British (DD-MM-YY) formats. And these are just some of the more standard formats. Many more are possible. For example, a datafile may contain dates written asJune 17, 1959 or as 17 Jun '59.

§ Are trailing times allowed on dates or perhaps required? When times are expected, is the full time required or just the hour and minute?

§ Do you allow special values like now or today?

§ Are date parts required to be delimited by a certain character, such as - or /, or are other delimiters allowed?

§ Are date parts required to have a specific number of digits? Or are leading zeros on month and year values allowed to be missing?

§ Are months written numerically, or are they represented as month names like January or Jan?

§ Are two-digit year values allowed? Should they be converted to have four digits? If so, what is the conversion rule? (What is the transition point within the range 00 to 99 at which values change from one century to another?)

§ Should date parts be checked to ensure their validity? Patterns can recognize strings that look like dates or times, but while they’re extremely useful for detecting malformed values, they may not be sufficient. A value like 1947-15-99 may match a pattern but isn’t a legal date. Pattern testing is thus most useful in conjunction with range checks on the individual parts of the date.

The prevalence of these issues in data-transfer problems means that you’ll probably end up writing some of your own validators on occasion to handle very specific date formats. Later sections of this chapter can provide additional assistance. For example, Converting Two-Digit Year Values to Four-Digit Form covers conversion of two-digit year values to four-digit form, and Performing Validity Checking on Date or Time Subparts discusses how to perform validity checking on components of date or time values.

Using Patterns to Match Email Addresses or URLs

Problem

You want to determine whether a value looks like an email address or a URL.

Solution

Use a pattern, tuned to the level of strictness you want to enforce.

Discussion

The immediately preceding sections use patterns to identify classes of values such as numbers and dates, which are fairly typical applications for regular expressions. But pattern matching has such widespread applicability that it’s impossible to list all the ways you can use it for data validation. To give some idea of a few other types of values that pattern matching can be used for, this section shows a few tests for email addresses and URLs.

To check values that are expected to be email addresses, the pattern should require at least an @ character with nonempty strings on either side:

/.@./

That’s a pretty minimal test. It’s difficult to come up with a fully general pattern that covers all the legal values and rejects all the illegal ones,[13] but it’s easy to write a pattern that’s at least a little more restrictive. For example, in addition to being nonempty, the username and the domain name should consist entirely of characters other than @ characters or spaces:

/^[^@ ]+@[^@ ]+$/

You may also want to require that the domain name part contain at least two parts separated by a dot:

/^[^@ ]+@[^@ .]+\.[^@ .]+/

To look for URL values that begin with a protocol specifier of http://, ftp://, or mailto:, use an alternation that matches any of them at the beginning of the string. These values contain slashes, so it’s easier to use a different character around the pattern to avoid having to escape the slashes with backslashes:

m#^(http://|ftp://|mailto:)#i

The alternatives in the pattern are grouped within parentheses because otherwise the ^ will anchor only the first of them to the beginning of the string. The i modifier follows the pattern because protocol specifiers in URLs are not case-sensitive. The pattern is otherwise fairly unrestrictive because it allows anything to follow the protocol specifier. I leave it to you to add further restrictions as necessary.


[13] To see how hard it can be to perform pattern matching for email addresses, check Jeffrey E. F. Friedl’s Mastering Regular Expressions (O’Reilly).

Using Table Metadata to Validate Data

Problem

You need to check input values against the legal members of an ENUM or SET column.

Solution

Get the column definition, extract the list of members from it, and check data values against the list.

Discussion

Some forms of validation involve checking input values against information stored in a database. This includes values to be stored in an ENUM or SET column, which can be checked against the valid members stored in the column definition. Database-backed validation also applies when you have values that must match those listed in a lookup table to be considered legal. For example, input records that contain customer IDs can be required to match a row in a customers table, or state abbreviations in addresses can be verified against a table that lists each state. This recipe describes ENUM- and SET-based validation, and Using a Lookup Table to Validate Data discusses how to use lookup tables.

One way to check input values that correspond to the legal values of ENUM or SET columns is to get the list of legal column values into an array using the information in INFORMATION_SCHEMA, and then perform an array membership test. For example, the favorite-color column color from the profile table is an ENUM that is defined as follows:

mysql>SELECT COLUMN_TYPE FROM INFORMATION_SCHEMA.COLUMNS

-> WHERE TABLE_SCHEMA = 'cookbook' AND TABLE_NAME = 'profile'

-> AND COLUMN_NAME = 'color';

+----------------------------------------------------+

| COLUMN_TYPE |

+----------------------------------------------------+

| enum('blue','red','green','brown','black','white') |

+----------------------------------------------------+

If you extract the list of enumeration members from the COLUMN_TYPE value and store them in an array @members, you can perform the membership test like this:

$valid = grep (/^$val$/i, @members);

The pattern constructor begins and ends with ^ and $ to require $val to match an entire enumeration member (rather than just a substring). It also is followed by an i to specify a case-insensitive comparison because the default collation is latin1_swedish_ci, which is case-insensitive. (If you’re working with a column that has a different collation, adjust accordingly.)

In Getting ENUM and SET Column Information, we wrote a function get_enumorset_info() that returns ENUM or SET column metadata. This includes the list of members, so it’s easy to use that function to write another utility routine, check_enum_value(), that gets the legal enumeration values and performs the membership test. The routine takes four arguments: a database handle, the table name and column name for the ENUM column, and the value to check. It returns true or false to indicate whether the value is legal:

sub check_enum_value

{

my ($dbh, $db_name, $tbl_name, $col_name, $val) = @_;

my $valid = 0;

my $info = get_enumorset_info ($dbh, $db_name, $tbl_name, $col_name);

if ($info && uc ($info->{type}) eq "ENUM")

{

# use case-insensitive comparison because default collation

# (latin1_swedish_ci) is case-insensitive (adjust if you use

# a different collation)

$valid = grep (/^$val$/i, @{$info->{values}});

}

return $valid;

}

For single-value testing, such as to validate a value submitted in a web form, that kind of test works well. However, if you’re going to be testing a lot of values (like an entire column in a datafile), it’s better to read the enumeration values into memory once, and then use them repeatedly to check each of the data values. Furthermore, it’s a lot more efficient to perform hash lookups than array lookups (in Perl at least). To do so, retrieve the legal enumeration values and store them as keys of a hash. Then test each input value by checking whether it exists as a hash key. It’s a little more effort to construct the hash, which is why check_enum_value() doesn’t do so. But for bulk validation, the improved lookup speed more than makes up for the hash construction overhead.[14]

Begin by getting the metadata for the column, and then convert the list of legal enumeration members to a hash:

my $ref = get_enumorset_info ($dbh, $db_name, $tbl_name, $col_name);

my %members;

foreach my $member (@{$ref->{values}})

{

# convert hash key to consistent lettercase

$members{lc ($member)} = 1;

}

The loop makes each enumeration member exist as the key of a hash element. The hash key is what’s important here; the value associated with it is irrelevant. (The example shown sets the value to 1, but you could use undef, 0, or any other value.) Note that the code converts the hash keys to lowercase before storing them. This is done because hash key lookups in Perl are case-sensitive. That’s fine if the values that you’re checking also are case-sensitive, but ENUM columns by default are not. By converting the enumeration values to a given lettercase before storing them in the hash, and then converting the values you want to check similarly, you perform, in effect, a case-insensitive key existence test:

$valid = exists ($members{lc ($val)});

The preceding example converts enumeration values and input values to lowercase. You could just as well use uppercase—as long as you do so for all values consistently.

Note that the existence test may fail if the input value is the empty string. You’ll have to decide how to handle that case on a column-by-column basis. For example, if the column allows NULL values, you might interpret the empty string as equivalent to NULL and thus as being a legal value.

The validation procedure for SET values is similar to that for ENUM values, except that an input value might consist of any number of SET members, separated by commas. For the value to be legal, each element in it must be legal. In addition, because “any number of members” includes“none,” the empty string is a legal value for any SET column.

For one-shot testing of individual input values, you can use a utility routine check_set_value() that is similar to check_enum_value():

sub check_set_value

{

my ($dbh, $db_name, $tbl_name, $col_name, $val) = @_;

my $valid = 0;

my $info = get_enumorset_info ($dbh, $db_name, $tbl_name, $col_name);

if ($info && uc ($info->{type}) eq "SET")

{

return 1 if $val eq ""; # empty string is legal element

# use case-insensitive comparison because default collation

# (latin1_swedish_ci) is case-insensitive (adjust if you use

# a different collation)

$valid = 1; # assume valid until we find out otherwise

foreach my $v (split (/,/, $val))

{

if (!grep (/^$v$/i, @{$info->{values}}))

{

$valid = 0; # value contains an invalid element

last;

}

}

}

return $valid;

}

For bulk testing, construct a hash from the legal SET members. The procedure is the same as for producing a hash from ENUM elements:

my $ref = get_enumorset_info ($dbh, $db_name, $tbl_name, $col_name);

my %members;

foreach my $member (@{$ref->{values}})

{

# convert hash key to consistent lettercase

$members{lc ($member)} = 1;

}

To validate a given input value against the SET member hash, convert it to the same lettercase as the hash keys, split it at commas to get a list of the individual elements of the value, and then check each one. If any of the elements are invalid, the entire value is invalid:

$valid = 1; # assume valid until we find out otherwise

foreach my $elt (split (/,/, lc ($val)))

{

if (!exists ($members{$elt}))

{

$valid = 0; # value contains an invalid element

last;

}

}

After the loop terminates, $valid is true if the value is legal for the SET column, and false otherwise. Empty strings are always legal SET values, but this code doesn’t perform any special-case test for an empty string. No such test is necessary, because in that case the split() operation returns an empty list, the loop never executes, and $valid remains true.


[14] If you want to check for yourself the relative efficiency of array membership tests versus hash lookups, try the lookup_time.pl script in the transfer directory of the recipes distribution.

Using a Lookup Table to Validate Data

Problem

You need to check values to make sure they’re listed in a lookup table.

Solution

Issue statements to see whether the values are in the table. However, the way you do this depends on the number of input values and the size of the table.

Discussion

To validate input values against the contents of a lookup table, you can use techniques somewhat similar to those shown in Using Table Metadata to Validate Data for checking ENUM and SET columns. However, whereas ENUM and SET columns are limited to a maximum of 65,535 and 64 member values, respectively, a lookup table can have an essentially unlimited number of values. You might not want to read them all into memory.

Validation of input values against the contents of a lookup table can be done several ways, as illustrated in the following discussion. The tests shown in the examples perform comparisons against values exactly as they are stored in the lookup table. To perform case-insensitive comparisons, remember to convert all values to a consistent lettercase.

Issue individual statements

For one-shot operations, you can test a value by checking whether it’s listed in the lookup table. The following query returns true (nonzero) a value that is present and false otherwise:

$valid = $dbh->selectrow_array (

"SELECT COUNT(*) FROM $tbl_name WHERE val = ?",

undef, $val);

This kind of test may be suitable for purposes such as checking a value submitted in a web form, but is inefficient for validating large datasets. It has no memory for the results of previous tests for values that have been seen before; consequently, you’ll end up issuing a query for every single input value.

Construct a hash from the entire lookup table

If you’re going to perform bulk validation of a large set of values, it’s more efficient to pull the lookup values into memory, save them in a data structure, and check each input value against the contents of that structure. Using an in-memory lookup avoids the overhead of running a query for each value.

First, run a query to retrieve all the lookup table values and construct a hash from them:

my %members; # hash for lookup values

my $sth = $dbh->prepare ("SELECT val FROM $tbl_name");

$sth->execute ();

while (my ($val) = $sth->fetchrow_array ())

{

$members{$val} = 1;

}

Then check each value by performing a hash key existence test:

$valid = exists ($members{$val});

This reduces the database traffic to a single query. However, for a large lookup table, that could still be a lot of traffic, and you might not want to hold the entire table in memory.

PERFORMING LOOKUPS WITH OTHER LANGUAGES

The example shown here for bulk testing of lookup values uses a Perl hash to determine whether a given value is present in a set of values:

$valid = exists ($members{$val});

Similar data structures exist for other languages. In Ruby, use a hash, and check input values using the has_key? method:

valid = members.has_key?(val)

In PHP, use an associative array, and perform a key lookup like this:

$valid = isset ($members[$val]);

In Python, use a dictionary, and check input values using the has_key() method:

valid = members.has_key (val)

For lookups in Java, use a HashMap, and test values with the containsKey() method:

valid = members.containsKey (val);

The transfer directory of the recipes distribution contains some sample code for lookup operations in each of these languages.

Use a hash as a cache of already-seen lookup values

Another lookup technique is to mix individual statements with a hash that stores lookup value existence information. This approach can be useful if you have a very large lookup table. Begin with an empty hash:

my %members; # hash for lookup values

Then, for each value to be tested, check whether it’s present in the hash. If not, issue a query to see whether the value is present in the lookup table, and record the result of the query in the hash. The validity of the input value is determined by the value associated with the key, not by the existence of the key:

if (!exists ($members{$val})) # haven't seen this value yet

{

my $count = $dbh->selectrow_array (

"SELECT COUNT(*) FROM $tbl_name WHERE val = ?",

undef, $val);

# store true/false to indicate whether value was found

$members{$val} = ($count > 0);

}

$valid = $members{$val};

For this method, the hash acts as a cache, so that you run a lookup query for any given value only once, no matter how many times it occurs in the input. For datasets that have a reasonable number of repeated values, this approach avoids issuing a separate query for every single test, while requiring an entry in the hash only for each unique value. It thus stands between the other two approaches in terms of the tradeoff between database traffic and program memory requirements for the hash.

Note that the hash is used in a somewhat different manner for this method than for the previous method. Previously, the existence of the input value as a key in the hash determined the validity of the value, and the value associated with the hash key was irrelevant. For the hash-as-cache method, the meaning of key existence in the hash changes from “it’s valid” to “it’s been tested before.” For each key, the value associated with it indicates whether the input value is present in the lookup table. (If you store as keys only those values that are found to be in the lookup table, you’ll issue a query for each instance of an invalid value in the input dataset, which is inefficient.)

Converting Two-Digit Year Values to Four-Digit Form

Problem

You need to convert years in date values from two digits to four digits.

Solution

Let MySQL do this for you, or perform the operation yourself if MySQL’s conversion rules aren’t appropriate.

Discussion

Two-digit year values are a problem because the century is not explicit in the data values. If you know the range of years spanned by your input, you can add the century without ambiguity. Otherwise, you can only guess. For example, the date 2/10/69 probably would be interpreted by most people in the U.S. as February 10, 1969. But if it represents Mahatma Gandhi’s birth date, the year is actually 1869.

One way to convert years to four digits is to let MySQL do it. If you store a date containing a two-digit year, MySQL automatically converts it to four-digit form. MySQL uses a transition point of 1970; it interprets values from 00 to 69 as the years 2000 to 2069, and values from 70 to 99 as the years 1970 to 1999. These rules are appropriate for year values in the range from 1970 to 2069. If your values lie outside this range, you should add the proper century yourself before storing them into MySQL.

To use a different transition point, convert years to four-digit form yourself. Here’s a general-purpose routine that converts two-digit years to four digits and allows an arbitrary transition point:

sub yy_to_ccyy

{

my ($year, $transition_point) = @_;

$transition_point = 70 unless defined ($transition_point);

$year += ($year >= $transition_point ? 1900 : 2000) if $year < 100;

return ($year);

}

The function uses MySQL’s transition point (70) by default. An optional second argument may be given to provide a different transition point. yy_to_ccyy() also makes sure the year actually needs converting (is less than 100) before modifying it. That way you can pass year values that do or don’t include the century without checking first. Some sample invocations using the default transition point have the following results:

$val = yy_to_ccyy (60); # returns 2060

$val = yy_to_ccyy (1960); # returns 1960 (no conversion done)

But suppose that you want to convert year values as follows, using a transition point of 50:

00 .. 49 -> 2000 .. 2049

50 .. 99 -> 1950 .. 1999

To do this, pass an explicit transition point argument to yy_to_ccyy():

$val = yy_to_ccyy (60, 50); # returns 1960

$val = yy_to_ccyy (1960, 50); # returns 1960 (no conversion done)

The yy_to_ccyy() function is one of those included in the Cookbook_Utils.pm library file.

Performing Validity Checking on Date or Time Subparts

Problem

A string passes a pattern test as a date or time, but you want to perform further checking to make sure that it’s legal.

Solution

Break up the value into subparts and perform the appropriate range checking on each part.

Discussion

Pattern matching may not be sufficient for checking dates or times. For example, a value like 1947-15-19 might match a date pattern, but it’s not actually legal as a date. If you want to perform more rigorous value testing, combine pattern matching with range checking. Break out the year, month, and day values, and then check that they’re within the proper ranges. Years should be less than 9999 (MySQL represents dates to an upper limit of 9999-12-31), month values should be in the range from 1 to 12, and days should be in the range from 1 to the number of days in the month. That latter part is the trickiest: it’s month-dependent, and for February it’s also year-dependent because it changes for leap years.

Suppose that you’re checking input dates in ISO format. In Using Patterns to Match Dates or Times, we used an is_iso_date() function from the Cookbook_Utils.pm library file to perform a pattern match on a date string and break it into component values:

my $ref = is_iso_date ($val);

if (defined ($ref))

{

# $val matched ISO format pattern;

# check its subparts using $ref->[0] through $ref->[2]

}

else

{

# $val didn't match ISO format pattern

}

is_iso_date() returns undef if the value doesn’t satisfy a pattern that matches ISO date format. Otherwise, it returns a reference to an array containing the year, month, and day values.[15]To perform additional checking on the date parts, pass them to is_valid_date(), another library function:

$valid = is_valid_date ($ref->[0], $ref->[1], $ref->[2]);

Or, more concisely:

$valid = is_valid_date (@{$ref});

is_valid_date() checks the parts of a date like this:

sub is_valid_date

{

my ($year, $month, $day) = @_;

# year must be nonnegative, month and day must be positive

return 0 if $year < 0 || $month < 1 || $day < 1;

# check maximum limits on individual parts

return 0 if $year > 9999;

return 0 if $month > 12;

return 0 if $day > days_in_month ($year, $month);

return 1;

}

is_valid_date() requires separate year, month, and day values, not a date string. This forces you to break apart candidate values into components before invoking it, but makes it applicable in more contexts. For example, you can use it to check dates like 12February2003 by mapping the month to its numeric value before calling is_valid_date(). Were is_valid_date() to take a string argument assumed to be in a specific date format, it would be much less general.

is_valid_date() uses a subsidiary function days_in_month() to determine how many days there are in the month represented by the date. days_in_month() requires both the year and the month as arguments, because if the month is 2 (February), the number of days depends on whether the year is a leap year. This means you must pass a four-digit year value: two-digit years are ambiguous with respect to the century, which makes proper leap-year testing impossible, as discussed in Performing Leap Year Calculations. The days_in_month() and is_leap_year()functions are based on techniques taken straight from that recipe:

sub is_leap_year

{

my $year = $_[0];

return (($year % 4 == 0) && ((($year % 100) != 0) || ($year % 400) == 0));

}

sub days_in_month

{

my ($year, $month) = @_;

my @day_tbl = (31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31);

my $days = $day_tbl[$month-1];

# add a day for Feb of leap years

$days++ if $month == 2 && is_leap_year ($year);

return ($days);

}

To perform validity checking on time values, a similar procedure can be used, although the ranges for the subparts are different: 0 to 23 for the hour, and 0 to 59 for the minute and second. Here is a function is_24hr_time() that checks for values in 24-hour format:

sub is_24hr_time

{

my $s = $_[0];

return undef unless $s =~ /^(\d{1,2})\D(\d{2})\D(\d{2})$/;

return [ $1, $2, $3 ]; # return hour, minute, second

}

The following is_ampm_time() function looks for times in 12-hour format with an optional AM or PM suffix, converting PM times to 24-hour values:

sub is_ampm_time

{

my $s = $_[0];

return undef unless $s =~ /^(\d{1,2})\D(\d{2})\D(\d{2})(?:\s*(AM|PM))?$/i;

my ($hour, $min, $sec) = ($1, $2, $3);

if ($hour == 12 && (!defined ($4) || uc ($4) eq "AM"))

{

$hour = "00"; # 12:xx:xx AM times are 00:xx:xx

}

elsif ($hour < 12 && defined ($4) && uc ($4) eq "PM")

{

$hour += 12; # PM times other than 12:xx:xx

}

return [ $hour, $min, $sec ]; # return hour, minute, second

}

Both functions return undef for values that don’t match the pattern. Otherwise, they return a reference to a three-element array containing the hour, minute, and second values.

NOTE

They don’t perform range checks. To do that, pass the array to is_valid_time(), another utility routine.


[15] The Cookbook_Utils.pm file also contains is_mmddyy_date() and is_ddmmyy_date() routines that match dates in U.S. or British format and return undef or a reference to an array of date parts. (The parts returned are always in year, month, day order, not the order in which the parts appear in the input date string.)

Writing Date-Processing Utilities

Problem

There’s a given date-processing operation that you need to perform frequently, so you want to write a utility that does it for you.

Solution

The utilities in this recipe provide some examples that show how to do that.

Discussion

Due to the idiosyncratic nature of dates, you might find it necessary to write date converters from time to time. This section shows some sample converters that serve various purposes:

§ isoize_date.pl reads a file looking for dates in U.S. format (MM-DD-YY) and converts them to ISO format.

§ cvt_date.pl converts dates to and from any of ISO, U.S., or British formats. It is more general than isoize_date.pl, but requires that you tell it what kind of input to expect and what kind of output to produce.

§ monddccyy_to_iso.pl looks for dates like Feb.6, 1788 and converts them to ISO format. It illustrates how to map dates with nonnumeric parts to a format that MySQL will understand.

All three scripts are located in the transfer directory of the recipes distribution. They assume datafiles are in tab-delimited, linefeed-terminated format. (Use cvt_file.pl first if you need to work with files that have a different format.)

Our first date-processing utility, isoize_date.pl, looks for dates in U.S. format and rewrites them into ISO format. You’ll recognize that it’s modeled after the general input-processing loop shown in Validating and Transforming Data, with some extra stuff thrown in to perform a specific type of conversion:

#!/usr/bin/perl

# isoize_date.pl - Read input data, look for values that match

# a date pattern, convert them to ISO format. Also converts

# 2-digit years to 4-digit years, using a transition point of 70.

# By default, this looks for dates in MM-DD-[CC]YY format.

# Assumes tab-delimited, linefeed-terminated input lines.

# Does not check whether dates actually are valid (for example,

# won't complain about 13-49-1928).

use strict;

use warnings;

# transition point at which 2-digit years are assumed to be 19XX

# (below they are treated as 20XX)

my $transition = 70;

while (<>)

{

chomp;

my @val = split (/\t/, $_, 10000); # split, preserving all fields

for my $i (0 .. @val - 1)

{

my $val = $val[$i];

# look for strings in MM-DD-[CC]YY format

next unless $val =~ /^(\d{1,2})\D(\d{1,2})\D(\d{2,4})$/;

my ($month, $day, $year) = ($1, $2, $3);

# to interpret dates as DD-MM-[CC]YY instead, replace preceding

# line with the following one:

#my ($day, $month, $year) = ($1, $2, $3);

# convert 2-digit years to 4 digits, and then update value in array

$year += ($year >= $transition ? 1900 : 2000) if $year < 100;

$val[$i] = sprintf ("%04d-%02d-%02d", $year, $month, $day);

}

print join ("\t", @val) . "\n";

}

If you feed isoize_date.pl an input file that looks like this:

Fred 04-13-70

Mort 09-30-69

Brit 12-01-57

Carl 11-02-73

Sean 07-04-63

Alan 02-14-65

Mara 09-17-68

Shepard 09-02-75

Dick 08-20-52

Tony 05-01-60

It produces the following output:

Fred 1970-04-13

Mort 2069-09-30

Brit 2057-12-01

Carl 1973-11-02

Sean 2063-07-04

Alan 2065-02-14

Mara 2068-09-17

Shepard 1975-09-02

Dick 2052-08-20

Tony 2060-05-01

isoize_date.pl serves a specific purpose: it converts only from U.S. to ISO format. It does not perform validity checking on date subparts or allow the transition point for adding the century to be specified. A more general tool would be more useful. The next script, cvt_date.pl, extends the capabilities of isoize_date.pl; it recognizes input dates in ISO, U.S., or British formats and converts any of them to any other. It also can convert two-digit years to four digits, enables you to specify the conversion transition point, and can warn about bad dates. As such, it can be used to preprocess input for loading into MySQL, or to postprocess data exported from MySQL for use by other programs.

cvt_date.pl understands the following options:

--iformat=format, --oformat=format, --format=format

Set the date format for input, output, or both. The default format value is iso; cvt_date.pl also recognizes any string beginning with us or br as indicating U.S. or British date format.

--add-century

Convert two-digit years to four digits.

--columns = column_list

Convert dates only in the named columns. By default, cvt_date.pl looks for dates in all columns. If this option is given, column_list should be a list of one or more column positions or ranges separated by commas. (Ranges can be given as m-n to specify columns m through n.) Positions begin at 1.

--transition = n

Specify the transition point for two-digit to four-digit year conversions. The default transition point is 70. This option turns on --add-century.

--warn

Warn about bad dates. (Note that this option can produce spurious warnings if the dates have two-digit years and you don’t specify --add-century, because leap year testing won’t always be accurate in that case.)

I won’t show the code for cvt_date.pl here (most of it is taken up with processing command-line options), but you can examine the source for yourself if you like. As an example of how cvt_date.pl works, suppose that you have a file newdata.txt with the following contents:

name1 01/01/99 38

name2 12/31/00 40

name3 02/28/01 42

name4 01/02/03 44

Running the file through cvt_date.pl with options indicating that the dates are in U.S. format and that the century should be added produces this result:

%cvt_date.pl --iformat=us --add-century newdata.txt

name1 1999-01-01 38

name2 2000-12-31 40

name3 2001-02-28 42

name4 2003-01-02 44

To produce dates in British format instead with no year conversion, do this:

%cvt_date.pl --iformat=us --oformat=br newdata.txt

name1 01-01-99 38

name2 31-12-00 40

name3 28-02-01 42

name4 02-01-03 44

cvt_date.pl has no knowledge of the meaning of each data column, of course. If you have a nondate column with values that match the pattern, it will rewrite that column, too. To deal with that, specify a --columns option to limit the columns that cvt_date.pl attempts to convert.

isoize_date.pl and cvt_date.pl both operate on dates written in all-numeric formats. But dates in datafiles often are written differently, in which case it may be necessary to write a special purpose script to process them. Suppose an input file contains dates in the following format (these represent the dates on which U.S. states were admitted to the Union):

Delaware Dec. 7, 1787

Pennsylvania Dec 12, 1787

New Jersey Dec. 18, 1787

Georgia Jan. 2, 1788

Connecticut Jan. 9, 1788

Massachusetts Feb. 6, 1788

Maryland Apr. 28, 1788

South Carolina May 23, 1788

New Hampshire Jun. 21, 1788

Virginia Jun 25, 1788

...

The dates consist of a three-character month abbreviation (possibly followed by a period), the numeric day of the month, a comma, and the numeric year. To import this file into MySQL, you need to convert the dates to ISO format, resulting in a file that looks like this:

Delaware 1787-12-07

Pennsylvania 1787-12-12

New Jersey 1787-12-18

Georgia 1788-01-02

Connecticut 1788-01-09

Massachusetts 1788-02-06

Maryland 1788-04-28

South Carolina 1788-05-23

New Hampshire 1788-06-21

Virginia 1788-06-25

...

That’s a somewhat specialized kind of transformation, although this general type of problem (converting a particular date format to ISO format) is hardly uncommon. To perform the conversion, identify the dates as those values matching an appropriate pattern, map month names to the corresponding numeric values, and reformat the result. The following script, monddccyy_to_iso.pl, illustrates how to do this:

#!/usr/bin/perl

# monddccyy_to_iso.pl - convert dates from mon[.] dd, ccyy to ISO format

# Assumes tab-delimited, linefeed-terminated input

use strict;

use warnings;

my %map = # map 3-char month abbreviations to numeric month

(

"jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,

"jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12

);

while (<>)

{

chomp;

my @val = split (/\t/, $_, 10000); # split, preserving all fields

for my $i (0 .. @val - 1)

{

# reformat the value if it matches the pattern, otherwise assume

# that it's not a date in the required format and leave it alone

if ($val[$i] =~ /^([^.]+)\.? (\d+), (\d+)$/)

{

# use lowercase month name

my ($month, $day, $year) = (lc ($1), $2, $3);

if (exists ($map{$month}))

{

$val[$i] = sprintf ("%04d-%02d-%02d",

$year, $map{$month}, $day);

}

else

{

# warn, but don't reformat

warn "$val[$i]: bad date?\n";

}

}

}

print join ("\t", @val) . "\n";

}

The script only does reformatting, it doesn’t validate the dates. To do that, modify the script to use the Cookbook_Utils.pm module by adding this statement after the use warnings line:

use Cookbook_Utils;

That gives the script access to the module’s is_valid_date() routine. To use it, change this line:

if (exists ($map{$month}))

To this:

if (exists ($map{$month}) && is_valid_date ($year, $map{$month}, $day))

Using Dates with Missing Components

Problem

The dates in your data are incomplete; that is, they have missing subparts.

Solution

MySQL can represent them as ISO-format dates using zero for the missing parts.

Discussion

Some applications use dates that are not complete. For example, you may need to work with input values such as Mar/2001 that contain only a month and year. In MySQL, it’s possible to represent such values as ISO-format dates that have zero in the “missing” parts. (The value Mar/2001can be stored as 2001-03-00.) To convert month/year values to ISO format for import into MySQL, set up a hash to map month names to their numeric values:

my %map = # map 3-char month abbreviations to numeric month

(

"jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,

"jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12

);

Now, convert each input value like this:

if ($val =~ /^([a-z]{3})\/(\d{4})$/i)

{

my ($m, $y) = (lc ($1), $2); # use lowercase month name

$val = sprintf ("%04d-%02d-00", $y, $map{$m})

}

After storing the resulting values into MySQL, you can retrieve them for display in the original month/year format by issuing a SELECT statement that rewrites the dates using a DATE_FORMAT() expression:

DATE_FORMAT(date_val,'%b/%Y')

Applications that use strict SQL mode but require zero parts in dates should be careful not to set the NO_ZERO_IN_DATE SQL mode, which causes the server to consider such dates invalid.

Importing Non-ISO Date Values

Problem

Date values to be imported are not in the ISO (CCYY-MM-DD) format that MySQL expects.

Solution

Use an external utility to convert the dates to ISO format before importing the data into MySQL (cvt_date.pl is useful here). Or use LOAD DATA’s capability for preprocessing input data before it gets loaded into the database.

Discussion

Suppose that you have a table that contains three columns, name, date, and value, where date is a DATE column requiring values in ISO format (CCYY-MM-DD). Suppose also that you’re given a datafile newdata.txt to be imported into the table, but its contents look like this:

name1 01/01/99 38

name2 12/31/00 40

name3 02/28/01 42

name4 01/02/03 44

Here the dates are in MM/DD/YY format and must be converted to ISO format to be stored as DATE values in MySQL. One way to do this is to run the file through the cvt_date.pl script shown earlier in the chapter:

%cvt_date.pl --iformat=us --add-century newdata.txt > tmp.txt

You can then load the tmp.txt file into the table. This task also can be accomplished entirely in MySQL with no external utilities by using SQL to perform the reformatting operation. As discussed in Preprocessing Input Values Before Inserting Them, LOAD DATA can preprocess input values before inserting them. Applying that capability to the present problem, the date-rewriting LOAD DATA statement looks like this, using the STR_TO_DATE() function (Changing MySQL’s Date Format) to interpret the input dates:

mysql>LOAD DATA LOCAL INFILE 'newdata.txt'

-> INTO TABLE t (name,@date,value)

-> SET date = STR_TO_DATE(@date,'%m/%d/%y');

With the %y format specifier in STR_TO_DATE(), MySQL converts the two-digit years to four-digit years automatically, so the original MM/DD/YY values end up as ISO values in CCYY-MM-DD format. The resulting data after import looks like this:

+-------+------------+-------+

| name | date | value |

+-------+------------+-------+

| name1 | 1999-01-01 | 38 |

| name2 | 2000-12-31 | 40 |

| name3 | 2001-02-28 | 42 |

| name4 | 2003-01-02 | 44 |

+-------+------------+-------+

This procedure assumes that MySQL’s automatic conversion of two-digit years to four digits produces the correct century values. This means that the year part of the values must correspond to years in the range from 1970 to 2069. If that’s not true, you need to convert the year values some other way. (See Performing Validity Checking on Date or Time Subparts for some ideas.)

If the dates are not in a format that STR_TO_DATE() can handle, perhaps you can write a stored function to handle them and return ISO date values. In that case, the LOAD DATA statement looks like this, where my_date_interp() is the stored function name:

mysql>LOAD DATA LOCAL INFILE 'newdata.txt'

-> INTO TABLE t (name,@date,value)

-> SET date = my_date_interp(@date);

Exporting Dates Using Non-ISO Formats

Problem

You want to export date values using a format other than the ISO (CCYY-MM-DD) format that MySQL uses by default. This might be a requirement when exporting dates from MySQL to applications that don’t understand ISO format.

Solution

Use an external utility to rewrite the dates to non-ISO format after exporting the data from MySQL (cvt_date.pl is useful here). Or use the DATE_FORMAT() function to rewrite the values during the export operation.

Discussion

Suppose that you want to export data from MySQL into an application that doesn’t understand ISO-format dates. One way to do this is to export the data into a file, leaving the dates in ISO format. Then run the file through a utility such as cvt_date.pl that rewrites the dates into the required format.

Another approach is to export the dates directly in the required format by rewriting them with DATE_FORMAT(). Suppose that you have the following table:

CREATE TABLE datetbl

(

i INT,

c CHAR(10),

d DATE,

dt DATETIME,

ts TIMESTAMP

);

Suppose also that you need to export data from this table, but with the dates in any DATE, DATETIME, or TIMESTAMP columns rewritten in U.S. format (MM-DD-CCYY). A SELECT statement that uses the DATE_FORMAT() function to rewrite the dates as required looks like this:

SELECT

i,

c,

DATE_FORMAT(d, '%m-%d-%Y') AS d,

DATE_FORMAT(dt, '%m-%d-%Y %T') AS dt,

DATE_FORMAT(ts, '%m-%d-%Y %T') AS ts

FROM datetbl

Thus, if datetbl contains the following rows:

3 abc 2005-12-31 2005-12-31 12:05:03 2005-12-31 12:05:03

4 xyz 2006-01-31 2006-01-31 12:05:03 2006-01-31 12:05:03

The statement generates output that looks like this:

3 abc 12-31-2005 12-31-2005 12:05:03 12-31-2005 12:05:03

4 xyz 01-31-2006 01-31-2006 12:05:03 01-31-2006 12:05:03

Importing and Exporting NULL Values

Problem

You’re not sure how to represent NULL values in a datafile.

Solution

Try to use a value not otherwise present, so that you can distinguish NULL from all other legitimate non-NULL values. When you import the file, look for that value and convert instances of it to NULL.

Discussion

There’s no standard for representing NULL values in datafiles, which makes them a bit of a problem for import and export operations. Some of the difficulty arises from the fact that NULL indicates the absence of a value, and something that’s not there is not easy to represent literally in a datafile. Using an empty column value is the most obvious thing to do, but that’s ambiguous for string-valued columns because there is no way to distinguish a NULL represented that way from a true empty string. Empty values can be a problem for other data types as well. For example, if you load an empty value with LOAD DATA into a numeric column, it is stored as 0 rather than as NULL, and thus becomes indistinguishable from a true 0 in the input.

The usual strategy for dealing with this problem is to represent NULL using a value that doesn’t otherwise occur in the data. This is how LOAD DATA and mysqlimport handle the issue: they understand the value of \N by convention to mean NULL. (\N is interpreted as NULL when it occurs by itself, and not as part of a larger value such as x\N or \Nx.) For example, if you load the following datafile with LOAD DATA, it will treat the instances of \N as NULL:

str1 13 1997-10-14

str2 \N 2009-05-07

\N 15 \N

\N \N 1973-07-14

But you might want to interpret values other than \N as signifying NULL, and you might have different conventions in different columns. Consider the following datafile:

str1 13 1997-10-14

str2 -1 2009-05-07

Unknown 15

Unknown -1 1973-07-15

The first column contains strings, and Unknown signifies NULL. The second column contains integers, and -1 signifies NULL. The third column contains dates, and an empty value signifies NULL. What to do?

To handle situations like this, use LOAD DATA’s capability for preprocessing input values: specify a column list that assigns input values to user-defined variables, and use a SET clause that maps the special values to true NULL values. If the datafile is named has_nulls.txt, the following LOADDATA statement properly interprets its contents:

mysql>LOAD DATA LOCAL INFILE 'has_nulls.txt'

-> INTO TABLE t (@c1,@c2,@c3)

-> SET c1 = IF(@c1='Unknown',NULL,@c1),

-> c2 = IF(@c2=-1,NULL,@c2),

-> c3 = IF(@c3='',NULL,@c3);

The resulting data after import looks like this:

+------+------+------------+

| c1 | c2 | c3 |

+------+------+------------+

| str1 | 13 | 1997-10-14 |

| str2 | NULL | 2009-05-07 |

| NULL | 15 | NULL |

| NULL | NULL | 1973-07-15 |

+------+------+------------+

The preceding discussion pertains to interpreting NULL values for import into MySQL, but it’s also necessary to think about NULL values when transferring data in the other direction—from MySQL into other programs. Here are some examples:

§ SELECT ... INTOOUTFILE writes NULL values as \N. Will another program understand that convention? If not, you need to convert \N to something the program will understand. For example, the SELECT statement can export the column using an expression like this:

IFNULL(col_name,'Unknown')

§ You can use mysql in batch mode as an easy way to produce tab-delimited output (Exporting Query Results from MySQL), but one problem with doing so is that NULL values appear in the output as instances of the word “NULL”. If that word occurs nowhere else in the output, you may be able to postprocess it to convert instances of the word to something more appropriate. For example, you can use a one-line sed command:

%sed -e "s/NULL/\\N/g" data.txt > tmp

If the word “NULL” does appear where it represents something other than a NULL value, then it’s ambiguous and you should probably export your data differently. For example, your export statement could use IFNULL() to map NULL values to something else.

Guessing Table Structure from a Datafile

Problem

Someone gives you a datafile and says, “Here, put this into MySQL for me.” But no table yet exists to hold the data.

Solution

Write the CREATE TABLE statement yourself. Or use a utility that guesses the table structure by examining the contents of the datafile.

Discussion

Sometimes you need to import data into MySQL for which no table has yet been set up. You can create the table yourself, based on any knowledge you might have about the contents of the file. Or you might be able to avoid some of the work by using guess_table.pl, a utility located in thetransfer directory of the recipes distribution. guess_table.pl reads the datafile to see what kind of information it contains, and then attempts to produce an appropriate CREATE TABLE statement that matches the contents of the file. This script is necessarily imperfect, because column contents sometimes are ambiguous. (For example, a column containing a small number of distinct strings might be a VARCHAR column or an ENUM.) Still, it’s often easier to tweak the CREATE TABLE statement that guess_table.pl produces than to write the entire statement from scratch. This utility also has a diagnostic function, although that’s not its primary purpose. For example, you might believe a column contains only numbers, but if guess_table.pl indicates that it should be created using a VARCHAR type, that tells you the column contains at least one nonnumeric value.

guess_table.pl assumes that its input is in tab-delimited, linefeed-terminated format. It also assumes valid input because any attempt to guess data types based on possibly flawed data is doomed to failure. This means, for example, that if a date column is to be recognized as such, it should be in ISO format. Otherwise, guess_table.pl may characterize it as a VARCHAR column. If a datafile doesn’t satisfy these assumptions, you may be able to reformat it first using the cvt_file.pl and cvt_date.pl utilities described in Recipes and .

guess_table.pl understands the following options:

--labels

Interpret the first input line as a row of column labels, and use them for table column names. If this option is omitted, guess_table.pl uses default column names of c1, c2, and so forth.

Note that if the file contains a row of labels, and you neglect to specify this option, the labels will be treated as data values by guess_table.pl. The likely result is that the script will characterize all columns as VARCHAR columns (even those that otherwise contain only numeric or temporal values), due to the presence of a nonnumeric or nontemporal value in the column.

--lower, --upper

Force column names in the CREATE TABLE statement to be lowercase or uppercase.

--quote-names, --skip-quote-names

Quote or do not quote table and column identifiers in the CREATE TABLE statement with ` characters (for example, `mytbl`). This can be useful if an identifier is a reserved word. The default is to quote identifiers.

--report

Generate a report rather than a CREATE TABLE statement. The script displays the information that it gathers about each column.

--table = tbl_name

Specify the table name to use in the CREATE TABLE statement. The default name is t.

Here’s an example of how guess_table.pl works. Suppose that a file named stockdat.csv is in CSV format and has the following contents:

commodity,trade_date,shares,price,change

sugar,12-14-2006,1000000,10.50,-.125

oil,12-14-2006,96000,60.25,.25

wheat,12-14-2006,2500000,8.75,0

gold,12-14-2006,13000,103.25,2.25

sugar,12-15-2006,970000,10.60,.1

oil,12-15-2006,105000,60.5,.25

wheat,12-15-2006,2370000,8.65,-.1

gold,12-15-2006,11000,101,-2.25

The first row indicates the column labels, and the following rows contain data records, one per line. The values in the trade_date column are dates, but they are in MM-DD-CCYY format rather than the ISO format that MySQL expects. cvt_date.pl can convert these dates to ISO format. However, both cvt_date.pl and guess_table.pl require input in tab-delimited, linefeed-terminated format. So first let’s use cvt_file.pl to convert the input to tab-delimited, linefeed-terminated format. Then we can convert the dates with cvt_date.pl:

%cvt_file.pl --iformat=csv stockdat.csv > tmp1.txt

% cvt_date.pl --iformat=us tmp1.txt > tmp2.txt

Then feed the resulting file, tmp2.txt, to guess_table.pl:

%guess_table.pl --labels --table=stocks tmp2.txt > stocks.sql

The CREATE TABLE statement that guess_table.pl writes to stocks.sql looks like this:

CREATE TABLE `stocks`

(

`commodity` VARCHAR(5) NOT NULL,

`trade_date` DATE NOT NULL,

`shares` INT UNSIGNED NOT NULL,

`price` DOUBLE UNSIGNED NOT NULL,

`change` DOUBLE NOT NULL

);

guess_table.pl produces that statement based on deductions such as the following:

§ If a column contains only numeric values, it’s assumed to be an INT if no values contain a decimal point, and DOUBLE otherwise.

§ If a numeric column contains no negative values, the column is likely to be UNSIGNED.

§ If a column contains no empty values, guess_table.pl assumes that it’s probably NOT NULL.

§ Columns that cannot be classified as numbers or dates are taken to be VARCHAR columns, with a length equal to the longest value present in the column.

You might want to edit the CREATE TABLE statement that guess_table.pl produces, to make modifications such as increasing the size of character fields, changing VARCHAR to CHAR, or adding indexes. Another reason to edit the statement is that if a column has a name that is a reserved word in MySQL, you can rename it.

To create the table, use the statement produced by guess_table.pl:

%mysql cookbook < stocks.sql

Then you can load the datafile into the table (skipping the initial row of labels):

mysql>LOAD DATA LOCAL INFILE 'tmp2.txt' INTO TABLE stocks

-> IGNORE 1 LINES;

The resulting data after import looks like this:

mysql>SELECT * FROM stocks;

+-----------+------------+---------+--------+--------+

| commodity | trade_date | shares | price | change |

+-----------+------------+---------+--------+--------+

| sugar | 2006-12-14 | 1000000 | 10.5 | -0.125 |

| oil | 2006-12-14 | 96000 | 60.25 | 0.25 |

| wheat | 2006-12-14 | 2500000 | 8.75 | 0 |

| gold | 2006-12-14 | 13000 | 103.25 | 2.25 |

| sugar | 2006-12-15 | 970000 | 10.6 | 0.1 |

| oil | 2006-12-15 | 105000 | 60.5 | 0.25 |

| wheat | 2006-12-15 | 2370000 | 8.65 | -0.1 |

| gold | 2006-12-15 | 11000 | 101 | -2.25 |

+-----------+------------+---------+--------+--------+

Exchanging Data Between MySQL and Microsoft Access

Problem

You want to exchange information between MySQL and Access.

Solution

To use information stored in MySQL, connect to the MySQL server directly from Access. To transfer information from Access to MySQL, use a utility that can perform the transfer directly, or export tables from Access into files and import the files into MySQL.

Discussion

MySQL and Access both understand ODBC, so you can connect to MySQL directly from Access. By making an ODBC connection, Access becomes a frontend through which you use MySQL databases. The mysql.com Connector/ODBC area contains a lot of useful information:

http://www.mysql.com/products/connector/odbc/

An excellent description of the procedures for setting up ODBC and for connecting from Access to MySQL over ODBC can be found in W.J. Gilmore’s article at the DevShed web site:

http://www.devshed.com/c/a/MySQL/MySQL-and-ODBC/

If your tables currently are in Access and you want to move them into MySQL, you’ll need to create tables in MySQL to hold the information and then import the Access data into those tables. Gilmore’s article describes how.

You can also choose to export Access tables to files and then import the files into MySQL. This may be necessary, for example, if your MySQL server is on a different machine and doesn’t allow connections from your Windows box. If you elect to go this route, some of the issues you need to consider are the file format to use, date format conversion, and how to create the MySQL tables for the data if the tables don’t already exist. Several of the scripts described earlier in the chapter (such as cvt_file.pl, cvt_date.pl, and guess_table.pl) can provide assistance in dealing with these issues. The procedure for importing an Access table into MySQL might go something like this:

1. Export the table from Access in some text format, perhaps including the column labels. Should you need to transform the file with other utilities that assume tab-delimited, linefeed-terminated input, it will be most useful to export in that format.

2. If the table contains dates and you did not export them in ISO format, you need to convert them for MySQL. Use cvt_date.pl for this.

3. If the MySQL table into which you want to import the Access data does not exist, create it. The guess_table.pl utility might be helpful at this point for generating a CREATE TABLE statement.

4. Import the datafile into MySQL with LOAD DATA or mysqlimport.

Exchanging Data Between MySQL and Microsoft Excel

Problem

You want to exchange information between MySQL and Excel.

Solution

Your programming language might provide modules to make this task easier. For example, there are Perl modules that read and write Excel spreadsheet files. You can use them to construct your own data transfer utilities.

Discussion

If you need to transfer Excel files into MySQL, check around for modules that let you do this from your chosen programming language. For example, you can read and write Excel spreadsheets from within Perl scripts by installing a few modules:

§ Spreadsheet::ParseExcel::Simple provides an easy-to-use interface for reading Excel spreadsheets.

§ Spreadsheet::WriteExcel::Simple enables you to create files in Excel spreadsheet format.

These Excel modules are available from the Perl CPAN. (They’re actually frontends to other modules, which you also need to install as prerequisites.) After installing the modules, use these commands to read their documentation:

%perldoc Spreadsheet::ParseExcel::Simple

% perldoc Spreadsheet::WriteExcel::Simple

These modules make it relatively easy to write a couple of short scripts for converting spreadsheets to and from tab-delimited file format. Combined with techniques for importing and exporting data into and out of MySQL, these scripts can help you move spreadsheet contents to MySQL tables and vice versa. Use them as is, or adapt them to suit your own purposes.

The following script, from_excel.pl, reads an Excel spreadsheet and converts it to tab-delimited format:

#!/usr/bin/perl

# from_excel.pl - read Excel spreadsheet, write tab-delimited,

# linefeed-terminated output to the standard output.

use strict;

use warnings;

use Spreadsheet::ParseExcel::Simple;

@ARGV or die "Usage: $0 excel-file\n";

my $xls = Spreadsheet::ParseExcel::Simple->read ($ARGV[0]);

foreach my $sheet ($xls->sheets ())

{

while ($sheet->has_data ())

{

my @data = $sheet->next_row ();

print join ("\t", @data) . "\n";

}

}

The to_excel.pl script performs the converse operation of reading a tab-delimited file and writing it in Excel format:

#!/usr/bin/perl

# to_excel.pl - read tab-delimited, linefeed-terminated input, write

# Excel-format output to the standard output.

use strict;

use warnings;

use Spreadsheet::WriteExcel::Simple;

my $ss = Spreadsheet::WriteExcel::Simple->new ();

while (<>) # read each row of input

{

chomp;

my @data = split (/\t/, $_, 10000); # split, preserving all fields

$ss->write_row (\@data); # write row to the spreadsheet

}

binmode (STDOUT);

print $ss->data (); # write the spreadsheet

to_excel.pl assumes input in tab-delimited, linefeed-terminated format. Use it in conjunction with cvt_file.pl to work with files that are not in that format.

Another Excel-related Perl module, Spreadsheet::WriteExcel::FromDB, reads data from a table using a DBI connection and writes it in Excel format. Here’s a short script that exports a MySQL table as an Excel spreadsheet:

#!/usr/bin/perl

# mysql_to_excel.pl - given a database and table name,

# dump the table to the standard output in Excel format.

use strict;

use warnings;

use DBI;

use Spreadsheet::ParseExcel::Simple;

use Spreadsheet::WriteExcel::FromDB;

#... process command-line options (not shown) ...

@ARGV == 2 or die "$usage\n";

my $db_name = shift (@ARGV);

my $tbl_name = shift (@ARGV);

# ... connect to database (not shown) ...

my $ss = Spreadsheet::WriteExcel::FromDB->read ($dbh, $tbl_name);

binmode (STDOUT);

print $ss->as_xls ();

Each of the three utilities writes to its standard output, which you can redirect to capture the results in a file:

%from_excel.pl data.xls > data.txt

% to_excel.pl data.txt > data.xls

% mysql_to_excel.pl cookbook profile > profile.xls

Exporting Query Results as XML

Problem

You want to export the result of a query as an XML document.

Solution

mysql can do that, or you can write your own exporter.

Discussion

You can use mysql to produce XML-format output from a query result (Producing HTML or XML Output).

You can also write your own XML-export program. One way to do this is to issue the query and then write it out, adding all the XML markup yourself. But it’s easier to install a few Perl modules and let them do the work:

§ XML::Generator::DBI issues a query over a DBI connection and passes the result to a suitable output writer.

§ XML::Handler::YAWriter provides one such writer.

The following script, mysql_to_xml.pl, is somewhat similar to mysql_to_text.pl (Writing Your Own Export Programs), but doesn’t take options for such things as the quote or delimiter characters. They are unneeded for reading XML, because that is done by standard XML parsing routines. The options that mysql_to_xml.pl does understand are:

--execute=query, -equery

Execute query, and export its output.

--table=tbl_name, -ttbl_name

Export the contents of the named table. This is equivalent to using --execute to specify a query value of SELECT * FROM tbl_name.

If necessary, you can also specify standard connection parameter options such as --user or --host. The final argument on the command line should be the database name, unless it’s implicit in the query.

Suppose that you want to export the contents of an experimental-data table expt that looks like this:

mysql>SELECT * FROM expt;

+---------+------+-------+

| subject | test | score |

+---------+------+-------+

| Jane | A | 47 |

| Jane | B | 50 |

| Jane | C | NULL |

| Jane | D | NULL |

| Marvin | A | 52 |

| Marvin | B | 45 |

| Marvin | C | 53 |

| Marvin | D | NULL |

+---------+------+-------+

To do that, invoke mysql_to_xml.pl using either of the following commands:

%mysql_to_xml.pl --execute="SELECT * FROM expt" cookbook > expt.xml

% mysql_to_xml.pl --table=cookbook.expt > expt.xml

The resulting XML document, expt.xml, looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<rowset>

<select query="SELECT * FROM expt">

<row>

<subject>Jane</subject>

<test>A</test>

<score>47</score>

</row>

<row>

<subject>Jane</subject>

<test>B</test>

<score>50</score>

</row>

<row>

<subject>Jane</subject>

<test>C</test>

</row>

<row>

<subject>Jane</subject>

<test>D</test>

</row>

<row>

<subject>Marvin</subject>

<test>A</test>

<score>52</score>

</row>

<row>

<subject>Marvin</subject>

<test>B</test>

<score>45</score>

</row>

<row>

<subject>Marvin</subject>

<test>C</test>

<score>53</score>

</row>

<row>

<subject>Marvin</subject>

<test>D</test>

</row>

</select>

</rowset>

Each table row is written as a <row> element. Within a row, column names and values are used as element names and values, one element per column. Note that NULL values are omitted from the output.

The script produces this output with very little code after it processes the command-line arguments and connects to the MySQL server. The XML-related parts of mysql_to_xml.pl are the use statements that pull in the necessary modules and the code to set up and use the XML objects. Given a database handle $dbh and a query string $query, there’s not a lot to this process. The code instructs the writer object to send its results to the standard output, and then connects that object to DBI and issues the query:

#!/usr/bin/perl

# mysql_to_xml.pl - given a database and table name,

# dump the table to the standard output in XML format.

use strict;

use warnings;

use DBI;

use XML::Generator::DBI;

use XML::Handler::YAWriter;

#... process command-line options (not shown) ...

# ... connect to database (not shown) ...

# create output writer; "-" means "standard output"

my $out = XML::Handler::YAWriter->new (AsFile => "-");

# set up connection between DBI and output writer

my $gen = XML::Generator::DBI->new (

dbh => $dbh, # database handle

Handler => $out, # output writer

RootElement => "rowset" # document root element

);

# issue query and write XML

$gen->execute ($stmt);

$dbh->disconnect ();

Other languages might have library modules to perform similar XML export operations. For example, the Ruby DBI::Utils::XMLFormatter module has a table method that easily exports a query result as XML. Here’s a simple script that uses it:

#!/usr/bin/ruby -w

# xmlformatter.rb - demonstrate DBI::Utils::XMLFormatter.table method

require "Cookbook"

stmt = "SELECT * FROM expt"

# override statement with command line argument if one was given

stmt = ARGV[0] if ARGV.length > 0

begin

dbh = Cookbook.connect

rescue DBI::DatabaseError => e

puts "Could not connect to server"

puts "Error code: #{e.err}"

puts "Error message: #{e.errstr}"

end

DBI::Utils::XMLFormatter.table(dbh.select_all(stmt))

dbh.disconnect

Importing XML into MySQL

Problem

You want to import an XML document into a MySQL table.

Solution

Set up an XML parser to read the document. Then use the records in the document to construct and execute INSERT statements.

Discussion

Importing an XML document depends on being able to parse the document and extract record contents from it. The way that you do this depends on how the document is written. For example, one format might represent column names and values as attributes of <column> elements:

<?xml version="1.0" encoding="UTF-8"?>

<rowset>

<row>

<column name="subject" value="Jane" />

<column name="test" value="A" />

<column name="score" value="47" />

</row>

<row>

<column name="subject" value="Jane" />

<column name="test" value="B />

<column name="score" value="50" />

</row>

...

</rowset>

Another format uses column names as element names and column values as the contents of those elements:

<?xml version="1.0" encoding="UTF-8"?>

<rowset>

<row>

<subject>Jane</subject>

<test>A</test>

<score>47</score>

</row>

<row>

<subject>Jane</subject>

<test>B</test>

<score>50</score>

</row>

...

</rowset>

Due to the various structuring possibilities, it’s necessary to make some assumptions about the format you expect the XML document to have. For the example here, I’ll assume the second format just shown. One way to process this kind of document is to use the XML::XPath module, which enables you to refer to elements within the document using path expressions. For example, the path //row selects all the <row> elements under the document root, and the path * selects all child elements of a given element. You can use these paths with XML::XPath to obtain first a list of all the <row> elements, and then for each row a list of all its columns.

The following script, xml_to_mysql.pl, takes three arguments:

%xml_to_mysql.pl

db_name tbl_name xml_file

The filename argument indicates which document to import, and the database and table name arguments indicate which table to import it into. xml_to_mysql.pl processes the command-line arguments, connects to MySQL, and then processes the document:

#!/usr/bin/perl

# xml_to_mysql.pl - read XML file into MySQL

use strict;

use warnings;

use DBI;

use XML::XPath;

#... process command-line options (not shown) ...

# ... connect to database (not shown) ...

# Open file for reading

my $xp = XML::XPath->new (filename => $file_name);

my $row_list = $xp->find ("//row"); # find set of <row> elements

print "Number of records: " . $row_list->size () . "\n";

foreach my $row ($row_list->get_nodelist ()) # loop through rows

{

my @name; # array for column names

my @val; # array for column values

my $col_list = $row->find ("*"); # child columns of row

foreach my $col ($col_list->get_nodelist ()) # loop through columns

{

# save column name and value

push (@name, $col->getName ());

push (@val, $col->string_value ());

}

# construct INSERT statement, and then execute it

my $stmt = "INSERT INTO $tbl_name ("

. join (",", @name)

. ") VALUES ("

. join (",", ("?") x scalar (@val))

. ")";

$dbh->do ($stmt, undef, @val);

}

$dbh->disconnect ();

The script creates an XML::XPath object, which opens and parses the document. Then the object is queried for the set of <row> elements, using the path //row. The size of this set indicates how many records the document contains.

To process each row, the script uses the path * to ask for all the child elements of the row object. Each child corresponds to a column within the row; using * as the path for get_nodelist() this way is convenient because you don’t need to know in advance which columns to expect.xml_to_mysql.pl obtains the name and value from each column and saves them in the @name and @value arrays. After all the columns have been processed, the arrays are used to construct an INSERT statement that names those columns that were found to be present in the row and that includes a placeholder for each data value. (Handling Special Characters and NULL Values in Statements discusses placeholder list construction.) Then the script issues the statement, passing the column values to do() to bind them to the placeholders.

In the previous section, we used mysql_to_xml.pl to export the contents of the expt table as an XML document. xml_to_mysql.pl can perform the converse operation of importing the document back into MySQL:

%xml_to_mysql.pl cookbook expt expt.xml

As it processes the document, the script generates and executes the following set of statements:

INSERT INTO expt (subject,test,score) VALUES ('Jane','A','47')

INSERT INTO expt (subject,test,score) VALUES ('Jane','B','50')

INSERT INTO expt (subject,test) VALUES ('Jane','C')

INSERT INTO expt (subject,test) VALUES ('Jane','D')

INSERT INTO expt (subject,test,score) VALUES ('Marvin','A','52')

INSERT INTO expt (subject,test,score) VALUES ('Marvin','B','45')

INSERT INTO expt (subject,test,score) VALUES ('Marvin','C','53')

INSERT INTO expt (subject,test) VALUES ('Marvin','D')

Note that these statements do not all insert the same number of columns. MySQL will set the missing columns to their default values.

Epilogue

Recall the scenario with which this chapter began:

Suppose that you have a file named somedata.csv that contains 12 columns of data in comma-separated values (CSV) format. From this file you want to extract only columns 2, 11, 5, and 9, and use them to create database rows in a MySQL table that contains name, birth, height, andweight columns. You need to make sure that the height and weight are positive integers, and convert the birth dates from MM/DD/YY format to CCYY-MM-DD format. How can you do this?

So…how would you do that, based on the techniques discussed in this chapter?

Much of the work can be done using the utility programs developed here. You can convert the file to tab-delimited format with cvt_file.pl, extract the columns in the desired order with yank_col.pl, and rewrite the date column to ISO format with cvt_date.pl:

%cvt_file.pl --iformat=csv somedata.csv \

| yank_col.pl --columns=2,11,5,9 \

| cvt_date.pl --columns=2 --iformat=us --add-century > tmp

The resulting file, tmp, will have four columns representing the name, birth, height, and weight values, in that order. It needs only to have its height and weight columns checked to make sure they contain positive integers. Using the is_positive_integer() library function from theCookbook_Utils.pm module file, that task can be achieved using a short special-purpose script that isn’t much more than an input loop:

#!/usr/bin/perl

# validate_htwt.pl - height/weight validation example

# Assumes tab-delimited, linefeed-terminated input lines.

# Input columns and the actions to perform on them are as follows:

# 1: name; echo as given

# 2: birth; echo as given

# 3: height; validate as positive integer

# 4: weight; validate as positive integer

use strict;

use warnings;

use Cookbook_Utils;

while (<>)

{

chomp;

my ($name, $birth, $height, $weight) = split (/\t/, $_, 4);

warn "line $.:height $height is not a positive integer\n"

if !is_positive_integer ($height);

warn "line $.:weight $weight is not a positive integer\n"

if !is_positive_integer ($weight);

}

The validate_htwt.pl script doesn’t produce any output (except for warning messages) because it doesn’t need to reformat any of the input values. Assuming that tmp passes validation with no errors, it can be loaded into MySQL with a simple LOAD DATA statement:

mysql>LOAD DATA LOCAL INFILE 'tmp' INTO TABLE

tbl_name

;