HP Vertica Essentials (2014)
Chapter 6. Bulk Loading
Bulk loading is the process of inserting a huge amount of data at once. Bulk loading in Vertica is performed using the COPY command. This chapter will cover topics such as the use of the COPY command, different load methods, and the basics of data transformation.
Using the COPY command
The COPY command can only be used by a superuser. The COPY command provides the flexibility to load and manage data with the help of the following optional parameters:
· Format and arrangement of the incoming data
· Metadata about the data load
· Data transformation
· Error handling
The encoding of the data to be loaded should be in the UTF-8 format. It is advisable to check the encoding of the file before loading the data. If the data present is not in the UTF-8 format, then we can convert it using the following Linux/UNIX iconv command:
iconv -f encoding-of-old-file -t encoding-of-new-file old-file.txt > newfile.txt
This can be illustrated with the help of the following example:
> iconv –f WINDOWS-1251 –t UTF-8 data.txt > data_new.txt
You can also check for the various formats supported by iconv using iconv -l.
It should also be noted that data should be segregated with proper delimiter characters. Before loading the data, it should also be checked that no CHAR(N) or VARCHAR(N) data values are included in the delimiter character. The default delimiter character is the pipe character, or |.
The following is an example of the COPY command with all the possible options:
COPY [TARGET_TABLE]
FROM { STDIN
...... [ BZIP | GZIP | UNCOMPRESSED ]
...| 'pathToData' [ ON nodename | ON ANY NODE ]
...... [ BZIP | GZIP | UNCOMPRESSED ] [, ...]
...| LOCAL STDIN | 'pathToData'
...... [ BZIP | GZIP | UNCOMPRESSED ] [, ...]
}
The following is a simple example for loading the data:
COPY table1 FROM '/root/data/tab1.txt';
Here, table1 is the target table while '/root/data/tab1.txt' is the source data.
To load data from the client to the Vertica database cluster, we should use COPY…FROM LOCAL, as shown in the following example:
COPY table1 FROM LOCAL '/root/data/tab1.txt' DELIMITER '~';
We can provide more than one delimiter. For example, let's say we have data in the following fashion, a|b|c|d~e|f, with | and ~ being delimiters, as shown in the following example:
COPY table1 COLUMN OPTION (col4 DELIMITER '~') FROM '/root/data/tab1.txt' DELIMITER '|'
We can also provide multiple files by giving a comma-separated list as follows:
COPY table1 FROM LOCAL '/root/data/tab1_1.txt', '/root/data/tab1_2.txt'
We can supply archives (GZIP and BZIP) containing files as follows:
COPY table1 FROM LOCAL '/root/data/tab1' GZIP
We can supply data files from any node or a specific node by using pathToData. This is an optional parameter, and if not supplied, it finds files in the local node from which the command is invoked, for example:
COPY table1 FROM LOCAL '/root/data/tab1' GZIP ON ANY NODE
In the preceding example, ON ANY NODE is the pathToData exception.
Aborting the COPY command
If at any point in time you feel that something is wrong with the data or loading process, then you can just cancel the bulk load process. All the changes made during this process will be rolled back to its original state. Remember, it is not advisable to abort a bulk loading process, but if the situation warrants it, then go ahead.
Load methods
Depending on the size of the data, you should select one of the following load methods with the COPY command:
Load methods |
Description and use |
AUTO |
This is the default option. It loads data into WOS. After WOS is full, it continues loading data into ROS. It is good for data less than 100 MB in size. (Please refer to Chapter 5, Performance Improveme nt, to understand more on ROS and WOS.) |
DIRECT |
This loads data directly into ROS containers. It is advised to use the DIRECT load method for data more than 100 MB in size. |
TRICKLE |
This loads data only into WOS. After WOS is full, it generates an error. It is suggested to use this for frequent incremental load operations. |
An example of a load method is as follows:
COPY table1 from '/root/data/tab1.txt' DIRECT
For incremental loads, it is suggested to use the NO COMMIT command and use the COMMIT command at a later stage for better control at the transaction level.
Data transformation
Using the COPY command, we can control the columns in which the values need to be inserted for a table. Moreover, the COPY command supports operators, constants, Nulls, and comments. It should be noted that the COPY command cannot use the analytic and aggregate functions while loading the data, although it supports the following types of functions:
· Date/time
· Formatting
· Numeric
· String
· Null handling
· System information
You can also ignore columns from source files. For that, we need to use the FILLER option.
Summary
Since Vertica is more apt for OLAP purposes, it is imperative for you to perform bulk loading. As you must have observed, bulk loading in Vertica is quite simple and follows the same standards as followed by other relational databases.