Bulk Loading - HP Vertica Essentials (2014)

HP Vertica Essentials (2014)

Chapter 6. Bulk Loading

Bulk loading is the process of inserting a huge amount of data at once. Bulk loading in Vertica is performed using the COPY command. This chapter will cover topics such as the use of the COPY command, different load methods, and the basics of data transformation.

Using the COPY command

The COPY command can only be used by a superuser. The COPY command provides the flexibility to load and manage data with the help of the following optional parameters:

· Format and arrangement of the incoming data

· Metadata about the data load

· Data transformation

· Error handling

The encoding of the data to be loaded should be in the UTF-8 format. It is advisable to check the encoding of the file before loading the data. If the data present is not in the UTF-8 format, then we can convert it using the following Linux/UNIX iconv command:

iconv -f encoding-of-old-file -t encoding-of-new-file old-file.txt > newfile.txt

This can be illustrated with the help of the following example:

> iconv –f WINDOWS-1251 –t UTF-8 data.txt > data_new.txt

You can also check for the various formats supported by iconv using iconv -l.

It should also be noted that data should be segregated with proper delimiter characters. Before loading the data, it should also be checked that no CHAR(N) or VARCHAR(N) data values are included in the delimiter character. The default delimiter character is the pipe character, or |.

The following is an example of the COPY command with all the possible options:

COPY [TARGET_TABLE]

FROM { STDIN

...... [ BZIP | GZIP | UNCOMPRESSED ]

...| 'pathToData' [ ON nodename | ON ANY NODE ]

...... [ BZIP | GZIP | UNCOMPRESSED ] [, ...]

...| LOCAL STDIN | 'pathToData'

...... [ BZIP | GZIP | UNCOMPRESSED ] [, ...]

}

The following is a simple example for loading the data:

COPY table1 FROM '/root/data/tab1.txt';

Here, table1 is the target table while '/root/data/tab1.txt' is the source data.

To load data from the client to the Vertica database cluster, we should use COPY…FROM LOCAL, as shown in the following example:

COPY table1 FROM LOCAL '/root/data/tab1.txt' DELIMITER '~';

We can provide more than one delimiter. For example, let's say we have data in the following fashion, a|b|c|d~e|f, with | and ~ being delimiters, as shown in the following example:

COPY table1 COLUMN OPTION (col4 DELIMITER '~') FROM '/root/data/tab1.txt' DELIMITER '|'

We can also provide multiple files by giving a comma-separated list as follows:

COPY table1 FROM LOCAL '/root/data/tab1_1.txt', '/root/data/tab1_2.txt'

We can supply archives (GZIP and BZIP) containing files as follows:

COPY table1 FROM LOCAL '/root/data/tab1' GZIP

We can supply data files from any node or a specific node by using pathToData. This is an optional parameter, and if not supplied, it finds files in the local node from which the command is invoked, for example:

COPY table1 FROM LOCAL '/root/data/tab1' GZIP ON ANY NODE

In the preceding example, ON ANY NODE is the pathToData exception.

Aborting the COPY command

If at any point in time you feel that something is wrong with the data or loading process, then you can just cancel the bulk load process. All the changes made during this process will be rolled back to its original state. Remember, it is not advisable to abort a bulk loading process, but if the situation warrants it, then go ahead.

Load methods

Depending on the size of the data, you should select one of the following load methods with the COPY command:

Load methods

Description and use

AUTO

This is the default option. It loads data into WOS. After WOS is full, it continues loading data into ROS. It is good for data less than 100 MB in size. (Please refer to Chapter 5, Performance Improveme nt, to understand more on ROS and WOS.)

DIRECT

This loads data directly into ROS containers. It is advised to use the DIRECT load method for data more than 100 MB in size.

TRICKLE

This loads data only into WOS. After WOS is full, it generates an error. It is suggested to use this for frequent incremental load operations.

An example of a load method is as follows:

COPY table1 from '/root/data/tab1.txt' DIRECT

For incremental loads, it is suggested to use the NO COMMIT command and use the COMMIT command at a later stage for better control at the transaction level.

Data transformation

Using the COPY command, we can control the columns in which the values need to be inserted for a table. Moreover, the COPY command supports operators, constants, Nulls, and comments. It should be noted that the COPY command cannot use the analytic and aggregate functions while loading the data, although it supports the following types of functions:

· Date/time

· Formatting

· Numeric

· String

· Null handling

· System information

You can also ignore columns from source files. For that, we need to use the FILLER option.

Summary

Since Vertica is more apt for OLAP purposes, it is imperative for you to perform bulk loading. As you must have observed, bulk loading in Vertica is quite simple and follows the same standards as followed by other relational databases.