Backup and Restore - HP Vertica Essentials (2014)

HP Vertica Essentials (2014)

Chapter 4. Backup and Restore

In this chapter, you will learn how to create database backups and restore backups in Vertica. Vertica provides the vbr.py script to back up, restore, and copy a database. We can create both full and incremental database snapshots, as well as snapshots of specific schemas or tables. For creating the most optimum backup, it is suggested that each node have its own dedicated backup host.

Requirements for backup hosts

The vbr.py utility lets us back up the database to one or more hosts (called backup hosts) that can be outside the database cluster. The backup hosts must have a password-less SSH access for the database administrator account. Also, backup hosts must have the same versions of Python and rsync as the main nodes. The vbr.py utility initially creates a snapshot of the database cluster of which the backup is being created. When the creation of the new snapshot is complete, it is moved to the designated backup location (host and directory). After Vertica copies the snapshot to the backup location, it deletes the snapshot created initially in the main cluster.

Generating the vbr.py configuration file

To invoke vbr.py to set up a configuration file, we should use the following command:

> vbr.py --setupconfig

The script prompts us to answer the following questions for parameters marked with an asterisk, *. For setting all the advanced options, we need to answer the last question with a "yes" during initial setup or change the configuration file manually. A configuration file contains various parameters categorized under various headers. The upcoming sections explain the headers and corresponding parameters present in the configuration file.

Miscellaneous settings

The following table shows the parameters present in the Miscellaneous [Misc] section of the file:

Parameters

Default

Description

snapshotName*

snapshotName

This value provides the prefix of the directory that will contain the snapshot. The characters in the value of snapshotName can include the following:

· Aa-Zz

· 0-9

· Period (.), hyphen (-), and underscore (_)

tempDir

/tmp

The vbr.py utility uses this directory path as the path on all nodes to store temporary data/files during the backup process. Since it is the universal path, it must be present on all nodes in the cluster.

verticaBinDir

/opt/vertica/bin

If Vertica is installed on some other directory than the default location, we need to state where the bin files are present through this parameter.

verticaConfig*

False

Sometimes it is a good idea to backup configuration files. This parameter, when set to true, permits Vertica to back up config files along with data files.

restorePointLimit*

1

If we wish to store incremental snapshots, we need to increase the value of this parameter from 1. The permissible value range is from 1-99.

objects*

None

We can create a partial backup by providing a list of object names in the form of a comma-separated list which is included in a snapshot.

overwrite

True

This parameter is not part of the configuration file and needs to be included manually under the [MISC] header to change its default value. When set to true, a newer object overwrites the existing one in the database whenever OID conflicts occur during restore.

retryCount

2

This number indicates the number of attempts or retries of the backup operation after a fatal error occurs.

retryDelay

1

This is the interval time in seconds in between retries for backup after a failure occurs.

Database access settings

The parameter under the [Database] header needs to be set for database access. The following table lists the parameters present in this section:

Parameters

Default

Description

dbName

N/A

As the name implies, it tells the utility which database to back up. If not supplied with a name, any live database is chosen as a backup candidate.

dbUser*

The current username

This parameter tells the utility which user is authorized to run vbr.py to back up or restore. The user must have admin rights.

dbPromptForPassword*

True

If this parameter is set to False, the utility will not prompt for a password at runtime. If set otherwise, we must also enter the database administrator password in the dbPassword parameter (discussed next).

dbPassword*

None

This identifies the database administrator's password. The vbr.py utility doesn't encrypt the password, and hence precautions must be taken while supplying the password. If dbPromptForPassword is set to tr ue and no password is supplied in the configuration file, we must supply it at runtime (this is more secure).

Data transmission during the backup process

The following table lists the parameters present in the [Transmission] section of the file:

Parameters

Default

Description

encrypt

False

When this parameter is set to true, data transmitted during the backup process is encrypted. As a rule of thumb, the utility may engage one whole core on both source and destination nodes to look for encryption. This may create a significant overhead and the performance may suffer.

checksum

False

In order to ensure that there is no data loss during transmission, we can set this parameter to true. When set to true, rsync performs MD5 checksum over the transmitted data to ensure integrity.

port_rsync

50000

This sets the port number on which rsync will work.

bwlimit

0

This parameter defines the upper limit of the bandwidth of the network to be used in kbps during backup.

hardLinkedLocal

False

This configuration parameter, when set to true, makes the utility create a backup on a local directory instead of a remote location. A local backup is created by hard links instead of copying data.

Mapping

For each node of the database cluster, a separate mapping heading needs to be written. Each heading is numbered ([Mapping1], [Mapping2], and so on) and controls the backup of the respective node. All of the parameters under this header are mandatory in the configuration file. The following table lists the parameters:

Parameters

Default

Description

backupHost*

None

As the name implies, this tells the utility the hostname of the backup node.

backupDir*

None

This is the path to the directory that will host the backup files on the designated host or node where the backup will be stored. This directory must exist and the user must have all rights to it.

dbNode

None

This is Vertica's internal name of the host usually in the form v_databasename_node00xx.

Creating full and incremental backups

In this section, you will learn about the process of creating full and incremental backups. But first, let's understand the requirements.

Understanding the requirements

Before we create a snapshot, we must check the following:

· The database is up and running.

· All of the backup hosts are up and available to the cluster.

· The user account has all permissions to the target directories on the backup node.

· By default, the vbr.py utility searches for the config file at /opt/vertica/config/vbr.ini, and if it is not found, the utility exits with an error. We can supply the config file using the --config-file parameter with vbr.py.

It should be noted that if some nodes are down but the database is running in K-Safe mode, then the backup will still be created.

Running vbr.py

To run the vbr.py utility, use the following command:

> vbr.py --task backup --config-file myconfig.ini

Copying...

xxxxxx out of xxxxxx, 100%

All child processes terminated successfully.

Committing changes on all backup sites...

backup done!

Incremental snapshots

If restorePointLimit is set to a value more than 1, then after creating a full snapshot, the vbr.py utility will create the future backups in an incremental manner. This incremental backup process works in the following manner:

1. The utility obtains the value of the restorePointLimit parameter from the configuration file.

2. If the value is more than 1, then after creating a full backup, incremental backups will be created on subsequent invocation of the utility. It is suggested to create manual backups of snapshots periodically.

3. If after creating the next snapshot the total number of snapshots exceeds the restore point limit, vbr.py automatically deletes the oldest snapshot to maintain the count.

Creating schema and table snapshots

As discussed earlier, it is possible to create object-specific snapshots. These objects can be certain tables or schemas. As mentioned in the table, there are the following two configuration file parameters:

· Objects

· Overwrite

Refer to the table under the Miscellaneous settings section for more information.

Restoring full database snapshots

Restoring full database snapshots in Vertica can be a little tricky due to various constraints. The following are the constraints:

· The database must be down. It is good to have another database cluster to serve the application by the time the restore is complete on the primary cluster.

· The new cluster should have the same number of hosts as the base cluster of which the backup is created. This constraint is included to tackle K-Safety issues that might arise later. In fact, it is important to keep even node names and the IP addresses the same as that of the nodes in the base cluster.

We can use a full database snapshot created in Vertica 5.0 to restore into a 6.0 database. To begin a full database snapshot restore, log in using the database administrator's account created during installation. We cannot run the utility as root.

Restoring from a specific snapshot

At first, the utility restores the most recent snapshot, though it is possible to control which snapshot needs to be restored in case we have saved multiple versions. It is important to bear in mind that snapshots are restored to the same database from which they were created. In addition to that, you cannot restore a partial snapshot into an empty database. The following example exhibits restoring a database using the settings in the myconfig.ini config file:

> vbr.py --task restore --config-file myconfig.ini

Copying...

xxxxxx out of xxxxxx, 100%

All child processes terminated successfully.

restore done!

Restoring from the most recent snapshot

To restore from the most recent of several snapshots, we can use the --archive option of the vby.py command, specifying the full snapshot name with its date and time. The following example illustrates restoring a database snapshot using the settings provided in themyconfig.ini config file, which also contains the superuser's password:

> vbr.py --task restore --config-file myconfig.ini --archive=20131210_308564

The --archive parameter recognizes the archive directory created on October 12, 2011 (_archive20131210) at timestamp 308564. Note that the configuration file contains the snapshot name of the directory, and while running the command, we just need to specify the date and timestamp. If the snapshot directory has the _archive suffix, the directory is archived.

Restoring schema and table snapshots

We can restore object-level snapshots only to the database from which they were created. Only the snapshots created in the same backup are compatible. We can restore object-level snapshots in the same backup location after a full-database snapshot restore. It should be noted that we cannot restore any object-level snapshot into an empty database.

Copying a database from one cluster to another

As discussed earlier, certain administration tasks require the database to shut down. In that case, it is good to have a secondary cluster. The vbr.py utility can do this activity of copying the database from one cluster to another using the copycluster command task. We can create a config file explicitly for copying the database from the base cluster to another cluster. In this config file, it is required to provide the hostnames or IP addresses of nodes of the target cluster as the backup hosts.

Please note that we can use neither incremental snapshots nor object-level snapshot files with the copycluster command. We can only use a full database backup file.

The following example config file can be used to copy a database on a three-node cluster to another three-node cluster (secondary cluster) comprising nodes named bck_host01, bck_host02, and bck_host03. Please note that although backupHost is used, backupDir is not used. This is because it is a cluster-wide doubling and not just a data backup.

[Misc]

snapshotName = km_001

tempDir = /data/tmp

restorePointLimit = 1

verticaConfig = False

retryCount = 1

retryDelay = 1

[Database]

dbName = km

dbUser = dba

dbPassword = sql

dbPromptForPassword = False

[Transmission]

encrypt = True

checksum = False

port_rsync = 50000

bwlimit = 0

hardLinkLocal = False

[Mapping0]

dbNode = v_km_node0001

backupHost = bck_host01

[Mapping1]

dbNode = v_km_node0002

backupHost = bck_host02

[Mapping2]

dbNode = v_km_node0003

backupHost = bck_host03

Copying the database

The following example exhibits copying a cluster using a config file located in the current directory:

> vbr.py --config-file CopyCluster.ini --task copycluster

Copying...

xxxxxx out of xxxxxx, 100%

All child processes terminated successfully.

copycluster done!

The vbr.py utility should be invoked from the source cluster and not from the destination cluster, else the backup database will be copied to the primary cluster. The user must be a superuser.

Using database snapshot functions

Apart from the vbr.py utility, we can also use database snapshot functions that let us create snapshots and remove snapshots. It is worth noting that database snapshot functions do not provide us with great flexibility as vbr.py does. By using database snapshot functions, we are limited to creating a full image backup of the database on the respective nodes. For example, if we have a three-node cluster, then a backup of each node will be created on the respective nodes; this is unlike vbr.py, where we have an option to migrate them to other nodes.

There are the following two types of snapshots that we can create using these functions:

· Durable snapshots: These are like hard links to the actual data files of the database with a proper directory structure. So whatever changes are made, the storage containers will be readily available through these hard links. These snapshots remain persistent throughout.

· Non-durable snapshots: This type of snapshot consists of the actual files containing the database's data and not hard links. So any change made after the snapshot has been created will not be reflected in these snapshots.

There are mainly two types of database snapshot functions, as follows:

· Database_snapshot

· Remove_database_snapshot

Creating database snapshots

We will use the DATABASE_SNAPSHOT SQL function to create snapshots. This function requires the following two parameters:

· The name of the snapshot.

· Whether it is durable or not (true/false). In the case of true, a durable snapshot will be created. In the case of false, a non-durable snapshot will be created.

The following example illustrates the use of the Database_snapshot function:

=> SELECT DATABASE_SNAPSHOT('snapshot_1', true);

DATABASE_SNAPSHOT

----------------------------------------------------

v_km_node0001,v_km_node0002,v_km_node0003

(1 row)

=> SELECT DATABASE_SNAPSHOT('snapshot_2', false);

DATABASE_SNAPSHOT

----------------------------------------------------

v_km_node0001,v_km_node0002,v_km_node0003

(1 row)

If we wish to check what snapshots have been created, we can check the Database_snapshots system table. This system table will not have data of historic snapshots. The following is the sample output when we query the Database_snapshots system table:

Creating database snapshots

The snapshots subdirectory in the catalog directory stores both durable and non-durable snapshots on each node of the cluster. Both types of snapshots have two files, which are as follows:

· snapshotname.txt: This file contains an overview of the snapshot. The following is the output of a durable snapshot:

Creating database snapshots

· snapshotname.ctlg: This file contains all information pertaining to the database catalog.

Removing snapshots

We can manually remove snapshots using the REMOVE_DATABASE_SNAPSHOT SQL function, as shown in the following commands:

=> SELECT REMOVE_DATABASE_SNAPSHOT(snapshot_1');

-[ RECORD 1 ]-----------+------------------------------------------------------------

REMOVE_DATABASE_SNAPSHOT | Removed: v_km_node0001,v_km_node0002,v_km_node0003

=> SELECT REMOVE_DATABASE_SNAPSHOT('snapshot_2');

-[ RECORD 1 ]------------+------------------------------------------------------------

REMOVE_DATABASE_SNAPSHOT | Removed: v_km_node0001,v_km_node0002,v_km_node0003

Now we can check the Database_snapshot system tables. Because we have removed all the snapshots, we will find no rows, as shown in the following command:

=> SELECT * FROM DATABASE_SNAPSHOTS;

(No rows)

Apart from manually removing the snapshots we set, there are two parameters that allow Vertica to perform automatic removal of snapshots after a certain time interval. They are shown in the following table:

Parameters

Description

Default

Example

RemoveSnapshotInterval

This is the interval time in seconds between two checks that Vertica performs for automatically removing snapshots.

3600

SELECT SET_CONFIG_PARAMETER ('RemoveSnapshotInterval', 7200);

SnapshotRetentionTime

This is a lifetime of snapshots in seconds. Vertica will not remove a snapshot automatically until this time limit is reached.

3600

SELECT SET_CONFIG_PARAMETER ('SnapshotRetentionTime', 7200);

Summary

A good backup strategy helps you safeguard against possible data losses in case of any catastrophic events. Vertica provides highly flexible and configurable backup and restore functions, which when optimally used can provide a satisfactory level of data backup.

In the next chapter, we will discuss performance tuning in Vertica and some basic concepts around it.