Monitoring Vertica - HP Vertica Essentials (2014)

HP Vertica Essentials (2014)

Chapter 3. Monitoring Vertica

Monitoring is one of the key database administration tasks. It allows you to find possible bottlenecks in database performance in the most pragmatic fashion. In Vertica, there are several sources of information through which monitoring can be performed. They are as follows:

· System tables

· Logfiles

· Management Console (Enterprise Edition only)

Monitoring through the system tables

Vertica stores most of its information about the various states of the database in system tables. The process of storing such information is completely automatic and runs in the background. By querying these system tables and by relating various types of information fetched from these tables, you can efficiently monitor Vertica. The advantage of using system tables is that we can use them, just like any other user table, to perform aggregation, analytics, and so on. There are more than 100 system tables; we can get the whole list by running the following query:

Select * from system_tables;

The table shown in the following screenshot shows the output of the query:

Monitoring through the system tables

The system_tables table

On the basis of the type of information a system table stores, system tables can be grouped into the following categories:

· System information

· System resources

· Background processes

· Workload and performance

System tables are clustered into the following schemas (as shown in the preceding screenshot):

· V_CATALOG: This schema contains information about persistent objects in the catalog

· V_MONITOR: This schema contains information about the temporary system state

Examples of V_CATALOG tables include views, roles, nodes, databases, license_audits, and so on, while examples of V_MONITOR tables include locks, disk_storage, resource_usage, and so on.

We do not need to specify a schema (V_CATALOG or V_MONITOR) for querying system tables as they belong to the default schema. It should be noted that there are certain restrictions imposed on system tables; they are as follows:

1. Vertica provides a SQL-monitoring API that can be used on these system tables, but the database will not connect through this API if the Vertica cluster is in a recovering state.

2. Due to the nature of the data stored in these tables, it is not possible to perform Data Definition Language (DDL) and Data Manipulation Language (DML) operations on system tables.

Understanding a system table example

For example, we just need to run the following command to retrieve all the information from the tables of a database:

select * from tables;

The following screenshot shows the output of the command. Please note that in the table_schema and table_name tables, the columns are case sensitive.

Understanding a system table example

The tables system table

Looking at events

There are various events such as Emergency, Alerts, Critical, Errors, Warnings, Notices, Information, and Debugs that are logged by Vertica. Vertica logs these events in one of the following ways:

· In the Vertica.log file

· In the ACTIVE_EVENTS system table

· Using SNMP

· Using Syslog

This book will cover the usage of the Vertial.log files and the ACTIVE_EVENTS system table for logging events.

The following table lists the events that are logged:

Event name

Event type

Description

Low Disk Space

0

This event is logged either when there is not enough space in the physical disk or when there is a hardware issue.

Read Only File System

1

Vertica uses specific directories for storing Catalog or Data information. When Vertica is unable to create or write to files, this event is logged. Changing access rights may solve the issue.

Loss Of K Safety

2

Vertica defines the minimum number of nodes needed to run a database, particularly in the K-Safe mode. For example, for maintaining K-safety of level 2, we need at least five nodes; if the number of active nodes goes below five, this event will be logged.

It should be ensured that all nodes are able to reach each other in the cluster.

Current Fault Tolerance at Critical Level

3

As discussed earlier, if a database loses K-safety, it shuts down, and this event is logged.

Too Many ROS Containers

4

This may occur due to many reasons, primarily if there are a large number of local segments or if Mergeout is not working efficiently. It is suggested that you stop the data load in such events, and the database should be allowed to recover.

Node State Change

6

This event occurs when the node state has changed. For example, a node can change its state from startup to up.

Recovery Failure

7

This event occurs when a database is not able to recover properly from a nonfunctional state.

Recovery Error

8

If the number of recovery errors exceeds the value of Max Tries defined in the vbr.py configuration file, this event is triggered.

Recovery Lock Error

9

This event is logged when a recovering node is not able to secure lock on tables.

Recovery Projection Retrieval Error

10

This event is logged when Vertica is unable to retrieve information about a projection.

Refresh Error

11

This event is logged when a database is not able to refresh itself.

Refresh Lock Error

12

This event is logged when a database is not able to obtain the necessary locks during refresh.

Tuple Mover Error

13

This event is logged when a database fails to move data from RAM (WOS) to physical disk (ROS or Read Optimized Store).

Timer Service Task Error

14

This event is logged when an error occurs in an internal scheduled task.

Stale Checkpoint

15

This event occurs when Write Optimized Store (WOS) data is not completely flushed into the Read Optimized Store (ROS) containers before shutdown. This may result in loss of data.

Looking at events through logfiles

During database startup, Vertica writes logs to the dblog file. This file resides in the catalog-path/<database-name>/ directory path. For any information regarding the starting of data, in case of a failure or success, it is advisable to look into the dblog file. The following is the sample text of a dblog file:

Connected to spread at host 192.168.56.101(192.168.56.101) port 4803

[Thu 07 Mar 2013 16:41:49] SP_connect: DEBUG: Auth list is: NULL

Connected to spread at host 192.168.56.101(192.168.56.101) port 4803

After the database is up and running properly, Vertica starts writing logs to vertica.log. Just like the dblog file, the vertica.log file is also maintained on each node of the cluster. It resides in the catalog-path/<database-name>/<node-name_catalog> directory path. Thefollowing is a sample text:

2013-01-05 04:03:07.225 Main:0x1124df70 [Init] <INFO> Starting up Vertica Analytic Database v6.0.0-1

2013-01-05 04:03:07.225 Main:0x1124df70 [Init] <INFO> Project Codename: BunkerHill

2013-01-05 04:03:07.225 Main:0x1124df70 [Init] <INFO> vertica(v6.0.0-1) built by release@build2.verticacorp.com from releases/VER_6_0_RELEASE_BUILD_0_1_20120611@95490 on 'Mon Jun 11 10:40:47 2012' $BuildId$

2013-01-05 04:03:07.225 Main:0x1124df70 [Init] <INFO> 64-bit Optimized Build

2013-01-05 04:03:07.225 Main:0x1124df70 [Init] <INFO> Compiler Version: 4.1.2 20080704 (Red Hat 4.1.2-52)

2013-01-05 04:03:07.225 Main:0x1124df70 <LOG> @v_km_node0001: 00000/5081: Total swap memory used: 94208

Looking at events through the ACTIVE_EVENTS system table

The ACTIVE_EVENTS system table can be used to look at various events logged by Vertica. The following table describes the various fields of the ACTIVE_EVENTS system table:

Column

Description

node_name

Internal name of the node

event_code

ID of the event type

event_posted_timestamp

Timestamp when the event was initiated

event_expiration

Timestamp when the event expired

event_code_description

Description of the event type

event_problem_description

Actual description of the event

reporting_node

Internal name of the node on which the event occurred

The following screenshot shows the sample output when we query the ACTIVE_EVENTS system table:

Looking at events through the ACTIVE_EVENTS system table

The ACTIVE_EVENTS system table

Monitoring Vertica through the Management Console

Vertica provides a very intuitive web console that can be installed and integrated with Vertica. Vertica Management Console is shipped only with the Enterprise Edition. It can help you perform almost every administration task that is possible through a command-based console.

Management Console runs only on browsers that support HTML5. By default, it is hosted on port 5450, which can be changed. To access it, we can just enter https://xx.xx.xx.xx:5450/webui, where xx.xx.xx.xx is the IP of the host on which the Management Console is hosted. It will prompt for a username and password; we can supply any credentials of any user who has admin rights. It is not advised to install Management Console on data-hosting nodes.

Retaining monitoring information

As discussed earlier, system tables provide a plethora of information regarding various processes running on a database, including queries. However, sometimes we need to store monitoring information. For this, we can employ Data Collector. Information retained by Data Collector is stored on a disk in the DataCollector directory under the Vertica catalog path.

Enabling and disabling Data Collector

To enable or disable Data Collector, we can use the SET_CONFIG_PARAMETER() function. Data Collector is on by default. To disable Data Collector, use the following command:

=> SELECT SET_CONFIG_PARAMETER('EnableDataCollector', '0');

Use the following command to enable Data Collector:

=> SELECT SET_CONFIG_PARAMETER('EnableDataCollector', '1');

Viewing the current data retention policy

To view the data retention policy, we can use the GET_DATA_COLLECTOR_POLICY() function. We can replace the component variable with the actual component name, as shown in the following line of code:

GET_DATA_COLLECTOR_POLICY( 'component' );

To get the whole list of components, we can query the V_MONITOR.DATA_COLLECTOR system table. In addition to the list, we will also get their current retention policies and statistics about how much data is retained.

Configuring data retention policies

We can set Data Collector policies using the SET_DATA_COLLECTOR_POLICY() function. Only a superuser can modify policies. This function, shown in the following line of code, allows us to change how much memory (in kb) and disk space (in kb) to retain for each component on all nodes:

SET_DATA_COLLECTOR_POLICY('component', 'memoryKB', 'diskKB')

Failed nodes are not ignored from the policy, as they will create policy anomalies in the cluster. Hence, when failed nodes rejoin, the latest policy is imposed on them.

Monitoring data collection components

As discussed earlier, when the DATA_COLLECTOR system table is queried, we are presented with a list of Data Collector components, their current retention policies, and statistics about how much data is retained and how much has been discarded. DATA_COLLECTOR also calculates the approximate collection rate to aid in sizing calculations. The following is a simple query that returns all the columns in this system table:

km=> \x

Expanded display is on.

km=> SELECT * FROM data_collector;

-[ RECORD 1 ]----------+-------------------------------------------------------------------------------------------------------------------

node_name | v_km_node0001

component | AllocationPoolStatistics

table_name | dc_allocation_pool_statistics

description | Information about global memory pools, which generally cannot be recovered without restart

access_restricted | t

in_db_log | f

in_vertica_log | f

memory_buffer_size_kb | 64

disk_size_kb | 256

record_too_big_errors | 0

lost_buffers | 0

lost_records | 0

retired_files | 64

retired_records | 29674

current_memory_records | 0

current_disk_records | 1786

current_memory_bytes | 0

current_disk_bytes | 251826

first_time | 2013-03-08 16:00:40.002042+05:30

last_time | 2013-03-08 16:15:32.001259+05:30

kb_per_day | 23813.3539369616

-[ RECORD 2 ]----------+-------------------------------------------------------------------------------------------------------------------

node_name | v_km_node0001

component | AllocationPoolStatisticsBySecond

table_name | dc_allocation_pool_statistics_by_second

description | Information about global memory pools, which generally cannot be recovered without restart (historical, by second)

access_restricted | t

in_db_log | f

in_vertica_log | f

memory_buffer_size_kb | 64

disk_size_kb | 256

record_too_big_errors | 0

lost_buffers | 0

lost_records | 0

retired_files | 292

retired_records | 29777

current_memory_records | 0

current_disk_records | 343

current_memory_bytes | 0

current_disk_bytes | 219173

first_time | 2013-03-08 16:12:41.002868+05:30

last_time | 2013-03-08 16:15:32.001287+05:30

kb_per_day | 107977.851408599

Summary

Vertica provides several ways to monitor various processes and facilitates the storage of the monitoring information. It is up to the administrator to determine how they can make the best use of it. In the next chapter, we will learn how to create a backup of Vertica databases and restore them.