This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Monitoring Vertica

You can monitor the activity and health of a Vertica database through various log files and system tables.

You can monitor the activity and health of a Vertica database through various log files and system tables. Vertica provides various configuration parameters that control monitoring options. You can also use the Management Console to observe database activity.

1 - Monitoring log files

When a Vertica database is running, each in the writes messages into a file named vertica.log.

When a database is running

When a Vertica database is running, each node in the cluster writes messages into a file named vertica.log. For example, the Tuple Mover and the transaction manager write INFO messages into vertica.log at specific intervals even when there is no mergeout activity.

You configure the location of the vertica.log file. By default, the log file is in:

catalog-path/database-name/node-name_catalog/vertica.log
  • catalog-path is the path shown in the NODES system table minus the Catalog directory at the end.

  • database-name is the name of your database.

  • node-name is the name of the node shown in the NODES system table.

To monitor one node in a running database in real time:

  1. Log in to the database administrator account on any node in the cluster.

  2. In a terminal window enter:

    $ tail -f catalog-path/database-name/node-name_catalog/vertica.log
    

catalog-path The catalog pathname specified when you created the database. See Creating a database.
database-name The database name (case sensitive)
node-name The node name, as specified in the database definition. See Viewing a database.

When the database/node is starting up

During system startup, before the Vertica log has been initialized to write messages, each node in the cluster writes messages into a file named dbLog. This log is useful to diagnose situations where the database fails to start before it can write messages into vertica.log. The dblog is located at the following path, using catalog-path and database-name as described above:

catalog-path/database-name/dbLog

See also

2 - Rotating log files

The LogRotate service periodically rotates logs and removes old rotated logs.

The LogRotate service periodically rotates logs and removes old rotated logs. To view previous LogRotate events, see LOG_ROTATE_EVENTS.

If the following files exceed the specified maximum size, they are rotated:

  • vertica.log
  • UDxFencedProcesses.log
  • MemoryReport.log
  • editor.log
  • dbLog

Rotated files are compressed and marked with a timestamp in the same location as the original log file: path/to/logfile.logtimestamp.gz. For example, /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log is rotated to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-14-09-02-381909-05.gz.

If a log file was rotated, the previously rotated logs (.gz) in that directory are checked against the specified maximum age. Rotated logs older than the maximum age are deleted.

Upgrade behavior

The LogRotate service is automatically enabled, but if your database contains the configuration file for the Linux logrotate utility at /opt/vertica/config/logrotate/database_name, the LogRotate service will not run. If you upgraded from Vertica 23.4 or below, you must delete the configuration file on all nodes to use the LogRotate service.

Automatic rotation

To configure and enable the LogRotate service:

  1. Set the maximum allowed size of logs with LogRotateMaxSize. Log files larger than this age are rotated (compressed).
  2. Set the maximum allowed age of rotated logs with LogRotateMaxAge. Rotated (compressed) logs older than this age are deleted.
  3. Set how often the service runs with LogRotateInterval.
  4. Enable the service with EnableLogRotate.

For details on these parameters, see Monitoring parameters.

To view the current values for these parameters, query CONFIGURATION_PARAMETERS.

The following example configures the LogRotate service to automatically run every four hours, rotating logs that are larger than 1 kibibyte and removing rotated logs that are older than 12 days:

=> ALTER DATABASE DEFAULT SET LogRotateInterval = '4h', LogRotateMaxSize = '1k', LogRotateMaxAge = '12 days';

Manual rotation

You can manually rotate the logs on a particular node with the DO_LOGROTATE_LOCAL meta-function. This function takes optional arguments to force the rotation of all logs and override the values of LogRotateMaxSize and LogRotateMaxAge.

To rotate logs that are larger than 1 kilobyte, and then remove rotated logs that are older than 1 day:

=> SELECT do_logrotate_local('max_size=1K;max_age=1 day');

                                             do_logrotate_local
-----------------------------------------------------------------------------------------------------------
 Doing Logrotate
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
File size: 35753 Bytes
Force rotate? no
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log

Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
File size: 68 Bytes
Force rotate? no
Rotation not required for file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
(1 row)

To force rotation and then remove all logs older than 4 days (default value of LogRotateMaxAge):

=> SELECT do_logrotate_local('force;max_age=4 days');

                                             do_logrotate_local
-----------------------------------------------------------------------------------------------------------
 Doing Logrotate
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
File size: 4310245 Bytes
Force rotate? yes
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log

Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
File size: 68 Bytes
Force rotate? yes
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-06-13-18-27-23141-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-07-13-18-30-059008-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-08-13-47-11-707903-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-09-14-09-02-386402-05.gz
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log


(1 row)

Rotating with the Linux logrotate utility

Most Linux distributions include the logrotate utility. By setting up a logrotate configuration file, you can use the utility to complete one or more of these tasks automatically:

  • Compress and rotate log files

  • Remove log files automatically

  • Email log files to named recipients

You can configure logrotate to complete these tasks at specific intervals, or when log files reach a particular size.

If logrotate is present when Vertica is installed, then Vertica automatically sets this utility to look for configuration files. Thus, logrotate searches for configuration files in the /opt/vertica/config/logrotate directory on each node.

When you create a database, Vertica creates database-specific logrotate configurations on each node in your cluster, which are used by the logrotate utility. It then creates a file with the path /opt/vertica/config/logrotate/dbname for each individual database.

For information about additional settings, use the man logrotate command.

Executing the Python script through the dbadmin logrotate cron job

During the installation of Vertica, the installer configures a cron job for the dbadmin user. This cron job is configured to execute a Python script that runs the logrotate utility. You can view the details of this cron job by viewing the dbadmin.cron file, which is located in the /opt/vertica/config directory.

If you want to customize a cron job to configure logrotate for your Vertica database, you must create the cron job under the dbadmin user.

Using the administration tools logrotate utility

You can use the admintools logrotate option to help configure logrotate scripts for a database and distribute the scripts across the cluster. The logrotate option allows you to specify:

  • How often to rotate logs

  • How large logs can become before being rotated

  • How long to keep the logs

Example:

The following example shows you how to set up log rotation on a weekly schedule and keeps for three months (12 logs).

$ admintools -t logrotate -d <dbname> -r weekly -k 12

See Writing administration tools scripts for more usage information.

Configure logrotate for MC

The Management Console log file is:

/opt/vconsole/log/mc/mconsole.log

To configure logrotate for MC, configure the following file:

/opt/vconsole/temp/webapp/WEB-INF/classes/log4j.xml

Edit the log4j.xml file and set these parameters as follows:

  1. Restrict the size of the log:

    <param name="MaxFileSize" value="1MB"/>
    
  2. Restrict the number of file backups for the log:

    <param name="MaxBackupIndex" value="1"/>
    
  3. Restart MC as the root user:

    # etc/init.d/vertica-consoled restart
    

Rotating logs manually

To implement a custom log rotation process, follow these steps:

  1. Rename or archive the existing vertica.log file. For example:

    $ mv vertica.log vertica.log.1
    
  2. Send the Vertica process the USR1 signal, using either of the following approaches:

    $ killall -USR1 vertica
    

    or

    $ ps -ef | grep -i vertica
    $ kill -USR1 process-id
    

3 - Monitoring process status (ps)

You can use ps to monitor the database and Spread processes running on each node in the cluster.

You can use ps to monitor the database and Spread processes running on each node in the cluster. For example:

$ ps aux | grep /opt/vertica/bin/vertica
$ ps aux | grep /opt/vertica/spread/sbin/spread

You should see one Vertica process and one Spread process on each node for common configurations. To monitor Administration Tools and connector processes:

$ ps aux | grep vertica

There can be many connection processes but only one Administration Tools process.

4 - Monitoring Linux resource usage

You should monitor system resource usage on any or all nodes in the cluster.

You should monitor system resource usage on any or all nodes in the cluster. You can use System Activity Reporting (SAR) to monitor resource usage.

  1. Log in to the database administrator account on any node.

  2. Run the top utility

    $ top
    

    A high CPU percentage in top indicates that Vertica is CPU-bound. For example:

    top - 11:44:28 up 53 days, 23:47,  9 users,  load average: 0.91, 0.97, 0.81
    Tasks: 123 total,   1 running, 122 sleeping,   0 stopped,   0 zombie
    Cpu(s):  26.9%us,  1.3%sy,  0.0%ni, 71.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   4053136 total,  3882020k used,  171116 free,   407688 buffers
    Swap:  4192956 total,   176k used,  4192780 free,  1526436 cached
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    13703 dbadmin    1   0 1374m 678m  55m S 99.9 17.1   6:21.70 vertica
     2606 root      16   0 32152  11m 2508 S  1.0  0.3   0:16.97 X
        1 root      16   0  4748  552  456 S  0.0  0.0   0:01.51 init
        2 root      RT  -5     0    0    0 S  0.0  0.0   0:04.92 migration/0
        3 root      34  19     0    0    0 S  0.0  0.0   0:11.75 ksoftirqd/0
    ...
    

    Some possible reasons for high CPU usage are:

    • The Tuple Mover runs automatically and thus consumes CPU time even if there are no connections to the database.

    • The swappiness kernel parameter may not be set to 0. Execute the following command from the Linux command line to see the value of this parameter:

      $ cat /proc/sys/vm/swappiness
      

      If this value is not 0, change it by following the steps in Check for swappiness.

    • Some information sources:

  3. Run the iostat utility. A high idle time in top at the same time as a high rate of blocks read in iostat indicates that Vertica is disk-bound. For example:

    $ /usr/bin/iostat
    Linux 2.6.18-164.el5 (qa01)     02/05/2011
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               0.77    2.32    0.76    0.68    0.00   95.47
    Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
    hda               0.37         3.40        10.37    2117723    6464640
    sda               0.46         1.94        18.96    1208130   11816472
    sdb               0.26         1.79        15.69    1114792    9781840
    sdc               0.24         1.80        16.06    1119304   10010328
    sdd               0.22         1.79        15.52    1117472    9676200
    md0               8.37         7.31        66.23    4554834   41284840
    

5 - Monitoring disk space usage

You can use these system tables to monitor disk space usage on your cluster:

System table Description
DISK_STORAGE Monitors the amount of disk storage used by the database on each node.
COLUMN_STORAGE Monitors the amount of disk storage used by each column of each projection on each node.
PROJECTION_STORAGE Monitors the amount of disk storage used by each projection on each node.

6 - Monitoring elastic cluster rebalancing

Vertica includes system tables that can be used to monitor the rebalance status of an elastic cluster and gain general insight to the status of elastic cluster on your nodes.

Vertica includes system tables that can be used to monitor the rebalance status of an elastic cluster and gain general insight to the status of elastic cluster on your nodes.

  • REBALANCE_TABLE_STATUS provides general information about a rebalance. It shows, for each table, the amount of data that has been separated, the amount that is currently being separated, and the amount to be separated. It also shows the amount of data transferred, the amount that is currently being transferred, and the remaining amount to be transferred (or an estimate if storage is not separated).

  • REBALANCE_PROJECTION_STATUS can be used to gain more insight into the details for a particular projection that is being rebalanced. It provides the same type of information as above, but in terms of a projection instead of a table.

In each table, the columns SEPARATED_PERCENT and TRANSFERRED_PERCENTcan be used to determine overall progress.

Historical rebalance information

Historical information about work completed is retained, so query with the table column IS_LATEST to restrict output to only the most recent or current rebalance activity. Historical data may include information about dropped projections or tables. If a table or projection has been dropped and information about the anchor table is not available, then NULL is displayed for the table ID and <unknown> for table name. Information on dropped tables is still useful, for example, in providing justification for the duration of a task.

7 - Monitoring events

To help you monitor your database system, Vertica traps and logs significant events that affect database performance and functionality if you do not address their root causes.

To help you monitor your database system, Vertica traps and logs significant events that affect database performance and functionality if you do not address their root causes. This section describes where events are logged, the types of events that Vertica logs, how to respond to these events, the information that Vertica provides for these events, and how to configure event monitoring.

7.1 - Event logging mechanisms

Vertica posts events to the following mechanisms:.

Vertica posts events to the following mechanisms:

Mechanism Description
vertica.log All events are automatically posted to vertica.log. See Monitoring the Log Files.
ACTIVE_EVENTS This SQL system table provides information about all open events. See Using system tables and ACTIVE_EVENTS.
SNMP To post traps to SNMP, enable global reporting in addition to each individual event you want trapped. See Configuring event reporting.
Syslog To log events to syslog, enable event reporting for each individual event you want logged. See Configuring event reporting.

7.2 - Event codes

The following table lists the event codes that Vertica logs to the events system tables.

The following table lists the event codes that Vertica logs to the events system tables.

Event Code Severity Event Code Description Description/Action
0 Warning Low Disk Space

Warning indicates one of the following issues:

  • Lack of disk space for database

  • Disk failure

  • I/O hardware failure

Action: Add more disk space, or replace the failing disk or hardware as soon as possible.

Check dmesg to see what caused the problem.

Also, use the DISK_RESOURCE_REJECTIONS system table to determine the types of disk space requests that are being rejected and the hosts where they are rejected. See Managing disk space for details.

1 Warning Read Only File System

Database lacks write access to the file system for data or catalog paths. This sometimes occurs if Linux remounts a drive due to a kernel issue.

Action: Give the database write access.

2 Emergency Loss Of K Safety

The database is no longer K-safe because insufficient nodes are functioning within the cluster. Loss of K-safety causes the database to shut down.

Action: Recover the system.

3 Critical Current Fault Tolerance at Critical Level

One or more nodes in the cluster failed. If the database loses one more node, it will no longer be K-safe and shut down.

Action: Restore nodes that failed or shut down.

4 Warning Too Many ROS Containers

Heavy load activity on one or more projections sometimes generates more ROS containers than the Tuple Mover can handle. Vertica allows up to 1024 ROS containers per projection before it rolls back additional load jobs and returns a ROS pushback error message.

Action: The Tuple Mover typically catches up with pending mergeout requests and the Optimizer can resume executing queries on affected tables (see Mergeout).

If this problem does not resolve quickly, or if it occurs frequently, it is probably related to insufficient RAM allocated to MAXMEMORY in the TM resource pool.

5 Informational WOS Over Flow Deprecated
6 Informational Node State Change

The node state changed.

Action: Check node status.

7 Warning Recovery Failure

Database was not restored to a functional state after a hardware or software related failure.

Action: Reasons for the warning can vary, see the event description for details.

8 Warning Recovery Error

Database encountered an error while attempting to recover. If the number of recovery errors exceeds Max Tries, the Recovery Failure event is triggered.

Action: Reasons for the warning can vary, see the event description for details.

9 n/a Recovery Lock Error Unused
10 n/a Recovery Projection Retrieval Error Unused
11 Warning Refresh Error

The database encountered an error while attempting to refresh.

Action: Reasons for the warning can vary, see the event description for details.

12 n/a Refresh Lock Error Unused
13 n/a Tuple Mover Error Deprecated
14 Warning Timer Service Task Error

Error occurred in an internal scheduled task.

Action: None, internal use only

15 Warning Stale Checkpoint Deprecated
16 Notice License Size Compliance

Database size exceeds license size allowance.

Action: See Monitoring database size for license compliance.

17 Notice License Term Compliance

Database is not in compliance with your Vertica license.

Action: Check compliance status with GET_COMPLIANCE_STATUS.

18 Error CRC Mismatch

Cyclic Redundancy Check (CRC) returned an error or errors while fetching data.

Action: Review the vertica.log file or the SNMP trap utility to evaluate CRC errors.

19 Critical/Warning Catalog Sync Exceeds Durability Threshold Severity: Critical when exceeding hard limit, Warning when exceeding soft limit.
20 Critical Cluster Read-only

Eon Mode) Quorum or primary shard coverage loss forced database into read-only mode.

Action: Restart down nodes.

7.3 - Event data

To help you interpret and solve the issue that triggered an event, each event provides a variety of data, depending upon the event logging mechanism used.

To help you interpret and solve the issue that triggered an event, each event provides a variety of data, depending upon the event logging mechanism used.

The following table describes the event data and indicates where it is used.

vertica.log ACTIVE_EVENTS
(column names)
SNMP Syslog Description
N/A NODE_NAME N/A N/A The node where the event occurred.
Event Code EVENT_CODE Event Type Event Code A numeric ID that indicates the type of event. See Event Types in the previous table for a list of event type codes.
Event Id EVENT_ID Event OID Event Id A unique numeric ID that identifies the specific event.
Event Severity

EVENT_

SEVERITY

Event Severity Event Severity

The severity of the event from highest to lowest. These events are based on standard syslog severity types:

0 – Emergency

1 – Alert

2 – Critical

3 – Error

4 – Warning

5 – Notice

6 – Info

7 – Debug

PostedTimestamp

EVENT_

POSTED_

TIMESTAMP

N/A PostedTimestamp The year, month, day, and time the event was reported. Time is provided as military time.
ExpirationTimestamp

EVENT_

EXPIRATION

N/A ExpirationTimestamp The time at which this event expires. If the same event is posted again prior to its expiration time, this field gets updated to a new expiration time.
EventCodeDescription

EVENT_CODE_

DESCRIPTION

Description EventCodeDescription A brief description of the event and details pertinent to the specific situation.
ProblemDescription

EVENT_PROBLEM_

DESCRIPTION

Event Short Description ProblemDescription A generic description of the event.
N/A

REPORTING_

NODE

Node Name N/A The name of the node within the cluster that reported the event.
DatabaseName N/A Database Name DatabaseName The name of the database that is impacted by the event.
N/A N/A Host Name Hostname The name of the host within the cluster that reported the event.
N/A N/A Event Status N/A

The status of the event. It can be either:

1 – Open

2 – Clear

7.4 - Configuring event reporting

Event reporting is automatically configured for .log, and current events are automatically posted to the ACTIVE_EVENTS system table.

Event reporting is automatically configured for vertica.log, and current events are automatically posted to the ACTIVE_EVENTS system table. You can also configure Vertica to post events to syslog and SNMP.

7.4.1 - Configuring reporting for the simple notification service (SNS)

You can monitor (DC) components and send new rows to Amazon Web Services (AWS) Simple Notification Service (SNS).

You can monitor Data collector (DC) components and send new rows to Amazon Web Services (AWS) Simple Notification Service (SNS). SNS notifiers are configured with database-level SNS parameters.

Minimally, to send DC data to SNS topics, you must configure and specify the following:

  • An SNS notifier

  • A Simple Notification Service (SNS) topic

  • An AWS region (SNSRegion)

  • An SNS endpoint (SNSEndpoint) (FIPS only)

  • Credentials to authenticate to AWS

  • Information about how to handle HTTPS

Creating an SNS notifier

To create an SNS notifier, use CREATE NOTIFIER, specifying sns as the ACTION.

SNS configuration

SNS topics and their subscribers should be configured with AWS. For details, see the AWS documentation:

AWS region and endpoint

In most use cases, you only need to set the AWS region with the SNSRegion parameter; if the SNSEndpoint is set to an empty string (default) and the SNSRegion is set, Vertica automatically finds and uses the appropriate endpoint:

=> ALTER DATABASE DEFAULT SET SNSRegion='us-east-1';
=> ALTER DATABASE DEFAULT SET SNSEndpoint='';

If you want to specify an endpoint, its region must match the region specified in SNSRegion:

=> ALTER DATABASE DEFAULT SET SNSEndpoint='sns.us-east-1.amazonaws.com';

If you use FIPS, you should manually set SNSEndpoint to a FIPS-compliant endpoint:

=> ALTER DATABASE DEFAULT SET SNSEndpoint='sns-fips.us-east-1.amazonaws.com';

AWS credentials

AWS credentials can be set with SNSAuth, which takes an access key and secret access key in the following format:

access_key:secret_access_key

To set SNSAuth:

=> ALTER DATABASE DEFAULT SET SNSAuth='VNDDNVOPIUQF917O5PDB:+mcnVONVIbjOnf1ekNis7nm3mE83u9fjdwmlq36Z';

Handling HTTPS

The SNSEnableHttps parameter determines whether the SNS notifier uses TLS to secure the connection between Vertica and AWS. HTTPS is enabled by default and can be manually enabled with:

=> ALTER DATABASE DEFAULT SET SNSEnableHttps=1;

If SNSEnableHttps is enabled, depending on your configuration, you might need to specify a custom set of CA bundles with SNSCAFile or SNSCAPath. Amazon root certificates are typically contained in the set of trusted CA certificates already, so you should not have to set these parameters in most environments:

=> ALTER DATABASE DEFAULT SET SNSCAFile='path/to/ca/bundle.pem'
=> ALTER DATABASE DEFAULT SET SNSCAPath='path/to/ca/bundles/'

HTTPS can be manually disabled with:

=> ALTER DATABASE DEFAULT SET SNSEnableHttps=0;

Examples

The following example creates an SNS topic, subscribes to it with an SQS queue, and then configures an SNS notifier for the DC component LoginFailures:

  1. Create an SNS topic.

  2. Create an SQS queue.

  3. Subscribe the SQS queue to the SNS topic.

  4. Set SNSAuth with your AWS credentials:

    => ALTER DATABASE DEFAULT SET SNSAuth='VNDDNVOPIUQF917O5PDB:+mcnVONVIbjOnf1ekNis7nm3mE83u9fjdwmlq36Z';
    
  5. Set SNSRegion:

    => ALTER DATABASE DEFAULT SET SNSRegion='us-east-1'
    
  6. Enable HTTPS:

    => ALTER DATABASE DEFAULT SET SNSEnableHttps=1;
    
  7. Create an SNS notifier:

    => CREATE NOTIFIER v_sns_notifier ACTION 'sns' MAXPAYLOAD '256K' MAXMEMORYSIZE '10M' CHECK COMMITTED;
    
  8. Verify that the SNS notifier, SNS topic, and SQS queue are properly configured:

    1. Manually send a message from the notifier to the SNS topic with NOTIFY:

      => SELECT NOTIFY('test message', 'v_sns_notifier', 'arn:aws:sns:us-east-1:123456789012:MyTopic')
      
    2. Poll the SQS queue for your message.

  9. Attach the SNS notifier to the LoginFailures component with SET_DATA_COLLECTOR_NOTIFY_POLICY:

    => SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures', 'v_sns_notifier', 'Login failed!', true)
    

To disable an SNS notifier:

=> SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures', 'v_sns_notifier', 'Login failed!', false)

7.4.2 - Configuring reporting for syslog

Syslog is a network-logging utility that issues, stores, and processes log messages.

Syslog is a network-logging utility that issues, stores, and processes log messages. It is a useful way to get heterogeneous data into a single data repository.

To log events to syslog, enable event reporting for each individual event you want logged. Messages are logged, by default, to /var/log/messages.

Configuring event reporting to syslog consists of:

  1. Enabling Vertica to trap events for syslog.

  2. Defining which events Vertica traps for syslog.

    Vertica strongly suggests that you trap the Stale Checkpoint event.

  3. Defining which syslog facility to use.

Enabling Vertica to trap events for syslog

To enable event trapping for syslog, issue the following SQL command:

=> ALTER DATABASE DEFAULT SET SyslogEnabled = 1;

To disable event trapping for syslog, issue the following SQL command:

=> ALTER DATABASE DEFAULT SET SyslogEnabled = 0;

Defining events to trap for syslog

To define events that generate a syslog entry, issue the following SQL command, one of the events described in the list below the command:

=> ALTER DATABASE DEFAULT SET SyslogEvents = 'events-list';

where events-list is a comma-delimited list of events, one or more of the following:

  • Low Disk Space

  • Read Only File System

  • Loss Of K Safety

  • Current Fault Tolerance at Critical Level

  • Too Many ROS Containers

  • Node State Change

  • Recovery Failure

  • Recovery Error

  • Recovery Lock Error

  • Recovery Projection Retrieval Error

  • Refresh Error

  • Refresh Lock Error

  • Tuple Mover Error

  • Timer Service Task Error

  • Stale Checkpoint

The following example generates a syslog entry for low disk space and recovery failure:

=> ALTER DATABASE DEFAULT SET SyslogEvents = 'Low Disk Space, Recovery Failure';

Defining the SyslogFacility to use for reporting

The syslog mechanism allows for several different general classifications of logging messages, called facilities. Typically, all authentication-related messages are logged with the auth (or authpriv) facility. These messages are intended to be secure and hidden from unauthorized eyes. Normal operational messages are logged with the daemon facility, which is the collector that receives and optionally stores messages.

The SyslogFacility directive allows all logging messages to be directed to a different facility than the default. When the directive is used, all logging is done using the specified facility, both authentication (secure) and otherwise.

To define which SyslogFacility Vertica uses, issue the following SQL command:

=> ALTER DATABASE DEFAULT SET SyslogFacility = 'Facility_Name';

Where the facility-level argument <Facility_Name> is one of the following:

  • auth

  • authpriv (Linux only)

  • cron

  • uucp (UUCP subsystem)

  • daemon

  • ftp (Linux only)

  • lpr (line printer subsystem)

  • mail (mail system)

  • news (network news subsystem)

  • user (default system)

  • local0 (local use 0)

  • local1 (local use 1)

  • local2 (local use 2)

  • local3 (local use 3)

  • local4 (local use 4)

  • local5 (local use 5)

  • local6 (local use 6)

  • local7 (local use 7)

Trapping other event types

To trap events other than the ones listed above, create a syslog notifier and allow it to trap the desired events with SET_DATA_COLLECTOR_NOTIFY_POLICY.

Events monitored by this notifier type are not logged to MONITORING_EVENTS nor vertica.log.

The following example creates a notifier that writes a message to syslog when the Data collector (DC) component LoginFailures updates:

  1. Enable syslog notifiers for the current database:

    => ALTER DATABASE DEFAULT SET SyslogEnabled = 1;
    
  2. Create and enable a syslog notifier v_syslog_notifier:

    => CREATE NOTIFIER v_syslog_notifier ACTION 'syslog'
        ENABLE
        MAXMEMORYSIZE '10M'
        IDENTIFIED BY 'f8b0278a-3282-4e1a-9c86-e0f3f042a971'
        PARAMETERS 'eventSeverity = 5';
    
  3. Configure the syslog notifier v_syslog_notifier for updates to the LoginFailures DC component with SET_DATA_COLLECTOR_NOTIFY_POLICY:

    => SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures','v_syslog_notifier', 'Login failed!', true);
    

    This notifier writes the following message to syslog (default location: /var/log/messages) when a user fails to authenticate as the user Bob:

    Apr 25 16:04:58
    vertica_host_01
    vertica:
        Event Posted:
            Event Code:21
            Event Id:0
            Event Severity: Notice [5]
            PostedTimestamp: 2022-04-25 16:04:58.083063
            ExpirationTimestamp: 2022-04-25 16:04:58.083063
            EventCodeDescription: Notifier
            ProblemDescription: (Login failed!)
        {
           "_db":"VMart",
           "_schema":"v_internal",
           "_table":"dc_login_failures",
           "_uuid":"f8b0278a-3282-4e1a-9c86-e0f3f042a971",
           "authentication_method":"Reject",
           "client_authentication_name":"default: Reject",
           "client_hostname":"::1",
           "client_label":"",
           "client_os_user_name":"dbadmin",
           "client_pid":523418,
           "client_version":"",
           "database_name":"dbadmin",
           "effective_protocol":"3.8",
           "node_name":"v_vmart_node0001",
           "reason":"REJECT",
           "requested_protocol":"3.8",
           "ssl_client_fingerprint":"",
           "ssl_client_subject":"",
           "time":"2022-04-25 16:04:58.082568-05",
           "user_name":"Bob"
        }#012
        DatabaseName: VMart
        Hostname: vertica_host_01
    

See also

Event reporting examples

7.4.3 - Configuring reporting for SNMP

Configuring event reporting for SNMP consists of:.

Configuring event reporting for SNMP consists of:

  1. Configuring Vertica to enable event trapping for SNMP as described below.

  2. Importing the Vertica Management Information Base (MIB) file into the SNMP monitoring device.

    The Vertica MIB file allows the SNMP trap receiver to understand the traps it receives from Vertica. This, in turn, allows you to configure the actions it takes when it receives traps.

    Vertica supports the SNMP V1 trap protocol, and it is located in /opt/vertica/sbin/VERTICA-MIB. See the documentation for your SNMP monitoring device for more information about importing MIB files.

  3. Configuring the SNMP trap receiver to handle traps from Vertica.

    SNMP trap receiver configuration differs greatly from vendor to vendor. As such, the directions presented here for configuring the SNMP trap receiver to handle traps from Vertica are generic.

    Vertica traps are single, generic traps that contain several fields of identifying information. These fields equate to the event data described in Monitoring events. However, the format used for the field names differs slightly. Under SNMP, the field names contain no spaces. Also, field names are pre-pended with “vert”. For example, Event Severity becomes vertEventSeverity.

    When configuring your trap receiver, be sure to use the same hostname, port, and community string you used to configure event trapping in Vertica.

    Examples of network management providers:

    • Network Node Manager i

    • IBM Tivoli

    • AdventNet

    • Net-SNMP (Open Source)

    • Nagios (Open Source)

    • Open NMS (Open Source)

7.4.4 - Configuring event trapping for SNMP

The following events are trapped by default when you configure Vertica to trap events for SNMP:.

The following events are trapped by default when you configure Vertica to trap events for SNMP:

  • Low Disk Space

  • Read Only File System

  • Loss of K Safety

  • Current Fault Tolerance at Critical Level

  • Too Many ROS Containers

  • Node State Change

  • Recovery Failure

  • Stale Checkpoint

  • CRC Mismatch

To configure Vertica to trap events for SNMP

  1. Enable Vertica to trap events for SNMP.

  2. Define where Vertica sends the traps.

  3. Optionally redefine which SNMP events Vertica traps.

To enable event trapping for SNMP

Use the following SQL command:

=> ALTER DATABASE DEFAULT SET SnmpTrapsEnabled = 1;

To define where Vertica send traps

Use the following SQL command, where Host_name and port identify the computer where SNMP resides, and CommunityString acts like a password to control Vertica's access to the server:

=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'host_name port CommunityString';

For example:

=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'localhost 162 public';

You can also specify multiple destinations by specifying a list of destinations, separated by commas:

=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'host_name1 port1 CommunityString1, hostname2 port2 CommunityString2';

To define which events Vertica traps

Use the following SQL command, where Event_Name is one of the events in the list below the command:

=> ALTER DATABASE DEFAULT SET SnmpTrapEvents = 'Event_Name1, Even_Name2';
  • Low Disk Space

  • Read Only File System

  • Loss Of K Safety

  • Current Fault Tolerance at Critical Level

  • Too Many ROS Containers

  • Node State Change

  • Recovery Failure

  • Recovery Error

  • Recovery Lock Error

  • Recovery Projection Retrieval Error

  • Refresh Error

  • Tuple Mover Error

  • Stale Checkpoint

  • CRC Mismatch

The following example specifies two event names:

=> ALTER DATABASE DEFAULT SET SnmpTrapEvents = 'Low Disk Space, Recovery Failure';

7.4.5 - Verifying SNMP configuration

To create a set of test events that checks SNMP configuration:.

To create a set of test events that checks SNMP configuration:

  1. Set up SNMP trap handlers to catch Vertica events.

  2. Test your setup with the following command:

    SELECT SNMP_TRAP_TEST();
        SNMP_TRAP_TEST
    --------------------------
     Completed SNMP Trap Test
    (1 row)
    

7.5 - Event reporting examples

The following example illustrates a Too Many ROS Containers event posted and cleared within vertica.log:.

Vertica.log

The following example illustrates a Too Many ROS Containers event posted and cleared within vertica.log:

08/14/15 15:07:59 thr:nameless:0x45a08940 [INFO] Event Posted: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp:
2015-08-14 15:07:59.253729 ExpirationTimestamp: 2015-08-14 15:08:29.253729
EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB
Hostname: fc6-1.example.com
08/14/15 15:08:54 thr:Ageout Events:0x2aaab0015e70 [INFO] Event Cleared:
Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp:
2015-08-14 15:07:59.253729 ExpirationTimestamp: 2015-08-14 15:08:53.012669
EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB
Hostname: fc6-1.example.com

SNMP

The following example illustrates a Too Many ROS Containers event posted to SNMP:

Version: 1, type: TRAPREQUESTEnterprise OID: .1.3.6.1.4.1.31207.2.0.1
Trap agent: 72.0.0.0
Generic trap: ENTERPRISESPECIFIC (6)
Specific trap: 0
.1.3.6.1.4.1.31207.1.1 ---> 4
.1.3.6.1.4.1.31207.1.2 ---> 0
.1.3.6.1.4.1.31207.1.3 ---> 2008-08-14 11:30:26.121292
.1.3.6.1.4.1.31207.1.4 ---> 4
.1.3.6.1.4.1.31207.1.5 ---> 1
.1.3.6.1.4.1.31207.1.6 ---> site01
.1.3.6.1.4.1.31207.1.7 ---> suse10-1
.1.3.6.1.4.1.31207.1.8 ---> Too many ROS containers exist on this node.
.1.3.6.1.4.1.31207.1.9 ---> QATESTDB
.1.3.6.1.4.1.31207.1.10 ---> Too Many ROS Containers

Syslog

The following example illustrates a Too Many ROS Containers event posted and cleared within syslog:

Aug 14 15:07:59 fc6-1 vertica: Event Posted: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp: 2015-08-14 15:07:59.253729 ExpirationTimestamp:
2015-08-14 15:08:29.253729 EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB Hostname: fc6-1.example.com
Aug 14 15:08:54 fc6-1 vertica: Event Cleared: Event Code:4 Event Id:0 Event Severity:
Warning [4] PostedTimestamp: 2015-08-14 15:07:59.253729 ExpirationTimestamp:
2015-08-14 15:08:53.012669 EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB Hostname: fc6-1.example.com

8 - Using system tables

Vertica system tables provide information about system resources, background processes, workload, and performance—for example, load streams, query profiles, and tuple mover operations.

Vertica system tables provide information about system resources, background processes, workload, and performance—for example, load streams, query profiles, and tuple mover operations. Vertica collects and refreshes this information automatically.

You can query system tables using expressions, predicates, aggregates, analytics, subqueries, and joins. You can also save system table query results into a user table for future analysis. For example, the following query creates a table, mynode, selecting three node-related columns from the NODES system table:

=> CREATE TABLE mynode AS SELECT node_name, node_state, node_address FROM nodes;
CREATE TABLE
=> SELECT * FROM mynode;
    node_name     | node_state |  node_address
------------------+------------+----------------
 v_vmart_node0001 | UP         | 192.168.223.11
(1 row)

Where system tables reside

System tables are grouped into two schemas:

These schemas reside in the default search path. Unless you change the search path to exclude V_MONITOR or V_CATALOG or both, queries can specify a system table name that omits its schema.

You can query the SYSTEM_TABLES table for all Vertica system tables and their schemas. For example:

SELECT * FROM system_tables ORDER BY table_schema, table_name;

System table categories

Vertica system tables can be grouped into the following areas:

  • System information

  • System resources

  • Background processes

  • Workload and performance

Vertica reserves some memory to help monitor busy systems. Using simple system table queries makes it easier to troubleshoot issues. See also SYSQUERY.

Privileges

You can GRANT and REVOKE privileges on system tables, with the following restrictions:

  • You cannot GRANT privileges on system tables to the SYSMONITOR or PSEUDOSUPERUSER roles.

  • You cannot GRANT on system schemas.

Case-sensitive system table data

Some system table data might be stored in mixed case. For example, Vertica stores mixed-case identifier names the way you specify them in the CREATE statement, even though case is ignored when you reference them in queries. When these object names appear as data in the system tables, you'll encounter errors if you query them with an equality (=) operator because the case must exactly match the stored identifier. In particular, data in columns TABLE_SCHEMA and TABLE_NAME in system table TABLES are case sensitive.

If you don't know how the identifiers are stored, use the case-insensitive operator ILIKE. For example, given the following schema:

=> CREATE SCHEMA SS;
=> CREATE TABLE SS.TT (c1 int);
=> CREATE PROJECTION SS.TTP1 AS SELECT * FROM ss.tt UNSEGMENTED ALL NODES;
=> INSERT INTO ss.tt VALUES (1);

A query that uses the = operator returns 0 rows:

=> SELECT table_schema, table_name FROM v_catalog.tables WHERE table_schema ='ss';
table_schema | table_name
--------------+------------
(0 rows)

A query that uses case-insensitive ILIKE returns the expected results:

=> SELECT table_schema, table_name FROM v_catalog.tables WHERE table_schema ILIKE 'ss';
table_schema | table_name
--------------+------------
 SS           | TT
(1 row)

Examples

The following examples illustrate simple ways to use system tables in queries.


=> SELECT current_epoch, designed_fault_tolerance, current_fault_tolerance FROM SYSTEM;
 current_epoch | designed_fault_tolerance | current_fault_tolerance
---------------+--------------------------+-------------------------
           492 |                        1 |                       1
(1 row)

=> SELECT node_name, total_user_session_count, executed_query_count FROM query_metrics;
    node_name     | total_user_session_count | executed_query_count
------------------+--------------------------+----------------------
 v_vmart_node0001 |                      115 |                  353
 v_vmart_node0002 |                      114 |                   35
 v_vmart_node0003 |                      116 |                   34
(3 rows)

=> SELECT DISTINCT(schema_name), schema_owner FROM schemata;
 schema_name  | schema_owner
--------------+--------------
 v_catalog    | dbadmin
 v_txtindex   | dbadmin
 v_func       | dbadmin
 TOPSCHEMA    | dbadmin
 online_sales | dbadmin
 v_internal   | dbadmin
 v_monitor    | dbadmin
 structs      | dbadmin
 public       | dbadmin
 store        | dbadmin
(10 rows)

9 - Data Collector utility

The Data Collector collects and retains history of important system activities, and records essential performance and resource utilization counters.

The Data Collector collects and retains history of important system activities and records essential performance and resource utilization counters. The Data Collector propagates information to system tables.

You can use the Data Collector in the following ways:

  • Query the past state of system tables and extract aggregate information

  • See what actions users have taken

  • Locate performance bottlenecks

  • Identify potential improvements to Vertica configuration

The Data Collector does not collect data for nodes that are down, so no historical data is available for affected nodes.

The Data Collector works with Workload Analyzer, a tool that intelligently monitors the performance of SQL queries and workloads and recommends tuning actions based on observations of the actual workload history.

Configuring and accessing Data Collector information

The Data Collector retains the data it gathers according to configurable retention policies. The Data Collector is on by default; you can disable it by setting the EnableDataCollector configuration parameter to 0. You can set the parameter at the database level using ALTER DATABASE or the node level using ALTER NODE. You cannot set it at the session or user level.

You can access metadata on collected data of all components using the DATA_COLLECTOR system table. This table includes information for each component about current collection policies and how much data is retained in memory and on disk.

Collected data is logged on disk in the DataCollector directory under the Vertica /catalog path. You can query logged data from component-specific Data Collector tables. You can also manage logged data with Vertica meta-functions; see Managing data collection logs.

9.1 - Configuring data retention policies

maintains retention policies for each Vertica component that it monitors—for example, TupleMoverEvents, or DepotEvictions.

Data collector maintains retention policies for each Vertica component that it monitors—for example, TupleMoverEvents or DepotEvictions. You can identify monitored components by querying the DATA_COLLECTOR system table. For example, the following query returns partition activity components:

=> SELECT DISTINCT component FROM DATA_COLLECTOR 
   WHERE component ILIKE '%partition%';
      component
----------------------
 HiveCustomPartitions
 CopyPartitions
 MovePartitions
 SwapPartitions
(4 rows)

Each component has its own retention policy, which consists of several properties:

  • MEMORY_BUFFER_SIZE_KB: the maximum amount of collected data that the Data Collector buffers in memory before moving it to disk.

  • DISK_SIZE_KB: the maximum disk space allocated for the component's Data Collector table.

  • INTERVAL_TIME: how long data of a given component is retained in the component's Data Collector table (INTERVAL data type).

Vertica sets default values on all properties, which you can modify with the SET_DATA_COLLECTOR_POLICY (using parameters) function or, alternatively, SET_DATA_COLLECTOR_POLICY and SET_DATA_COLLECTOR_TIME_POLICY.

You can view retention policy settings with GET_DATA_COLLECTOR_POLICY. For example, the following statement returns the retention policy for the TupleMoverEvents component:

=> SELECT GET_DATA_COLLECTOR_POLICY('TupleMoverEvents');
                          GET_DATA_COLLECTOR_POLICY
-----------------------------------------------------------------------------
 1000KB kept in memory, 15000KB kept on disk. Time based retention disabled.
(1 row)

Setting retention memory and disk storage

The MEMORY_BUFFER_SIZE_KB and DISK_SIZE_KB properties combine to determine how much collected data is available at any given time. If MEMORY_BUFFER_SIZE_KB is set to 0, the Data Collector does not retain any data for the component either in memory or on disk. If DISK_SIZE_KB is set to 0, then the Data Collector retains only as much component data as it can buffer, as set by MEMORY_BUFFER_SIZE_KB.

For example, the following statement changes memory and disk setting for the ResourceAcquisitions component from its default setting of 1,000 KB memory and 10,000 KB disk space to 1500 KB and 25000 KB, respectively:

=> SELECT SET_DATA_COLLECTOR_POLICY('ResourceAcquisitions', '1500', '25000');
 SET_DATA_COLLECTOR_POLICY
---------------------------
 SET
(1 row)

Consider setting MEMORY_BUFFER_SIZE_KB to a high value in the following cases:

  • Unusually high levels of data collection. If MEMORY_BUFFER_SIZE_KB is set too low, the Data Collector might be unable to flush buffered data to disk quickly enough to keep up with the activity level, which can lead to loss of in-memory data.

  • Very large data collector records—for example, records with very long query strings. The Data Collector uses double-buffering, so it cannot retain in-memory records that are more than half the size of the memory buffer.

Setting time-based retention

By default, all data collected for a given component remains on disk and is accessible in the component's Data Collector table, up to the disk storage limit of that component's retention policy. You can use SET_DATA_COLLECTOR_POLICY to limit how long data is retained in a component's Data Collector table. The following example sets the INTERVAL_TIME property for the TupleMoverEvents component:

=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('TupleMoverEvents ', '30 minutes'::INTERVAL);
SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
SET
(1 row)

After this call, the corresponding Data Collector table, DC_TUPLE_MOVER_EVENTS, only retains records of activity that occurred in the last 30 minutes. Older data is automatically dropped from this table. For example, after the previous call to SET_DATA_COLLECTOR_TIME_POLICY, the table contains on 30 minutes' worth of data:

=> SELECT CURRENT_TIMESTAMP(0) - '30 minutes'::INTERVAL AS '30 minutes ago';
   30 minutes ago
---------------------
 2020-08-13 07:58:21
(1 row)

=> SELECT time, node_name, session_id, user_name, transaction_id, operation 
   FROM DC_TUPLE_MOVER_EVENTS WHERE node_name='v_vmart_node0001' 
   ORDER BY transaction_id;
             time              |    node_name     |           session_id            | user_name |  transaction_id   | operation
-------------------------------+------------------+---------------------------------+-----------+-------------------+-----------
 2020-08-13 08:16:54.360597-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807826 | Mergeout
 2020-08-13 08:16:54.397346-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807826 | Mergeout
 2020-08-13 08:16:54.424002-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807826 | Mergeout
 2020-08-13 08:16:54.425989-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807829 | Mergeout
 2020-08-13 08:16:54.456829-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807829 | Mergeout
 2020-08-13 08:16:54.485097-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807829 | Mergeout
 2020-08-13 08:19:45.8045-04   | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin   | 45035996273807855 | Mergeout
 2020-08-13 08:19:45.742-04    | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin   | 45035996273807855 | Mergeout
 2020-08-13 08:19:45.684764-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin   | 45035996273807855 | Mergeout
 2020-08-13 08:19:45.799796-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807865 | Mergeout
 2020-08-13 08:19:45.768856-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807865 | Mergeout
 2020-08-13 08:19:45.715424-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807865 | Mergeout
 2020-08-13 08:25:20.465604-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.497266-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.518839-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.52099-04  | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
 2020-08-13 08:25:20.549075-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
 2020-08-13 08:25:20.569072-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
(18 rows)

After 25 minutes elapse, 12 of these records age out of the 30-minute interval and are dropped:

=> SELECT CURRENT_TIMESTAMP(0) - '30 minutes'::INTERVAL AS '30 minutes ago';
   30 minutes ago
---------------------
 2020-08-13 08:23:33
(1 row)

=> SELECT time, node_name, session_id, user_name, transaction_id, operation 
   FROM DC_TUPLE_MOVER_EVENTS WHERE node_name='v_vmart_node0001' 
   ORDER BY transaction_id;
             time              |    node_name     |           session_id            | user_name |  transaction_id   | operation
-------------------------------+------------------+---------------------------------+-----------+-------------------+-----------
 2020-08-13 08:25:20.465604-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.497266-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.518839-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807890 | Mergeout
 2020-08-13 08:25:20.52099-04  | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
 2020-08-13 08:25:20.549075-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
 2020-08-13 08:25:20.569072-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin   | 45035996273807893 | Mergeout
(6 rows)

You can use SET_DATA_COLLECTOR_TIME_POLICY to update INTERVAL_TIME for all components by omitting the component argument:

=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('1 day'::INTERVAL);
SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
SET
(1 row)

=> SELECT DISTINCT component, INTERVAL_SET, INTERVAL_TIME 
   FROM DATA_COLLECTOR WHERE component ILIKE '%partition%';
      component       | INTERVAL_SET | INTERVAL_TIME
----------------------+--------------+---------------
 HiveCustomPartitions | t            | 1
 MovePartitions       | t            | 1
 CopyPartitions       | t            | 1
 SwapPartitions       | t            | 1
(4 rows)

To clear the INTERVAL_TIME policy property, call SET_DATA_COLLECTOR_TIME_POLICY with a negative integer argument:

=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('-1');
 SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
 SET
(1 row)

=> SELECT DISTINCT component, INTERVAL_SET, INTERVAL_TIME 
   FROM DATA_COLLECTOR WHERE component ILIKE '%partition%';
      component       | INTERVAL_SET | INTERVAL_TIME
----------------------+--------------+---------------
 MovePartitions       | f            | 0
 SwapPartitions       | f            | 0
 HiveCustomPartitions | f            | 0
 CopyPartitions       | f            | 0
(4 rows)

Setting INTERVAL_TIME on a retention policy also sets its INTERVAL_SET property to true.

9.2 - Querying data collector tables

Data Collector tables (prefixed by dc_) are in the V_INTERNAL schema.

You can obtain component-specific data from Data Collector tables. The Data Collector compiles the component data from its log files in a table format that you can query with standard SQL queries. You can identify Data Collector table names for specific components from the DATA_COLLECTOR system table:

=> SELECT DISTINCT component, table_name FROM DATA_COLLECTOR 
   WHERE component ILIKE 'lock%';
  component   |    table_name
--------------+------------------
 LockRequests | dc_lock_requests
 LockReleases | dc_lock_releases
 LockAttempts | dc_lock_attempts
(3 rows)

You can then query those Data Collector tables directly:

=> SELECT * FROM DC_LOCK_ATTEMPTS 
   WHERE description != 'Granted immediately';
-[ RECORD 1 ]------+------------------------------
time               | 2020-08-17 00:14:07.187607-04
node_name          | v_vmart_node0001
session_id         | v_vmart_node0001-319647:0x1d
user_id            | 45035996273704962
user_name          | dbadmin
transaction_id     | 45035996273819050
object             | 0
object_name        | Global Catalog
mode               | X
promoted_mode      | X
scope              | TRANSACTION
start_time         | 2020-08-17 00:14:07.184663-04
timeout_in_seconds | 300
result             | granted
description        | Granted after waiting

9.3 - Managing data collection logs

On startup, Vertica creates a DataCollector directory under the database catalog directory of each node.

On startup, Vertica creates a DataCollector directory under the database catalog directory of each node. This directory contains one or more logs for individual components. For example:

$ pwd
/home/dbadmin/VMart/v_vmart_node0001_catalog/DataCollector

$ ls -1 -g Lock*
-rw------- 1 verticadba 2559879 Aug 17 00:14 LockAttempts_650572441057355.log
-rw------- 1 verticadba  614579 Aug 17 05:28 LockAttempts_650952885486175.log
-rw------- 1 verticadba 2559895 Aug 14 18:31 LockReleases_650306482037650.log
-rw------- 1 verticadba 1411127 Aug 17 05:28 LockReleases_650759468041873.log

The DataCollector directory also contains a pair of SQL template files for each component:

  • CREATE_component_TABLE.sql provides DDL for creating a table that you can use to load Data Collector logs for a given component. For example:

    $ cat CREATE_LockAttempts_TABLE.sql
    \set dcschema 'echo ${DCSCHEMA:-dc}'
    CREATE TABLE :dcschema.dc_lock_attempts(
      "time" TIMESTAMP WITH TIME ZONE,
      "node_name" VARCHAR(128),
      "session_id" VARCHAR(128),
      "user_id" INTEGER,
      "user_name" VARCHAR(128),
      "transaction_id" INTEGER,
      "object" INTEGER,
      "object_name" VARCHAR(128),
      "mode" VARCHAR(128),
      "promoted_mode" VARCHAR(128),
      "scope" VARCHAR(128),
      "start_time" TIMESTAMP WITH TIME ZONE,
      "timeout_in_seconds" INTEGER,
      "result" VARCHAR(128),
      "description" VARCHAR(64000)
    );
    
  • COPY_component_TABLE.sql contains SQL for loading (with COPY) the data log files into the table that the CREATE script creates. For example:

    $ cat COPY_LockAttempts_TABLE.sql
    \set dcpath 'echo ${DCPATH:-$PWD}'
    \set dcschema 'echo ${DCSCHEMA:-dc}'
    \set logfiles '''':dcpath'/LockAttempts_*.log'''
    COPY :dcschema.dc_lock_attempts(
      LockAttempts_start_filler FILLER VARCHAR(64) DELIMITER E'\n',
      "time_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "time" FORMAT '_internal' DELIMITER E'\n',
      "node_name_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "node_name" ESCAPE E'\001' DELIMITER E'\n',
      "session_id_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "session_id" ESCAPE E'\001' DELIMITER E'\n',
      "user_id_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "user_id" FORMAT 'd' DELIMITER E'\n',
      "user_name_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "user_name" ESCAPE E'\001' DELIMITER E'\n',
      "transaction_id_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "transaction_id" FORMAT 'd' DELIMITER E'\n',
      "object_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "object" FORMAT 'd' DELIMITER E'\n',
      "object_name_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "object_name" ESCAPE E'\001' DELIMITER E'\n',
      "mode_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "mode" ESCAPE E'\001' DELIMITER E'\n',
      "promoted_mode_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "promoted_mode" ESCAPE E'\001' DELIMITER E'\n',
      "scope_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "scope" ESCAPE E'\001' DELIMITER E'\n',
      "start_time_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "start_time" FORMAT '_internal' DELIMITER E'\n',
      "timeout_in_seconds_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "timeout_in_seconds" FORMAT 'd' DELIMITER E'\n',
      "result_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "result" ESCAPE E'\001' DELIMITER E'\n',
      "description_nfiller" FILLER VARCHAR(32) DELIMITER ':',
      "description" ESCAPE E'\001'
    )  FROM :logfiles RECORD TERMINATOR E'\n.\n' DELIMITER E'\n';
    

Log management functions

You can use two Data Collector functions to manage logs. Both functions can operate on a single component or all components.

  • FLUSH_DATA_COLLECTOR waits until in-memory logs are moved to disk and then flushes the Data Collector, synchronizing the log with disk storage.

  • CLEAR_DATA_COLLECTOR clears all memory and disk records from Data Collector tables and logs and then resets collection statistics in DATA_COLLECTOR.

10 - Monitoring partition reorganization

When you use ALTER TABLE...REORGANIZE, the operation reorganizes the data in the background.

When you use ALTER TABLE...REORGANIZE, the operation reorganizes the data in the background.

You can monitor details of the reorganization process by polling the following system tables:

11 - Monitoring resource pools

You can use the following to find information about resource pools:.

You can use the following to find information about resource pools:

You can also use the Management Console to obtain run-time data on resource pool usage.

Viewing resource pool status

The following example queries RESOURCE_POOL_STATUS for memory size data:

=> SELECT pool_name poolName,
    node_name nodeName,
    max_query_memory_size_kb maxQueryMemSizeKb,
    max_memory_size_kb maxMemSizeKb,
    memory_size_actual_kb memSizeActualKb
    FROM resource_pool_status WHERE pool_name='ceo_pool';
 poolName |     nodeName     | maxQueryMemSizeKb | maxMemSizeKb | memSizeActualKb
----------+------------------+-------------------+--------------+-----------------
 ceo_pool | v_vmart_node0001 |          12179388 |     13532654 |         1843200
 ceo_pool | v_vmart_node0002 |          12191191 |     13545768 |         1843200
 ceo_pool | v_vmart_node0003 |          12191170 |     13545745 |         1843200
(3 rows)

Viewing query resource acquisitions

The following example displays all resources granted to the queries that are currently running. The information shown is stored in system table RESOURCE_ACQUISITIONS table. You can see that the query execution used 708504 KB of memory from the GENERAL pool.

=> SELECT pool_name, thread_count, open_file_handle_count, memory_inuse_kb,
   queue_entry_timestamp, acquisition_timestamp
   FROM V_MONITOR.RESOURCE_ACQUISITIONS WHERE node_name ILIKE '%node0001';

-[ RECORD 1 ]----------+------------------------------
pool_name              | sysquery
thread_count           | 4
open_file_handle_count | 0
memory_inuse_kb        | 4103
queue_entry_timestamp  | 2013-12-05 07:07:08.815362-05
acquisition_timestamp  | 2013-12-05 07:07:08.815367-05
-[ RECORD 2 ]----------+------------------------------
...
-[ RECORD 8 ]----------+------------------------------
pool_name              | general
thread_count           | 12
open_file_handle_count | 18
memory_inuse_kb        | 708504
queue_entry_timestamp  | 2013-12-04 12:55:38.566614-05
acquisition_timestamp  | 2013-12-04 12:55:38.566623-05
-[ RECORD 9 ]----------+------------------------------
...

You can determine how long a query waits in the queue before it can run. To do so, you obtain the difference between acquisition_timestamp and queue_entry_timestamp using a query as this example shows:

=> SELECT pool_name, queue_entry_timestamp, acquisition_timestamp,
    (acquisition_timestamp-queue_entry_timestamp) AS 'queue wait'
   FROM V_MONITOR.RESOURCE_ACQUISITIONS WHERE node_name ILIKE '%node0001';

-[ RECORD 1 ]---------+------------------------------
pool_name             | sysquery
queue_entry_timestamp | 2013-12-05 07:07:08.815362-05
acquisition_timestamp | 2013-12-05 07:07:08.815367-05
queue wait            | 00:00:00.000005
-[ RECORD 2 ]---------+------------------------------
pool_name             | sysquery
queue_entry_timestamp | 2013-12-05 07:07:14.714412-05
acquisition_timestamp | 2013-12-05 07:07:14.714417-05
queue wait            | 00:00:00.000005
-[ RECORD 3 ]---------+------------------------------
pool_name             | sysquery
queue_entry_timestamp | 2013-12-05 07:09:57.238521-05
acquisition_timestamp | 2013-12-05 07:09:57.281708-05
queue wait            | 00:00:00.043187
-[ RECORD 4 ]---------+------------------------------
...

Querying user-defined resource pools

The Boolean column IS_INTERNAL in system tables RESOURCE_POOLS and RESOURCE_POOL_STATUS lets you get data on user-defined resource pools only. For example:

SELECT name, subcluster_oid, subcluster_name, memorysize, maxmemorysize, priority, maxconcurrency
dbadmin->     FROM V_CATALOG.RESOURCE_POOLS where is_internal ='f';
   name       |  subcluster_oid   | subcluster_name | memorysize | maxmemorysize | priority | maxconcurrency
--------------+-------------------+-----------------+------------+---------------+----------+----------------
 load_pool    | 72947297254957395 | default         | 0%         |               |       10 |
 ceo_pool     | 63570532589529860 | c_subcluster    | 250M       |               |       10 |
 ad hoc_pool  |                 0 |                 | 200M       | 200M          |        0 |
 billing_pool | 45579723408647896 | ar_subcluster   | 0%         |               |        0 |              3
 web_pool     |                 0 | analytics_1     | 25M        |               |       10 |              5
 batch_pool   | 47479274633682648 | default         | 150M       | 150M          |        0 |             10
 dept1_pool   |                 0 |                 | 0%         |               |        5 |
 dept2_pool   |                 0 |                 | 0%         |               |        8 |
 dashboard    | 45035996273843504 | analytics_1     | 0%         |               |        0 |
(9 rows)

12 - Monitoring recovery

When your Vertica database is recovering from a failure, it's important to monitor the recovery process.

When your Vertica database is recovering from a failure, it's important to monitor the recovery process. There are several ways to monitor database recovery:

12.1 - Viewing log files on each node

During database recovery, Vertica adds logging information to the .log on each host.

During database recovery, Vertica adds logging information to the vertica.log on each host. Each message is identified with a [Recover]string.

Use the tail command to monitor recovery progress by viewing the relevant status messages, as follows.

$ tail -f catalog-path/database-name/node-name_catalog/vertica.log
01/23/08 10:35:31 thr:Recover:0x2a98700970 [Recover] <INFO> Changing host v_vmart_node0001 startup state from INITIALIZING to RECOVERING
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Recovering to specified epoch 0x120b6
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Running 1 split queries
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Running query: ALTER PROJECTION proj_tradesquotes_0 SPLIT v_vmart_node0001 FROM 73911;

12.2 - Using system tables to monitor recovery

Use the following system tables to monitor recover:.

Use the following system tables to monitor recover:

Specifically, the recovery_status system table includes information about the node that is recovering, the epoch being recovered, the current recovery phase, and running status:

=>select node_name, recover_epoch, recovery_phase, current_completed, is_running from recovery_status;
node_name            | recover_epoch | recovery_phase    | current_completed | is_running
---------------------+---------------+-------------------+-------------------+--------------
 v_vmart_node0001    |               |                   | 0                 | f
 v_vmart_node0002    | 0             | historical pass 1 | 0                 | t
 v_vmart_node0003    | 1             | current           | 0                 | f

The projection_recoveries system table maintains history of projection recoveries. To check the recovery status, you can summarize the data for the recovering node, and run the same query several times to see if the counts change. Differing counts indicate that the recovery is working and in the process of recovering all missing data.

=> select node_name, status , progress from projection_recoveries;
node_name              | status      | progress
-----------------------+-------------+---------
v_vmart_node0001       | running     | 61

To see a single record from the projection_recoveries system table, add limit 1 to the query.

After a recovery has completed, Vertica continues to store information from the most recent recovery in these tables.

12.3 - Viewing cluster state and recovery status

Use the admintools view_cluster tool from the command line to see the cluster state:.

Use the admintools view_cluster tool from the command line to see the cluster state:

$ /opt/vertica/bin/admintools -t view_cluster
DB | Host | State
---------+--------------+------------
<data_base> | 112.17.31.10 | RECOVERING
<data_base> | 112.17.31.11 | UP
<data_base> | 112.17.31.12 | UP
<data_base> | 112.17.31.17 | UP
________________________________

12.4 - Monitoring cluster status after recovery

When recovery has completed:.

When recovery has completed:

  1. Launch Administration Tools.

  2. From the Main Menu, select View Database Cluster State and click OK.

    The utility reports your node's status as UP.

13 - Clearing projection refresh history

To immediately purge this information, call CLEAR_PROJECTION_REFRESHES:.

The PROJECTION_REFRESHES system table records information about successful and unsuccessful refresh operations. This table normally retains data for a projection until replaced by a new refresh of that projection, but you can also purge the table.

To immediately purge data for all completed refreshes, call CLEAR_PROJECTION_REFRESHES:

=> SELECT clear_projection_refreshes();
clear_projection_refreshes
----------------------------
 CLEAR
(1 row)

This function does not clear data for refreshes that are currently in progress.

14 - Monitoring Vertica using notifiers

A Vertica notifier is a push-based mechanism for sending messages from Vertica to endpoints like Apache Kafka or syslog.

A Vertica notifier is a push-based mechanism for sending messages from Vertica to endpoints like Apache Kafka or syslog. For example, you can configure a long-running script to send notifications at various stages and then at the completion of a task.

To use a notifier:

  1. Use CREATE NOTIFIER to create one of the following notifier types:

  2. Send a notification to the NOTIFIER endpoint with any of the following:

    • NOTIFY: Manually sends a message to the NOTIFIER endpoint.

    • SET_DATA_COLLECTOR_NOTIFY_POLICY: Creates a notification policy, which automatically sends a message to the NOTIFIER endpoint when a specified event occurs.

15 - Health Watchdog

Health Watchdog is an automated mechanism that detects and block queries during bad health conditions on the server.

If the database is under a high concurrent load, it leads to a bad health state on the server. The Health Watchdog is designed to mitigate the bad health state by doing the following:

  • Detecting the bad health state.

  • Stopping the transactions from adding to this bad state by blocking DDL/DML transactions.

  • Once the bad health state has been mitagated, allowing all blocked transactions to proceed.

Health Watchdog has three health metrics that it uses to check the server status and enact the mitigation:

  • Truncation Version Lag - tracks the catalog sync service and detects bad health conditions in the server when the current commit version is far ahead of the database truncation version.

  • GCLX Queue Bloat - tracks the GCLX queue size and stops the GCLX requests when the server is bombarded.

  • Mergeout Queue Bloat - tracks the TM queue size and stops DML transactions if the TM pool threads cannot keep up with the number of TM requests.

You can check the status of the server using check_cluster_health and the health_watchdog_blocked_transactions system table.