You can monitor the activity and health of a Vertica database through various log files and system tables. Vertica provides various configuration parameters that control monitoring options. You can also use the Management Console to observe database activity.
This is the multi-page printable view of this section. Click here to print.
Monitoring Vertica
- 1: Monitoring log files
- 2: Rotating log files
- 3: Monitoring process status (ps)
- 4: Monitoring Linux resource usage
- 5: Monitoring disk space usage
- 6: Monitoring elastic cluster rebalancing
- 7: Monitoring events
- 7.1: Event logging mechanisms
- 7.2: Event codes
- 7.3: Event data
- 7.4: Configuring event reporting
- 7.4.1: Configuring reporting for the simple notification service (SNS)
- 7.4.2: Configuring reporting for syslog
- 7.4.3: Configuring reporting for SNMP
- 7.4.4: Configuring event trapping for SNMP
- 7.4.5: Verifying SNMP configuration
- 7.5: Event reporting examples
- 8: Using system tables
- 9: Data Collector utility
- 9.1: Configuring data retention policies
- 9.2: Querying data collector tables
- 9.3: Managing data collection logs
- 10: Monitoring partition reorganization
- 11: Monitoring resource pools
- 12: Monitoring recovery
- 12.1: Viewing log files on each node
- 12.2: Using system tables to monitor recovery
- 12.3: Viewing cluster state and recovery status
- 12.4: Monitoring cluster status after recovery
- 13: Clearing projection refresh history
- 14: Monitoring Vertica using notifiers
- 15: Health Watchdog
1 - Monitoring log files
When a database is running
When a Vertica database is running, each node in the cluster writes messages into a file named vertica.log
. For example, the Tuple Mover and the transaction manager write INFO messages into vertica.log
at specific intervals even when there is no mergeout activity.
You configure the location of the vertica.log
file. By default, the log file is in:
catalog-path/database-name/node-name_catalog/vertica.log
-
catalog-path
is the path shown in the NODES system table minus the Catalog directory at the end. -
database-name
is the name of your database. -
node-name
is the name of the node shown in the NODES system table.
Note
Vertica often changes the format or content of log files in subsequent releases to benefit both customers and customer support.To monitor one node in a running database in real time:
-
Log in to the database administrator account on any node in the cluster.
-
In a terminal window enter:
$ tail -f catalog-path/database-name/node-name_catalog/vertica.log
Note
To monitor your overall database (rather than an individual node/host), use the Data Collector, which records system activities and performance. See Data Collector utility for more on Data Collector.
catalog-path |
The catalog pathname specified when you created the database. See Creating a database. |
database-name |
The database name (case sensitive) |
node-name |
The node name, as specified in the database definition. See Viewing a database. |
When the database/node is starting up
During system startup, before the Vertica log has been initialized to write messages, each node in the cluster writes messages into a file named dbLog
. This log is useful to diagnose situations where the database fails to start before it can write messages into vertica.log
. The dblog
is located at the following path, using catalog-path
and database-name
as described above:
catalog-path/database-name/dbLog
See also
2 - Rotating log files
The LogRotate service periodically rotates logs and removes old rotated logs. To view previous LogRotate events, see LOG_ROTATE_EVENTS.
If the following files exceed the specified maximum size, they are rotated:
vertica.log
UDxFencedProcesses.log
MemoryReport.log
editor.log
dbLog
Rotated files are compressed and marked with a timestamp in the same location as the original log file: path/to/logfile.log
timestamp
.gz
. For example, /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
is rotated to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-14-09-02-381909-05.gz
.
If a log file was rotated, the previously rotated logs (.gz
) in that directory are checked against the specified maximum age. Rotated logs older than the maximum age are deleted.
Upgrade behavior
The LogRotate service is automatically enabled, but if your database contains the configuration file for the Linux logrotate
utility at /opt/vertica/config/logrotate/
database_name
, the LogRotate service will not run. If you upgraded from Vertica 23.4 or below, you must delete the configuration file on all nodes to use the LogRotate service.
Automatic rotation
To configure and enable the LogRotate service:
- Set the maximum allowed size of logs with LogRotateMaxSize. Log files larger than this age are rotated (compressed).
- Set the maximum allowed age of rotated logs with LogRotateMaxAge. Rotated (compressed) logs older than this age are deleted.
- Set how often the service runs with LogRotateInterval.
- Enable the service with EnableLogRotate.
For details on these parameters, see Monitoring parameters.
To view the current values for these parameters, query CONFIGURATION_PARAMETERS.
The following example configures the LogRotate service to automatically run every four hours, rotating logs that are larger than 1 kibibyte and removing rotated logs that are older than 12 days:
=> ALTER DATABASE DEFAULT SET LogRotateInterval = '4h', LogRotateMaxSize = '1k', LogRotateMaxAge = '12 days';
Manual rotation
You can manually rotate the logs on a particular node with the DO_LOGROTATE_LOCAL meta-function. This function takes optional arguments to force the rotation of all logs and override the values of LogRotateMaxSize and LogRotateMaxAge.
To rotate logs that are larger than 1 kilobyte, and then remove rotated logs that are older than 1 day:
=> SELECT do_logrotate_local('max_size=1K;max_age=1 day');
do_logrotate_local
-----------------------------------------------------------------------------------------------------------
Doing Logrotate
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
File size: 35753 Bytes
Force rotate? no
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-08-13-55-51-651129-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
File size: 68 Bytes
Force rotate? no
Rotation not required for file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
(1 row)
To force rotation and then remove all logs older than 4 days (default value of LogRotateMaxAge):
=> SELECT do_logrotate_local('force;max_age=4 days');
do_logrotate_local
-----------------------------------------------------------------------------------------------------------
Doing Logrotate
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
File size: 4310245 Bytes
Force rotate? yes
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log.2023-11-10-13-45-15-53837-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/vertica.log
Considering file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
File size: 68 Bytes
Force rotate? yes
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-06-13-18-27-23141-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-07-13-18-30-059008-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-08-13-47-11-707903-05.gz
Remove old log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-09-14-09-02-386402-05.gz
Renaming to /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05
Opening new log file /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
Compressing /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05 to /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log.2023-11-10-13-45-15-647762-05.gz
Done with /scratch_b/qa/VMart/v_vmart_node0001_catalog/UDxLogs/UDxFencedProcesses.log
(1 row)
Rotating with the Linux logrotate utility
Deprecated
This functionality is deprecated and will be removed in a future release. You should instead use the native LogRotate service.Most Linux distributions include the logrotate
utility. By setting up a logrotate
configuration file, you can use the utility to complete one or more of these tasks automatically:
-
Compress and rotate log files
-
Remove log files automatically
-
Email log files to named recipients
You can configure logrotate
to complete these tasks at specific intervals, or when log files reach a particular size.
If logrotate
is present when Vertica is installed, then Vertica automatically sets this utility to look for configuration files. Thus, logrotate searches for configuration files in the
/opt/vertica/config/logrotate
directory on each node.
When you create a database, Vertica creates database-specific logrotate
configurations on each node in your cluster, which are used by the logrotate
utility. It then creates a file with the path
/opt/vertica/config/logrotate/dbname
for each individual database.
For information about additional settings, use the man logrotate
command.
Executing the Python script through the dbadmin logrotate cron job
During the installation of Vertica, the installer configures a cron job for the dbadmin
user. This cron job is configured to execute a Python script that runs the logrotate
utility. You can view the details of this cron job by viewing the dbadmin.cron
file, which is located in the /opt/vertica/config
directory.
If you want to customize a cron job to configure logrotate for your Vertica database, you must create the cron job under the dbadmin
user.
Using the administration tools logrotate utility
You can use the admintools logrotate
option to help configure logrotate
scripts for a database and distribute the scripts across the cluster. The logrotate
option allows you to specify:
-
How often to rotate logs
-
How large logs can become before being rotated
-
How long to keep the logs
Example:
The following example shows you how to set up log rotation on a weekly schedule and keeps for three months (12 logs).
$ admintools -t logrotate -d <dbname> -r weekly -k 12
See Writing administration tools scripts for more usage information.
Configure logrotate for MC
The Management Console log file is:
/opt/vconsole/log/mc/mconsole.log
To configure logrotate
for MC, configure the following file:
/opt/vconsole/temp/webapp/WEB-INF/classes/log4j.xml
Edit the log4j.xml
file and set these parameters as follows:
-
Restrict the size of the log:
<param name="MaxFileSize" value="1MB"/>
-
Restrict the number of file backups for the log:
<param name="MaxBackupIndex" value="1"/>
-
Restart MC as the root user:
# etc/init.d/vertica-consoled restart
Rotating logs manually
To implement a custom log rotation process, follow these steps:
-
Rename or archive the existing vertica.log file. For example:
$ mv vertica.log vertica.log.1
-
Send the Vertica process the USR1 signal, using either of the following approaches:
$ killall -USR1 vertica
or
$ ps -ef | grep -i vertica $ kill -USR1 process-id
3 - Monitoring process status (ps)
You can use ps
to monitor the database and Spread processes running on each node in the cluster. For example:
$ ps aux | grep /opt/vertica/bin/vertica
$ ps aux | grep /opt/vertica/spread/sbin/spread
You should see one Vertica process and one Spread process on each node for common configurations. To monitor Administration Tools and connector processes:
$ ps aux | grep vertica
There can be many connection processes but only one Administration Tools process.
4 - Monitoring Linux resource usage
You should monitor system resource usage on any or all nodes in the cluster. You can use System Activity Reporting (SAR) to monitor resource usage.
Note
OpenText recommends that you installpstack
and sysstat
to help monitor Linux resources. The SYSSTAT package contains utilities for monitoring system performance and usage activity, such as sar
, as well as tools you can schedule via cron
to collect performance and activity data. See the SYSSTAT Web page for details.The
pstack
utility lets you print a stack trace of a running process. See the PSTACK man page for details.
-
Log in to the database administrator account on any node.
-
Run the
top
utility$ top
A high CPU percentage in
top
indicates that Vertica is CPU-bound. For example:top - 11:44:28 up 53 days, 23:47, 9 users, load average: 0.91, 0.97, 0.81 Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie Cpu(s): 26.9%us, 1.3%sy, 0.0%ni, 71.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4053136 total, 3882020k used, 171116 free, 407688 buffers Swap: 4192956 total, 176k used, 4192780 free, 1526436 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13703 dbadmin 1 0 1374m 678m 55m S 99.9 17.1 6:21.70 vertica 2606 root 16 0 32152 11m 2508 S 1.0 0.3 0:16.97 X 1 root 16 0 4748 552 456 S 0.0 0.0 0:01.51 init 2 root RT -5 0 0 0 S 0.0 0.0 0:04.92 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:11.75 ksoftirqd/0 ...
Some possible reasons for high CPU usage are:
-
The Tuple Mover runs automatically and thus consumes CPU time even if there are no connections to the database.
-
The swappiness kernel parameter may not be set to 0. Execute the following command from the Linux command line to see the value of this parameter:
$ cat /proc/sys/vm/swappiness
If this value is not 0, change it by following the steps in Check for swappiness.
-
Some information sources:
-
-
Run the
iostat
utility. A high idle time intop
at the same time as a high rate of blocks read iniostat
indicates that Vertica is disk-bound. For example:$ /usr/bin/iostat Linux 2.6.18-164.el5 (qa01) 02/05/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.77 2.32 0.76 0.68 0.00 95.47 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 0.37 3.40 10.37 2117723 6464640 sda 0.46 1.94 18.96 1208130 11816472 sdb 0.26 1.79 15.69 1114792 9781840 sdc 0.24 1.80 16.06 1119304 10010328 sdd 0.22 1.79 15.52 1117472 9676200 md0 8.37 7.31 66.23 4554834 41284840
5 - Monitoring disk space usage
You can use these system tables to monitor disk space usage on your cluster:
System table | Description |
---|---|
DISK_STORAGE |
Monitors the amount of disk storage used by the database on each node. |
COLUMN_STORAGE |
Monitors the amount of disk storage used by each column of each projection on each node. |
PROJECTION_STORAGE |
Monitors the amount of disk storage used by each projection on each node. |
6 - Monitoring elastic cluster rebalancing
Vertica includes system tables that can be used to monitor the rebalance status of an elastic cluster and gain general insight to the status of elastic cluster on your nodes.
-
REBALANCE_TABLE_STATUS provides general information about a rebalance. It shows, for each table, the amount of data that has been separated, the amount that is currently being separated, and the amount to be separated. It also shows the amount of data transferred, the amount that is currently being transferred, and the remaining amount to be transferred (or an estimate if storage is not separated).
Note
If multiple rebalance methods were used for a single table (for example, the table has unsegmented and segmented projections), the table may appear multiple times - once for each rebalance method. -
REBALANCE_PROJECTION_STATUS can be used to gain more insight into the details for a particular projection that is being rebalanced. It provides the same type of information as above, but in terms of a projection instead of a table.
In each table, the columns SEPARATED_PERCENT
and TRANSFERRED_PERCENT
can be used to determine overall progress.
Historical rebalance information
Historical information about work completed is retained, so query with the table column IS_LATEST to restrict output to only the most recent or current rebalance activity. Historical data may include information about dropped projections or tables. If a table or projection has been dropped and information about the anchor table is not available, then NULL is displayed for the table ID and <unknown>
for table name. Information on dropped tables is still useful, for example, in providing justification for the duration of a task.
7 - Monitoring events
To help you monitor your database system, Vertica traps and logs significant events that affect database performance and functionality if you do not address their root causes. This section describes where events are logged, the types of events that Vertica logs, how to respond to these events, the information that Vertica provides for these events, and how to configure event monitoring.
7.1 - Event logging mechanisms
Vertica posts events to the following mechanisms:
Mechanism | Description |
---|---|
vertica.log |
All events are automatically posted to
vertica.log . See Monitoring the Log Files. |
ACTIVE_EVENTS |
This SQL system table provides information about all open events. See Using system tables and ACTIVE_EVENTS. |
SNMP |
To post traps to SNMP, enable global reporting in addition to each individual event you want trapped. See Configuring event reporting. |
Syslog |
To log events to syslog, enable event reporting for each individual event you want logged. See Configuring event reporting. |
7.2 - Event codes
The following table lists the event codes that Vertica logs to the events system tables.
Event Code | Severity | Event Code Description | Description/Action |
---|---|---|---|
0 | Warning | Low Disk Space |
Warning indicates one of the following issues:
Action: Add more disk space, or replace the failing disk or hardware as soon as possible. Check Also, use the DISK_RESOURCE_REJECTIONS system table to determine the types of disk space requests that are being rejected and the hosts where they are rejected. See Managing disk space for details. |
1 | Warning | Read Only File System |
Database lacks write access to the file system for data or catalog paths. This sometimes occurs if Linux remounts a drive due to a kernel issue. Action: Give the database write access. |
2 | Emergency | Loss Of K Safety |
The database is no longer K-safe because insufficient nodes are functioning within the cluster. Loss of K-safety causes the database to shut down. Action: Recover the system. |
3 | Critical | Current Fault Tolerance at Critical Level |
One or more nodes in the cluster failed. If the database loses one more node, it will no longer be K-safe and shut down. Action: Restore nodes that failed or shut down. |
4 | Warning | Too Many ROS Containers |
Heavy load activity on one or more projections sometimes generates more ROS containers than the Tuple Mover can handle. Vertica allows up to 1024 ROS containers per projection before it rolls back additional load jobs and returns a ROS pushback error message. Action: The Tuple Mover typically catches up with pending mergeout requests and the Optimizer can resume executing queries on affected tables (see Mergeout). If this problem does not resolve quickly, or if it occurs frequently, it is probably related to insufficient RAM allocated to MAXMEMORY in the TM resource pool. |
5 | Informational | WOS Over Flow | Deprecated |
6 | Informational | Node State Change |
The node state changed. Action: Check node status. |
7 | Warning | Recovery Failure |
Database was not restored to a functional state after a hardware or software related failure. Action: Reasons for the warning can vary, see the event description for details. |
8 | Warning | Recovery Error |
Database encountered an error while attempting to recover. If the number of recovery errors exceeds Max Tries, the Recovery Failure event is triggered. Action: Reasons for the warning can vary, see the event description for details. |
9 | n/a | Recovery Lock Error | Unused |
10 | n/a | Recovery Projection Retrieval Error | Unused |
11 | Warning | Refresh Error |
The database encountered an error while attempting to refresh. Action: Reasons for the warning can vary, see the event description for details. |
12 | n/a | Refresh Lock Error | Unused |
13 | n/a | Tuple Mover Error | Deprecated |
14 | Warning | Timer Service Task Error |
Error occurred in an internal scheduled task. Action: None, internal use only |
15 | Warning | Stale Checkpoint | Deprecated |
16 | Notice | License Size Compliance |
Database size exceeds license size allowance. Action: See Monitoring database size for license compliance. |
17 | Notice | License Term Compliance |
Database is not in compliance with your Vertica license. Action: Check compliance status with GET_COMPLIANCE_STATUS. |
18 | Error | CRC Mismatch |
Cyclic Redundancy Check (CRC) returned an error or errors while fetching data. Action: Review the |
19 | Critical/Warning | Catalog Sync Exceeds Durability Threshold | Severity: Critical when exceeding hard limit, Warning when exceeding soft limit. |
20 | Critical | Cluster Read-only |
Eon Mode) Quorum or primary shard coverage loss forced database into read-only mode. Action: Restart down nodes. |
7.3 - Event data
To help you interpret and solve the issue that triggered an event, each event provides a variety of data, depending upon the event logging mechanism used.
The following table describes the event data and indicates where it is used.
vertica.log |
ACTIVE_EVENTS (column names) |
SNMP | Syslog | Description |
---|---|---|---|---|
N/A | NODE_NAME | N/A | N/A | The node where the event occurred. |
Event Code | EVENT_CODE | Event Type | Event Code | A numeric ID that indicates the type of event. See Event Types in the previous table for a list of event type codes. |
Event Id | EVENT_ID | Event OID | Event Id | A unique numeric ID that identifies the specific event. |
Event Severity |
EVENT_ SEVERITY |
Event Severity | Event Severity |
The severity of the event from highest to lowest. These events are based on standard syslog severity types:
|
PostedTimestamp |
EVENT_ POSTED_ TIMESTAMP |
N/A | PostedTimestamp | The year, month, day, and time the event was reported. Time is provided as military time. |
ExpirationTimestamp |
EVENT_ EXPIRATION |
N/A | ExpirationTimestamp | The time at which this event expires. If the same event is posted again prior to its expiration time, this field gets updated to a new expiration time. |
EventCodeDescription |
EVENT_CODE_ DESCRIPTION |
Description | EventCodeDescription | A brief description of the event and details pertinent to the specific situation. |
ProblemDescription |
EVENT_PROBLEM_ DESCRIPTION |
Event Short Description | ProblemDescription | A generic description of the event. |
N/A |
REPORTING_ NODE |
Node Name | N/A | The name of the node within the cluster that reported the event. |
DatabaseName | N/A | Database Name | DatabaseName | The name of the database that is impacted by the event. |
N/A | N/A | Host Name | Hostname | The name of the host within the cluster that reported the event. |
N/A | N/A | Event Status | N/A |
The status of the event. It can be either:
|
7.4 - Configuring event reporting
Event reporting is automatically configured for
vertica.log
, and current events are automatically posted to the ACTIVE_EVENTS system table. You can also configure Vertica to post events to syslog and SNMP.
7.4.1 - Configuring reporting for the simple notification service (SNS)
You can monitor Data collector (DC) components and send new rows to Amazon Web Services (AWS) Simple Notification Service (SNS). SNS notifiers are configured with database-level SNS parameters.
Note
Several SNS configuration parameters have and fall back to equivalents for S3. This lets you to share configurations between S3 and SNS. For example, if you set the values for AWSAuth but not for SNSAuth, Vertica automatically uses the AWSAuth credentials. For brevity, the procedures on this page will not use this fallback behavior and instead use the SNS configuration parameters.
For details, see SNS parameters.
Minimally, to send DC data to SNS topics, you must configure and specify the following:
-
An SNS notifier
-
A Simple Notification Service (SNS) topic
-
An AWS region (SNSRegion)
-
An SNS endpoint (SNSEndpoint) (FIPS only)
-
Credentials to authenticate to AWS
-
Information about how to handle HTTPS
Creating an SNS notifier
To create an SNS notifier, use CREATE NOTIFIER, specifying sns
as the ACTION.
SNS configuration
SNS topics and their subscribers should be configured with AWS. For details, see the AWS documentation:
AWS region and endpoint
In most use cases, you only need to set the AWS region with the SNSRegion parameter; if the SNSEndpoint is set to an empty string (default) and the SNSRegion is set, Vertica automatically finds and uses the appropriate endpoint:
=> ALTER DATABASE DEFAULT SET SNSRegion='us-east-1';
=> ALTER DATABASE DEFAULT SET SNSEndpoint='';
If you want to specify an endpoint, its region must match the region specified in SNSRegion:
=> ALTER DATABASE DEFAULT SET SNSEndpoint='sns.us-east-1.amazonaws.com';
If you use FIPS, you should manually set SNSEndpoint to a FIPS-compliant endpoint:
=> ALTER DATABASE DEFAULT SET SNSEndpoint='sns-fips.us-east-1.amazonaws.com';
AWS credentials
AWS credentials can be set with SNSAuth, which takes an access key and secret access key in the following format:
access_key:secret_access_key
To set SNSAuth:
=> ALTER DATABASE DEFAULT SET SNSAuth='VNDDNVOPIUQF917O5PDB:+mcnVONVIbjOnf1ekNis7nm3mE83u9fjdwmlq36Z';
Handling HTTPS
The SNSEnableHttps parameter determines whether the SNS notifier uses TLS to secure the connection between Vertica and AWS. HTTPS is enabled by default and can be manually enabled with:
=> ALTER DATABASE DEFAULT SET SNSEnableHttps=1;
If SNSEnableHttps is enabled, depending on your configuration, you might need to specify a custom set of CA bundles with SNSCAFile or SNSCAPath. Amazon root certificates are typically contained in the set of trusted CA certificates already, so you should not have to set these parameters in most environments:
=> ALTER DATABASE DEFAULT SET SNSCAFile='path/to/ca/bundle.pem'
=> ALTER DATABASE DEFAULT SET SNSCAPath='path/to/ca/bundles/'
HTTPS can be manually disabled with:
=> ALTER DATABASE DEFAULT SET SNSEnableHttps=0;
Examples
The following example creates an SNS topic, subscribes to it with an SQS queue, and then configures an SNS notifier for the DC component LoginFailures
:
-
Set SNSAuth with your AWS credentials:
=> ALTER DATABASE DEFAULT SET SNSAuth='VNDDNVOPIUQF917O5PDB:+mcnVONVIbjOnf1ekNis7nm3mE83u9fjdwmlq36Z';
-
Set SNSRegion:
=> ALTER DATABASE DEFAULT SET SNSRegion='us-east-1'
-
Enable HTTPS:
=> ALTER DATABASE DEFAULT SET SNSEnableHttps=1;
-
Create an SNS notifier:
=> CREATE NOTIFIER v_sns_notifier ACTION 'sns' MAXPAYLOAD '256K' MAXMEMORYSIZE '10M' CHECK COMMITTED;
-
Verify that the SNS notifier, SNS topic, and SQS queue are properly configured:
-
Manually send a message from the notifier to the SNS topic with NOTIFY:
=> SELECT NOTIFY('test message', 'v_sns_notifier', 'arn:aws:sns:us-east-1:123456789012:MyTopic')
-
Poll the SQS queue for your message.
-
-
Attach the SNS notifier to the
LoginFailures
component with SET_DATA_COLLECTOR_NOTIFY_POLICY:=> SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures', 'v_sns_notifier', 'Login failed!', true)
To disable an SNS notifier:
=> SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures', 'v_sns_notifier', 'Login failed!', false)
7.4.2 - Configuring reporting for syslog
Syslog is a network-logging utility that issues, stores, and processes log messages. It is a useful way to get heterogeneous data into a single data repository.
To log events to syslog, enable event reporting for each individual event you want logged. Messages are logged, by default, to /var/log/messages
.
Configuring event reporting to syslog consists of:
-
Enabling Vertica to trap events for syslog.
-
Defining which events Vertica traps for syslog.
Vertica strongly suggests that you trap the Stale Checkpoint event.
-
Defining which syslog facility to use.
Enabling Vertica to trap events for syslog
To enable event trapping for syslog, issue the following SQL command:
=> ALTER DATABASE DEFAULT SET SyslogEnabled = 1;
To disable event trapping for syslog, issue the following SQL command:
=> ALTER DATABASE DEFAULT SET SyslogEnabled = 0;
Defining events to trap for syslog
To define events that generate a syslog entry, issue the following SQL command, one of the events described in the list below the command:
=> ALTER DATABASE DEFAULT SET SyslogEvents = 'events-list';
where events-list
is a comma-delimited list of events, one or more of the following:
-
Low Disk Space
-
Read Only File System
-
Loss Of K Safety
-
Current Fault Tolerance at Critical Level
-
Too Many ROS Containers
-
Node State Change
-
Recovery Failure
-
Recovery Error
-
Recovery Lock Error
-
Recovery Projection Retrieval Error
-
Refresh Error
-
Refresh Lock Error
-
Tuple Mover Error
-
Timer Service Task Error
-
Stale Checkpoint
The following example generates a syslog entry for low disk space and recovery failure:
=> ALTER DATABASE DEFAULT SET SyslogEvents = 'Low Disk Space, Recovery Failure';
Defining the SyslogFacility to use for reporting
The syslog mechanism allows for several different general classifications of logging messages, called facilities. Typically, all authentication-related messages are logged with the auth
(or authpriv
) facility. These messages are intended to be secure and hidden from unauthorized eyes. Normal operational messages are logged with the daemon
facility, which is the collector that receives and optionally stores messages.
The SyslogFacility directive allows all logging messages to be directed to a different facility than the default. When the directive is used, all logging is done using the specified facility, both authentication (secure) and otherwise.
To define which SyslogFacility Vertica uses, issue the following SQL command:
=> ALTER DATABASE DEFAULT SET SyslogFacility = 'Facility_Name';
Where the facility-level argument <Facility_Name>
is one of the following:
-
auth
-
authpriv (Linux only)
-
cron
-
uucp (UUCP subsystem)
-
daemon
-
ftp (Linux only)
-
lpr (line printer subsystem)
-
mail (mail system)
-
news (network news subsystem)
-
user (default system)
-
local0 (local use 0)
-
local1 (local use 1)
-
local2 (local use 2)
-
local3 (local use 3)
-
local4 (local use 4)
-
local5 (local use 5)
-
local6 (local use 6)
-
local7 (local use 7)
Trapping other event types
To trap events other than the ones listed above, create a syslog notifier and allow it to trap the desired events with SET_DATA_COLLECTOR_NOTIFY_POLICY.
Events monitored by this notifier type are not logged to MONITORING_EVENTS nor vertica.log
.
The following example creates a notifier that writes a message to syslog when the Data collector (DC) component LoginFailures
updates:
-
Enable syslog notifiers for the current database:
=> ALTER DATABASE DEFAULT SET SyslogEnabled = 1;
-
Create and enable a syslog notifier
v_syslog_notifier
:=> CREATE NOTIFIER v_syslog_notifier ACTION 'syslog' ENABLE MAXMEMORYSIZE '10M' IDENTIFIED BY 'f8b0278a-3282-4e1a-9c86-e0f3f042a971' PARAMETERS 'eventSeverity = 5';
-
Configure the syslog notifier
v_syslog_notifier
for updates to theLoginFailures
DC component with SET_DATA_COLLECTOR_NOTIFY_POLICY:=> SELECT SET_DATA_COLLECTOR_NOTIFY_POLICY('LoginFailures','v_syslog_notifier', 'Login failed!', true);
This notifier writes the following message to syslog (default location:
/var/log/messages
) when a user fails to authenticate as the userBob
:Apr 25 16:04:58 vertica_host_01 vertica: Event Posted: Event Code:21 Event Id:0 Event Severity: Notice [5] PostedTimestamp: 2022-04-25 16:04:58.083063 ExpirationTimestamp: 2022-04-25 16:04:58.083063 EventCodeDescription: Notifier ProblemDescription: (Login failed!) { "_db":"VMart", "_schema":"v_internal", "_table":"dc_login_failures", "_uuid":"f8b0278a-3282-4e1a-9c86-e0f3f042a971", "authentication_method":"Reject", "client_authentication_name":"default: Reject", "client_hostname":"::1", "client_label":"", "client_os_user_name":"dbadmin", "client_pid":523418, "client_version":"", "database_name":"dbadmin", "effective_protocol":"3.8", "node_name":"v_vmart_node0001", "reason":"REJECT", "requested_protocol":"3.8", "ssl_client_fingerprint":"", "ssl_client_subject":"", "time":"2022-04-25 16:04:58.082568-05", "user_name":"Bob" }#012 DatabaseName: VMart Hostname: vertica_host_01
See also
Event reporting examples7.4.3 - Configuring reporting for SNMP
Configuring event reporting for SNMP consists of:
-
Configuring Vertica to enable event trapping for SNMP as described below.
-
Importing the Vertica Management Information Base (MIB) file into the SNMP monitoring device.
The Vertica MIB file allows the SNMP trap receiver to understand the traps it receives from Vertica. This, in turn, allows you to configure the actions it takes when it receives traps.
Vertica supports the SNMP V1 trap protocol, and it is located in /opt/vertica/sbin/VERTICA-MIB. See the documentation for your SNMP monitoring device for more information about importing MIB files.
-
Configuring the SNMP trap receiver to handle traps from Vertica.
SNMP trap receiver configuration differs greatly from vendor to vendor. As such, the directions presented here for configuring the SNMP trap receiver to handle traps from Vertica are generic.
Vertica traps are single, generic traps that contain several fields of identifying information. These fields equate to the event data described in Monitoring events. However, the format used for the field names differs slightly. Under SNMP, the field names contain no spaces. Also, field names are pre-pended with “vert”. For example, Event Severity becomes vertEventSeverity.
When configuring your trap receiver, be sure to use the same hostname, port, and community string you used to configure event trapping in Vertica.
Examples of network management providers:
-
IBM Tivoli
-
AdventNet
-
Net-SNMP (Open Source)
-
Nagios (Open Source)
-
Open NMS (Open Source)
7.4.4 - Configuring event trapping for SNMP
The following events are trapped by default when you configure Vertica to trap events for SNMP:
-
Low Disk Space
-
Read Only File System
-
Loss of K Safety
-
Current Fault Tolerance at Critical Level
-
Too Many ROS Containers
-
Node State Change
-
Recovery Failure
-
Stale Checkpoint
-
CRC Mismatch
To configure Vertica to trap events for SNMP
-
Enable Vertica to trap events for SNMP.
-
Define where Vertica sends the traps.
-
Optionally redefine which SNMP events Vertica traps.
Note
After you complete steps 1 and 2 above, Vertica automatically traps the default SNMP events. Only perform step 3 if you want to redefine which SNMP events are trapped. Vertica recommends that you trap theStale Checkpoint
event even if you decide to reduce the number events Vertica traps for SNMP. The specific settings you define have no effect on traps sent to the log. All events are trapped to the log.
To enable event trapping for SNMP
Use the following SQL command:
=> ALTER DATABASE DEFAULT SET SnmpTrapsEnabled = 1;
To define where Vertica send traps
Use the following SQL command, where Host_name and port identify the computer where SNMP resides, and CommunityString acts like a password to control Vertica's access to the server:
=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'host_name port CommunityString';
For example:
=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'localhost 162 public';
You can also specify multiple destinations by specifying a list of destinations, separated by commas:
=> ALTER DATABASE DEFAULT SET SnmpTrapDestinationsList = 'host_name1 port1 CommunityString1, hostname2 port2 CommunityString2';
Note
: Setting multiple destinations sends any SNMP trap notification to all destinations listed.To define which events Vertica traps
Use the following SQL command, where Event_Name
is one of the events in the list below the command:
=> ALTER DATABASE DEFAULT SET SnmpTrapEvents = 'Event_Name1, Even_Name2';
-
Low Disk Space
-
Read Only File System
-
Loss Of K Safety
-
Current Fault Tolerance at Critical Level
-
Too Many ROS Containers
-
Node State Change
-
Recovery Failure
-
Recovery Error
-
Recovery Lock Error
-
Recovery Projection Retrieval Error
-
Refresh Error
-
Tuple Mover Error
-
Stale Checkpoint
-
CRC Mismatch
Note
The above values are case sensitive.The following example specifies two event names:
=> ALTER DATABASE DEFAULT SET SnmpTrapEvents = 'Low Disk Space, Recovery Failure';
7.4.5 - Verifying SNMP configuration
To create a set of test events that checks SNMP configuration:
-
Set up SNMP trap handlers to catch Vertica events.
-
Test your setup with the following command:
SELECT SNMP_TRAP_TEST(); SNMP_TRAP_TEST -------------------------- Completed SNMP Trap Test (1 row)
7.5 - Event reporting examples
Vertica.log
The following example illustrates a Too Many ROS Containers event posted and cleared within vertica.log:
08/14/15 15:07:59 thr:nameless:0x45a08940 [INFO] Event Posted: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp:
2015-08-14 15:07:59.253729 ExpirationTimestamp: 2015-08-14 15:08:29.253729
EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB
Hostname: fc6-1.example.com
08/14/15 15:08:54 thr:Ageout Events:0x2aaab0015e70 [INFO] Event Cleared:
Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp:
2015-08-14 15:07:59.253729 ExpirationTimestamp: 2015-08-14 15:08:53.012669
EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB
Hostname: fc6-1.example.com
SNMP
The following example illustrates a Too Many ROS Containers event posted to SNMP:
Version: 1, type: TRAPREQUESTEnterprise OID: .1.3.6.1.4.1.31207.2.0.1
Trap agent: 72.0.0.0
Generic trap: ENTERPRISESPECIFIC (6)
Specific trap: 0
.1.3.6.1.4.1.31207.1.1 ---> 4
.1.3.6.1.4.1.31207.1.2 ---> 0
.1.3.6.1.4.1.31207.1.3 ---> 2008-08-14 11:30:26.121292
.1.3.6.1.4.1.31207.1.4 ---> 4
.1.3.6.1.4.1.31207.1.5 ---> 1
.1.3.6.1.4.1.31207.1.6 ---> site01
.1.3.6.1.4.1.31207.1.7 ---> suse10-1
.1.3.6.1.4.1.31207.1.8 ---> Too many ROS containers exist on this node.
.1.3.6.1.4.1.31207.1.9 ---> QATESTDB
.1.3.6.1.4.1.31207.1.10 ---> Too Many ROS Containers
Syslog
The following example illustrates a Too Many ROS Containers event posted and cleared within syslog:
Aug 14 15:07:59 fc6-1 vertica: Event Posted: Event Code:4 Event Id:0 Event Severity: Warning [4] PostedTimestamp: 2015-08-14 15:07:59.253729 ExpirationTimestamp:
2015-08-14 15:08:29.253729 EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB Hostname: fc6-1.example.com
Aug 14 15:08:54 fc6-1 vertica: Event Cleared: Event Code:4 Event Id:0 Event Severity:
Warning [4] PostedTimestamp: 2015-08-14 15:07:59.253729 ExpirationTimestamp:
2015-08-14 15:08:53.012669 EventCodeDescription: Too Many ROS Containers ProblemDescription:
Too many ROS containers exist on this node. DatabaseName: TESTDB Hostname: fc6-1.example.com
8 - Using system tables
Vertica system tables provide information about system resources, background processes, workload, and performance—for example, load streams, query profiles, and tuple mover operations. Vertica collects and refreshes this information automatically.
You can query system tables using expressions, predicates, aggregates, analytics, subqueries, and joins. You can also save system table query results into a user table for future analysis. For example, the following query creates a table, mynode
, selecting three node-related columns from the NODES system table:
=> CREATE TABLE mynode AS SELECT node_name, node_state, node_address FROM nodes;
CREATE TABLE
=> SELECT * FROM mynode;
node_name | node_state | node_address
------------------+------------+----------------
v_vmart_node0001 | UP | 192.168.223.11
(1 row)
Note
You cannot query system tables if the database cluster is in a recovering state. The database refuses connection requests and cannot be monitored. Vertica also does not support DDL and DML operations on system tables.Where system tables reside
System tables are grouped into two schemas:
-
V_CATALOG schema: Provides information about persistent objects in the catalog
-
V_MONITOR schema: Provides information about transient system state
These schemas reside in the default search path. Unless you change the search path to exclude V_MONITOR
or V_CATALOG
or both, queries can specify a system table name that omits its schema.
You can query the SYSTEM_TABLES table for all Vertica system tables and their schemas. For example:
SELECT * FROM system_tables ORDER BY table_schema, table_name;
System table categories
Vertica system tables can be grouped into the following areas:
-
System information
-
System resources
-
Background processes
-
Workload and performance
Vertica reserves some memory to help monitor busy systems. Using simple system table queries makes it easier to troubleshoot issues. See also SYSQUERY.
Note
You can use external monitoring tools or scripts to query the system tables and act upon the information, as necessary. For example, when a host failure causes the K-safety level to fall below the desired level, the tool or script can notify the database administrator and/or appropriate IT personnel of the change, typically in the form of an e-mail.Privileges
You can GRANT and REVOKE privileges on system tables, with the following restrictions:
-
You cannot GRANT privileges on system tables to the SYSMONITOR or PSEUDOSUPERUSER roles.
-
You cannot GRANT on system schemas.
Case-sensitive system table data
Some system table data might be stored in mixed case. For example, Vertica stores mixed-case identifier names the way you specify them in the CREATE statement, even though case is ignored when you reference them in queries. When these object names appear as data in the system tables, you'll encounter errors if you query them with an equality (=
) operator because the case must exactly match the stored identifier. In particular, data in columns TABLE_SCHEMA
and TABLE_NAME
in system table TABLES are case sensitive.
If you don't know how the identifiers are stored, use the case-insensitive operator
ILIKE
. For example, given the following schema:
=> CREATE SCHEMA SS;
=> CREATE TABLE SS.TT (c1 int);
=> CREATE PROJECTION SS.TTP1 AS SELECT * FROM ss.tt UNSEGMENTED ALL NODES;
=> INSERT INTO ss.tt VALUES (1);
A query that uses the =
operator returns 0 rows:
=> SELECT table_schema, table_name FROM v_catalog.tables WHERE table_schema ='ss';
table_schema | table_name
--------------+------------
(0 rows)
A query that uses case-insensitive ILIKE
returns the expected results:
=> SELECT table_schema, table_name FROM v_catalog.tables WHERE table_schema ILIKE 'ss';
table_schema | table_name
--------------+------------
SS | TT
(1 row)
Examples
The following examples illustrate simple ways to use system tables in queries.
=> SELECT current_epoch, designed_fault_tolerance, current_fault_tolerance FROM SYSTEM;
current_epoch | designed_fault_tolerance | current_fault_tolerance
---------------+--------------------------+-------------------------
492 | 1 | 1
(1 row)
=> SELECT node_name, total_user_session_count, executed_query_count FROM query_metrics;
node_name | total_user_session_count | executed_query_count
------------------+--------------------------+----------------------
v_vmart_node0001 | 115 | 353
v_vmart_node0002 | 114 | 35
v_vmart_node0003 | 116 | 34
(3 rows)
=> SELECT DISTINCT(schema_name), schema_owner FROM schemata;
schema_name | schema_owner
--------------+--------------
v_catalog | dbadmin
v_txtindex | dbadmin
v_func | dbadmin
TOPSCHEMA | dbadmin
online_sales | dbadmin
v_internal | dbadmin
v_monitor | dbadmin
structs | dbadmin
public | dbadmin
store | dbadmin
(10 rows)
9 - Data Collector utility
The Data Collector collects and retains history of important system activities and records essential performance and resource utilization counters. The Data Collector propagates information to system tables.
You can use the Data Collector in the following ways:
-
Query the past state of system tables and extract aggregate information
-
See what actions users have taken
-
Locate performance bottlenecks
-
Identify potential improvements to Vertica configuration
The Data Collector does not collect data for nodes that are down, so no historical data is available for affected nodes.
The Data Collector works with Workload Analyzer, a tool that intelligently monitors the performance of SQL queries and workloads and recommends tuning actions based on observations of the actual workload history.
Configuring and accessing Data Collector information
The Data Collector retains the data it gathers according to configurable retention policies. The Data Collector is on by default; you can disable it by setting the EnableDataCollector configuration parameter to 0. You can set the parameter at the database level using ALTER DATABASE or the node level using ALTER NODE. You cannot set it at the session or user level.
You can access metadata on collected data of all components using the DATA_COLLECTOR system table. This table includes information for each component about current collection policies and how much data is retained in memory and on disk.
Collected data is logged on disk in the DataCollector
directory under the Vertica /catalog
path. You can query logged data from component-specific Data Collector tables. You can also manage logged data with Vertica meta-functions; see Managing data collection logs.
9.1 - Configuring data retention policies
Data collector maintains retention policies for each Vertica component that it monitors—for example, TupleMoverEvents or DepotEvictions. You can identify monitored components by querying the DATA_COLLECTOR system table. For example, the following query returns partition activity components:
=> SELECT DISTINCT component FROM DATA_COLLECTOR
WHERE component ILIKE '%partition%';
component
----------------------
HiveCustomPartitions
CopyPartitions
MovePartitions
SwapPartitions
(4 rows)
Each component has its own retention policy, which consists of several properties:
-
MEMORY_BUFFER_SIZE_KB: the maximum amount of collected data that the Data Collector buffers in memory before moving it to disk.
-
DISK_SIZE_KB: the maximum disk space allocated for the component's Data Collector table.
-
INTERVAL_TIME: how long data of a given component is retained in the component's Data Collector table (INTERVAL data type).
Vertica sets default values on all properties, which you can modify with the SET_DATA_COLLECTOR_POLICY (using parameters) function or, alternatively, SET_DATA_COLLECTOR_POLICY and SET_DATA_COLLECTOR_TIME_POLICY.
You can view retention policy settings with GET_DATA_COLLECTOR_POLICY. For example, the following statement returns the retention policy for the TupleMoverEvents component:
=> SELECT GET_DATA_COLLECTOR_POLICY('TupleMoverEvents');
GET_DATA_COLLECTOR_POLICY
-----------------------------------------------------------------------------
1000KB kept in memory, 15000KB kept on disk. Time based retention disabled.
(1 row)
Setting retention memory and disk storage
The MEMORY_BUFFER_SIZE_KB and DISK_SIZE_KB properties combine to determine how much collected data is available at any given time. If MEMORY_BUFFER_SIZE_KB is set to 0, the Data Collector does not retain any data for the component either in memory or on disk. If DISK_SIZE_KB is set to 0, then the Data Collector retains only as much component data as it can buffer, as set by MEMORY_BUFFER_SIZE_KB.
For example, the following statement changes memory and disk setting for the ResourceAcquisitions component from its default setting of 1,000 KB memory and 10,000 KB disk space to 1500 KB and 25000 KB, respectively:
=> SELECT SET_DATA_COLLECTOR_POLICY('ResourceAcquisitions', '1500', '25000');
SET_DATA_COLLECTOR_POLICY
---------------------------
SET
(1 row)
Consider setting MEMORY_BUFFER_SIZE_KB to a high value in the following cases:
-
Unusually high levels of data collection. If MEMORY_BUFFER_SIZE_KB is set too low, the Data Collector might be unable to flush buffered data to disk quickly enough to keep up with the activity level, which can lead to loss of in-memory data.
-
Very large data collector records—for example, records with very long query strings. The Data Collector uses double-buffering, so it cannot retain in-memory records that are more than half the size of the memory buffer.
Setting time-based retention
By default, all data collected for a given component remains on disk and is accessible in the component's Data Collector table, up to the disk storage limit of that component's retention policy. You can use SET_DATA_COLLECTOR_POLICY to limit how long data is retained in a component's Data Collector table. The following example sets the INTERVAL_TIME property for the TupleMoverEvents component:
=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('TupleMoverEvents ', '30 minutes'::INTERVAL);
SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
SET
(1 row)
After this call, the corresponding Data Collector table, DC_TUPLE_MOVER_EVENTS
, only retains records of activity that occurred in the last 30 minutes. Older data is automatically dropped from this table. For example, after the previous call to SET_DATA_COLLECTOR_TIME_POLICY, the table contains on 30 minutes' worth of data:
=> SELECT CURRENT_TIMESTAMP(0) - '30 minutes'::INTERVAL AS '30 minutes ago';
30 minutes ago
---------------------
2020-08-13 07:58:21
(1 row)
=> SELECT time, node_name, session_id, user_name, transaction_id, operation
FROM DC_TUPLE_MOVER_EVENTS WHERE node_name='v_vmart_node0001'
ORDER BY transaction_id;
time | node_name | session_id | user_name | transaction_id | operation
-------------------------------+------------------+---------------------------------+-----------+-------------------+-----------
2020-08-13 08:16:54.360597-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807826 | Mergeout
2020-08-13 08:16:54.397346-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807826 | Mergeout
2020-08-13 08:16:54.424002-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807826 | Mergeout
2020-08-13 08:16:54.425989-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807829 | Mergeout
2020-08-13 08:16:54.456829-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807829 | Mergeout
2020-08-13 08:16:54.485097-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807829 | Mergeout
2020-08-13 08:19:45.8045-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin | 45035996273807855 | Mergeout
2020-08-13 08:19:45.742-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin | 45035996273807855 | Mergeout
2020-08-13 08:19:45.684764-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x37b08 | dbadmin | 45035996273807855 | Mergeout
2020-08-13 08:19:45.799796-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807865 | Mergeout
2020-08-13 08:19:45.768856-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807865 | Mergeout
2020-08-13 08:19:45.715424-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807865 | Mergeout
2020-08-13 08:25:20.465604-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.497266-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.518839-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.52099-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
2020-08-13 08:25:20.549075-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
2020-08-13 08:25:20.569072-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
(18 rows)
After 25 minutes elapse, 12 of these records age out of the 30-minute interval and are dropped:
=> SELECT CURRENT_TIMESTAMP(0) - '30 minutes'::INTERVAL AS '30 minutes ago';
30 minutes ago
---------------------
2020-08-13 08:23:33
(1 row)
=> SELECT time, node_name, session_id, user_name, transaction_id, operation
FROM DC_TUPLE_MOVER_EVENTS WHERE node_name='v_vmart_node0001'
ORDER BY transaction_id;
time | node_name | session_id | user_name | transaction_id | operation
-------------------------------+------------------+---------------------------------+-----------+-------------------+-----------
2020-08-13 08:25:20.465604-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.497266-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.518839-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807890 | Mergeout
2020-08-13 08:25:20.52099-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
2020-08-13 08:25:20.549075-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
2020-08-13 08:25:20.569072-04 | v_vmart_node0001 | v_vmart_node0001-190508:0x375db | dbadmin | 45035996273807893 | Mergeout
(6 rows)
Note
Setting a component policy's INTERVAL_TIME property has no effect on how much data storage the Data Collector retains on disk for that component. Maximum disk storage capacity is determined by the DISK_SIZE_KB property. Setting the INTERVAL_TIME property only affects how long data is retained by the component's Data Collector table.You can use SET_DATA_COLLECTOR_TIME_POLICY to update INTERVAL_TIME for all components by omitting the component argument:
=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('1 day'::INTERVAL);
SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
SET
(1 row)
=> SELECT DISTINCT component, INTERVAL_SET, INTERVAL_TIME
FROM DATA_COLLECTOR WHERE component ILIKE '%partition%';
component | INTERVAL_SET | INTERVAL_TIME
----------------------+--------------+---------------
HiveCustomPartitions | t | 1
MovePartitions | t | 1
CopyPartitions | t | 1
SwapPartitions | t | 1
(4 rows)
To clear the INTERVAL_TIME policy property, call SET_DATA_COLLECTOR_TIME_POLICY with a negative integer argument:
=> SELECT SET_DATA_COLLECTOR_TIME_POLICY('-1');
SET_DATA_COLLECTOR_TIME_POLICY
--------------------------------
SET
(1 row)
=> SELECT DISTINCT component, INTERVAL_SET, INTERVAL_TIME
FROM DATA_COLLECTOR WHERE component ILIKE '%partition%';
component | INTERVAL_SET | INTERVAL_TIME
----------------------+--------------+---------------
MovePartitions | f | 0
SwapPartitions | f | 0
HiveCustomPartitions | f | 0
CopyPartitions | f | 0
(4 rows)
Setting INTERVAL_TIME on a retention policy also sets its INTERVAL_SET property to true.
9.2 - Querying data collector tables
Caution
Data Collector tables (prefixed byDC_
) are in the V_INTERNAL
schema. If you use Data Collector tables in scripts or monitoring tools, be aware that any Vertica upgrade can remove or change them without notice.
You can obtain component-specific data from Data Collector tables. The Data Collector compiles the component data from its log files in a table format that you can query with standard SQL queries. You can identify Data Collector table names for specific components from the DATA_COLLECTOR system table:
=> SELECT DISTINCT component, table_name FROM DATA_COLLECTOR
WHERE component ILIKE 'lock%';
component | table_name
--------------+------------------
LockRequests | dc_lock_requests
LockReleases | dc_lock_releases
LockAttempts | dc_lock_attempts
(3 rows)
You can then query those Data Collector tables directly:
=> SELECT * FROM DC_LOCK_ATTEMPTS
WHERE description != 'Granted immediately';
-[ RECORD 1 ]------+------------------------------
time | 2020-08-17 00:14:07.187607-04
node_name | v_vmart_node0001
session_id | v_vmart_node0001-319647:0x1d
user_id | 45035996273704962
user_name | dbadmin
transaction_id | 45035996273819050
object | 0
object_name | Global Catalog
mode | X
promoted_mode | X
scope | TRANSACTION
start_time | 2020-08-17 00:14:07.184663-04
timeout_in_seconds | 300
result | granted
description | Granted after waiting
9.3 - Managing data collection logs
On startup, Vertica creates a DataCollector
directory under the database catalog directory of each node. This directory contains one or more logs for individual components. For example:
$ pwd
/home/dbadmin/VMart/v_vmart_node0001_catalog/DataCollector
$ ls -1 -g Lock*
-rw------- 1 verticadba 2559879 Aug 17 00:14 LockAttempts_650572441057355.log
-rw------- 1 verticadba 614579 Aug 17 05:28 LockAttempts_650952885486175.log
-rw------- 1 verticadba 2559895 Aug 14 18:31 LockReleases_650306482037650.log
-rw------- 1 verticadba 1411127 Aug 17 05:28 LockReleases_650759468041873.log
The DataCollector
directory also contains a pair of SQL template files for each component:
-
CREATE_
component
_TABLE.sql
provides DDL for creating a table that you can use to load Data Collector logs for a given component. For example:$ cat CREATE_LockAttempts_TABLE.sql \set dcschema 'echo ${DCSCHEMA:-dc}' CREATE TABLE :dcschema.dc_lock_attempts( "time" TIMESTAMP WITH TIME ZONE, "node_name" VARCHAR(128), "session_id" VARCHAR(128), "user_id" INTEGER, "user_name" VARCHAR(128), "transaction_id" INTEGER, "object" INTEGER, "object_name" VARCHAR(128), "mode" VARCHAR(128), "promoted_mode" VARCHAR(128), "scope" VARCHAR(128), "start_time" TIMESTAMP WITH TIME ZONE, "timeout_in_seconds" INTEGER, "result" VARCHAR(128), "description" VARCHAR(64000) );
-
COPY_
component
_TABLE.sql
contains SQL for loading (with COPY) the data log files into the table that the CREATE script creates. For example:$ cat COPY_LockAttempts_TABLE.sql \set dcpath 'echo ${DCPATH:-$PWD}' \set dcschema 'echo ${DCSCHEMA:-dc}' \set logfiles '''':dcpath'/LockAttempts_*.log''' COPY :dcschema.dc_lock_attempts( LockAttempts_start_filler FILLER VARCHAR(64) DELIMITER E'\n', "time_nfiller" FILLER VARCHAR(32) DELIMITER ':', "time" FORMAT '_internal' DELIMITER E'\n', "node_name_nfiller" FILLER VARCHAR(32) DELIMITER ':', "node_name" ESCAPE E'\001' DELIMITER E'\n', "session_id_nfiller" FILLER VARCHAR(32) DELIMITER ':', "session_id" ESCAPE E'\001' DELIMITER E'\n', "user_id_nfiller" FILLER VARCHAR(32) DELIMITER ':', "user_id" FORMAT 'd' DELIMITER E'\n', "user_name_nfiller" FILLER VARCHAR(32) DELIMITER ':', "user_name" ESCAPE E'\001' DELIMITER E'\n', "transaction_id_nfiller" FILLER VARCHAR(32) DELIMITER ':', "transaction_id" FORMAT 'd' DELIMITER E'\n', "object_nfiller" FILLER VARCHAR(32) DELIMITER ':', "object" FORMAT 'd' DELIMITER E'\n', "object_name_nfiller" FILLER VARCHAR(32) DELIMITER ':', "object_name" ESCAPE E'\001' DELIMITER E'\n', "mode_nfiller" FILLER VARCHAR(32) DELIMITER ':', "mode" ESCAPE E'\001' DELIMITER E'\n', "promoted_mode_nfiller" FILLER VARCHAR(32) DELIMITER ':', "promoted_mode" ESCAPE E'\001' DELIMITER E'\n', "scope_nfiller" FILLER VARCHAR(32) DELIMITER ':', "scope" ESCAPE E'\001' DELIMITER E'\n', "start_time_nfiller" FILLER VARCHAR(32) DELIMITER ':', "start_time" FORMAT '_internal' DELIMITER E'\n', "timeout_in_seconds_nfiller" FILLER VARCHAR(32) DELIMITER ':', "timeout_in_seconds" FORMAT 'd' DELIMITER E'\n', "result_nfiller" FILLER VARCHAR(32) DELIMITER ':', "result" ESCAPE E'\001' DELIMITER E'\n', "description_nfiller" FILLER VARCHAR(32) DELIMITER ':', "description" ESCAPE E'\001' ) FROM :logfiles RECORD TERMINATOR E'\n.\n' DELIMITER E'\n';
Log management functions
You can use two Data Collector functions to manage logs. Both functions can operate on a single component or all components.
-
FLUSH_DATA_COLLECTOR waits until in-memory logs are moved to disk and then flushes the Data Collector, synchronizing the log with disk storage.
-
CLEAR_DATA_COLLECTOR clears all memory and disk records from Data Collector tables and logs and then resets collection statistics in DATA_COLLECTOR.
10 - Monitoring partition reorganization
When you use
ALTER TABLE...REORGANIZE
, the operation reorganizes the data in the background.
You can monitor details of the reorganization process by polling the following system tables:
-
V_MONITOR.PARTITION_STATUS
displays the fraction of each table that is partitioned correctly. -
V_MONITOR.PARTITION_REORGANIZE_ERRORS
logs errors issued by the reorganize process. -
V_MONITOR.PARTITIONS
displaysNULL
in thepartition_key
column for any ROS that was not reorganized.
11 - Monitoring resource pools
You can use the following to find information about resource pools:
-
RESOURCE_POOL_STATUS returns current data from resource pools—for example, current memory usage, resources requested and acquired by various requests, and state of the queues.
-
RESOURCE_ACQUISITIONS displays all resources granted to the queries that are currently running.
-
SHOW SESSION shows, along with other session-level parameters, the current session's resource pool.
You can also use the Management Console to obtain run-time data on resource pool usage.
Note
The Linux top command returns data on overall CPU usage and I/O wait time across the system. Because of file system caching, the resident memory size returned bytop
is not the best indicator of actual memory use or available reserves.
Viewing resource pool status
The following example queries RESOURCE_POOL_STATUS for memory size data:
=> SELECT pool_name poolName,
node_name nodeName,
max_query_memory_size_kb maxQueryMemSizeKb,
max_memory_size_kb maxMemSizeKb,
memory_size_actual_kb memSizeActualKb
FROM resource_pool_status WHERE pool_name='ceo_pool';
poolName | nodeName | maxQueryMemSizeKb | maxMemSizeKb | memSizeActualKb
----------+------------------+-------------------+--------------+-----------------
ceo_pool | v_vmart_node0001 | 12179388 | 13532654 | 1843200
ceo_pool | v_vmart_node0002 | 12191191 | 13545768 | 1843200
ceo_pool | v_vmart_node0003 | 12191170 | 13545745 | 1843200
(3 rows)
Viewing query resource acquisitions
The following example displays all resources granted to the queries that are currently running. The information shown is stored in system table RESOURCE_ACQUISITIONS table. You can see that the query execution used 708504 KB of memory from the GENERAL pool.
=> SELECT pool_name, thread_count, open_file_handle_count, memory_inuse_kb,
queue_entry_timestamp, acquisition_timestamp
FROM V_MONITOR.RESOURCE_ACQUISITIONS WHERE node_name ILIKE '%node0001';
-[ RECORD 1 ]----------+------------------------------
pool_name | sysquery
thread_count | 4
open_file_handle_count | 0
memory_inuse_kb | 4103
queue_entry_timestamp | 2013-12-05 07:07:08.815362-05
acquisition_timestamp | 2013-12-05 07:07:08.815367-05
-[ RECORD 2 ]----------+------------------------------
...
-[ RECORD 8 ]----------+------------------------------
pool_name | general
thread_count | 12
open_file_handle_count | 18
memory_inuse_kb | 708504
queue_entry_timestamp | 2013-12-04 12:55:38.566614-05
acquisition_timestamp | 2013-12-04 12:55:38.566623-05
-[ RECORD 9 ]----------+------------------------------
...
You can determine how long a query waits in the queue before it can run. To do so, you obtain the difference between acquisition_timestamp
and queue_entry_timestamp
using a query as this example shows:
=> SELECT pool_name, queue_entry_timestamp, acquisition_timestamp,
(acquisition_timestamp-queue_entry_timestamp) AS 'queue wait'
FROM V_MONITOR.RESOURCE_ACQUISITIONS WHERE node_name ILIKE '%node0001';
-[ RECORD 1 ]---------+------------------------------
pool_name | sysquery
queue_entry_timestamp | 2013-12-05 07:07:08.815362-05
acquisition_timestamp | 2013-12-05 07:07:08.815367-05
queue wait | 00:00:00.000005
-[ RECORD 2 ]---------+------------------------------
pool_name | sysquery
queue_entry_timestamp | 2013-12-05 07:07:14.714412-05
acquisition_timestamp | 2013-12-05 07:07:14.714417-05
queue wait | 00:00:00.000005
-[ RECORD 3 ]---------+------------------------------
pool_name | sysquery
queue_entry_timestamp | 2013-12-05 07:09:57.238521-05
acquisition_timestamp | 2013-12-05 07:09:57.281708-05
queue wait | 00:00:00.043187
-[ RECORD 4 ]---------+------------------------------
...
Querying user-defined resource pools
The Boolean column IS_INTERNAL in system tables RESOURCE_POOLS and RESOURCE_POOL_STATUS lets you get data on user-defined resource pools only. For example:
SELECT name, subcluster_oid, subcluster_name, memorysize, maxmemorysize, priority, maxconcurrency
dbadmin-> FROM V_CATALOG.RESOURCE_POOLS where is_internal ='f';
name | subcluster_oid | subcluster_name | memorysize | maxmemorysize | priority | maxconcurrency
--------------+-------------------+-----------------+------------+---------------+----------+----------------
load_pool | 72947297254957395 | default | 0% | | 10 |
ceo_pool | 63570532589529860 | c_subcluster | 250M | | 10 |
ad hoc_pool | 0 | | 200M | 200M | 0 |
billing_pool | 45579723408647896 | ar_subcluster | 0% | | 0 | 3
web_pool | 0 | analytics_1 | 25M | | 10 | 5
batch_pool | 47479274633682648 | default | 150M | 150M | 0 | 10
dept1_pool | 0 | | 0% | | 5 |
dept2_pool | 0 | | 0% | | 8 |
dashboard | 45035996273843504 | analytics_1 | 0% | | 0 |
(9 rows)
12 - Monitoring recovery
When your Vertica database is recovering from a failure, it's important to monitor the recovery process. There are several ways to monitor database recovery:
12.1 - Viewing log files on each node
During database recovery, Vertica adds logging information to the
vertica.log
on each host. Each message is identified with a [Recover]
string.
Use the tail
command to monitor recovery progress by viewing the relevant status messages, as follows.
$ tail -f catalog-path/database-name/node-name_catalog/vertica.log
01/23/08 10:35:31 thr:Recover:0x2a98700970 [Recover] <INFO> Changing host v_vmart_node0001 startup state from INITIALIZING to RECOVERING
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Recovering to specified epoch 0x120b6
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Running 1 split queries
01/23/08 10:35:31 thr:CatchUp:0x1724b80 [Recover] <INFO> Running query: ALTER PROJECTION proj_tradesquotes_0 SPLIT v_vmart_node0001 FROM 73911;
12.2 - Using system tables to monitor recovery
Use the following system tables to monitor recover:
Specifically, the recovery_status
system table includes information about the node that is recovering, the epoch being recovered, the current recovery phase, and running status:
=>select node_name, recover_epoch, recovery_phase, current_completed, is_running from recovery_status;
node_name | recover_epoch | recovery_phase | current_completed | is_running
---------------------+---------------+-------------------+-------------------+--------------
v_vmart_node0001 | | | 0 | f
v_vmart_node0002 | 0 | historical pass 1 | 0 | t
v_vmart_node0003 | 1 | current | 0 | f
The projection_recoveries
system table maintains history of projection recoveries. To check the recovery status, you can summarize the data for the recovering node, and run the same query several times to see if the counts change. Differing counts indicate that the recovery is working and in the process of recovering all missing data.
=> select node_name, status , progress from projection_recoveries;
node_name | status | progress
-----------------------+-------------+---------
v_vmart_node0001 | running | 61
To see a single record from the projection_recoveries
system table, add limit 1 to the query.
After a recovery has completed, Vertica continues to store information from the most recent recovery in these tables.
12.3 - Viewing cluster state and recovery status
Use the admintools view_cluster
tool from the command line to see the cluster state:
$ /opt/vertica/bin/admintools -t view_cluster
DB | Host | State
---------+--------------+------------
<data_base> | 112.17.31.10 | RECOVERING
<data_base> | 112.17.31.11 | UP
<data_base> | 112.17.31.12 | UP
<data_base> | 112.17.31.17 | UP
________________________________
12.4 - Monitoring cluster status after recovery
When recovery has completed:
-
Launch Administration Tools.
-
From the Main Menu, select View Database Cluster State and click OK.
The utility reports your node's status as
UP
.
Note
You can also monitor the state of your database nodes on the Management Console Overview page under the Database section, which tells you the number of nodes that are up, critical, recovering, or down. To get node-specific information, click Manage at the bottom of the page.13 - Clearing projection refresh history
The PROJECTION_REFRESHES system table records information about successful and unsuccessful refresh operations. This table normally retains data for a projection until replaced by a new refresh of that projection, but you can also purge the table.
To immediately purge data for all completed refreshes, call CLEAR_PROJECTION_REFRESHES:
=> SELECT clear_projection_refreshes();
clear_projection_refreshes
----------------------------
CLEAR
(1 row)
This function does not clear data for refreshes that are currently in progress.
14 - Monitoring Vertica using notifiers
A Vertica notifier is a push-based mechanism for sending messages from Vertica to endpoints like Apache Kafka or syslog. For example, you can configure a long-running script to send notifications at various stages and then at the completion of a task.
To use a notifier:
-
Use CREATE NOTIFIER to create one of the following notifier types:
-
Send a notification to the NOTIFIER endpoint with any of the following:
-
NOTIFY: Manually sends a message to the NOTIFIER endpoint.
-
SET_DATA_COLLECTOR_NOTIFY_POLICY: Creates a notification policy, which automatically sends a message to the NOTIFIER endpoint when a specified event occurs.
-
15 - Health Watchdog
If the database is under a high concurrent load, it leads to a bad health state on the server. The Health Watchdog is designed to mitigate the bad health state by doing the following:
-
Detecting the bad health state.
-
Stopping the transactions from adding to this bad state by blocking DDL/DML transactions.
-
Once the bad health state has been mitagated, allowing all blocked transactions to proceed.
Note
Only non-super user commands can be stopped by the Health Watchdog check.Health Watchdog has three health metrics that it uses to check the server status and enact the mitigation:
-
Truncation Version Lag - tracks the catalog sync service and detects bad health conditions in the server when the current commit version is far ahead of the database truncation version.
-
GCLX Queue Bloat - tracks the GCLX queue size and stops the GCLX requests when the server is bombarded.
-
Mergeout Queue Bloat - tracks the TM queue size and stops DML transactions if the TM pool threads cannot keep up with the number of TM requests.
You can check the status of the server using check_cluster_health and the health_watchdog_blocked_transactions system table.