Monitoring COPY loads and metrics
You can check COPY loads using:
-
Vertica functions
-
LOAD_STREAMS system table
-
LOAD_SOURCES system table
More generally, the EXECUTION_ENGINE_PROFILES system table records information about query events, including loads.
Using Vertica functions
Two meta-functions return COPY metrics for the number of accepted or rejected rows from the most recent COPY statement run in the current session:
-
To get the number of accepted rows, use the GET_NUM_ACCEPTED_ROWS function:
=> select get_num_accepted_rows(); get_num_accepted_rows ----------------------- 11 (1 row)
-
To check the number of rejected rows, use the GET_NUM_REJECTED_ROWS function:
=> select get_num_rejected_rows(); get_num_rejected_rows ----------------------- 0 (1 row)
Note
GET_NUM_ACCEPTED_ROWS and GET_NUM_REJECTED_ROWS support loads from STDIN, COPY LOCAL from a Vertica client, or a single file on the initiator. You cannot use these functions for multi-node loads.Using the CURRENT_LOAD_SOURCE function
You can include the CURRENT_LOAD_SOURCE
function as a part of the COPY statement. Doing so allows you to insert into a column the input file name or value computed by this function.
To insert the file names into a column from multiple source files:
=> COPY t (c1, c2, c3 as CURRENT_LOAD_SOURCE()) FROM '/home/load_file_1' ON exampledb_node02,
'/home/load_file_2' ON exampledb_node03 DELIMITER ',';
Using the LOAD_STREAMS system table
Vertica includes a set of system tables that include monitoring information. The LOAD_STREAMS system table includes information about load stream metrics from COPY and COPY FROM VERTICA statements. Thus, you can query table values to get COPY metrics.
Vertica maintains system table metrics until they reach a designated size quota (in kilobytes). This quota is set through internal processes, which you cannot set or view directly.
Labeling copy streams
COPY can include the STREAM NAME parameter to label its load stream so it is easy to identify in the LOAD_STREAMS system table. For example:
=> COPY mytable FROM myfile DELIMITER '|' STREAM NAME 'My stream name';
Load stream metrics
The following LOAD_STREAMS
columns show on the status of a load as it progresses:
Column name | Value... |
---|---|
ACCEPTED_ROW_COUNT |
Increases during parsing, up to the maximum number of rows in the input file. |
PARSE_COMPLETE_PERCENT |
Remains zero (0) until all named pipes return an EOF. While COPY awaits an EOF from multiple pipes, it can appear to be hung. However, before canceling the COPY statement, check your system CPU and disk accesses to determine if any activity is in progress. In a typical load, the |
SORT_COMPLETE_PERCENT |
Remains at 0 when loading from named pipes or STDIN. After PARSE_COMPLETE_PERCENT reaches 100 percent, SORT_COMPLETE_PERCENT increases to 100 percent. |
Depending on the data sizes, a significant lag can occur between the time PARSE_COMPLETE_PERCENT
reaches 100 percent and the time SORT_COMPLETE_PERCENT
begins to increase.
This example queries load stream data from the LOAD_STREAMS system table:
=> \pset expanded
Expanded display is on.
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Using the LOAD_SOURCES system table
The LOAD_STREAMS table shows the total number of rows that were loaded or rejected. Grouping this information by source can help you determine from where data is coming. The LOAD_SOURCES system table includes some of the same data as LOAD_STREAMS does but adds this source-specific information. If apportioning is enabled, LOAD_SOURCES also provides information about how loads are apportioned among nodes.
You can use this table to identify causes of differing query results. For example, you can use the following statement to create an external table based on globs:
=> CREATE EXTERNAL TABLE tt AS COPY WITH SOURCE AWS(dir = 'foo', file = '*');
If you select from this table, Vertica loads data from every file in the foo
directory and creates one row in the LOAD_SOURCES table for each file. Suppose you later repeat the query and see different results. You could look at the LOAD_SOURCES table and discover that—between the two queries—somebody added another file to the foo
directory. Because each file is recorded in LOAD_SOURCES, you can see the new file that explains the changed query results.
If you are using many data sources. you might prefer to disable this reporting. To disable reporting, set the LoadSourceStatisticsLimit configuration parameter to 0. This parameter sets the upper bound on the number of sources profiled by LOAD_SOURCES per load. The default value is 256.