Monitoring COPY loads and metrics

You can check COPY loads using:.

You can check COPY loads using:

  • Vertica functions

  • LOAD_STREAMS system table

  • LOAD_SOURCES system table

More generally, the EXECUTION_ENGINE_PROFILES system table records information about query events, including loads.

Using Vertica functions

Two meta-functions return COPY metrics for the number of accepted or rejected rows from the most recent COPY statement run in the current session:

  1. To get the number of accepted rows, use the GET_NUM_ACCEPTED_ROWS function:

    => select get_num_accepted_rows();
    (1 row)
  2. To check the number of rejected rows, use the GET_NUM_REJECTED_ROWS function:

    => select get_num_rejected_rows();
    (1 row)

Using the CURRENT_LOAD_SOURCE function

You can include the CURRENT_LOAD_SOURCE function as a part of the COPY statement. Doing so allows you to insert into a column the input file name or value computed by this function.

To insert the file names into a column from multiple source files:

=> COPY t (c1, c2, c3 as CURRENT_LOAD_SOURCE()) FROM '/home/load_file_1' ON  exampledb_node02,
'/home/load_file_2' ON exampledb_node03 DELIMITER ',';

Using the LOAD_STREAMS system table

Vertica includes a set of system tables that include monitoring information. The LOAD_STREAMS system table includes information about load stream metrics from COPY and COPY FROM VERTICA statements. Thus, you can query table values to get COPY metrics.

Vertica maintains system table metrics until they reach a designated size quota (in kilobytes). This quota is set through internal processes, which you cannot set or view directly.

Labeling copy streams

COPY can include the STREAM NAME parameter to label its load stream so it is easy to identify in the LOAD_STREAMS system table. For example:

=> COPY mytable FROM myfile DELIMITER '|' STREAM NAME 'My stream name';

Load stream metrics

The following LOAD_STREAMS columns show on the status of a load as it progresses:

Column name Value...
ACCEPTED_ROW_COUNT Increases during parsing, up to the maximum number of rows in the input file.

Remains zero (0) until all named pipes return an EOF. While COPY awaits an EOF from multiple pipes, it can appear to be hung. However, before canceling the COPY statement, check your system CPU and disk accesses to determine if any activity is in progress.

In a typical load, the PARSE_COMPLETE_PERCENT value can either increase slowly or jump quickly to 100%, if you are loading from named pipes or STDIN.

SORT_COMPLETE_PERCENT Remains at 0 when loading from named pipes or STDIN. After PARSE_COMPLETE_PERCENT reaches 100 percent, SORT_COMPLETE_PERCENT increases to 100 percent.

Depending on the data sizes, a significant lag can occur between the time PARSE_COMPLETE_PERCENT reaches 100 percent and the time SORT_COMPLETE_PERCENT begins to increase.

This example queries load stream data from the LOAD_STREAMS system table:

=> \pset expanded
Expanded display is on.
=> SELECT stream_name, table_name, load_start, accepted_row_count,
   rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
   sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name            | fact-13
table_name             | fact
load_start             | 2010-12-28 15:07:41.132053
accepted_row_count     | 900
rejected_row_count     | 100
read_bytes             | 11975
input_file_size_bytes  | 0
parse_complete_percent | 0
unsorted_row_count     | 3600
sorted_row_count       | 3600
sort_complete_percent  | 100

Using the LOAD_SOURCES system table

The LOAD_STREAMS table shows the total number of rows that were loaded or rejected. Grouping this information by source can help you determine from where data is coming. The LOAD_SOURCES system table includes some of the same data as LOAD_STREAMS does but adds this source-specific information. If apportioning is enabled, LOAD_SOURCES also provides information about how loads are apportioned among nodes.

You can use this table to identify causes of differing query results. For example, you can use the following statement to create an external table based on globs:

=> CREATE EXTERNAL TABLE tt AS COPY WITH SOURCE AWS(dir = 'foo', file = '*');

If you select from this table, Vertica loads data from every file in the foo directory and creates one row in the LOAD_SOURCES table for each file. Suppose you later repeat the query and see different results. You could look at the LOAD_SOURCES table and discover that—between the two queries—somebody added another file to the foo directory. Because each file is recorded in LOAD_SOURCES, you can see the new file that explains the changed query results.

If you are using many data sources. you might prefer to disable this reporting. To disable reporting, set the LoadSourceStatisticsLimit configuration parameter to 0. This parameter sets the upper bound on the number of sources profiled by LOAD_SOURCES per load. The default value is 256.