Getting data on table statistics

Vertica provides information about statistics for a given table and its columns and partitions in two ways:.

Vertica provides information about statistics for a given table and its columns and partitions in two ways:

  • The query optimizer notifies you about the availability of statistics to process a given query.

  • System table PROJECTION_COLUMNS shows what types of statistics are available for the table columns, and when they were last updated.

Query evaluation

During predicate selectivity estimation, the query optimizer can identify when histograms are not available or are out of date. If the value in the predicate is outside the histogram's maximum range, the statistics are stale. If no histograms are available, then no statistics are available to the plan.

When the optimizer detects stale or no statistics, such as when it encounters a column predicate for which it has no histogram, the optimizer performs the following actions:

  • Displays and logs a message that you should run ANALYZE_STATISTICS.

  • Annotates EXPLAIN-generated query plans with a statistics entry.

  • Ignores stale statistics when it generates a query plan. The optimizer uses other considerations to create a query plan, such as FK-PK constraints.

For example, the following query plan fragment shows no statistics (histograms unavailable):

| | +-- Outer -> STORAGE ACCESS for fact [Cost: 604, Rows: 10K (NO STATISTICS)]

The following query plan fragment shows that the predicate falls outside the histogram range:

| | +-- Outer -> STORAGE ACCESS for fact [Cost: 35, Rows: 1 (PREDICATE VALUE OUT-OF-RANGE)]

Statistics data in PROJECTION_COLUMNS

Two columns in system table PROJECTION_COLUMNS show the status of each table column's statistics, as follows:

  • STATISTICS_TYPE returns the type of statistics that are available for this column, one of the following: NONE, ROWCOUNT, or FULL.

  • STATISTICS_UPDATED_TIMESTAMP returns the last time statistics were collected for this column.

For example, the following sample schema defines a table named trades, which groups the highly-correlated columns bid and ask and stores the stock column separately:

=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION trades_p (
     stock ENCODING RLE, GROUPED(bid ENCODING DELTAVAL, ask))
     AS (SELECT * FROM trades) ORDER BY stock, bid;
=> INSERT INTO trades VALUES('acme', 10, 20);
=> COMMIT;

Query the PROJECTION_COLUMNS table for table trades:

=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
     FROM projection_columns WHERE table_name = 'trades';
 table  | projection  | column | statistics_type | last_updated
--------+-------------+--------+-----------------+--------------
 trades | trades_p_b0 | stock  | NONE            |
 trades | trades_p_b0 | bid    | NONE            |
 trades | trades_p_b0 | ask    | NONE            |
 trades | trades_p_b1 | stock  | NONE            |
 trades | trades_p_b1 | bid    | NONE            |
 trades | trades_p_b1 | ask    | NONE            |
(6 rows)

The statistics_type column returns NONE for all columns in the trades table, while statistics_updated_timestamp is empty because statistics have not yet been collected on this table.

Now, run ANALYZE_STATISTICS on the stock column:

=> SELECT ANALYZE_STATISTICS ('public.trades', 'stock');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Now, when you query PROJECTION_COLUMNS, it returns the following results:

=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
     FROM projection_columns WHERE table_name = 'trades';
 table  | projection  | column | statistics_type |         last_updated
--------+-------------+--------+-----------------+-------------------------------
 trades | trades_p_b0 | stock  | FULL            | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b0 | bid    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b0 | ask    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | stock  | FULL            | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | bid    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | ask    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
(6 rows)

This time, the query results contain several changes:

statistics_type
  • Set to FULL for the stock column, confirming that full statistics were run on this column.

  • Set to ROWCOUNT for the bid and ask columns, confirming that ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.

statistics_updated_timestamp Set to the same timestamp for all columns, confirming that statistics (either full or row count) were updated on all.