Collecting table statistics

ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.

ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.

You can set the scope of the collection at several levels:

ANALYZE_STATISTICS can also control the size of the data sample that it collects.

Analyze all database tables

If ANALYZE_STATISTICS specifies no table, it collects statistics for all database tables and their projections. For example:

=> SELECT ANALYZE_STATISTICS ('');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Analyze a single table

You can compute statistics on a single table as follows:

=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

When you query system table PROJECTION_COLUMNS, it confirms that statistics have been collected on all table columns for all projections of store_orders_fact:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+-------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | store_key         | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | vendor_key        | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | employee_key      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_ordered      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_shipped      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | shipper_name      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
...
(20 rows)

Analyze table columns

Within a table, you can narrow scope of analysis to a subset of its columns. Doing so can save significant processing overhead for big tables that contain many columns. It is especially useful if you frequently query these tables on specific columns.

For example, instead of collecting statistics on all columns in store_orders_fact, you can select only those columns that are frequently queried: product_key, product_version, order_number, and quantity_shipped:

=> SELECT DROP_STATISTICS('public.store_orders_fact');
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact', 'product_key, product_version, order_number, quantity_ordered');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

If you query PROJECTION_COLUMNS again, it returns the following results:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
(20 rows)

In this case, columns statistics_type is set to FULL only for those columns on which you ran ANALYZE_STATISTICS. The remaining table columns are set to ROWCOUNT, indicating that only row statistics were collected for them.

Data collection percentage

By default, Vertica collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.

The percentage of data you collect affects collection time and accuracy:

  • A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.

  • A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.

For example:

Collect data on all projections for shipping_dimension from 20 percent of the disk:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20);
 ANALYZE_STATISTICS
-------------------
                 0
(1 row)

Collect data from the entire disk by setting the percent parameter to 100:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Sampling size

ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:

  • 217 (131,072) rows

  • Number of rows that fit in 1 GB of memory

If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.

The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:

Number of column rows % Number of rows read Number of sampled rows
<=max-sample-size 20 All All
400K 10 max-sample-size max-sample-size
4000K 10 400K max-sample-size