Collecting table statistics
ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.
You can set the scope of the collection at several levels:
ANALYZE_STATISTICS can also control the size of the data sample that it collects.
Analyze all database tables
If ANALYZE_STATISTICS specifies no table, it collects statistics for all database tables and their projections. For example:
=> SELECT ANALYZE_STATISTICS ('');
ANALYZE_STATISTICS
--------------------
0
(1 row)
Analyze a single table
You can compute statistics on a single table as follows:
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact');
ANALYZE_STATISTICS
--------------------
0
(1 row)
When you query system table PROJECTION_COLUMNS, it confirms that statistics have been collected on all table columns for all projections of store_orders_fact
:
=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
projection_name | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+-------------------------------
store_orders_fact_b0 | FULL | product_key | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | product_version | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | store_key | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | vendor_key | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | employee_key | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | order_number | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | date_ordered | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | date_shipped | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | quantity_ordered | 2019-04-04 18:06:55.747329-04
store_orders_fact_b0 | FULL | shipper_name | 2019-04-04 18:06:55.747329-04
store_orders_fact_b1 | FULL | product_key | 2019-04-04 18:06:55.747329-04
store_orders_fact_b1 | FULL | product_version | 2019-04-04 18:06:55.747329-04
...
(20 rows)
Analyze table columns
Within a table, you can narrow scope of analysis to a subset of its columns. Doing so can save significant processing overhead for big tables that contain many columns. It is especially useful if you frequently query these tables on specific columns.
Important
If you collect statistics on specific columns, be sure to include all columns that you are likely to query. If a query includes other columns in that table, the query optimizer regards the statistics as incomplete for that query and ignores them in its plan.For example, instead of collecting statistics on all columns in store_orders_fact
, you can select only those columns that are frequently queried: product_key
, product_version
, order_number
, and quantity_shipped
:
=> SELECT DROP_STATISTICS('public.store_orders_fact');
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact', 'product_key, product_version, order_number, quantity_ordered');
ANALYZE_STATISTICS
--------------------
0
(1 row)
If you query PROJECTION_COLUMNS
again, it returns the following results:
=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
projection_name | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+------------------------------
store_orders_fact_b0 | FULL | product_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | FULL | product_version | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | store_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | vendor_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | employee_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | FULL | order_number | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | date_ordered | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | date_shipped | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | FULL | quantity_ordered | 2019-04-04 18:09:40.05452-04
store_orders_fact_b0 | ROWCOUNT | shipper_name | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | FULL | product_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | FULL | product_version | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | store_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | vendor_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | employee_key | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | FULL | order_number | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | date_ordered | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | date_shipped | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | FULL | quantity_ordered | 2019-04-04 18:09:40.05452-04
store_orders_fact_b1 | ROWCOUNT | shipper_name | 2019-04-04 18:09:40.05452-04
(20 rows)
In this case, columns statistics_type is set to FULL
only for those columns on which you ran ANALYZE_STATISTICS. The remaining table columns are set to ROWCOUNT
, indicating that only row statistics were collected for them.
Note
ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.Data collection percentage
By default, Vertica collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.
The percentage of data you collect affects collection time and accuracy:
-
A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.
-
A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.
For example:
Collect data on all projections for shipping_dimension
from 20 percent of the disk:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20);
ANALYZE_STATISTICS
-------------------
0
(1 row)
Collect data from the entire disk by setting the percent
parameter to 100:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
ANALYZE_STATISTICS
--------------------
0
(1 row)
Sampling size
ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:
-
217 (131,072) rows
-
Number of rows that fit in 1 GB of memory
If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.
Note
The data collected in a sample range does not indicate how data should be distributed.The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:
Number of column rows | % | Number of rows read | Number of sampled rows |
---|---|---|---|
<= max-sample-size |
20 | All | All |
400K | 10 |
max-sample-size |
max-sample-size |
4000K | 10 | 400K |
max-sample-size |