This is the multipage printable view of this section.
Click here to print.
Return to the regular view of this page.
Collecting database statistics
The Vertica costbased query optimizer relies on data statistics to produce query plans.
The Vertica costbased query optimizer relies on data statistics to produce query plans. If statistics are incomplete or outofdate, the optimizer is liable to use a suboptimal plan to execute a query.
When you query a table, the Vertica optimizer checks for statistics as follows:

If the table is partitioned, the optimizer checks whether the partitions required by this query have recently been analyzed. If so, it retrieves those statistics and uses them to facilitate query planning.

Otherwise, the optimizer uses tablelevel statistics, if available.

If no valid partition or tablelevel statistics are available, the optimizer assumes uniform distribution of data values and equal storage usage for all projections.
Statistics management functions
Vertica provides two functions that generate uptodate statistics on table data:
ANALYZE_STATISTICS
and
ANALYZE_STATISTICS_PARTITION
collect tablelevel and partitionlevel statistics, respectively. After computing statistics, the functions store them in the database catalog.
Both functions perform the following operations:

Collect statistics using historical queries (at epoch latest) without any locks.

Perform fast data sampling, which expedites analysis of relatively small tables with a large number of columns.

Recognize deleted data instead of ignoring delete markers.
Vertica also provides several functions that help you management database statistics—for example, to export and import statistics, validate statistics, and drop statistics.
After you collect the desired statistics, you can run Workload Analyzer to retrieve hints about underperforming queries and their root causes, and obtain tuning recommendations.
1  Collecting table statistics
ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.
ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.
You can set the scope of the collection at several levels:
ANALYZE_STATISTICS can also control the size of the data sample that it collects.
Analyze all database tables
If ANALYZE_STATISTICS specifies no table, it collects statistics for all database tables and their projections. For example:
=> SELECT ANALYZE_STATISTICS ('');
ANALYZE_STATISTICS

0
(1 row)
Analyze a single table
You can compute statistics on a single table as follows:
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact');
ANALYZE_STATISTICS

0
(1 row)
When you query system table PROJECTION_COLUMNS, it confirms that statistics have been collected on all table columns for all projections of store_orders_fact
:
=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
projection_name  statistics_type  table_column_name  statistics_updated_timestamp
+++
store_orders_fact_b0  FULL  product_key  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  product_version  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  store_key  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  vendor_key  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  employee_key  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  order_number  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  date_ordered  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  date_shipped  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  quantity_ordered  20190404 18:06:55.74732904
store_orders_fact_b0  FULL  shipper_name  20190404 18:06:55.74732904
store_orders_fact_b1  FULL  product_key  20190404 18:06:55.74732904
store_orders_fact_b1  FULL  product_version  20190404 18:06:55.74732904
...
(20 rows)
Analyze table columns
Within a table, you can narrow scope of analysis to a subset of its columns. Doing so can save significant processing overhead for big tables that contain many columns. It is especially useful if you frequently query these tables on specific columns.
Important
If you collect statistics on specific columns, be sure to include all columns that you are likely to query. If a query includes other columns in that table, the query optimizer regards the statistics as incomplete for that query and ignores them in its plan.
For example, instead of collecting statistics on all columns in store_orders_fact
, you can select only those columns that are frequently queried: product_key
, product_version
, order_number
, and quantity_shipped
:
=> SELECT DROP_STATISTICS('public.store_orders_fact');
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact', 'product_key, product_version, order_number, quantity_ordered');
ANALYZE_STATISTICS

0
(1 row)
If you query PROJECTION_COLUMNS
again, it returns the following results:
=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
projection_name  statistics_type  table_column_name  statistics_updated_timestamp
+++
store_orders_fact_b0  FULL  product_key  20190404 18:09:40.0545204
store_orders_fact_b0  FULL  product_version  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  store_key  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  vendor_key  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  employee_key  20190404 18:09:40.0545204
store_orders_fact_b0  FULL  order_number  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  date_ordered  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  date_shipped  20190404 18:09:40.0545204
store_orders_fact_b0  FULL  quantity_ordered  20190404 18:09:40.0545204
store_orders_fact_b0  ROWCOUNT  shipper_name  20190404 18:09:40.0545204
store_orders_fact_b1  FULL  product_key  20190404 18:09:40.0545204
store_orders_fact_b1  FULL  product_version  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  store_key  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  vendor_key  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  employee_key  20190404 18:09:40.0545204
store_orders_fact_b1  FULL  order_number  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  date_ordered  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  date_shipped  20190404 18:09:40.0545204
store_orders_fact_b1  FULL  quantity_ordered  20190404 18:09:40.0545204
store_orders_fact_b1  ROWCOUNT  shipper_name  20190404 18:09:40.0545204
(20 rows)
In this case, columns statistics_type is set to FULL
only for those columns on which you ran ANALYZE_STATISTICS. The remaining table columns are set to ROWCOUNT
, indicating that only row statistics were collected for them.
Note
ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.
Data collection percentage
By default, Vertica collects a fixed 10percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.
The percentage of data you collect affects collection time and accuracy:

A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.

A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.
For example:
Collect data on all projections for shipping_dimension
from 20 percent of the disk:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20);
ANALYZE_STATISTICS

0
(1 row)
Collect data from the entire disk by setting the percent
parameter to 100:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
ANALYZE_STATISTICS

0
(1 row)
Sampling size
ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:
If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.
Note
The data collected in a sample range does not indicate how data should be distributed.
The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:
Number of column rows 
% 
Number of rows read 
Number of sampled rows 
<= maxsamplesize 
20 
All 
All 
400K 
10 
maxsamplesize 
maxsamplesize 
4000K 
10 
400K 
maxsamplesize 
Note
When a column specified for ANALYZE_STATISTICS is first in a projection's sort order, the function reads all data from disk to avoid a biased sample.
2  Collecting partition statistics
ANALYZE_STATISTICS_PARTITION collects and aggregates data samples and storage information for a range of partitions in the specified table.
ANALYZE_STATISTICS_PARTITION collects and aggregates data samples and storage information for a range of partitions in the specified table. Vertica writes the collected statistics to the database catalog.
For example, the following table stores sales data and is partitioned by order dates:
CREATE TABLE public.store_orders_fact
(
product_key int,
product_version int,
store_key int,
vendor_key int,
employee_key int,
order_number int,
date_ordered date NOT NULL,
date_shipped date NOT NULL,
quantity_ordered int,
shipper_name varchar(32)
);
ALTER TABLE public.store_orders_fact PARTITION BY date_ordered::DATE GROUP BY CALENDAR_HIERARCHY_DAY(date_ordered::DATE, 2, 2) REORGANIZE;
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_product FOREIGN KEY (product_key, product_version) references public.product_dimension (product_key, product_version);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_vendor FOREIGN KEY (vendor_key) references public.vendor_dimension (vendor_key);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_employee FOREIGN KEY (employee_key) references public.employee_dimension (employee_key);
At the end of each business day you might call ANALYZE_STATISTICS_PARTITION and collect statistics on all data of the latest (today's) partition:
=> SELECT ANALYZE_STATISTICS_PARTITION('public.store_orders_fact', CURRENT_DATE::VARCHAR(10), CURRENT_DATE::VARCHAR(10));
ANALYZE_STATISTICS_PARTITION

0
(1 row)
The function produces a set of fresh statistics for the most recent partition in store.store_sales_fact
. If you query this table each morning on yesterday's sales, the optimizer uses these statistics to generate an optimized query plan:
=> EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE1;
QUERY PLAN

QUERY PLAN DESCRIPTION:

EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE1;
Access Path:
+GROUPBY NOTHING [Cost: 2, Rows: 1] (PATH ID: 1)
 Aggregates: count(*)
 Execute on: All Nodes
 +> STORAGE ACCESS for store_orders_fact [Cost: 1, Rows: 222(PARTITIONLEVEL STATISTICS)] (PATH ID: 2)
  Projection: public.store_orders_fact_v1_b1
  Filter: (store_orders_fact.date_ordered = '20190401'::date)
  Execute on: All Nodes
Narrowing the collection scope
Like ANALYZE_STATISTICS, ANALYZE_STATISTICS_PARTITION lets you narrow the scope of analysis to a subset of a table's columns. You can also control the size of the data sample that it collects. For details on these options, see Collecting table statistics.
Collecting statistics on multiple partition ranges
If you specify multiple partitions, they must be continuous. Different collections of statistics can overlap. For example, the following table t1 is partitioned on column c1
:
=> SELECT export_tables('','t1');
export_tables

CREATE TABLE public.t1
(
a int,
b int,
c1 int NOT NULL
)
PARTITION BY (t1.c1);
=> SELECT * FROM t1 ORDER BY c1;
a  b  c1
++
1  2  3
4  5  6
7  8  9
10  11  12
(4 rows)
Given this dataset, you can call ANALYZE_STATISTICS_PARTITION on t1
twice. The successive calls collect statistics for two overlapping ranges of partition keys, 3 through 9 and 6 through 12:
=> SELECT drop_statistics_partition('t1', '', '');
drop_statistics_partition

0
(1 row)
=> SELECT analyze_statistics_partition('t1', '3', '9');
analyze_statistics_partition

0
(1 row)
=> SELECT analyze_statistics_partition('t1', '6', '12');
analyze_statistics_partition

0
(1 row)
=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
table_name  min_partition_key  max_partition_key  row_count
+++
t1  3  9  3
t1  6  12  3
(2 rows)
If two statistics collections overlap, Vertica stores only the most recent statistics for each partition range. Thus, given the previous example, Vertica uses only statistics from the second collection for partition keys 6 through 9.
Statistics that are collected for a given range of partition keys always supersede statistics that were previously collected for a subset of that range. For example, given a call to ANALYZE_STATISTICS_PARTITION that specifies partition keys 3 through 12, the collected statistics are a superset of the two sets of statistics collected earlier, so it supersedes both:
=> SELECT analyze_statistics_partition('t1', '3', '12');
analyze_statistics_partition

0
(1 row)
=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
table_name  min_partition_key  max_partition_key  row_count
+++
t1  3  12  4
(1 row)
Finally, ANALYZE_STATISTICS_PARTITION collects statistics on partition keys 3 through 6. This collection is a subset of the previous collection, so Vertica retains both sets and uses the latest statistics from each:
=> SELECT analyze_statistics_partition('t1', '3', '6');
analyze_statistics_partition

0
(1 row)
=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
table_name  min_partition_key  max_partition_key  row_count
+++
t1  3  12  4
t1  3  6  2
(2 rows)
Supported date/time functions
ANALYZE_STATISTICS_PARTITION can collect partitionlevel statistics on tables where the partition expression specifies one of the following date/time functions:
Requirements and restrictions
The following requirements and restrictions apply to ANALYZE_STATISTICS_PARTITION:

The table must be partitioned and cannot contain unpartitioned data.

The table partition expression must specify a single column. The following expressions are supported:

Expressions that specify only the column—that is, partition on all column values. For example:
PARTITION BY ship_date GROUP BY CALENDAR_HIERARCHY_DAY(ship_date, 2, 2)

If the column is a DATE or TIMESTAMP/TIMESTAMPTZ, the partition expression can specify a supported date/time function that returns that column or any portion of it, such as month or year. For example, the following partition expression specifies to partition on the year portion of column order_date
:
PARTITION BY YEAR(order_date)

Expressions that perform addition or subtraction on the column. For example:
PARTITION BY YEAR(order_date) 1

The table partition expression cannot coerce the specified column to another data type.

Vertica collects no statistics from the following projections:
3  Analyzing row counts
Vertica lets you obtain row counts for projections and for external tables, through ANALYZE_ROW_COUNT and ANALYZE_EXTERNAL_ROW_COUNT, respectively.
Vertica lets you obtain row counts for projections and for external tables, through ANALYZE_ROW_COUNT and ANALYZE_EXTERNAL_ROW_COUNT, respectively.
Projection row count
ANALYZE_ROW_COUNT is a lightweight operation that collects a minimal set of statistics and aggregate row counts for a projection, and saves it in the database catalog. In many cases, this data satisifes many optimizer requirements for producing optimal query plans. This operation is invoked on the following occasions:

At the time intervals specified by configuration parameter AnalyzeRowCountInterval—by default, once a day.

During loads. Vertica updates the catalog with the current aggregate row count data for a given table when the percentage of difference between the lastrecorded aggregate projection row count and current row count exceeds the setting in configuration parameter ARCCommitPercentage.

On calls to metafunctions ANALYZE_STATISTICS and ANALYZE_STATISTICS_PARTITION.
You can explicitly invoke ANALYZE_ROW_COUNT through calls to DO_TM_TASK. For example:
=> SELECT DO_TM_TASK('analyze_row_count', 'store_orders_fact_b0');
do_tm_task

Task: row count analyze
(Table: public.store_orders_fact) (Projection: public.store_orders_fact_b0)
(1 row)
You can change the intervals when Vertica regularly collects rowlevel statistics by setting configuration parameter AnalyzeRowCountInterval. For example, you can change the collection interval to 1 hour (3600 seconds):
=> ALTER DATABASE DEFAULT SET AnalyzeRowCountInterval = 3600;
ALTER DATABASE
External table row count
ANALYZE_EXTERNAL_ROW_COUNT calculates the exact number of rows in an external table. The optimizer uses this count to optimize for queries that access external tables. This is especially useful when an external table participates in a join. This function enables the optimizer to identify the smaller table to use as the inner input to the join, and facilitate better query performance.
The following query calculates the exact number of rows in the external table loader_rejects
:
=> SELECT ANALYZE_EXTERNAL_ROW_COUNT('loader_rejects');
ANALYZE_EXTERNAL_ROW_COUNT

0
4  Canceling statistics collection
To cancel statistics collection mid analysis, execute CTRLC on or call the INTERRUPT_STATEMENT() function.
To cancel statistics collection mid analysis, execute CTRLC on vsql or call the INTERRUPT_STATEMENT() function.
If you want to remove statistics for the specified table or type, call the DROP_STATISTICS() function.
Caution
After you drop statistics, it can be time consuming to regenerate them.
5  Getting data on table statistics
Vertica provides information about statistics for a given table and its columns and partitions in two ways:.
Vertica provides information about statistics for a given table and its columns and partitions in two ways:

The query optimizer notifies you about the availability of statistics to process a given query.

System table
PROJECTION_COLUMNS
shows what types of statistics are available for the table columns, and when they were last updated.
Query evaluation
During predicate selectivity estimation, the query optimizer can identify when histograms are not available or are out of date. If the value in the predicate is outside the histogram's maximum range, the statistics are stale. If no histograms are available, then no statistics are available to the plan.
When the optimizer detects stale or no statistics, such as when it encounters a column predicate for which it has no histogram, the optimizer performs the following actions:

Displays and logs a message that you should run
ANALYZE_STATISTICS
.

Annotates
EXPLAIN
generated query plans with a statistics entry.

Ignores stale statistics when it generates a query plan. The optimizer uses other considerations to create a query plan, such as FKPK constraints.
For example, the following query plan fragment shows no statistics (histograms unavailable):
  + Outer > STORAGE ACCESS for fact [Cost: 604, Rows: 10K (NO STATISTICS)]
The following query plan fragment shows that the predicate falls outside the histogram range:
  + Outer > STORAGE ACCESS for fact [Cost: 35, Rows: 1 (PREDICATE VALUE OUTOFRANGE)]
Statistics data in PROJECTION_COLUMNS
Two columns in system table
PROJECTION_COLUMNS
show the status of each table column's statistics, as follows:

STATISTICS_TYPE
returns the type of statistics that are available for this column, one of the following: NONE
, ROWCOUNT
, or FULL
.

STATISTICS_UPDATED_TIMESTAMP
returns the last time statistics were collected for this column.
For example, the following sample schema defines a table named trades, which groups the highlycorrelated columns bid
and ask
and stores the stock
column separately:
=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION trades_p (
stock ENCODING RLE, GROUPED(bid ENCODING DELTAVAL, ask))
AS (SELECT * FROM trades) ORDER BY stock, bid;
=> INSERT INTO trades VALUES('acme', 10, 20);
=> COMMIT;
Query the PROJECTION_COLUMNS
table for table trades
:
=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
FROM projection_columns WHERE table_name = 'trades';
table  projection  column  statistics_type  last_updated
++++
trades  trades_p_b0  stock  NONE 
trades  trades_p_b0  bid  NONE 
trades  trades_p_b0  ask  NONE 
trades  trades_p_b1  stock  NONE 
trades  trades_p_b1  bid  NONE 
trades  trades_p_b1  ask  NONE 
(6 rows)
The statistics_type
column returns NONE
for all columns in the trades
table, while statistics_updated_timestamp
is empty because statistics have not yet been collected on this table.
Now, run
ANALYZE_STATISTICS
on the stock
column:
=> SELECT ANALYZE_STATISTICS ('public.trades', 'stock');
ANALYZE_STATISTICS

0
(1 row)
Now, when you query PROJECTION_COLUMNS
, it returns the following results:
=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
FROM projection_columns WHERE table_name = 'trades';
table  projection  column  statistics_type  last_updated
++++
trades  trades_p_b0  stock  FULL  20190403 12:00:12.23156404
trades  trades_p_b0  bid  ROWCOUNT  20190403 12:00:12.23156404
trades  trades_p_b0  ask  ROWCOUNT  20190403 12:00:12.23156404
trades  trades_p_b1  stock  FULL  20190403 12:00:12.23156404
trades  trades_p_b1  bid  ROWCOUNT  20190403 12:00:12.23156404
trades  trades_p_b1  ask  ROWCOUNT  20190403 12:00:12.23156404
(6 rows)
This time, the query results contain several changes:
statistics_type 

Set to FULL for the stock column, confirming that full statistics were run on this column.

Set to ROWCOUNT for the bid and ask columns, confirming that ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.

statistics_updated_timestamp 
Set to the same timestamp for all columns, confirming that statistics (either full or row count) were updated on all. 
6  Best practices for statistics collection
You should call ANALYZE_STATISTICS or ANALYZE_STATISTICS_PARTITION when one or more of following conditions are true:.
You should call ANALYZE_STATISTICS or ANALYZE_STATISTICS_PARTITION when one or more of following conditions are true:

Data is bulk loaded for the first time.

A new projection is refreshed.

The number of rows changes significantly.

A new column is added to the table.

Column minimum/maximum values change significantly.

New primary key values with referential integrity constraints are added . The primary key and foreign key tables should be reanalyzed.

Table size notably changes relative to other tables it is joined to—for example, a table that was 50 times larger than another table is now only five times larger.

A notable deviation in data distribution necessitates recalculating histograms—for example, an event causes abnormally high levels of trading for a particular stock.

The database is inactive for an extended period of time.
Overhead considerations
Running ANALYZE_STATISTICS is an efficient but potentially longrunning operation. You can run it concurrently with queries and loads in a production environment. However, the function can incur considerable overhead on system resources (CPU and memory), at the expense of queries and load operations. To minimize overhead, consider calling ANALYZE_STATISTICS_PARTITIONS on those partitions that are subject to significant activity—typically, the most recently loaded partitions, including the table's active partition. You can further narrow the scope of both functions by specifying a subset of the table columns—generally, those that are queried most often.
You can diagnose and resolve many statisticsrelated issues by calling ANALYZE_WORKLOAD, which returns tuning recommendations. If you update statistics and find that a query still performs poorly, run it through the Database Designer and choose incremental as the design type.