OpenText Analytics Database 26.2.x – Collecting database statistics

Admin: Collecting table statistics

Mon, 01 Jan 0001 00:00:00 +0000

ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections of the target tables.

You can set the scope of the collection at several levels:

ANALYZE_STATISTICS can also control the size of the data sample that it collects.

Analyze all database tables

If ANALYZE_STATISTICS specifies no table, it collects statistics for all database tables and their projections. For example:

=> SELECT ANALYZE_STATISTICS ('');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Analyze a single table

You can compute statistics on a single table as follows:

=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

When you query system table PROJECTION_COLUMNS, it confirms that statistics have been collected on all table columns for all projections of store_orders_fact:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+-------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | store_key         | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | vendor_key        | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | employee_key      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_ordered      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | date_shipped      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b0 | FULL            | shipper_name      | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:06:55.747329-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:06:55.747329-04
...
(20 rows)

Analyze table columns

Within a table, you can narrow scope of analysis to a subset of its columns. Doing so can save significant processing overhead for big tables that contain many columns. It is especially useful if you frequently query these tables on specific columns.

Important

If you collect statistics on specific columns, be sure to include all columns that you are likely to query. If a query includes other columns in that table, the query optimizer regards the statistics as incomplete for that query and ignores them in its plan.

For example, instead of collecting statistics on all columns in store_orders_fact, you can select only those columns that are frequently queried: product_key, product_version, order_number, and quantity_shipped:

=> SELECT DROP_STATISTICS('public.store_orders_fact');
=> SELECT ANALYZE_STATISTICS ('public.store_orders_fact', 'product_key, product_version, order_number, quantity_ordered');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

If you query PROJECTION_COLUMNS again, it returns the following results:

=> SELECT projection_name, statistics_type, table_column_name,statistics_updated_timestamp
    FROM projection_columns WHERE projection_name ilike 'store_orders_fact%' AND table_schema='public';
   projection_name    | statistics_type | table_column_name | statistics_updated_timestamp
----------------------+-----------------+-------------------+------------------------------
 store_orders_fact_b0 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b0 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_key       | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | product_version   | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | store_key         | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | vendor_key        | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | employee_key      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | order_number      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_ordered      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | date_shipped      | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | FULL            | quantity_ordered  | 2019-04-04 18:09:40.05452-04
 store_orders_fact_b1 | ROWCOUNT        | shipper_name      | 2019-04-04 18:09:40.05452-04
(20 rows)

In this case, columns statistics_type is set to FULL only for those columns on which you ran ANALYZE_STATISTICS. The remaining table columns are set to ROWCOUNT, indicating that only row statistics were collected for them.

Note

ANALYZE_STATISTICS always invokes ANALYZE_ROW_COUNT on all table columns, even if ANALYZE_STATISTICS specifies a subset of those columns.

Data collection percentage

By default, OpenText™ Analytics Database collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.

The percentage of data you collect affects collection time and accuracy:

A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.
A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.

For example:

Collect data on all projections for shipping_dimension from 20 percent of the disk:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20);
 ANALYZE_STATISTICS
-------------------
                 0
(1 row)

Collect data from the entire disk by setting the percent parameter to 100:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Sampling size

ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:

2¹⁷ (131,072) rows
Number of rows that fit in 1 GB of memory

If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.

Note

The data collected in a sample range does not indicate how data should be distributed.

The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:

Number of column rows	%	Number of rows read	Number of sampled rows
`<=``max-sample-size`	20	All	All
400K	10	`max-sample-size`	`max-sample-size`
4000K	10	400K	`max-sample-size`

Note

When a column specified for ANALYZE_STATISTICS is first in a projection's sort order, the function reads all data from disk to avoid a biased sample.

Admin: Collecting partition statistics

Mon, 01 Jan 0001 00:00:00 +0000

ANALYZE_STATISTICS_PARTITION collects and aggregates data samples and storage information for a range of partitions in the specified table. OpenText™ Analytics Database writes the collected statistics to the database catalog.

For example, the following table stores sales data and is partitioned by order dates:

CREATE TABLE public.store_orders_fact
(
    product_key int,
    product_version int,
    store_key int,
    vendor_key int,
    employee_key int,
    order_number int,
    date_ordered date NOT NULL,
    date_shipped date NOT NULL,
    quantity_ordered int,
    shipper_name varchar(32)
);

ALTER TABLE public.store_orders_fact PARTITION BY date_ordered::DATE GROUP BY CALENDAR_HIERARCHY_DAY(date_ordered::DATE, 2, 2) REORGANIZE;
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_product FOREIGN KEY (product_key, product_version) references public.product_dimension (product_key, product_version);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_vendor FOREIGN KEY (vendor_key) references public.vendor_dimension (vendor_key);
ALTER TABLE public.store_orders_fact ADD CONSTRAINT fk_store_orders_employee FOREIGN KEY (employee_key) references public.employee_dimension (employee_key);

At the end of each business day you might call ANALYZE_STATISTICS_PARTITION and collect statistics on all data of the latest (today's) partition:

=> SELECT ANALYZE_STATISTICS_PARTITION('public.store_orders_fact', CURRENT_DATE::VARCHAR(10), CURRENT_DATE::VARCHAR(10));
 ANALYZE_STATISTICS_PARTITION
------------------------------
                            0
(1 row)

The function produces a set of fresh statistics for the most recent partition in store.store_sales_fact. If you query this table each morning on yesterday's sales, the optimizer uses these statistics to generate an optimized query plan:

=> EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE-1;

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT COUNT(*) FROM public.store_orders_fact WHERE date_ordered = CURRENT_DATE-1;

 Access Path:
 +-GROUPBY NOTHING [Cost: 2, Rows: 1] (PATH ID: 1)
 |  Aggregates: count(*)
 |  Execute on: All Nodes
 | +---> STORAGE ACCESS for store_orders_fact [Cost: 1, Rows: 222(PARTITION-LEVEL STATISTICS)] (PATH ID: 2)
 | |      Projection: public.store_orders_fact_v1_b1
 | |      Filter: (store_orders_fact.date_ordered = '2019-04-01'::date)
 | |      Execute on: All Nodes

Narrowing the collection scope

Like ANALYZE_STATISTICS, ANALYZE_STATISTICS_PARTITION lets you narrow the scope of analysis to a subset of a table's columns. You can also control the size of the data sample that it collects. For details on these options, see Collecting table statistics.

Collecting statistics on multiple partition ranges

If you specify multiple partitions, they must be continuous. Different collections of statistics can overlap. For example, the following table t1 is partitioned on column c1:

=> SELECT export_tables('','t1');
                                          export_tables
-------------------------------------------------------------------------------------------------
CREATE TABLE public.t1
(
    a int,
    b int,
    c1 int NOT NULL
)
PARTITION BY (t1.c1);


=> SELECT * FROM t1 ORDER BY c1;
 a  | b  | c1
----+----+----
  1 |  2 |  3
  4 |  5 |  6
  7 |  8 |  9
 10 | 11 | 12
(4 rows)

Given this dataset, you can call ANALYZE_STATISTICS_PARTITION on t1 twice. The successive calls collect statistics for two overlapping ranges of partition keys, 3 through 9 and 6 through 12:

=> SELECT drop_statistics_partition('t1', '', '');
 drop_statistics_partition
---------------------------
                         0
(1 row)

=> SELECT analyze_statistics_partition('t1', '3', '9');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT analyze_statistics_partition('t1', '6', '12');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 9                 |         3
 t1         | 6                 | 12                |         3
(2 rows)

If two statistics collections overlap, the database stores only the most recent statistics for each partition range. Thus, given the previous example, the database uses only statistics from the second collection for partition keys 6 through 9.

Statistics that are collected for a given range of partition keys always supersede statistics that were previously collected for a subset of that range. For example, given a call to ANALYZE_STATISTICS_PARTITION that specifies partition keys 3 through 12, the collected statistics are a superset of the two sets of statistics collected earlier, so it supersedes both:


=> SELECT analyze_statistics_partition('t1', '3', '12');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 12                |         4
(1 row)

Finally, ANALYZE_STATISTICS_PARTITION collects statistics on partition keys 3 through 6. This collection is a subset of the previous collection, so the database retains both sets and uses the latest statistics from each:


=> SELECT analyze_statistics_partition('t1', '3', '6');
 analyze_statistics_partition
------------------------------
                            0
(1 row)

=> SELECT table_name, min_partition_key, max_partition_key, row_count FROM table_statistics WHERE table_name = 't1';
 table_name | min_partition_key | max_partition_key | row_count
------------+-------------------+-------------------+-----------
 t1         | 3                 | 12                |         4
 t1         | 3                 | 6                 |         2
(2 rows)

Supported date/time functions

ANALYZE_STATISTICS_PARTITION can collect partition-level statistics on tables where the partition expression specifies one of the following date/time functions:

Requirements and restrictions

The following requirements and restrictions apply to ANALYZE_STATISTICS_PARTITION:

The table must be partitioned and cannot contain unpartitioned data.
The table partition expression must specify a single column. The following expressions are supported:
- Expressions that specify only the column—that is, partition on all column values. For example:
```
PARTITION BY ship_date GROUP BY CALENDAR_HIERARCHY_DAY(ship_date, 2, 2)
```
- If the column is a DATE or TIMESTAMP/TIMESTAMPTZ, the partition expression can specify a supported date/time function that returns that column or any portion of it, such as month or year. For example, the following partition expression specifies to partition on the year portion of column order_date:
```
PARTITION BY YEAR(order_date)
```
- Expressions that perform addition or subtraction on the column. For example:
```
PARTITION BY YEAR(order_date) -1
```
The table partition expression cannot coerce the specified column to another data type.
Vertica collects no statistics from the following projections:
- Live aggregate and Top-K projections
- Projections that are defined to include an SQL function within an expression

Admin: Analyzing row counts

Mon, 01 Jan 0001 00:00:00 +0000

OpenText™ Analytics Database lets you obtain row counts for projections and for external tables, through ANALYZE_ROW_COUNT and ANALYZE_EXTERNAL_ROW_COUNT, respectively.

Projection row count

ANALYZE_ROW_COUNT is a lightweight operation that collects a minimal set of statistics and aggregate row counts for a projection, and saves it in the database catalog. In many cases, this data satisifes many optimizer requirements for producing optimal query plans. This operation is invoked on the following occasions:

At the time intervals specified by configuration parameter AnalyzeRowCountInterval—by default, once a day.
During loads, the database updates the catalog with the current aggregate row count data for a given table when the percentage of difference between the last-recorded aggregate projection row count and current row count exceeds the setting in configuration parameter ARCCommitPercentage.
On calls to meta-functions ANALYZE_STATISTICS and ANALYZE_STATISTICS_PARTITION.

You can explicitly invoke ANALYZE_ROW_COUNT through calls to DO_TM_TASK. For example:

=> SELECT DO_TM_TASK('analyze_row_count', 'store_orders_fact_b0');
                                              do_tm_task
------------------------------------------------------------------------------------------------------
 Task: row count analyze
(Table: public.store_orders_fact) (Projection: public.store_orders_fact_b0)

(1 row)

You can change the intervals when the database regularly collects row-level statistics by setting configuration parameter AnalyzeRowCountInterval. For example, you can change the collection interval to 1 hour (3600 seconds):

=> ALTER DATABASE DEFAULT SET AnalyzeRowCountInterval = 3600;
ALTER DATABASE

External table row count

ANALYZE_EXTERNAL_ROW_COUNT calculates the exact number of rows in an external table. The optimizer uses this count to optimize for queries that access external tables. This is especially useful when an external table participates in a join. This function enables the optimizer to identify the smaller table to use as the inner input to the join, and facilitate better query performance.

The following query calculates the exact number of rows in the external table loader_rejects:

=> SELECT ANALYZE_EXTERNAL_ROW_COUNT('loader_rejects');
 ANALYZE_EXTERNAL_ROW_COUNT
----------------------------
                 0

Admin: Canceling statistics collection

Mon, 01 Jan 0001 00:00:00 +0000

To cancel statistics collection mid analysis, execute CTRL-C on vsql or call the INTERRUPT_STATEMENT() function.

If you want to remove statistics for the specified table or type, call the DROP_STATISTICS() function.

Caution

After you drop statistics, it can be time consuming to regenerate them.

Admin: Getting data on table statistics

Mon, 01 Jan 0001 00:00:00 +0000

OpenText™ Analytics Database provides information about statistics for a given table and its columns and partitions in two ways:

The query optimizer notifies you about the availability of statistics to process a given query.
System table PROJECTION_COLUMNS shows what types of statistics are available for the table columns, and when they were last updated.

Query evaluation

During predicate selectivity estimation, the query optimizer can identify when histograms are not available or are out of date. If the value in the predicate is outside the histogram's maximum range, the statistics are stale. If no histograms are available, then no statistics are available to the plan.

When the optimizer detects stale or no statistics, such as when it encounters a column predicate for which it has no histogram, the optimizer performs the following actions:

Displays and logs a message that you should run ANALYZE_STATISTICS.
Annotates EXPLAIN-generated query plans with a statistics entry.
Ignores stale statistics when it generates a query plan. The optimizer uses other considerations to create a query plan, such as FK-PK constraints.

For example, the following query plan fragment shows no statistics (histograms unavailable):

| | +-- Outer -> STORAGE ACCESS for fact [Cost: 604, Rows: 10K (NO STATISTICS)]

The following query plan fragment shows that the predicate falls outside the histogram range:

| | +-- Outer -> STORAGE ACCESS for fact [Cost: 35, Rows: 1 (PREDICATE VALUE OUT-OF-RANGE)]

Statistics data in PROJECTION_COLUMNS

Two columns in system table PROJECTION_COLUMNS show the status of each table column's statistics, as follows:

STATISTICS_TYPE returns the type of statistics that are available for this column, one of the following: NONE, ROWCOUNT, or FULL.
STATISTICS_UPDATED_TIMESTAMP returns the last time statistics were collected for this column.

For example, the following sample schema defines a table named trades, which groups the highly-correlated columns bid and ask and stores the stock column separately:

=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION trades_p (
     stock ENCODING RLE, GROUPED(bid ENCODING DELTAVAL, ask))
     AS (SELECT * FROM trades) ORDER BY stock, bid;
=> INSERT INTO trades VALUES('acme', 10, 20);
=> COMMIT;

Query the PROJECTION_COLUMNS table for table trades:

=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
     FROM projection_columns WHERE table_name = 'trades';
 table  | projection  | column | statistics_type | last_updated
--------+-------------+--------+-----------------+--------------
 trades | trades_p_b0 | stock  | NONE            |
 trades | trades_p_b0 | bid    | NONE            |
 trades | trades_p_b0 | ask    | NONE            |
 trades | trades_p_b1 | stock  | NONE            |
 trades | trades_p_b1 | bid    | NONE            |
 trades | trades_p_b1 | ask    | NONE            |
(6 rows)

The statistics_type column returns NONE for all columns in the trades table, while statistics_updated_timestamp is empty because statistics have not yet been collected on this table.

Now, run ANALYZE_STATISTICS on the stock column:

=> SELECT ANALYZE_STATISTICS ('public.trades', 'stock');
 ANALYZE_STATISTICS
--------------------
                  0
(1 row)

Now, when you query PROJECTION_COLUMNS, it returns the following results:

=> SELECT table_name AS table, projection_name AS projection, table_column_name AS column, statistics_type, statistics_updated_timestamp AS last_updated
     FROM projection_columns WHERE table_name = 'trades';
 table  | projection  | column | statistics_type |         last_updated
--------+-------------+--------+-----------------+-------------------------------
 trades | trades_p_b0 | stock  | FULL            | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b0 | bid    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b0 | ask    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | stock  | FULL            | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | bid    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
 trades | trades_p_b1 | ask    | ROWCOUNT        | 2019-04-03 12:00:12.231564-04
(6 rows)

This time, the query results contain several changes:

`statistics_type`	Set to `FULL` for the `stock` column, confirming that full statistics were run on this column. Set to `ROWCOUNT` for the `bid` and `ask` columns, confirming that `ANALYZE_STATISTICS` always invokes `ANALYZE_ROW_COUNT` on all table columns, even if `ANALYZE_STATISTICS` specifies a subset of those columns.
`statistics_updated_timestamp`	Set to the same timestamp for all columns, confirming that statistics (either full or row count) were updated on all.

Admin: Best practices for statistics collection

Mon, 01 Jan 0001 00:00:00 +0000

You should call ANALYZE_STATISTICS or ANALYZE_STATISTICS_PARTITION when one or more of following conditions are true:

Data is bulk loaded for the first time.
A new projection is refreshed.
The number of rows changes significantly.
A new column is added to the table.
Column minimum/maximum values change significantly.
New primary key values with referential integrity constraints are added . The primary key and foreign key tables should be re-analyzed.
Table size notably changes relative to other tables it is joined to—for example, a table that was 50 times larger than another table is now only five times larger.
A notable deviation in data distribution necessitates recalculating histograms—for example, an event causes abnormally high levels of trading for a particular stock.
The database is inactive for an extended period of time.

Overhead considerations

Running ANALYZE_STATISTICS is an efficient but potentially long-running operation. You can run it concurrently with queries and loads in a production environment. However, the function can incur considerable overhead on system resources (CPU and memory), at the expense of queries and load operations. To minimize overhead, consider calling ANALYZE_STATISTICS_PARTITIONS on those partitions that are subject to significant activity—typically, the most recently loaded partitions, including the table's active partition. You can further narrow the scope of both functions by specifying a subset of the table columns—generally, those that are queried most often.

You can diagnose and resolve many statistics-related issues by calling ANALYZE_WORKLOAD, which returns tuning recommendations. If you update statistics and find that a query still performs poorly, run it through the Database Designer and choose incremental as the design type.

OpenText Analytics Database 26.2.x – Collecting database statistics

Admin: Collecting table statistics

Analyze all database tables

Analyze a single table

Analyze table columns

Important

Note

Data collection percentage

Sampling size

Note

Note

Admin: Collecting partition statistics

Narrowing the collection scope

Collecting statistics on multiple partition ranges

Supported date/time functions

Requirements and restrictions

Admin: Analyzing row counts

Projection row count

External table row count

Admin: Canceling statistics collection

Caution

Admin: Getting data on table statistics

Query evaluation

Statistics data in PROJECTION_COLUMNS

Admin: Best practices for statistics collection

Overhead considerations

Related tools