This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Query optimization

Review INSERT design doc for use cases and add optimization hints to Performance and Tuning section.

1: Initial process for improving query performance
2: Column encoding

2.1: Improving column compression
2.2: Using run length encoding

3: Projections for queries with predicates
4: GROUP BY queries

4.1: GROUP BY implementation options
4.2: Avoiding resegmentation during GROUP BY optimization with projection design

5: DISTINCT in a SELECT query list

5.1: Query has no aggregates in SELECT list
5.2: COUNT (DISTINCT) and other DISTINCT aggregates
5.3: Approximate count distinct functions
5.4: Single DISTINCT aggregates
5.5: Multiple DISTINCT aggregates

6: JOIN queries

6.1: Hash joins versus merge joins
6.2: Identical segmentation
6.3: Joining variable length string data

7: ORDER BY queries
8: Analytic functions

8.1: Empty OVER clauses
8.2: NULL sort order
8.3: Runtime sorting of NULL values in analytic functions

9: LIMIT queries
10: INSERT-SELECT operations

10.1: Matching sort orders
10.2: Identical segmentation

11: DELETE and UPDATE queries
12: Data collector table queries

When you submit a query to Vertica for processing, the Vertica query optimizer automatically chooses a set of operations to compute the requested result. These operations together are called a query plan. The choice of operations can significantly affect how many resources are needed to compute query results, and overall run-time performance. Optimal performance depends in great part on the projections that are available for a given query.

This section describes the different operations that the optimizer uses and how you can facilitate optimizer performance.

Note

Database response time depends on many factors. These include type and size of the application query, database design, data size and data types stored, available computational power, and network bandwidth. Adding nodes to a database cluster does not necessarily improve the system response time for every query, especially if response times are already short, or are not hardware bound.

1 - Initial process for improving query performance

To optimize query performance, begin by performing the following tasks:.

To optimize query performance, begin by performing the following tasks:

Run Database Designer.
Check query events proactively.
Review the query plan.

Run Database Designer

Database Designer creates a physical schema for your database that provides optimal query performance. The first time you run Database Designer, you should create a comprehensive design that includes relevant sample queries and data on which to base the design. If you develop performance issues later, consider loading additional queries that you run frequently and then rerunning Database Designer to create an incremental design.

When you run Database Designer, choose the option, Update Statistics. The Vertica query optimizer uses statistics about the data to create a query plan. Statistics help the optimizer determine:

Multiple eligible projections to answer the query
The best order in which to perform joins
Data distribution algorithms, such as broadcast and resegmentation

If your statistics become out of date, run the Vertica function ANALYZE_STATISTICS function to update statistics for a schema, table, or columns. For more information, see Collecting database statistics.

Check query events proactively

The QUERY_EVENTS system table returns information on query planning, optimization, and execution events.

The EVENT_TYPE column provides various event types:

Some event types are informational.
Others you should review for possible corrective action.
Several are most important to address.

Review the query plan

A query plan is a sequence of step-like paths that the Vertica query optimizer selects to access or alter information in your Vertica database. There are two ways to get information about the query plan:

Run the EXPLAIN command. Each step (path) represents a single operation that the optimizer uses for its execution strategy.
Query the QUERY_PLAN_PROFILES system table. This table provides detailed execution status for currently running queries. Output from the QUERY_PLAN_PROFILES table shows the real-time flow of data and the time and resources consumed for each path in each query plan.

2 - Column encoding

You can potentially make queries faster by changing column encoding.

You can potentially make queries faster by changing column encoding. Encoding reduces the on-disk size of your data so the amount of I/O required for queries is reduced, resulting in faster execution times. Make sure all columns and projections included in the query use the correct data encoding. To do this, take the following steps:

Run Database Designer to create an incremental design. Database Designer implements the optimum encoding and projection design.
After creating the incremental design, update statistics using the ANALYZE_STATISTICS function.
Run EXPLAIN with one or more of the queries you submitted to the design to make sure it is using the new projections.

Alternatively, run DESIGNER_DESIGN_PROJECTION_ENCODINGS to re-evaluate the current encoding and update it if necessary.

2.1 - Improving column compression

If you see slow performance or a large storage footprint with your FLOAT data, evaluate the data and your business needs to see if it can be contained in a NUMERIC column with a precision of 18 digits or less. Converting a FLOAT column to a NUMERIC column can improve data compression, reduce the on-disk size of your database, and improve performance of queries on that column.

When you define a NUMERIC data type, you specify the precision and the scale; NUMERIC data are exact representations of data. FLOAT data types represent variable precision and approximate values; they take up more space in the database.

Converting FLOAT columns to NUMERIC columns is most effective when:

NUMERIC precision is 18 digits or less. Performance of NUMERIC data is fine-tuned for the common case of 18 digits of precision. Vertica recommends converting FLOAT columns to NUMERIC columns only if they require precision of 18 digits or less.
FLOAT precision is bounded, and the values will all fall within a specified precision for a NUMERIC column. One example is monetary values like product prices or financial transaction amounts. For example, a column defined as NUMERIC(11,2) can accommodate prices from 0 to a few million dollars and can store cents, and compresses more efficiently than a FLOAT column.

If you try to load a value into a NUMERIC column that exceeds the specified precision, Vertica returns an error and does not load the data. If you assign a value with more decimal digits than the specified scale, the value is rounded to match the specified scale and stored in that column.

2.2 - Using run length encoding

When you run Database Designer, you can choose to optimize for loads, which minimizes database footprint.

When you run Database Designer, you can choose to optimize for loads, which minimizes database footprint. In this case, Database Designer applies encodings to columns to maximize query performance. Encoding options include run length encoding (RLE), which replaces sequences (runs) of identical values in a column with a set of pairs, where each pair represents the number of contiguous occurrences for a given value: (occurrences, value).

RLE is generally applicable to a column with low-cardinality, and where identical values are contiguous—typically, because table data is sorted on that column. For example, a customer profile table typically includes a gender column that contains values of F and M only. Sorting on gender ensures runs of F or M values that can be expressed as a set of two pairs: (occurrences, F) and (occurrences, M). So, given 8,147 occurrences of F and 7,956 occurrences of M, and a projection that is sorted primarily on gender, Vertica can apply RLE and store these values as a single set of two pairs: (8147, F) and (7956, M). Doing so reduces this projection’s footprint and improves query performance.

3 - Projections for queries with predicates

If your query contains one or more predicates, you can modify the projections to improve the query's performance, as described in the following two examples.

Queries that use date ranges

This example shows how to encode data using RLE and change the projection sort order to improve the performance of a query that retrieves all data within a given date range.

Suppose you have a query that looks like this:

=> SELECT * FROM trades
   WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01';

To optimize this query, determine whether all of the projections can perform the SELECT operation in a timely manner. Run SELECT COUNT(*) statement for each projection, specifying the date range, and note the response time. For example:

=> SELECT COUNT(*) FROM [ projection_name ]
   WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01;

If one or more of the queries is slow, check the uniqueness of the trade_date column and determine if it needs to be in the projection’s ORDER BY clause and/or can be encoded using RLE. RLE replaces sequences of the same data values within a column by a pair that represents the value and a count. For best results, order the columns in the projection from lowest cardinality to highest cardinality, and use RLE to encode the data in low-cardinality columns.

Note

For an example of using sorting and RLE, see Choosing sort order: best practices.

If the number of unique columns is unsorted, or if the average number of repeated rows is less than 10, trade_date is too close to being unique and cannot be encoded using RLE. In this case, add a new column to minimize the search scope.

The following example adds a new column trade_year:

Determine if the new column trade_year returns a manageable result set. The following query returns the data grouped by trade_year:
```
=> SELECT DATE_TRUNC('trade_year', trade_date), COUNT(*)
   FROM trades
   GROUP BY DATE_TRUNC('trade_year',trade_date);
```
Assuming that trade_year = 2007 is near 8k, add a column for trade_year to the trades table. The SELECT statement then becomes:
```
=> SELECT * FROM trades
   WHERE trade_year = 2007
   AND trade_date BETWEEN '2016-11-01' AND '2016-12-01';
```
As a result, you have a projection that is sorted on trade_year, which can be encoded using RLE.

Queries for tables with a high-cardinality primary key

This example demonstrates how you can modify the projection to improve the performance of queries that select data from a table with a high-cardinality primary key.

Suppose you have the following query:

=> SELECT FROM [table]
   WHERE pk IN (12345, 12346, 12347,...);

Because the primary key is a high-cardinality column, Vertica has to search a large amount of data.

To optimize the schema for this query, create a new column named buckets and assign it the value of the primary key divided by 10000. In this example, buckets=(int) pk/10000. Use the buckets column to limit the search scope as follows:

=> SELECT FROM [table]
   WHERE buckets IN (1,...)
   AND pk IN (12345, 12346, 12347,...);

Creating a lower cardinality column and adding it to the query limits the search scope and improves the query performance. In addition, if you create a projection where buckets is first in the sort order, the query may run even faster.

4 - GROUP BY queries

The following sections include several examples that show how you can design your projections to optimize the performance of your GROUP BY queries.

The following sections include several examples that show how you can design your projections to optimize the performance of your GROUP BY queries.

4.1 - GROUP BY implementation options

Vertica implements a query GROUP BY clause with one of these algorithms: GROUPBY PIPELINED or GROUPBY HASH.

Vertica implements a query GROUP BY clause with one of these algorithms: GROUPBY PIPELINED or GROUPBY HASH. Both algorithms return the same results. Performance of both is generally similar for queries that return a small number of distinct groups—typically a thousand per node .

You can use EXPLAIN to determine which algorithm the query optimizer chooses for a given query. The following conditions generally determine which algorithm is chosen:

GROUPBY PIPELINED requires all GROUP BY data to be specified in the projection's ORDER BY clause. For details, see GROUPBY PIPELINED Requirements below.

Because GROUPBY PIPELINED only needs to retain in memory the current group data, this algorithm generally requires less memory and executes faster than GROUPBY HASH. Performance improvements are especially notable for queries that aggregate large numbers of distinct groups.
GROUPBY HASH is used for any query that does not comply with GROUPBY PIPELINED sort order requirements. In this case, Vertica must build a hash table on GROUP BY columns before it can start grouping the data.

GROUPBY PIPELINED requirements

You can enable use of the GROUPBY PIPELINED algorithm by ensuring that the query and one of its projections comply with GROUPBY PIPELINED requirements. The following conditions apply to GROUPBY PIPELINED. If none of them is true for the query, then Vertica uses GROUPBY HASH.

All examples that follow assume this schema:

CREATE TABLE sortopt (
    a INT NOT NULL,
    b INT NOT NULL,
    c INT,
    d INT
);
CREATE PROJECTION sortopt_p (
   a_proj,
   b_proj,
   c_proj,
   d_proj )
AS SELECT * FROM sortopt
ORDER BY a,b,c 
UNSEGMENTED ALL NODES;
INSERT INTO sortopt VALUES(5,2,13,84);
INSERT INTO sortopt VALUES(14,22,8,115);
INSERT INTO sortopt VALUES(79,9,401,33);

Condition 1

All GROUP BY columns are also included in the projection ORDER BY clause. For example:

GROUP BY columns	GROUPBY algorithm	Reason chosen
`a` `a,b` `b,a` `a,b,c` `c,a,b`	GROUPBY PIPELINED	Columns `a`, `b`, and `c` are included in the projection sort columns.
`a,b,c,d`	GROUPBY HASH	Column `d` is not part of the projection sort columns.

Condition 2

If the query's GROUP BY clause has fewer columns than the projection's ORDER BY clause, the GROUP BY columns must:

Be a subset of ORDER BY columns that are contiguous.
Include the first ORDER BY column.

For example:

GROUP BY columns	GROUPBY algorithm	Reason chosen
`a` `a,b` `b,a`	GROUPBY PIPELINED	All GROUP BY columns are a subset of contiguous columns in the projection's ORDER BY clause `{a,b,c}`, and include column `a`.
`a,c`	GROUPBY HASH	GROUP BY columns `{a,c}` are not contiguous in the projection ORDER BY clause `{a,b,c}`.
`b,c`	GROUPBY HASH	GROUP BY columns `{b,c`} do not include the projection's first ORDER BY column `a`.

Condition 3

If a query's GROUP BY columns do not appear first in the projection's ORDER BY clause, then any early-appearing projection sort columns that are missing in the query's GROUP BY clause must be present as single-column constant equality predicates in the query's WHERE clause.

For example:

Query	GROUPBY algorithm	Reason chosen
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b`	GROUPBY PIPELINED	All columns preceding `b` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY a, b`	GROUPBY PIPELINED	Grouping column `a` is redundant but has no effect on algorithm selection.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b, c`	GROUPBY PIPELINED	All columns preceding `b` and `c` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c, b`	GROUPBY PIPELINED	All columns preceding `b` and `c` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c`	GROUPBY HASH	All columns preceding `c` in the projection sort order do not appear as constant equality predicates.

Controlling GROUPBY algorithm choice

It is generally best to allow Vertica to determine which GROUP BY algorithm is best suited for a given query. Occasionally, you might want to use one algorithm over another. In such cases, you can qualify the GROUP BY clause with a GBYTYPE hint:

GROUP BY /*+ GBYTYPE( HASH | PIPE ) */

For example, given the following query, the query optimizer uses the GROUPBY PIPELINED algorithm:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;

 Access Path:
 +-GROUPBY PIPELINED [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

You can use the GBYTYPE hint to force the query optimizer to use the GROUPBY HASH algorithm instead:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;

 Access Path:
 +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

The GBYTYPE hint can specify a PIPE (GROUPBY PIPELINED algorithm) argument only if the query and one of its projections comply with GROUPBY PIPELINED requirements. Otherwise, Vertica issues a warning and uses GROUPBY HASH.

For example, the following query cannot use the GROUPBY PIPELINED algorithm, as the GROUP BY columns {b,c} do not include the projection's first ORDER BY column a:

=> SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(PIPE) */ b,c;
WARNING 7765:  Cannot apply Group By Pipe algorithm. Proceeding with Group By Hash and hint will be ignored
 SUM
-----
  79
  14
   5
(3 rows)

4.2 - Avoiding resegmentation during GROUP BY optimization with projection design

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation.

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation. If the projection design already guarantees the data is segmented by the GROUP BY columns, no resegmentation is required at run time.

To avoid resegmentation, the GROUP BY clause must contain all the segmentation columns of the projection, but it can also contain other columns.

When your query includes a GROUP BY clause and joins, if the join depends on the results of the GROUP BY, as in the following example, Vertica performs the GROUP BY first:

=> EXPLAIN SELECT * FROM (SELECT b from foo GROUP BY b) AS F, foo WHERE foo.a = F.b;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Join Cond: (foo.a = F.b)
|  Materialize at Output: foo.b
|  Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| |      Projection: public.foo_super
| |      Materialize: foo.a
| |      Execute on: All Nodes
| |      Runtime Filter: (SIP1(MergeJoin): foo.a)
| +-- Inner -> SELECT [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| |      Execute on: All Nodes
| | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID:
4)
| | |      Group By: foo.b
| | |      Execute on: All Nodes
| | | +---> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | |      Projection: public.foo_super
| | | |      Materialize: foo.b
| | | |      Execute on: All Nodes

If the result of the join operation is the input to the GROUP BY clause, Vertica performs the join first, as in the following example. The segmentation of those intermediate results may not be consistent with the GROUP BY clause in your query, resulted in resegmentation at run time.

=> EXPLAIN SELECT * FROM foo AS F, foo WHERE foo.a = F.b GROUP BY 1,2,3,4;
Access Path:
+-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 869, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Group By: F.a, F.b, foo.a, foo.b
|  Execute on: All Nodes
| +---> JOIN HASH [Cost: 853, Rows: 10K (NO STATISTICS)] (PATH ID: 2) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| |      Join Cond: (foo.a = F.b)
| |      Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for F [Cost: 403, Rows: 10K (NO STATISTICS)] (PUSHED GROUPING) (PATH ID: 3)
| | |      Projection: public.foo_super
| | |      Materialize: F.a, F.b
| | |      Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for foo [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
| | |      Projection: public.foo_super
| | |      Materialize: foo.a, foo.b
| | |      Execute on: All Nodes

If your query does not include joins, the GROUP BY clauses are processed using the existing database projections.

Examples

Assume the following projection:

CREATE PROJECTION ... SEGMENTED BY HASH(a,b) ALL NODES

The following table explains whether or not resegmentation occurs at run time and why.

`GROUP BY a`	Requires resegmentation at run time. The query does not contain all the projection segmentation columns.
`GROUP BY a, b`	Does not require resegmentation at run time. The `GROUP BY` clause contains all the projection segmentation columns.
`GROUP BY a, b, c`	Does not require resegmentation at run time. The `GROUP BY` clause contains all the projection segmentation columns.
`GROUP BY a+1, b`	Requires resegmentation at run time because of the expression on column `a`.

To determine if resegmentation will occurs during your GROUP BY query, look at the EXPLAIN-generated query plan.

For example, the following plan uses GROUPBY PIPELINED sort optimization and requires resegmentation to perform the GROUP BY calculation:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 194, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

The following plan uses GROUPBY PIPELINED sort optimization, but does not require resegmentation:

+-GROUPBY PIPELINED [Cost: 459, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

5 - DISTINCT in a SELECT query list

This section describes how to optimize queries that have the DISTINCT keyword in their SELECT list.

This section describes how to optimize queries that have the DISTINCT keyword in their SELECT list. The techniques for optimizing DISTINCT queries are similar to the techniques for optimizing GROUP BY queries because when processing queries that use DISTINCT, the Vertica optimizer rewrites the query as a GROUP BY query.

The following sections below this page describe specific situations:

Examples in these sections use the following table:

=> CREATE TABLE table1 (
    a INT,
    b INT,
    c INT
);

5.1 - Query has no aggregates in SELECT list

If your query has no aggregates in the SELECT list, internally, Vertica treats the query as if it uses GROUP BY instead.

If your query has no aggregates in the SELECT list, internally, Vertica treats the query as if it uses GROUP BY instead.

For example, you can rewrite the following query:

SELECT DISTINCT a, b, c FROM table1;

as:

SELECT a, b, c FROM table1 GROUP BY a, b, c;

For fastest execution, apply the optimization techniques for GROUP BY queries described in GROUP BY queries.

5.2 - COUNT (DISTINCT) and other DISTINCT aggregates

Computing a DISTINCT aggregate generally requires more work than other aggregates.

Computing a DISTINCT aggregate generally requires more work than other aggregates. Also, a query that uses a single DISTINCT aggregate consumes fewer resources than a query with multiple DISTINCT aggregates.

Tip

Vertica executes queries with multiple distinct aggregates more efficiently when all distinct aggregate columns have a similar number of distinct values.

Examples

The following query returns the number of distinct values in a column:

=> SELECT COUNT (DISTINCT date_key) FROM date_dimension;

 COUNT
-------
  1826
(1 row)

This example returns the number of distinct return values from an expression:

=> SELECT COUNT (DISTINCT date_key + product_key) FROM inventory_fact;

 COUNT
-------
 21560
(1 row)

You can create an equivalent query using the LIMIT keyword to restrict the number of rows returned:

=> SELECT COUNT(date_key + product_key) FROM inventory_fact GROUP BY date_key LIMIT 10;

 COUNT
-------
   173
    31
   321
   113
   286
    84
   244
   238
   145
   202
(10 rows)

The following query uses GROUP BY to count distinct values within groups:

=> SELECT product_key, COUNT (DISTINCT date_key) FROM inventory_fact
   GROUP BY product_key LIMIT 10;

 product_key | count
-------------+-------
           1 |    12
           2 |    18
           3 |    13
           4 |    17
           5 |    11
           6 |    14
           7 |    13
           8 |    17
           9 |    15
          10 |    12
(10 rows)

The following query returns the number of distinct products and the total inventory within each date key:

=> SELECT date_key, COUNT (DISTINCT product_key), SUM(qty_in_stock) FROM inventory_fact
   GROUP BY date_key LIMIT 10;

 date_key | count |  sum
----------+-------+--------
        1 |   173 |  88953
        2 |    31 |  16315
        3 |   318 | 156003
        4 |   113 |  53341
        5 |   285 | 148380
        6 |    84 |  42421
        7 |   241 | 119315
        8 |   238 | 122380
        9 |   142 |  70151
       10 |   202 |  95274
(10 rows)

This query selects each distinct product_key value and then counts the number of distinct date_key values for all records with the specific product_key value. It also counts the number of distinct warehouse_key values in all records with the specific product_key value:

=> SELECT product_key, COUNT (DISTINCT date_key), COUNT (DISTINCT warehouse_key) FROM inventory_fact
   GROUP BY product_key LIMIT 15;

 product_key | count | count
-------------+-------+-------
           1 |    12 |    12
           2 |    18 |    18
           3 |    13 |    12
           4 |    17 |    18
           5 |    11 |     9
           6 |    14 |    13
           7 |    13 |    13
           8 |    17 |    15
           9 |    15 |    14
          10 |    12 |    12
          11 |    11 |    11
          12 |    13 |    12
          13 |     9 |     7
          14 |    13 |    13
          15 |    18 |    17
(15 rows)

This query selects each distinct product_key value, counts the number of distinct date_key and warehouse_key values for all records with the specific product_key value, and then sums all qty_in_stock values in records with the specific product_key value. It then returns the number of product_version values in records with the specific product_key value:

=> SELECT product_key, COUNT (DISTINCT date_key),
      COUNT (DISTINCT warehouse_key),
      SUM (qty_in_stock),
      COUNT (product_version)
      FROM inventory_fact GROUP BY product_key LIMIT 15;

 product_key | count | count |  sum  | count
-------------+-------+-------+-------+-------
           1 |    12 |    12 |  5530 |    12
           2 |    18 |    18 |  9605 |    18
           3 |    13 |    12 |  8404 |    13
           4 |    17 |    18 | 10006 |    18
           5 |    11 |     9 |  4794 |    11
           6 |    14 |    13 |  7359 |    14
           7 |    13 |    13 |  7828 |    13
           8 |    17 |    15 |  9074 |    17
           9 |    15 |    14 |  7032 |    15
          10 |    12 |    12 |  5359 |    12
          11 |    11 |    11 |  6049 |    11
          12 |    13 |    12 |  6075 |    13
          13 |     9 |     7 |  3470 |     9
          14 |    13 |    13 |  5125 |    13
          15 |    18 |    17 |  9277 |    18
(15 rows)

5.3 - Approximate count distinct functions

The aggregate function COUNT(DISTINCT) computes the exact number of distinct values in a data set.

The aggregate function COUNT(DISTINCT) computes the exact number of distinct values in a data set. COUNT(DISTINCT) performs well when it executes with the GROUPBY PIPELINED algorithm.

An aggregate COUNT operation performs well on a data set when the following conditions are true:

One of the target table's projections has an ORDER BY clause that facilitates sorted aggregation.
The number of distinct values is fairly small.
Hashed aggregation is required to execute the query.

Alternatively, consider using the APPROXIMATE_COUNT_DISTINCT function instead of COUNT(DISTINCT) when the following conditions are true:

You have a large data set and you do not require an exact count of distinct values.
The performance of COUNT(DISTINCT) on a given data set is insufficient.
You calculate several distinct counts in the same query.
The plan for COUNT(DISTINCT) uses hashed aggregation.

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation by setting the function's optional error tolerance argument—by default, 1.25 percent.

Other APPROXIMATE_COUNT_DISTINCT functions

Vertica supports two other functions that you can use together, instead of APPROXIMATE_COUNT_DISTINCT: APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS. Use these functions when the following conditions are true:

You have a large data set and you don't require an exact count of distinct values.
The performance of COUNT(DISTINCT) on a given data set is insufficient.
You want to pre-compute the distinct counts and later combine them in different ways.

Use the two functions together as follows:

Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, as a binary synopsis object*.*
Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.

You also use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE, which merges multiple synopses into one synopsis. With this function, you can continually update a "master" synopsis by merging in one or more synopses that cover more recent, shorter periods of time.

Example

The following example shows how to use APPROXIMATE_COUNT_DISTINCT functions to keep an approximate running count of users who click on a given web page within a given time span.

Create the pviews table to store data about website visits—time of visit, web page visited, and visitor:
```
=> CREATE TABLE pviews(
visit_time TIMESTAMP NOT NULL,
page_id INTEGER NOT NULL,
user_id INTEGER NOT NULL)
ORDER BY page_id, visit_time
SEGMENTED BY HASH(user_id) ALL NODES KSAFE
PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);
```
pviews is segmented by hashing user_id data, so all visits by a given user are stored on the same segment, on the same node. This prevents inefficient cross-node transfer of data, when later we do a COUNT (DISTINCT user_id).

The table also uses hierarchical partitioning on time of visit to optimize the ROS storage. Doing so improves performance when filtering data by time.

Load data into pviews:

=> INSERT INTO pviews VALUES
     ('2022-02-01 10:00:02',1002,1),
     ('2022-02-01 10:00:03',1002,2),
     ('2022-02-01 10:00:04',1002,1),
     ('2022-02-01 10:00:05',1002,3),
     ('2022-02-01 10:00:01',1000,1),
     ('2022-02-01 10:00:06',1002,1),
     ('2022-02-01 10:00:07',1002,3),
     ('2022-02-01 10:00:08',1002,1),
     ('2022-02-01 10:00:09',1002,3),
     ('2022-02-01 10:00:12',1002,2),
     ('2022-02-02 10:00:01',1000,1),
     ('2022-02-02 10:00:02',1002,4),
     ('2022-02-02 10:00:03',1002,2),
     ('2022-02-02 10:00:04',1002,1),
     ('2022-02-02 10:00:05',1002,3),
     ('2022-02-02 10:00:06',1002,4),
     ('2022-02-02 10:00:07',1002,3),
     ('2022-02-02 10:00:08',1002,4),
     ('2022-02-02 10:00:09',1002,3),
     ('2022-02-02 10:00:12',1002,2),
     ('2022-03-02 10:00:01',1000,1),
     ('2022-03-02 10:00:02',1002,1),
     ('2022-03-02 10:00:03',1002,2),
     ('2022-03-02 10:00:04',1002,1),
     ('2022-03-02 10:00:05',1002,3),
     ('2022-03-02 10:00:06',1002,4),
     ('2022-03-02 10:00:07',1002,3),
     ('2022-03-02 10:00:08',1002,6),
     ('2022-03-02 10:00:09',1002,5),
     ('2022-03-02 10:00:12',1002,2),
     ('2022-03-02 11:00:01',1000,5),
     ('2022-03-02 11:00:02',1002,6),
     ('2022-03-02 11:00:03',1002,7),
     ('2022-03-02 11:00:04',1002,4),
     ('2022-03-02 11:00:05',1002,1),
     ('2022-03-02 11:00:06',1002,6),
     ('2022-03-02 11:00:07',1002,8),
     ('2022-03-02 11:00:08',1002,6),
     ('2022-03-02 11:00:09',1002,7),
     ('2022-03-02 11:00:12',1002,1),
     ('2022-03-03 10:00:01',1000,1),
     ('2022-03-03 10:00:02',1002,2),
     ('2022-03-03 10:00:03',1002,4),
     ('2022-03-03 10:00:04',1002,1),
     ('2022-03-03 10:00:05',1002,2),
     ('2022-03-03 10:00:06',1002,6),
     ('2022-03-03 10:00:07',1002,9),
     ('2022-03-03 10:00:08',1002,10),
     ('2022-03-03 10:00:09',1002,7),
     ('2022-03-03 10:00:12',1002,1);
 OUTPUT
--------
     50
(1 row)

=> COMMIT;
COMMIT

Create the pview_summary table by querying pviews with CREATE TABLE...AS SELECT. Each row of this table summarizes data selected from pviews for a given date:
- partial_visit_count stores the number of rows (website visits) in pviews with that date.
- daily_users_acdp uses APPROXIMATE_COUNT_DISTINCT_SYNOPSIS to construct a synopsis that approximates the number of distinct users (user_id) who visited that website on that date.
```
=> CREATE TABLE pview_summary AS SELECT
      visit_time::DATE "date",
      COUNT(*) partial_visit_count,
      APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
   FROM pviews GROUP BY 1;
CREATE TABLE
=> ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
```

Update the pview_summary table so it is partitioned like pviews. The REORGANIZE keyword forces immediate repartitioning of the table data:

=> ALTER TABLE pview_summary
     PARTITION BY "date"
     GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE;
vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364:  The new partitioning scheme will produce partitions in 2 physical storage containers per projection
vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785:  Started background repartition table task
ALTER TABLE

Use CREATE TABLE..LIKE to create two ETL tables, pviews_etl and pview_summary_etl with the same DDL as pviews and pview_summary, respectively. These tables serve to process incoming data:
```
=> CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
CREATE TABLE
=> CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
CREATE TABLE
```

Load new data into pviews_etl:

=> INSERT INTO pviews_etl VALUES
     ('2022-03-03 11:00:01',1000,8),
     ('2022-03-03 11:00:02',1002,9),
     ('2022-03-03 11:00:03',1002,1),
     ('2022-03-03 11:00:04',1002,11),
     ('2022-03-03 11:00:05',1002,10),
     ('2022-03-03 11:00:06',1002,12),
     ('2022-03-03 11:00:07',1002,3),
     ('2022-03-03 11:00:08',1002,10),
     ('2022-03-03 11:00:09',1002,1),
     ('2022-03-03 11:00:12',1002,1);
 OUTPUT
--------
     10
(1 row)

=> COMMIT;
COMMIT

Summarize the new data in pview_summary_etl:

=> INSERT INTO pview_summary_etl SELECT
      visit_time::DATE visit_date,
      COUNT(*) partial_visit_count,
      APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
    FROM pviews_etl GROUP BY visit_date;
 OUTPUT
--------
      1
(1 row)

Append the pviews_etl data to pviews with COPY_PARTITIONS_TO_TABLE:

=> SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews');
              COPY_PARTITIONS_TO_TABLE
----------------------------------------------------
 1 distinct partition values copied at epoch 1403.

(1 row)

=> SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary');
              COPY_PARTITIONS_TO_TABLE
----------------------------------------------------
 1 distinct partition values copied at epoch 1404.

(1 row)

Create views and distinct (approximate) views by day for all data, including the partition that was just copied from pviews_etl :

=> SELECT
     "date" visit_date,
     SUM(partial_visit_count) visit_count,
     APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd
   FROM pview_summary GROUP BY visit_date ORDER BY visit_date;
 visit_date | visit_count | daily_users_acd
------------+-------------+-----------------
 2022-02-01 |          10 |               3
 2022-02-02 |          10 |               4
 2022-03-02 |          20 |               8
 2022-03-03 |          20 |              11
(4 rows)

Create views and distinct (approximate) views by month:

=> SELECT
     DATE_TRUNC('MONTH', "date")::DATE "month",
     SUM(partial_visit_count) visit_count,
     APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd
   FROM pview_summary GROUP BY month ORDER BY month;
   month    | visit_count | monthly_users_acd
------------+-------------+-------------------
 2022-02-01 |          20 |                 4
 2022-03-01 |          40 |                12
(2 rows)

Merge daily synopses into monthly synopses:

=> CREATE TABLE pview_monthly_summary AS SELECT
     DATE_TRUNC('MONTH', "date")::DATE "month",
     SUM(partial_visit_count) partial_visit_count,
     APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp
   FROM pview_summary GROUP BY month ORDER BY month;
CREATE TABLE

Create views and distinct views by month, generated from the merged synopses:

=> SELECT
     month,
     SUM(partial_visit_count) monthly_visit_count,
     APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd
   FROM pview_monthly_summary GROUP BY month ORDER BY month;
   month    | monthly_visit_count | monthly_users_acd
------------+---------------------+-------------------
 2019-02-01 |                  20 |                 4
 2019-03-01 |                  40 |                12
(2 rows)

You can use the monthly summary to produce a yearly summary. This approach is likely to be faster than using a daily summary if a lot of data needs to be processed:

=> SELECT
     DATE_TRUNC('YEAR', "month")::DATE "year",
     SUM(partial_visit_count) yearly_visit_count,
     APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd
   FROM pview_monthly_summary GROUP BY year ORDER BY year;
    year    | yearly_visit_count | yearly_users_acd
------------+--------------------+------------------
 2022-01-01 |                 60 |               12
(1 row)

Drop the ETL tables:

=> DROP TABLE IF EXISTS pviews_etl, pview_summary_etl;
DROP TABLE

5.4 - Single DISTINCT aggregates

Vertica computes a DISTINCT aggregate by first removing all duplicate values of the aggregate's argument to find the distinct values.

Vertica computes a DISTINCT aggregate by first removing all duplicate values of the aggregate's argument to find the distinct values. Then it computes the aggregate.

For example, you can rewrite the following query:

SELECT a, b, COUNT(DISTINCT c) AS dcnt FROM table1 GROUP BY a, b;

as:

SELECT a, b, COUNT(dcnt) FROM
  (SELECT a, b, c AS dcnt FROM table1 GROUP BY a, b, c)
GROUP BY a, b;

For fastest execution, apply the optimization techniques for GROUP BY queries.

5.5 - Multiple DISTINCT aggregates

If your query has multiple DISTINCT aggregates, there is no straightforward SQL rewrite that can compute them.

If your query has multiple DISTINCT aggregates, there is no straightforward SQL rewrite that can compute them. The following query cannot easily be rewritten for improved performance:

SELECT a, COUNT(DISTINCT b), COUNT(DISTINCT c) AS dcnt FROM table1 GROUP BY a;

For a query with multiple DISTINCT aggregates, there is no projection design that can avoid using GROUPBY HASH and resegmenting the data. To improve performance of this query, make sure that it has large amounts of memory available. For more information about memory allocation for queries, see Resource manager.

6 - JOIN queries

In general, you can optimize execution of queries that join multiple tables in several ways:.

In general, you can optimize execution of queries that join multiple tables in several ways:

Create projections for the joined tables that are sorted on join predicate columns. This facilitates use of the merge join algorithm, which generally joins tables more efficiently than the hash join algorithm.
Create projections that are identically segmented on the join keys.

Other best practices

Vertica also executes joins more efficiently if the following conditions are true:

Query construction enables the query optimizer to create a plan where the larger table is defined as the outer input.
The columns on each side of the equality predicate are from the same table. For example in the following query, the left and right sides of the equality predicate include only columns from tables T and X, respectively:
```
=> SELECT * FROM T JOIN X ON T.a + T.b = X.x1 - X.x2;
```
Conversely, the following query incurs more work to process, because the right side of the predicate includes columns from both tables T and X:
```
=> SELECT * FROM T JOIN X WHERE T.a = X.x1 + T.b
```

6.1 - Hash joins versus merge joins

The Vertica optimizer implements a join with one of the following algorithms:.

The Vertica optimizer implements a join with one of the following algorithms:

Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins.
Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table's join column. The optimizer then scans the outer table for matches to the hash table, and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.

The optimizer automatically chooses the most appropriate algorithm to execute a query, given the projections that are available.

Facilitating merge joins

To facilitate a merge join, create projections for the joined tables that are sorted on the join predicate columns. The join predicate columns should be the first columns in the ORDER BY clause.

For example, tables first and second are defined as follows, with projections first_p1 and second_p1, respectively. The projections are sorted on data_first and data_second:

CREATE TABLE first ( id INT, data_first INT );
CREATE PROJECTION first_p1 AS SELECT * FROM first ORDER BY data_first;

CREATE TABLE second ( id INT, data_second INT );
CREATE PROJECTION second_p1 AS SELECT * FROM second ORDER BY data_second;

When you join these tables on unsorted columns first.id and second.id, Vertica uses the hash join algorithm:

 EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN HASH [Cost: 752, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can facilitate execution of this query with the merge join algorithm by creating projections first_p2 and second_p2, which are sorted on join columns first_p2.id and second_p2.id, respectively:


CREATE PROJECTION first_p2 AS SELECT id, data_first FROM first ORDER BY id SEGMENTED BY hash(id, data_first) ALL NODES;
CREATE PROJECTION second_p2 AS SELECT id, data_second FROM second ORDER BY id SEGMENTED BY hash(id, data_second) ALL NODES;

If the query joins significant amounts of data, the query optimizer uses the merge algorithm:

EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 731, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can also facilitate a merge join by using subqueries to pre-sort the join predicate columns. For example:

SELECT first.id, first.data_first, second.data_second FROM
  (SELECT * FROM first ORDER BY id ) first JOIN (SELECT * FROM second ORDER BY id) second ON first.id = second.id;

6.2 - Identical segmentation

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys.

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys. Identically-segmented projections allow the joins to occur locally on each node, thereby helping to reduce data movement across the network during query processing.

To determine if projections are identically-segmented on the query join keys, create a query plan with EXPLAIN. If the query plan contains RESEGMENT or BROADCAST, the projections are not identically segmented.

The Vertica optimizer chooses a projection to supply rows for each table in a query. If the projections to be joined are segmented, the optimizer evaluates their segmentation against the query join expressions. It thereby determines whether the rows are placed on each node so it can join them without fetching data from another node.

Join conditions for identically segmented projections

A projection p is segmented on join columns if all column references in p’s segmentation expression are a subset of the columns in the join expression.

The following conditions must be true for two segmented projections p1 of table t1 and p2 of table t2 to participate in a join of t1 to t2:

The join condition must have the following form:

t1.j1 = t2.j1 AND t1.j2 = t2.j2 AND ... t1.jN = t2.jN

The join columns must share the same base data type. For example:
- If t1.j1 is an INTEGER, t2.j1 can be an INTEGER but it cannot be a FLOAT.
- If t1.j1 is a CHAR(10), t2.j1 can be any CHAR or VARCHAR (for example, CHAR(10), VARCHAR(10), VARCHAR(20)), but t2.j1 cannot be an INTEGER.
If p1 is segmented by an expression on columns {t1.s1, t1.s2, ... t1.sN}, each segmentation column t1.sX must be in the join column set {t1.jX}.
If p2 is segmented by an expression on columns {t2.s1, t2.s2, ... t2.sN}, each segmentation column t2.sX must be in the join column set {t2.jX}.
The segmentation expressions of p1 and p2 must be structurally equivalent. For example:
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x), p1 and p2 are identically segmented.
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x + 1), p1 and p2 are not identically segmented.
p1 and p2 must have the same segment count.
The assignment of segments to nodes must match. For example, if p1 and p2 use an OFFSET clause, their offsets must match.
If Vertica finds projections for t1 and t2 that are not identically segmented, the data is redistributed across the network during query run time, as necessary.

Tip
If you create custom designs, try to use segmented projections for joins whenever possible.

Examples

The following statements create two tables and specify to create identical segments:

=> CREATE TABLE t1 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;
=> CREATE TABLE t2 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;

Given this design, the join conditions in the following queries can leverage identical segmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id AND t1.x1 = t2.x1;

Conversely, the join conditions in the following queries require resegmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.x1 = t2.x1;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.x1;

6.3 - Joining variable length string data

When you join tables on VARCHAR columns, Vertica calculates how much storage space it requires to buffer join column data.

When you join tables on VARCHAR columns, Vertica calculates how much storage space it requires to buffer join column data. It does so by formatting the column data in one of two ways:

Uses the join column metadata to size column data to a fixed length and buffer accordingly. For example, given a column that is defined as VARCHAR(1000), Vertica always buffers 1000 characters.
Uses the actual length of join column data, so buffer size varies for each join. For example, given a join on strings Xi, John, and Amrita, Vertica buffers only as much storage as it needs for each join—in this case, 2, 4, and 6 bytes, respectively.

The second approach can improve join query performance. It can also reduce memory consumption, which helps prevent join spills and minimize how often memory is borrowed from the resource manager. In general, these benefits are especially marked in cases where the defined size of a join column significantly exceeds the average length of its data.

Setting and verifying variable length formatting

You can control how Vertica implements joins at the session or database levels, through configuration parameter JoinDefaultTupleFormat, or for individual queries, through the JFMT hint. Vertica supports variable length formatting for all joins except merge and event series joins.

Use EXPLAIN VERBOSE to verify whether a given query uses variable character formatting, by checking for these flags:

JF_EE_VARIABLE_FORMAT
JF_EE_FIXED_FORMAT

7 - ORDER BY queries

You can improve the performance of queries that contain only ORDER BY clauses if the columns in a projection's ORDER BY clause are the same as the columns in the query.

You can improve the performance of queries that contain only ORDER BY clauses if the columns in a projection's ORDER BY clause are the same as the columns in the query.

If you define the projection sort order in the CREATE PROJECTION statement, the Vertica query optimizer does not have to sort projection data before performing certain ORDER BY queries.

The following table, sortopt, contains the columns a, b, c, and d. Projection sortopt_p specifies to order on columns a, b, and c.

CREATE TABLE sortopt (
    a INT NOT NULL,
    b INT NOT NULL,
    c INT,
    d INT
);
CREATE PROJECTION sortopt_p (
   a_proj,
   b_proj,
   c_proj,
   d_proj )
AS SELECT * FROM sortopt
ORDER BY a,b,c 
UNSEGMENTED ALL NODES;
INSERT INTO sortopt VALUES(5,2,13,84);
INSERT INTO sortopt VALUES(14,22,8,115);
INSERT INTO sortopt VALUES(79,9,401,33);

Based on this sort order, if a SELECT * FROM sortopt query contains one of the following ORDER BY clauses, the query does not have to resort the projection:

ORDER BY a
ORDER BY a, b
ORDER BY a, b, c

For example, Vertica does not have to resort the projection in the following query because the sort order includes columns specified in the CREATE PROJECTION..ORDER BY a, b, c clause, which mirrors the query's ORDER BY a, b, c clause:

=> SELECT * FROM sortopt ORDER BY a, b, c;
 a  | b  |  c  |  d
----+----+-----+-----
  5 |  2 |  13 |  84
 14 | 22 |   8 | 115
 79 |  9 | 401 |  33
(3 rows)

If you include column d in the query, Vertica must re-sort the projection data because column d was not defined in the CREATE PROJECTION..ORDER BY clause. Therefore, the ORDER BY d query won't benefit from any sort optimization.

You cannot specify an ASC or DESC clause in the CREATE PROJECTION statement's ORDER BY clause. Vertica always uses an ascending sort order in physical storage, so if your query specifies descending order for any of its columns, the query still causes Vertica to re-sort the projection data. For example, the following query requires Vertica to sort the results:

=> SELECT * FROM sortopt ORDER BY a DESC, b, c;
 a  | b  |  c  |  d
----+----+-----+-----
 79 |  9 | 401 |  33
 14 | 22 |   8 | 115
  5 |  2 |  13 |  84
(3 rows)

8 - Analytic functions

The following sections describe how to optimize SQL-99 analytic functions that Vertica supports.

8.1 - Empty OVER clauses

The OVER() clause does not require a windowing clause.

The OVER() clause does not require a windowing clause. If your query uses an analytic function like SUM(x) and you specify an empty OVER() clause, the analytic function is used as a reporting function, where the entire input is treated as a single partition; the aggregate returns the same aggregated value for each row of the result set. The query executes on a single node, potentially resulting in poor performance.

If you add a PARTITION BY clause to the OVER() clause, the query executes on multiple nodes, improving its performance.

8.2 - NULL sort order

By default, projection column values are stored in ascending order, but placement of NULL values depends on a column's data type.

NULL placement differences with ORDER BY clauses

The analytic OVER(window-order-clause) and the SQL ORDER BY clause have slightly different semantics:

OVER(ORDER BY ...)

The analytic window order clause uses the ASC or DESC sort order to determine NULLS FIRST or NULLS LAST placement for analytic function results. NULL values are placed as follows:

ASC, NULLS LAST — NULL values appear at the end of the sorted result.
DESC, NULLS FIRST — NULL values appear at the beginning of the sorted result.

(SQL) ORDER BY

The SQL and Vertica ORDER BY clauses produce different results. The SQL ORDER BY clause specifies only ascending or descending sort order. The Vertica ORDER BY clause determines NULL placement based on the column data type:

NUMERIC, INTEGER, DATE, TIME, TIMESTAMP, and INTERVAL columns: NULLS FIRST (NULL values appear at the beginning of a sorted projection.)
FLOAT, STRING, and BOOLEAN columns: NULLS LAST (NULL values appear at the end of a sorted projection.)

NULL sort options

If you do not care about NULL placement in queries that involve analytic computations, or if you know that columns do not contain any NULL values, specify NULLS AUTO—irrespective of data type. Vertica chooses the placement that gives the fastest performance, as in the following query. Otherwise, specify NULLS FIRST or NULLS LAST.

=> SELECT x, RANK() OVER (ORDER BY x NULLS AUTO) FROM t;

You can carefully formulate queries so Vertica can avoid sorting the data and increase query performance, as illustrated by the following example. Vertica sorts inputs from table t on column x, as specified in the OVER(ORDER BY) clause, and then evaluates RANK():

=> CREATE TABLE t (
    x FLOAT,
    y FLOAT );
=> CREATE PROJECTION t_p (x, y) AS SELECT * FROM t
   ORDER BY x, y UNSEGMENTED ALL NODES;
=> SELECT x, RANK() OVER (ORDER BY x) FROM t;

In the preceding SELECT statement, Vertica eliminates the ORDER BY clause and executes the query quickly because column x is a FLOAT data type. As a result, the projection sort order matches the analytic default ordering (ASC + NULLS LAST). Vertica can also avoid having to sort the data when the underlying projection is already sorted.

However, if column x is an INTEGER data type, Vertica must sort the data because the projection sort order for INTEGER data types (ASC + NULLS FIRST) does not match the default analytic ordering (ASC + NULLS LAST). To help Vertica eliminate the sort, specify the placement of NULLs to match the default ordering:

=> SELECT x, RANK() OVER (ORDER BY x NULLS FIRST) FROM t;

If column x is a STRING, the following query eliminates the sort:

=> SELECT x, RANK() OVER (ORDER BY x NULLS LAST) FROM t;

If you omit NULLS LAST in the preceding query, Ver eliminates the sort because ASC + NULLS LAST is the default sort specification for both the analytic ORDER BY clause and for string-related columns in Vertica.

8.3 - Runtime sorting of NULL values in analytic functions

By carefully writing queries or creating your design (or both), you can help the Vertica query optimizer skip sorting all columns in a table when performing an analytic function, which can improve query performance.

To minimize Vertica's need to sort projections during query execution, redefine the employee table and specify that NULL values are not allowed in the sort fields:

=> DROP TABLE employee CASCADE;
=> CREATE TABLE employee
   (empno INT,
    deptno INT NOT NULL,
    sal INT NOT NULL);
CREATE TABLE
=> CREATE PROJECTION employee_p AS
   SELECT * FROM employee
   ORDER BY deptno, sal;
CREATE PROJECTION
=> INSERT INTO employee VALUES(101,10,50000);
=> INSERT INTO employee VALUES(103,10,43000);
=> INSERT INTO employee VALUES(104,10,45000);
=> INSERT INTO employee VALUES(105,20,97000);
=> INSERT INTO employee VALUES(108,20,33000);
=> INSERT INTO employee VALUES(109,20,51000);
=> COMMIT;
COMMIT

=> SELECT * FROM employee;
 empno | deptno |  sal
-------+--------+-------
   101 |     10 | 50000
   103 |     10 | 43000
   104 |     10 | 45000
   105 |     20 | 97000
   108 |     20 | 33000
   109 |     20 | 51000
(6 rows)
=> SELECT deptno, sal, empno, RANK() OVER
     (PARTITION BY deptno ORDER BY sal)
   FROM employee;
 deptno |  sal  | empno | ?column?
--------+-------+-------+----------
     10 | 43000 |   103 |        1
     10 | 45000 |   104 |        2
     10 | 50000 |   101 |        3
     20 | 33000 |   108 |        1
     20 | 51000 |   109 |        2
     20 | 97000 |   105 |        3
(6 rows)

Tip

If you do not care about NULL placement in queries that involve analytic computations, or if you know that columns contain no NULL values, specify NULLS AUTO in your queries. Vertica attempts to choose the placement that gives the fastest performance. Otherwise, specify NULLS FIRST or NULLS LAST.

9 - LIMIT queries

A query can include a LIMIT clause to limit its result set in two ways:.

A query can include a LIMIT clause to limit its result set in two ways:

Return a subset of rows from the entire result set.
Set window partitions on the result set and limit the number of rows in each window.

Limiting the query result set

Queries that use the LIMIT clause with ORDER BY return a specific subset of rows from the queried dataset. Vertica processes these queries efficiently using Top-K optimization, which is a database query ranking process. Top-K optimization avoids sorting (and potentially writing to disk) an entire data set to find a small number of rows. This can significantly improve query performance.

For example, the following query returns the first 20 rows of data in table customer_dimension, as ordered by number_of_employees:

=> SELECT store_region, store_city||', '||store_state location, store_name, number_of_employees
     FROM store.store_dimension ORDER BY number_of_employees DESC LIMIT 20;
 store_region |       location       | store_name | number_of_employees
--------------+----------------------+------------+---------------------
 East         | Nashville, TN        | Store141   |                  50
 East         | Manchester, NH       | Store225   |                  50
 East         | Portsmouth, VA       | Store169   |                  50
 SouthWest    | Fort Collins, CO     | Store116   |                  50
 SouthWest    | Phoenix, AZ          | Store232   |                  50
 South        | Savannah, GA         | Store201   |                  50
 South        | Carrollton, TX       | Store8     |                  50
 West         | Rancho Cucamonga, CA | Store102   |                  50
 MidWest      | Lansing, MI          | Store105   |                  50
 West         | Provo, UT            | Store73    |                  50
 East         | Washington, DC       | Store180   |                  49
 MidWest      | Sioux Falls, SD      | Store45    |                  49
 NorthWest    | Seattle, WA          | Store241   |                  49
 SouthWest    | Las Vegas, NV        | Store104   |                  49
 West         | El Monte, CA         | Store100   |                  49
 SouthWest    | Fort Collins, CO     | Store20    |                  49
 East         | Lowell, MA           | Store57    |                  48
 SouthWest    | Arvada, CO           | Store188   |                  48
 MidWest      | Joliet, IL           | Store82    |                  48
 West         | Berkeley, CA         | Store248   |                  48
(20 rows)

Important

If a LIMIT clause omits ORDER BY, results can be nondeterministic.

Limiting window partitioning results

You can use LIMIT to set window partitioning on query results, and limit the number of rows that are returned in each window:

SELECT ... FROM dataset LIMIT num-rows OVER ( PARTITION BY column-expr-x, ORDER BY column-expr-y [ASC | DESC] )

where querying dataset returns num-rows rows in each column-expr-x partition with the highest or lowest values of column-expr-y.

For example, the following statement queries table store.store_dimension and includes a LIMIT clause that specifies window partitioning. In this case, Vertica partitions the result set by store_region, where each partition window displays for one region the two stores with the fewest employees:

=> SELECT store_region, store_city||', '||store_state location, store_name, number_of_employees FROM store.store_dimension
     LIMIT 2 OVER (PARTITION BY store_region ORDER BY number_of_employees ASC);
 store_region |      location       | store_name | number_of_employees
--------------+---------------------+------------+---------------------
 West         | Norwalk, CA         | Store43    |                  10
 West         | Lancaster, CA       | Store95    |                  11
 East         | Stamford, CT        | Store219   |                  12
 East         | New York, NY        | Store122   |                  12
 SouthWest    | North Las Vegas, NV | Store170   |                  10
 SouthWest    | Phoenix, AZ         | Store228   |                  11
 NorthWest    | Bellevue, WA        | Store200   |                  19
 NorthWest    | Portland, OR        | Store39    |                  22
 MidWest      | South Bend, IN      | Store134   |                  10
 MidWest      | Evansville, IN      | Store30    |                  11
 South        | Mesquite, TX        | Store124   |                  10
 South        | Beaumont, TX        | Store226   |                  11
(12 rows)

10 - INSERT-SELECT operations

There are several ways to optimize an INSERT-SELECT query that has the following format:.

There are several ways to optimize an INSERT-SELECT query that has the following format:

INSERT /*+direct*/ INTO destination SELECT * FROM source;

10.1 - Matching sort orders

When performing INSERT-SELECT operations, to avoid the sort phase of the INSERT, make sure that the sort order for the SELECT query matches the projection sort order of the target table.

When performing INSERT-SELECT operations, to avoid the sort phase of the INSERT, make sure that the sort order for the SELECT query matches the projection sort order of the target table.

For example, on a single-node database:

=> CREATE TABLE source (col1 INT, col2 INT, col3 INT);
=> CREATE PROJECTION source_p (col1, col2, col3)
     AS SELECT col1, col2, col3 FROM source
     ORDER BY col1, col2, col3
     SEGMENTED BY HASH(col3)
     ALL NODES;
=> CREATE TABLE destination (col1 INT, col2 INT, col3 INT);
=> CREATE PROJECTION destination_p (col1, col2, col3)
     AS SELECT col1, col2, col3 FROM destination
     ORDER BY col1, col2, col3
     SEGMENTED BY HASH(col3)
     ALL NODES;

The following INSERT does not require a sort because the query result has the column order of the projection:

=> INSERT /*+direct*/ INTO destination SELECT * FROM source;

The following INSERT requires a sort because the order of the columns in the SELECT statement does not match the projection order:

=> INSERT /*+direct*/ INTO destination SELECT col1, col3, col2 FROM source;

The following INSERT does not require a sort. The order of the columns doesn't match, but the explicit ORDER BY causes the output to be sorted by c1, c3, c2 in Vertica:

=> INSERT /*+direct*/ INTO destination SELECT col1, col3, col2 FROM source
      GROUP BY col1, col3, col2
      ORDER BY col1, col2, col3 ;

10.2 - Identical segmentation

When performing an INSERT-SELECT operation from a segmented source table to a segmented destination table, segment both projections on the same column to avoid resegmenting the data, as in the following example:.

When performing an INSERT-SELECT operation from a segmented source table to a segmented destination table, segment both projections on the same column to avoid resegmenting the data, as in the following example:

CREATE TABLE source (col1 INT, col2 INT, col3 INT);
CREATE PROJECTION source_p (col1, col2, col3) AS
   SELECT col1, col2, col3 FROM source
   SEGMENTED BY HASH(col3) ALL NODES;
CREATE TABLE destination (col1 INT, col2 INT, col3 INT);
CREATE PROJECTION destination_p (col1, col2, col3) AS
    SELECT col1, col2, col3 FROM destination
    SEGMENTED BY HASH(col3) ALL NODES;
INSERT /*+direct*/ INTO destination SELECT * FROM source;

11 - DELETE and UPDATE queries

Vertica is optimized for query-intensive workloads, so DELETE and UPDATE queries might not achieve the same level of performance as other queries.

Vertica is optimized for query-intensive workloads, so DELETE and UPDATE queries might not achieve the same level of performance as other queries. DELETE and UPDATE operations must update all projections, so these operations can be no faster than the slowest projection. For details, see Optimizing DELETE and UPDATE.

12 - Data collector table queries

The Vertica Data Collector extends system table functionality by gathering and retaining information about your database cluster.

The Vertica Data Collector extends system table functionality by gathering and retaining information about your database cluster. The Data Collector makes this information available in system tables.

Vertica Analytic Database stores Data Collection data in the Data Collector directory under the Vertica or catalog path. Use Data Collector information to query the past state of system tables and extract aggregate information.

In general, queries on Data Collector tables are more efficient when they include only the columns that contain the desired data. Queries are also more efficient when they:

Avoid resegmentation
Use time predicates

Avoiding resegmentation

You can avoid resegmentation when you join the following DC tables on session_id or transaction_id, because all data is local:

dc_session_starts
dc_session_ends
dc_requests_issued
dc_requests_completed

Resegmentation is not required when a query includes the node_name column. For example:

=> SELECT dri.transaction_id, dri.request, drc.processed_row_count
    FROM dc_requests_issued dri
    JOIN dc_requests_completed drc
    USING (node_name, session_id, request_id)
    WHERE dri.time between 'April 7,2015'::timestamptz and 'April 8,2015'::timestamptz
    AND drc.time between 'April 7,2015'::timestamptz and 'April 8,2015'::timestamptz;

This query runs efficiently because:

The initiator node writes only to dc_requests_issued and dc_requests_completed.
Columns session_id and node_name are correlated.

Using time predicates

Use non-volatile functions and TIMESTAMP for the time range predicates. Vertica Analytic Database optimizes SQL performance for DC tables that use the time predicate.

Each DC table has a time column. Use this column to enter the time range as the query predicate.

For example, this query returns data for dates between September 1 and September 10: select * from dc_foo where time > 'Sept 1, 2015::timestamptz and time < 'Sept 10 2015':: timestamptz; You can change the minimum and maximum time values to adjust the time for which you want to retrieve data.

You must use non-volatile functions as time predicates. Volatile functions cause queries to run inefficiently. This example returns all queries that started and ended on April 7, 2015. However, the query runs at less than optimal performance because trunc and timestamp are volatile:

=> SELECT dri.transaction_id, dri.request, drc.processed_row_count
    FROM dc_requests_issued dri
    LEFT JOIN dc_requests_completed drc
    USING (session_id, request_id)
    WHERE trunc(dri.time, ‘DDD’) > 'April 7,2015'::timestamp
    AND trunc(drc.time, ‘DDD’) < 'April 8,2015'::timestamp;

Query optimization

Note

1 - Initial process for improving query performance

Run Database Designer

Check query events proactively

Review the query plan

See also

2 - Column encoding

2.1 - Improving column compression

See also

2.2 - Using run length encoding

3 - Projections for queries with predicates

Queries that use date ranges

Note

Queries for tables with a high-cardinality primary key

4 - GROUP BY queries

4.1 - GROUP BY implementation options

GROUPBY PIPELINED requirements

Condition 1

Condition 2

Condition 3

Controlling GROUPBY algorithm choice

4.2 - Avoiding resegmentation during GROUP BY optimization with projection design

Examples

5 - DISTINCT in a SELECT query list

5.1 - Query has no aggregates in SELECT list

5.2 - COUNT (DISTINCT) and other DISTINCT aggregates

Tip

Examples

5.3 - Approximate count distinct functions

Other APPROXIMATE_COUNT_DISTINCT functions

Example

See also

5.4 - Single DISTINCT aggregates

5.5 - Multiple DISTINCT aggregates

6 - JOIN queries

Other best practices

6.1 - Hash joins versus merge joins

Facilitating merge joins

6.2 - Identical segmentation

Join conditions for identically segmented projections

Tip

Examples

See also

6.3 - Joining variable length string data

Setting and verifying variable length formatting

7 - ORDER BY queries

See also

8 - Analytic functions

8.1 - Empty OVER clauses

8.2 - NULL sort order

NULL placement differences with ORDER BY clauses

NULL sort options

See also

8.3 - Runtime sorting of NULL values in analytic functions

Tip

9 - LIMIT queries

Limiting the query result set

Important

Limiting window partitioning results

10 - INSERT-SELECT operations

10.1 - Matching sort orders

10.2 - Identical segmentation

11 - DELETE and UPDATE queries

12 - Data collector table queries

Avoiding resegmentation

Using time predicates