This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

GROUP BY paths

A GROUP BY operation has two algorithms:.

A GROUP BY operation has two algorithms:

  • GROUPBY HASH input is not sorted by the group columns, so Vertica builds a hash table on those group columns in order to process the aggregates and group by expressions.

  • GROUPBY PIPELINED requires that inputs be presorted on the columns specified in the group, which means that Vertica need only retain data in the current group in memory. GROUPBY PIPELINED operations are preferred because they are generally faster and require less memory than GROUPBY HASH. GROUPBY PIPELINED is especially useful for queries that process large numbers of high-cardinality group by columns or DISTINCT aggregates.

If possible, the query optimizer chooses the faster algorithm GROUPBY PIPELINED over GROUPBY HASH.

1 - GROUPBY HASH query plan

Here's an example of how GROUPBY HASH operations look in EXPLAIN output.

Here's an example of how GROUPBY HASH operations look in EXPLAIN output.

=> EXPLAIN SELECT COUNT(DISTINCT annual_income)
     FROM customer_dimension
     WHERE customer_region='NorthWest';

The output shows that the optimizer chose the less efficient GROUPBY HASH path, which means the projection was not presorted on the annual_income column. If such a projection is available, the optimizer would choose the GROUPBY PIPELINED algorithm.

Access Path:
+-GROUPBY NOTHING [Cost: 256, Rows: 1 (NO STATISTICS)] (PATH ID: 1)
|  Aggregates: count(DISTINCT customer_dimension.annual_income)
| +---> GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 253, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| |      Group By: customer_dimension.annual_income
| | +---> STORAGE ACCESS for customer_dimension [Cost: 227, Rows: 50K (NO STATISTICS)] (PATH ID: 3)
| | |      Projection: public.customer_dimension_super
| | |      Materialize: customer_dimension.annual_income
| | |      Filter: (customer_dimension.customer_region = 'NorthWest'
...

2 - GROUPBY PIPELINED query plan

If you have a projection that is already sorted on the customer_gender column, the optimizer chooses the faster GROUPBY PIPELINED operation:.

If you have a projection that is already sorted on the customer_gender column, the optimizer chooses the faster GROUPBY PIPELINED operation:

 => EXPLAIN SELECT COUNT(distinct customer_gender) from customer_dimension;
Access Path:
 +-GROUPBY NOTHING [Cost: 22, Rows: 1] (PATH ID: 1)
 |  Aggregates: count(DISTINCT customer_dimension.customer_gender)
 |  Execute on: Query Initiator
 | +---> GROUPBY PIPELINED [Cost: 20, Rows: 10K] (PATH ID: 2)
 | |      Group By: customer_dimension.customer_gender
 | |      Execute on: Query Initiator
 | | +---> STORAGE ACCESS for customer_dimension [Cost: 17, Rows: 50K (3 RLE)] (PATH ID: 3)
 | | |      Projection: public.customer_dimension_DBD_1_rep_vmartdb_design_vmartdb_design_node0001
 | | |      Materialize: customer_dimension.customer_gender
 | | |      Execute on: Query Initiator

Similarly, the use of an equality predicate, such as in the following query, preserves GROUPBY PIPELINED:

=> EXPLAIN SELECT COUNT(DISTINCT annual_income)    FROM customer_dimension
   WHERE customer_gender = 'Female';

 Access Path: +-GROUPBY NOTHING [Cost: 161, Rows: 1] (PATH ID: 1)
 |  Aggregates: count(DISTINCT customer_dimension.annual_income)
 | +---> GROUPBY PIPELINED [Cost: 158, Rows: 10K] (PATH ID: 2)
 | |      Group By: customer_dimension.annual_income
 | | +---> STORAGE ACCESS for customer_dimension [Cost: 144, Rows: 47K] (PATH ID: 3)
 | | |      Projection: public.customer_dimension_DBD_1_rep_vmartdb_design_vmartdb_design_node0001
 | | |      Materialize: customer_dimension.annual_income
 | | |      Filter: (customer_dimension.customer_gender = 'Female')