GROUP BY queries

The following sections include several examples that show how you can design your projections to optimize the performance of your GROUP BY queries.

1: GROUP BY implementation options
2: Avoiding resegmentation during GROUP BY optimization with projection design

The following sections include several examples that show how you can design your projections to optimize the performance of your GROUP BY queries.

1 - GROUP BY implementation options

Vertica implements a query GROUP BY clause with one of these algorithms: GROUPBY PIPELINED or GROUPBY HASH.

Vertica implements a query GROUP BY clause with one of these algorithms: GROUPBY PIPELINED or GROUPBY HASH. Both algorithms return the same results. Performance of both is generally similar for queries that return a small number of distinct groups—typically a thousand per node .

You can use EXPLAIN to determine which algorithm the query optimizer chooses for a given query. The following conditions generally determine which algorithm is chosen:

GROUPBY PIPELINED requires all GROUP BY data to be specified in the projection's ORDER BY clause. For details, see GROUPBY PIPELINED Requirements below.

Because GROUPBY PIPELINED only needs to retain in memory the current group data, this algorithm generally requires less memory and executes faster than GROUPBY HASH. Performance improvements are especially notable for queries that aggregate large numbers of distinct groups.
GROUPBY HASH is used for any query that does not comply with GROUPBY PIPELINED sort order requirements. In this case, Vertica must build a hash table on GROUP BY columns before it can start grouping the data.

GROUPBY PIPELINED requirements

You can enable use of the GROUPBY PIPELINED algorithm by ensuring that the query and one of its projections comply with GROUPBY PIPELINED requirements. The following conditions apply to GROUPBY PIPELINED. If none of them is true for the query, then Vertica uses GROUPBY HASH.

All examples that follow assume this schema:

CREATE TABLE sortopt (
    a INT NOT NULL,
    b INT NOT NULL,
    c INT,
    d INT
);
CREATE PROJECTION sortopt_p (
   a_proj,
   b_proj,
   c_proj,
   d_proj )
AS SELECT * FROM sortopt
ORDER BY a,b,c 
UNSEGMENTED ALL NODES;
INSERT INTO sortopt VALUES(5,2,13,84);
INSERT INTO sortopt VALUES(14,22,8,115);
INSERT INTO sortopt VALUES(79,9,401,33);

Condition 1

All GROUP BY columns are also included in the projection ORDER BY clause. For example:

GROUP BY columns	GROUPBY algorithm	Reason chosen
`a` `a,b` `b,a` `a,b,c` `c,a,b`	GROUPBY PIPELINED	Columns `a`, `b`, and `c` are included in the projection sort columns.
`a,b,c,d`	GROUPBY HASH	Column `d` is not part of the projection sort columns.

Condition 2

If the query's GROUP BY clause has fewer columns than the projection's ORDER BY clause, the GROUP BY columns must:

Be a subset of ORDER BY columns that are contiguous.
Include the first ORDER BY column.

For example:

GROUP BY columns	GROUPBY algorithm	Reason chosen
`a` `a,b` `b,a`	GROUPBY PIPELINED	All GROUP BY columns are a subset of contiguous columns in the projection's ORDER BY clause `{a,b,c}`, and include column `a`.
`a,c`	GROUPBY HASH	GROUP BY columns `{a,c}` are not contiguous in the projection ORDER BY clause `{a,b,c}`.
`b,c`	GROUPBY HASH	GROUP BY columns `{b,c`} do not include the projection's first ORDER BY column `a`.

Condition 3

If a query's GROUP BY columns do not appear first in the projection's ORDER BY clause, then any early-appearing projection sort columns that are missing in the query's GROUP BY clause must be present as single-column constant equality predicates in the query's WHERE clause.

For example:

Query	GROUPBY algorithm	Reason chosen
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b`	GROUPBY PIPELINED	All columns preceding `b` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY a, b`	GROUPBY PIPELINED	Grouping column `a` is redundant but has no effect on algorithm selection.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY b, c`	GROUPBY PIPELINED	All columns preceding `b` and `c` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c, b`	GROUPBY PIPELINED	All columns preceding `b` and `c` in the projection sort order appear as constant equality predicates.
`SELECT SUM(a) FROM sortopt WHERE a = 10 GROUP BY c`	GROUPBY HASH	All columns preceding `c` in the projection sort order do not appear as constant equality predicates.

Controlling GROUPBY algorithm choice

It is generally best to allow Vertica to determine which GROUP BY algorithm is best suited for a given query. Occasionally, you might want to use one algorithm over another. In such cases, you can qualify the GROUP BY clause with a GBYTYPE hint:

GROUP BY /*+ GBYTYPE( HASH | PIPE ) */

For example, given the following query, the query optimizer uses the GROUPBY PIPELINED algorithm:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY a,b;

 Access Path:
 +-GROUPBY PIPELINED [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

You can use the GBYTYPE hint to force the query optimizer to use the GROUPBY HASH algorithm instead:

=> EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;
 ------------------------------
 QUERY PLAN DESCRIPTION:
 ------------------------------

 EXPLAIN SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(HASH) */ a,b;

 Access Path:
 +-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 11, Rows: 3 (NO STATISTICS)] (PATH ID: 1)
 |  Aggregates: sum(sortopt.a)
 |  Group By: sortopt.a, sortopt.b

...

The GBYTYPE hint can specify a PIPE (GROUPBY PIPELINED algorithm) argument only if the query and one of its projections comply with GROUPBY PIPELINED requirements. Otherwise, Vertica issues a warning and uses GROUPBY HASH.

For example, the following query cannot use the GROUPBY PIPELINED algorithm, as the GROUP BY columns {b,c} do not include the projection's first ORDER BY column a:

=> SELECT SUM(a) FROM sortopt GROUP BY /*+GBYTYPE(PIPE) */ b,c;
WARNING 7765:  Cannot apply Group By Pipe algorithm. Proceeding with Group By Hash and hint will be ignored
 SUM
-----
  79
  14
   5
(3 rows)

2 - Avoiding resegmentation during GROUP BY optimization with projection design

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation.

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation. If the projection design already guarantees the data is segmented by the GROUP BY columns, no resegmentation is required at run time.

To avoid resegmentation, the GROUP BY clause must contain all the segmentation columns of the projection, but it can also contain other columns.

When your query includes a GROUP BY clause and joins, if the join depends on the results of the GROUP BY, as in the following example, Vertica performs the GROUP BY first:

=> EXPLAIN SELECT * FROM (SELECT b from foo GROUP BY b) AS F, foo WHERE foo.a = F.b;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Join Cond: (foo.a = F.b)
|  Materialize at Output: foo.b
|  Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| |      Projection: public.foo_super
| |      Materialize: foo.a
| |      Execute on: All Nodes
| |      Runtime Filter: (SIP1(MergeJoin): foo.a)
| +-- Inner -> SELECT [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| |      Execute on: All Nodes
| | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID:
4)
| | |      Group By: foo.b
| | |      Execute on: All Nodes
| | | +---> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | |      Projection: public.foo_super
| | | |      Materialize: foo.b
| | | |      Execute on: All Nodes

If the result of the join operation is the input to the GROUP BY clause, Vertica performs the join first, as in the following example. The segmentation of those intermediate results may not be consistent with the GROUP BY clause in your query, resulted in resegmentation at run time.

=> EXPLAIN SELECT * FROM foo AS F, foo WHERE foo.a = F.b GROUP BY 1,2,3,4;
Access Path:
+-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 869, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Group By: F.a, F.b, foo.a, foo.b
|  Execute on: All Nodes
| +---> JOIN HASH [Cost: 853, Rows: 10K (NO STATISTICS)] (PATH ID: 2) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| |      Join Cond: (foo.a = F.b)
| |      Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for F [Cost: 403, Rows: 10K (NO STATISTICS)] (PUSHED GROUPING) (PATH ID: 3)
| | |      Projection: public.foo_super
| | |      Materialize: F.a, F.b
| | |      Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for foo [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
| | |      Projection: public.foo_super
| | |      Materialize: foo.a, foo.b
| | |      Execute on: All Nodes

If your query does not include joins, the GROUP BY clauses are processed using the existing database projections.

Examples

Assume the following projection:

CREATE PROJECTION ... SEGMENTED BY HASH(a,b) ALL NODES

The following table explains whether or not resegmentation occurs at run time and why.

`GROUP BY a`	Requires resegmentation at run time. The query does not contain all the projection segmentation columns.
`GROUP BY a, b`	Does not require resegmentation at run time. The `GROUP BY` clause contains all the projection segmentation columns.
`GROUP BY a, b, c`	Does not require resegmentation at run time. The `GROUP BY` clause contains all the projection segmentation columns.
`GROUP BY a+1, b`	Requires resegmentation at run time because of the expression on column `a`.

To determine if resegmentation will occurs during your GROUP BY query, look at the EXPLAIN-generated query plan.

For example, the following plan uses GROUPBY PIPELINED sort optimization and requires resegmentation to perform the GROUP BY calculation:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 194, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

The following plan uses GROUPBY PIPELINED sort optimization, but does not require resegmentation:

+-GROUPBY PIPELINED [Cost: 459, Rows: 10K (NO STATISTICS)] (PATH ID: 1)