Avoiding resegmentation during GROUP BY optimization with projection design

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation.

To compute the correct result of a query that contains a GROUP BY clause, Vertica must ensure that all rows with the same value in the GROUP BY expressions end up at the same node for final computation. If the projection design already guarantees the data is segmented by the GROUP BY columns, no resegmentation is required at run time.

To avoid resegmentation, the GROUP BY clause must contain all the segmentation columns of the projection, but it can also contain other columns.

When your query includes a GROUP BY clause and joins, if the join depends on the results of the GROUP BY, as in the following example, Vertica performs the GROUP BY first:

=> EXPLAIN SELECT * FROM (SELECT b from foo GROUP BY b) AS F, foo WHERE foo.a = F.b;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Join Cond: (foo.a = F.b)
|  Materialize at Output: foo.b
|  Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| |      Projection: public.foo_super
| |      Materialize: foo.a
| |      Execute on: All Nodes
| |      Runtime Filter: (SIP1(MergeJoin): foo.a)
| +-- Inner -> SELECT [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| |      Execute on: All Nodes
| | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID:
4)
| | |      Group By: foo.b
| | |      Execute on: All Nodes
| | | +---> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | |      Projection: public.foo_super
| | | |      Materialize: foo.b
| | | |      Execute on: All Nodes

If the result of the join operation is the input to the GROUP BY clause, Vertica performs the join first, as in the following example. The segmentation of those intermediate results may not be consistent with the GROUP BY clause in your query, resulted in resegmentation at run time.

=> EXPLAIN SELECT * FROM foo AS F, foo WHERE foo.a = F.b GROUP BY 1,2,3,4;
Access Path:
+-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 869, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Group By: F.a, F.b, foo.a, foo.b
|  Execute on: All Nodes
| +---> JOIN HASH [Cost: 853, Rows: 10K (NO STATISTICS)] (PATH ID: 2) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| |      Join Cond: (foo.a = F.b)
| |      Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for F [Cost: 403, Rows: 10K (NO STATISTICS)] (PUSHED GROUPING) (PATH ID: 3)
| | |      Projection: public.foo_super
| | |      Materialize: F.a, F.b
| | |      Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for foo [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
| | |      Projection: public.foo_super
| | |      Materialize: foo.a, foo.b
| | |      Execute on: All Nodes

If your query does not include joins, the GROUP BY clauses are processed using the existing database projections.

Examples

Assume the following projection:

CREATE PROJECTION ... SEGMENTED BY HASH(a,b) ALL NODES

The following table explains whether or not resegmentation occurs at run time and why.

GROUP BY a Requires resegmentation at run time. The query does not contain all the projection segmentation columns.
GROUP BY a, b Does not require resegmentation at run time. The GROUP BY clause contains all the projection segmentation columns.
GROUP BY a, b, c Does not require resegmentation at run time. The GROUP BY clause contains all the projection segmentation columns.
GROUP BY a+1, b Requires resegmentation at run time because of the expression on column a.

To determine if resegmentation will occurs during your GROUP BY query, look at the EXPLAIN-generated query plan.

For example, the following plan uses GROUPBY PIPELINED sort optimization and requires resegmentation to perform the GROUP BY calculation:

+-GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 194, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

The following plan uses GROUPBY PIPELINED sort optimization, but does not require resegmentation:

+-GROUPBY PIPELINED [Cost: 459, Rows: 10K (NO STATISTICS)] (PATH ID: 1)