使用投影设计避免在 GROUP BY 优化期间重新分段

要计算包含 GROUP BY 子句的查询的正确结果,Vertica 必须确保 GROUP BY 表达式中具有相同值的所有行最终都会在相同节点上以便于最终计算。如果投影设计已经确保数据按 GROUP BY 列进行分段,那么运行时则不需要重新分段。

要避免重新分段,GROUP BY 子句必须包含投影的所有分段列,但是它也可以包含其他列。

当查询包括 GROUP BY 子句和联接时,如果联接取决于以下示例所示的 GROUP BY 的结果,那么 Vertica 会首先执行 GROUP BY

=> EXPLAIN SELECT * FROM (SELECT b from foo GROUP BY b) AS F, foo WHERE foo.a = F.b;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [Cost: 649, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Join Cond: (foo.a = F.b)
|  Materialize at Output: foo.b
|  Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| |      Projection: public.foo_super
| |      Materialize: foo.a
| |      Execute on: All Nodes
| |      Runtime Filter: (SIP1(MergeJoin): foo.a)
| +-- Inner -> SELECT [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| |      Execute on: All Nodes
| | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 245, Rows: 10K (NO STATISTICS)] (PATH ID:
4)
| | |      Group By: foo.b
| | |      Execute on: All Nodes
| | | +---> STORAGE ACCESS for foo [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 5)
| | | |      Projection: public.foo_super
| | | |      Materialize: foo.b
| | | |      Execute on: All Nodes

如果联接操作的结果是 GROUP BY 子句的输入,Vertica 会首先执行联接,如以下示例所示。这些中间结果的分段可能不会与查询的 GROUP BY 子句保持一致,从而导致运行时要重新分段。

=> EXPLAIN SELECT * FROM foo AS F, foo WHERE foo.a = F.b GROUP BY 1,2,3,4;
Access Path:
+-GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 869, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
|  Group By: F.a, F.b, foo.a, foo.b
|  Execute on: All Nodes
| +---> JOIN HASH [Cost: 853, Rows: 10K (NO STATISTICS)] (PATH ID: 2) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| |      Join Cond: (foo.a = F.b)
| |      Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for F [Cost: 403, Rows: 10K (NO STATISTICS)] (PUSHED GROUPING) (PATH ID: 3)
| | |      Projection: public.foo_super
| | |      Materialize: F.a, F.b
| | |      Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for foo [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 4)
| | |      Projection: public.foo_super
| | |      Materialize: foo.a, foo.b
| | |      Execute on: All Nodes

如果查询不包括联接,GROUP BY 子句会使用现有数据库投影进行处理。

示例

假设有以下投影:

CREATE PROJECTION ... SEGMENTED BY HASH(a,b) ALL NODES

下表说明了运行时是否会发生重新分段以及相关原因。

要确定在 GROUP BY 查询期间是否会发生重新分段,请查看 EXPLAIN 生成的查询计划。

例如,以下计划使用了 GROUPBY PIPELINED 排序优化,并需要重新分段以执行 GROUP BY 计算:

+-GROUPBY PIPELINEDRESEGMENT GROUPS[Cost: 194, Rows: 10K (NO STATISTICS)] (PATH ID: 1)

以下计划使用了 GROUPBY PIPELINED 排序优化,但是不需要重新分段:

+-GROUPBY PIPELINED [Cost: 459, Rows: 10K (NO STATISTICS)] (PATH ID: 1)