This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

JOIN queries

In general, you can optimize execution of queries that join multiple tables in several ways:.

1: Hash joins versus merge joins
2: Identical segmentation
3: Joining variable length string data

In general, you can optimize execution of queries that join multiple tables in several ways:

Create projections for the joined tables that are sorted on join predicate columns. This facilitates use of the merge join algorithm, which generally joins tables more efficiently than the hash join algorithm.
Create projections that are identically segmented on the join keys.

Other best practices

Vertica also executes joins more efficiently if the following conditions are true:

Query construction enables the query optimizer to create a plan where the larger table is defined as the outer input.
The columns on each side of the equality predicate are from the same table. For example in the following query, the left and right sides of the equality predicate include only columns from tables T and X, respectively:
```
=> SELECT * FROM T JOIN X ON T.a + T.b = X.x1 - X.x2;
```
Conversely, the following query incurs more work to process, because the right side of the predicate includes columns from both tables T and X:
```
=> SELECT * FROM T JOIN X WHERE T.a = X.x1 + T.b
```

1 - Hash joins versus merge joins

The Vertica optimizer implements a join with one of the following algorithms:.

The Vertica optimizer implements a join with one of the following algorithms:

Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins.
Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table's join column. The optimizer then scans the outer table for matches to the hash table, and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.

The optimizer automatically chooses the most appropriate algorithm to execute a query, given the projections that are available.

Facilitating merge joins

To facilitate a merge join, create projections for the joined tables that are sorted on the join predicate columns. The join predicate columns should be the first columns in the ORDER BY clause.

For example, tables first and second are defined as follows, with projections first_p1 and second_p1, respectively. The projections are sorted on data_first and data_second:

CREATE TABLE first ( id INT, data_first INT );
CREATE PROJECTION first_p1 AS SELECT * FROM first ORDER BY data_first;

CREATE TABLE second ( id INT, data_second INT );
CREATE PROJECTION second_p1 AS SELECT * FROM second ORDER BY data_second;

When you join these tables on unsorted columns first.id and second.id, Vertica uses the hash join algorithm:

 EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN HASH [Cost: 752, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can facilitate execution of this query with the merge join algorithm by creating projections first_p2 and second_p2, which are sorted on join columns first_p2.id and second_p2.id, respectively:


CREATE PROJECTION first_p2 AS SELECT id, data_first FROM first ORDER BY id SEGMENTED BY hash(id, data_first) ALL NODES;
CREATE PROJECTION second_p2 AS SELECT id, data_second FROM second ORDER BY id SEGMENTED BY hash(id, data_second) ALL NODES;

If the query joins significant amounts of data, the query optimizer uses the merge algorithm:

EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 731, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can also facilitate a merge join by using subqueries to pre-sort the join predicate columns. For example:

SELECT first.id, first.data_first, second.data_second FROM
  (SELECT * FROM first ORDER BY id ) first JOIN (SELECT * FROM second ORDER BY id) second ON first.id = second.id;

2 - Identical segmentation

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys.

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys. Identically-segmented projections allow the joins to occur locally on each node, thereby helping to reduce data movement across the network during query processing.

To determine if projections are identically-segmented on the query join keys, create a query plan with EXPLAIN. If the query plan contains RESEGMENT or BROADCAST, the projections are not identically segmented.

The Vertica optimizer chooses a projection to supply rows for each table in a query. If the projections to be joined are segmented, the optimizer evaluates their segmentation against the query join expressions. It thereby determines whether the rows are placed on each node so it can join them without fetching data from another node.

Join conditions for identically segmented projections

A projection p is segmented on join columns if all column references in p’s segmentation expression are a subset of the columns in the join expression.

The following conditions must be true for two segmented projections p1 of table t1 and p2 of table t2 to participate in a join of t1 to t2:

The join condition must have the following form:

t1.j1 = t2.j1 AND t1.j2 = t2.j2 AND ... t1.jN = t2.jN

The join columns must share the same base data type. For example:
- If t1.j1 is an INTEGER, t2.j1 can be an INTEGER but it cannot be a FLOAT.
- If t1.j1 is a CHAR(10), t2.j1 can be any CHAR or VARCHAR (for example, CHAR(10), VARCHAR(10), VARCHAR(20)), but t2.j1 cannot be an INTEGER.
If p1 is segmented by an expression on columns {t1.s1, t1.s2, ... t1.sN}, each segmentation column t1.sX must be in the join column set {t1.jX}.
If p2 is segmented by an expression on columns {t2.s1, t2.s2, ... t2.sN}, each segmentation column t2.sX must be in the join column set {t2.jX}.
The segmentation expressions of p1 and p2 must be structurally equivalent. For example:
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x), p1 and p2 are identically segmented.
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x + 1), p1 and p2 are not identically segmented.
p1 and p2 must have the same segment count.
The assignment of segments to nodes must match. For example, if p1 and p2 use an OFFSET clause, their offsets must match.
If Vertica finds projections for t1 and t2 that are not identically segmented, the data is redistributed across the network during query run time, as necessary.

Tip
If you create custom designs, try to use segmented projections for joins whenever possible.

Examples

The following statements create two tables and specify to create identical segments:

=> CREATE TABLE t1 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;
=> CREATE TABLE t2 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;

Given this design, the join conditions in the following queries can leverage identical segmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id AND t1.x1 = t2.x1;

Conversely, the join conditions in the following queries require resegmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.x1 = t2.x1;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.x1;

3 - Joining variable length string data

When you join tables on VARCHAR columns, Vertica calculates how much storage space it requires to buffer join column data.

When you join tables on VARCHAR columns, Vertica calculates how much storage space it requires to buffer join column data. It does so by formatting the column data in one of two ways:

Uses the join column metadata to size column data to a fixed length and buffer accordingly. For example, given a column that is defined as VARCHAR(1000), Vertica always buffers 1000 characters.
Uses the actual length of join column data, so buffer size varies for each join. For example, given a join on strings Xi, John, and Amrita, Vertica buffers only as much storage as it needs for each join—in this case, 2, 4, and 6 bytes, respectively.

The second approach can improve join query performance. It can also reduce memory consumption, which helps prevent join spills and minimize how often memory is borrowed from the resource manager. In general, these benefits are especially marked in cases where the defined size of a join column significantly exceeds the average length of its data.

Setting and verifying variable length formatting

You can control how Vertica implements joins at the session or database levels, through configuration parameter JoinDefaultTupleFormat, or for individual queries, through the JFMT hint. Vertica supports variable length formatting for all joins except merge and event series joins.

Use EXPLAIN VERBOSE to verify whether a given query uses variable character formatting, by checking for these flags:

JF_EE_VARIABLE_FORMAT
JF_EE_FIXED_FORMAT

JOIN queries

Other best practices

1 - Hash joins versus merge joins

Facilitating merge joins

2 - Identical segmentation

Join conditions for identically segmented projections

Tip

Examples

See also

3 - Joining variable length string data

Setting and verifying variable length formatting