OpenText Analytics Database 26.2.x – JOIN queries

Data-Analysis: Hash joins versus merge joins

Mon, 01 Jan 0001 00:00:00 +0000

The database optimizer implements a join with one of the following algorithms:

Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins.
Hash join is used when projections of the joined tables are not already sorted on the join columns. In this case, the optimizer builds an in-memory hash table on the inner table's join column. The optimizer then scans the outer table for matches to the hash table, and joins data from the two tables accordingly. The cost of performing a hash join is low if the entire hash table can fit in memory. Cost rises significantly if the hash table must be written to disk.

The optimizer automatically chooses the most appropriate algorithm to execute a query, given the projections that are available.

Facilitating merge joins

To facilitate a merge join, create projections for the joined tables that are sorted on the join predicate columns. The join predicate columns should be the first columns in the ORDER BY clause.

For example, tables first and second are defined as follows, with projections first_p1 and second_p1, respectively. The projections are sorted on data_first and data_second:

CREATE TABLE first ( id INT, data_first INT );
CREATE PROJECTION first_p1 AS SELECT * FROM first ORDER BY data_first;

CREATE TABLE second ( id INT, data_second INT );
CREATE PROJECTION second_p1 AS SELECT * FROM second ORDER BY data_second;

When you join these tables on unsorted columns first.id and second.id, the database uses the hash join algorithm:

 EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN HASH [Cost: 752, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can facilitate execution of this query with the merge join algorithm by creating projections first_p2 and second_p2, which are sorted on join columns first_p2.id and second_p2.id, respectively:


CREATE PROJECTION first_p2 AS SELECT id, data_first FROM first ORDER BY id SEGMENTED BY hash(id, data_first) ALL NODES;
CREATE PROJECTION second_p2 AS SELECT id, data_second FROM second ORDER BY id SEGMENTED BY hash(id, data_second) ALL NODES;

If the query joins significant amounts of data, the query optimizer uses the merge algorithm:

EXPLAIN SELECT first.data_first, second.data_second FROM first JOIN second ON first.id = second.id;

 Access Path:
 +-JOIN MERGEJOIN(inputs presorted) [Cost: 731, Rows: 300K] (PATH ID: 1) Inner (BROADCAST)

You can also facilitate a merge join by using subqueries to pre-sort the join predicate columns. For example:

SELECT first.id, first.data_first, second.data_second FROM
  (SELECT * FROM first ORDER BY id ) first JOIN (SELECT * FROM second ORDER BY id) second ON first.id = second.id;

Data-Analysis: Identical segmentation

Mon, 01 Jan 0001 00:00:00 +0000

To improve query performance when you join multiple tables, create projections that are identically segmented on the join keys. Identically-segmented projections allow the joins to occur locally on each node, thereby helping to reduce data movement across the network during query processing.

To determine if projections are identically-segmented on the query join keys, create a query plan with EXPLAIN. If the query plan contains RESEGMENT or BROADCAST, the projections are not identically segmented.

The database optimizer chooses a projection to supply rows for each table in a query. If the projections to be joined are segmented, the optimizer evaluates their segmentation against the query join expressions. It thereby determines whether the rows are placed on each node so it can join them without fetching data from another node.

Join conditions for identically segmented projections

A projection p is segmented on join columns if all column references in p’s segmentation expression are a subset of the columns in the join expression.

The following conditions must be true for two segmented projections p1 of table t1 and p2 of table t2 to participate in a join of t1 to t2:

The join condition must have the following form:

t1.j1 = t2.j1 AND t1.j2 = t2.j2 AND ... t1.jN = t2.jN

The join columns must share the same base data type. For example:
- If t1.j1 is an INTEGER, t2.j1 can be an INTEGER but it cannot be a FLOAT.
- If t1.j1 is a CHAR(10), t2.j1 can be any CHAR or VARCHAR (for example, CHAR(10), VARCHAR(10), VARCHAR(20)), but t2.j1 cannot be an INTEGER.
If p1 is segmented by an expression on columns {t1.s1, t1.s2, ... t1.sN}, each segmentation column t1.sX must be in the join column set {t1.jX}.
If p2 is segmented by an expression on columns {t2.s1, t2.s2, ... t2.sN}, each segmentation column t2.sX must be in the join column set {t2.jX}.
The segmentation expressions of p1 and p2 must be structurally equivalent. For example:
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x), p1 and p2 are identically segmented.
- If p1 is SEGMENTED BY hash(t1.x) and p2 is SEGMENTED BY hash(t2.x + 1), p1 and p2 are not identically segmented.
p1 and p2 must have the same segment count.
The assignment of segments to nodes must match. For example, if p1 and p2 use an OFFSET clause, their offsets must match.
If the database finds projections for t1 and t2 that are not identically segmented, the data is redistributed across the network during query run time, as necessary.

Tip
If you create custom designs, try to use segmented projections for joins whenever possible.

Examples

The following statements create two tables and specify to create identical segments:

=> CREATE TABLE t1 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;
=> CREATE TABLE t2 (id INT, x1 INT, y1 INT) SEGMENTED BY HASH(id, x1) ALL NODES;

Given this design, the join conditions in the following queries can leverage identical segmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.id AND t1.x1 = t2.x1;

Conversely, the join conditions in the following queries require resegmentation:

=> SELECT * FROM t1 JOIN t2 ON t1.x1 = t2.x1;
=> SELECT * FROM t1 JOIN t2 ON t1.id = t2.x1;

Data-Analysis: Joining variable length string data

Mon, 01 Jan 0001 00:00:00 +0000

When you join tables on VARCHAR columns, the database calculates how much storage space it requires to buffer join column data. It does so by formatting the column data in one of two ways:

Uses the join column metadata to size column data to a fixed length and buffer accordingly. For example, given a column that is defined as VARCHAR(1000), the database always buffers 1000 characters.
Uses the actual length of join column data, so buffer size varies for each join. For example, given a join on strings Xi, John, and Amrita, the database buffers only as much storage as it needs for each join—in this case, 2, 4, and 6 bytes, respectively.

The second approach can improve join query performance. It can also reduce memory consumption, which helps prevent join spills and minimize how often memory is borrowed from the resource manager. In general, these benefits are especially marked in cases where the defined size of a join column significantly exceeds the average length of its data.

Setting and verifying variable length formatting

You can control how the database implements joins at the session or database levels, through configuration parameter JoinDefaultTupleFormat, or for individual queries, through the JFMT hint. OpenText™ Analytics Database supports variable length formatting for all joins except merge and event series joins.

Use EXPLAIN VERBOSE to verify whether a given query uses variable character formatting, by checking for these flags:

JF_EE_VARIABLE_FORMAT
JF_EE_FIXED_FORMAT

OpenText Analytics Database 26.2.x – JOIN queries

Data-Analysis: Hash joins versus merge joins

Facilitating merge joins

Data-Analysis: Identical segmentation

Join conditions for identically segmented projections

Tip

Examples

See also

Data-Analysis: Joining variable length string data

Setting and verifying variable length formatting