Range joins

Vertica provides performance optimizations for <, <=, >, >=, and BETWEEN predicates in join ON clauses.

Vertica provides performance optimizations for <, <=, >, >=, and BETWEEN predicates in join ON clauses. These optimizations are particularly useful when a column from one table is restricted to be in a range specified by two columns of another table.

Key ranges

Multiple, consecutive key values can map to the same dimension values. Consider, for example, a table of IPv4 addresses and their owners. Because large subnets (ranges) of IP addresses can belong to the same owner, this dimension can be represented as:

=> CREATE TABLE ip_owners(
      ip_start INTEGER,
      ip_end INTEGER,
      owner_id INTEGER);
=> CREATE TABLE clicks(
      ip_owners INTEGER,
      dest_ip INTEGER);

A query that associates a click stream with its destination can use a join similar to the following, which takes advantage of range optimization:

=> SELECT owner_id, COUNT(*) FROM clicks JOIN ip_owners
   ON clicks.dest_ip BETWEEN ip_start AND ip_end
   GROUP BY owner_id;

Requirements

Operators <, <=, >, >=, or BETWEEN must appear as top-level conjunctive predicates for range join optimization to be effective, as shown in the following examples:

`BETWEEN` as the only predicate:

```
=> SELECT COUNT(*) FROM fact JOIN dim
    ON fact.point BETWEEN dim.start AND dim.end;

```

Comparison operators as top-level predicates (within `AND`):

```
=> SELECT COUNT(*) FROM fact JOIN dim
    ON fact.point > dim.start AND fact.point < dim.end;

```

`BETWEEN` as a top-level predicate (within `AND`):

```
=> SELECT COUNT(*) FROM fact JOIN dim
   ON (fact.point BETWEEN dim.start AND dim.end) AND fact.c <> dim.c;

```

Query not optimized because `OR` is top-level predicate (disjunctive):

```
=> SELECT COUNT(*) FROM fact JOIN dim
   ON (fact.point BETWEEN dim.start AND dim.end) OR dim.end IS NULL;

```

Notes

  • Expressions are optimized in range join queries in many cases.

  • If range columns can have NULL values indicating that they are open-ended, it is possible to use range join optimizations by replacing nulls with very large or very small values:

    => SELECT COUNT(*) FROM fact JOIN dim
       ON fact.point BETWEEN NVL(dim.start, -1) AND NVL(dim.end, 1000000000000);
    
  • If there is more than one set of ranging predicates in the same ON clause, the order in which the predicates are specified might impact the effectiveness of the optimization:

    => SELECT COUNT(*) FROM fact JOIN dim ON fact.point1 BETWEEN dim.start1 AND dim.end1
       AND fact.point2 BETWEEN dim.start2 AND dim.end2;
    

    The optimizer chooses the first range to optimize, so write your queries so that the range you most want optimized appears first in the statement.

  • The use of the range join optimization is not directly affected by any characteristics of the physical schema; no schema tuning is required to benefit from the optimization.

  • The range join optimization can be applied to joins without any other predicates, and to HASH or MERGE joins.

  • To determine if an optimization is in use, search for RANGE in the EXPLAIN plan.