Range joins
Vertica provides performance optimizations for <, <=, >, >=, and BETWEEN
predicates in join ON
clauses. These optimizations are particularly useful when a column from one table is restricted to be in a range specified by two columns of another table.
Key ranges
Multiple, consecutive key values can map to the same dimension values. Consider, for example, a table of IPv4 addresses and their owners. Because large subnets (ranges) of IP addresses can belong to the same owner, this dimension can be represented as:
=> CREATE TABLE ip_owners(
ip_start INTEGER,
ip_end INTEGER,
owner_id INTEGER);
=> CREATE TABLE clicks(
ip_owners INTEGER,
dest_ip INTEGER);
A query that associates a click stream with its destination can use a join similar to the following, which takes advantage of range optimization:
=> SELECT owner_id, COUNT(*) FROM clicks JOIN ip_owners
ON clicks.dest_ip BETWEEN ip_start AND ip_end
GROUP BY owner_id;
Requirements
Operators <, <=, >, >=, or BETWEEN
must appear as top-level conjunctive predicates for range join optimization to be effective, as shown in the following examples:
`BETWEEN` as the only predicate:
```
=> SELECT COUNT(*) FROM fact JOIN dim
ON fact.point BETWEEN dim.start AND dim.end;
```
Comparison operators as top-level predicates (within `AND`):
```
=> SELECT COUNT(*) FROM fact JOIN dim
ON fact.point > dim.start AND fact.point < dim.end;
```
`BETWEEN` as a top-level predicate (within `AND`):
```
=> SELECT COUNT(*) FROM fact JOIN dim
ON (fact.point BETWEEN dim.start AND dim.end) AND fact.c <> dim.c;
```
Query not optimized because `OR` is top-level predicate (disjunctive):
```
=> SELECT COUNT(*) FROM fact JOIN dim
ON (fact.point BETWEEN dim.start AND dim.end) OR dim.end IS NULL;
```
Notes
-
Expressions are optimized in range join queries in many cases.
-
If range columns can have
NULL
values indicating that they are open-ended, it is possible to use range join optimizations by replacing nulls with very large or very small values:=> SELECT COUNT(*) FROM fact JOIN dim ON fact.point BETWEEN NVL(dim.start, -1) AND NVL(dim.end, 1000000000000);
-
If there is more than one set of ranging predicates in the same
ON
clause, the order in which the predicates are specified might impact the effectiveness of the optimization:=> SELECT COUNT(*) FROM fact JOIN dim ON fact.point1 BETWEEN dim.start1 AND dim.end1 AND fact.point2 BETWEEN dim.start2 AND dim.end2;
The optimizer chooses the first range to optimize, so write your queries so that the range you most want optimized appears first in the statement.
-
The use of the range join optimization is not directly affected by any characteristics of the physical schema; no schema tuning is required to benefit from the optimization.
-
The range join optimization can be applied to joins without any other predicates, and to
HASH
orMERGE
joins. -
To determine if an optimization is in use, search for
RANGE
in theEXPLAIN
plan.