Writing event series joins
The examples in this topic contains mismatches between timestamps—just as you'd find in real life situations; for example, there could be a period of inactivity on stocks where no trade occurs, which can present challenges when you want to compare two stocks whose timestamps don't match.
The hTicks and aTicks tables
As described in the example ticks schema, tables, hTicks
is missing input rows for 12:02, 12:03, and 12:04, and aTicks
is missing inputs at 12:01, 12:02, and 12:04.
hTicks | aTicks | |
---|---|---|
|
|
Querying event series data with full outer joins
Using a traditional full outer join, this query finds a match between tables hTicks and aTicks at 12:00 and 12:05 and pads the missing data points with NULL values.
=> SELECT * FROM hTicks h FULL OUTER JOIN aTicks a ON (h.time = a.time);
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:01:00 | 51.00 | | |
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 340.20
HPQ | 12:06:00 | 52.00 | | |
| | | ACME | 12:03:00 | 340.10
(6 rows)
To replace the gaps with interpolated values for those missing data points, use the INTERPOLATE predicate to create an event series join. The join condition is restricted to the ON clause, which evaluates the equality predicate on the timestamp columns from the two input tables. In other words, for each row in outer table hTicks, the ON clause predicates are evaluated for each combination of each row in the inner table aTicks.
Simply rewrite the full outer join query to use the INTERPOLATE predicate with either the required PREVIOUS VALUE or NEXT VALUE keywords. Note that a full outer join on event series data is the most common scenario for event series data, where you keep all rows from both tables.
=> SELECT * FROM hTicks h FULL OUTER JOIN aTicks a
ON (h.time INTERPOLATE PREVIOUS VALUE a.time);
Vertica interpolates the missing values (which appear as NULL in the full outer join) using that table's previous or next value, whichever is specified. This example shows INTERPOLATE PREVIOUS. Notice how in the second row, blank cells have been filled using values interpolated from the previous row:
=> SELECT * FROM hTicks h FULL OUTER JOIN aTicks a ON (h.time = a.time);
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:01:00 | 51.00 | ACME | 12:00:00 | 340.00
HPQ | 12:01:00 | 51.00 | ACME | 12:03:00 | 340.10
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 52.00 | ACME | 12:05:00 | 340.20
HPQ | 12:06:00 | 52.00 | ACME | 12:05:00 | 340.20
(6 rows)
Note
The output ordering above is different from the regular full outer join because in the event series join, interpolation occurs independently for each stock (hTicks and aTicks), where the data is partitioned and sorted based on the equality predicate. This means that interpolation occurs within, not across, partitions.If you review the regular full outer join output, you can see that both tables have a match in the time column at 12:00 and 12:05, but at 12:01, there is no entry record for ACME. So the operation interpolates a value for ACME (ACME,12:00,340
) based on the previous value in the aTicks table.
Querying event series data with left outer joins
You can also use left and right outer joins. You might, for example, decide you want to preserve only hTicks values. So you'd write a left outer join:
=> SELECT * FROM hTicks h LEFT OUTER JOIN aTicks a
ON (h.time INTERPOLATE PREVIOUS VALUE a.time);
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:01:00 | 51.00 | ACME | 12:00:00 | 340.00
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 340.20
HPQ | 12:06:00 | 52.00 | ACME | 12:05:00 | 340.20
(5 rows)
Here's what the same data looks like using a traditional left outer join:
=> SELECT * FROM hTicks h LEFT OUTER JOIN aTicks a ON h.time = a.time;
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:01:00 | 51.00 | | |
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 340.20
HPQ | 12:06:00 | 52.00 | | |
(5 rows)
Note that a right outer join has the same behavior with the preserved table reversed.
Querying event series data with inner joins
Note that INNER event series joins behave the same way as normal ANSI SQL-99 joins, where all gaps are omitted. Thus, there is nothing to interpolate, and the following two queries are equivalent and return the same result set:
A regular inner join:
=> SELECT * FROM HTicks h JOIN aTicks a
ON (h.time INTERPOLATE PREVIOUS VALUE a.time);
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 340.20
(3 rows)
An event series inner join:
=> SELECT * FROM hTicks h INNER JOIN aTicks a ON (h.time = a.time);
stock | time | price | stock | time | price
-------+----------+-------+-------+----------+--------
HPQ | 12:00:00 | 50.00 | ACME | 12:00:00 | 340.00
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 333.80
HPQ | 12:05:00 | 51.00 | ACME | 12:05:00 | 340.20
(3 rows)
The bid and ask tables
Using the example schema for the bid
and ask
tables, write a full outer join to interpolate the missing data points:
=> SELECT * FROM bid b FULL OUTER JOIN ask a
ON (b.stock = a.stock AND b.time INTERPOLATE PREVIOUS VALUE a.time);
In the below output, the first row for stock HPQ shows nulls because there is no entry record for HPQ before 12:01.
stock | time | price | stock | time | price
-------+----------+--------+-------+----------+--------
ACME | 12:00:00 | 80.00 | ACME | 12:00:00 | 80.00
ACME | 12:00:00 | 80.00 | ACME | 12:02:00 | 75.00
ACME | 12:03:00 | 79.80 | ACME | 12:02:00 | 75.00
ACME | 12:05:00 | 79.90 | ACME | 12:02:00 | 75.00
HPQ | 12:00:00 | 100.10 | | |
HPQ | 12:01:00 | 100.00 | HPQ | 12:01:00 | 101.00
(6 rows)
Note also that the same row (ACME,12:02,75
) from the ask
table appears three times. The first appearance is because no matching rows are present in the bid
table for the row in ask
, so Vertica interpolates the missing value using the ACME value at 12:02 (75.00). The second appearance occurs because the row in bid
(ACME,12:05,79.9
) has no matches in ask
. The row from ask
that contains (ACME,12:02,75
) is the closest row; thus, it is used to interpolate the values.
If you write a regular full outer join, you can see where the mismatched timestamps occur:
=> SELECT * FROM bid b FULL OUTER JOIN ask a ON (b.time = a.time);
stock | time | price | stock | time | price
-------+----------+--------+-------+----------+--------
ACME | 12:00:00 | 80.00 | ACME | 12:00:00 | 80.00
ACME | 12:03:00 | 79.80 | | |
ACME | 12:05:00 | 79.90 | | |
HPQ | 12:00:00 | 100.10 | ACME | 12:00:00 | 80.00
HPQ | 12:01:00 | 100.00 | HPQ | 12:01:00 | 101.00
| | | ACME | 12:02:00 | 75.00
(6 rows)