Pre-aggregating UDTF results

CREATE PROJECTION can define live aggregate projections that invoke user-defined transform functions (UDTFs).

CREATE PROJECTION can define live aggregate projections that invoke user-defined transform functions (UDTFs). To minimize overhead when you query those projections, Vertica processes these functions in the background and stores their results on disk.

Defining projections with UDTFs

The projection definition characterizes UDTFs in one of two ways:

  • Identifies the UDTF as a pre-pass UDTF, which transforms newly loaded data before it is stored in the projection ROS containers.

  • Identifies the UDTF as a batch UDTF, which aggregates and stores projection data.

The projection definition identifies a UDTF as a pre-pass UDTF or batch UDTF in its window partition clause, through the keywords PREPASS or BATCH. A projection can specify one pre-pass or batch UDTF or include both (see UDTF Specification Options).

In all cases, the projection is implicitly segmented and ordered on the PARTITION BY columns.

UDTF specification options

Projections can invoke batch and pre-pass UDTFs singly or in combination.

Single pre-pass UDTF

Vertica invokes the pre-pass UDTF when you load data into the projection's anchor table—for example through COPY or INSERT statements. A pre-pass UDTF transforms the new data and then stores the transformed data in the projection's ROS containers.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]

AS SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf(prepass-args)
    OVER (PARTITION PREPASS BY partition-column-expr[,...])
    [ AS (prepass-output-columns) ] FROM table [[AS] alias]

Single batch UDTF

When invoked singly, a batch UDTF transforms and aggregates projection data on mergeout, data load, and query operations. The UDTF stores aggregated results in the projection's ROS containers. Aggregation is cumulative across mergeout and load operations, and is completed (if necessary) on query execution.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ]  }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
   OVER (PARTITION BATCH BY partition-column-expr[,...])
   [ AS (batch-output-columns) FROM table [ [AS] alias ]

Combined pre-pass and batch UDTFs

You can define a projection with a subquery that invokes a pre-pass UDTF. The pre-pass UDTF returns transformed data to the outer batch query. The batch UDTF then iteratively aggregates results across mergeout operations. It completes aggregation (if necessary) on query execution.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
   OVER (PARTITION BATCH BY partition-column-expr[,...]) [ AS (batch-output-columns) ] FROM (
      SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf (prepass-args)
      OVER (PARTITION PREPASS BY partition-column-expr[,...]) [ AS (prepass-output-columns) ] FROM table ) sq-ref

Examples

Single pre-pass UDTF
The following example shows how to use the UDTF text_index, which extracts from a text document strings that occur more than once.

The following projection specifies to invoke text_index as a pre-pass UDTF:


=> CREATE TABLE documents ( doc_id INT PRIMARY KEY, text VARCHAR(140));

=> CREATE PROJECTION index_proj (doc_id, text)
     AS SELECT doc_id, text_index(doc_id, text)
     OVER (PARTITION PREPASS BY doc_id) FROM documents;

The UDTF is invoked whenever data is loaded into the anchor table documents. text_index transforms the newly loaded data, and Vertica stores the transformed data in the live aggregate projection ROS containers.

So, if you load the following data into documents:

=> INSERT INTO documents VALUES
(100, 'A SQL Query walks into a bar. In one corner of the bar are two tables.
 The Query walks up to the tables and asks - Mind if I join you?');
 OUTPUT
--------
      1
(1 row)

text_index transforms the newly loaded data and stores it in the projection ROS containers. When you query the projection, it returns with the following results:


doc_id | frequency |     term
-------+-----------+--------------
100    | 2         | bar
100    | 2         | Query
100    | 2         | tables
100    | 2         | the
100    | 2         | walks

Combined Pre-Pass and Batch UDTFs
The following projection specifies pre-pass and batch UDTFs stv_intersect and aggregate_classified_points, respectively:


CREATE TABLE points( point_id INTEGER, point_type VARCHAR(10), coordinates GEOMETRY(100));

CREATE PROJECTION aggregated_proj
   AS SELECT point_type, aggregate_classified_points( sq.point_id, sq.polygon_id)
   OVER (PARTITION BATCH BY point_type)
   FROM
      (SELECT point_type, stv_intersect(
         point_id, coordinates USING PARAMETERS index=‘polygons’ )
       OVER (PARTITION PREPASS BY point_type) AS (point_id, polygon_id) FROM points) sq;

The pre-pass query UDTF stv_intersect returns its results (a set of point and matching polygon IDs) to the outer batch query. The outer batch query then invokes the UDTF aggregate_classified_points. Vertica aggregates the result set that is returned by aggregate_classified_points whenever a mergeout operation consolidates projection data. Final aggregation (if necessary) occurs when the projection is queried.

The batch UDTF arguments must exactly match the output columns returned by the pre-pass UDTF stv_intersect, in name and order. In this example, the pre-pass subquery explicitly names the pre-pass UDTF output columns point_id and polygon_id. Accordingly, the batch UDTF arguments match them in name and order: sq.point_id and sq.polygon_id.