This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Pre-aggregating data in projections

Queries that use aggregate functions such as SUM and COUNT can perform more efficiently when they use projections that already contain the aggregated data.

Queries that use aggregate functions such as SUM and COUNT can perform more efficiently when they use projections that already contain the aggregated data. This improved efficiency is especially true for queries on large quantities of data.

For example, a power grid company reads 30 million smart meters that provide data at five-minute intervals. The company records each reading in a database table. Over a given year, three trillion records are added to this table.

The power grid company can analyze these records with queries that include aggregate functions to perform the following tasks:

  • Establish usage patterns.

  • Detect fraud.

  • Measure correlation to external events such as weather patterns or pricing changes.

To optimize query response time, you can create an aggregate projection, which stores the data is stored after it is aggregated.

Aggregate projections

Vertica provides several types of projections for storing data that is returned from aggregate functions or expressions:

  • Live aggregate projection: Projection that contains columns with values that are aggregated from columns in its anchor table. You can also define live aggregate projections that include user-defined transform functions.

  • Top-K projection: Type of live aggregate projection that returns the top k rows from a partition of selected rows. Create a Top-K projection that satisfies the criteria for a Top-K query.

  • Projection that pre-aggregates UDTF results: Live aggregate projection that invokes user-defined transform functions (UDTFs). To minimize overhead when you query those projections of this type, Vertica processes the UDTF functions in the background and stores their results on disk.

  • Projection that contains expressions: Projection with columns whose values are calculated from anchor table columns.

  • Aggregate projections are most useful for queries against large sets of data.

  • For optimal query performance, the size of LAP projections should be a small subset of the anchor table—ideally, between 1 and 10 percent of the anchor table, or smaller, if possible.

Restrictions

  • MERGE operations must be optimized if they are performed on target tables that have live aggregate projections.

  • You cannot update or delete data in temporary tables with live aggregate projections.

Requirements

In the event of manual recovery from an unclean database shutdown, live aggregate projections might require some time to refresh.

1 - Live aggregate projections

A live aggregate projection contains columns with values that are aggregated from columns in its anchor table. When you load data into the table, Vertica aggregates the data before loading it into the live aggregate projection. On subsequent loads—for example, through INSERTor COPY—Vertica recalculates aggregations with the new data and updates the projection.

1.1 - Functions supported for live aggregate projections

Vertica can aggregate results in live aggregate projections from the following aggregate functions:.

Vertica can aggregate results in live aggregate projections from the following aggregate functions:

Aggregate functions with DISTINCT

Live aggregate projections can support queries that include aggregate functions qualified with the keyword DISTINCT. The following requirements apply:

  • The aggregated expression must evaluate to a non-constant.

  • The projection's GROUP BY clause must specify the aggregated expression.

For example, the following query uses SUM(DISTINCT) to calculate the total of all unique salaries in a given region:

SELECT customer_region, SUM(DISTINCT annual_income)::INT
   FROM customer_dimension GROUP BY customer_region;

This query can use the following live aggregate projection, which specifies the aggregated column (annual_income) in its GROUP BY clause:

CREATE PROJECTION public.TotalRegionalIncome
(
 customer_region,
 annual_income,
 Count
)
AS
 SELECT customer_dimension.customer_region,
        customer_dimension.annual_income,
        count(*) AS Count
 FROM public.customer_dimension
 GROUP BY customer_dimension.customer_region,
          customer_dimension.annual_income
;

1.2 - Creating live aggregate projections

You define a live aggregate projection with the following syntax:.

You define a live aggregate projection with the following syntax:


=> CREATE PROJECTION proj-name AS
      SELECT select-expression FROM table
      GROUP BY group-expression;

For full syntax options, see CREATE PROJECTION.

For example:

=> CREATE PROJECTION clicks_agg AS
   SELECT page_id, click_time::DATE click_date, COUNT(*) num_clicks FROM clicks
   GROUP BY page_id, click_time::DATE KSAFE 1;

For an extended discussion, see Live aggregate projection example.

Requirements

The following requirements apply to live aggregate projections:

  • The projection cannot be unsegmented.

  • SELECT and GROUP BY columns must be in the same order. GROUP BY expressions must be at the beginning of the SELECT list.

Restrictions

The following restrictions apply to live aggregate projections:

  • MERGE operations must be optimized if they are performed on target tables that have live aggregate projections.

  • Live aggregate projections can reference only one table.

  • Vertica does not regard live aggregate projections as superprojections, even those that include all table columns.

  • You cannot modify the anchor table metadata of columns that are included in live aggregate projections—for example, a column's data type or default value. You also cannot drop these columns. To make these changes, first drop all live aggregate and Top-K projections that are associated with the table.

1.3 - Live aggregate projection example

This example shows how you can track user clicks on a given web page using the following clicks table:.

This example shows how you can track user clicks on a given web page using the following clicks table:


=> CREATE TABLE clicks(
   user_id INTEGER,
   page_id INTEGER,
   click_time TIMESTAMP NOT NULL);

You can aggregate user-specific activity with the following query:


=> SELECT page_id, click_time::DATE click_date, COUNT(*) num_clicks FROM clicks
   WHERE click_time::DATE = '2015-04-30'
   GROUP BY page_id, click_time::DATE ORDER BY num_clicks DESC;

To facilitate performance of this query, create a live aggregate projection that counts the number of clicks per user:

=> CREATE PROJECTION clicks_agg AS
   SELECT page_id, click_time::DATE click_date, COUNT(*) num_clicks FROM clicks
   GROUP BY page_id, click_time::DATE KSAFE 1;

When you query the clicks table on user clicks, Vertica typically directs the query to the live aggregate projection clicks_agg. As additional data is loaded into clicks, Vertica pre-aggregates the new data and updates clicks_agg, so queries always return with the latest data.

For example:

=> SELECT page_id, click_time::DATE click_date, COUNT(*) num_clicks FROM clicks
    WHERE click_time::DATE = '2015-04-30' GROUP BY page_id, click_time::DATE
    ORDER BY num_clicks DESC;
 page_id | click_date | num_clicks
---------+------------+------------
    2002 | 2015-04-30 |         10
    3003 | 2015-04-30 |          3
    2003 | 2015-04-30 |          1
    2035 | 2015-04-30 |          1
   12034 | 2015-04-30 |          1
(5 rows)

2 - Top-k projections

A Top-K query returns the top k rows from partitions of selected rows.

A Top-K query returns the top k rows from partitions of selected rows. Top-K projections can significantly improve performance of Top-K queries. For example, you can define a table that stores gas meter readings with three columns: gas meter ID, time of meter reading, and the read value:


=> CREATE TABLE readings (
    meter_id INT,
    reading_date TIMESTAMP,
    reading_value FLOAT);

Given this table, the following Top-K query returns the five most recent meter readings for a given meter:


SELECT meter_id, reading_date, reading_value FROM readings
    LIMIT 5 OVER (PARTITION BY meter_id ORDER BY reading_date DESC);

To improve the performance of this query, you can create a Top-K projection, which is a special type of live aggregate projection:


=> CREATE PROJECTION readings_topk (meter_id, recent_date, recent_value)
    AS SELECT meter_id, reading_date, reading_value FROM readings
    LIMIT 5 OVER (PARTITION BY meter_id ORDER BY reading_date DESC);

After you create this Top-K projection and load its data (through START_REFRESH or REFRESH), Vertica typically redirects the query to the projection and returns with the pre-aggregated data.

2.1 - Creating top-k projections

You define a Top-K projection with the following syntax:.

You define a Top-K projection with the following syntax:


CREATE PROJECTION proj-name [(proj-column-spec)]
    AS SELECT select-expression FROM table
    LIMIT num-rows OVER (PARTITION BY expression ORDER BY column-expr);

For full syntax options, see CREATE PROJECTION.

For example:


=> CREATE PROJECTION readings_topk (meter_id, recent_date, recent_value)
    AS SELECT meter_id, reading_date, reading_value FROM readings
    LIMIT 5 OVER (PARTITION BY meter_id ORDER BY reading_date DESC);

For an extended discussion, see Top-k projection examples.

Requirements

The following requirements apply to Top-K projections:

  • The projection cannot be unsegmented.

  • The window partition clause must use PARTITION BY.

  • Columns in PARTITION BY and ORDER BY clauses must be the first columns specified in the SELECT list.

  • You must use the LIMIT option to create a Top-K projection, instead of subqueries. For example, the following SELECT statements are equivalent:

    
    => SELECT symbol, trade_time last_trade, price last_price FROM (
        SELECT symbol, trade_time, price, ROW_NUMBER()
        OVER(PARTITION BY symbol ORDER BY trade_time DESC) rn FROM trades) trds WHERE rn <=1;
    
    => SELECT symbol, trade_time last_trade, price last_price FROM trades
        LIMIT 1 OVER(PARTITION BY symbol ORDER BY trade_time DESC);
    

    Both return the same results:

    
          symbol      |      last_trade       | last_price
    ------------------+-----------------------+------------
     AAPL             | 2011-11-10 10:10:20.5 |   108.4000
     HPQ              | 2012-10-10 10:10:10.4 |    42.0500
    (2 rows)
    

    A Top-K projection that pre-aggregates data for use by both queries must include the LIMIT option:

    
    => CREATE PROJECTION trades_topk AS
         SELECT symbol, trade_time last_trade, price last_price FROM trades
         LIMIT 1 OVER(PARTITION BY symbol ORDER BY trade_time DESC);
    

Restrictions

The following restrictions apply to Top-K projections:

  • Top-K projections can reference only one table.

  • Vertica does not regard Top-K projections as superprojections, even those that include all table columns.

  • You cannot modify the anchor table metadata of columns that are included in Top-K projections—for example, a column's data type or default value. You also cannot drop these columns. To make these changes, first drop all live aggregate and Top-K projections that are associated with the table.

2.2 - Top-k projection examples

The following examples show how to query a table with two Top-K projections for the most-recent trade and last trade of the day for each stock symbol.

The following examples show how to query a table with two Top-K projections for the most-recent trade and last trade of the day for each stock symbol.

  1. Create a table that contains information about individual stock trades:

    • Stock symbol

    • Timestamp

    • Price per share

    • Number of shares

    => CREATE TABLE trades(
        symbol CHAR(16) NOT NULL,
        trade_time TIMESTAMP NOT NULL,
        price NUMERIC(12,4),
        volume INT )
        PARTITION BY (EXTRACT(year from trade_time) * 100 +
        EXTRACT(month from trade_time));
    
  2. Load data into the table:

    
    INSERT INTO trades VALUES('AAPL','2010-10-10 10:10:10'::TIMESTAMP,100.00,100);
    INSERT INTO trades VALUES('AAPL','2010-10-10 10:10:10.3'::TIMESTAMP,101.00,100);
    INSERT INTO trades VALUES ('AAPL','2011-10-10 10:10:10.5'::TIMESTAMP,106.1,1000);
    INSERT INTO trades VALUES ('AAPL','2011-10-10 10:10:10.2'::TIMESTAMP,105.2,500);
    INSERT INTO trades VALUES ('HPQ','2012-10-10 10:10:10.2'::TIMESTAMP,42.01,400);
    INSERT INTO trades VALUES ('HPQ','2012-10-10 10:10:10.3'::TIMESTAMP,42.02,1000);
    INSERT INTO trades VALUES ('HPQ','2012-10-10 10:10:10.4'::TIMESTAMP,42.05,100);
    COMMIT;
    
  3. Create two Top-K projections that obtain the following information from the trades table:

    For each stock symbol, return the most recent trade.

    
    => CREATE PROJECTION trades_topk_a AS SELECT symbol, trade_time last_trade, price last_price
           FROM trades LIMIT 1 OVER(PARTITION BY symbol ORDER BY trade_time DESC);
    
    => SELECT symbol, trade_time last_trade, price last_price FROM trades
       LIMIT 1 OVER(PARTITION BY symbol ORDER BY trade_time DESC);
    
          symbol      |      last_trade       | last_price
    ------------------+-----------------------+------------
     HPQ              | 2012-10-10 10:10:10.4 |    42.0500
     AAPL             | 2011-10-10 10:10:10.5 |   106.1000
    (2 rows)
    


    For each stock symbol, return the last trade on each trading day.

    
    => CREATE PROJECTION trades_topk_b
        AS SELECT symbol, trade_time::DATE trade_date, trade_time, price close_price, volume
        FROM trades LIMIT 1 OVER(PARTITION BY symbol, trade_time::DATE ORDER BY trade_time DESC);
    
    => SELECT symbol, trade_time::DATE trade_date, trade_time, price close_price, volume
        FROM trades LIMIT 1 OVER(PARTITION BY symbol, trade_time::DATE ORDER BY trade_time DESC);
    
          symbol      | trade_date |      trade_time       | close_price | volume
    ------------------+------------+-----------------------+-------------+--------
     HPQ              | 2012-10-10 | 2012-10-10 10:10:10.4 |     42.0500 |    100
     AAPL             | 2011-10-10 | 2011-10-10 10:10:10.5 |    106.1000 |   1000
     AAPL             | 2010-10-10 | 2010-10-10 10:10:10.3 |    101.0000 |    100
    (3 rows)
    

In each scenario, Vertica redirects queries on the trades table to the appropriate Top-K projection and returns the aggregated data from them. As additional data is loaded into this table, Vertica pre-aggregates the new data and updates the Top-K projections, so queries always return with the latest data.

3 - Pre-aggregating UDTF results

CREATE PROJECTION can define live aggregate projections that invoke user-defined transform functions (UDTFs).

CREATE PROJECTION can define live aggregate projections that invoke user-defined transform functions (UDTFs). To minimize overhead when you query those projections, Vertica processes these functions in the background and stores their results on disk.

Defining projections with UDTFs

The projection definition characterizes UDTFs in one of two ways:

  • Identifies the UDTF as a pre-pass UDTF, which transforms newly loaded data before it is stored in the projection ROS containers.

  • Identifies the UDTF as a batch UDTF, which aggregates and stores projection data.

The projection definition identifies a UDTF as a pre-pass UDTF or batch UDTF in its window partition clause, through the keywords PREPASS or BATCH. A projection can specify one pre-pass or batch UDTF or include both (see UDTF Specification Options).

In all cases, the projection is implicitly segmented and ordered on the PARTITION BY columns.

UDTF specification options

Projections can invoke batch and pre-pass UDTFs singly or in combination.

Single pre-pass UDTF

Vertica invokes the pre-pass UDTF when you load data into the projection's anchor table—for example through COPY or INSERT statements. A pre-pass UDTF transforms the new data and then stores the transformed data in the projection's ROS containers.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]

AS SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf(prepass-args)
    OVER (PARTITION PREPASS BY partition-column-expr[,...])
    [ AS (prepass-output-columns) ] FROM table [[AS] alias]

Single batch UDTF

When invoked singly, a batch UDTF transforms and aggregates projection data on mergeout, data load, and query operations. The UDTF stores aggregated results in the projection's ROS containers. Aggregation is cumulative across mergeout and load operations, and is completed (if necessary) on query execution.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ]  }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
   OVER (PARTITION BATCH BY partition-column-expr[,...])
   [ AS (batch-output-columns) FROM table [ [AS] alias ]

Combined pre-pass and batch UDTFs

You can define a projection with a subquery that invokes a pre-pass UDTF. The pre-pass UDTF returns transformed data to the outer batch query. The batch UDTF then iteratively aggregates results across mergeout operations. It completes aggregation (if necessary) on query execution.

Use the following syntax:


CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
   OVER (PARTITION BATCH BY partition-column-expr[,...]) [ AS (batch-output-columns) ] FROM (
      SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf (prepass-args)
      OVER (PARTITION PREPASS BY partition-column-expr[,...]) [ AS (prepass-output-columns) ] FROM table ) sq-ref

Examples

Single pre-pass UDTF
The following example shows how to use the UDTF text_index, which extracts from a text document strings that occur more than once.

The following projection specifies to invoke text_index as a pre-pass UDTF:


=> CREATE TABLE documents ( doc_id INT PRIMARY KEY, text VARCHAR(140));

=> CREATE PROJECTION index_proj (doc_id, text)
     AS SELECT doc_id, text_index(doc_id, text)
     OVER (PARTITION PREPASS BY doc_id) FROM documents;

The UDTF is invoked whenever data is loaded into the anchor table documents. text_index transforms the newly loaded data, and Vertica stores the transformed data in the live aggregate projection ROS containers.

So, if you load the following data into documents:

=> INSERT INTO documents VALUES
(100, 'A SQL Query walks into a bar. In one corner of the bar are two tables.
 The Query walks up to the tables and asks - Mind if I join you?');
 OUTPUT
--------
      1
(1 row)

text_index transforms the newly loaded data and stores it in the projection ROS containers. When you query the projection, it returns with the following results:


doc_id | frequency |     term
-------+-----------+--------------
100    | 2         | bar
100    | 2         | Query
100    | 2         | tables
100    | 2         | the
100    | 2         | walks

Combined Pre-Pass and Batch UDTFs
The following projection specifies pre-pass and batch UDTFs stv_intersect and aggregate_classified_points, respectively:


CREATE TABLE points( point_id INTEGER, point_type VARCHAR(10), coordinates GEOMETRY(100));

CREATE PROJECTION aggregated_proj
   AS SELECT point_type, aggregate_classified_points( sq.point_id, sq.polygon_id)
   OVER (PARTITION BATCH BY point_type)
   FROM
      (SELECT point_type, stv_intersect(
         point_id, coordinates USING PARAMETERS index=‘polygons’ )
       OVER (PARTITION PREPASS BY point_type) AS (point_id, polygon_id) FROM points) sq;

The pre-pass query UDTF stv_intersect returns its results (a set of point and matching polygon IDs) to the outer batch query. The outer batch query then invokes the UDTF aggregate_classified_points. Vertica aggregates the result set that is returned by aggregate_classified_points whenever a mergeout operation consolidates projection data. Final aggregation (if necessary) occurs when the projection is queried.

The batch UDTF arguments must exactly match the output columns returned by the pre-pass UDTF stv_intersect, in name and order. In this example, the pre-pass subquery explicitly names the pre-pass UDTF output columns point_id and polygon_id. Accordingly, the batch UDTF arguments match them in name and order: sq.point_id and sq.polygon_id.

4 - Aggregating data through expressions

You can create projections where one or more columns are defined by expressions.

You can create projections where one or more columns are defined by expressions. An expression can reference one or more anchor table columns. For example, the following table contains two integer columns, a and b:

=> CREATE TABLE values (a INT, b INT);

You can create a projection with an expression that calculates the value of column c as the product of a and b:

=> CREATE PROJECTION values_product (a, b, c)
   AS SELECT a, b, a*b FROM values SEGMENTED BY HASH(a) ALL NODES KSAFE;

When you load data into this projection, Vertica resolves the expression a*b in column c. You can then query the projection instead of the anchor table. Vertica returns the pre-calculated data and avoids the overhead otherwise incurred by resource-intensive computations.

Using expressions in projections also lets you sort or segment data on the calculated results of an expression instead of sorting on single column values.

Support for user-defined scalar functions

Vertica treats user-defined scalar functions (UDSFs) like other expressions. On each load operation, the UDSF is invoked and returns its results. Vertica stores these results on disk, and returns them when you query the projection directly.

In the following example, the projection points_p1 specifies the UDSF zorder, which is invoked whenever data is loaded in the anchor table points. When data is loaded into the projection, Vertica invokes this function and stores its results for fast access by future queries.

=> CREATE TABLE points(point_id INTEGER, lat NUMERIC(12,9), long NUMERIC(12,9));

=> CREATE PROJECTION points_p1
     AS SELECT point_id, lat, long, zorder(lat, long) zorder FROM points
     ORDER BY zorder(lat, long) SEGMENTED BY hash(point_id) ALL NODES;

Requirements

  • Any ORDER BY expression must be in the SELECT list.

  • All projection columns must be named.

Restrictions

  • MERGE operations must be optimized if they are performed on target tables that have live aggregate projections.

  • Unlike live aggregate projections, Vertica does not redirect queries with expressions to an equivalent existing projection.

  • Projection expressions must be immutable—that is, they must always return the same result. For example, a projection cannot include expressions that use TO CHAR (depends on locale) or RANDOM (returns different value at each invocation).

  • Projection expressions cannot include Vertica meta-functions such as ADVANCE_EPOCH, ANALYZE_STATISTICS, EXPORT_TABLES, or START_REFRESH.

4.1 - Querying data through expressions example

The following example uses a table that contains two integer columns, a and b:.

The following example uses a table that contains two integer columns, a and b:

=> CREATE TABLE values (a INT, b INT);

You can create a projection with an expression that calculates the value of column c as the product of a and b:

=> CREATE PROJECTION values_product (a, b, c)
   AS SELECT a, b, a*b FROM values SEGMENTED BY HASH(a) ALL NODES KSAFE;
=> COPY values FROM STDIN DELIMITER ',' DIRECT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 3,11
>> 3,55
>> 8,9
>> 8,23
>> 16,41
>> 22,111
>> \.
=>

To query this projection, use the name that Vertica assigns to it or to its buddy projections. For example, the following queries target different instances of the projection defined earlier, and return the same results:

=> SELECT * FROM values_product_b0;
=> SELECT * FROM values_product_b1;

The following example queries the anchor table:

=> SELECT * FROM values;
 a  |  b
----+-----
  3 |  11
  3 |  55
  8 |   9
  8 |  23
 16 |  41
 22 | 111

Given the projection created earlier, querying that projection returns the following values:

VMart=> SELECT * FROM values_product_b0;
 a  |  b  | product
----+-----+---------
  3 |  11 |      33
  3 |  55 |     165
  8 |   9 |      72
  8 |  23 |     184
 16 |  41 |     656
 22 | 111 |    2442

5 - Aggregation information in system tables

You can query the following system table fields for information about live aggregate projections, Top-K projections, and projections with expressions:.

You can query the following system table fields for information about live aggregate projections, Top-K projections, and projections with expressions:

Table Fields
TABLES TABLE_HAS_AGGREGATE_PROJECTION
PROJECTIONS AGGREGATE_TYPE
HAS_EXPRESSIONS
AGGREGATE_TYPE
PROJECTION_COLUMNS COLUMN_EXPRESSION
IS_AGGREGATE
IS_EXPRESSION
ORDER_BY_POSITION
ORDER_BY_TYPE
PARTITION_BY_POSITION