This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

CREATE PROJECTION

Creates metadata for a in the Vertica catalog.

Creates metadata for a projection in the Vertica catalog. Vertica supports four types of projections:

  • Standard projection: Stores a collection of table data in a format that optimizes execution of certain queries on that table.

  • Live aggregate projection: Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns.

  • Top-K projection: Stores the top k rows from partitions of selected rows.

  • UDTF projection: Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs).

Complex data types have additional restrictions when used within a projection:

  • Each projection must include at least one column that is a primitive type or native array.

  • An AS SELECT clause can use a complex-type column, but any other expression must be of a scalar type or native array.

  • The ORDER BY, PARTITION BY, and GROUP BY clauses cannot use complex types.

  • If a projection does not include an ORDER BY or segmentation clause, Vertica uses only the primitive columns from the select list to order or segment data.

  • Projection columns cannot be complex types returned from functions such as ARRAY_CAT.

  • TopK and UDTF projections do not support complex types.

1 - Encoding types

Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:.

Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:

You can set encoding types on a projection column when you create the projection. You can also change the encoding of one or more projection columns for a given table with ALTER TABLE...ALTER COLUMN.

AUTO (default)

AUTO encoding is ideal for sorted, many-valued columns such as primary keys. It is also suitable for general purpose applications for which no other encoding or compression scheme is applicable. Therefore, it serves as the default if no encoding/compression is specified.

Column data type Default encoding type
BINARY/VARBINARY
BOOLEAN
CHAR/VARCHAR
FLOAT
Lempel-Ziv-Oberhumer-based (LZO) compression
DATE/TIME/TIMESTAMP
INTEGER
INTERVAL
Compression scheme based on the delta between consecutive column values.

The CPU requirements for this type are relatively small. In the worst case, data might expand by eight percent (8%) for LZO and twenty percent (20%) for integer data.

BLOCK_DICT

For each block of storage, Vertica compiles distinct column values into a dictionary and then stores the dictionary and a list of indexes to represent the data block.

BLOCK_DICT is ideal for few-valued, unsorted columns where saving space is more important than encoding speed. Certain kinds of data, such as stock prices, are typically few-valued within a localized area after the data is sorted, such as by stock symbol and timestamp, and are good candidates for BLOCK_DICT. By contrast, long CHAR/VARCHAR columns are not good candidates for BLOCK_DICT encoding.

CHAR and VARCHAR columns that contain 0x00 or 0xFF characters should not be encoded with BLOCK_DICT. Also, BINARY/VARBINARY columns do not support BLOCK_DICT encoding.

BLOCK_DICT encoding requires significantly higher CPU usage than default encoding schemes. The maximum data expansion is eight percent (8%).

BLOCKDICT_COMP

This encoding type is similar to BLOCK_DICT except dictionary indexes are entropy coded. This encoding type requires significantly more CPU time to encode and decode and has a poorer worst-case performance. However, if the distribution of values is extremely skewed, using BLOCK_DICT_COMP encoding can lead to space savings.

BZIP_COMP

BZIP_COMP encoding uses the bzip2 compression algorithm on the block contents. See bzip web site for more information. This algorithm results in higher compression than the automatic LZO and gzip encoding; however, it requires more CPU time to compress. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Choose this encoding type when you are willing to trade slower load speeds for higher data compression.

COMMONDELTA_COMP

This compression scheme builds a dictionary of all deltas in the block and then stores indexes into the delta dictionary using entropy coding.

This scheme is ideal for sorted FLOAT and INTEGER-based (DATE/TIME/TIMESTAMP/INTERVAL) data columns with predictable sequences and only occasional sequence breaks, such as timestamps recorded at periodic intervals or primary keys. For example, the following sequence compresses well: 300, 600, 900, 1200, 1500, 600, 1200, 1800, 2400. The following sequence does not compress well: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55.

If delta distribution is excellent, columns can be stored in less than one bit per row. However, this scheme is very CPU intensive. If you use this scheme on data with arbitrary deltas, it can cause significant data expansion.

DELTARANGE_COMP

This compression scheme is primarily used for floating-point data; it stores each value as a delta from the previous one.

This scheme is ideal for many-valued FLOAT columns that are sorted or confined to a range. Do not use this scheme for unsorted columns that contain NULL values, as the storage cost for representing a NULL value is high. This scheme has a high cost for both compression and decompression.

To determine if DELTARANGE_COMP is suitable for a particular set of data, compare it to other schemes. Be sure to use the same sort order as the projection, and select sample data that will be stored consecutively in the database.

DELTAVAL

For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, data is recorded as a difference from the smallest value in the data block. This encoding has no effect on other data types.

DELTAVAL is best used for many-valued, unsorted integer or integer-based columns. CPU requirements for this encoding type are minimal, and data never expands.

GCDDELTA

For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, and NUMERIC columns with 18 or fewer digits, data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. This encoding has no effect on other data types.

ENCODING GCDDELTA is best used for many-valued, unsorted, integer columns or integer-based columns, when the values are a multiple of a common factor. For example, timestamps are stored internally in microseconds, so data that is only precise to the millisecond are all multiples of 1000. The CPU requirements for decoding GCDDELTA encoding are minimal, and the data never expands, but GCDDELTA may take more encoding time than DELTAVAL.

GZIP_COMP

This encoding type uses the gzip compression algorithm. See gzip web site for more information. This algorithm results in better compression than the automatic LZO compression, but lower compression than BZIP_COMP. It requires more CPU time to compress than LZO but less CPU time than BZIP_COMP. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Use this encoding when you want a better compression than LZO, but at less CPU time than bzip2.

RLE

RLE (run length encoding) replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. Therefore, it is best used for low cardinality columns that are present in the ORDER BY clause of a projection.

The Vertica execution engine processes RLE encoding run-by-run and the Vertica optimizer gives it preference. Use it only when run length is large, such as when low-cardinality columns are sorted.

Zstandard compression

Vertica supports three ZSTD compression types:

  • ZSTD_COMP provides high compression ratios. This encoding type has a higher compression than gzip. Use this when you want a better compression than gzip. For general use cases, use this or the ZSTD_FAST_COMP encoding type.

  • ZSTD_FAST_COMP uses the fastest compression level that the zstd library provides. It is the fastest encoding type of the zstd library, but takes up more space than the other two encoding types. For general use cases, use this or the ZSTD_COMP encoding type.

  • ZSTD_HIGH_COMP offers the best compression in the zstd library. It is slower than the other two encoding types. Use this type when you need the best compression, with slower CPU time.

2 - GROUPED clause

Groups two or more columns into a single disk file.

Enterprise Mode only

Groups two or more columns into a single disk file. Doing so minimizes file I/O for the following tasks:

  • Read a large percentage of the columns in a table.

  • Perform single row look-ups.

  • Query against many small columns.

  • Frequently update data in these columns.

You can improve query performance by grouping columns that are always accessed together and are not used in predicates. Once columns are grouped, queries can no longer retrieve from disk records for one column independently of the others.

You can group columns in several ways:

  • Group some of the columns:

    (a, GROUPED(b, c), d)
    
  • Group all of the columns:

    (GROUPED(a, b, c, d))
    
  • Create multiple groupings in the same projection:

    (GROUPED(a, b), GROUPED(c, d))
    

Grouping columns

The following example shows how to group columns bid and ask. The stock column is stored separately.

=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION tradeproj (stock ENCODING RLE,
   GROUPED(bid ENCODING DELTAVAL, ask))
   AS (SELECT * FROM trades) KSAFE 1;

The following example show how to create a projection that uses expressions in the column definition. The projection contains two integer columns a and b, and a third column product_value that stores the product of a and b:

=> CREATE TABLE values (a INT, b INT);
=> CREATE PROJECTION product (a, b, product_value) AS
   SELECT a, b, a*b FROM values ORDER BY a KSAFE;

3 - Hash segmentation clause

A general SQL expression.

Specifies how to segment projection data for distribution across all cluster nodes. You can specify segmentation for a table and a projection. If a table definition specifies segmentation, Vertica uses it for that table's auto-projections.

It is strongly recommended that you use Vertica's built-in HASH function, which distributes data evenly across the cluster, and facilitates optimal query execution.

Syntax

SEGMENTED BY expression ALL NODES [ OFFSET offset ]

Parameters

SEGMENTED BY expression
A general SQL expression. Hash segmentation is the preferred method of segmentation. Vertica recommends using its built-in HASH function, whose arguments resolve to table columns. If you use an expression other than HASH, Vertica issues a warning.

The segmentation expression should specify columns with a large number of unique data values and acceptable skew in their data distribution. In general, primary key columns that meet these criteria are good candidates for hash segmentation.

For details, see Expression Requirements below.

ALL NODES
Automatically distributes data evenly across all nodes when the projection is created. Node ordering is fixed.
OFFSET offset
A zero-based offset that indicates on which node to start segmentation distribution.

This option is not valid for CREATE TABLE and CREATE TEMPORARY TABLE.

Expression requirements

A segmentation expression must specify table columns as they are defined in the source table. Projection column names are not supported.

The following restrictions apply to segmentation expressions:

  • All leaf expressions must be constants or column references to a column in the CREATE PROJECTION 's SELECT list.

  • The expression must return the same value over the life of the database.

  • Aggregate functions are not allowed.

  • The expression must return non-negative INTEGER values in the range 0 <= x < 263, and values are generally distributed uniformly over that range.

Examples

The following CREATE PROJECTION statement creates projection public.employee_dimension_super. It specifies to include all columns in table public.employee_dimension. The hash segmentation clause invokes the Vertica HASH function to segment projection data on the column employee_key; it also includes the ALL NODES clause, which specifies to distribute projection data evenly across all nodes in the cluster:

=> CREATE PROJECTION public.employee_dimension_super
    AS SELECT * FROM public.employee_dimension
    ORDER BY employee_key
    SEGMENTED BY hash(employee_key) ALL NODES;

4 - Live aggregate projection

Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns.

Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns. For details, see Live aggregate projections.

Vertica does not regard live aggregate projections as superprojections, even those that include all table columns. For other requirements and restrictions, see Creating live aggregate projections.

Syntax

CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
   GROUP BY column-expr[,...]
   [ KSAFE [ k-num ] ]

Arguments

IF NOT EXISTS

If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.

The IF NOT EXISTS clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.

[database.]schema

Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.

projection

Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

projection-column

The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.

If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.

grouped-clause
See GROUPED clause.
ENCODING encoding-type

The column encoding type, by default set to AUTO.

ACCESSRANK integer

Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.

AS SELECT

Specifies the table data to query:

{table-column | expr-with-table-columns } [ [AS] alias] }[,...]

You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.

If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.

GROUP BY column-expr[,...]
One or more column expressions from the SELECT list. The first expression must be the first column expression in the SELECT list, the second expression must be the second column expression in the SELECT list, and so on.

Privileges

Non-superusers:

  • Anchor table owner

  • CREATE privilege on the schema

Examples

See Live aggregate projection example.

5 - Standard projection

Stores a collection of table data in a format that optimizes execution of certain queries on that table.

Stores a collection of table data in a format that optimizes execution of certain queries on that table. For details, see Projections .

Syntax

CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { * | { MATCH_COLUMNS('pattern') | expression [ [AS] alias ] }[,...] }
   FROM [[database.]schema.]table [ [AS] alias]
   [ ORDER BY column[,...] ]
   [ SEGMENTED BY expression ALL NODES [ OFFSET offset ] | UNSEGMENTED ALL NODES ]
   [ KSAFE [ k-num ]
   [ ON PARTITION RANGE BETWEEN min-val AND max-val ] ]

Arguments

IF NOT EXISTS

If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.

The IF NOT EXISTS clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.

[database.]schema

Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.

projection

Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

projection-column

The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.

If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.

grouped-clause
See GROUPED clause.
ENCODING encoding-type

The column encoding type, by default set to AUTO.

ACCESSRANK integer

Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.

AS SELECT
Columns or column expressions to select from the specified table:
  • * (asterisk): Returns all columns in the queried tables.

  • MATCH_COLUMNS('pattern'): Returns the names of all columns in the queried anchor table that match pattern.

  • expression[[AS]alias]: Resolves to column data from the queried anchor table. You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement—for example, in the ORDER BY or segmentation clause.

If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.

ORDER BY column
Columns from the SELECT list on which to sort the projection. The ORDER BY clause can only be set to ASC (the default). Vertica always stores projection data in ascending sort order.

If you order by a column with a collection data type (ARRAY or SET), queries that use that column in an ORDER BY clause perform the sort again. This is because projections and queries perform the ordering differently.

If you omit the ORDER BY clause, Vertica uses the SELECT list to sort the projection.

SEGMENTED BY expression ALL NODES [ OFFSET offset ] | UNSEGMENTED ALL NODES

How to distribute projection data:

If neither the anchor table nor the projection specifies segmentation, Vertica uses hash segmentation using all columns:

SEGMENTED BY HASH(column[,...]) ALL NODES OFFSET 0;
KSAFE k-num

Specifies K-safety for the projection. The value must be equal to or greater than database K-safety. Vertica ignores this option if set for unsegmented projections.

If you omit this clause, Vertica uses database K-safety.

For general information, see K-safety in an Enterprise Mode database.

ON PARTITION RANGE BETWEEN min-val AND max-val

Limits projection data to a range of partition keys. The minimum value must be less than or equal to the maximum value.

Values can be NULL. A null minimum value indicates no lower bound and a null maximum value indicates no upper bound. If both are NULL, this statement drops the projection endpoints, producing a regular projection instead of a range projection.

For other requirements and usage details, see Partition range projections.

Privileges

Non-superusers:

  • Anchor table owner

  • CREATE privilege on the schema

Examples

See:

6 - Top-k projection

Stores the top k rows from partitions of selected rows.

Stores the top k rows from partitions of selected rows. For details, see Top-k projections.

Vertica does not regard Top-K projections as superprojections, even those that include all table columns. For other requirements and restrictions, see Creating top-k projections.

Syntax

CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
   LIMIT num-rows OVER ( window-partition-clause window-order-clause )
   [ KSAFE [ k-num ] ]

Arguments

IF NOT EXISTS

If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.

The IF NOT EXISTS clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.

[database.]schema

Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.

projection

Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

projection-column

The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.

If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.

grouped-clause
See GROUPED clause.
ENCODING encoding-type

The column encoding type, by default set to AUTO.

ACCESSRANK integer

Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.

AS SELECT

Specifies the table data to query:

{table-column | expr-with-table-columns } [ [AS] alias] }[,...]

You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.

If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.

LIMIT num-rows
The number of rows to return from the specified partition.
OVER (window-partition-clause window-order-clause)
Specifies window partitioning by one or more comma-delimited column expressions from the SELECT list. The first partition expression must be the first SELECT list item, the second partition expression the second SELECT list item, and so on.

The order clause is required, and specifies the order in which the top k rows are returned. The default is ascending (ASC) order. All column expressions must be from the SELECT list, where the first window order expression must be the first SELECT list item not specified in the window partition clause.

Top-K projections support ORDER BY NULLS FIRST/LAST.

Privileges

Non-superusers:

  • Anchor table owner

  • CREATE privilege on the schema

Examples

See Top-k projection examples.

7 - UDTF projection

Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs).

Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs). For details and examples, see Pre-aggregating UDTF results.

Syntax

CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
   { projection-column | grouped-clause
   [ ENCODING encoding-type ]
   [ ACCESSRANK integer ]  }[,...]
) ]
AS { batch-query FROM { prepass-query sq-ref | table [[AS] alias] }
     | prepass-query }

batch-query:

SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
   OVER (PARTITION BATCH BY partition-column-expr[,...])
   [ AS (output-columns) ]

prepass-query:

SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf(prepass-args)
   OVER (PARTITION PREPASS BY partition-column-expr[,...])
   [ AS (output-columns) ] FROM table

Arguments

IF NOT EXISTS

If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.

The IF NOT EXISTS clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.

[database.]schema

Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.

projection

Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

projection-column

The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.

If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.

grouped-clause
See GROUPED clause.
ENCODING encoding-type

The column encoding type, by default set to AUTO.

ACCESSRANK integer

Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.

SELECT

Specifies the table data to query:

{table-column | expr-with-table-columns } [ [AS] alias] }[,...]

You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.

If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.

batch-udtf(batch-args)
The batch UDTF to invoke each time the following events occur:
  • Tuple mover mergeout

  • Queries on the projection

  • If invoked singly, on data load operations

prepass-udtf(prepass-args)
The pre-pass UDTF to invoke on each load operation such as COPY or INSERT.

If specified in a subquery, the pre-pass UDTF returns transformed data to the batch query for further processing. Otherwise, the pre-pass query results are added to projection data storage.

OVER (PARTITION { BATCH | PREPASS } BY partition-column-expr[,...])
Specifies the UDTF type and how to partition the data it returns:

In both cases, the OVER clause specifies partitioning with one or more column expressions from the SELECT list. The first expression is the first column expression in the SELECT list, the second expression is the second column expression in the SELECT list, and so on.

AS output-columns
Optionally names columns that are returned by the UDTF.

If a pre-pass subquery omits this clause, the outer batch query UDTF arguments (batch-args) must reference the column names as they are defined in the pre-pass UDTF.

table[[AS]alias]
The projection's anchor table, optionally qualified by an alias.
sq-results
Subquery result set that is returned to the outer batch UDTF.

Privileges

Non-superusers:

  • Anchor table owner

  • CREATE privilege on the schema

  • EXECUTE privileges on all UDTFs that are referenced by the projection

Examples

See Pre-aggregating UDTF results.

8 - Unsegmented clause

Specifies to distribute identical copies of table or projection data on all nodes across the cluster.

Specifies to distribute identical copies of table or projection data on all nodes across the cluster. Use this clause to facilitate distributed query execution on tables and projections that are too small to benefit from segmentation.

Vertica uses the same name to identify all instances of an unsegmented projection. For more information about projection name conventions, see Projection naming.

Syntax

UNSEGMENTED ALL NODES

Examples

This example creates an unsegmented projection for table store.store_dimension:


=> CREATE PROJECTION store.store_dimension_proj (storekey, name, city, state)
             AS SELECT store_key, store_name, store_city, store_state
             FROM store.store_dimension
             UNSEGMENTED ALL NODES;
CREATE PROJECTION

=>  SELECT anchor_table_name anchor_table, projection_name, node_name
      FROM PROJECTIONS WHERE projection_basename='store_dimension_proj';
  anchor_table   |   projection_name    |    node_name
-----------------+----------------------+------------------
 store_dimension | store_dimension_proj | v_vmart_node0001
 store_dimension | store_dimension_proj | v_vmart_node0002
 store_dimension | store_dimension_proj | v_vmart_node0003
(3 rows)