Creates metadata for a projection in the Vertica catalog. Vertica supports four types of projections:
Standard projection: Stores a collection of table data in a format that optimizes execution of certain queries on that table.
Live aggregate projection: Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns.
Top-K projection: Stores the top k rows from partitions of selected rows.
UDTF projection: Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs).
Complex data types have additional restrictions when used within a projection:
Each projection must include at least one column that is a primitive type or native array.
An AS SELECT clause can use a complex-type column, but any other expression must be of a scalar type or native array.
The ORDER BY, PARTITION BY, and GROUP BY clauses cannot use complex types.
If a projection does not include an ORDER BY or segmentation clause, Vertica uses only the primitive columns from the select list to order or segment data.
Projection columns cannot be complex types returned from functions such as ARRAY_CAT.
TopK and UDTF projections do not support complex types.
1 - Encoding types
Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:.
Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:
You can set encoding types on a projection column when you create the projection. You can also change the encoding of one or more projection columns for a given table with ALTER TABLE...ALTER COLUMN.
AUTO (default)
AUTO encoding is ideal for sorted, many-valued columns such as primary keys. It is also suitable for general purpose applications for which no other encoding or compression scheme is applicable. Therefore, it serves as the default if no encoding/compression is specified.
Column data type
Default encoding type
BINARY/VARBINARY BOOLEAN CHAR/VARCHAR FLOAT
Lempel-Ziv-Oberhumer-based (LZO) compression
DATE/TIME/TIMESTAMP INTEGER INTERVAL
Compression scheme based on the delta between consecutive column values.
The CPU requirements for this type are relatively small. In the worst case, data might expand by eight percent (8%) for LZO and twenty percent (20%) for integer data.
BLOCK_DICT
For each block of storage, Vertica compiles distinct column values into a dictionary and then stores the dictionary and a list of indexes to represent the data block.
BLOCK_DICT is ideal for few-valued, unsorted columns where saving space is more important than encoding speed. Certain kinds of data, such as stock prices, are typically few-valued within a localized area after the data is sorted, such as by stock symbol and timestamp, and are good candidates for BLOCK_DICT. By contrast, long CHAR/VARCHAR columns are not good candidates for BLOCK_DICT encoding.
CHAR and VARCHAR columns that contain 0x00 or 0xFF characters should not be encoded with BLOCK_DICT. Also, BINARY/VARBINARY columns do not support BLOCK_DICT encoding.
BLOCK_DICT encoding requires significantly higher CPU usage than default encoding schemes. The maximum data expansion is eight percent (8%).
BLOCKDICT_COMP
This encoding type is similar to BLOCK_DICT except dictionary indexes are entropy coded. This encoding type requires significantly more CPU time to encode and decode and has a poorer worst-case performance. However, if the distribution of values is extremely skewed, using BLOCK_DICT_COMP encoding can lead to space savings.
BZIP_COMP
BZIP_COMP encoding uses the bzip2 compression algorithm on the block contents. See bzip web site for more information. This algorithm results in higher compression than the automatic LZO and gzip encoding; however, it requires more CPU time to compress. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Choose this encoding type when you are willing to trade slower load speeds for higher data compression.
COMMONDELTA_COMP
This compression scheme builds a dictionary of all deltas in the block and then stores indexes into the delta dictionary using entropy coding.
This scheme is ideal for sorted FLOAT and INTEGER-based (DATE/TIME/TIMESTAMP/INTERVAL) data columns with predictable sequences and only occasional sequence breaks, such as timestamps recorded at periodic intervals or primary keys. For example, the following sequence compresses well: 300, 600, 900, 1200, 1500, 600, 1200, 1800, 2400. The following sequence does not compress well: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55.
If delta distribution is excellent, columns can be stored in less than one bit per row. However, this scheme is very CPU intensive. If you use this scheme on data with arbitrary deltas, it can cause significant data expansion.
DELTARANGE_COMP
This compression scheme is primarily used for floating-point data; it stores each value as a delta from the previous one.
This scheme is ideal for many-valued FLOAT columns that are sorted or confined to a range. Do not use this scheme for unsorted columns that contain NULL values, as the storage cost for representing a NULL value is high. This scheme has a high cost for both compression and decompression.
To determine if DELTARANGE_COMP is suitable for a particular set of data, compare it to other schemes. Be sure to use the same sort order as the projection, and select sample data that will be stored consecutively in the database.
DELTAVAL
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, data is recorded as a difference from the smallest value in the data block. This encoding has no effect on other data types.
DELTAVAL is best used for many-valued, unsorted integer or integer-based columns. CPU requirements for this encoding type are minimal, and data never expands.
GCDDELTA
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, and NUMERIC columns with 18 or fewer digits, data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. This encoding has no effect on other data types.
ENCODING GCDDELTA is best used for many-valued, unsorted, integer columns or integer-based columns, when the values are a multiple of a common factor. For example, timestamps are stored internally in microseconds, so data that is only precise to the millisecond are all multiples of 1000. The CPU requirements for decoding GCDDELTA encoding are minimal, and the data never expands, but GCDDELTA may take more encoding time than DELTAVAL.
GZIP_COMP
This encoding type uses the gzip compression algorithm. See gzip web site for more information. This algorithm results in better compression than the automatic LZO compression, but lower compression than BZIP_COMP. It requires more CPU time to compress than LZO but less CPU time than BZIP_COMP. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Use this encoding when you want a better compression than LZO, but at less CPU time than bzip2.
RLE
RLE (run length encoding) replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. Therefore, it is best used for low cardinality columns that are present in the ORDER BY clause of a projection.
The Vertica execution engine processes RLE encoding run-by-run and the Vertica optimizer gives it preference. Use it only when run length is large, such as when low-cardinality columns are sorted.
ZSTD_COMP provides high compression ratios. This encoding type has a higher compression than gzip. Use this when you want a better compression than gzip. For general use cases, use this or the ZSTD_FAST_COMP encoding type.
ZSTD_FAST_COMP uses the fastest compression level that the zstd library provides. It is the fastest encoding type of the zstd library, but takes up more space than the other two encoding types. For general use cases, use this or the ZSTD_COMP encoding type.
ZSTD_HIGH_COMP offers the best compression in the zstd library. It is slower than the other two encoding types. Use this type when you need the best compression, with slower CPU time.
2 - GROUPED clause
Groups two or more columns into a single disk file.
Enterprise Mode only
Groups two or more columns into a single disk file. Doing so minimizes file I/O for the following tasks:
Read a large percentage of the columns in a table.
Perform single row look-ups.
Query against many small columns.
Frequently update data in these columns.
You can improve query performance by grouping columns that are always accessed together and are not used in predicates. Once columns are grouped, queries can no longer retrieve from disk records for one column independently of the others.
Note
RLE encoding is reduced when an RLE column is grouped with one or more non-RLE columns.
You can group columns in several ways:
Group some of the columns:
(a, GROUPED(b, c), d)
Group all of the columns:
(GROUPED(a, b, c, d))
Create multiple groupings in the same projection:
(GROUPED(a, b), GROUPED(c, d))
Note
Vertica performs dynamic column grouping. For example, to provide better read and write efficiency for small loads, Vertica ignores any projection-defined column grouping (or lack thereof) and groups all columns together by default.
Grouping columns
The following example shows how to group columns bid and ask. The stock column is stored separately.
The following example show how to create a projection that uses expressions in the column definition. The projection contains two integer columns a and b, and a third column product_value that stores the product of a and b:
=> CREATE TABLE values (a INT, b INT);
=> CREATE PROJECTION product (a, b, product_value) AS
SELECT a, b, a*b FROM values ORDER BY a KSAFE;
3 - Hash segmentation clause
A general SQL expression.
Specifies how to segment projection data for distribution across all cluster nodes. You can specify segmentation for a table and a projection. If a table definition specifies segmentation, Vertica uses it for that table's auto-projections.
It is strongly recommended that you use Vertica's built-in
HASH function, which distributes data evenly across the cluster, and facilitates optimal query execution.
Syntax
SEGMENTED BY expression ALL NODES [ OFFSET offset ]
Parameters
SEGMENTED BY expression
A general SQL expression. Hash segmentation is the preferred method of segmentation. Vertica recommends using its built-in
HASH function, whose arguments resolve to table columns. If you use an expression other than HASH, Vertica issues a warning.
The segmentation expression should specify columns with a large number of unique data values and acceptable skew in their data distribution. In general, primary key columns that meet these criteria are good candidates for hash segmentation.
If you create a projection for a table with the OFFSET option, be sure to create enough copies of each projection segment to satisfy system K-safety; otherwise, Vertica regards the projection as unsafe and cannot use it to query the table.
You can ensure K-safety compliance when you create projections by combining OFFSET and
KSAFE options in the CREATE PROJECTION statement. On executing this statement, Vertica automatically creates the necessary number of projection copies.
Expression requirements
A segmentation expression must specify table columns as they are defined in the source table. Projection column names are not supported.
The following restrictions apply to segmentation expressions:
All leaf expressions must be constants or column references to a column in the CREATE PROJECTION 's SELECT list.
The expression must return the same value over the life of the database.
Aggregate functions are not allowed.
The expression must return non-negative INTEGER values in the range
0 <= x < 263, and values are generally distributed uniformly over that range.
Note
If the expression produces a value outside the expected range—for example, a negative value—no error occurs, and the row is added to the projection's first segment.
Examples
The following CREATE PROJECTION statement creates projection public.employee_dimension_super. It specifies to include all columns in table public.employee_dimension. The hash segmentation clause invokes the Vertica HASH function to segment projection data on the column employee_key; it also includes the ALL NODES clause, which specifies to distribute projection data evenly across all nodes in the cluster:
=> CREATE PROJECTION public.employee_dimension_super
AS SELECT * FROM public.employee_dimension
ORDER BY employee_key
SEGMENTED BY hash(employee_key) ALL NODES;
4 - Live aggregate projection
Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns.
Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns. For details, see Live aggregate projections.
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
GROUP BY column-expr
[ KSAFE [ k-num ] ]
Parameters
IF NOT EXISTS
If an object with the same name exists, do not create it and proceed. If you omit this option and the object exists, Vertica generates a ROLLBACK error message. In both cases, the object is not created if it already exists.
The IF NOT EXISTS clause is useful for SQL scripts where you want to create an object if it does not already exist.
Specifies the schema for this projection and its anchor table, where schema must be the same for both. If you specify a database, it must be the current database.
projection
Identifies the projection to create, where projection conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
Note
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
GROUP BY column-expr[,...]
One or more column expressions from the SELECT list. The first column-expr must be the first column expression in the SELECT list, the second column-expr must be the second column expression in the SELECT list, and so on.
Stores a collection of table data in a format that optimizes execution of certain queries on that table.
Stores a collection of table data in a format that optimizes execution of certain queries on that table. For details, see Projections .
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { * | { MATCH_COLUMNS('pattern') | expression [ [AS] alias ] }[,...] }
FROM [[database.]schema.]table [ [AS] alias]
[ ORDER BY column-expr[,...] ]
[ segmentation-spec ]
[ KSAFE [ k-num ]
[ ON PARTITION RANGE BETWEEN min-val AND max-val ] ]
Parameters
IF NOT EXISTS
If an object with the same name exists, do not create it and proceed. If you omit this option and the object exists, Vertica generates a ROLLBACK error message. In both cases, the object is not created if it already exists.
The IF NOT EXISTS clause is useful for SQL scripts where you want to create an object if it does not already exist.
Specifies the schema for this projection and its anchor table, where schema must be the same for both. If you specify a database, it must be the current database.
projection
Identifies the projection to create, where projection conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
AS SELECT
Specifies the columns or column expressions to select from the specified table:
Returns the names of all columns in the queried anchor table that match pattern.
expression[[AS]alias]
Resolves to column data from the queried anchor table.
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement—for example, in the ORDER BY or segmentation clause.
Note
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
ORDER BY
Specifies columns from the SELECT list on which to sort the projection. The ORDER BY clause can only be set to ASC (the default). Vertica always stores projection data in ascending sort order.
If you order by a column with a collection data type (ARRAY or SET), queries that use that column in an ORDER BY clause perform the sort again. This is because projections and queries perform the ordering differently.
If you omit the ORDER BY clause, Vertica uses the SELECT list to sort the projection.
segmentation-spec
Specifies how to distribute projection data with one of the following clauses:
hash-segmentation-clause: Specifies to segment projection data evenly and distribute across cluster nodes:
SEGMENTED BY expression ALL NODES [ OFFSET offset ]
If the anchor table and projection both omit specifying segmentation, the projection is defined with a hash segmentation clause that includes all columns in the SELECT list , as follows:
SEGMENTED BY HASH(column-expr[,...]) ALL NODES OFFSET 0;
Tip
Vertica recommends segmenting large tables.
KSAFE [k-num]
Specifies K-safety for the projection, where k-num must be equal to or greater than database K-safety. Vertica ignores this parameter if set for unsegmented projections. If you omit k-num, Vertica uses database K-safety.
max-range-value can be set to NULL, to specify that the partition range has no upper bound.
min-range-value can be set to NULL, to specify that the partition range has no lower bound.
If both partition range projection min-range-value and max-range-value are set to NULL, it will drop the projection endpoints, becoming a regular projection.
Stores the top k rows from partitions of selected rows.
Stores the top k rows from partitions of selected rows. For details, see Top-k projections.
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
LIMIT num-rows OVER ( window-partition-clause [window-order-clause] )
[ KSAFE [ k-num ] ]
Parameters
IF NOT EXISTS
If an object with the same name exists, do not create it and proceed. If you omit this option and the object exists, Vertica generates a ROLLBACK error message. In both cases, the object is not created if it already exists.
The IF NOT EXISTS clause is useful for SQL scripts where you want to create an object if it does not already exist.
Specifies the schema for this projection and its anchor table, where schema must be the same for both. If you specify a database, it must be the current database.
projection
Identifies the projection to create, where projection conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
Note
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
LIMIT num-rows
The number of rows to return from the specified partition.
window-partition-clause
Specifies window partitioning by one or more comma-delimited column expressions from the SELECT list. The first partition expression must be the first SELECT list item, the second partition expression the second SELECT list item, and so on.
window-order-clause
Specifies the order in which the top k rows are returned, by default in ascending (ASC) order. All column expressions must be from the SELECT list, where the first window order expression must be the first SELECT list item not specified in the window partition clause.
Vertica does not regard Top-K projections as superprojections, even those that include all table columns. For other requirements and restrictions, see Creating top-k projections.
Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs).
Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs). For details and examples, see Pre-aggregating UDTF results.
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS { [batch-query](#UDTFBatchQuery) FROM { prepass-querysq-ref | table [[AS] alias] }
| prepass-query }
batch-query
SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
OVER (PARTITION BATCH BY partition-column-expr[,...])
[ AS (batch-output-columns) ]
prepass-query
SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf(prepass-args)
OVER (PARTITION PREPASS BY partition-column-expr[,...])
[ AS (prepass-output-columns) ] FROM table
Parameters
IF NOT EXISTS
If an object with the same name exists, do not create it and proceed. If you omit this option and the object exists, Vertica generates a ROLLBACK error message. In both cases, the object is not created if it already exists.
The IF NOT EXISTS clause is useful for SQL scripts where you want to create an object if it does not already exist.
Specifies the schema for this projection and its anchor table, where schema must be the same for both. If you specify a database, it must be the current database.
projection
Identifies the projection to create, where projection conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
Note
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
batch-udtf(batch-args)
The batch UDTF to invoke each time the following events occur:
Tuple mover mergeout
Queries on the projection
If invoked singly, on data load operations
Important
If the projection definition includes a pre-pass subquery, batch-args must exactly match the pre-pass UDTF output columns, in name and order.
prepass-udtf(prepass-args)
The pre-pass UDTF to invoke on each load operation such as COPY or INSERT.
If specified in a subquery, the pre-pass UDTF returns transformed data to the batch query for further processing. Otherwise, the pre-pass query results are added to projection data storage.
OVER (PARTITION { BATCH | PREPASS } BY partition-column-expr[,...]
Specifies the UDTF type and how to partition the data it returns:
In both cases, the OVER clause specifies partitioning with one or more column expressions from the SELECT list. The first partition-column-expr is the first column expression in the SELECT list, the second partition-column-expr is the second column expression in the SELECT list, and so on.
Note
The projection is implicitly segmented and ordered on PARTITION BY columns.
AS (batch-output-columns) AS (prepass-output-columns)
Optionally names columns that are returned by the UDTF.
If a pre-pass subquery omits this clause, the outer batch query UDTF arguments (batch-args) must reference the column names as they are defined in the pre-pass UDTF.
table[[AS]alias]
Specifies the projection's anchor table, optionally qualified by an alias.
sq-results
Subquery result set that is returned to the outer batch UDTF.
Privileges
Non-superusers:
Anchor table owner
CREATE privilege on the schema
EXECUTE privileges on all UDTFs that are referenced by the projection
Specifies to distribute identical copies of table or projection data on all nodes across the cluster.
Specifies to distribute identical copies of table or projection data on all nodes across the cluster. Use this clause to facilitate distributed query execution on tables and projections that are too small to benefit from segmentation.
Vertica uses the same name to identify all instances of an unsegmented projection. For more information about projection name conventions, see Projection naming.
Syntax
UNSEGMENTED ALL NODES
Examples
This example creates an unsegmented projection for table store.store_dimension:
=> CREATE PROJECTION store.store_dimension_proj (storekey, name, city, state)
AS SELECT store_key, store_name, store_city, store_state
FROM store.store_dimension
UNSEGMENTED ALL NODES;
CREATE PROJECTION
=> SELECT anchor_table_name anchor_table, projection_name, node_name
FROM PROJECTIONS WHERE projection_basename='store_dimension_proj';
anchor_table | projection_name | node_name
-----------------+----------------------+------------------
store_dimension | store_dimension_proj | v_vmart_node0001
store_dimension | store_dimension_proj | v_vmart_node0002
store_dimension | store_dimension_proj | v_vmart_node0003
(3 rows)