1 - Encoding types
Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:.
Vertica supports various encoding and compression types, specified by the following ENCODING
parameter arguments:
Note
Vertica supports the following encoding for numeric data types:
-
Precision ≤ 18: AUTO
, BLOCK_DICT
, BLOCKDICT_COMP
, COMMONDELTA_COMP
, DELTAVAL
, GCDDELTA
, and RLE
-
Precision > 18: AUTO
, BLOCK_DICT
, BLOCKDICT_COMP
, RLE
You can set encoding types on a projection column when you create the projection. You can also change the encoding of one or more projection columns for a given table with ALTER TABLE...ALTER COLUMN.
AUTO (default)
AUTO encoding is ideal for sorted, many-valued columns such as primary keys. It is also suitable for general purpose applications for which no other encoding or compression scheme is applicable. Therefore, it serves as the default if no encoding/compression is specified.
Column data type |
Default encoding type |
BINARY/VARBINARY BOOLEAN CHAR/VARCHAR FLOAT |
Lempel-Ziv-Oberhumer-based (LZO) compression |
DATE/TIME/TIMESTAMP INTEGER INTERVAL |
Compression scheme based on the delta between consecutive column values. |
The CPU requirements for this type are relatively small. In the worst case, data might expand by eight percent (8%) for LZO and twenty percent (20%) for integer data.
BLOCK_DICT
For each block of storage, Vertica compiles distinct column values into a dictionary and then stores the dictionary and a list of indexes to represent the data block.
BLOCK_DICT is ideal for few-valued, unsorted columns where saving space is more important than encoding speed. Certain kinds of data, such as stock prices, are typically few-valued within a localized area after the data is sorted, such as by stock symbol and timestamp, and are good candidates for BLOCK_DICT. By contrast, long CHAR/VARCHAR columns are not good candidates for BLOCK_DICT encoding.
CHAR and VARCHAR columns that contain 0x00 or 0xFF characters should not be encoded with BLOCK_DICT. Also, BINARY/VARBINARY columns do not support BLOCK_DICT encoding.
BLOCK_DICT encoding requires significantly higher CPU usage than default encoding schemes. The maximum data expansion is eight percent (8%).
BLOCKDICT_COMP
This encoding type is similar to BLOCK_DICT except dictionary indexes are entropy coded. This encoding type requires significantly more CPU time to encode and decode and has a poorer worst-case performance. However, if the distribution of values is extremely skewed, using BLOCK_DICT_COMP
encoding can lead to space savings.
BZIP_COMP
BZIP_COMP encoding uses the bzip2 compression algorithm on the block contents. See bzip web site for more information. This algorithm results in higher compression than the automatic LZO and gzip encoding; however, it requires more CPU time to compress. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Choose this encoding type when you are willing to trade slower load speeds for higher data compression.
COMMONDELTA_COMP
This compression scheme builds a dictionary of all deltas in the block and then stores indexes into the delta dictionary using entropy coding.
This scheme is ideal for sorted FLOAT and INTEGER-based (DATE/TIME/TIMESTAMP/INTERVAL) data columns with predictable sequences and only occasional sequence breaks, such as timestamps recorded at periodic intervals or primary keys. For example, the following sequence compresses well: 300, 600, 900, 1200, 1500, 600, 1200, 1800, 2400. The following sequence does not compress well: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55.
If delta distribution is excellent, columns can be stored in less than one bit per row. However, this scheme is very CPU intensive. If you use this scheme on data with arbitrary deltas, it can cause significant data expansion.
DELTARANGE_COMP
This compression scheme is primarily used for floating-point data; it stores each value as a delta from the previous one.
This scheme is ideal for many-valued FLOAT columns that are sorted or confined to a range. Do not use this scheme for unsorted columns that contain NULL values, as the storage cost for representing a NULL value is high. This scheme has a high cost for both compression and decompression.
To determine if DELTARANGE_COMP is suitable for a particular set of data, compare it to other schemes. Be sure to use the same sort order as the projection, and select sample data that will be stored consecutively in the database.
DELTAVAL
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, data is recorded as a difference from the smallest value in the data block. This encoding has no effect on other data types.
DELTAVAL is best used for many-valued, unsorted integer or integer-based columns. CPU requirements for this encoding type are minimal, and data never expands.
GCDDELTA
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, and NUMERIC columns with 18 or fewer digits, data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. This encoding has no effect on other data types.
ENCODING GCDDELTA is best used for many-valued, unsorted, integer columns or integer-based columns, when the values are a multiple of a common factor. For example, timestamps are stored internally in microseconds, so data that is only precise to the millisecond are all multiples of 1000. The CPU requirements for decoding GCDDELTA encoding are minimal, and the data never expands, but GCDDELTA may take more encoding time than DELTAVAL.
GZIP_COMP
This encoding type uses the gzip compression algorithm. See gzip web site for more information. This algorithm results in better compression than the automatic LZO compression, but lower compression than BZIP_COMP. It requires more CPU time to compress than LZO but less CPU time than BZIP_COMP. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Use this encoding when you want a better compression than LZO, but at less CPU time than bzip2.
RLE
RLE (run length encoding) replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. Therefore, it is best used for low cardinality columns that are present in the ORDER BY clause of a projection.
The Vertica execution engine processes RLE encoding run-by-run and the Vertica optimizer gives it preference. Use it only when run length is large, such as when low-cardinality columns are sorted.
Zstandard compression
Vertica supports three ZSTD compression types:
-
ZSTD_COMP
provides high compression ratios. This encoding type has a higher compression than gzip. Use this when you want a better compression than gzip. For general use cases, use this or the ZSTD_FAST_COMP
encoding type.
-
ZSTD_FAST_COMP
uses the fastest compression level that the zstd library provides. It is the fastest encoding type of the zstd library, but takes up more space than the other two encoding types. For general use cases, use this or the ZSTD_COMP
encoding type.
-
ZSTD_HIGH_COMP
offers the best compression in the zstd library. It is slower than the other two encoding types. Use this type when you need the best compression, with slower CPU time.
2 - GROUPED clause
Groups two or more columns into a single disk file.
Enterprise Mode only
Groups two or more columns into a single disk file. Doing so minimizes file I/O for the following tasks:
-
Read a large percentage of the columns in a table.
-
Perform single row look-ups.
-
Query against many small columns.
-
Frequently update data in these columns.
You can improve query performance by grouping columns that are always accessed together and are not used in predicates. Once columns are grouped, queries can no longer retrieve from disk records for one column independently of the others.
Note
RLE encoding is reduced when an RLE column is grouped with one or more non-RLE columns.
You can group columns in several ways:
-
Group some of the columns:
(a, GROUPED(b, c), d)
-
Group all of the columns:
(GROUPED(a, b, c, d))
-
Create multiple groupings in the same projection:
(GROUPED(a, b), GROUPED(c, d))
Note
Vertica performs dynamic column grouping. For example, to provide better read and write efficiency for small loads, Vertica ignores any projection-defined column grouping (or lack thereof) and groups all columns together by default.
Grouping columns
The following example shows how to group columns bid
and ask
. The stock
column is stored separately.
=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION tradeproj (stock ENCODING RLE,
GROUPED(bid ENCODING DELTAVAL, ask))
AS (SELECT * FROM trades) KSAFE 1;
The following example show how to create a projection that uses expressions in the column definition. The projection contains two integer columns a
and b
, and a third column product_value
that stores the product of a
and b
:
=> CREATE TABLE values (a INT, b INT);
=> CREATE PROJECTION product (a, b, product_value) AS
SELECT a, b, a*b FROM values ORDER BY a KSAFE;
3 - Hash segmentation clause
A general SQL expression.
Specifies how to segment projection data for distribution across all cluster nodes. You can specify segmentation for a table and a projection. If a table definition specifies segmentation, Vertica uses it for that table's auto-projections.
It is strongly recommended that you use Vertica's built-in
HASH
function, which distributes data evenly across the cluster, and facilitates optimal query execution.
Syntax
SEGMENTED BY expression ALL NODES [ OFFSET offset ]
Parameters
SEGMENTED BY
expression
- A general SQL expression. Hash segmentation is the preferred method of segmentation. Vertica recommends using its built-in
HASH
function, whose arguments resolve to table columns. If you use an expression other than HASH
, Vertica issues a warning.
The segmentation expression should specify columns with a large number of unique data values and acceptable skew in their data distribution. In general, primary key columns that meet these criteria are good candidates for hash segmentation.
For details, see Expression Requirements below.
ALL NODES
- Automatically distributes data evenly across all nodes when the projection is created. Node ordering is fixed.
OFFSET
offset
- A zero-based offset that indicates on which node to start segmentation distribution.
This option is not valid for
CREATE TABLE
and
CREATE TEMPORARY TABLE
.
Important
If you create a projection for a table with the OFFSET
option, be sure to create enough copies of each projection segment to satisfy system K-safety; otherwise, Vertica regards the projection as unsafe and cannot use it to query the table.
You can ensure K-safety compliance when you create projections by combining OFFSET
and
KSAFE
options in the CREATE PROJECTION
statement. On executing this statement, Vertica automatically creates the necessary number of projection copies.
Expression requirements
A segmentation expression must specify table columns as they are defined in the source table. Projection column names are not supported.
The following restrictions apply to segmentation expressions:
-
All leaf expressions must be constants or column references to a column in the CREATE PROJECTION
's SELECT
list.
-
The expression must return the same value over the life of the database.
-
Aggregate functions are not allowed.
-
The expression must return non-negative INTEGER
values in the range
0 <= x < 263
, and values are generally distributed uniformly over that range.
Note
If the expression produces a value outside the expected range—for example, a negative value—no error occurs, and the row is added to the projection's first segment.
Examples
The following CREATE PROJECTION
statement creates projection public.employee_dimension_super
. It specifies to include all columns in table public.employee_dimension
. The hash segmentation clause invokes the Vertica HASH
function to segment projection data on the column employee_key
; it also includes the ALL NODES
clause, which specifies to distribute projection data evenly across all nodes in the cluster:
=> CREATE PROJECTION public.employee_dimension_super
AS SELECT * FROM public.employee_dimension
ORDER BY employee_key
SEGMENTED BY hash(employee_key) ALL NODES;
4 - Live aggregate projection
Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns.
Stores the grouped results of queries that invoke aggregate functions (such as SUM) on table columns. For details, see Live aggregate projections.
Vertica does not regard live aggregate projections as superprojections, even those that include all table columns. For other requirements and restrictions, see Creating live aggregate projections.
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
GROUP BY column-expr[,...]
[ KSAFE [ k-num ] ]
Arguments
IF NOT EXISTS
If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.
The IF NOT EXISTS
clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.
[
database
.]
schema
Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.
projection
Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
grouped-clause
- See GROUPED clause.
ENCODING
encoding-type
The column encoding type, by default set to AUTO.
ACCESSRANK
integer
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
AS SELECT
Specifies the table data to query:
{table-column | expr-with-table-columns } [ [AS] alias] }[,...]
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
GROUP BY
column-expr
[,...]
- One or more column expressions from the SELECT list. The first expression must be the first column expression in the SELECT list, the second expression must be the second column expression in the SELECT list, and so on.
Privileges
Non-superusers:
Examples
See Live aggregate projection example.
5 - Standard projection
Stores a collection of table data in a format that optimizes execution of certain queries on that table.
Stores a collection of table data in a format that optimizes execution of certain queries on that table. For details, see Projections .
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { * | { MATCH_COLUMNS('pattern') | expression [ [AS] alias ] }[,...] }
FROM [[database.]schema.]table [ [AS] alias]
[ ORDER BY column[,...] ]
[ SEGMENTED BY expression ALL NODES [ OFFSET offset ] | UNSEGMENTED ALL NODES ]
[ KSAFE [ k-num ]
[ ON PARTITION RANGE BETWEEN min-val AND max-val ] ]
Arguments
IF NOT EXISTS
If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.
The IF NOT EXISTS
clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.
[
database
.]
schema
Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.
projection
Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
grouped-clause
- See GROUPED clause.
ENCODING
encoding-type
The column encoding type, by default set to AUTO.
ACCESSRANK
integer
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
AS SELECT
- Columns or column expressions to select from the specified table:
-
*
(asterisk): Returns all columns in the queried tables.
-
MATCH_COLUMNS('pattern'): Returns the names of all columns in the queried anchor table that match pattern
.
-
expression
[[AS]
alias
]
: Resolves to column data from the queried anchor table. You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement—for example, in the ORDER BY or segmentation clause.
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
ORDER BY
column
- Columns from the SELECT list on which to sort the projection. The ORDER BY clause can only be set to ASC (the default). Vertica always stores projection data in ascending sort order.
If you order by a column with a collection data type (ARRAY or SET), queries that use that column in an ORDER BY clause perform the sort again. This is because projections and queries perform the ordering differently.
If you omit the ORDER BY clause, Vertica uses the SELECT list to sort the projection.
SEGMENTED BY
expression
ALL NODES [ OFFSET
offset
] | UNSEGMENTED ALL NODES
-
How to distribute projection data:
If neither the anchor table nor the projection specifies segmentation, Vertica uses hash segmentation using all columns:
SEGMENTED BY HASH(column[,...]) ALL NODES OFFSET 0;
Tip
Vertica recommends segmenting large tables.
KSAFE
k-num
Specifies K-safety for the projection. The value must be equal to or greater than database K-safety. Vertica ignores this option if set for unsegmented projections.
If you omit this clause, Vertica uses database K-safety.
For general information, see K-safety in an Enterprise Mode database.
ON PARTITION RANGE BETWEEN
min-val
AND
max-val
Limits projection data to a range of partition keys. The minimum value must be less than or equal to the maximum value.
Values can be NULL. A null minimum value indicates no lower bound and a null maximum value indicates no upper bound. If both are NULL, this statement drops the projection endpoints, producing a regular projection instead of a range projection.
For other requirements and usage details, see Partition range projections.
Privileges
Non-superusers:
Examples
See:
6 - Top-k projection
Stores the top k rows from partitions of selected rows.
Stores the top k rows from partitions of selected rows. For details, see Top-k projections.
Vertica does not regard Top-K projections as superprojections, even those that include all table columns. For other requirements and restrictions, see Creating top-k projections.
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS SELECT { table-column | expr-with-table-columns }[,...] FROM [[database.]schema.]table [ [AS] alias]
LIMIT num-rows OVER ( window-partition-clause window-order-clause )
[ KSAFE [ k-num ] ]
Arguments
IF NOT EXISTS
If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.
The IF NOT EXISTS
clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.
[
database
.]
schema
Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.
projection
Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
grouped-clause
- See GROUPED clause.
ENCODING
encoding-type
The column encoding type, by default set to AUTO.
ACCESSRANK
integer
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
AS SELECT
Specifies the table data to query:
{table-column | expr-with-table-columns } [ [AS] alias] }[,...]
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
LIMIT
num-rows
- The number of rows to return from the specified partition.
OVER (
window-partition-clause
window-order-clause
)
- Specifies window partitioning by one or more comma-delimited column expressions from the SELECT list. The first partition expression must be the first SELECT list item, the second partition expression the second SELECT list item, and so on.
The order clause is required, and specifies the order in which the top k
rows are returned. The default is ascending (ASC) order. All column expressions must be from the SELECT list, where the first window order expression must be the first SELECT list item not specified in the window partition clause.
Top-K projections support ORDER BY NULLS FIRST/LAST.
Privileges
Non-superusers:
Examples
See Top-k projection examples.
7 - UDTF projection
Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs).
Stores newly-loaded data after it is transformed and/or aggregated by user-defined transformation functions (UDTFs). For details and examples, see Pre-aggregating UDTF results.
Syntax
CREATE PROJECTION [ IF NOT EXISTS ] [[database.]schema.]projection
[ (
{ projection-column | grouped-clause
[ ENCODING encoding-type ]
[ ACCESSRANK integer ] }[,...]
) ]
AS { batch-query FROM { prepass-query sq-ref | table [[AS] alias] }
| prepass-query }
batch-query:
SELECT { table-column | expr-with-table-columns }[,...], batch-udtf(batch-args)
OVER (PARTITION BATCH BY partition-column-expr[,...])
[ AS (output-columns) ]
prepass-query:
SELECT { table-column | expr-with-table-columns }[,...], prepass-udtf(prepass-args)
OVER (PARTITION PREPASS BY partition-column-expr[,...])
[ AS (output-columns) ] FROM table
Arguments
IF NOT EXISTS
If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.
The IF NOT EXISTS
clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.
[
database
.]
schema
Schema for this projection and its anchor table. The value must be the same for both. If you specify a database, it must be the current database.
projection
Name of the projection to create, which must conform to Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
projection-column
The name of a projection column. The list of projection columns must match the SELECT list columns and expressions in number, type, and sequence.
If projection column names are omitted, Vertica uses the anchor table column names specified in the SELECT list.
grouped-clause
- See GROUPED clause.
ENCODING
encoding-type
The column encoding type, by default set to AUTO.
ACCESSRANK
integer
Overrides the default access rank for a column. Use this parameter to increase or decrease the speed at which Vertica accesses a column. For more information, see Overriding Default Column Ranking.
SELECT
Specifies the table data to query:
{table-column | expr-with-table-columns } [ [AS] alias] }[,...]
You can optionally assign an alias to each column expression and reference that alias elsewhere in the SELECT statement.
If you specify projection column names, the two lists of projection columns and table columns/expressions must exactly match in number and order.
batch-udtf
(
batch-args
)
- The batch UDTF to invoke each time the following events occur:
Important
If the projection definition includes a pre-pass subquery, batch-args
must exactly match the pre-pass UDTF output columns, in name and order.
prepass-udtf
(
prepass-args
)
- The pre-pass UDTF to invoke on each load operation such as COPY or INSERT.
If specified in a subquery, the pre-pass UDTF returns transformed data to the batch query for further processing. Otherwise, the pre-pass query results are added to projection data storage.
OVER (PARTITION { BATCH | PREPASS } BY
partition-column-expr
[,...])
- Specifies the UDTF type and how to partition the data it returns:
In both cases, the OVER clause specifies partitioning with one or more column expressions from the SELECT list. The first expression is the first column expression in the SELECT list, the second expression is the second column expression in the SELECT list, and so on.
Note
The projection is implicitly segmented and ordered on PARTITION BY columns.
AS
output-columns
- Optionally names columns that are returned by the UDTF.
If a pre-pass subquery omits this clause, the outer batch query UDTF arguments (batch-args
) must reference the column names as they are defined in the pre-pass UDTF.
table
[[AS]
alias
]
- The projection's anchor table, optionally qualified by an alias.
sq-results
- Subquery result set that is returned to the outer batch UDTF.
Privileges
Non-superusers:
Examples
See Pre-aggregating UDTF results.
8 - Unsegmented clause
Specifies to distribute identical copies of table or projection data on all nodes across the cluster.
Specifies to distribute identical copies of table or projection data on all nodes across the cluster. Use this clause to facilitate distributed query execution on tables and projections that are too small to benefit from segmentation.
Vertica uses the same name to identify all instances of an unsegmented projection. For more information about projection name conventions, see Projection naming.
Syntax
UNSEGMENTED ALL NODES
Examples
This example creates an unsegmented projection for table store.store_dimension
:
=> CREATE PROJECTION store.store_dimension_proj (storekey, name, city, state)
AS SELECT store_key, store_name, store_city, store_state
FROM store.store_dimension
UNSEGMENTED ALL NODES;
CREATE PROJECTION
=> SELECT anchor_table_name anchor_table, projection_name, node_name
FROM PROJECTIONS WHERE projection_basename='store_dimension_proj';
anchor_table | projection_name | node_name
-----------------+----------------------+------------------
store_dimension | store_dimension_proj | v_vmart_node0001
store_dimension | store_dimension_proj | v_vmart_node0002
store_dimension | store_dimension_proj | v_vmart_node0003
(3 rows)