1 - Encoding types
Vertica supports various encoding and compression types, specified by the following ENCODING parameter arguments:.
Vertica supports various encoding and compression types, specified by the following ENCODING
parameter arguments:
Note
Vertica supports the following encoding for numeric data types:
-
Precision ≤ 18: AUTO
, BLOCK_DICT
, BLOCKDICT_COMP
, COMMONDELTA_COMP
, DELTAVAL
, GCDDELTA
, and RLE
-
Precision > 18: AUTO
, BLOCK_DICT
, BLOCKDICT_COMP
, RLE
You can set encoding types on a projection column when you create the projection. You can also change the encoding of one or more projection columns for a given table with ALTER TABLE...ALTER COLUMN.
AUTO (default)
AUTO encoding is ideal for sorted, many-valued columns such as primary keys. It is also suitable for general purpose applications for which no other encoding or compression scheme is applicable. Therefore, it serves as the default if no encoding/compression is specified.
Column data type |
Default encoding type |
BINARY/VARBINARY BOOLEAN CHAR/VARCHAR FLOAT |
Lempel-Ziv-Oberhumer-based (LZO) compression |
DATE/TIME/TIMESTAMP INTEGER INTERVAL |
Compression scheme based on the delta between consecutive column values. |
The CPU requirements for this type are relatively small. In the worst case, data might expand by eight percent (8%) for LZO and twenty percent (20%) for integer data.
BLOCK_DICT
For each block of storage, Vertica compiles distinct column values into a dictionary and then stores the dictionary and a list of indexes to represent the data block.
BLOCK_DICT is ideal for few-valued, unsorted columnswhere saving space is more important than encoding speed. Certain kinds of data, such as stock prices, are typically few-valued within a localized area after the data is sorted, such as by stock symbol and timestamp, and are good candidates for BLOCK_DICT. By contrast, long CHAR/VARCHAR columns are not good candidates for BLOCK_DICT encoding.
CHAR and VARCHAR columns that contain 0x00 or 0xFF characters should not be encoded with BLOCK_DICT. Also, BINARY/VARBINARY columns do not support BLOCK_DICT encoding.
BLOCK_DICT encoding requires significantly higher CPU usage than default encoding schemes. The maximum data expansion is eight percent (8%).
BLOCKDICT_COMP
This encoding type is similar to BLOCK_DICT except dictionary indexes are entropy coded. This encoding type requires significantly more CPU time to encode and decode and has a poorer worst-case performance. However, if the distribution of values is extremely skewed, using BLOCK_DICT_COMP
encoding can lead to space savings.
BZIP_COMP
BZIP_COMP encoding uses the bzip2 compression algorithm on the block contents. See bzip web site for more information. This algorithm results in higher compression than the automatic LZO and gzip encoding; however, it requires more CPU time to compress. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Choose this encoding type when you are willing to trade slower load speeds for higher data compression.
COMMONDELTA_COMP
This compression scheme builds a dictionary of all deltas in the block and then stores indexes into the delta dictionary using entropy coding.
This scheme is ideal for sorted FLOAT and INTEGER-based (DATE/TIME/TIMESTAMP/INTERVAL) data columns with predictable sequences and only occasional sequence breaks, such as timestamps recorded at periodic intervals or primary keys. For example, the following sequence compresses well: 300, 600, 900, 1200, 1500, 600, 1200, 1800, 2400. The following sequence does not compress well: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55.
If delta distribution is excellent, columns can be stored in less than one bit per row. However, this scheme is very CPU intensive. If you use this scheme on data with arbitrary deltas, it can cause significant data expansion.
DELTARANGE_COMP
This compression scheme is primarily used for floating-point data; it stores each value as a delta from the previous one.
This scheme is ideal for many-valued FLOAT columns that are sorted or confined to a range. Do not use this scheme for unsorted columns that contain NULL values, as the storage cost for representing a NULL value is high. This scheme has a high cost for both compression and decompression.
To determine if DELTARANGE_COMP is suitable for a particular set of data, compare it to other schemes. Be sure to use the same sort order as the projection, and select sample data that will be stored consecutively in the database.
DELTAVAL
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, data is recorded as a difference from the smallest value in the data block. This encoding has no effect on other data types.
DELTAVAL is best used for many-valued, unsorted integer or integer-based columns. CPU requirements for this encoding type are minimal, and data never expands.
GCDDELTA
For INTEGER and DATE/TIME/TIMESTAMP/INTERVAL columns, and NUMERIC columns with 18 or fewer digits, data is recorded as the difference from the smallest value in the data block divided by the greatest common divisor (GCD) of all entries in the block. This encoding has no effect on other data types.
ENCODING GCDDELTA is best used for many-valued, unsorted, integer columns or integer-based columns, when the values are a multiple of a common factor. For example, timestamps are stored internally in microseconds, so data that is only precise to the millisecond are all multiples of 1000. The CPU requirements for decoding GCDDELTA encoding are minimal, and the data never expands, but GCDDELTA may take more encoding time than DELTAVAL.
GZIP_COMP
This encoding type uses the gzip compression algorithm. See gzip web site for more information. This algorithm results in better compression than the automatic LZO compression, but lower compression than BZIP_COMP. It requires more CPU time to compress than LZO but less CPU time than BZIP_COMP. This algorithm is best used on large string columns such as VARCHAR, VARBINARY, CHAR, and BINARY. Use this encoding when you want a better compression than LZO, but at less CPU time than bzip2.
RLE
RLE (run length encoding) replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. Therefore, it is best used for low cardinality columns that are present in the ORDER BY clause of a projection.
The Vertica execution engine processes RLE encoding run-by-run and the Vertica optimizer gives it preference. Use it only when run length is large, such as when low-cardinality columns are sorted.
Zstandard compression
Vertica supports three ZSTD compression types:
-
ZSTD_COMP
provides high compression ratios. This encoding type has a higher compression than gzip. Use this when you want a better compression than gzip. For general use cases, use this or the ZSTD_FAST_COMP
encoding type.
-
ZSTD_FAST_COMP
uses the fastest compression level that the zstd library provides. It is the fastest encoding type of the zstd library, but takes up more space than the other two encoding types. For general use cases, use this or the ZSTD_COMP
encoding type.
-
ZSTD_HIGH_COMP
offers the best compression in the zstd library. It is slower than the other two encoding types. Use this type when you need the best compression, with slower CPU time.
2 - GROUPED clause
Groups two or more columns into a single disk file.
Enterprise Mode only
Groups two or more columns into a single disk file. This minimizes file I/O for work loads that:
-
Read a large percentage of the columns in a table.
-
Perform single row look-ups.
-
Query against many small columns.
-
Frequently update data in these columns.
If you have data that is always accessed together and it is not used in predicates, you can increase query performance by grouping these columns. Once grouped, queries can no longer independently retrieve from disk all records for an individual column independent of the other columns within the group.
Note
RLE encoding is reduced when an RLE column is grouped with one or more non-RLE columns.
When grouping columns you can:
-
Group some of the columns:
(a, GROUPED(b, c), d)
-
Group all of the columns:
(GROUPED(a, b, c, d))
-
Create multiple groupings in the same projection:
(GROUPED(a, b), GROUPED(c, d))
Note
Vertica performs dynamic column grouping. For example, to provide better read and write efficiency for small loads, Vertica ignores any projection-defined column grouping (or lack thereof) and groups all columns together by default.
The following example shows how to group highly correlated columns bid
and ask
. The stock
column is stored separately.
=> CREATE TABLE trades (stock CHAR(5), bid INT, ask INT);
=> CREATE PROJECTION tradeproj (stock ENCODING RLE,
GROUPED(bid ENCODING DELTAVAL, ask))
AS (SELECT * FROM trades) KSAFE 1;
The following example show how to create a projection that uses expressions in the column definition. The projection contains two integer columns a
and b
, and a third column product_value
that stores the product of a
and b
:
=> CREATE TABLE values (a INT, b INT
=> CREATE PROJECTION product (a, b, product_value) AS
SELECT a, b, a*b FROM values ORDER BY a KSAFE;
3 - Hash segmentation clause
A general SQL expression.
Specifies how to segment projection data for distribution across all cluster nodes. You can specify segmentation for a table and a projection. If a table definition specifies segmentation, Vertica uses it for that table's auto-projections.
It is strongly recommended that you use Vertica's built-in
HASH
function, which distributes data evenly across the cluster, and facilitates optimal query execution.
Syntax
SEGMENTED BY expression ALL NODES [ OFFSET offset ]
Parameters
SEGMENTED BY
expression
- A general SQL expression. Hash segmentation is the preferred method of segmentation. Vertica recommends using its built-in
HASH
function, whose arguments resolve to table columns. If you use an expression other than HASH
, Vertica issues a warning.
The segmentation expression should specify columns with a large number of unique data values and acceptable skew in their data distribution. In general, primary key columns that meet these criteria are good candidates for hash segmentation.
For details, see Expression Requirements below.
ALL NODES
- Automatically distributes data evenly across all nodes when the projection is created. Node ordering is fixed.
OFFSET
offset
- A zero-based offset that indicates on which node to start segmentation distribution.
This option is not valid for
CREATE TABLE
and
CREATE TEMPORARY TABLE
.
Important
If you create a projection for a table with the OFFSET
option, be sure to create enough copies of each projection segment to satisfy system K-safety; otherwise, Vertica regards the projection as unsafe and cannot use it to query the table.
You can ensure K-safety compliance when you create projections by combining OFFSET
and
KSAFE
options in the CREATE PROJECTION
statement. On executing this statement, Vertica automatically creates the necessary number of projection copies.
Expression requirements
A segmentation expression must specify table columns as they are defined in the source table. Projection column names are not supported.
The following restrictions apply to segmentation expressions:
-
All leaf expressions must be constants or column references to a column in the CREATE PROJECTION
's SELECT
list.
-
The expression must return the same value over the life of the database.
-
Aggregate functions are not allowed.
-
The expression must return non-negative INTEGER
values in the range 0 <= x < 2
63
, and values are generally distributed uniformly over that range.
Note
If the expression produces a value outside the expected range—for example, a negative value—no error occurs, and the row is added to the projection's first segment.
Examples
The following CREATE PROJECTION
statement creates projection public.employee_dimension_super
. It specifies to include all columns in table public.employee_dimension
. The hash segmentation clause invokes the Vertica HASH
function to segment projection data on the column employee_key
; it also includes the ALL NODES
clause, which specifies to distribute projection data evenly across all nodes in the cluster:
=> CREATE PROJECTION public.employee_dimension_super
AS SELECT * FROM public.employee_dimension
ORDER BY employee_key
SEGMENTED BY hash(employee_key) ALL NODES;
4 - Unsegmented clause
Specifies to distribute identical copies of table or projection data on all nodes across the cluster.
Specifies to distribute identical copies of table or projection data on all nodes across the cluster. Use this clause to facilitate distributed query execution on tables and projections that are too small to benefit from segmentation.
Vertica uses the same name to identify all instances of an unsegmented projection. For more information about projection name conventions, see Projection naming.
Syntax
UNSEGMENTED ALL NODES
Examples
This example creates an unsegmented projection for table store.store_dimension
:
=> CREATE PROJECTION store.store_dimension_proj (storekey, name, city, state)
AS SELECT store_key, store_name, store_city, store_state
FROM store.store_dimension
UNSEGMENTED ALL NODES;
CREATE PROJECTION
=> SELECT anchor_table_name anchor_table, projection_name, node_name
FROM PROJECTIONS WHERE projection_basename='store_dimension_proj';
anchor_table | projection_name | node_name
-----------------+----------------------+------------------
store_dimension | store_dimension_proj | v_vmart_node0001
store_dimension | store_dimension_proj | v_vmart_node0002
store_dimension | store_dimension_proj | v_vmart_node0003
(3 rows)