This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data load

Vertica provides many ways to read data.

Vertica provides many ways to read data. You can load data into the database from a variety of sources, optionally transforming it in various ways. You can read data in place in its original format using external tables. You can use streaming, and you can import data from other Vertica databases. See Common use cases for an introduction.

Most data-loading operations, including external tables, revolve around the COPY statement, which has many options. This book focuses on COPY-based reads (data load and external tables). Other data-loading options supported by Vertica are described elsewhere:

1 - Common use cases

Vertica supports a variety of use cases for reading data.

Vertica supports a variety of use cases for reading data. Some of the most common are summarized here with links for more information. This is not a complete list of capabilities.

The COPY statement is central to loading data. See Introduction to the COPY statement for an overview of its use.

Loading data from files

You might have data, perhaps quite a bit of data, that you want to load into Vertica. These files might reside on shared storage, in the cloud, or on local nodes, and might be in a variety of formats.

For information about source locations, see Specifying where to load data from. To handle data in different formats you specify a parser; for more information about the options, see Data formats.

You are not limited to loading data "as-is"; you can also transform it during load. See Transforming data during loads.

Loading data from other services

Apache Kafka is a platform for streaming data. Vertica supports streaming data to and from Kafka. See Apache Kafka integration.

Apache Spark is a cluster-computing framework for distributed data. Vertica supports connecting to Spark for data. See Apache Spark integration.

You can copy data directly from another Vertica cluster, instead of exporting to files and then loading those files. See Database export and import.

Read data where it exists (don't import)

Instead of loading data into Vertica, you can read it in place using external tables. External tables can be advantageous in the following cases:

  • If you want to explore data, such as in a data lake, before selecting data to load into Vertica.

  • If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.

  • If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.

  • If you have lower-priority data in Vertica that you still want to be able to query.

When you query an external table, Vertica loads the data it needs from the external source. The Parquet and ORC columnar formats are optimized for this kind of load, so using external tables does not necessarily have a large effect on performance compared to loading data into Vertica native tables.

For more information about using external tables, see Working with external data.

Complex types

Some data formats support complex types such as arrays and structs (sets of property-value pairs). You can use strong typing to define columns using the ARRAY and ROW types (including nesting) in native and external tables. See Complex types. Alternatively, you can define tables using flexible (schemaless) complex types without fully specifying them. You can load flexible complex types in the Parquet, ORC, JSON, and Avro formats; see Flexible complex types.

Unknown or evolving schema

Sometimes the schema for the data you want to load is unknown or changes over time, particularly with JSON data which does not embed a schema in data files. There are two primary ways to explore data with an unknown schema, explained in detail in Data exploration:

  • You can inspect sample data files and derive an initial table definition using the INFER_TABLE_DDL function. For tables with many or complex columns, using this function can save time and reduce manual effort.

  • You can use flex tables to load schemaless data as blobs and inspect it using flex functions. Flex tables are best used for initial exploration of heterogeneous or poly-structured data, not in production databases, because the query-time extraction affects performance.

If your data changes over time, some parsers emit warnings about mismatches or new columns. You can use ALTER TABLE to modify the table definition as the data evolves. For more information and an example, see Schema evolution.

Messy data

Sometimes data is not clean; values might not match the declared data types, or required values might be missing, or the parser might not be able to interpret a row for other reasons. You might still want to be able to load and explore this data. You can specify how error-tolerant to be and where to record information about rejected data using parameters to the COPY statement. For more information, see Handling messy data.

2 - Introduction to the COPY statement

Use the the COPY statement to load data.

Use the the COPY statement to load data. COPY is a large and versatile statement with many parameters; for all of the details, see the reference page. In its simplest form, COPY copies data from a source to a table, as follows:

=> COPY target-table FROM data-source;

You also use COPY when defining an external table:

=> CREATE EXTERNAL TABLE target-table (...) AS COPY FROM data-source;

Source data can be a data stream or a file path. For more about the FROM clause, see Specifying where to load data from.

You can specify many details about a data load, including:

For a complete list of parameters, see Parameters.

Permissions

Generally, only a superuser can use the COPY statement to bulk-load data. Non-supersuers can use COPY in certain cases:

  • To load from a stream on the host (such as STDIN) rather than a file (see Streaming data via JDBC).

  • To load with the FROM LOCAL option.

  • To load into a storage location where the user has been granted permission.

  • To use a user-defined-load function for which the user has permission.

A non-superuser can also perform a batch load with a JDBC prepared statement, which invokes COPY to load data as a background task.

Users must also have read permission to the source from which the data is to be loaded.

3 - Global and column-specific options

You can specify some COPY options globally, for the entire COPY statement, or limit their scope to a column.

You can specify some COPY options globally, for the entire COPY statement, or limit their scope to a column. For example, in the following COPY statement, the first column is delimited by '|' but the others are delimited by commas.

=> COPY employees(id DELIMITER '|', name, department) FROM ... DELIMITER ',';

You could specify a different default for null inputs for one column:

=> COPY employees(id, name, department NULL 'General Admin') FROM ... ;

Alternatively, you can use the COLUMN OPTION parameter to specify column-specific parameters instead of enumerating the columns:

=> COPY employees COLUMN OPTION (department NULL 'General Admin') FROM ... ;

Where both global and column-specific values are provided for the same parameter, the column-specific value governs those columns and the global one governs others.

All parameters can be used globally. The description of each parameter indicates whether it can be restricted to specific columns.

4 - Specifying where to load data from

Each COPY statement requires either a FROM clause to indicate the location of the file or files being loaded or a SOURCE clause when using a user-defined source.

Each COPY statement requires either a FROM clause to indicate the location of the file or files being loaded or a SOURCE clause when using a user-defined source. For more about the SOURCE clause, see Parameters. This section covers use of the FROM clause.

Loading from a specific path

Use the path-to-data argument to indicate the location of one or more files to load. You can load data from the following locations:

If the path is a URL, you must use URL encoding for the '%' character. Otherwise, URL encoding ('%NN' where NN is a two-digit hexadecimal number) is permitted but not required.

When copying from the local file system, the COPY statement expects to find files in the same location on every node that participates in the query. If you are using NFS, then you can create an NFS mount point on each node. Doing so allows all database nodes to participate in the load for better performance without requiring files to be copied to all nodes.

Treat NFS mount points as local files in paths:

=> COPY sales FROM '/mount/sales.dat' ON ANY NODE;

You can specify more than one path in the same COPY statement, as in the following example.

=> COPY myTable FROM 'webhdfs:///data/sales/01/*.dat', 'webhdfs:///data/sales/02/*.dat',
    'webhdfs:///data/sales/historical.dat';

For files in HDFS, you can specify a name service (hadoopNS). In this example, the COPY statement is part of the definition of an external table:

=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
    AS COPY FROM 'webhdfs://hadoopNS/data/users.csv';

If path-to-data resolves to a storage location on a local file system, and the user invoking COPY is not a superuser, these permissions are required:

  • The storage location must have been created with the USER option (see CREATE LOCATION).

  • The user must already have been granted READ access to the storage location where the file or files exist, as described in GRANT (storage location).

Vertica prevents symbolic links from allowing unauthorized access.

Loading with wildcards (glob)

You can invoke COPY for a large number of files in a shared directory with a single statement such as:

=> COPY myTable FROM '/data/manyfiles/*.dat' ON ANY NODE;

The glob (*) must indicate a set of files, not directories. The following statement fails if /data/manyfiles contains any subdirectories:

=> COPY myTable FROM '/data/manyfiles/*' ON ANY NODE;

If /data/manyfiles contains subdirectories that contain files, you can use a glob in the path for the subdirectories:

=> COPY myTable FROM '/data/manyfiles/*/*' ON ANY NODE;

Sometimes a directory structure partitions the data, as in the following example:

path/created=2016-11-01/*
path/created=2016-11-02/*
path/created=2016-11-03/*
path/...

You still use a glob to read the data, but you can also read the partition values themselves (creation date, in this case) as table columns. See Partitioned file paths.

Using a wildcard with the ON ANY NODE clause expands the file list on the initiator node. This command then distributes the individual files among all nodes, so that the COPY workload is evenly distributed across the entire cluster. ON ANY NODE is the default for all file systems other than Linux.

Loading from a Vertica client

Use COPY LOCAL to load files on a client system to the Vertica database. For example, to copy a GZIP file from your local client, use a command such as this:

=> COPY store.store_dimension FROM LOCAL '/usr/files/my_data/input_file' GZIP;

You can use a comma-separated list to load multiple files of the same compression type. COPY LOCAL then concatenates the files into a single file, so you cannot combine files with different compression types in the list. When listing multiple files, be sure to specify the type of every input file, such as BZIP, as shown:

=>COPY simple_table FROM LOCAL 'input_file.bz' BZIP, 'input_file.bz' BZIP;

You can load data from a local client from STDIN, as follows:

=> COPY simple_table FROM LOCAL STDIN;

Loading from Kafka or Spark

For information about streaming data from Kafka, see Apache Kafka integration.

For information about using Vertica with Spark data, see Apache Spark integration.

Loading data from an IDOL CFS client

The IDOL Connector Framework Server (CFS) VerticaIndexer feature lets CFS clients connect to your Vertica database using ODBC. After it is connected, CFS uses COPY...FROM LOCAL statements to load IDOL document metadata into an existing flex table. For more information, see the Using flex tables for IDOL data section in Using Flex Tables.

5 - Data formats

COPY supports many data formats, detailed in the sections that follow.

COPY supports many data formats, detailed in the sections that follow. You specify a data format by specifying a parser.

By default, COPY uses the DELIMITED parser (Delimited data) to load raw data into the database. Raw input data must be in UTF-8, delimited text format. Other parsers support other data formats.

The syntax for specifying which parser to use varies. The description of each parser includes this information.

The same COPY statement cannot mix raw data types that require different parsers, such as NATIVE and FIXEDWIDTH. You can, however, load data of different formats, using different parsers, into the same table using separate COPY statements.

For information about verifying input data formats, see Handling Non-UTF-8 input.

All parsers described in this section can be used with conventional tables (those created with CREATE TABLE or CREATE EXTERNAL TABLE). Some also support Flex tables (CREATE FLEX TABLE). For more information specific to Flex tables, see Using flex table parsers.

All parsers support all primitive data types and several support one-dimensional arrays of primitive types. Some parsers support other complex types. See the documentation of individual parsers for information about supported types.

5.1 - Delimited data

If you do not specify another parser, Vertica defaults to the DELIMITED parser.

If you do not specify another parser, Vertica defaults to the DELIMITED parser. You can specify the delimiter, escape characters, how to handle null values, and other parameters in the COPY statement.

The following example shows the default behavior, in which the delimiter character is '|'

=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE

=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.

=> SELECT * FROM employees;
 id |      name       |  department
----+-----------------+--------------
 17 | Howard Wolowitz | Astrophysics
 42 | Sheldon Cooper  | Physics
(2 rows)

By default, collection values are delimited by brackets and elements are delimited by commas. Collections must be one-dimensional arrays or sets of scalar types.

=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE

=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.

=> SELECT * FROM researchers;
 id |      name       |               grants               |       values
----+-----------------+------------------------------------+---------------------
 17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
 42 | Sheldon Cooper  | ["US-7376","DARPA-1567"]           | [65000,135000]
(2 rows)

To use a special character as a literal, prefix it with an escape character. For example, to include a literal backslash (\) in the loaded data (such as when including a file path), use two backslashes (\\). COPY removes the escape character from the input when it loads escaped characters.

When loading delimited data, two consecutive delimiters indicate a null value, unless the NULL parameter is set otherwise. The final delimiter is optional. For example, the following input is valid for the previous table:


=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 45|Raj|
>> 21|Leonard||
>> \.

=> SELECT * FROM employees;
 id |  name   |  department
----+---------+--------------
 21 | Leonard |
 42 | Raj     |
(2 rows)

By default, if the data has too few values, the load fails. You can use the TRAILING NULLCOLS option to accept any number of missing columns and treat their values as null.

Vertica assumes that data is in the UTF-8 encoding.

The options specific to the DELIMITED parser and their default values are:

Option Default
DELIMITER |
ENCLOSED BY "
ESCAPE \
NULL '' (empty string)
COLLECTIONOPEN [
COLLECTIONCLOSE ]
COLLECTIONDELIMITER ,
COLLECTIONNULLELEMENT null
COLLECTIONENCLOSE " (double quote)
TRAILING NULLCOLS (none)

To load delimited data into a Flex table, use the FDELIMITEDPARSER parser.

Changing the column separator (DELIMITER)

The default COPY delimiter is a vertical bar ('|'). The DELIMITER is a single ASCII character used to separate columns within each record of an input source. Between two delimiters, COPY interprets all string data in the input as characters. Do not enclose character strings in quotes, because quote characters are also treated as literals between delimiters.

You can define a different delimiter using any ASCII value in the range E'\000' to E'\177' inclusive. For instance, if you are loading CSV data files, and the files use a comma (',') character as a delimiter, you can change the default delimiter to a comma. You cannot use the same character for both the DELIMITER and NULL options.

If the delimiter character is among a string of data values, use the ESCAPE AS character ('\' by default) to indicate that the delimiter should be treated as a literal.

The COPY statement accepts empty values (two consecutive delimiters) as valid input data for CHAR and VARCHAR data types. COPY stores empty columns as an empty string (''). An empty string is not equivalent to a NULL string.

To indicate a non-printing delimiter character (such as a tab), specify the character in extended string syntax (E'...'). If your database has StandardConformingStrings enabled, use a Unicode string literal (U&'...'). For example, use either E'\t' or U&'\0009' to specify tab as the delimiter.

The following example loads data from a comma-separated file:

=> COPY employees FROM ... DELIMITER ',';

In the following example, the first column has a column-specific delimiter:

=> COPY employees(id DELIMITER ':', name, department) FROM ... DELIMITER ',';

Changing collection delimiters (COLLECTIONDELIMITER, COLLECTIONOPEN, COLLECTIONCLOSE)

The DELIMITER option specifies the value that separates columns in the input. For a column with a collection type (ARRAY or SET), a delimiter is also needed between elements of the collection. In addition, the collection itself has start and end markers. By default, collections are enclosed in brackets and elements are delimited by commas, but you can change these values.

In the following example, collections are enclosed in braces and delimited by periods.

=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672"."darpa-1963"}|{16200.16700}
>> \.

=> SELECT * FROM researchers;
 id |      name       |               grants               |       values
----+-----------------+------------------------------------+---------------------
 17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
 42 | Sheldon Cooper  | ["US-7376","DARPA-1567"]           | [65000,135000]
 19 | Leonard         | ["us-1672","darpa-1963"]           | [16200,16700]
(3 rows)

Changing the character enclosing column or collection values (ENCLOSED BY, COLLECTIONENCLOSE)

The ENCLOSED BY parameter lets you set an ASCII character to delimit characters to embed in string values. The enclosing character is not considered to be part of the data if and only if it is the first and last character of the input. You can use any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000') for the ENCLOSED BY value. Using double quotation marks (") is common, as shown in the following example.

=> COPY employees FROM STDIN ENCLOSED BY '"';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 21|Leonard|Physics
>> 42|"Sheldon"|"Physics"
>> 17|Rajesh "Raj" K|Astronomy
>> \.

=> SELECT * FROM employees;
 id |     name       |  department
----+----------------+--------------
 17 | Rajesh "Raj" K | Astronomy
 21 | Leonard        | Physics
 42 | Sheldon        | Physics
(3 rows)

Notice that while ENCLOSED BY is a double quote, the embedded quotes in Rajesh's name are treated as part of the data because they are not the first and last characters in the column. The quotes that enclose "Sheldon" and "Physics" are dropped because of their positions.

Within a collection value, the COLLECTIONENCLOSE parameter is like ENCLOSED BY for individual elements of the collection.

Changing the null indicator (NULL)

By default, an empty string ('') for a column value means NULL. You can specify a different ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NUL: E'\000') as the NULL indicator. You cannot use the same character for both the DELIMITER and NULL options.

A column containing one or more whitespace characters is not NULL unless the sequence of whitespace exactly matches the NULL string.

A NULL is case-insensitive and must be the only value between the data field delimiters. For example, if the null string is NULL and the delimiter is the default vertical bar (|):

|NULL| indicates a null value.

| NULL | does not indicate a null value.

When you use the COPY statement in a script, you must substitute a double-backslash for each null string that includes a backslash. For example, the scripts used to load the example database contain:

COPY ... NULL E'\\n' ...

Changing the null indicator for collection values (COLLECTIONNULLELEMENT)

The NULL option specifies the value to be treated as null for a column value. For a column with a collection type (ARRAY or SET), a separate option specifies how to interpret null elements. By default, "null" indicates a null value. An empty value, meaning two consecutive element delimiters, does not indicate null:

=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 17|Howard|["nasa-143",,"nasa-6262"]|[10000,1650,15367]
>> 19|Leonard|["us-177",null,"us-6327"]|[16200,64000,26500]
>> \.

=> SELECT * FROM researchers;
 id |  name   |           grants            |       values
----+---------+-----------------------------+---------------------
 17 | Howard  | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
 19 | Leonard | ["us-177",null,"us-6327"]   | [16200,64000,26500]
(2 rows)

Use COLLECTIONNULLELEMENT to specify a different value, as in the following example.

=> COPY researchers from STDIN COLLECTIONNULLELEMENT 'x';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon|[x,"us-1672"]|[x,165000]
>> \.

=> SELECT * FROM researchers;
 id |  name   |           grants            |       values
----+---------+-----------------------------+---------------------
 17 | Howard  | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
 19 | Leonard | ["us-177",null,"us-6327"]   | [16200,64000,26500]
 42 | Sheldon | [null, "us-1672"]           | [null,165000]
(3 rows)

Filling missing columns (TRAILING NULLCOLS)

By default, COPY fails if the input does not contain enough columns. Use the TRAILING NULLCOLS option to instead insert NULL values for any columns that lack data. This option cannot be used with columns that have a NOT NULL constraint.

The following example demonstrates use of this option.

=> CREATE TABLE z (a INT, b INT, c INT );

--- insert with enough data:
=> INSERT INTO z VALUES (1, 2, 3);

=> SELECT * FROM z;
 a | b | c
---+---+---
 1 | 2 | 3
(1 row)

--- insert deficient data:
=> COPY z FROM STDIN TRAILING NULLCOLS;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 4 | 5 | 6
>> 7 | 8
>> \.

=> SELECT * FROM z;
 a | b | c
---+---+---
 1 | 2 | 3
 4 | 5 | 6
 7 | 8 |
(3 rows)

Changing the escape character (ESCAPE AS, NO ESCAPE)

You can specify an escape character, which enables any special characters to be treated as part of the data. For example, if an element from a CSV file should contain a comma, you can indicate that by pre-pending the escape character to the comma in the data. The default escape character is a backslash (\).

To change the default to a different character, use the ESCAPE AS option. You can set the escape character to be any ASCII value in the range E'\001' to E'\177' inclusive.

If you do not want any escape character and want to prevent any characters from being interpreted as escape sequences, use the NO ESCAPE option.

ESCAPE AS and NO ESCAPE can be set at both the column and global levels.

Changing the end-of-line character (RECORD TERMINATOR)

To specify the literal character string that indicates the end of a data file record, use the RECORD TERMINATOR parameter, followed by the string to use. If you do not specify a value, then Vertica attempts to determine the correct line ending, accepting either just a linefeed (E'\n') common on UNIX systems, or a carriage return and linefeed (E'\r\n') common on Windows platforms.

For example, if your file contains comma-separated values terminated by line feeds that you want to maintain, use the RECORD TERMINATOR option to specify an alternative value:

=>  COPY mytable FROM STDIN DELIMITER ',' RECORD TERMINATOR E'\n';

To specify the RECORD TERMINATOR as non-printing characters, use either the extended string syntax or Unicode string literals. The following table lists some common record terminator characters. See String Literals for an explanation of the literal string formats.

Extended String Syntax Unicode Literal String Description ASCII Decimal
E'\b' U&'\0008' Backspace 8
E'\t' U&'\0009' Horizontal tab 9
E'\n' U&'\000a' Linefeed 10
E'\f' U&'\000c' Formfeed 12
E'\r' U&'\000d' Carriage return 13
E'\\' U&'\005c' Backslash 92

If you use the RECORD TERMINATOR option to specify a custom value, be sure the input file matches the value. Otherwise, you may get inconsistent data loads.

If using JDBC, Vertica recommends that you use the following value for RECORD TERMINATOR:

System.getProperty("line.separator")

5.2 - Binary (native) data

You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option.

You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option. Since binary-format data does not require the use and processing of delimiters, it precludes the need to convert integers, dates, and timestamps from text to their native storage format, and improves load performance over delimited data. All binary-format files must adhere to the formatting specifications described in Appendix: creating native binary format files.

Native binary format data files are typically larger than their delimited text format counterparts, so compress the data before loading it. The NATIVE parser does not support concatenated compressed binary files. You can load native (binary) format files when developing plug-ins to ETL applications.

There is no copy format to load binary data byte-for-byte because the column and record separators in the data would have to be escaped. Binary data type values are padded and translated on input, and also in the functions, operators, and casts supported.

Loading hexadecimal, octal, and bitstring data

You can use the formats hexadecimal, octal, and bitstring only to load binary columns. To specify these column formats, use the COPY statement's FORMAT options:

  • Hexadecimal

  • Octal

  • Bitstring

The following examples illustrate how to use the FORMAT option.

  1. Create a table:

    => CREATE TABLE t(oct VARBINARY(5),
         hex VARBINARY(5),
         bitstring VARBINARY(5) );
    
  2. Create the projection:

    => CREATE PROJECTION t_p(oct, hex, bitstring) AS SELECT * FROM t;
    
  3. Use a COPY statement with the STDIN clause, specifying each of the formats:

    => COPY t (oct FORMAT 'octal', hex FORMAT 'hex',
               bitstring FORMAT 'bitstring')
       FROM STDIN DELIMITER ',';
    
  4. Enter the data to load, ending the statement with a backslash () and a period (.) on a separate line:

    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 141142143144145,0x6162636465,0110000101100010011000110110010001100101
    >> \.
    
  5. Use a select query on table t to view the input values results:

    => SELECT * FROM t;
     oct   | hex   | bitstring
    -------+-------+-----------
    abcde  | abcde | abcde
    (1 row)
    

COPY uses the same default format to load binary data, as used to input binary data. Since the backslash character ('\') is the default escape character, you must escape octal input values. For example, enter the byte '\141' as '\\141'.

On input, COPY translates string data as follows:

  • Uses the HEX_TO_BINARY function to translate from hexadecimal representation to binary.

  • Uses the BITSTRING_TO_BINARY function to translate from bitstring representation to binary.

Both functions take a VARCHAR argument and return a VARBINARY value.

You can also use the escape character to represent the (decimal) byte 92 by escaping it twice; for example, '\\\\'. Note that vsql inputs the escaped backslash as four backslashes. Equivalent inputs are hex value '0x5c' and octal value '\134' (134 = 1 x 8^2 + 3 x 8^1 + 4 x 8^0 = 92).

You can load a delimiter value if you escape it with a backslash. For example, given delimiter '|', '\\001\|\\002' is loaded as {1,124,2}, which can also be represented in octal format as '\\001\\174\\002'.

If you insert a value with more bytes than fit into the target column, COPY returns an error. For example, if column c1 is VARBINARY(1):

=> INSERT INTO t (c1) values ('ab');   ERROR: 2-byte value too long for type Varbinary(1)

If you implicitly or explicitly cast a value with more bytes than fit the target data type, COPY silently truncates the data. For example:

=> SELECT 'abcd'::binary(2);
 binary
--------
 ab
(1 row)

Hexadecimal data

The optional '0x' prefix indicates that a value is hexadecimal, not decimal, although not all hexadecimal values use A-F; for example, 5396. COPY ignores the 0x prefix when loading the input data.

If there are an odd number of characters in the hexadecimal value, the first character is treated as the low nibble of the first (furthest to the left) byte.

Octal data

Loading octal format data requires that each byte be represented by a three-digit octal code. The first digit must be in the range [0,3] and the second and third digits must both be in the range [0,7].

If the length of an octal value is not a multiple of three, or if one of the three digits is not in the proper range, the value is invalid and COPY rejects the row in which the value appears. If you supply an invalid octal value, COPY returns an error. For example:

=> SELECT '\\000\\387'::binary(8);
ERROR: invalid input syntax for type binary

Rows that contain binary values with invalid octal representations are also rejected. For example, COPY rejects '\\008' because '\\ 008' is not a valid octal number.

BitString data

Loading bitstring data requires that each character must be zero (0) or one (1), in multiples of eight characters. If the bitstring value is not a multiple of eight characters, COPY treats the first n characters as the low bits of the first byte (furthest to the left), where n is the remainder of the value's length, divided by eight.

Examples

The following example shows VARBINARY HEX_TO_BINARY(VARCHAR) and VARCHAR TO_HEX(VARBINARY) usage.

  1. Create table t and and its projection with binary columns:

    => CREATE TABLE t (c BINARY(1));
    => CREATE PROJECTION t_p (c) AS SELECT c FROM t;
    
  2. Insert minimum and maximum byte values, including an IP address represented as a character string:

    => INSERT INTO t values(HEX_TO_BINARY('0x00'));
    => INSERT INTO t values(HEX_TO_BINARY('0xFF'));
    => INSERT INTO t values (V6_ATON('2001:DB8::8:800:200C:417A'));
    

    Use the TO_HEX function to format binary values in hexadecimal on output:

    => SELECT TO_HEX(c) FROM t;
     to_hex
    --------
     00
     ff
     20
    (3 rows)
    

See also

5.3 - Native varchar data

Use the NATIVE VARCHAR parser option when the raw data consists primarily of CHAR or VARCHAR data.

Use the NATIVE VARCHAR parser option when the raw data consists primarily of CHAR or VARCHAR data. COPY performs the conversion to the actual table data types on the database server. This parser option is not supported with COPY LOCAL.

Using NATIVE VARCHAR does not provide the same efficiency as NATIVE. However, NATIVE VARCHAR precludes the need to use delimiters or to escape special characters, such as quotes, which can make working with client applications easier.

Batch data inserts performed through the Vertica ODBC and JDBC drivers automatically use the NATIVE VARCHAR format.

5.4 - Fixed-width format data

Use the FIXEDWIDTH parser option to bulk load fixed-width data.

Use the FIXEDWIDTH parser option to bulk load fixed-width data. You must specify the COLSIZES option values to specify the number of bytes for each column. The definition of the table you are loading (COPY table f (x, y, z)) determines the number of COLSIZES values to declare.

To load fixed-width data, use the COLSIZES option to specify the number of bytes for each input column. If any records do not have values, COPY inserts one or more null characters to equal the specified number of bytes. The last record in a fixed-width data file must include a record terminator to determine the end of the load data.

The following COPY options are not supported:

  • DELIMITER

  • ENCLOSED BY

  • ESCAPE AS

  • TRAILING NULLCOLS

Using nulls in fixed-width data

The default NULL string for a fixed-width load cannot be an empty string, and instead, consists of all spaces. The number of spaces depends on the column width declared with the COLSIZES (integer, [,...]) option.

For fixed-width loads, the NULL definition depends on whether you specify NULL at the column or statement level:

  • Statement level: NULL must be defined as a single-character. The default (or custom) NULL character is repeated for the entire width of the column.

  • Column level: NULL must be defined as a string whose length matches the column width.

For fixed-width loads, if the input data column has fewer values than the specified column size, COPY inserts NULL characters. The number of NULLs must match the declared column width. If you specify a NULL string at the column level, COPY matches the string with the column width.

Defining a null character (statement level)

  1. Create a two-column table (fw):

    => CREATE TABLE fw(co int, ci int);
    CREATE TABLE
    
  2. Copy the table, specifying null as 'N', and enter some data:

    => COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) null AS 'N' NO COMMIT;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> NN12
    >> 23NN
    >> NNNN
    >> nnnn
    >> \.
    
  3. Select all (*) from the table:

    => SELECT * FROM fw;
     co | ci
    ----+----
        | 12
     23 |
    (2 rows)
    

Defining a custom record terminator

To define a record terminator other than the COPY default when loading fixed-width data, take these steps:

  1. Create table fw with two columns, co and ci:

    => CREATE TABLE fw(co int, ci int);
    CREATE TABLE
    
  2. Copy table fw, specifying two 2-byte column sizes, and specifying a comma (,) as the record terminator:

    => COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) RECORD TERMINATOR ',';
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 1234,1444,6666
    >> \.
    
  3. Query all data in table fw:

    => SELECT * FROM fw;
     co | ci
    ----+----
     12 | 34
     14 | 44
    (2 rows)
    

The SELECT output indicates only two values. COPY rejected the third value (6666) because it was not followed by a comma (,) record terminator. Fixed-width data requires a trailing record terminator only if you explicitly specify a record terminator explicitly.

Copying fixed-width data

Use COPY FIXEDWIDTH COLSIZES (n [,...) to load files into a Vertica database. By default, all spaces are NULLs. For example:

=> CREATE TABLE mytest(co int, ci int);
=> CREATE PROJECTION mytest_p1 AS SELECT * FROM mytest SEGMENTED BY HASH(co) ALL NODES;
=> COPY mytest(co,ci) FROM STDIN FIXEDWIDTH colsizes(6,4) NO COMMIT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> \.
=> SELECT * FROM mytest ORDER BY co;
 co | ci
----+----
(0 rows)

Skipping content in fixed-width data

The COPY statement has two options to skip input data. The SKIP BYTES option is only for fixed-width data loads:

SKIP BYTES num-bytes Skips the specified number of bytes from the input data.
SKIP num-records Skips the specified number of records.

The following example uses SKIP BYTES to skip 11 bytes when loading a fixed-width table with two columns (4 and 6 bytes):

  1. Copy a table using SKIP BYTES:

    => COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP BYTES 11;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 2222666666
    >> 1111999999
    >> 1632641282
    >> \.
    
  2. Query all data in table fw:

    => SELECT * FROM fw ORDER BY co;
      co  |   ci
    ------+--------
     1111 | 999999
     1632 | 641282
    (2 rows)
    

The output confirms that COPY skipped the first 11 bytes of loaded data.

The following example uses SKIP when loading a fixed-width (4,6) table:

  1. Copy a table, using SKIP to skip two records of input data:

    => COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP 2;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 2222666666
    >> 1111999999
    >> 1632641282
    >> 3333888888
    >> \.
    
  2. Query all data in table fw:

    => SELECT * FROM fw ORDER BY co;
      co  |   ci
    ------+--------
     1632 | 641282
     3333 | 888888
    (2 rows)
    

The output confirms that COPY skipped the first two records of load data.

Trimming characters in fixed-width data loads

Use the TRIM option to trim a character. TRIM accepts a single-byte character, which is trimmed at the beginning and end of the data. For fixed-width data loads, when you specify a TRIM character, COPY first checks to see if the row is NULL. If the row is not null, COPY trims the character(s). The next example instructs COPY to trim the character A, and shows the results:

  1. Copy table fw, specifying TRIM character A:

    => COPY fw FROM STDIN FIXEDWIDTH colsizes(4,6) TRIM 'A';
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> A22A444444
    >> A22AA4444A
    >> \.
    
  2. Query all data in table fw:

    => SELECT * FROM fw ORDER BY co;
     co |   ci
    ----+--------
     22 |   4444
     22 | 444444
    (2 rows)
    

Using padding in fixed-width data loads

By default, the padding character is ' ' (a single space). The padding behavior for fixed-width data loads is similar to how a space is treated in other formats, differing by data type as follows:

Data type Padding
Integer Leading and trailing spaces
Bool Leading and trailing spaces
Float Leading and trailing spaces
[var]Binary None, all characters are significant.
[Var]Char Trailing spaces if string is too large
DateInterval, Time, Timestamp, TimeTZ None, all characters are significant. COPY uses an internal algorithm to parse these data types.
Date (formatted) Use the COPY FORMAT option string to match the expected column length.
Numerics Leading and trailing spaces

5.5 - ORC data

The ORC (Optimized Row Columnar) format is a column-oriented file format.

The ORC (Optimized Row Columnar) format is a column-oriented file format. Vertica has a parser for this format that can take advantage of the columnar layout.

In the COPY statement, specify the parser as follows. Note that this is not the usual PARSER parser-name syntax; COPY supports ORC directly:

=> COPY tableName FROM path ORC[(...)];

The parser takes several optional parameters; see the ORC reference page.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Supported data types

Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.

Timestamps and time zones

To correctly report timestamps, Vertica must know what time zone the data was written in. Older versions of the ORC format do not record the time zone.

Hive version 1.2.0 and later records the writer time zone in the stripe footer. Vertica uses that time zone to make sure the timestamp values read into the database match the ones written in the source file. For ORC files that are missing this time zone information, Vertica assumes the values were written in the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table. Check for events of this type after your first query to verify that timestamps are being handled as you expected.

5.6 - Parquet data

Parquet is a column-oriented file format.

Parquet is a column-oriented file format. Vertica has a parser that can take advantage of the columnar layout.

In the COPY statement, specify the parser as follows. Note that this is not the usual PARSER parser-name syntax; COPY supports Parquet directly:


=> COPY tableName FROM path PARQUET[(...)];

The parser takes several optional parameters; see the PARQUET reference page.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Schema matching

By default, the Parquet parser uses strong schema matching. This means that the load must consume all columns in the Parquet data in the order they occur in the data. You can, instead, use loose schema matching, which allows you to select the columns and struct fields you want and ignore the rest. Loose schema matching depends on the names of the columns and fields in the data rather than their order, so the names in your table must match those in the data. Types must match or be coercible. For more information on how to use loose schema matching, see Loose Schema Matching on the PARQUET reference page.

Use the do_soft_schema_match_by_name parameter to specify loose schema matching. In the following example, the Parquet data contains more columns than those used in the table.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(do_soft_schema_match_by_name='True');

=> SELECT * from restaurant;
       name        | cuisine
-------------------+----------
 Bob's pizzeria    | Italian
 Bakersfield Tacos | Mexican
(2 rows)

Metadata caching

Parquet files include metadata that Vertica uses when loading data. To avoid repeatedly fetching this data, particularly from remote sources or where API calls incur financial costs, Vertica caches this metadata on each participating node during the planning phase for use during the execution phase.

Vertica uses TEMP storage for the cache, and only if TEMP storage is on the local file system.

You can limit the size of the cache by setting the ParquetMetadataCacheSizeMB configuration parameter. The default is 4GB.

Supported data types

Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.

For the Parquet format only, you can use flexible complex types instead of fully specifying the schema for complex types. See Flexible complex types.

Timestamps and time zones

To correctly report timestamps, Vertica must know what time zone the data was written in. Hive does not record the writer time zone. Vertica assumes timestamp values were written in the local time zone and reports a warning at query time.

Hive provides an option, when writing Parquet files, to record timestamps in the local time zone. If you are using Parquet files that record times in this way, set the UseLocalTzForParquetTimestampConversion configuration parameter to 0 to disable the conversion done by Vertica. (See General parameters.)

5.7 - JSON data

Use FJSONPARSER to load data in JSON format.

Use FJSONPARSER to load data in JSON format.

The schema for JSON data is the set of property names in the property/value pairs. When you load JSON data into a columnar table or materialized columns in a Flex table, the property names in the data must match the column names in the table. You do not need to load all of the columns in the data.

The JSON parser can load data into columns of any scalar type, strongly-typed complex type, or flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.

In the COPY statement, use the PARSER parameter to specify the JSON parser as in the following example:

=> CREATE EXTERNAL TABLE customers(id INT, address VARCHAR, transactions ARRAY[INT,10])
    AS COPY FROM 'cust.json' PARSER FJSONPARSER();

This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.

Before loading JSON data, consider using a tool such as JSONLint to verify that the data is valid.

If you load JSON data into a Flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data. You can use Flex functions to extract values.

Strongly-typed complex types

JSON data can contain arrays, structs, and combinations of the two. You can load this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.

Use the ARRAY and ROW types in the table definition as usual:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city ARRAY[VARCHAR(80),50],
   menu ARRAY[ ROW(item VARCHAR(80), price FLOAT), 100 ]
  )
 AS COPY FROM :restdata PARSER FJSONPARSER();

=> SELECT name, location_city, menu FROM rest;
       name        |       location_city        |                                     menu
-------------------+----------------------------+------------------------------------------------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | [{"item":"cheese pizza","price":8.25},{"item":"spinach pizza","price":10.5}]
 Bakersfield Tacos | ["Pittsburgh"]             | [{"item":"veggie taco","price":9.95},{"item":"steak taco","price":10.95}]
(2 rows)

When loading JSON data into a table with strong typing for complex types, Vertica ignores the parser's flatten_maps and flatten_arrays parameters.

Strong and flexible complex types

An advantage of strong typing is the easier (and more efficient) access in queries. A disadvantage is that additional values found in the data but not included in the column definition are ignored. If the menu struct in this data includes more attributes, such as calories, they are not loaded because the definition of the column only specified item and price. The following example uses flexible complex types to reveal the extra attributes:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city LONG VARBINARY, menu LONG VARBINARY)
 AS COPY FROM :restdata
 PARSER FJSONPARSER(flatten_maps=false);

=> SELECT name, MAPTOSTRING(location_city) as location_city, MAPTOSTRING(menu) AS menu FROM rest;
       name        |       location_city      |                     menu
-------------------+--------------------------+---------------------------------------------------
 Bob's pizzeria    | {
    "0": "Cambridge",
    "1": "Pittsburgh"
} | {
    "0": {
        "calories": "1200",
        "item": "cheese pizza",
        "price": "8.25"
    },
    "1": {
        "calories": "900",
        "item": "spinach pizza",
        "price": "10.50"
    }
}
 Bakersfield Tacos | {
    "0": "Pittsburgh"
}                    | {
    "0": {
        "item": "veggie taco",
        "price": "9.95",
        "vegetarian": "true"
    },
    "1": {
        "item": "steak taco",
        "price": "10.95"
    }
}
(2 rows)

Unmatched fields

The JSON parser loads all fields from the data that are part of the table definition. If the data contains other fields, the parser produces a warning and logs the new fields in a system table. You can review the logged events and decide whether to modify your table definition. For details, see Schema evolution.

Loading from a specific start point

You need not load an entire JSON file. You can use the start_point parameter to load data beginning at a specific key, rather than at the beginning of a file. Data is parsed from after the start_point key until the end of the file, or to the end of the first start_point's value. The parser ignores any subsequent instance of the start_point, even if that key appears multiple times in the input file. If the input data contains only one copy of the start_point key, and that value is a list of JSON elements, the parser loads each element in the list as a row.

If a start_point value occurs more than once in your JSON data, you can use the start_point_occurrence integer parameter to specify the occurrence at which to start parsing.

This example uses the following JSON data, saved to a file named alphanums.json:

 { "A": { "B": { "C": [ { "d": 1, "e": 2, "f": 3 }, { "g": 4, "h": 5, "i": 6 },
{ "j": 7, "k": 8, "l": 9 } ] } } }

Loading this data into a flex table produces the following results:

=> CREATE FLEX TABLE start_json;
CREATE TABLE

=> COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           1
(1 row)

=> SELECT maptostring(__raw__) FROM start_json;
                        maptostring
-------------------------------------------------------------------------------
 {
   "A.B.C" : {
      "0.d" : "1",
      "0.e" : "2",
      "0.f" : "3",
      "1.g" : "4",
      "1.h" : "5",
      "1.i" : "6",
      "2.j" : "7",
      "2.k" : "8",
      "2.l" : "9"
   }
}

(1 row)

The following load specifies a start point:

=> TRUNCATE TABLE start_json;
TRUNCATE TABLE

=>  COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER(start_point='B');
 Rows Loaded
-------------
           1
(1 row)

=> SELECT maptostring(__raw__) FROM start_json;
                         maptostring
--------------------------------------------------------------------------------
 {
   "C" : {
      "0.d" : "1",
      "0.e" : "2",
      "0.f" : "3",
      "1.g" : "4",
      "1.h" : "5",
      "1.i" : "6",
      "2.j" : "7",
      "2.k" : "8",
      "2.l" : "9"
   }
}
(1 row)

Dealing with invalid JSON records

If your JSON data contains syntax errors, your load can fail due to invalid records. You can use the RECORD_TERMINATOR option in the COPY statement to skip these invalid records if your JSON records are consistently delimited by a character like a line break. Setting a record terminator allows the parser to skip over invalid records and continue parsing the rest of the data.

If your records are not consistently marked by a character, you can use the ERROR TOLERANCE option. ERROR TOLERANCE skips entire source files with invalid JSON records, while RECORD_TERMINATOR skips individual malformed JSON records. You can use the two options together.

The following example uses invalid records:

=> => CREATE FLEX TABLE fruits();
CREATE TABLE

=> COPY fruits FROM STDIN PARSER FJSONPARSER(RECORD_TERMINATOR=E'\n');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself
>> {"name": "orange", "type": "fruit", "color": "orange", "rating": 5 }
>> {"name": "apple", "type": "fruit", "color": "green" }
>> {"name": "blueberry", "type": "fruit", "color": "blue", "rating": 10 }
>> "type": "fruit", "rating": 7 }
>> {"name": "banana", "type" : "fruit", "color": "yellow", "rating": 3 }
>> \.

View the flex table using MAPTOSTRING to confirm that the invalid record was skipped while the rest of the records were successfully loaded:

=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
--------------------------------------------------------------------------------------------
{
"color" : "orange",
"name" : "orange",
"rating" : "5",
"type" : "fruit"
}
{
"color" : "green",
"name" : "apple",
"type" : "fruit"
}
{
"color" : "blue",
"name" : "blueberry",
"rating" : "10",
"type" : "fruit"
}
{
"color" : "yellow",
"name" : "banana",
"rating" : "3",
"type" : "fruit"
}
(4 rows)

Rejecting data on materialized column type errors

By default, if FJSONPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error parameter. If this parameter is true, COPY rejects such rows and reports an error.

If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types; if any ROW field or ARRAY element cannot be coerced, it loads NULL for the column.

The following example attempts to load invalid data. Note that the invalid row is missing from the query results:

=> CREATE TABLE test(one VARCHAR, two INT);
CREATE TABLE

=> COPY test FROM stdin
   PARSER FJSONPARSER(reject_on_materialized_type_error=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"one": 1, "two": 2}
>> {"one": "one", "two": "two"}
>> {"one": "one", "two": 2}
>> \.

=> SELECT one, two FROM test;
 one | two
-----+-----
 1   |   2
 one |   2
(2 rows)

Rejecting or omitting empty keys in flex tables

Valid JSON files can include empty key and value pairs. By default, for a Flex table, FJSONPARSER loads them, as in the following example:

=> CREATE FLEX TABLE fruits();
CREATE TABLE

=> COPY fruits FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "orange", "rating": 5}
>> {"name": "apple", "rating" : 10}
>> {"": "banana", "rating" : 2}
>> \.

=> SELECT MAPTOSTRING(__raw__) FROM fruits;
              maptostring
---------------------------------------
 {
        "name": "orange",
        "rating": "5"
}
 {
        "name": "apple",
        "rating": "10"
}
 {
        "": "banana",
        "rating": "2"
}
(3 rows)

To omit fields with empty keys, use the omit_empty_keys parameter:

=> COPY fruits FROM STDIN PARSER FJSONPARSER(omit_empty_keys=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "apple", "rating": 5}
>> {"": "missing", "rating" : 1}
>> {"name": "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
             maptostring
--------------------------------------
 {
        "name": "apple",
        "rating": "5"
}
 {
        "rating": "1"
}
 {
        "name": "",
        "rating": "3"
}
(3 rows)

Note that the second value, with the missing name, still loads with the other (non-empty) field. To instead reject the row entirely, use the reject_on_empty_key parameter:

=> COPY fruits FROM STDIN PARSER FJSONPARSER(reject_on_empty_key=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name" : "apple", "rating" : 5}
>> {"" : "missing", "rating" : 1}
>> {"name" : "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
             maptostring
--------------------------------------
 {
        "name": "apple",
        "rating": "5"
}
 {
        "name": "",
        "rating": "3"
}
(2 rows)

5.8 - Avro data

Use FAVROPARSER to load Avro data files.

Use FAVROPARSER to load Avro data files. This parser supports both columnar and Flex tables.

A column can be of any scalar type, a strongly-typed complex type, or a flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.

The following requirements apply:

  • Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.

  • The Avro file must have its related schema in the file being loaded; FAVROPARSER does not support Avro files with separate schema files.

In the COPY statement, use the PARSER parameter to specify the Avro parser as in the following example:

=> COPY weather FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER()

This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.

Avro schemas and columnar tables

Avro includes the schema with the data. When you load Avro data into a columnar table, the column names in the schema in the data must match the column names in the table. You do not need to load all of the columns in the data.

For example, the following Avro schema uses the Avro record type to represent a user profile:

{
  "type": "record",
  "name": "Profile",
  "fields" : [
      {"name": "UserName", "type": "string"},
      {"name": "Email", "type": "string"},
      {"name": "Address", "type": "string"}
   ]
}

To successfully load the data into a columnar table with this schema, each target column name must match the "name" value in the schema. In the following example, the profiles table does not load values corresponding to the schema's Email field because the target column is named EmailAddr:

=> COPY profiles FROM '/home/dbadmin/data/user_profile.avro' PARSER FAVROPARSER();

=> SELECT * FROM profiles;
     UserName    |     EmailAddr      |       Address
-----------------+--------------------+---------------------
    dbadmin      |                    |    123 Main St.

Strongly-typed complex types

Avro data can contain arrays, structs, and combinations of the two. You can read this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.

Use the ARRAY and ROW types in the table definition as usual:

=> CREATE EXTERNAL TABLE rest
  (name VARCHAR, cuisine VARCHAR,
   location_city ARRAY[VARCHAR(80)],
   menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
  )
AS COPY FROM :avro_file PARSER FAVROPARSER();

You can use strong typing in both external tables and native tables.

An alternative to strong typing for complex types is flexible complex types.

Unmatched fields

The Avro parser loads all fields from the data that are part of the table definition. If the data contains other fields, the parser produces a warning and logs the new fields in a system table. You can review the logged events and decide whether to modify your table definition. For details, see Schema evolution.

Rejecting data on materialized column type errors

By default, if FAVROPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error parameter. If this parameter is true, COPY rejects such rows and reports an error.

If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.

If a flex table has a materialized column, the data being loaded must be coercible to that column's type. For example, if a materialized column is declared as a FLOAT and you try to load a VARCHAR value for that key, FAVROPARSER rejects the data row.

See also

5.9 - Matches from regular expressions

You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser.

You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser. This section describes some examples of using the options that the flex parsers support.

Sample regular expression

These examples use the following regular expression, which searches information that includes the timestamp, date, thread_name, and thread_id strings.

This example expression loads any thread_id hex value, regardless of whether it has a 0x prefix, (<thread_id>(?:0x)?[0-9a-f]+).

'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
 (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'

Using regular expression matches for a flex table

You can load the results from a regular expression into a flex table, using the fregexparser. For a complete example of doing so, see FREGEXPARSER.

Using fregexparser for columnar tables

This section illustrates how to load the results of a regular expression used with a sample log file for a Vertica database. By using an external table definition, the section presents an example of using fregexparser to load data into a columnar table. Using a flex table parser for a columnar tables gives you the capability to mix data loads in one table. For example, you can load the results of a regular expression in one session, and JSON data in another.

The following basic examples illustrate this usage.

  1. Create a columnar table, vlog, with the following columns:

    => CREATE TABLE vlog (
        "text"             varchar(2322),
        thread_id         varchar(28),
        thread_name     varchar(44),
        "time"             varchar(46),
        component         varchar(30),
        level             varchar(20),
        transaction_id     varchar(32),
        elevel             varchar(20),
        enode             varchar(34)
    );
    
  2. Use COPY to load parts of a log file using the sample regular expression presented above, with the fregexparser. Be sure to remove any line characters from this expression example before trying it yourself:

    => COPY v_log FROM '/home/dbadmin/data/flex/vertica.log' PARSER
    FRegexParser(pattern=
    '^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
     (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
    -?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
    \<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
    ?(?<text>.*)'
    
    ) rejected data as table fregex_reject;
  3. Query the time column:

    => SELECT time FROM flogs limit 10;
              time
    -------------------------
     2014-04-02 04:02:02.613
     2014-04-02 04:02:02.613
     2014-04-02 04:02:02.614
     2014-04-02 04:02:51.008
     2014-04-02 04:02:51.010
     2014-04-02 04:02:51.012
     2014-04-02 04:02:51.012
     2014-04-02 04:02:51.013
     2014-04-02 04:02:51.014
     2014-04-02 04:02:51.017
    (10 rows)
    

Using external tables with fregexparser

By creating an external columnar table for your Vertica log file, querying the table will return updated log information. The following basic example illustrate this usage.

  1. Create a columnar table, vertica_log, using the AS COPY clause and fregexparser to load matched results from the regular expression. For illustrative purposes, this regular expression has new line characters to split long text lines. Remove any line returns before testing with this expression:

    => CREATE EXTERNAL TABLE public.vertica_log
    (
        "text" varchar(2322),
        thread_id varchar(28),
        thread_name varchar(44),
        "time" varchar(46),
        component varchar(30),
        level varchar(20),
        transaction_id varchar(32),
        elevel varchar(20),
        enode varchar(34)
    )
    AS COPY
    FROM '/home/dbadmin/data/vertica.log'
    PARSER FRegexParser(pattern=
    '^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
     (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
    -?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
    \<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
    ?(?<text>.*)'
    
    );
  2. Query from the external table to get updated results:

    => SELECT component, thread_id, time FROM vertica_log limit 10;
     component | thread_id  |          time
    -----------+------------+-------------------------
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
     Init      | 0x16321430 | 2014-04-02 04:02:02.613
               | 0x16321430 | 2014-04-02 04:02:02.614
               | 0x16321430 | 2014-04-02 04:02:02.614
               | 0x16321430 | 2014-04-02 04:02:02.614
    (10 rows)
    

5.10 - Common event format (CEF) data

Use the flex parser fcefparser to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.

Use the flex parser fcefparser to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.

When you use the parser to load arbitrary CEF-format files, it interprets key names in the data as virtual columns in your flex table. After loading, you can query your CEF data directly, regardless of which set of keys exist in each row. You can also use the associated flex table data and map functions to manage CEF data access.

Create a flex table and load CEF data

This section uses a sample set of CEF data. All IP addresses have been purposely changed to be inaccurate, and Return characters added for illustration.

To use this sample data, copy the following text and remove all Return characters. Save the file as CEF_sample.cef, which is the name used throughout these examples.

CEF:0|ArcSight|ArcSight|6.0.3.6664.0|agent:030|Agent [test] type [testalertng] started|Low|
eventId=1 mrt=1396328238973 categorySignificance=/Normal categoryBehavior=/Execute/Start
categoryDeviceGroup=/Application catdt=Security Mangement categoryOutcome=/Success
categoryObject=/Host/Application/Service art=1396328241038 cat=/Agent/Started
deviceSeverity=Warning rt=1396328238937 fileType=Agent
cs2=<Resource ID\="3DxKlG0UBABCAA0cXXAZIwA\=\="/> c6a4=fe80:0:0:0:495d:cc3c:db1a:de71
cs2Label=Configuration Resource c6a4Label=Agent
IPv6 Address ahost=SKEELES10 agt=888.99.100.1 agentZoneURI=/All Zones/ArcSight
System/Private Address Space
Zones/RFC1918: 888.99.0.0-888.200.255.255 av=6.0.3.6664.0 atz=Australia/Sydney
aid=3DxKlG0UBABCAA0cXXAZIwA\=\= at=testalertng dvchost=SKEELES10 dvc=888.99.100.1
deviceZoneURI=/All Zones/ArcSight System/Private Address Space Zones/RFC1918:
888.99.0.0-888.200.255.255 dtz=Australia/Sydney _cefVer=0.1
  1. Create a flex table logs:

    => CREATE FLEX TABLE logs();
    CREATE TABLE
    
  2. Load the sample CEF file, using the flex parser fcefparser:

    => COPY logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
     Rows Loaded
    -------------
               1
    (1 row)
    
  3. Use the maptostring() function to see the contents of the logs flex table:

    => SELECT maptostring(__raw__) FROM logs;
                                      maptostring
    -------------------------------------------------------------------------------------
      {
       "_cefver" : "0.1",
       "agentzoneuri" : "/All Zones/ArcSight System/Private Address
        Space Zones/RFC1918: 888.99.0.0-888.200.255.255",
       "agt" : "888.99.100.1",
       "ahost" : "SKEELES10",
       "aid" : "3DxKlG0UBABCAA0cXXAZIwA==",
       "art" : "1396328241038",
       "at" : "testalertng",
       "atz" : "Australia/Sydney",
       "av" : "6.0.3.6664.0",
       "c6a4" : "fe80:0:0:0:495d:cc3c:db1a:de71",
       "c6a4label" : "Agent IPv6 Address",
       "cat" : "/Agent/Started",
       "catdt" : "Security Mangement",
       "categorybehavior" : "/Execute/Start",
       "categorydevicegroup" : "/Application",
       "categoryobject" : "/Host/Application/Service",
       "categoryoutcome" : "/Success",
       "categorysignificance" : "/Normal",
       "cs2" : "<Resource ID=\"3DxKlG0UBABCAA0cXXAZIwA==\"/>",
       "cs2label" : "Configuration Resource",
       "deviceproduct" : "ArcSight",
       "deviceseverity" : "Warning",
       "devicevendor" : "ArcSight",
       "deviceversion" : "6.0.3.6664.0",
       "devicezoneuri" : "/All Zones/ArcSight System/Private Address Space
        Zones/RFC1918: 888.99.0.0-888.200.255.255",
       "dtz" : "Australia/Sydney",
       "dvc" : "888.99.100.1",
       "dvchost" : "SKEELES10",
       "eventid" : "1",
       "filetype" : "Agent",
       "mrt" : "1396328238973",
       "name" : "Agent [test] type [testalertng] started",
       "rt" : "1396328238937",
       "severity" : "Low",
       "signatureid" : "agent:030",
       "version" : "0"
    }
    
    (1 row)
    

Create a columnar table and load CEF data

This example lets you compare the flex table for CEF data with a columnar table. You do so by creating a new table and load the same CEF_sample.cef file used in the preceding flex table example.

  1. Create a columnar table, col_logs, defining the prefix names that are hard coded in fcefparser:

    => CREATE TABLE col_logs(version INT,
      devicevendor VARCHAR,
      deviceproduct VARCHAR,
      deviceversion VARCHAR,
      signatureid VARCHAR,
      name VARCHAR,
      severity VARCHAR);
    CREATE TABLE
    
  2. Load the sample file into col_logs, as you did for the flex table:

    => COPY col_logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
     Rows Loaded
    -------------
               1
    (1 row)
    
  3. Query the table. You can find the identical information in the flex table output.

    => \x
    Expanded display is on.
    VMart=> SELECT * FROM col_logs;
    -[ RECORD 1 ]-+----------------------------------------
    version       | 0
    devicevendor  | ArcSight
    deviceproduct | ArcSight
    deviceversion | 6.0.3.6664.0
    signatureid   | agent:030
    name          | Agent [test] type [testalertng] started
    severity      | Low
    

Compute keys and build a flex table view

In this example, you use a flex helper function to compute keys and build a view for the logs flex table.

  1. Use the compute_flextable_keys_and_build_view function to compute keys and populate a view generated from the logs flex table:

    => SELECT compute_flextable_keys_and_build_view('logs');
                                compute_flextable_keys_and_build_view
    -------------------------------------------------------------------------------------
     Please see public.logs_keys for updated keys
    The view public.logs_view is ready for querying
    (1 row)
    
  2. Query the logs_keys table to see what the function computed from the sample CEF data:

    => SELECT * FROM logs_keys;
           key_name       | frequency | data_type_guess
    ----------------------+-----------+-----------------
     c6a4                 |         1 | varchar(60)
     c6a4label            |         1 | varchar(36)
     categoryobject       |         1 | varchar(50)
     categoryoutcome      |         1 | varchar(20)
     categorysignificance |         1 | varchar(20)
     cs2                  |         1 | varchar(84)
     cs2label             |         1 | varchar(44)
     deviceproduct        |         1 | varchar(20)
     deviceversion        |         1 | varchar(24)
     devicezoneuri        |         1 | varchar(180)
     dvchost              |         1 | varchar(20)
     version              |         1 | varchar(20)
     ahost                |         1 | varchar(20)
     art                  |         1 | varchar(26)
     at                   |         1 | varchar(22)
     cat                  |         1 | varchar(28)
     catdt                |         1 | varchar(36)
     devicevendor         |         1 | varchar(20)
     dtz                  |         1 | varchar(32)
     dvc                  |         1 | varchar(24)
     filetype             |         1 | varchar(20)
     mrt                  |         1 | varchar(26)
     _cefver              |         1 | varchar(20)
     agentzoneuri         |         1 | varchar(180)
     agt                  |         1 | varchar(24)
     aid                  |         1 | varchar(50)
     atz                  |         1 | varchar(32)
     av                   |         1 | varchar(24)
     categorybehavior     |         1 | varchar(28)
     categorydevicegroup  |         1 | varchar(24)
     deviceseverity       |         1 | varchar(20)
     eventid              |         1 | varchar(20)
     name                 |         1 | varchar(78)
     rt                   |         1 | varchar(26)
     severity             |         1 | varchar(20)
     signatureid          |         1 | varchar(20)
    (36 rows)
    
  3. Query several columns from the logs_view:

    => \x
    Expanded display is on.
    VMart=> select version, devicevendor, deviceversion, name, severity, signatureid
      from logs_view;
    -[ RECORD 1 ]-+----------------------------------------
    version       | 0
    devicevendor  | ArcSight
    deviceversion | 6.0.3.6664.0
    name          | Agent [test] type [testalertng] started
    severity      | Low
    signatureid   | agent:030
    

Use the fcefparser delimiter parameter

In this example, you use the fcefparser delimiter parameter to query events located in California, New Mexico, and Arizona.

  1. Create a new columnar table, CEFData3:

    => CREATE TABLE CEFData3(eventId INT, location VARCHAR(20));
    CREATE TABLE
    
  2. Using the delimiter=',' parameter, load some CEF data into the table:

    => COPY CEFData3 FROM stdin PARSER fcefparser(delimiter=',');
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> eventId=1,location=California
    >> eventId=2,location=New Mexico
    >> eventId=3,location=Arizona
    >> \.
    
  3. Query the table:

    
    => SELECT eventId, location FROM CEFData3;
     eventId |  location
    ---------+------------
           1 | California
           2 | New Mexico
           3 | Arizona
    (3 rows)
    

6 - Complex types

Tables can have columns of complex types, including nested complex types.

Tables can have columns of complex types, including nested complex types. You can use the ROW (struct), ARRAY, and SET types in native and external tables, including flex tables. Sets are limited to one-dimensional collections of scalar types. A limited MAP type is available for external tables, but you can use ARRAY and ROW to express a map instead. Selected parsers support loading data with complex types.

You can define a column for heterogeneous combinations of the ARRAY and ROW types: a struct containing array fields or an array of structs. These types can be nested up to the maximum nesting depth of 100.

Restrictions for native tables

Complex types used in native tables have some restrictions, in addition to the restrictions for individual types listed on their reference pages:

  • A native table must have at least one column that is a primitive type or a native array (one-dimensional array of a primitive type). If a flex table has real columns, it must also have at least one column satisfying this restriction.

  • Complex type columns cannot be used in ORDER BY or PARTITION BY clauses nor as FILLER columns.

  • Complex type columns cannot have constraints.

  • Complex type columns cannot use DEFAULT or SET USING.

  • Expressions returning complex types cannot be used as projection columns, and projections cannot be segmented or ordered by columns of complex types.

See CREATE TABLE and ALTER TABLE for additional restrictions.

Deriving a table definition from the data

You can use the INFER_TABLE_DDL function to inspect Parquet, ORC, JSON, or Avro data and produce a starting point for a table definition. This function returns a CREATE TABLE statement, which might require further editing. For columns where the function could not infer the data type, the function labels the type as unknown and emits a warning. For VARCHAR and VARBINARY columns, you might need to adjust the length. Always review the statement the function returns, but especially for tables with many columns, using this function can save time and effort.

Parquet, ORC, and Avro files include schema information, but JSON files do not. For JSON, the function inspects the raw data to produce one or more candidate table definitions. In the following example, two input files differ in the structure of the menu column:

=> SELECT INFER_TABLE_DDL ('/data/*.json'
    USING PARAMETERS table_name='restaurants', format='json',
max_files=3, max_candidates=3);
WARNING 0:  This generated statement contains one or more float types which might lose precision
WARNING 0:  This generated statement contains one or more varchar/varbinary types which default to length 80

                INFER_TABLE_DDL
------------------------------------------------------------------------
 Candidate matched 1 out of 2 total files:
  create table "restaurants"(
    "cuisine" varchar,
    "location_city" Array[varchar],
    "menu" Array[Row(
      "item" varchar,
      "price" float
    )],
    "name" varchar
  );
Candidate matched 1 out of 2 total files:
  create table "restaurants"(
    "cuisine" varchar,
    "location_city" Array[varchar],
    "menu" Array[Row(
      "items" Array[Row(
        "item" varchar,
        "price" numeric
      )],
      "time" varchar
    )],
    "name" varchar
  );

(1 row)

Alternative to strong typing

Though you can fully specify a column representing any combination of ROW and ARRAY types, there might be cases where you prefer a more flexible approach. If the data contains a struct with hundreds of fields, only a few of which you need, you might prefer to extract just those few at query time instead of defining all of the fields. Similarly, if the data structure is likely to change, you might prefer to defer fully specifying the complex types. You can use flexible columns as an alternative to fully specifying the structure of a complex column. This is the same approach used for flex tables, where all data is initially loaded into a single binary column and materialized from there as needed. See Flexible complex types for more information about using this approach.

6.1 - Structs

Columns can contain structs, which store property-value pairs.

Columns can contain structs, which store property-value pairs. For example, a struct representing an address could have strings for the street address and city/state and an integer for the postal code:

{ "street":"150 Cambridgepark Dr.",
  "city":"Cambridge MA",
  "postalcode":02140}

Struct fields can be primitive types or other complex types.

Use the ROW expression to define a struct column. In the following example, the data has columns for customer name, address, and account number, and the address is a struct in the data. The types you declare in Vertica must be compatible with the types in the data you load into them.

=> CREATE TABLE customers (
    name VARCHAR,
    address ROW(street VARCHAR, city VARCHAR, zipcode INT),
    accountID INT);

Within the ROW, you specify the fields and their data types using the same syntax as for columns. Vertica treats the ROW as a single column for purposes of queries.

Structs can contain other structs. In the following example, employees have various personal information, including an address which is itself a struct.

=> CREATE TABLE employees(
    employeeID INT,
    personal ROW(
      name VARCHAR,
      address ROW(street VARCHAR, city VARCHAR, zipcode INT),
      taxID INT),
    department VARCHAR);

Structs can contain arrays of primitive types, arrays, or structs.

=> CREATE TABLE customers(
  name VARCHAR,
  contact ROW(
    street VARCHAR,
    city VARCHAR,
    zipcode INT,
    email ARRAY[VARCHAR]
  ),
  accountid INT );

When defining an external table with Parquet or ORC data, Vertica requires the definition of the table to match the schema of the data. For example, with the data used in the previous employees example, the following definition is an error:

=> CREATE EXTERNAL TABLE employees(
    employeeID INT,
    personal ROW(
      name VARCHAR,
      address ROW(street VARCHAR, city VARCHAR),
      zipcode INT,
      taxID INT),
    department VARCHAR)
  AS COPY FROM '...' PARQUET;
ERROR 9151: Datatype mismatch [...]

The data contains an address struct with three fields (street, city, zipcode), so the external table must also use a ROW with three fields. Changing the ROW to have two fields and promoting one of the fields to the parent ROW is a mismatch. Each ROW must match and, if structs are nested in the data, the complete structure must match.

For native tables, you can specify which columns to load from the data, so you do not need to account for all of them. For the columns you load, the definition of the table must match the schema in the data file. Some parsers report fields found in that data that are not part of the table definition.

Handling nulls

If a struct exists but a field value is null, Vertica assigns NULL as its value in the ROW. A struct where all fields are null is treated as a ROW with null fields. If the struct itself is null, Vertica reads the ROW as NULL.

Queries

See Rows (structs).

Restrictions

ROW columns have several restrictions:

  • Maximum nesting depth is 100.
  • Vertica tables support up to 9800 columns and fields. The ROW itself is not counted, only its fields.
  • ROW columns cannot use any constraints (such as NOT NULL) or defaults.
  • ROW fields cannot be auto_increment or setof.
  • ROW definition must include at least one field.
  • Row is a reserved keyword within a ROW definition, but is permitted as the name of a table or column.
  • Tables containing ROW columns cannot also contain IDENTITY, default, SET USING, or named sequence columns.

6.2 - Arrays

Columns can contain arrays, which store ordered lists of elements of the same type.

Columns can contain arrays, which store ordered lists of elements of the same type. For example, an address column could use an array of strings to store multiple addresses that an individual might have, such as ['668 SW New Lane', '518 Main Ave', '7040 Campfire Dr'].

There are two types of arrays:

  • Native array: a one-dimensional array of a primitive type.

  • Non-native array: all other supported arrays, including arrays that contain other arrays (multi-dimensional arrays) or structs (ROWs). Non-native arrays have some usage restrictions.

Use the ARRAY type to define an array column, specifying the type of its elements (a primitive type, a ROW (struct), or an array):

=> CREATE TABLE orders
   (orderkey    INT,
    custkey     INT,
    prodkey     ARRAY[VARCHAR(10)],
    orderprices ARRAY[DECIMAL(12,2)],
    orderdate   DATE
   );

If an array is multi-dimensional, represent it as an array containing an array:

ARRAY[ARRAY[FLOAT]]

Queries

See Arrays and sets (collections).

Restrictions

  • Native arrays support only data of primitive types, for example, int, UUID, and so on.

  • Array dimensionality is enforced. A column cannot contain arrays of varying dimensions. For example, a column that contains a three-dimensional array can only contain other three-dimensional arrays; it cannot simultaneously include a one-dimensional array. However, the arrays in a column can vary in size, where one array can contain four elements while another contains ten.

  • Array bounds, if specified, are enforced for all operations that load or alter data. Unbounded arrays may have as many elements as will fit in the allotted binary size.

  • An array has a maximum binary size. If this size is not set when the array is defined, a default value is used.

  • Arrays do not support LONG types (like LONG VARBINARY or LONG VARCHAR) or user-defined types (like Geometry).

6.3 - Flexible complex types

When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays.

When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays. However, there are times when you might prefer not to:

  • If the data contains a struct with a very large number of fields, and in your queries you will need only a few of them, you can avoid having to enumerate the rest in the table DDL. Further, a deeply-nested set of structs could exceed the nesting limit for the table if fully specified.

  • If the data schema is still evolving, you can delay finalizing strongly-typed DDL.

  • If you anticipate the introduction of new fields in the data, you can use flexible types to discover them. A table with strong typing, on the other hand, would silently ignore those values. For an example of using flexible types to discover new fields, see Strong and Flexible Typing.

Flexible types are a way to store complex or unstructured data as a binary blob in one column, in a way that allows access to individual elements of that data. This is the same approach that Vertica uses with flex tables, which support loading unstructured or semi-structured data. In a flex table, all data from a source is loaded into a single VMap column named __raw__. From this column you can materialize other columns, such as a specific field in JSON data, or use special lookup functions in queries to read values directly out of the __raw__ column.

Vertica uses a similar approach with complex types. You can describe the types fully using the ROW and ARRAY types in your table definition, or you can instead treat a complex type as a flexible type and not fully describe it. Each complex type that you choose to treat this way becomes its own flex-style column. You are not limited to a single column containing all data as in flex tables; instead, you can treat any complex type column, no matter how deeply it nests other types, as one flex-like column.

Defining flexible columns

To use a flexible complex type, declare the column as LONG VARBINARY. You might also need to set other parameters in the parser, as described in the parser documentation.

Consider a Parquet file with a restaurants table and the following columns:

  • name: varchar

  • cuisine type: varchar

  • location (cities): array[varchar]

  • menu: array of structs, each struct having an item name and a price

This data contains two complex columns, location (an array) and menu (an array of structs). The following example defines both columns as flexible columns by using LONG VARBINARY:

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR,
        location_city LONG VARBINARY, menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(allow_long_varbinary_match_complex_type='True');

The allow_long_varbinary_match_complex_type parameter is specific to the Parquet parser. It is required if you define any column as a flexible type. Without this parameter, Vertica tries to match the LONG VARBINARY declaration in the table to a VARBINARY column in the Parquet file, finds a complex type instead, and reports a data-type mismatch.

You need not treat all complex columns as flexible types. The following definition is also valid:

=> CREATE EXTERNAL TABLE restaurants(
            name VARCHAR, cuisine VARCHAR,
            location_city ARRAY[VARCHAR,50],
            menu LONG VARBINARY)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(allow_long_varbinary_match_complex_type='True');

For many common data formats, you can use the INFER_TABLE_DDL function to derive a table definition from a data file. This function uses strong typing for complex types in almost all cases.

Querying flexible columns

Flexible columns are stored as LONG VARBINARY, so selecting them directly produces unhelpful results. Instead, use the flex mapping functions to extract values from these columns. The MAPTOSTRING function translates the complex type to JSON, as shown in the following example:

=> SELECT name, location_city, MAPTOSTRING(menu) AS menu FROM restaurants;
       name        |       location_city        |                    menu
-------------------+---------+------------------+----------------------------------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 Bakersfield Tacos | ["Pittsburgh"]             | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

The menu column is an array of structs. Notice that the output is a set of key/value pairs, with the key being the array index. Bob's Pizzeria has two items on its menu, and each value is a struct. The first item ("0") is a struct with an "item" value of "cheese pizza" and a "price" of "$8.25".

You can use keys to access specific values. The following example selects the first menu item from each restaurant. Note that all keys are strings, even array indexes:

=> SELECT name, location_city, menu['0']['item'] AS item, menu['0']['price'] AS price FROM restaurants;
       name        |       location_city        |     item     | price
-------------------+----------------------------+--------------+-------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco  | $9.95
(2 rows)

Instead of accessing specific indexes, you can use the MAPITEMS function in a subquery to explode a flexible type, as in the following example:

=>  SELECT name, location_city, menu_items['item'], menu_items['price']
    FROM (SELECT mapitems(menu, name, location_city) OVER(PARTITION BEST)
         AS (indexes, menu_items, name, location_city)
    FROM restaurants) explode_menu;
       name        |       location_city        |  menu_items   | menu_items
-------------------+----------------------------+---------------+------------
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | cheese pizza  | $8.25
 Bob's pizzeria    | ["Cambridge","Pittsburgh"] | spinach pizza | $10.50
 Bakersfield Tacos | ["Pittsburgh"]             | veggie taco   | $9.95
 Bakersfield Tacos | ["Pittsburgh"]             | steak taco    | $10.95
(4 rows)

For a complete list of flex mapping functions, see Flex data functions.

JSON and Avro flexible types

The parsers for JSON and Avro support both flexible and strong types for complex types. When using flexible complex types or loading into a flex table, use the flatten_maps and flatten_arrays parameters to control how the parser handles complex data. These parsers ignore these parameters for strongly-typed complex types.

The following example demonstrates the use of flexible complex types. Consider a JSON file containing the following data:

{
    "name" : "Bob's pizzeria",
    "cuisine" : "Italian",
    "location_city" : ["Cambridge", "Pittsburgh"],
    "menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
              {"item" : "spinach pizza", "price" : "$10.50"}]
}
{
    "name" : "Bakersfield Tacos",
    "cuisine" : "Mexican",
    "location_city" : ["Pittsburgh"],
    "menu" : [{"item" : "veggie taco", "price" : "$9.95"},
              {"item" : "steak taco", "price" : "$10.95"}]
}

Create a table, using LONG VARBINARY for the flexible complex types, and load data specifying these parameters:

=> CREATE TABLE restaurant(name VARCHAR, cuisine VARCHAR,
        location_city LONG VARBINARY, menu LONG VARBINARY);

=> COPY restaurant FROM '/data/restaurant.json'
   PARSER FJSONPARSER(flatten_maps=false, flatten_arrays=false);

You can use Flex functions and direct access (through indexes) to return readable values:

=> SELECT MAPTOSTRING(location_city), MAPTOSTRING(menu) FROM restaurant;
                   maptostring                   |             maptostring
-------------------------------------------------+--------------------------------------------------------
 {
    "0": "Cambridge",
    "1": "Pittsburgh"
} | {
    "0": {
        "item": "cheese pizza",
        "price": "$8.25"
    },
    "1": {
        "item": "spinach pizza",
        "price": "$10.50"
    }
}
 {
    "0": "Pittsburgh"
}                       | {
    "0": {
        "item": "veggie taco",
        "price": "$9.95"
    },
    "1": {
        "item": "steak taco",
        "price": "$10.95"
    }
}
(2 rows)

=> SELECT menu['0']['item'] FROM restaurant;
     menu
--------------
 cheese pizza
 veggie taco
(2 rows)

The COPY statement shown in this example sets flatten_maps to false. Without that change, the keys for the complex columns would not work as expected, because record and array keys would be "flattened" at the top level. Querying menu['0']['item'] would produce no results. Instead, query flattened values as in the following example:

=> SELECT menu['0.item'] FROM restaurant;
     menu
--------------
 veggie taco
 cheese pizza
(2 rows)

Flattening directives apply to the entire COPY statement. You cannot flatten some columns and not others, or prevent flattening values in a complex column that is itself within a flattened flex table. Because flexible complex types and strongly-typed complex types require different values for flattening, you cannot combine strong and flexible complex types in the same load operation.

6.4 - System tables for complex types

Information about all complex types is recorded in the COMPLEX_TYPES system table.

Information about all complex types is recorded in the COMPLEX_TYPES system table. You must have read permission for the external table that uses a type to see its entries in this system table. Complex types are not shown in the TYPES system table.

For ROW types, each row in COMPLEX_TYPES represents one field of one ROW. The field name is the name used in the table definition if present, or a generated name beginning with _field otherwise. Each row also includes the (generated) name of its containing type, a string beginning with _ct_. ("CT" stands for "complex type".)

The following example defines one external table and then shows the types in COMPLEX_TYPES:

=> CREATE EXTERNAL TABLE warehouse(
    name VARCHAR, id_map MAP<INT,VARCHAR>,
    data row(record INT, total FLOAT, description VARCHAR(100)),
    prices ARRAY[INT], comment VARCHAR(200), sales_total FLOAT, storeID INT)
  AS COPY FROM ... PARQUET;

=> SELECT type_id,type_kind,type_name,field_id,field_name,field_type_name,field_position
    FROM COMPLEX_TYPES ORDER BY type_id,field_name;

      type_id      | type_kind |       type_name       | field_id | field_name  | field_type_name | field_position
-------------------+-----------+-----------------------+----------+-------------+-----------------+----------------
 45035996274278280 | Map       | _ct_45035996274278280 |        6 | key         | int             |              0
 45035996274278280 | Map       | _ct_45035996274278280 |        9 | value       | varchar(80)     |              1
 45035996274278282 | Row       | _ct_45035996274278282 |        9 | description | varchar(80)     |              2
 45035996274278282 | Row       | _ct_45035996274278282 |        6 | record      | int             |              0
 45035996274278282 | Row       | _ct_45035996274278282 |        7 | total       | float           |              1
 45035996274278284 | Array     | _ct_45035996274278284 |        6 |             | int             |              0
(6 rows)

This table shows the fields for the two ROW types defined in the table. When a ROW contains another ROW, as is the case here with the nested address field, the field_type_name column uses the generated name of the contained ROW. The same number, minus the leading "ct", serves as the field_id.

7 - Schema evolution

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table.

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table. Typically, you use strong typing to specify the table columns and parsers reject data that does not match the table definition. Some parsers cause the entire load to fail if the data does not match, and some load what they can and produce warnings about unhandled data.

For long-lived databases or more volatile domains, new columns and fields can appear in data, disrupting existing ETL pipelines. Some parsers provide ways to handle differences between the data and the table definition.

Parquet

By default, the Parquet parser uses strong schema matching. This means that all columns in the Parquet data must be loaded in the same order as in the data. Unmatched columns produce load errors:

=> CREATE TABLE orders_summary(orderid INT, accountid INT);

=> COPY orders_summary FROM '/data/orders.parquet' PARQUET;
ERROR 9135:  Attempt to load 2 columns from a parquet source [/data/orders.parquet] that has 3 columns

You can pick out only the specific columns you want using loose schema matching. Use the do_soft_schema_match_by_name parameter in the PARQUET parser. With loose schema matching, columns in the data are matched to columns in the table by their names. Columns in the data that are not part of the table definition are ignored:

=> COPY orders_summary FROM '/data/orders.parquet'
   PARQUET(do_soft_schema_match_by_name='true');
 Rows Loaded
-------------
           3
(1 row)

Alternatively, you can use ALTER TABLE to add new columns to the table definition and continue to use strong schema matching.

JSON and Avro

By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This behavior is in contrast to the Parquet and ORC parsers, which by default require that a table consume all of the columns in the data. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.

When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample data file using INFER_TABLE_DDL. You can use the results to define a table. If later data files add more fields, the parser emits a warning and logs information about the new fields in the UDX_EVENTS system table. You can use this information to decide whether to alter the table definition or ignore the new fields.

If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings enabled and without committing (COPY...NO COMMIT).

For a detailed discussion of this workflow, see Schema changes in Data exploration.

8 - Handling Non-UTF-8 input

Vertica supports loading data files in the Unicode UTF-8 format.

Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not directly supported.

If you have data that does not meet the UTF-8 standard, you can modify the data during the load or you can transform the data files before loading.

Checking data format

Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The file command reports the encoding of any text files. For example:

$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text

The file command could indicate ASCII text even though the file contains multibyte characters.

To check for multibyte characters in an ASCII file, use the wc command. For example:

$ wc Date_Dimension.tbl
  1828   5484 221822 Date_Dimension.tbl

If the wc command returns an error such as Invalid or incomplete multibyte or wide character, the data file is using an incompatible character set.

This example shows two files that are not UTF-8 data files:

$ file data*
data1.txt: Little-endian UTF-16 Unicode text
data2.txt: ISO-8859 text

The results indicate that neither of the files is in UTF-8 format.

Converting data while loading

You can remove or replace non-UTF-8 characters in text data during the load. The MAKEUTF8 function removes such characters by default, or you can specify a replacement string.

The following example shows how to use this function during a load. The original data is loaded into the orig_name column, and the transformed data is loaded into the name column. Typically you would use a FILLER column for the original value instead of adding the column to the table definition; this example adds the column to show the differences side by side.

=> CREATE TABLE people (orig_name VARCHAR, name VARCHAR);
CREATE TABLE

=> COPY people (orig_name, name AS MAKEUTF8(orig_name)) FROM ...;
 Rows Loaded
-------------
           8
(1 row)

=> SELECT * FROM people;
orig_name |  name
----------+--------
  Dáithí    | Dith
  Fíona     | Fona
  Móirín     | Mirn
  Róisín     | Risn
  Séamus    | Samus
  Séan      | San
  Tiarnán  | Tiarnn
  Áine     | ine
(8 rows)

For general information about transforming data, see Transforming data during loads.

Converting files before loading data

To convert files before loading them into Vertica, use the iconv UNIX command. For example, to convert the data2.txt file from the previous example, use the iconv command as follows:

$ iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt

See the man pages for file and iconv for more information.

Checking UTF-8 compliance after loading data

After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named people that has a VARCHAR column named name, you can use this statement to verify that all of the strings are UTF-8 encoded:

=> SELECT name FROM people WHERE NOT ISUTF8(name);

If all of the strings are in UTF-8 format, the query should not return any rows.

9 - Transforming data during loads

To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading.

To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading. Transforming data while loading lets you compute values to insert into a target column, either from other columns or from values in the data that you load as FILLER columns (see Deriving Table Columns From Data File Columns). You can transform data to be loaded into columns of scalar types and native arrays, but not other complex types.

For example, you might have text data that is not compatible with UTF-8, the encoding that Vertica expects. You can use the MAKEUTF8 function during load to remove or replace non-UTF-8 characters, as illustrated in Converting Data While Loading. Or you might want to extract fields for day, month, and year from a single input date.

When transforming data during a load, you load the data into a column normally, and then use that column in an expression to populate another column. The COPY statement must always contain at least one parsed column, which can be a FILLER column. You can intersperse parsed and computed columns in a COPY statement.

The following example extracts day, month, and year columns from a single input date column:

=> CREATE TABLE purchases
    (id INT, year VARCHAR(10), month VARCHAR(10), day VARCHAR(10), ts TIMESTAMP);

=> COPY purchases (id, year AS TO_CHAR(ts,'YYYY'),
                   month AS TO_CHAR(ts,'MM'), day AS TO_CHAR(ts, 'DD'),
                   ts FORMAT 'YYYY-MM-DD') FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1943|2021-03-29
>> 1256|2021-03-30
>> 1512|2021-03-31
>> \.

=> SELECT * FROM purchases;
  id  | year | month | day |         ts
------+------+-------+-----+---------------------
 1256 | 2021 | 03    | 30  | 2021-03-30 00:00:00
 1512 | 2021 | 03    | 31  | 2021-03-31 00:00:00
 1943 | 2021 | 03    | 29  | 2021-03-29 00:00:00
(3 rows)

The input data has two columns, id and ts (timestamp). The COPY statement specifies the format of the timestamp column using the FORMAT option. The TO_CHAR function uses that format information to extract the year, month, and day columns.

Using expressions in COPY statements

The expression in a COPY statement can be as simple as a single column, or more complex, such as a case statement for multiple columns. An expression can specify multiple columns, and multiple expressions can refer to the same parsed column. You can use expressions for columns of all supported data types.

COPY expressions can use many SQL functions, operators, constants, NULLs, and comments, including these functions:

Requirements and restrictions:

  • COPY expressions cannot use SQL meta-functions, analytic functions, aggregate functions, or computed columns.

  • For computed columns, you must list all parsed columns in the COPY statement expression. Do not specify FORMAT or RAW in the source data for a computed column.

  • The return data type of the expression must be coercible to that of the target column. Parsed column parameters are also coerced to match the expression.

Handling expression errors

Errors in expressions within your COPY statement are SQL errors. As such, they are handled differently from parse errors. When a parse error occurs, COPY rejects the row and adds it to the rejected data file or table. COPY also adds the reason for a rejected row to the exceptions file or the rejected data table. For example, COPY parsing does not implicitly cast data types. If a type mismatch occurs between the data being loaded and a column type (such as attempting to load a text value into a FLOAT column), COPY rejects the row and continues processing.

If an error occurs in an expression in your COPY statement, then by default the entire load fails. For example, if your COPY statement uses a function expression, and a syntax error exists in that expression, the entire load is rolled back. All SQL errors, including ones caused by rolling back the COPY, are stored in the Vertica log file. However, unlike parse rejections and exception messages, SQL expression errors are brief and may require further research.

You can have COPY treat errors in transformation expressions like parse errors. Rejected rows are added to the same file or table, and exceptions are added to the same exceptions file or table. To enable this behavior, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.)

Loading data with expression rejections is potentially slower than loading with the same number of parse rejections. Enable expression rejections if your data has a few bad rows, to allow the rest of the data to be loaded. If you are concerned about the time it takes to complete a load with many bad rows, use the REJECTMAX parameter to set a limit. If COPY finds more than REJECTMAX bad rows, it aborts and rolls back the load.

See Handling messy data for more information about managing rejected data.

Deriving table columns from data file columns

When loading data, your source data might contain one or more columns that do not exist in the target table. Or, the source and target tables have matched columns, but you want to omit one or more source columns from the target table.

Use the FILLER parameter to identify a column of source data that COPY can ignore or use to compute new values that are loaded into the target table. The following requirements apply:

  • Define the FILLER parameter data type so it is compatible with the source data. For example, be sure to define a VARCHAR in the target table so its length can contain all source data; otherwise, data might be truncated. You can specify multiple filler columns by using the FILLER parameter more than once in the COPY statement.

  • The name of the filler column must not match the name of any column in the target table.

In the following example, the table has columns for first name, last name, and full name, but the data being loaded contains columns for first, middle, and last names. The COPY statement reads all of the source data but only loads the source columns for first and last names. It constructs the data for the full name by concatenating each of the source data columns, including the middle name. The middle name is read as a FILLER column so it can be used in the concatenation, but is ignored otherwise. (There is no table column for middle name.)

=> CREATE TABLE names(first VARCHAR(20), last VARCHAR(20), full VARCHAR(60));
CREATE TABLE
=> COPY names(first,
              middle FILLER VARCHAR(20),
              last,
              full AS first||' '||middle||' '||last)
      FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Marc|Gregory|Smith
>> Sue|Lucia|Temp
>> Jon|Pete|Hamilton
>> \.
=> SELECT * from names;
 first |   last   |        full
-------+----------+--------------------
 Jon   | Hamilton | Jon Pete Hamilton
 Marc  | Smith    | Marc Gregory Smith
 Sue   | Temp     | Sue Lucia Temp
(3 rows)

10 - Distributing a load

Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation.

Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation. How this is done depends on where the data is and what types of parallelism the parsers support.

Vertica can be most effective in distributing a load when the data to be loaded is found in shared storage available to all nodes. Sometimes, however, data is available only on specific nodes, which you must specify.

Types of load parallelism

Vertica supports several types of parallelism. All built-in parsers support distributed load, and some parsers support apportioned load and cooperative parse. The reference pages for individual parsers include the kinds of load parallelism they support.

Load parallelism is enabled by default but can be disabled. The configuration parameters named in this section are described in General parameters.

Distributed load

Vertica distributes files in a multi-file load to several nodes to load in parallel, instead of loading all of them on a single node. Vertica automatically distributes a load if the files are accessible to all nodes and you do not restrict participating nodes.

Apportioned load and cooperative parse both require an input that can be divided at record boundaries. The difference is that cooperative parse does a sequential scan to find record boundaries, while apportioned load first jumps (seeks) to a given position and then scans. Some formats, like generic XML, do not support seeking.

Apportioned load

In an apportioned load, Vertica divides a single large file or other single source into segments (portions), which it assigns to several nodes to load in parallel. Apportioned load divides the load at planning time, based on available nodes and cores on each node.

To use apportioned load, you must ensure that the source is reachable by all participating database nodes. You typically use apportioned load with distributed file systems.

Apportioned load is enabled by default for parsers that support it. To disable it, set the EnableApportionLoad configuration parameter to 0.

Cooperative parse

By default, Vertica parses a data source in a single thread on one database node. If a parser supports cooperative parse, the node instead uses multiple threads to parallelize the parse. Cooperative parse divides a load at execution time, based on how threads are scheduled.

Cooperative parse is enabled by default for parsers that support it. To disable it, set the EnableCooperativeParse configuration parameter to 0.

Loading on specific nodes

You can indicate which node or nodes should parse an input path by using any of the following:

Using the ON ANY NODE clause indicates that the source path is available on all of the nodes. If you specify this clause, COPY parses the files from any node in the cluster. If the path contains a glob, COPY expands the glob on the initiator node. ON ANY NODE is the default for all file systems except Linux.

Using the ON nodeset clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.

ON node and ON ANY NODE load each file once, choosing one node to perform the load. If, instead, you have different data on each node and you want to load all of it, and the path is the same, use ON EACH NODE:

=> COPY myTable FROM '/local_export/*.dat' ON EACH NODE;

If the path is not valid on all nodes, COPY loads the valid paths and produces a warning. If the path is a shared location, COPY loads it only once as for ON ANY NODE.

Use ON EACH NODE when you want to load from the same path on every node and the files are different. For example, if you have machine-specific data, such as system event logs, or if an operation wrote data on each node individually, you can use this clause to load all of it.

If the data to be loaded is on a client, use COPY FROM LOCAL instead of specifying nodes. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option.

Specifying distributed file loads

You can direct individual files in a multi-file load to specific nodes:

=> COPY t FROM '/data/file1.dat' ON v_vmart_node0001, '/data/file2.dat' ON v_vmart_node0002;

You can use globbing (wildcard expansion) to specify a group of files with the ON ANY NODE option:

=> COPY t FROM '/data/*.dat' ON ANY NODE;

You can limit the nodes that participate in an apportioned load. Doing so is useful if you need to balance several concurrent loads. Vertica apportions each load individually; it does not account for other loads that might be in progress on those nodes. You can, therefore, potentially speed up your loads by managing apportioning yourself:

=> COPY t FROM '/data/big1.dat' ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003),
        '/data/big2.dat' ON (v_vmart_node0004, v_vmart_node0005);

You can specify a compression type for each path. However, because file compression is a filter, you cannot use apportioned load for a compressed file.

Specifying distributed loads with sources

You can also apportion loads using COPY WITH SOURCE. You can create sources and parsers with the user-defined load (UDL) API. If both the source and parser support apportioned load, and EnableApportionLoad is set, then Vertica attempts to divide the load among nodes.

The following example shows a load that you could apportion:

=> COPY t WITH SOURCE MySource() PARSER MyParser();

The built-in delimited parser supports apportioning, so you can use it with a user-defined source:

=> COPY t WITH SOURCE MySource();

Number of load streams

Although the number of files you can load is not restricted, the optimal number of load streams depends on several factors, including:

  • Number of nodes

  • Physical and logical schemas

  • Host processors

  • Memory

  • Disk space

Using too many load streams can deplete or reduce system memory required for optimal query processing. See Best practices for managing workload resources for advice on configuring load streams.

11 - Using transactions to stage a load

By default, COPY automatically commits itself and other current transactions except when loading temporary tables or querying external tables.

By default, COPY automatically commits itself and other current transactions except when loading temporary tables or querying external tables. You can override this behavior by qualifying the COPY statement with the NO COMMIT option. When you specify NO COMMIT, Vertica does not commit the transaction until you explicitly issue a COMMIT statement.

You can use COPY...NO COMMIT in two ways:

  • Execute multiple COPY commands as a single transaction.

  • Check data for constraint violations before committing the load.

Combine multiple COPY statements in the same transaction

When you combine multiple COPY...NO COMMIT statements in the same transaction, Vertica can consolidate the data for all operations into fewer ROS containers, and thereby perform more efficiently.

For example, the following set of COPY...NO COMMIT statements performs several copy statements sequentially, and then commits them all. In this way, all of the copied data is either committed or rolled back as a single transaction.

COPY... NO COMMIT;
COPY... NO COMMIT;
COPY... NO COMMIT;
COPY X FROM LOCAL NO COMMIT;
COMMIT;

Check constraint violations

If constraints are not enforced in the target table, COPY does not check for constraint violations when it loads data. To troubleshoot loaded data for constraint violations, use COPY...NO COMMIT with ANALYZE_CONSTRAINTS. Doing so enables you detect constraint violations before you commit the load operation and, if necessary, roll back the operation. For details, see Detecting constraint violations.

12 - Handling messy data

Loading data with COPY has two main phases, parsing and loading.

Loading data with COPY has two main phases, parsing and loading. During parsing, if COPY encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY cannot parse a row of data. Following are some parser errors that can cause a rejected row:

  • Unsupported parser options

  • Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections

  • Malformed context for the parser in use

  • Missing delimiters

Optionally, COPY can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY aborts a load if it encounters errors during the loading phase.

Several optional parameters let you determine how strictly COPY handles rejections. For example, you can have COPY fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY handles rejected data.

Save rejected rows (REJECTED DATA and EXCEPTIONS)

The COPY statement automatically saves a copy of each rejected row in a rejections file. COPY also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs, as shown in this example:

v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1

You can optionally save COPY rejections and exceptions in one of two other ways:

  • Use the REJECTED DATA and EXCEPTIONS options to save both outputs to locations of your choice. REJECTED DATA records rejected rows, while EXCEPTIONS records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY creates the directory if it does not exist.) If a path value is a file, COPY uses it as a file prefix if multiple files are written.

  • Use the REJECTED DATA AS TABLE option. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving rejected data to a table.

If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database installation, rejected data table records are stored in the RejectionTableData directory as follows:

=> cd v_mart_node003_data\RejectionTableData\
=> ls
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2
.
.
.
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2

COPY LOCAL rejected data

For COPY LOCAL operations, if you use REJECTED DATA or EXCEPTIONS with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE instead of REJECTED DATA.

Enforce truncating or rejecting rows (ENFORCELENGTH)

When parsing CHAR, VARCHAR, BINARY, or VARBINARY data, rows may exceed the target table length. By default, COPY truncates such rows without rejecting them. Use the ENFORCELENGTH option to instead reject rows that exceed the target table.

For example, loading 'abc' into a table column specified as VARCHAR(2) results in COPY truncating the value to 'ab' and loading it. Loading the same row with the ENFORCELENGTH option causes COPY to reject the row.

Specify a maximum number of rejections (REJECTMAX)

The REJECTMAX option specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.

When the number of rejected records becomes equal to the REJECTMAX value, the load fails. If you do not specify a value for REJECTMAX, or if the value is 0, COPY allows an unlimited number of exceptions to occur.

If you allow COPY to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX to limit the impact. See Handling Transformation Errors.

Handling transformation errors

By default, COPY aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.

To have COPY treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA or REJECTED DATA AS TABLE to specify the output location.

You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX when enabling transformation rejections.

If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.

Abort data loads for any error (ABORT ON ERROR)

Using the ABORT ON ERROR option is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.

If you use ABORT ON ERROR as part of a CREATE EXTERNAL TABLE statement, the option is used whenever a query references the external table. The offending error is saved in the COPY exceptions or rejected data file.

Understanding row rejections and rollback errors

Depending on the type of error that COPY encounters, Vertica does one of the following:

  • Rejects the offending row and loads other rows into a table

  • Rolls back the entire COPY statement without loading any data

COPY cannot parse rows that contain any of the following:

  • Incompatible data types

  • Missing fields

  • Missing delimiters

COPY rolls back a load if it encounters any of these conditions:

  • Server-side errors, such as lack of memory

  • Primary key or foreign key constraint violations

  • Loading NULL data into a NOT NULL column

  • Transformation errors (by default)

This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY statement, "a::INT + b::INT" is a SQL expression in which a and b are derived values:

=> CREATE TABLE t (i INT);
=> COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT)
   FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> cat|dog
>> \.

Vertica cannot parse the row to the requested data type and rejects the row:

ERROR 2827:  Could not convert "cat" from column "*FILLER*".a to an int8

If a resolved to 'cat' and b to 'dog', the next expression 'cat'::INT + 'dog'::INT would return an expression evaluator error:

=> SELECT 'cat'::INT + 'dog'::INT;
ERROR 3681:  Invalid input syntax for integer: "cat"

The following COPY statement would also roll back because Vertica cannot parse the row to the requested data type:

=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;

In the following COPY statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat' row as a VARCHAR type, COPY parses 'cat' directly as an INTEGER.

=> COPY t (a FILLER INT, i AS a) FROM STDIN;

In the following example, transformation errors are rejected instead of aborting the load.

=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1;
ALTER DATABASE

=> CREATE TABLE sales (price INTEGER);
CREATE TABLE

=> COPY sales (f FILLER VARCHAR, price AS f::INT)
   FROM STDIN REJECTED DATA AS TABLE sales_rej;

=> COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>1
>>2a
>>3
>>\.

=> SELECT * FROM sales;
 price
-------
     1
     3
(2 rows)

=> SELECT rejected_data, rejected_reason FROM sales_rej;
 rejected_data |                rejected_reason
---------------+-----------------------------------------------
 Tuple (2a)    | ERROR 2827:  Could not convert "2a" from column "*FILLER*".f to an int8
(1 row)

See also

12.1 - Saving load rejections (REJECTED DATA)

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error. By default, if you do not specify a rejected data file, COPY saves rejected data files to this location:

catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.`*`n`*
catalog_dir/

The database catalog files directory, for example:

/home/dbadmin/VMart/v_vmart_node0001_catalog

target_table The table into which data was loaded (target_table).
source The source of the load data, which can be STDIN, or a file name, such as baseball.csv.
copy-from-rejected-data.n

The default name for a rejected data file, followed by n suffix, indicating the number of files, such as .1, .2, .3. For example, this default file name indicates file 3 after loading from STDIN:

fw-STDIN-copy-from-rejected-data.3.

Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table, lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.

Multiple rejected data files

Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data. Here are some of the factors:

  • Number of sources being loaded

  • Total number of rejected rows

  • Size of the source file (or files)

  • Cooperative parsing and number of threads being used

  • UDLs that support apportioned loads

  • For your own COPY parser, the number of objects returned from prepareUDSources()

Naming conventions for rejected files

You can specify one or more rejected data files with the files you are loading. Use the REJECTED DATA parameter to specify a file location and name, and separate consecutive rejected data file names with a comma (,). Do not use the ON ANY NODE option because it is applicable only to load files.

If you specify one or more files, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix, and appends a numeric suffix to each rejected data file. For example, if you specify the name my_rejects for the REJECTED_DATA parameter, and the file you are loading is large enough (> 10MB), several files such as the following will exist:

  • my_rejects-1

  • my_rejects-2

  • my_rejects-3

COPY uses cooperative parsing by default, having the nodes parse a specific part of the file contents. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).

The maximum number of rejected data files cannot be greater than the number of sources being loaded, per thread to parse any portion. The resource pool determines the maximum number of threads. For cooperative parse, use all available threads.

If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDL's that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.

For all data loads except COPY LOCAL, COPY behaves as follows:

No rejected data file specified... Rejected data file specified...
For a single data file (pathToData or STDIN), COPY stores one or more rejected data files in the default location. For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one files is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error.
For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a file name prefix, as noted.

For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If path is not a directory, COPY returns an error.

COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.

Rejected data files are returned to the initiator node. Rejected data files are not shipped to the initiator node.

Maximum length of file names

Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, \tmp\<shorter-file-name>.

12.2 - Saving rejected data to a table

Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data.

Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data. Saving rejected data to a file is mutually exclusive with using the AS TABLE clause.

When you use the AS TABLE clause, Vertica creates a new table if one does not exist, or appends to an existing table. If no parsing rejections occur during a load, the table exists but is empty. The next time you load data, Vertica inserts any rejected rows to the existing table.

The load rejection tables are a special type of table with the following capabilities and limitations:

  • Support SELECT statements

  • Can use DROP TABLE

  • Cannot be created outside of a COPY statement

  • Do not support DML and DDL activities

  • Are not K-safe

To make the data in a rejected table K-safe, you can do one of the following:

  • Write a CREATE TABLE..AS statement, such as this example:

    => CREATE TABLE new_table AS SELECT * FROM rejected_table;
    
  • Create a table to store rejected records, and run INSERT..SELECT operations into the new table

Using COPY NO COMMIT

If the COPY statement includes options NO COMMIT and REJECTED DATA AS TABLE, and the reject-table does not already exist, Vertica Analytic Database saves the rejected data table as a LOCAL TEMP table and returns a message that a LOCAL TEMP table is being created.

Rejected-data tables are useful for Extract-Load-Transform workflows, where you will likely use temporary tables more frequently. The rejected-data tables let you quickly load data and identify which records failed to load. If you load data into a temporary table that you created using the ON COMMIT DELETE clause, the COPY operation will not commit.

Location of rejected data table records

When you save rejected records to a table, using the REJECTED DATA AS TABLE table_name option, the data for the table is saved in a database data subdirectory, RejectionTableData. For example, for a VMart database, table data files reside here:

/home/dbadmin/VMart/v_vmart_node0001_data/RejectionTableData

Rejected data tables include both rejected data and the reason for the rejection (exceptions), along with other data columns, described next. Vertica suggests that you periodically drop any rejected data tables that you no longer require.

Querying a rejected data table

When you specify a rejected data table when loading data with COPY, you can query that table for information about rejected data after the load operation is complete. For example:

  1. Create the loader table:

    => CREATE TABLE loader(a INT)
    CREATE TABLE
    
  2. Use COPY to load values, saving rejected data to a table, loader_rejects:

    => COPY loader FROM STDIN REJECTED DATA AS TABLE loader_rejects;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 1
    >> 2
    >> 3
    >> a
    >> \.
    
  3. Query the loader table after loading data:

    => SELECT * FROM loader;
     x
    ---
     1
     2
     3
    (3 rows)
    
  4. Query the loader_rejects table to see its column rows:

    => SELECT * FROM loader_rejects;
    -[ RECORD 1 ]-------------+--------------------------------------------
    node_name                 | v_vmart_node0001
    file_name                 | STDIN
    session_id                | v_vmart_node0001.example.-24016:0x3439
    transaction_id            | 45035996274080923
    statement_id              | 1
    batch_number              | 0
    row_number                | 4
    rejected_data             | a
    rejected_data_orig_length | 1
    rejected_reason           | Invalid integer format 'a' for column 1 (x)
    

The rejected data table has the following columns:

Column Data Type Description
node_name VARCHAR The name of the Vertica node on which the input load file was located.
file_name VARCHAR The name of the file being loaded, which applies if you loaded a file (as opposed to using STDIN).
session_id VARCHAR The session ID number in which the COPY statement occurred.
transaction_id INTEGER Identifier for the transaction within the session, if any; otherwise NULL.
statement_id INTEGER

The unique identification number of the statement within the transaction that included the rejected data.

batch_number INTEGER INTERNAL USE. Represents which batch (chunk) the data comes from.
row_number INTEGER

The rejected row number from the input file, or -1 if it could not be determined. The value can be -1 when using cooperative parse.

Each parse operation resets the row number, so in an apportioned load, there can be several entries with the same row number but different rows.

rejected_data LONG VARCHAR The data that was not loaded.
rejected_data_orig_length INTEGER The length of the rejected data.
rejected_reason VARCHAR The error that caused the rejected row. This column returns the same message that exists in a load exceptions file when you do not save to a table.

Exporting the rejected records table

You can export the contents of the column rejected_data to a file to capture only the data rejected during the first COPY statement. Then, correct the data in the file, save it, and load the updated file.

To export rejected records:

  1. Create a sample table:

    => CREATE TABLE t (i int);
    CREATE TABLE
    
  2. Copy data directly into the table, using a table to store rejected data:

    => COPY t FROM STDIN REJECTED DATA AS TABLE t_rejects;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 1
    >> 2
    >> 3
    >> 4
    >> a
    >> b
    >> c
    >> \.
    
  3. Show only tuples and set the output format:

    => \t
    Showing only tuples.
    => \a
    Output format is unaligned.
    
  4. Output to a file:

    => \o rejected.txt
    => select rejected_data from t_rejects;
    => \o
    
  5. Use the catcommand on the saved file:

    
    => \! cat rejected.txt
    a
    b
    c
    

After a file exists, you can fix load errors and use the corrected file as load input to the COPY statement.

12.3 - Saving load exceptions (EXCEPTIONS)

COPY exceptions consist of informational messages describing why a row of data could not be parsed.

COPY exceptions consist of informational messages describing why a row of data could not be parsed. The EXCEPTIONS option lets you specify a file to which COPY writes exceptions. If you omit this option, COPY saves exception files to the following path: catalog-dir/CopyErrorLogs/tablename-sourcefilename-copy-from-exceptions, where:

  • catalog-dir is the directory holding the database catalog files

  • table is the name of the table being loaded into

  • sourcefile is the name of the file being loaded

The file produced by the EXCEPTIONS option indicates the line number and the reason for each exception.

If copying from STDIN, the source file name is STDIN.

You can specify rejected data and exceptions files for individual files in a data load. Separate rejected data and exception file names with commas in the COPY statement.

You must specify a filename in the path to load multiple input files. Keep in mind that long table names combined with long data file names can exceed the operating system's maximum length (typically 255 characters). To work around file names exceeding the maximum length, use a path for the exceptions file that differs from the default path; for example, /tmp/<shorter-file-name>.

If you specify an EXCEPTIONS path:

  • For one data file, the path must be a file, and COPY stores all information in this file.

  • For multiple data files, the path must be a directory. COPY creates one file in this directory for each data file.

  • Exceptions files are not stored on the initiator node.

  • You can specify only one path per node.

If you do not specify the EXCEPTIONS path, COPY stores exception files in the default directory.

12.4 - COPY rejected data and exception files

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files.

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files. Typically, the number of threads depends on the number of node cores in the system. Each node processes part of the load data. If the load succeeds overall, any parser rejections that occur during load processing are written to that node's specific rejected data and exceptions files. If the load fails, the rejected data file contents can be incomplete, or empty. If you do not specify a file name explicitly, COPY uses a default name and location for rejected data files. See the next topic for specifying your own rejected data and exception files.

Both rejected data and exceptions files are saved and stored on a per-node basis. This example uses multiple files as COPY inputs. Since the statement does not include either the REJECTED DATA or EXCEPTIONS parameters, rejected data and exceptions files are written to the default location, the database catalog subdirectory, CopyErrorLogs, on each node:

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set file1 '''':dir'C1_large_tbl.dat'''
\set file2 '''':dir'C2_large_tbl.dat'''
\set file3 '''':remote_dir'C3_large_tbl.dat'''
\set file4 '''':remote_dir'C4_large_tbl.dat'''
=>COPY large_tbl FROM :file1 ON site01,:file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|';

Specifying rejected data and exceptions files

The optional COPY REJECTED DATA and EXCEPTIONS parameters 'path' element lets you specify a non-default path in which to store the files.

If path resolves to a storage location, and the user invoking COPY is not a superuser, these are the required permissions:

Both parameters also have an optional ON nodename clause that uses the specified path:

...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]...[ REJECTED DATA 'path' [ ON nodename ] [, ...] ]

While 'path' specifies the location of the rejected data and exceptions files (with their corresponding parameters), the optional ON nodename clause moves any existing rejected data and exception files on the node to the specified path on the same node.

Saving rejected data and exceptions files to a single server

The COPY statement does not have a facility to merge exception and rejected data files after COPY processing is complete. To see the contents of exception and rejected data files requires accessing each node's specific files.

Using VSQL variables for rejected data and exceptions files

This example uses vsql variables to specify the path and file names to use with the exceptions and rejected data parameters (except_s1 and reject_s1). The COPY statement specifies a single input file (large_tbl) on the initiator node:

\set dir `pwd`/data/ \set file1 '''':dir'C1_large_tbl.dat'''
\set except_s1 '''':dir'exceptions'''
\set reject_s1 '''':dir'rejections'''

COPY large_tbl FROM :file1 ON site01 DELIMITER '|'
REJECTED DATA :reject_s1 ON site01
EXCEPTIONS :except_s1 ON site01;

This example uses variables to specify exception and rejected date files (except_s2 and reject_s2) on a remote node. The COPY statement consists of a single input file on a remote node (site02):

\set remote_dir /vertica/test_dev/tmp_ms/\set except_s2 '''':remote_dir'exceptions'''
\set reject_s2 '''':remote_dir'rejections'''

COPY large_tbl FROM :file1 ON site02 DELIMITER '|'
REJECTED DATA :reject_s2 ON site02
EXCEPTIONS :except_s2 ON site02;

This example uses variables to specify that the exception and rejected data files are on a remote node (indicated by :remote_dir). The inputs to the COPY statement consist of multiple data files on two nodes (site01 and site02). The exceptions and rejected data options use the ON nodename clause with the variables to indicate where the files reside (site01 and site02):

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set except_s1 '''':dir''''
\set reject_s1 '''':dir''''
\set except_s2 '''':remote_dir''''
\set reject_s2 '''':remote_dir''''
COPY large_tbl FROM :file1 ON site01,
               :file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|'
               REJECTED DATA :reject_s1 ON site01, :reject_s2 ON site02
               EXCEPTIONS :except_s1 ON site01, :except_s2 ON site02;

12.5 - COPY LOCAL rejection and exception files

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files.

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files. This behavior differs from using COPY, which saves both files automatically, regardless of whether you use the optional REJECTED DATA and EXCEPTIONS parameters to specify either file explicitly.

Use the REJECTED DATA and EXCEPTIONS parameters with COPY LOCAL and COPY LOCAL FROM STDIN to save the corresponding output files on the client. If you do not use these options, rejected data parsing events (and the exceptions that describe them) are not retained, even if they occur.

You can load multiple input files using COPY LOCAL (or COPY LOCAL FROM STDIN). If you also use the REJECTED DATA and EXCEPTIONS options, the statement writes rejected rows and exceptions and to separate files. The respective files contain all rejected rows and corresponding exceptions, respectively, regardless of how many input files were loaded.

If COPY LOCAL does not reject any rows, it does not create either file.

Specifying rejected data and exceptions files

To save any rejected data and their exceptions to files:

  1. In the COPY LOCAL (and COPY LOCAL FROM STDIN) statement, use the REJECTED DATA 'path' and the EXCEPTIONS 'path' parameters, respectively.

  2. Specify two different file names for the two options. You cannot use one file for both the REJECTED DATA and the EXCEPTIONS.

  3. When you invoke COPY LOCAL or COPY LOCAL FROM STDIN, the files you specify need not pre-exist. If they do, COPY LOCAL must be able to overwrite them.

You can specify the path and file names with vsql variables:

\set rejected ../except_reject/copyLocal.rejected
\set exceptions ../except_reject/copyLocal.exceptions

When you use the COPY LOCAL or COPY LOCAL FROM STDIN statement, specify the variable names for the files with their corresponding parameters:

=> COPY large_tbl FROM LOCAL rejected data :rejected exceptions :exceptions;
=> COPY large_tbl FROM LOCAL STDIN rejected data :rejected exceptions :exceptions;

13 - Monitoring COPY loads and metrics

You can check COPY loads using:.

You can check COPY loads using:

  • Vertica functions

  • LOAD_STREAMS system table

  • LOAD_SOURCES system table

More generally, the EXECUTION_ENGINE_PROFILES system table records information about query events, including loads.

Using Vertica functions

Two meta-functions return COPY metrics for the number of accepted or rejected rows from the most recent COPY statement run in the current session:

  1. To get the number of accepted rows, use the GET_NUM_ACCEPTED_ROWS function:

    => select get_num_accepted_rows();
     get_num_accepted_rows
    -----------------------
                        11
    (1 row)
    
  2. To check the number of rejected rows, use the GET_NUM_REJECTED_ROWS function:

    => select get_num_rejected_rows();
     get_num_rejected_rows
    -----------------------
                        0
    (1 row)
    

Using the CURRENT_LOAD_SOURCE function

You can include the CURRENT_LOAD_SOURCE function as a part of the COPY statement. Doing so allows you to insert into a column the input file name or value computed by this function.

To insert the file names into a column from multiple source files:

=> COPY t (c1, c2, c3 as CURRENT_LOAD_SOURCE()) FROM '/home/load_file_1' ON  exampledb_node02,
'/home/load_file_2' ON exampledb_node03 DELIMITER ',';

Using the LOAD_STREAMS system table

Vertica includes a set of system tables that include monitoring information. The LOAD_STREAMS system table includes information about load stream metrics from COPY and COPY FROM VERTICA statements. Thus, you can query table values to get COPY metrics.

Vertica maintains system table metrics until they reach a designated size quota (in kilobytes). This quota is set through internal processes, which you cannot set or view directly.

Labeling copy streams

COPY can include the STREAM NAME parameter to label its load stream so it is easy to identify in the LOAD_STREAMS system table. For example:

=> COPY mytable FROM myfile DELIMITER '|' STREAM NAME 'My stream name';

Load stream metrics

The following LOAD_STREAMS columns show on the status of a load as it progresses:

Column name Value...
ACCEPTED_ROW_COUNT Increases during parsing, up to the maximum number of rows in the input file.
PARSE_COMPLETE_PERCENT

Remains zero (0) until all named pipes return an EOF. While COPY awaits an EOF from multiple pipes, it can appear to be hung. However, before canceling the COPY statement, check your system CPU and disk accesses to determine if any activity is in progress.

In a typical load, the PARSE_COMPLETE_PERCENT value can either increase slowly or jump quickly to 100%, if you are loading from named pipes or STDIN.

SORT_COMPLETE_PERCENT Remains at 0 when loading from named pipes or STDIN. After PARSE_COMPLETE_PERCENT reaches 100 percent, SORT_COMPLETE_PERCENT increases to 100 percent.

Depending on the data sizes, a significant lag can occur between the time PARSE_COMPLETE_PERCENT reaches 100 percent and the time SORT_COMPLETE_PERCENT begins to increase.

This example queries load stream data from the LOAD_STREAMS system table:

=> \pset expanded
Expanded display is on.
=> SELECT stream_name, table_name, load_start, accepted_row_count,
   rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
   sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name            | fact-13
table_name             | fact
load_start             | 2010-12-28 15:07:41.132053
accepted_row_count     | 900
rejected_row_count     | 100
read_bytes             | 11975
input_file_size_bytes  | 0
parse_complete_percent | 0
unsorted_row_count     | 3600
sorted_row_count       | 3600
sort_complete_percent  | 100

Using the LOAD_SOURCES system table

The LOAD_STREAMS table shows the total number of rows that were loaded or rejected. Grouping this information by source can help you determine from where data is coming. The LOAD_SOURCES system table includes some of the same data as LOAD_STREAMS does but adds this source-specific information. If apportioning is enabled, LOAD_SOURCES also provides information about how loads are apportioned among nodes.

You can use this table to identify causes of differing query results. For example, you can use the following statement to create an external table based on globs:

=> CREATE EXTERNAL TABLE tt AS COPY WITH SOURCE AWS(dir = 'foo', file = '*');

If you select from this table, Vertica loads data from every file in the foo directory and creates one row in the LOAD_SOURCES table for each file. Suppose you later repeat the query and see different results. You could look at the LOAD_SOURCES table and discover that—between the two queries—somebody added another file to the foo directory. Because each file is recorded in LOAD_SOURCES, you can see the new file that explains the changed query results.

If you are using many data sources. you might prefer to disable this reporting. To disable reporting, set the LoadSourceStatisticsLimit configuration parameter to 0. This parameter sets the upper bound on the number of sources profiled by LOAD_SOURCES per load. The default value is 256.

14 - Using load scripts

You can write and run a load script for the COPY statement using a simple text-delimited file format.

You can write and run a load script for the COPY statement using a simple text-delimited file format. For information about other load formats see Data formats. Vertica recommends that you load the smaller tables before the largest tables. To check data formats before loading, see Handling Non-UTF-8 input.

Using absolute paths in a load script

Unless you are using the COPY FROM LOCAL statement, using COPY on a remote client requires an absolute path for a data file. You cannot use relative paths on a remote client. For a load script, you can use vsql variables to specify the locations of data files relative to your Linux working directory.

To use vsql variables to specify data file locations:

  1. Create a vsql variable containing your Linux current directory.

    \set t_pwd `pwd`
    
  2. Create another vsql variable that uses a path relative to the Linux current directory variable for a specific data file.

    \set input_file '\'':t_pwd'/Date_Dimension.tbl\''
    
  3. Use the second variable in the COPY statement:

    => COPY Date_Dimension FROM :input_file DELIMITER '|';
    
  4. Repeat steps 2 and 3 to load all data files.

Running a load script

You can run a load script on any host, as long as the data files are on that host.

  1. Change your Linux working directory to the location of the data files.

    $ cd /opt/vertica/doc/retail_example_database
    
  2. Run the Administration Tools.

    $ /opt/vertica/bin/admintools
    
  3. Connect to the database.

  4. Run the load script.

15 - Troubleshooting data loads

You might encounter the following issues when loading data.

You might encounter the following issues when loading data. For issues specific to external tables, see Troubleshooting external tables.

Cannot load data because all columns are complex types

A native table must contain at least one column that is either a scalar type or a native array (one-dimensional array of scalar types). If the data you wish to load consists entirely of complex types, you must add a scalar column to the table before loading the data. You do not need to populate the scalar column.

First, create a native table with the columns in the data plus one scalar column, and then load the data into that table as shown in the following example:

-- Native table with extra scalar column:
=> CREATE TABLE customers(person ROW(name VARCHAR, card VARCHAR), tempval int);
CREATE TABLE

-- Load complex data, ignoring scalar column:
=> COPY customers(person) FROM :custdata PARQUET;
 Rows Loaded
-------------
           3
(1 row)

Load fails because of disk quota

COPY can fail if the volume of data being loaded causes a table or schema to exceed a disk quota:

ERROR 0: Disk Quota for the Table Level Exceeded
HINT: Delete data and PURGE or increase disk quota at the table level

By default, Vertica tables and schemas do not have a disk quota; they are limited only by the capacity of your hardware and license. However, a superuser can set quotas, as described in Disk quotas.

If you encounter a failure due to quota, you cannot simply use DELETE to reduce consumption. DELETE does not free space, because deleted data is still preserved in the storage containers. To reclaim space you must purge deleted data; see Removing table data.

Reads from Parquet files report unexpected data-type mismatches

If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example:


ERROR 7247:  Datatype mismatch: column 2 in the parquet_cpp source
[/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int

This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING.

Error 7087: wrong number of columns

When loading ORC or Parquet data, you might see an error stating that you have the wrong number of columns:

=> CREATE TABLE nation (nationkey bigint, name varchar(500),
            regionkey bigint, comment varchar(500));
   CREATE TABLE

=> COPY nation from :orc_dir ORC;
   ERROR 7087: Attempt to load 4 columns from an orc source
   [/tmp/orc_glob/test.orc] that has 9 columns

When you load data from ORC or Parquet files, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition.

For Parquet data, time zones in timestamp values are not correct

Reading timestamps from a Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the Parquet format does not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets timestamps read from Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time such as the following:

WARNING 0:  SQL TIMESTAMPTZ is more appropriate for Parquet TIMESTAMP
because values are stored in UTC

When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP.

Time zones can also be incorrect in ORC data, but the reason is different.

For ORC data, time zones in timestamp values are not correct

Vertica and Hive both use the Apache ORC library to interact with ORC data. The behavior of this library changed with Hive version 1.2.0, so timestamp representation depends on what version was used to write the data.

When writing timestamps, the ORC library now records the time zone in the stripe footer. Vertica looks for this value and applies it when loading timestamps. If the file was written with an older version of the library, the time zone is missing from the file.

If the file does not contain a time zone, Vertica uses the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table.

The first time you query a new ORC data source, you should query this table to look for missing time zone information:

=> SELECT event_category, event_type, event_description, operator_name, event_details, COUNT(event_type)
   AS COUNT FROM QUERY_EVENTS WHERE event_type ILIKE 'ORC_FILE_INFO'
   GROUP BY event_category, event_type, event_description, operator_name, event_details
   ORDER BY event_details;
event_category | event_type | event_description | operator_name | event_details | count
----------------+---------------+----------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-------
EXECUTION | ORC_FILE_INFO | ORC file does not have writer timezone information | OrcParser | Timestamp values in the ORC source [data/sales_stats.orc] will be computed using local timezone | 2

(1 row)

Time zones can also be incorrect in Parquet data, but the reason is different.

Some date and timestamp values are wrong by several days

When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values.

16 - Working with external data

An alternative to importing data into Vertica is to query it in place.

An alternative to importing data into Vertica is to query it in place. Querying external data instead of importing it can be advantageous in some cases:

  • If you want to explore data, such as in a data lake, before selecting data to load into Vertica.

  • If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.

  • If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.

  • If you have a very large volume of data and do not want to increase your license capacity.

  • If you have lower-priority data in Vertica that you still want to be able to query.

To query external data, you must describe your data as an external table. Like native tables, external tables have table definitions and can be queried. Unlike native tables, external tables have no catalog and Vertica loads selected data from the external source as needed. For some formats, the query planner can take advantage of partitions and sorting in the data, so querying an external table does not mean you load all of the data at query time. (For more information about native tables, see Working with native tables.)

There is one special type of external data not covered in this section. If you are reading data from Hadoop, and specifically from a Hive data warehouse, then instead of defining your own external tables you can read the schema information from Hive. For more information, see Using the HCatalog Connector.

16.1 - How external tables differ from native tables

You can use external tables in the same ways you use Vertica native tables.

You can use external tables in the same ways you use Vertica native tables. Because the data is external to the database, however, there are some differences in how external tables operate.

Data

The data for an external table can reside anywhere, so long as all database nodes can access it. S3, HDFS, and NFS mount points are common places to find external data. Naturally, querying external data can incur some latency compared to querying locally-stored ROS data, but Vertica has optimizations that can reduce the impact. For example, Vertica can take advantage of node and rack locality for HDFS data.

Because the data is external, Vertica loads external data each time you query it. Vertica is optimized to reduce the volume of read data, including predicate pushdown and partition pruning for formats that support partitioning. The ORC and Parquet formats support these optimizations.

Because the data is read at query time, you must ensure that your users have and retain permission to read the data in its original location. Depending on where the data is stored, you might need to take additional steps to manage access, such as creating AWS IAM roles on S3.

Because the data is not stored in Vertica, external tables do not use superprojections and buddy projections.

Resource consumption

External tables add very little to the Vertica catalog, which reduces the resources that queries consume. Because the data is not stored in Vertica, external tables are not affected by the Tuple Mover and do not cause ROS pushback. Vertica uses a small amount of memory when reading external table data, because the table contents are not part of your database and are parsed each time the external table is used.

Backup and restore

Because the data in external tables is managed outside of Vertica, only the external table definitions, not the data files, are included in database backups. Arrange for a separate backup process for your external table data.

DML support

External tables allow you to read external data. They do not allow you to modify it. Some DML operations are therefore not available for external tables, including:

  • DELETE FROM

  • INSERT INTO

  • SELECT...FOR UPDATE

Sequences and IDENTITY columns

The COPY statement definition for external tables can include IDENTITY columns and sequences. Whenever a select statement queries the external table, sequences and IDENTITY columns are re-evaluated. This results in changing the external table column values, even if the underlying external table data remains the same.

16.2 - Creating external tables

To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement.

To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.

You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying where to load data from and Data formats.

As with native tables, you can use the INFER_TABLE_DDL function to derive column definitions from data files in supported formats.

The following example defines an external table for delimited data stored in HDFS:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM 'hdfs:///data/ext1.csv' DELIMITER ',';

The following example uses data in the ORC format that is stored in S3. The data has two partition columns. For more information about partitions, see Partitioned file paths.

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 's3://datalake/sales/*/*/*'
   PARTITION COLUMNS created, region;

The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM '/data/sales/*.parquet' PARQUET;

When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.

Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.

ORC and Parquet formats

When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. If you omit data columns, queries using the table abort with an error.

If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Iceberg format

You can create an external table for data stored by Apache Iceberg using CREATE EXTERNAL TABLE ICEBERG. An Iceberg table consists of data files and metadata describing the schema. Unlike other external tables, an Iceberg external table need not specify column definitions (DDL). The information is read from Iceberg metadata at query time. For certain data types you can adjust column definitions when creating the table, for example to specify VARCHAR sizes.

Special considerations for external tables

If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.

You can see unexpected query results if constraints on columns cause values to be rejected:

  • If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.

  • If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.

  • When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.

One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).

The JSON and Avro parsers produce warnings when the data contains columns or fields that are not part of the table definition. External tables load data at query time, so these warnings appear for each query. You can use the parser's suppress_warnings parameter to prevent these warnings in external tables. See the parser reference pages for an example.

When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.

You can take advantage of partitioning to limit the amount of data that Vertica reads. For more information about using partitioned data, see Partitioned file paths.

Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.

After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.

Required permissions

In addition to having permission in Vertica, users must have read access to the external data.

  • For data on the local disk this access is governed by local file permissions.

  • For data in HDFS, access might be governed by Kerberos authentication. See Accessing kerberized HDFS data.

  • For data on S3, you need access through an AWS IAM role. See S3 object store.

For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) object store.

By default, you must also be a database superuser to access external tables through a SELECT statement.

In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (storage location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp (including in subdirectories to any depth):

=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';
=> GRANT ALL ON LOCATION '/tmp' to Bob;

Organizing external table data

If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), you can use partitioning in combination with wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:

/year=2018/month=01/

You can then read the year and month values from the directory names in the COPY statement:

=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/*/*/*' PARTITION COLUMNS year, month;

Whenever a Vertica query references the external table archive, and Vertica parses the COPY statement, all stored data in the top-level monthly directory is accessible to the query. If the query filters by year or month, such as in a WHERE clause, Vertica skips irrelevant directories when evaluating the glob. See Partitioned file paths for more information.

Validating table definitions

When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:

=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar)
    AS COPY '/home/dbadmin/designer.log';
ERROR 2778:  COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source

Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.

To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.

Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling messy data.

Viewing external table definitions

When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.

To list all tables, use a SELECT * query, as shown:

=> SELECT * FROM TABLES WHERE table_definition <> '';

Use a query such as the following to list the external table definitions:

=> SELECT table_name, table_definition FROM TABLES;
 table_name |                                table_definition
------------+----------------------------------------------------------------------
 t1         | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t1_copy    | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t2         | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)

16.3 - Partitioned file paths

Data files are sometimes partitioned in the file system using the directory structure.

Data files are sometimes partitioned in the file system using the directory structure. Partitioning allows you to move values out of the raw data, where they have to be included for each row, and into the directory structure. Partitioning can improve query performance by allowing entire directories to be skipped. Partitioning can also save disk space if partition values are not repeated in the data.

Partitioning by date is one common use:

/data/created=2016-11-01/*
/data/created=2016-11-02/*
/data/created=2016-11-03/*
/data/...

The files in the globs do not contain a created column because this information is expressed through the file system. Vertica can read the partitioned values (dates, in this example) into a table column (created, in this example).

Data can be partitioned by more than one value:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

If data is partitioned by more than one value, the partitions must appear in the same order in all directory paths in the glob. In this example, creation date comes before region and must do so consistently.

COPY syntax

To read values from partitioned files, use wildcards (globs) in the COPY path and the PARTITION COLUMNS option:

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:////data/*/*/*'
   PARTITION COLUMNS created, region;

The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must be at least one more than the number of partitioned columns. Data files must include at least one real column; you cannot represent data entirely through directory structure.

The first part of each partition directory name must match the column name in the table definition. COPY parses the string after the = for the values. Empty values, for example a directory named created=, are treated as null values. For backward compatibility, a value of __HIVE_DEFAULT_PARTITION__ also means null.

Values that cannot be coerced to the correct types are rejected in the same way that non-coercible values in data are rejected.

Query execution

When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.

=> SELECT * FROM records WHERE region='northeast';

To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:

files with unmatched partition have been pruned

Partitions on object stores

When reading partitioned data from a file system, COPY can prune unneeded directories at each level, skipping entire directories of files. Object stores (S3, GCS, and Azure) do not have directories, so COPY cannot prune paths in the same way.

By default, when querying data on an object store, COPY first fetches a list of all object names. It then prunes unneeded ones before fetching the objects themselves. This approach is called flat listing. Accessing object stores can have high latency, and flat listing maximizes the number of results returned per request. Flat listing retrieves more object names than are needed but does so in as few calls to the object store as possible. In most cases this strategy is an acceptable tradeoff.

If the number of partitions at each level is large, or if there are many layers of partitioning, fetching all of the object names can have a noticeable performance cost. In this situation, you can set the ObjectStoreGlobStrategy configuration parameter to have COPY use hierarchical listing instead of flat listing. With hierarchical listing, COPY fetches information about objects one level at a time, so that COPY can prune unproductive paths earlier.

Consider data that is partitioned by date and then by region:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

Suppose a query needs only created=2016-11-02 and region=central. With the flat strategy, COPY retrieves the names of all objects in all dates and all regions before pruning most of them. With the hierarchical strategy, COPY first fetches the names of all of the created values, without expanding the paths further. In a second call, it then fetches the names of all of the region values below only that date, ignoring other paths. Finally, it fetches the names of the objects only below that path.

Creating partition structures

To create partitioned file structures, you can use Hive or the file exporters. For information about using Hive, see Hive primer for Vertica integration.

You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.

16.4 - Querying external tables

After you create an external table, you can query it as you would query any other table.

After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:

=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
    AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
    AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE

You can now write queries against these tables, such as the following:

=> SELECT * FROM catalog;
 id |     description      |  category
----+----------------------+-------------
 10 | 24in monitor         | computers
 11 | 27in monitor         | computers
 12 | 24in IPS monitor     | computers
 20 | 1TB USB drive        | computers
 21 | 2TB USB drive        | computers
 22 | 32GB USB thumb drive | computers
 30 | 40in LED TV          | electronics
 31 | 50in LED TV          | electronics
 32 | 60in plasma TV       | electronics
(9 rows)

=> SELECT * FROM inventory;
 storeID | prodID | quantity
---------+--------+----------
     502 |     10 |       17
     502 |     11 |        2
     517 |     10 |        1
     517 |     12 |        2
     517 |     12 |        4
     542 |     10 |        3
     542 |     11 |       11
     542 |     12 |        1
(8 rows)

=> SELECT inventory.storeID,catalog.description,inventory.quantity
    FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
 storeID |   description    | quantity
---------+------------------+----------
     502 | 24in monitor     |       17
     517 | 24in monitor     |        1
     542 | 24in monitor     |        3
     502 | 27in monitor     |        2
     542 | 27in monitor     |       11
     517 | 24in IPS monitor |        2
     517 | 24in IPS monitor |        4
     542 | 24in IPS monitor |        1
(8 rows)

One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.

Handling errors

Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit to –1 removes the limit, but is not recommended.

If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log file and does not report subsequent rejected rows.

Using the ExternalTablesExceptionsLimit configuration parameter differs from using the COPY statement REJECTMAX parameter to set a low rejection threshold. The REJECTMAX value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX, COPY aborts execution instead of logging a warning in vertica.log.

Improving query performance for external tables

Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.

After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)

The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.

If your data is partitioned, Vertica automatically prunes partitions that cannot affect query results, causing less data to be loaded.

For ORC and Parquet data written with Hive version 0.14 and later, Vertica automatically uses predicate pushdown to further improve query performance. Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files written with earlier versions of Hive might not contain the statistics required to perform this optimization. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.

Using external tables with user-defined load (UDL) functions

You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).

16.5 - Monitoring external tables

Vertica records information about external tables in system tables.

Vertica records information about external tables in system tables. You can use these tables to track your external data and queries against it.

The TABLES system table contains data about all tables, both native and external. The TABLE_DEFINITION column is specific to external tables. You can query this column to see all external data sources currently in use, as in the following example:

=> SELECT table_name, create_time, table_definition FROM tables WHERE table_definition != '';
table_name   |          create_time          |                     table_definition
---------------+-------------------------------+-----------------------------------------------------------
customers_orc | 2018-03-21 11:07:30.159442-04 | COPY from '/home/dbadmin/sample_orc_files/0*' ORC
miscprod      | 2018-06-26 17:40:04.012121-04 | copy from '/home/dbadmin/data/prod.csv'
students      | 2018-06-26 17:46:50.695024-04 | copy from '/home/dbadmin/students.csv'
numbers       | 2018-06-26 17:53:52.407441-04 | copy from '/home/dbadmin/tt.dat'
catalog       | 2018-06-26 18:12:28.598519-04 | copy from '/home/dbadmin/data/prod.csv' delimiter ','
inventory     | 2018-06-26 18:13:06.951802-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
test          | 2018-06-27 16:31:39.170866-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
                (7 rows)

The EXTERNAL_TABLE_DETAILS table provides more details, including file sizes. Vertica computes the values in this table at query time, which is potentially expensive, so consider restricting the query by schema or table.

=> SELECT table_name, source_format, total_file_size_bytes FROM external_table_details;
table_name   | source_format | total_file_size_bytes
---------------+---------------+-----------------------
customers_orc | ORC           |             619080883
miscprod      | DELIMITED     |                   254
students      | DELIMITED     |                   763
numbers       | DELIMITED     |                    30
catalog       | DELIMITED     |                   254
inventory     | DELIMITED     |                    74
test          | DELIMITED     |                    74
(7 rows)

If the size of an external table changes significantly over time, you should rerun ANALYZE_EXTERNAL_ROW_COUNT() to gather updated statistics. See Improving Query Performance for External Tables.

The LOAD_SOURCES table shows information for loads currently in progress. This table does not record information about loads of ORC or Parquet data.

16.6 - Troubleshooting external tables

You might encounter the following issues when creating or querying external tables.

You might encounter the following issues when creating or querying external tables. For general data-load troubleshooting, see Troubleshooting data loads.

File not found or permission denied

If a query against an external table produces a file or permission error, ensure that the user executing the query has the necessary permissions in both Vertica and the file system. See the permissions section in Creating external tables.

Error 7226: cannot find partition column

When querying external tables backed by partitioned data, you might see an error message stating that a partition column is missing:

ERROR 7226: Cannot find partition column [region] in parquet source
[/data/table_int/int_original/000000_0]

If you create an external table and then change the partition structure, for example by renaming a column, you must then re-create the external table. If you see this error, update your table to match the partitioning on disk.

For more information about partition structure, see Partitioned file paths).

Error 6766: is a directory

When querying data, you might see an error message stating that an input file is a directory:

ERROR 6766: Error reading from orc parser input stream
[/tmp/orc_glob/more_nations]: Is a directory

This error occurs if the glob in the table's COPY FROM clause matches an empty directory. This error occurs only for files in the Linux file system; empty directories in HDFS are ignored.

To correct the error, make the glob more specific. Instead of *, for example, use *.orc.