This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Data load
Vertica provides many ways to read data.
Vertica provides many ways to read data. You can load data into the database from a variety of sources, optionally transforming it in various ways. You can read data in place in its original format using external tables. You can use streaming, and you can import data from other Vertica databases. See Common use cases for an introduction.
Most data-loading operations, including external tables, revolve around the COPY statement, which has many options. This book focuses on COPY-based reads (data load and external tables). Other data-loading options supported by Vertica are described elsewhere:
1 - Common use cases
Vertica supports a variety of use cases for reading data.
Vertica supports a variety of use cases for reading data. Some of the most common are summarized here with links for more information. This is not a complete list of capabilities.
The COPY statement is central to loading data. See Introduction to the COPY statement for an overview of its use.
Loading data from files
You might have data, perhaps quite a bit of data, that you want to load into Vertica. These files might reside on shared storage, in the cloud, or on local nodes, and might be in a variety of formats.
For information about source locations, see Specifying where to load data from. To handle data in different formats you specify a parser; for more information about the options, see Data formats.
You are not limited to loading data "as-is"; you can also transform it during load. See Transforming data during loads.
Loading data from other services
Apache Kafka is a platform for streaming data. Vertica supports streaming data to and from Kafka. See Apache Kafka integration.
Apache Spark is a cluster-computing framework for distributed data. Vertica supports connecting to Spark for data. See Apache Spark integration.
You can copy data directly from another Vertica cluster, instead of exporting to files and then loading those files. See Database export and import.
Read data where it exists (don't import)
Instead of loading data into Vertica, you can read it in place using external tables. External tables can be advantageous in the following cases:
-
If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
-
If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
-
If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
-
If you have lower-priority data in Vertica that you still want to be able to query.
When you query an external table, Vertica loads the data it needs from the external source. The Parquet and ORC columnar formats are optimized for this kind of load, so using external tables does not necessarily have a large effect on performance compared to loading data into Vertica native tables.
For more information about using external tables, see Working with external data.
Complex types
Some data formats support complex types such as arrays and structs (sets of property-value pairs). You can use strong typing to define columns using the ARRAY and ROW types (including nesting) in native and external tables. See Complex types. Alternatively, you can define tables using flexible (schemaless) complex types without fully specifying them. You can load flexible complex types in the Parquet, ORC, JSON, and Avro formats; see Flexible complex types.
Unknown schema
Sometimes the schema for the data you wish to load is unknown. Vertica provides two options in this case.
You can inspect sample data files and derive a table definition using the INFER_TABLE_DDL function. This function returns a good starting point for column definitions, including ones using complex types. You might still need to make adjustments, particularly for partition columns of unknown types or to adjust sizes of VARCHAR values. Particularly for tables with many or complex columns, using this function can save time and reduce manual effort.
Alternatively, you can load schemaless data into a Flex table. All data is loaded into a single binary column (a VMap) and is extracted at query time. This approach is particularly helpful if you need to support heterogeneous data in one table, such as JSON data with different properties. Loading into a Flex table is generally like loading into any other table, but for some formats you need to use a different parser. For more information about Flex tables and Flex-specific parsers, see Flex tables.
Messy data
Sometimes data is not clean; values might not match the declared data types, or required values might be missing, or the parser might not be able to interpret a row for other reasons. You might still want to be able to load and explore this data. You can specify how error-tolerant to be and where to record information about rejected data using parameters to the COPY statement. For more information, see Handling messy data.
2 - Introduction to the COPY statement
Use the the COPY statement to load data.
Use the the
COPY
statement to load data. COPY
is a large and versatile statement with many parameters; for all of the details, see the reference page. In its simplest form, COPY
copies data from a source to a file, as follows:
=> COPY target-table FROM data-source
You also use COPY when defining an external table:
=> CREATE EXTERNAL TABLE target-table (...) AS COPY FROM data-source
Source data can be a data stream or a file path. For more about the FROM clause, see Specifying where to load data from.
You can specify many details about a data load, including:
For a complete list of parameters, see Parameters.
Permissions
Generally, only a superuser can use the COPY statement to bulk-load data. Non-supersuers can use COPY in certain cases:
-
To load from a stream on the host (such as STDIN) rather than a file (see Streaming data via JDBC).
-
To load with the FROM LOCAL option.
-
To load into a storage location where the user has been granted permission.
-
To use a user-defined-load function for which the user has permission.
A non-superuser can also perform a batch load with a JDBC prepared statement, which invokes COPY to load data as a background task.
Users must also have read permission to the source from which the data is to be loaded.
3 - Global and column-specific options
You can specify some COPY options globally, for the entire COPY statement, or limit their scope to a column.
You can specify some COPY options globally, for the entire COPY statement, or limit their scope to a column. For example, in the following COPY statement, the first column is delimited by '|' but the others are delimited by commas.
=> COPY employees(id DELIMITER '|', name, department) FROM ... DELIMITER ',';
You could specify a different default for null inputs for one column:
=> COPY employees(id, name, department NULL 'General Admin') FROM ... ;
Alternatively, you can use the COLUMN OPTION parameter to specify column-specific parameters instead of enumerating the columns:
=> COPY employees COLUMN OPTION (department NULL 'General Admin') FROM ... ;
Where both global and column-specific values are provided for the same parameter, the column-specific value governs those columns and the global one governs others.
All parameters can be used globally. The description of each parameter indicates whether it can be restricted to specific columns.
4 - Specifying where to load data from
Each COPY statement requires either a FROM clause to indicate the location of the file or files being loaded or a SOURCE clause when using a user-defined source.
Each COPY statement requires either a FROM clause to indicate the location of the file or files being loaded or a SOURCE clause when using a user-defined source. For more about the SOURCE clause, see Parameters. This section covers use of the FROM clause.
Loading from a specific path
Use the path-to-data
argument to indicate the location of the file to load. You can load data from the following locations:
-
The local file system.
-
NFS, through a mount point on the local file system.
-
HDFS, using a URL of the form [[s]web]hdfs://[
nameservice
]/
path
. For more information about HDFS URLs and configuration, see HDFS file system.
If a data file you want to read resides on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal, session doAs
user, or session delegation token. See Accessing kerberized HDFS data for more information about these options.
-
S3, Google Cloud Storage, or Azure Blob Storage, for data in text, delimited, Parquet, and ORC formats only. For more information about reading from these file systems, see S3 object store, Google Cloud Storage (GCS) object store, and Azure Blob Storage object store.
If the path is a URL, you must use URL encoding for the '%' character. Otherwise, URL encoding ('%NN' where NN is a two-digit hexadecimal number) is permitted but not required.
When copying from the local file system, the COPY statement expects to find files in the same location on every node that participates in the query. If you are using NFS, then you can create an NFS mount point on each node. Doing so allows all database nodes to participate in the load for better performance without requiring files to be copied to all nodes.
Treat NFS mount points as local files in paths:
=> COPY sales FROM '/mount/sales.dat' ON ANY NODE;
You can specify more than one path in the same COPY statement, as in the following example.
=> COPY myTable FROM 'webhdfs:///data/sales/01/*.dat', 'webhdfs:///data/sales/02/*.dat',
'webhdfs:///data/sales/historical.dat';
For files in HDFS, you can specify a name service (hadoopNS
). In this example, the COPY statement is part of the definition of an external table:
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
AS COPY FROM 'webhdfs://hadoopNS/data/users.csv';
If path-to-data
resolves to a storage location on a local file system, and the user invoking COPY is not a superuser, these permissions are required:
-
The storage location must have been created with the USER option (see CREATE LOCATION).
-
The user must already have been granted READ access to the storage location where the file or files exist, as described in GRANT (storage location).
Vertica prevents symbolic links from allowing unauthorized access.
Loading with wildcards (glob)
You can invoke COPY for a large number of files in a shared directory with a single statement such as:
=> COPY myTable FROM '/data/manyfiles/*.dat' ON ANY NODE;
The glob (*
) must indicate a set of files, not directories. The following statement fails if /data/manyfiles
contains any subdirectories:
=> COPY myTable FROM '/data/manyfiles/*' ON ANY NODE;
Using a wildcard with the ON ANY NODE clause expands the file list on the initiator node. This command then distributes the individual files among all nodes, so that the COPY workload is evenly distributed across the entire cluster.
ON ANY NODE is the default for all file systems other than Linux,as in the following example:
=> COPY myTable FROM 'webhdfs:///data/manyfiles/*';
You can also distribute a file set across a subset of nodes, which you might do to balance concurrent loads. For example, this command distributes the loading of individual files among the three named nodes:
=> COPY myTable FROM '/mydirectory/ofmanyfiles/*.dat'
ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003);
Distributing file loads across nodes depends on two configuration parameters, EnableApportionLoad and EnableApportionFileLoad. Both are enabled by default. See General parameters for more information about these parameters.
Loading from a Vertica client
Use COPY LOCAL to load files on a client system to the Vertica database. For example, to copy a GZIP file from your local client, use a command such as this:
=> COPY store.store_dimension FROM LOCAL '/usr/files/my_data/input_file' GZIP;
You can use a comma-separated list to load multiple files of the same compression type. COPY LOCAL then concatenates the files into a single file, so you cannot combine files with different compression types in the list. When listing multiple files, be sure to specify the type of every input file, such as BZIP, as shown:
=>COPY simple_table FROM LOCAL 'input_file.bz' BZIP, 'input_file.bz' BZIP;
You can load data from a local client from STDIN, as follows:
=> COPY simple_table FROM LOCAL STDIN;
Loading from Kafka or Spark
For information about streaming data from Kafka, see Apache Kafka integration.
For information about using Vertica with Spark data, see Apache Spark integration.
Loading data from an IDOL CFS client
The IDOL Connector Framework Server (CFS) VerticaIndexer feature lets CFS clients connect to your Vertica database using ODBC. After it is connected, CFS uses COPY...FROM LOCAL statements to load IDOL document metadata into an existing flex table. For more information, see the Using flex tables for IDOL data section in Using Flex Tables.
5 - Data formats
COPY supports many data formats, detailed in the sections that follow.
COPY supports many data formats, detailed in the sections that follow. You specify a data format by specifying a parser.
By default,
COPY
uses the DELIMITED parser (Delimited data) to load raw data into the database. Raw input data must be in UTF-8, delimited text format. Other parsers support other data formats.
The syntax for specifying which parser to use varies. The description of each parser includes this information.
The same COPY statement cannot mix raw data types that require different parsers, such as NATIVE
and FIXEDWIDTH
. You can, however, load data of different formats, using different parsers, into the same table using separate COPY statements.
For information about verifying input data formats, see Handling Non-UTF-8 input.
All parsers described in this section can be used with conventional tables (those created with CREATE TABLE or CREATE EXTERNAL TABLE). Some also support Flex tables (CREATE FLEX TABLE). For more information specific to Flex tables, see Using flex table parsers.
All parsers support all primitive data types and several support one-dimensional arrays of primitive types. Some parsers support other complex types. See the documentation of individual parsers for information about supported types.
5.1 - Delimited data
If you do not specify another parser, Vertica defaults to the DELIMITED parser.
If you do not specify another parser, Vertica defaults to the DELIMITED parser. You can specify the delimiter, escape characters, how to handle null values, and other parameters in the COPY statement.
The following example shows the default behavior, in which the delimiter character is '|'
=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE
=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.
=> SELECT * FROM employees;
id | name | department
----+-----------------+--------------
17 | Howard Wolowitz | Astrophysics
42 | Sheldon Cooper | Physics
(2 rows)
By default, collection values are delimited by brackets and elements are delimited by commas. Collections must be one-dimensional arrays or sets of scalar types.
=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE
=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
(2 rows)
To use a special character as a literal, prefix it with an escape character. For example, to include a literal backslash (\
) in the loaded data (such as when including a file path), use two backslashes (\\
). COPY removes the escape character from the input when it loads escaped characters.
When loading delimited data, two consecutive delimiters indicate a null value, unless the NULL parameter is set otherwise. The final delimiter is optional. For example, the following input is valid for the previous table:
=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 45|Raj|
>> 21|Leonard||
>> \.
=> SELECT * FROM employees;
id | name | department
----+---------+--------------
21 | Leonard |
42 | Raj |
(2 rows)
By default, if the data has too few values, the load fails. You can use the TRAILING NULLCOLS option to accept any number of missing columns and treat their values as null.
Vertica assumes that data is in the UTF-8 encoding.
The options specific to the DELIMITED parser and their default values are:
To load delimited data into a Flex table, use the FDELIMITEDPARSER parser.
Changing the column separator (DELIMITER)
The default COPY delimiter is a vertical bar ('|'). The DELIMITER is a single ASCII character used to separate columns within each record of an input source. Between two delimiters, COPY interprets all string data in the input as characters. Do not enclose character strings in quotes, because quote characters are also treated as literals between delimiters.
You can define a different delimiter using any ASCII value in the range E'\000' to E'\177' inclusive. For instance, if you are loading CSV data files, and the files use a comma (',') character as a delimiter, you can change the default delimiter to a comma. You cannot use the same character for both the DELIMITER and NULL options.
If the delimiter character is among a string of data values, use the ESCAPE AS character ('\'
by default) to indicate that the delimiter should be treated as a literal.
The COPY statement accepts empty values (two consecutive delimiters) as valid input data for CHAR and VARCHAR data types. COPY stores empty columns as an empty string (''). An empty string is not equivalent to a NULL string.
To indicate a non-printing delimiter character (such as a tab), specify the character in extended string syntax (E'...'). If your database has StandardConformingStrings
enabled, use a Unicode string literal (U&'...'). For example, use either E'\t' or U&'\0009' to specify tab as the delimiter.
The following example loads data from a comma-separated file:
=> COPY employees FROM ... DELIMITER ',';
In the following example, the first column has a column-specific delimiter:
=> COPY employees(id DELIMITER ':', name, department) FROM ... DELIMITER ',';
Changing collection delimiters (COLLECTIONDELIMITER, COLLECTIONOPEN, COLLECTIONCLOSE)
The DELIMITER option specifies the value that separates columns in the input. For a column with a collection type (ARRAY or SET), a delimiter is also needed between elements of the collection. In addition, the collection itself has start and end markers. By default, collections are enclosed in brackets and elements are delimited by commas, but you can change these values.
In the following example, collections are enclosed in braces and delimited by periods.
=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672"."darpa-1963"}|{16200.16700}
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
19 | Leonard | ["us-1672","darpa-1963"] | [16200,16700]
(3 rows)
Changing the character enclosing column or collection values (ENCLOSED BY, COLLECTIONENCLOSE)
The ENCLOSED BY parameter lets you set an ASCII character to delimit characters to embed in string values. The enclosing character is not considered to be part of the data if and only if it is the first and last character of the input. You can use any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000') for the ENCLOSED BY value. Using double quotation marks (") is common, as shown in the following example.
=> COPY employees FROM STDIN ENCLOSED BY '"';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 21|Leonard|Physics
>> 42|"Sheldon"|"Physics"
>> 17|Rajesh "Raj" K|Astronomy
>> \.
=> SELECT * FROM employees;
id | name | department
----+----------------+--------------
17 | Rajesh "Raj" K | Astronomy
21 | Leonard | Physics
42 | Sheldon | Physics
(3 rows)
Notice that while ENCLOSED BY is a double quote, the embedded quotes in Rajesh's name are treated as part of the data because they are not the first and last characters in the column. The quotes that enclose "Sheldon" and "Physics" are dropped because of their positions.
Within a collection value, the COLLECTIONENCLOSE parameter is like ENCLOSED BY for individual elements of the collection.
Changing the null indicator (NULL)
By default, an empty string ('') for a column value means NULL. You can specify a different ASCII value in the range E'\001'
to E'\177'
inclusive (any ASCII character except NUL
: E'\000'
) as the NULL indicator. You cannot use the same character for both the DELIMITER and NULL options.
A column containing one or more whitespace characters is not NULL unless the sequence of whitespace exactly matches the NULL string.
A NULL is case-insensitive and must be the only value between the data field delimiters. For example, if the null string is NULL and the delimiter is the default vertical bar (|
):
|NULL|
indicates a null value.
| NULL |
does not indicate a null value.
When you use the COPY
statement in a script, you must substitute a double-backslash for each null string that includes a backslash. For example, the scripts used to load the example database contain:
COPY ... NULL E'\\n' ...
Changing the null indicator for collection values (COLLECTIONNULLELEMENT)
The NULL option specifies the value to be treated as null for a column value. For a column with a collection type (ARRAY or SET), a separate option specifies how to interpret null elements. By default, "null" indicates a null value. An empty value, meaning two consecutive element delimiters, does not indicate null:
=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 17|Howard|["nasa-143",,"nasa-6262"]|[10000,1650,15367]
>> 19|Leonard|["us-177",null,"us-6327"]|[16200,64000,26500]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+---------+-----------------------------+---------------------
17 | Howard | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
19 | Leonard | ["us-177",null,"us-6327"] | [16200,64000,26500]
(2 rows)
Use COLLECTIONNULLELEMENT to specify a different value, as in the following example.
=> COPY researchers from STDIN COLLECTIONNULLELEMENT 'x';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon|[x,"us-1672"]|[x,165000]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+---------+-----------------------------+---------------------
17 | Howard | ["nasa-143","","nasa-6262"] | [10000,1650,15367]
19 | Leonard | ["us-177",null,"us-6327"] | [16200,64000,26500]
42 | Sheldon | [null, "us-1672"] | [null,165000]
(3 rows)
Filling missing columns (TRAILING NULLCOLS)
By default, COPY fails if the input does not contain enough columns. Use the TRAILING NULLCOLS option to instead insert NULL values for any columns that lack data. This option cannot be used with columns that have a NOT NULL constraint.
The following example demonstrates use of this option.
=> CREATE TABLE z (a INT, b INT, c INT );
--- insert with enough data:
=> INSERT INTO z VALUES (1, 2, 3);
=> SELECT * FROM z;
a | b | c
---+---+---
1 | 2 | 3
(1 row)
--- insert deficient data:
=> COPY z FROM STDIN TRAILING NULLCOLS;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 4 | 5 | 6
>> 7 | 8
>> \.
=> SELECT * FROM z;
a | b | c
---+---+---
1 | 2 | 3
4 | 5 | 6
7 | 8 |
(3 rows)
Changing the escape character (ESCAPE AS, NO ESCAPE)
You can specify an escape character, which enables any special characters to be treated as part of the data. For example, if an element from a CSV file should contain a comma, you can indicate that by pre-pending the escape character to the comma in the data. The default escape character is a backslash (\
).
To change the default to a different character, use the ESCAPE AS option. You can set the escape character to be any ASCII value in the range E'\001'
to E'\177'
inclusive.
If you do not want any escape character and want to prevent any characters from being interpreted as escape sequences, use the NO ESCAPE option.
ESCAPE AS and NO ESCAPE can be set at both the column and global levels.
Changing the end-of-line character (RECORD TERMINATOR)
To specify the literal character string that indicates the end of a data file record, use the RECORD TERMINATOR
parameter, followed by the string to use. If you do not specify a value, then Vertica attempts to determine the correct line ending, accepting either just a linefeed (E'\n'
) common on UNIX systems, or a carriage return and linefeed (E'\r\n'
) common on Windows platforms.
For example, if your file contains comma-separated values terminated by line feeds that you want to maintain, use the RECORD TERMINATOR
option to specify an alternative value:
=> COPY mytable FROM STDIN DELIMITER ',' RECORD TERMINATOR E'\n';
To specify the RECORD TERMINATOR
as non-printing characters, use either the extended string syntax or Unicode string literals. The following table lists some common record terminator characters. See String Literals for an explanation of the literal string formats.
Extended String Syntax |
Unicode Literal String |
Description |
ASCII Decimal |
E'\b' |
U&'\0008' |
Backspace |
8 |
E'\t' |
U&'\0009' |
Horizontal tab |
9 |
E'\n' |
U&'\000a' |
Linefeed |
10 |
E'\f' |
U&'\000c' |
Formfeed |
12 |
E'\r' |
U&'\000d' |
Carriage return |
13 |
E'\\' |
U&'\005c' |
Backslash |
92 |
If you use the RECORD TERMINATOR
option to specify a custom value, be sure the input file matches the value. Otherwise, you may get inconsistent data loads.
Note
The record terminator cannot be the same as DELIMITER
, NULL
, ESCAPE,
or ENCLOSED BY
.
If using JDBC, Vertica recommends that you use the following value for the RECORD TERMINATOR:
System.getProperty("line.separator")
5.2 - Binary (native) data
You can load binary data using the NATIVE parser option, except with COPY LOCAL, which does not support this option.
You can load binary data using the NATIVE
parser option, except with COPY LOCAL
, which does not support this option. Since binary-format data does not require the use and processing of delimiters, it precludes the need to convert integers, dates, and timestamps from text to their native storage format, and improves load performance over delimited data. All binary-format files must adhere to the formatting specifications described in Appendix: creating native binary format files.
Native binary format data files are typically larger than their delimited text format counterparts, so compress the data before loading it. The NATIVE
parser does not support concatenated compressed binary files. You can load native (binary) format files when developing plug-ins to ETL applications.
There is no copy format to load binary data byte-for-byte because the column and record separators in the data would have to be escaped. Binary data type values are padded and translated on input, and also in the functions, operators, and casts supported.
Loading hexadecimal, octal, and bitstring data
You can use the formats hexadecimal, octal, and bitstring only to load binary columns. To specify these column formats, use the COPY statement's FORMAT
options:
-
Hexadecimal
-
Octal
-
Bitstring
The following examples illustrate how to use the FORMAT
option.
-
Create a table:
=> CREATE TABLE t(oct VARBINARY(5),
hex VARBINARY(5),
bitstring VARBINARY(5) );
-
Create the projection:
=> CREATE PROJECTION t_p(oct, hex, bitstring) AS SELECT * FROM t;
-
Use a COPY
statement with the STDIN
clause, specifying each of the formats:
=> COPY t (oct FORMAT 'octal', hex FORMAT 'hex',
bitstring FORMAT 'bitstring')
FROM STDIN DELIMITER ',';
-
Enter the data to load, ending the statement with a backslash () and a period (.) on a separate line:
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 141142143144145,0x6162636465,0110000101100010011000110110010001100101
>> \.
-
Use a select query on table t
to view the input values results:
=> SELECT * FROM t;
oct | hex | bitstring
-------+-------+-----------
abcde | abcde | abcde
(1 row)
COPY
uses the same default format to load binary data, as used to input binary data. Since the backslash character ('\'
) is the default escape character, you must escape octal input values. For example, enter the byte '\141'
as '\\141'
.
Note
If you enter an escape character followed by an invalid octal digit or an escape character being escaped, COPY returns an error.
On input, COPY translates string data as follows:
Both functions take a VARCHAR argument and return a VARBINARY value.
You can also use the escape character to represent the (decimal) byte 92 by escaping it twice; for example, '\\\\'
. Note that vsql inputs the escaped backslash as four backslashes. Equivalent inputs are hex value '0x5c'
and octal value '\134'
(134 = 1 x 8^2 + 3 x 8^1 + 4 x 8^0 = 92).
You can load a delimiter value if you escape it with a backslash. For example, given delimiter '|'
, '\\001\|\\002'
is loaded as {1,124,2
}, which can also be represented in octal format as '\\001\\174\\002'
.
If you insert a value with more bytes than fit into the target column, COPY returns an error. For example, if column c1
is VARBINARY(1)
:
=> INSERT INTO t (c1) values ('ab'); ERROR: 2-byte value too long for type Varbinary(1)
If you implicitly or explicitly cast a value with more bytes than fit the target data type, COPY silently truncates the data. For example:
=> SELECT 'abcd'::binary(2);
binary
--------
ab
(1 row)
Hexadecimal data
The optional '0x'
prefix indicates that a value is hexadecimal, not decimal, although not all hexadecimal values use A-F; for example, 5396. COPY ignores the 0x
prefix when loading the input data.
If there are an odd number of characters in the hexadecimal value, the first character is treated as the low nibble of the first (furthest to the left) byte.
Octal data
Loading octal format data requires that each byte be represented by a three-digit octal code. The first digit must be in the range [0,3] and the second and third digits must both be in the range [0,7].
If the length of an octal value is not a multiple of three, or if one of the three digits is not in the proper range, the value is invalid and COPY rejects the row in which the value appears. If you supply an invalid octal value, COPY returns an error. For example:
=> SELECT '\\000\\387'::binary(8);
ERROR: invalid input syntax for type binary
Rows that contain binary values with invalid octal representations are also rejected. For example, COPY rejects '\\008'
because '\\ 008'
is not a valid octal number.
BitString data
Loading bitstring data requires that each character must be zero (0) or one (1), in multiples of eight characters. If the bitstring value is not a multiple of eight characters, COPY treats the first n characters as the low bits of the first byte (furthest to the left), where n is the remainder of the value's length, divided by eight.
Examples
The following example shows VARBINARY HEX_TO_BINARY(VARCHAR)
and VARCHAR TO_HEX(VARBINARY)
usage.
-
Create table t
and and its projection with binary columns:
=> CREATE TABLE t (c BINARY(1));
=> CREATE PROJECTION t_p (c) AS SELECT c FROM t;
-
Insert minimum and maximum byte values, including an IP address represented as a character string:
=> INSERT INTO t values(HEX_TO_BINARY('0x00'));
=> INSERT INTO t values(HEX_TO_BINARY('0xFF'));
=> INSERT INTO t values (V6_ATON('2001:DB8::8:800:200C:417A'));
Use the TO_HEX
function to format binary values in hexadecimal on output:
=> SELECT TO_HEX(c) FROM t;
to_hex
--------
00
ff
20
(3 rows)
See also
5.3 - Native varchar data
Use the NATIVE VARCHAR parser option when the raw data consists primarily of CHAR or VARCHAR data.
Use the NATIVE VARCHAR
parser option when the raw data consists primarily of CHAR
or VARCHAR
data. COPY
performs the conversion to the actual table data types on the database server. This parser option is not supported with COPY LOCAL
.
Using NATIVE VARCHAR
does not provide the same efficiency as NATIVE
. However, NATIVE VARCHAR
precludes the need to use delimiters or to escape special characters, such as quotes, which can make working with client applications easier.
Note
NATIVE VARCHAR
does not support concatenated compressed files.
Batch data inserts performed through the Vertica ODBC and JDBC drivers automatically use the NATIVE VARCHAR format.
5.4 - Fixed-width format data
Use the FIXEDWIDTH parser option to bulk load fixed-width data.
Use the FIXEDWIDTH
parser option to bulk load fixed-width data. You must specify the COLSIZES
option values to specify the number of bytes for each column. The definition of the table you are loading (COPY table f (x, y, z))
determines the number of COLSIZES
values to declare.
To load fixed-width data, use the COLSIZES
option to specify the number of bytes for each input column. If any records do not have values, COPY
inserts one or more null characters to equal the specified number of bytes. The last record in a fixed-width data file must include a record terminator to determine the end of the load data.
The following COPY options are not supported:
-
DELIMITER
-
ENCLOSED BY
-
ESCAPE AS
-
TRAILING NULLCOLS
Using nulls in fixed-width data
The default NULL
string for a fixed-width load cannot be an empty string, and instead, consists of all spaces. The number of spaces depends on the column width declared with the COLSIZES (integer, [,...])
option.
For fixed-width loads, the NULL definition depends on whether you specify NULL at the column or statement level:
-
Statement level: NULL must be defined as a single-character. The default (or custom) NULL character is repeated for the entire width of the column.
-
Column level: NULL must be defined as a string whose length matches the column width.
For fixed-width loads, if the input data column has fewer values than the specified column size, COPY
inserts NULL characters. The number of NULLs must match the declared column width. If you specify a NULL string at the column level, COPY
matches the string with the column width.
Note
To turn off NULLs, use the NULL AS
option and specify NULL AS ''
.
Defining a null character (statement level)
-
Create a two-column table (fw
):
=> CREATE TABLE fw(co int, ci int);
CREATE TABLE
-
Copy the table, specifying null as 'N'
, and enter some data:
=> COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) null AS 'N' NO COMMIT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> NN12
>> 23NN
>> NNNN
>> nnnn
>> \.
-
Select all (*
) from the table:
=> SELECT * FROM fw;
co | ci
----+----
| 12
23 |
(2 rows)
Defining a custom record terminator
To define a record terminator other than the COPY
default when loading fixed-width data, take these steps:
-
Create table fw
with two columns, co
and ci
:
=> CREATE TABLE fw(co int, ci int);
CREATE TABLE
-
Copy table fw
, specifying two 2-byte column sizes, and specifying a comma (,) as the record terminator:
=> COPY fw FROM STDIN FIXEDWIDTH colsizes(2,2) RECORD TERMINATOR ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1234,1444,6666
>> \.
-
Query all data in table fw
:
=> SELECT * FROM fw;
co | ci
----+----
12 | 34
14 | 44
(2 rows)
The SELECT
output indicates only two values. COPY
rejected the third value (6666
) because it was not followed by a comma (,
) record terminator. Fixed-width data requires a trailing record terminator only if you explicitly specify a record terminator explicitly.
Copying fixed-width data
Use COPY FIXEDWIDTH COLSIZES (n [,...)
to load files into a Vertica database. By default, all spaces are NULLs. For example:
=> CREATE TABLE mytest(co int, ci int);
=> CREATE PROJECTION mytest_p1 AS SELECT * FROM mytest SEGMENTED BY HASH(co) ALL NODES;
=> COPY mytest(co,ci) FROM STDIN FIXEDWIDTH colsizes(6,4) NO COMMIT;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> \.
=> SELECT * FROM mytest ORDER BY co;
co | ci
----+----
(0 rows)
Skipping content in fixed-width data
The COPY
statement has two options to skip input data. The SKIP BYTES
option is only for fixed-width data loads:
SKIP BYTES num-bytes |
Skips the specified number of bytes from the input data. |
SKIP num-records |
Skips the specified number of records. |
The following example uses SKIP BYTES
to skip 11 bytes when loading a fixed-width table with two columns (4 and 6 bytes):
-
Copy a table using SKIP BYTES
:
=> COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP BYTES 11;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2222666666
>> 1111999999
>> 1632641282
>> \.
-
Query all data in table fw
:
=> SELECT * FROM fw ORDER BY co;
co | ci
------+--------
1111 | 999999
1632 | 641282
(2 rows)
The output confirms that COPY
skipped the first 11 bytes of loaded data.
The following example uses SKIP
when loading a fixed-width (4,6) table:
-
Copy a table, using SKIP
to skip two records of input data:
=> COPY fw FROM STDIN FIXEDWIDTH colsizes (4,6) SKIP 2;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2222666666
>> 1111999999
>> 1632641282
>> 3333888888
>> \.
-
Query all data in table fw
:
=> SELECT * FROM fw ORDER BY co;
co | ci
------+--------
1632 | 641282
3333 | 888888
(2 rows)
The output confirms that COPY
skipped the first two records of load data.
Trimming characters in fixed-width data loads
Use the TRIM
option to trim a character. TRIM
accepts a single-byte character, which is trimmed at the beginning and end of the data. For fixed-width data loads, when you specify a TRIM
character, COPY
first checks to see if the row is NULL. If the row is not null, COPY
trims the character(s). The next example instructs COPY
to trim the character A, and shows the results:
-
Copy table fw
, specifying TRIM
character A
:
=> COPY fw FROM STDIN FIXEDWIDTH colsizes(4,6) TRIM 'A';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> A22A444444
>> A22AA4444A
>> \.
-
Query all data in table fw
:
=> SELECT * FROM fw ORDER BY co;
co | ci
----+--------
22 | 4444
22 | 444444
(2 rows)
Using padding in fixed-width data loads
By default, the padding character is ' ' (a single space). The padding behavior for fixed-width data loads is similar to how a space is treated in other formats, differing by data type as follows:
Data type |
Padding |
Integer |
Leading and trailing spaces |
Bool |
Leading and trailing spaces |
Float |
Leading and trailing spaces |
[var]Binary |
None, all characters are significant. |
[Var]Char |
Trailing spaces if string is too large |
DateInterval Time Timestamp TimestampTZ TimeTZ |
None, all characters are significant. COPY uses an internal algorithm to parse these data types. |
Date (formatted) |
Use the COPY FORMAT option string to match the expected column length. |
Numerics |
Leading and trailing spaces |
5.5 - ORC data
The ORC (Optimized Row Columnar) format is a column-oriented file format.
The ORC (Optimized Row Columnar) format is a column-oriented file format. Vertica has a parser for this format that can take advantage of the columnar layout.
In the COPY statement, specify the parser as follows:
=> COPY tableName FROM path ORC[(...)];
The parser takes several optional parameters; see the ORC (parser) reference page. If the data is partitioned, see in particular the hive_partition_cols
parameter and Using partition columns. While partitioning is primarily a benefit when using external tables, you can load partitioned data into Vertica tables using this parser.
Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.
Supported data types
Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).
The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.
Timestamps and time zones
To correctly report timestamps, Vertica must know what time zone the data was written in. Older versions of the ORC format do not record the time zone.
Hive version 1.2.0 and later records the writer time zone in the stripe footer. Vertica uses that time zone to make sure the timestamp values read into the database match the ones written in the source file. For ORC files that are missing this time zone information, Vertica assumes the values were written in the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table. Check for events of this type after your first query to verify that timestamps are being handled as you expected.
5.6 - Parquet data
Parquet is a column-oriented file format.
Parquet is a column-oriented file format. Vertica has a parser that can take advantage of the columnar layout.
In the COPY statement, specify the parser as follows:
=> COPY tableName FROM path PARQUET[(...)];
The parser takes several optional parameters; see the PARQUET (parser) reference page. If the data is partitioned, see in particular the hive_partition_cols
parameter and Using partition columns. While partitioning is primarily a benefit when using external tables, you can load partitioned data into Vertica tables using this parser.
Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.
Schema matching
By default, the Parquet parser uses strong schema matching. This means that the load must consume all columns in the Parquet data in the order they occur in the data. You can, instead, use loose schema matching, which allows you to select the columns you want and ignore the rest. Loose schema matching depends on the names of the columns in the data rather than their order, so the column names in your table must match those in the data. Types must match or be coercible. For more information on how to use loose schema matching, see Loose Schema Matching on the PARQUET (parser) reference page.
You can use loose schema matching for columns of primitive types and one-dimensional arrays of primitive types. For other complex types, use flexible complex types instead. See Flexible complex types.
Use the do_soft_schema_match_by_name
parameter to specify loose schema matching. In the following example, the Parquet data contains more columns than those used in the table.
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
AS COPY FROM '/data/rest*.parquet'
PARQUET(do_soft_schema_match_by_name='True');
=> SELECT * from restaurant;
name | cuisine
-------------------+----------
Bob's pizzeria | Italian
Bakersfield Tacos | Mexican
(2 rows)
Parquet files include metadata that Vertica uses when loading data. To avoid repeatedly fetching this data, particularly from remote sources or where API calls incur financial costs, Vertica caches this metadata on each participating node during the planning phase for use during the execution phase.
Vertica uses TEMP storage for the cache, and only if TEMP storage is on the local file system.
You can limit the size of the cache by setting the ParquetMetadataCacheSizeMB configuration parameter. The default is 4GB.
Supported data types
Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).
The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.
For the Parquet format only, you can use flexible complex types instead of fully specifying the schema for complex types. See Flexible complex types.
Timestamps and time zones
To correctly report timestamps, Vertica must know what time zone the data was written in. Hive does not record the writer time zone. Vertica assumes timestamp values were written in the local time zone and reports a warning at query time.
Hive provides an option, when writing Parquet files, to record timestamps in the local time zone. If you are using Parquet files that record times in this way, set the UseLocalTzForParquetTimestampConversion configuration parameter to 0 to disable the conversion done by Vertica. (See General parameters.)
5.7 - JSON data
Use FJSONPARSER to load data in JSON format.
Use FJSONPARSER to load data in JSON format.
The schema for JSON data is the set of property names in the property/value pairs. When you load JSON data into a columnar table or materialized columns in a Flex table, the property names in the data must match the column names in the table. You do not need to load all of the columns in the data.
The JSON parser can load data into columns of any scalar type, strongly-typed complex type, or flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.
In the COPY statement, use the PARSER parameter to specify the JSON parser as in the following example:
=> CREATE EXTERNAL TABLE customers(id INT, address VARCHAR, transactions ARRAY[INT,10])
AS COPY FROM 'cust.json' PARSER FJSONPARSER();
This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.
Before loading JSON data, consider using a tool such as JSONLint to verify that the data is valid.
If you load JSON data into a Flex table, Vertica loads all data into the __raw__
(VMap) column, including complex types found in the data. You can use Flex functions to extract values.
Strongly-typed complex types
JSON data can contain arrays, structs, and combinations of the two. You can load this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.
Use the ARRAY and ROW types in the table definition as usual:
=> CREATE EXTERNAL TABLE rest
(name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR(80),50],
menu ARRAY[ ROW(item VARCHAR(80), price FLOAT), 100 ]
)
AS COPY FROM :restdata PARSER FJSONPARSER();
=> SELECT name, location_city, menu FROM rest;
name | location_city | menu
-------------------+----------------------------+------------------------------------------------------------------------------
Bob's pizzeria | ["Cambridge","Pittsburgh"] | [{"item":"cheese pizza","price":8.25},{"item":"spinach pizza","price":10.5}]
Bakersfield Tacos | ["Pittsburgh"] | [{"item":"veggie taco","price":9.95},{"item":"steak taco","price":10.95}]
(2 rows)
When loading JSON data into a table with strong typing for complex types, Vertica ignores the parser's flatten_maps
and flatten_arrays
parameters.
Strong and flexible complex types
An advantage of strong typing is the easier (and more efficient) access in queries. A disadvantage is that additional values found in the data but not included in the column definition are ignored. If the menu struct in this data includes more attributes, such as calories, they are not loaded because the definition of the column only specified item and price. The following example uses flexible complex types to reveal the extra attributes:
=> CREATE EXTERNAL TABLE rest
(name VARCHAR, cuisine VARCHAR,
location_city LONG VARBINARY, menu LONG VARBINARY)
AS COPY FROM :restdata
PARSER FJSONPARSER(flatten_maps=false);
=> SELECT name, MAPTOSTRING(location_city) as location_city, MAPTOSTRING(menu) AS menu FROM rest;
name | location_city | menu
-------------------+--------------------------+---------------------------------------------------
Bob's pizzeria | {
"0": "Cambridge",
"1": "Pittsburgh"
} | {
"0": {
"calories": "1200",
"item": "cheese pizza",
"price": "8.25"
},
"1": {
"calories": "900",
"item": "spinach pizza",
"price": "10.50"
}
}
Bakersfield Tacos | {
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "9.95",
"vegetarian": "true"
},
"1": {
"item": "steak taco",
"price": "10.95"
}
}
(2 rows)
Loading from a specific start point
You need not load an entire JSON file. You can use the start_point
parameter to load data beginning at a specific key, rather than at the beginning of a file. Data is parsed from after the start_point
key until the end of the file, or to the end of the first start_point
's value. The parser ignores any subsequent instance of the start_point
, even if that key appears multiple times in the input file. If the input data contains only one copy of the start_point
key, and that value is a list of JSON elements, the parser loads each element in the list as a row.
If a start_point
value occurs more than once in your JSON data, you can use the start_point_occurrence
integer parameter to specify the occurrence at which to start parsing.
This example uses the following JSON data, saved to a file named alphanums.json
:
{ "A": { "B": { "C": [ { "d": 1, "e": 2, "f": 3 }, { "g": 4, "h": 5, "i": 6 },
{ "j": 7, "k": 8, "l": 9 } ] } } }
Loading this data into a flex table produces the following results:
=> CREATE FLEX TABLE start_json;
CREATE TABLE
=> COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER();
Rows Loaded
-------------
1
(1 row)
=> SELECT maptostring(__raw__) FROM start_json;
maptostring
-------------------------------------------------------------------------------
{
"A.B.C" : {
"0.d" : "1",
"0.e" : "2",
"0.f" : "3",
"1.g" : "4",
"1.h" : "5",
"1.i" : "6",
"2.j" : "7",
"2.k" : "8",
"2.l" : "9"
}
}
(1 row)
The following load specifies a start point:
=> TRUNCATE TABLE start_json;
TRUNCATE TABLE
=> COPY start_json FROM '/home/dbadmin/data/alphanums.json' PARSER FJSONPARSER(start_point='B');
Rows Loaded
-------------
1
(1 row)
=> SELECT maptostring(__raw__) FROM start_json;
maptostring
--------------------------------------------------------------------------------
{
"C" : {
"0.d" : "1",
"0.e" : "2",
"0.f" : "3",
"1.g" : "4",
"1.h" : "5",
"1.i" : "6",
"2.j" : "7",
"2.k" : "8",
"2.l" : "9"
}
}
(1 row)
Dealing with invalid JSON records
If your JSON data contains syntax errors, your load can fail due to invalid records. You can use the RECORD_TERMINATOR option in the COPY statement to skip these invalid records if your JSON records are consistently delimited by a character like a line break. Setting a record terminator allows the parser to skip over invalid records and continue parsing the rest of the data.
If your records are not consistently marked by a character, you can use the ERROR TOLERANCE option. ERROR TOLERANCE skips entire source files with invalid JSON records, while RECORD_TERMINATOR skips individual malformed JSON records. You can use the two options together.
The following example uses invalid records:
=> => CREATE FLEX TABLE fruits();
CREATE TABLE
=> COPY fruits FROM STDIN PARSER FJSONPARSER(RECORD_TERMINATOR=E'\n');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself
>> {"name": "orange", "type": "fruit", "color": "orange", "rating": 5 }
>> {"name": "apple", "type": "fruit", "color": "green" }
>> {"name": "blueberry", "type": "fruit", "color": "blue", "rating": 10 }
>> "type": "fruit", "rating": 7 }
>> {"name": "banana", "type" : "fruit", "color": "yellow", "rating": 3 }
>> \.
View the flex table using MAPTOSTRING to confirm that the invalid record was skipped while the rest of the records were successfully loaded:
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
--------------------------------------------------------------------------------------------
{
"color" : "orange",
"name" : "orange",
"rating" : "5",
"type" : "fruit"
}
{
"color" : "green",
"name" : "apple",
"type" : "fruit"
}
{
"color" : "blue",
"name" : "blueberry",
"rating" : "10",
"type" : "fruit"
}
{
"color" : "yellow",
"name" : "banana",
"rating" : "3",
"type" : "fruit"
}
(4 rows)
Rejecting data on materialized column type errors
By default, if FJSONPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error
parameter. If this parameter is true, COPY rejects such rows and reports an error.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types; if any ROW field or ARRAY element cannot be coerced, it loads NULL for the column.
The following example attempts to load invalid data. Note that the invalid row is missing from the query results:
=> CREATE TABLE test(one VARCHAR, two INT);
CREATE TABLE
=> COPY test FROM stdin
PARSER FJSONPARSER(reject_on_materialized_type_error=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"one": 1, "two": 2}
>> {"one": "one", "two": "two"}
>> {"one": "one", "two": 2}
>> \.
=> SELECT one, two FROM test;
one | two
-----+-----
1 | 2
one | 2
(2 rows)
Rejecting or omitting empty keys in flex tables
Valid JSON files can include empty key and value pairs. By default, for a Flex table, FJSONPARSER loads them, as in the following example:
=> CREATE FLEX TABLE fruits();
CREATE TABLE
=> COPY fruits FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "orange", "rating": 5}
>> {"name": "apple", "rating" : 10}
>> {"": "banana", "rating" : 2}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
---------------------------------------
{
"name": "orange",
"rating": "5"
}
{
"name": "apple",
"rating": "10"
}
{
"": "banana",
"rating": "2"
}
(3 rows)
To omit fields with empty keys, use the omit_empty_keys
parameter:
=> COPY fruits FROM STDIN PARSER FJSONPARSER(omit_empty_keys=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name": "apple", "rating": 5}
>> {"": "missing", "rating" : 1}
>> {"name": "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
--------------------------------------
{
"name": "apple",
"rating": "5"
}
{
"rating": "1"
}
{
"name": "",
"rating": "3"
}
(3 rows)
Note that the second value, with the missing name, still loads with the other (non-empty) field. To instead reject the row entirely, use the reject_on_empty_key
parameter:
=> COPY fruits FROM STDIN PARSER FJSONPARSER(reject_on_empty_key=true);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"name" : "apple", "rating" : 5}
>> {"" : "missing", "rating" : 1}
>> {"name" : "", "rating" : 3}
>> \.
=> SELECT MAPTOSTRING(__raw__) FROM fruits;
maptostring
--------------------------------------
{
"name": "apple",
"rating": "5"
}
{
"name": "",
"rating": "3"
}
(2 rows)
5.8 - Avro data
Use FAVROPARSER to load Avro data files.
Use FAVROPARSER to load Avro data files. This parser supports both columnar and Flex tables.
A column can be of any scalar type, a strongly-typed complex type, or a flexible complex type. A flexible complex type means you do not fully specify the schema for that column. You define these columns in the table as LONG VARBINARY, and you can use Flex functions to extract values from them.
The following requirements apply:
-
Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.
-
The Avro file must have its related schema in the file being loaded; FAVROPARSER does not support Avro files with separate schema files.
In the COPY statement, use the PARSER parameter to specify the Avro parser as in the following example:
=> COPY weather FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER()
This parser has several optional parameters, some of which are specific to use with Flex tables and flexible complex types.
Avro schemas and columnar tables
Avro includes the schema with the data. When you load Avro data into a columnar table, the column names in the schema in the data must match the column names in the table. You do not need to load all of the columns in the data.
For example, the following Avro schema uses the Avro record type to represent a user profile:
{
"type": "record",
"name": "Profile",
"fields" : [
{"name": "UserName", "type": "string"},
{"name": "Email", "type": "string"},
{"name": "Address", "type": "string"}
]
}
To successfully load the data into a columnar table with this schema, each target column name must match the "name" value in the schema. In the following example, the profiles
table does not load values corresponding to the schema's Email
field because the target column is named EmailAddr
:
=> COPY profiles FROM '/home/dbadmin/data/user_profile.avro' PARSER FAVROPARSER();
=> SELECT * FROM profiles;
UserName | EmailAddr | Address
-----------------+--------------------+---------------------
dbadmin | | 123 Main St.
Strongly-typed complex types
Avro data can contain arrays, structs, and combinations of the two. You can read this data either as flexible (VMap) columns or with strong typing. Strong typing allows you to query values directly, without having to use functions to unpack a VMap column.
Use the ARRAY and ROW types in the table definition as usual:
=> CREATE EXTERNAL TABLE rest
(name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR(80)],
menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
)
AS COPY FROM :avro_file PARSER FAVROPARSER();
You can use strong typing in both external tables and native tables.
An alternative to strong typing for complex types is flexible complex types.
Rejecting data on materialized column type errors
By default, if FAVROPARSER cannot coerce a data value to a type that matches the column definition, it sets the value to NULL. You can choose to instead reject these values using the reject_on_materialized_type_error
parameter. If this parameter is true, COPY rejects such rows and reports an error.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
If a flex table has a materialized column, the data being loaded must be coercible to that column's type. For example, if a materialized column is declared as a FLOAT and you try to load a VARCHAR value for that key, FAVROPARSER rejects the data row.
See also
5.9 - Matches from regular expressions
You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser.
You can load flex or columnar tables with the matched results of a regular expression, using the fregexparser
. This section describes some examples of using the options that the flex parsers support.
Sample regular expression
These examples use the following regular expression, which searches information that includes the timestamp
, date
, thread_name
, and thread_id
strings.
Caution
For display purposes, this sample regular expression adds new line characters to split long lines of text. To use this expression in a query, first copy and edit the example to remove any new line characters.
This example expression loads any thread_id
hex value, regardless of whether it has a 0x
prefix, (<thread_id>(?:0x)?[0-9a-f]+).
'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
Using regular expression matches for a flex table
You can load the results from a regular expression into a flex table, using the fregexparser
. For a complete example of doing so, see FREGEXPARSER.
Using fregexparser for columnar tables
This section illustrates how to load the results of a regular expression used with a sample log file for a Vertica database. By using an external table definition, the section presents an example of using fregexparser to load data into a columnar table. Using a flex table parser for a columnar tables gives you the capability to mix data loads in one table. For example, you can load the results of a regular expression in one session, and JSON data in another.
The following basic examples illustrate this usage.
-
Create a columnar table, vlog
, with the following columns:
=> CREATE TABLE vlog (
"text" varchar(2322),
thread_id varchar(28),
thread_name varchar(44),
"time" varchar(46),
component varchar(30),
level varchar(20),
transaction_id varchar(32),
elevel varchar(20),
enode varchar(34)
);
-
Use COPY to load parts of a log file using the sample regular expression presented above, with the fregexparser
. Be sure to remove any line characters from this expression example before trying it yourself:
=> COPY v_log FROM '/home/dbadmin/data/flex/vertica.log' PARSER
FRegexParser(pattern='^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
) rejected data as table fregex_reject;
-
Query the time
column:
=> SELECT time FROM flogs limit 10;
time
-------------------------
2014-04-02 04:02:02.613
2014-04-02 04:02:02.613
2014-04-02 04:02:02.614
2014-04-02 04:02:51.008
2014-04-02 04:02:51.010
2014-04-02 04:02:51.012
2014-04-02 04:02:51.012
2014-04-02 04:02:51.013
2014-04-02 04:02:51.014
2014-04-02 04:02:51.017
(10 rows)
Using external tables with fregexparser
By creating an external columnar table for your Vertica log file, querying the table will return updated log information. The following basic example illustrate this usage.
-
Create a columnar table, vertica_log
, using the AS COPY
clause and fregexparser
to load matched results from the regular expression. For illustrative purposes, this regular expression has new line characters to split long text lines. Remove any line returns before testing with this expression:
=> CREATE EXTERNAL TABLE public.vertica_log
(
"text" varchar(2322),
thread_id varchar(28),
thread_name varchar(44),
"time" varchar(46),
component varchar(30),
level varchar(20),
transaction_id varchar(32),
elevel varchar(20),
enode varchar(34)
)
AS COPY
FROM '/home/dbadmin/data/vertica.log'
PARSER FRegexParser(pattern='^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
);
-
Query from the external table to get updated results:
=> SELECT component, thread_id, time FROM vertica_log limit 10;
component | thread_id | time
-----------+------------+-------------------------
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
Init | 0x16321430 | 2014-04-02 04:02:02.613
| 0x16321430 | 2014-04-02 04:02:02.614
| 0x16321430 | 2014-04-02 04:02:02.614
| 0x16321430 | 2014-04-02 04:02:02.614
(10 rows)
5.10 - Common event format (CEF) data
Use the flex parser fcefparser to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.
Use the flex parser fcefparser
to load OpenText ArcSight or other Common Event Format (CEF) log file data into columnar and flexible tables.
When you use the parser to load arbitrary CEF-format files, it interprets key names in the data as virtual columns in your flex table. After loading, you can query your CEF data directly, regardless of which set of keys exist in each row. You can also use the associated flex table data and map functions to manage CEF data access.
Create a flex table and load CEF data
This section uses a sample set of CEF data. All IP addresses have been purposely changed to be inaccurate, and Return characters added for illustration.
To use this sample data, copy the following text and remove all Return characters. Save the file as CEF_sample.cef
, which is the name used throughout these examples.
CEF:0|ArcSight|ArcSight|6.0.3.6664.0|agent:030|Agent [test] type [testalertng] started|Low|
eventId=1 mrt=1396328238973 categorySignificance=/Normal categoryBehavior=/Execute/Start
categoryDeviceGroup=/Application catdt=Security Mangement categoryOutcome=/Success
categoryObject=/Host/Application/Service art=1396328241038 cat=/Agent/Started
deviceSeverity=Warning rt=1396328238937 fileType=Agent
cs2=<Resource ID\="3DxKlG0UBABCAA0cXXAZIwA\=\="/> c6a4=fe80:0:0:0:495d:cc3c:db1a:de71
cs2Label=Configuration Resource c6a4Label=Agent
IPv6 Address ahost=SKEELES10 agt=888.99.100.1 agentZoneURI=/All Zones/ArcSight
System/Private Address Space
Zones/RFC1918: 888.99.0.0-888.200.255.255 av=6.0.3.6664.0 atz=Australia/Sydney
aid=3DxKlG0UBABCAA0cXXAZIwA\=\= at=testalertng dvchost=SKEELES10 dvc=888.99.100.1
deviceZoneURI=/All Zones/ArcSight System/Private Address Space Zones/RFC1918:
888.99.0.0-888.200.255.255 dtz=Australia/Sydney _cefVer=0.1
-
Create a flex table logs
:
=> CREATE FLEX TABLE logs();
CREATE TABLE
-
Load the sample CEF file, using the flex parser fcefparser
:
=> COPY logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
Rows Loaded
-------------
1
(1 row)
-
Use the maptostring()
function to see the contents of the logs
flex table:
=> SELECT maptostring(__raw__) FROM logs;
maptostring
-------------------------------------------------------------------------------------
{
"_cefver" : "0.1",
"agentzoneuri" : "/All Zones/ArcSight System/Private Address
Space Zones/RFC1918: 888.99.0.0-888.200.255.255",
"agt" : "888.99.100.1",
"ahost" : "SKEELES10",
"aid" : "3DxKlG0UBABCAA0cXXAZIwA==",
"art" : "1396328241038",
"at" : "testalertng",
"atz" : "Australia/Sydney",
"av" : "6.0.3.6664.0",
"c6a4" : "fe80:0:0:0:495d:cc3c:db1a:de71",
"c6a4label" : "Agent IPv6 Address",
"cat" : "/Agent/Started",
"catdt" : "Security Mangement",
"categorybehavior" : "/Execute/Start",
"categorydevicegroup" : "/Application",
"categoryobject" : "/Host/Application/Service",
"categoryoutcome" : "/Success",
"categorysignificance" : "/Normal",
"cs2" : "<Resource ID=\"3DxKlG0UBABCAA0cXXAZIwA==\"/>",
"cs2label" : "Configuration Resource",
"deviceproduct" : "ArcSight",
"deviceseverity" : "Warning",
"devicevendor" : "ArcSight",
"deviceversion" : "6.0.3.6664.0",
"devicezoneuri" : "/All Zones/ArcSight System/Private Address Space
Zones/RFC1918: 888.99.0.0-888.200.255.255",
"dtz" : "Australia/Sydney",
"dvc" : "888.99.100.1",
"dvchost" : "SKEELES10",
"eventid" : "1",
"filetype" : "Agent",
"mrt" : "1396328238973",
"name" : "Agent [test] type [testalertng] started",
"rt" : "1396328238937",
"severity" : "Low",
"signatureid" : "agent:030",
"version" : "0"
}
(1 row)
Create a columnar table and load CEF data
This example lets you compare the flex table for CEF data with a columnar table. You do so by creating a new table and load the same CEF_sample.cef
file used in the preceding flex table example.
-
Create a columnar table, col_logs
, defining the prefix names that are hard coded in fcefparser
:
=> CREATE TABLE col_logs(version INT,
devicevendor VARCHAR,
deviceproduct VARCHAR,
deviceversion VARCHAR,
signatureid VARCHAR,
name VARCHAR,
severity VARCHAR);
CREATE TABLE
-
Load the sample file into col_logs
, as you did for the flex table:
=> COPY col_logs FROM '/home/dbadmin/data/CEF_sample.cef' PARSER fcefparser();
Rows Loaded
-------------
1
(1 row)
-
Query the table. You can find the identical information in the flex table output.
=> \x
Expanded display is on.
VMart=> SELECT * FROM col_logs;
-[ RECORD 1 ]-+----------------------------------------
version | 0
devicevendor | ArcSight
deviceproduct | ArcSight
deviceversion | 6.0.3.6664.0
signatureid | agent:030
name | Agent [test] type [testalertng] started
severity | Low
Compute keys and build a flex table view
In this example, you use a flex helper function to compute keys and build a view for the logs
flex table.
-
Use the compute_flextable_keys_and_build_view
function to compute keys and populate a view generated from the logs
flex table:
=> SELECT compute_flextable_keys_and_build_view('logs');
compute_flextable_keys_and_build_view
-------------------------------------------------------------------------------------
Please see public.logs_keys for updated keys
The view public.logs_view is ready for querying
(1 row)
-
Query the logs_keys
table to see what the function computed from the sample CEF data:
=> SELECT * FROM logs_keys;
key_name | frequency | data_type_guess
----------------------+-----------+-----------------
c6a4 | 1 | varchar(60)
c6a4label | 1 | varchar(36)
categoryobject | 1 | varchar(50)
categoryoutcome | 1 | varchar(20)
categorysignificance | 1 | varchar(20)
cs2 | 1 | varchar(84)
cs2label | 1 | varchar(44)
deviceproduct | 1 | varchar(20)
deviceversion | 1 | varchar(24)
devicezoneuri | 1 | varchar(180)
dvchost | 1 | varchar(20)
version | 1 | varchar(20)
ahost | 1 | varchar(20)
art | 1 | varchar(26)
at | 1 | varchar(22)
cat | 1 | varchar(28)
catdt | 1 | varchar(36)
devicevendor | 1 | varchar(20)
dtz | 1 | varchar(32)
dvc | 1 | varchar(24)
filetype | 1 | varchar(20)
mrt | 1 | varchar(26)
_cefver | 1 | varchar(20)
agentzoneuri | 1 | varchar(180)
agt | 1 | varchar(24)
aid | 1 | varchar(50)
atz | 1 | varchar(32)
av | 1 | varchar(24)
categorybehavior | 1 | varchar(28)
categorydevicegroup | 1 | varchar(24)
deviceseverity | 1 | varchar(20)
eventid | 1 | varchar(20)
name | 1 | varchar(78)
rt | 1 | varchar(26)
severity | 1 | varchar(20)
signatureid | 1 | varchar(20)
(36 rows)
-
Query several columns from the logs_view
:
=> \x
Expanded display is on.
VMart=> select version, devicevendor, deviceversion, name, severity, signatureid
from logs_view;
-[ RECORD 1 ]-+----------------------------------------
version | 0
devicevendor | ArcSight
deviceversion | 6.0.3.6664.0
name | Agent [test] type [testalertng] started
severity | Low
signatureid | agent:030
Use the fcefparser delimiter parameter
In this example, you use the fcefparser delimiter
parameter to query events located in California, New Mexico, and Arizona.
-
Create a new columnar table, CEFData3
:
=> CREATE TABLE CEFData3(eventId INT, location VARCHAR(20));
CREATE TABLE
-
Using the delimiter=','
parameter, load some CEF data into the table:
=> COPY CEFData3 FROM stdin PARSER fcefparser(delimiter=',');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> eventId=1,location=California
>> eventId=2,location=New Mexico
>> eventId=3,location=Arizona
>> \.
-
Query the table:
=> SELECT eventId, location FROM CEFData3;
eventId | location
---------+------------
1 | California
2 | New Mexico
3 | Arizona
(3 rows)
6 - Complex types
Tables can have columns of complex types, including nested complex types.
Tables can have columns of complex types, including nested complex types. You can use the ROW (struct), ARRAY, and SET types in native and external tables, including flex tables. Sets are limited to one-dimensional collections of scalar types. A limited MAP type is available for external tables, but you can use ARRAY and ROW to express a map instead. Selected parsers support loading data with complex types.
You can define a column for heterogeneous combinations of the ARRAY and ROW types: a struct containing array fields or an array of structs. These types can be nested up to the maximum nesting depth of 100.
Restrictions for native tables
Complex types used in native tables have some restrictions, in addition to the restrictions for individual types listed on their reference pages:
-
A native table must have at least one column that is a primitive type or a native array (one-dimensional array of a primitive type). If a flex table has real columns, it must also have at least one column satisfying this restriction.
-
Complex type columns cannot be used in ORDER BY or PARTITION BY clauses nor as FILLER columns.
-
Complex type columns cannot have constraints.
-
Expressions returning complex types cannot be used as projection columns, and projections cannot be segmented or ordered by columns of complex types.
-
Tables with columns of complex types cannot use DEFAULT and SET USING.
See CREATE TABLE and ALTER TABLE for additional restrictions.
Deriving a table definition from the data
You can use the INFER_TABLE_DDL function to inspect Parquet, ORC, or Avro data and produce a starting point for a table definition. This function returns a CREATE TABLE statement, which might require further editing. For columns where the function could not infer the data type, the function labels the type as unknown and emits a warning. For VARCHAR and VARBINARY columns, you might need to adjust the length. Always review the statement the function returns, but especially for tables with many columns, using this function can save time and effort.
In the following example, the data contains VARCHAR fields and columns and warns that you might need to adjust their lengths:
=> SELECT INFER_TABLE_DDL('/data/people/*.parquet'
USING PARAMETERS format = 'parquet', table_name = 'employees', table_type = 'native');
WARNING 9311: This generated statement contains one or more varchar/varbinary columns which default to length 80
INFER_TABLE_DDL
------------------------------------------------------------------------------------------------------------------
create table "employees"(
"employeeID" int,
"personal" Row(
"name" varchar,
"address" Row(
"street" varchar,
"city" varchar,
"zipcode" int
),
"taxID" int
),
"department" varchar
);
(1 row)
Alternative to strong typing
Though you can fully specify a column representing any combination of ROW and ARRAY types, there might be cases where you prefer a more flexible approach. If the data contains a struct with hundreds of fields, only a few of which you need, you might prefer to extract just those few at query time instead of defining all of the fields. Similarly, if the data structure is likely to change, you might prefer to defer fully specifying the complex types. You can use flexible columns as an alternative to fully specifying the structure of a complex column. This is the same approach used for flex tables, where all data is initially loaded into a single binary column and materialized from there as needed. See Flexible complex types for more information about using this approach.
6.1 - Structs
Columns can contain structs, which store (typed) property-value pairs.
Columns can contain structs, which store (typed) property-value pairs. For example, an address column could use a struct with strings for the street address, city/state, and postal code, such as { "street":"150 Cambridgepark Dr.", "city":"Cambridge MA", "postalcode":"02140"}
. (This is a Hive display format, not literally what is stored in the data file.) Struct fields can be primitive types or other complex types.
Use the ROW expression to define a struct column. In the following example, the data has columns for customer name, address, and account number, and the address is a struct in the data. The types you declare in Vertica must be compatible with the types in the data you load into them.
=> CREATE TABLE customers (
name VARCHAR,
address ROW(street VARCHAR, city VARCHAR, zipcode INT),
accountID INT);
Within the ROW, you specify the fields and their data types using the same syntax as for columns. Vertica treats the ROW as a single column for purposes of queries.
Structs can contain other structs. In the following example, employees have various personal information, including an address which is itself a struct.
=> CREATE TABLE employees(
employeeID INT,
personal ROW(
name VARCHAR,
address ROW(street VARCHAR, city VARCHAR, zipcode INT),
taxID INT),
department VARCHAR);
Structs can contain arrays of primitive types, arrays, or structs.
=> CREATE TABLE customers(
name VARCHAR,
contact ROW(
street VARCHAR,
city VARCHAR,
zipcode INT,
email ARRAY[VARCHAR]
),
accountid INT );
When defining an external table, Vertica requires the definition of the table to match the schema of the external data. For example, with the data used in the previous employees example, the following definition is an error:
=> CREATE EXTERNAL TABLE employees(
employeeID INT,
personal ROW(
name VARCHAR,
address ROW(street VARCHAR, city VARCHAR),
zipcode INT,
taxID INT),
department VARCHAR)
AS COPY FROM '...' PARQUET;
ERROR 9151: Datatype mismatch [...]
The data contains an address struct with three fields (street, city, zipcode), so the external table must also use a ROW with three fields. Changing the ROW to have two fields and promoting one of the fields to the parent ROW is a mismatch. Each ROW must match and, if structs are nested in the data, the complete structure must match.
For native tables, you can specify which columns to load from the data, so you do not need to account for all of them. For the columns you load, the definition of the table must match the schema in the data file.
Handling nulls
If a struct exists but a field value is null, Vertica assigns NULL as its value in the ROW. A struct where all fields are null is treated as a ROW with null fields. If the struct itself is null, Vertica reads the ROW as NULL.
Queries
See Rows (structs).
Restrictions
ROW columns have several restrictions:
-
The maximum nesting depth is 100.
-
Vertica tables support up to 9800 columns and fields. The ROW itself is not counted, only its fields.
-
ROW columns cannot use any constraints (such as NOT NULL) or defaults.
-
ROW fields cannot be auto_increment or setof.
-
A ROW definition must include at least one field.
-
"Row" is a reserved keyword within a ROW definition, but is permitted as the name of a table or column.
-
ROW columns cannot be modified using ALTER TABLE...ALTER COLUMN.
-
Tables containing ROW columns cannot also contain identity, auto-increment, default, SET USING, or sequence columns.
6.2 - Arrays
Columns can contain arrays, which store ordered lists of elements of the same type.
Columns can contain arrays, which store ordered lists of elements of the same type. For example, an address column could use an array of strings to store multiple addresses that an individual might have, such as ['668 SW New Lane', '518 Main Ave', '7040 Campfire Dr']
.
There are two types of arrays:
-
Native array: a one-dimensional array of a primitive type.
-
Non-native array: all other supported arrays, including arrays that contain other arrays (multi-dimensional arrays) or structs (ROWs). Non-native arrays have some usage restrictions.
Use the ARRAY type to define an array column, specifying the type of its elements (a primitive type, a ROW (struct), or an array):
=> CREATE TABLE orders
(orderkey INT,
custkey INT,
prodkey ARRAY[VARCHAR(10)],
orderprices ARRAY[DECIMAL(12,2)],
orderdate DATE
);
If an array is multi-dimensional, represent it as an array containing an array:
ARRAY[ARRAY[FLOAT]]
Queries
See Arrays and sets (collections).
Restrictions
-
Native arrays support only data of primitive types, for example, int, UUID, and so on.
-
Array dimensionality is enforced. A column cannot contain arrays of varying dimensions. For example, a column that contains a three-dimensional array can only contain other three-dimensional arrays; it cannot simultaneously include a one-dimensional array. However, the arrays in a column can vary in size, where one array can contain four elements while another contains ten.
-
Array bounds, if specified, are enforced for all operations that load or alter data. Unbounded arrays may have as many elements as will fit in the allotted binary size.
-
An array has a maximum binary size. If this size is not set when the array is defined, a default value is used.
-
Arrays do not support LONG types (like LONG VARBINARY or LONG VARCHAR) or user-defined types (like Geometry).
6.3 - Flexible complex types
When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays.
When defining tables, you can use strongly-typed complex types to fully describe any combination of structs and arrays. However, there are times when you might prefer not to:
-
If the data contains a struct with a very large number of fields, and in your queries you will need only a few of them, you can avoid having to enumerate the rest in the table DDL. Further, a deeply-nested set of structs could exceed the nesting limit for the table if fully specified.
-
If the data schema is still evolving, you can delay finalizing strongly-typed DDL.
-
If you anticipate the introduction of new fields in the data, you can use flexible types to discover them. A table with strong typing, on the other hand, would silently ignore those values. For an example of using flexible types to discover new fields, see Strong and Flexible Typing.
Flexible types are a way to store complex or unstructured data as a binary blob in one column, in a way that allows access to individual elements of that data. This is the same approach that Vertica uses with flex tables, which support loading unstructured or semi-structured data. In a flex table, all data from a source is loaded into a single VMap column named __raw__
. From this column you can materialize other columns, such as a specific field in JSON data, or use special lookup functions in queries to read values directly out of the __raw__
column.
Vertica uses a similar approach with complex types. You can describe the types fully using the ROW and ARRAY types in your table definition, or you can instead treat a complex type as a flexible type and not fully describe it. Each complex type that you choose to treat this way becomes its own flex-style column. You are not limited to a single column containing all data as in flex tables; instead, you can treat any complex type column, no matter how deeply it nests other types, as one flex-like column.
Defining flexible columns
To use a flexible complex type, declare the column as LONG VARBINARY. You might also need to set other parameters in the parser, as described in the parser documentation.
Consider a Parquet file with a restaurants table and the following columns:
-
name: varchar
-
cuisine type: varchar
-
location (cities): array[varchar]
-
menu: array of structs, each struct having an item name and a price
This data contains two complex columns, location (an array) and menu (an array of structs). The following example defines both columns as flexible columns by using LONG VARBINARY:
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR,
location_city LONG VARBINARY, menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
The allow_long_varbinary_match_complex_type
parameter is specific to the Parquet parser. It is required if you define any column as a flexible type. Without this parameter, Vertica tries to match the LONG VARBINARY declaration in the table to a VARBINARY column in the Parquet file, finds a complex type instead, and reports a data-type mismatch.
You need not treat all complex columns as flexible types. The following definition is also valid:
=> CREATE EXTERNAL TABLE restaurants(
name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR,50],
menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
For Parquet, ORC, and Avro data, you can use the INFER_TABLE_DDL function to derive a table definition from a data file. This function uses strong typing for complex types.
Querying flexible columns
Flexible columns are stored as LONG VARBINARY, so selecting them directly produces unhelpful results. Instead, use the flex mapping functions to extract values from these columns. The MAPTOSTRING function translates the complex type to JSON, as shown in the following example:
=> SELECT name, location_city, MAPTOSTRING(menu) AS menu FROM restaurants;
name | location_city | menu
-------------------+---------+------------------+----------------------------------------
Bob's pizzeria | ["Cambridge","Pittsburgh"] | {
"0": {
"item": "cheese pizza",
"price": "$8.25"
},
"1": {
"item": "spinach pizza",
"price": "$10.50"
}
}
Bakersfield Tacos | ["Pittsburgh"] | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
(2 rows)
The menu column is an array of structs. Notice that the output is a set of key/value pairs, with the key being the array index. Bob's Pizzeria has two items on its menu, and each value is a struct. The first item ("0") is a struct with an "item" value of "cheese pizza" and a "price" of "$8.25".
You can use keys to access specific values. The following example selects the first menu item from each restaurant. Note that all keys are strings, even array indexes:
=> SELECT name, location_city, menu['0']['item'] AS item, menu['0']['price'] AS price FROM restaurants;
name | location_city | item | price
-------------------+----------------------------+--------------+-------
Bob's pizzeria | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
Bakersfield Tacos | ["Pittsburgh"] | veggie taco | $9.95
(2 rows)
Instead of accessing specific indexes, you can use the MAPITEMS function in a subquery to explode a flexible type, as in the following example:
=> SELECT name, location_city, menu_items['item'], menu_items['price']
FROM (SELECT mapitems(menu, name, location_city) OVER(PARTITION BEST)
AS (indexes, menu_items, name, location_city)
FROM restaurants) explode_menu;
name | location_city | menu_items | menu_items
-------------------+----------------------------+---------------+------------
Bob's pizzeria | ["Cambridge","Pittsburgh"] | cheese pizza | $8.25
Bob's pizzeria | ["Cambridge","Pittsburgh"] | spinach pizza | $10.50
Bakersfield Tacos | ["Pittsburgh"] | veggie taco | $9.95
Bakersfield Tacos | ["Pittsburgh"] | steak taco | $10.95
(4 rows)
For a complete list of flex mapping functions, see Flex table data functions.
JSON and Avro flexible types
The parsers for JSON and Avro support both flexible and strong types for complex types. When using flexible complex types or loading into a flex table, use the flatten_maps
and flatten_arrays
parameters to control how the parser handles complex data. These parsers ignore these parameters for strongly-typed complex types.
The following example demonstrates the use of flexible complex types. Consider a JSON file containing the following data:
{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"location_city" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "price" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"location_city" : ["Pittsburgh"],
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}
Create a table, using LONG VARBINARY for the flexible complex types, and load data specifying these parameters:
=> CREATE TABLE restaurant(name VARCHAR, cuisine VARCHAR,
location_city LONG VARBINARY, menu LONG VARBINARY);
=> COPY restaurant FROM '/data/restaurant.json'
PARSER FJSONPARSER(flatten_maps=false, flatten_arrays=false);
You can use Flex functions and direct access (through indexes) to return readable values:
=> SELECT MAPTOSTRING(location_city), MAPTOSTRING(menu) FROM restaurant;
maptostring | maptostring
-------------------------------------------------+--------------------------------------------------------
{
"0": "Cambridge",
"1": "Pittsburgh"
} | {
"0": {
"item": "cheese pizza",
"price": "$8.25"
},
"1": {
"item": "spinach pizza",
"price": "$10.50"
}
}
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
(2 rows)
=> SELECT menu['0']['item'] FROM restaurant;
menu
--------------
cheese pizza
veggie taco
(2 rows)
The COPY statement shown in this example sets flatten_maps
to false. Without that change, the keys for the complex columns would not work as expected, because record and array keys would be "flattened" at the top level. Querying menu['0']['item'] would produce no results. Instead, query flattened values as in the following example:
=> SELECT menu['0.item'] FROM restaurant;
menu
--------------
veggie taco
cheese pizza
(2 rows)
Flattening directives apply to the entire COPY statement. You cannot flatten some columns and not others, or prevent flattening values in a complex column that is itself within a flattened flex table. Because flexible complex types and strongly-typed complex types require different values for flattening, you cannot combine strong and flexible complex types in the same load operation.
6.4 - System tables for complex types
Information about all complex types is recorded in the COMPLEX_TYPES system table.
Information about all complex types is recorded in the COMPLEX_TYPES system table. You must have read permission for the external table that uses a type to see its entries in this system table. Complex types are not shown in the TYPES system table.
For ROW types, each row in COMPLEX_TYPES represents one field of one ROW. The field name is the name used in the table definition if present, or a generated name beginning with _field otherwise. Each row also includes the (generated) name of its containing type, a string beginning with _ct_. ("CT" stands for "complex type".)
The following example defines one external table and then shows the types in COMPLEX_TYPES:
=> CREATE EXTERNAL TABLE warehouse(
name VARCHAR, id_map MAP<INT,VARCHAR>,
data row(record INT, total FLOAT, description VARCHAR(100)),
prices ARRAY[INT], comment VARCHAR(200), sales_total FLOAT, storeID INT)
AS COPY FROM ... PARQUET;
=> SELECT type_id,type_kind,type_name,field_id,field_name,field_type_name,field_position
FROM COMPLEX_TYPES ORDER BY type_id,field_name;
type_id | type_kind | type_name | field_id | field_name | field_type_name | field_position
-------------------+-----------+-----------------------+----------+-------------+-----------------+----------------
45035996274278280 | Map | _ct_45035996274278280 | 6 | key | int | 0
45035996274278280 | Map | _ct_45035996274278280 | 9 | value | varchar(80) | 1
45035996274278282 | Row | _ct_45035996274278282 | 9 | description | varchar(80) | 2
45035996274278282 | Row | _ct_45035996274278282 | 6 | record | int | 0
45035996274278282 | Row | _ct_45035996274278282 | 7 | total | float | 1
45035996274278284 | Array | _ct_45035996274278284 | 6 | | int | 0
(6 rows)
This table shows the fields for the two ROW types defined in the table. When a ROW contains another ROW, as is the case here with the nested address field, the field_type_name column uses the generated name of the contained ROW. The same number, minus the leading "ct", serves as the field_id.
7 - Handling Non-UTF-8 input
Vertica supports loading data files in the Unicode UTF-8 format.
Vertica supports loading data files in the Unicode UTF-8 format. You can load ASCII data, which is UTF-8 compatible. Character sets like ISO 8859-1 (Latin1) are incompatible with UTF-8 and are not directly supported.
If you have data that does not meet the UTF-8 standard, you can modify the data during the load or you can transform the data files before loading.
Before loading data from text files, you can use several Linux tools to ensure that your data is in UTF-8 format. The file
command reports the encoding of any text files. For example:
$ file Date_Dimension.tbl
Date_Dimension.tbl: ASCII text
The file
command could indicate ASCII text even though the file contains multibyte characters.
To check for multibyte characters in an ASCII file, use the wc
command. For example:
$ wc Date_Dimension.tbl
1828 5484 221822 Date_Dimension.tbl
If the wc
command returns an error such as Invalid or incomplete multibyte or wide character
, the data file is using an incompatible character set.
This example shows two files that are not UTF-8 data files:
$ file data*
data1.txt: Little-endian UTF-16 Unicode text
data2.txt: ISO-8859 text
The results indicate that neither of the files is in UTF-8 format.
Converting data while loading
You can remove or replace non-UTF-8 characters in text data during the load. The MAKEUTF8 function removes such characters by default, or you can specify a replacement string.
The following example shows how to use this function during a load. The original data is loaded into the orig_name
column, and the transformed data is loaded into the name
column. Typically you would use a FILLER column for the original value instead of adding the column to the table definition; this example adds the column to show the differences side by side.
=> CREATE TABLE people (orig_name VARCHAR, name VARCHAR);
CREATE TABLE
=> COPY people (orig_name, name AS MAKEUTF8(orig_name)) FROM ...;
Rows Loaded
-------------
8
(1 row)
=> SELECT * FROM people;
orig_name | name
----------+--------
Dáithí | Dith
Fíona | Fona
Móirín | Mirn
Róisín | Risn
Séamus | Samus
Séan | San
Tiarnán | Tiarnn
Áine | ine
(8 rows)
For general information about transforming data, see Transforming data during loads.
Converting files before loading data
To convert files before loading them into Vertica, use the iconv
UNIX command. For example, to convert the data2.txt
file from the previous example, use the iconv
command as follows:
$ iconv -f ISO88599 -t utf-8 data2.txt > data2-utf8.txt
See the man pages for file
and iconv
for more information.
Checking UTF-8 compliance after loading data
After loading data, use the ISUTF8 function to verify that all of the string-based data in the table is in UTF-8 format. For example, if you loaded data into a table named people
that has a VARCHAR column named name
, you can use this statement to verify that all of the strings are UTF-8 encoded:
=> SELECT name FROM people WHERE NOT ISUTF8(name);
If all of the strings are in UTF-8 format, the query should not return any rows.
8 - Transforming data during loads
To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading.
To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading. Transforming data while loading lets you compute values to insert into a target column, either from other columns or from values in the data that you load as FILLER columns (see Deriving Table Columns From Data File Columns). You can transform data to be loaded into columns of scalar types and native arrays, but not other complex types.
For example, you might have text data that is not compatible with UTF-8, the encoding that Vertica expects. You can use the MAKEUTF8 function during load to remove or replace non-UTF-8 characters, as illustrated in Converting Data While Loading. Or you might want to extract fields for day, month, and year from a single input date.
When transforming data during a load, you load the data into a column normally, and then use that column in an expression to populate another column. The COPY statement must always contain at least one parsed column, which can be a FILLER column. You can intersperse parsed and computed columns in a COPY statement.
The following example extracts day, month, and year columns from a single input date column:
=> CREATE TABLE purchases
(id INT, year VARCHAR(10), month VARCHAR(10), day VARCHAR(10), ts TIMESTAMP);
=> COPY purchases (id, year AS TO_CHAR(ts,'YYYY'),
month AS TO_CHAR(ts,'MM'), day AS TO_CHAR(ts, 'DD'),
ts FORMAT 'YYYY-MM-DD') FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1943|2021-03-29
>> 1256|2021-03-30
>> 1512|2021-03-31
>> \.
=> SELECT * FROM purchases;
id | year | month | day | ts
------+------+-------+-----+---------------------
1256 | 2021 | 03 | 30 | 2021-03-30 00:00:00
1512 | 2021 | 03 | 31 | 2021-03-31 00:00:00
1943 | 2021 | 03 | 29 | 2021-03-29 00:00:00
(3 rows)
The input data has two columns, id
and ts
(timestamp). The COPY statement specifies the format of the timestamp column using the FORMAT option. The TO_CHAR function uses that format information to extract the year
, month
, and day
columns.
Using expressions in COPY statements
The expression in a COPY statement can be as simple as a single column, or more complex, such as a case statement for multiple columns. An expression can specify multiple columns, and multiple expressions can refer to the same parsed column. You can use expressions for columns of all supported data types.
COPY expressions can use many SQL functions, operators, constants, NULLs, and comments, including these functions:
Requirements and restrictions:
-
COPY expressions cannot use SQL meta-functions, analytic functions, aggregate functions, or computed columns.
-
For computed columns, you must list all parsed columns in the COPY statement expression. Do not specify FORMAT or RAW in the source data for a computed column.
-
The return data type of the expression must be coercible to that of the target column. Parsed column parameters are also coerced to match the expression.
Handling expression errors
Errors in expressions within your COPY statement are SQL errors. As such, they are handled differently from parse errors. When a parse error occurs, COPY rejects the row and adds it to the rejected data file or table. COPY also adds the reason for a rejected row to the exceptions file or the rejected data table. For example, COPY parsing does not implicitly cast data types. If a type mismatch occurs between the data being loaded and a column type (such as attempting to load a text value into a FLOAT column), COPY rejects the row and continues processing.
If an error occurs in an expression in your COPY statement, then by default the entire load fails. For example, if your COPY statement uses a function expression, and a syntax error exists in that expression, the entire load is rolled back. All SQL errors, including ones caused by rolling back the COPY, are stored in the Vertica log file. However, unlike parse rejections and exception messages, SQL expression errors are brief and may require further research.
You can have COPY treat errors in transformation expressions like parse errors. Rejected rows are added to the same file or table, and exceptions are added to the same exceptions file or table. To enable this behavior, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.)
Loading data with expression rejections is potentially slower than loading with the same number of parse rejections. Enable expression rejections if your data has a few bad rows, to allow the rest of the data to be loaded. If you are concerned about the time it takes to complete a load with many bad rows, use the REJECTMAX parameter to set a limit. If COPY finds more than REJECTMAX bad rows, it aborts and rolls back the load.
See Handling messy data for more information about managing rejected data.
Deriving table columns from data file columns
When loading data, your source data might contain one or more columns that do not exist in the target table. Or, the source and target tables have matched columns, but you want to omit one or more source columns from the target table.
Use the FILLER parameter to identify a column of source data that COPY can ignore or use to compute new values that are loaded into the target table. The following requirements apply:
-
Define the FILLER parameter data type so it is compatible with the source data. For example, be sure to define a VARCHAR in the target table so its length can contain all source data; otherwise, data might be truncated. You can specify multiple filler columns by using the FILLER parameter more than once in the COPY statement.
Important
If the source field's data type is VARCHAR, be sure to set the VARCHAR length to ensure that the combined length of all FILLER source fields does not exceed the target column's defined length; otherwise, the COPY command might return with an error.
-
The name of the filler column must not match the name of any column in the target table.
In the following example, the table has columns for first name, last name, and full name, but the data being loaded contains columns for first, middle, and last names. The COPY statement reads all of the source data but only loads the source columns for first and last names. It constructs the data for the full name by concatenating each of the source data columns, including the middle name. The middle name is read as a FILLER column so it can be used in the concatenation, but is ignored otherwise. (There is no table column for middle name.)
=> CREATE TABLE names(first VARCHAR(20), last VARCHAR(20), full VARCHAR(60));
CREATE TABLE
=> COPY names(first,
middle FILLER VARCHAR(20),
last,
full AS first||' '||middle||' '||last)
FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Marc|Gregory|Smith
>> Sue|Lucia|Temp
>> Jon|Pete|Hamilton
>> \.
=> SELECT * from names;
first | last | full
-------+----------+--------------------
Jon | Hamilton | Jon Pete Hamilton
Marc | Smith | Marc Gregory Smith
Sue | Temp | Sue Lucia Temp
(3 rows)
9 - Distributing a load
Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation.
Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation. How this is done depends on where the data is and what types of parallelism the parsers support.
Vertica can be most effective in distributing a load when the data to be loaded is found in shared storage available to all nodes. Sometimes, however, data is available only on specific nodes, which you must specify.
Types of load parallelism
Vertica supports several types of parallelism. The default (DELIMITED) parser uses load parallelism, and user-defined parsers can enable it.
-
Distributed load: Files in a multi-file load are loaded on several nodes in parallel, instead of all being loaded on a single node.
-
Apportioned load: A single large file or other single source is divided into segments (portions), which are assigned to several nodes to be loaded in parallel. Apportioned load is enabled by default. To disable it, set the EnableApportionLoad configuration parameter to 0.
-
Cooperative parse: A source being loaded on a single node uses multi-threading to parallelize the parse. Cooperative parse is enabled by default. To disable it, set the EnableCooperativeParse configuration parameter to 0.
See General parameters for information about the configuration parameters.
Loading on specific nodes
You can indicate which node or nodes should parse an input path by using any of the following:
Using the ON ANY NODE
clause indicates that the source file to load is available on all of the nodes. If you specify this clause, COPY opens the file and parses it from any node in the cluster. ON ANY NODE
is the default for HDFS and S3 paths.
Using the ON
nodeset
clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.
If the data to be loaded is on a client, use COPY FROM LOCAL instead of specifying nodes. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option.
Specifying distributed file loads
You can direct individual files in a multi-file load to specific nodes, as in the following example of distributed load.
=> COPY t FROM '/data/file1.dat' ON v_vmart_node0001, '/data/file2.dat' ON v_vmart_node0002;
You can use globbing (wildcard expansion) to specify a group of files with the ON ANY NODE directive, as in the following example.
-
If apportioned load is enabled (the default), Vertica assigns different files to different nodes. Both the EnableApportionedLoad and EnableApportionedFileLoad must be set to 1.
-
If apportioned load is disabled, a single node loads all the data.
=> COPY t FROM '/data/*.dat' ON ANY NODE;
If you have a single file instead of a group of files, you can still, potentially, benefit from apportioned load. The file must be large enough to divide into portions at least equal to ApportionedFileMinimumPortionSizeKB in size, and this size must be large enough to contain at least one whole record. You must also use a parser that supports apportioned load. The delimited parser built into Vertica supports apportioned load, but other parsers might not.
The following example shows how you can load a single large file using multiple nodes.
=> COPY t FROM '/data/bigfile.dat' ON ANY NODE;
You can limit the nodes that participate in an apportioned load. Doing so is useful if you need to balance several concurrent loads. Vertica apportions each load individually; it does not account for other loads that might be in progress on those nodes. You can, therefore, potentially speed up your loads by managing apportioning yourself.
The following example shows how you can apportion loads on specific nodes.
=> COPY t FROM '/data/big1.dat' ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003),
'/data/big2.dat' ON (v_vmart_node0004, v_vmart_node0005);
Loaded files can be of different formats, such as BZIP, GZIP, and others. However, because file compression is a filter, you cannot use apportioned load for a compressed file.
Specifying distributed loads with sources
You can also apportion loads using COPY WITH SOURCE. You can create sources and parsers with the user-defined load (UDL) API. If both the source and parser support apportioned load, and EnableApportionLoad is set, then Vertica attempts to divide the load among nodes.
The following example shows a load that you could apportion.
=> COPY t WITH SOURCE MySource() PARSER MyParser();
The built-in delimited parser supports apportioning, so you can use it with a user-defined source, as in the following example.
=> COPY t WITH SOURCE MySource();
Number of load streams
Although the number of files you can load is not restricted, the optimal number of load streams depends on several factors, including:
Using too many load streams can deplete or reduce system memory required for optimal query processing. See Best practices for managing workload resources for advice on configuring load streams.
10 - Using transactions to stage a load
By default, COPY automatically commits itself and other current transactions except when loading temporary tables or querying external tables.
By default, COPY
automatically commits itself and other current transactions except when loading temporary tables or querying external tables. You can override this behavior by qualifying the COPY
statement with the NO COMMIT
option. When you specify NO COMMIT
, Vertica does not commit the transaction until you explicitly issue a COMMIT
statement.
You can use COPY...NO COMMIT
in two ways:
Combine multiple COPY statements in the same transaction
When you combine multiple COPY...NO COMMIT
statements in the same transaction, Vertica can consolidate the data for all operations into fewer ROS containers, and thereby perform more efficiently.
For example, the following set of COPY...NO COMMIT
statements performs several copy statements sequentially, and then commits them all. In this way, all of the copied data is either committed or rolled back as a single transaction.
COPY... NO COMMIT;
COPY... NO COMMIT;
COPY... NO COMMIT;
COPY X FROM LOCAL NO COMMIT;
COMMIT;
Tip
Be sure to commit or roll back any previous DML operations before you use COPY...NO COMMIT
. Otherwise, COPY...NO COMMIT
is liable to include earlier operations that are still in progress, such as INSERT
, in its own transaction. In this case, the previous operation and copy operation are combined as a single transaction and committed together.
Check constraint violations
If constraints are not enforced in the target table, COPY
does not check for constraint violations when it loads data. To troubleshoot loaded data for constraint violations, use COPY...NO COMMIT
with
ANALYZE_CONSTRAINTS
. Doing so enables you detect constraint violations before you commit the load operation and, if necessary, roll back the operation. For details, see Detecting constraint violations.
11 - Handling messy data
Loading data with COPY has two main phases, parsing and loading.
Loading data with COPY
has two main phases, parsing and loading. During parsing, if COPY
encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY
cannot parse a row of data. Following are some parser errors that can cause a rejected row:
-
Unsupported parser options
-
Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections
-
Malformed context for the parser in use
-
Missing delimiters
Optionally, COPY
can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY
aborts a load if it encounters errors during the loading phase.
Several optional parameters let you determine how strictly COPY
handles rejections. For example, you can have COPY
fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY
handles rejected data.
Save rejected rows (REJECTED DATA and EXCEPTIONS)
The COPY
statement automatically saves a copy of each rejected row in a rejected-data file. COPY
also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs
, as shown in this example:
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1
You can optionally save COPY
rejections and exceptions in one of two other ways:
-
Use the REJECTED DATA
reject_path
and EXCEPTIONS
except_path
parameters to save both outputs to locations of your choice. REJECTED DATA
records rejected rows, while EXCEPTIONS
records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY
creates the directory if it does not exist.) If a path value is a file, COPY
uses it as a file prefix if multiple files are written.
-
Use the REJECTED DATA AS TABLE
reject_table
clause. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving rejected data to a table.
Note
Vertica recommends saving rejected data to a table. However, saving to a table excludes saving to a default or specific rejected data file.
If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database installation, rejected data table records are stored in the RejectionTableData
directory as follows:
=> cd v_mart_node003_data\RejectionTableData\
=> ls
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2
.
.
.
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2
COPY LOCAL rejected data
For COPY LOCAL
operations, if you use REJECTED DATA
or EXCEPTIONS
with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE
instead of REJECTED DATA
.
Enforce truncating or rejecting rows (ENFORCELENGTH)
When parsing data of type CHAR
, VARCHAR
, BINARY
, or VARBINARY
, rows may exceed the target table length. By default, COPY
truncates such rows without rejecting them.
Use the ENFORCELENGTH
parameter to reject rows that exceed the target table.
For example, loading 'abc'
into a table column specified as VARCHAR(2)
results in COPY
truncating the value to 'ab'
and loading it. Loading the same row with the ENFORCELENGTH
parameter causes COPY
to reject the row.
Note
Vertica supports NATIVE
and NATIVE VARCHAR
values up to 65K. If any value exceeds this limit, COPY
rejects the row, even when ENFORCELENGTH
is not in use.
Specify a maximum number of rejections (REJECTMAX)
The REJECTMAX
parameter specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.
When the number of rejected records becomes equal to the REJECTMAX
value, the load fails. If you do not specify a value for REJECTMAX
, or if the value is 0, COPY
allows an unlimited number of exceptions to occur.
If you allow COPY
to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX
to limit the impact. See Handling Transformation Errors.
By default, COPY
aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.
To have COPY
treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA
or REJECTED DATA AS TABLE
to specify the output location.
You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX
when enabling transformation rejections.
If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.
Abort data loads for any error (ABORT ON ERROR)
Using the ABORT ON ERROR
argument is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY
operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.
If you use the ABORT ON ERROR
as part of a CREATE EXTERNAL TABLE AS COPY FROM
statement, the option is used whenever a query references the external table. The offending error is saved in the COPY
exceptions or rejected data file.
Understanding row rejections and rollback errors
Depending on the type of error that COPY
encounters, Vertica does one of the following:
Note
If you specify ABORT ON ERROR
with the COPY
statement, the load automatically rolls back if COPY
cannot parse any row.
The following table summarizes the reasons for rejected rows or rollbacks.
Rejected Rows |
Load Rollback |
COPY cannot parse rows that contain any of the following:
-
Incompatible data types
-
Missing fields
-
Missing delimiters
|
COPY rolls back a load if it encounters any of these conditions:
-
Server-side errors, such as lack of memory
-
Primary key or foreign key constraint violations
-
Loading NULL data into a NOT NULL column
-
Transformation errors (by default)
|
This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY
statement, "a::INT + b::INT"
is a SQL expression in which a
and b
are derived values:
=> CREATE TABLE t (i INT);
=> COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT)
FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> cat|dog
>> \.
Vertica cannot parse the row to the requested data type and rejects the row:
ERROR 2827: Could not convert "cat" from column "*FILLER*".a to an int8
If a
resolved to 'cat'
and b
to 'dog'
, the next expression 'cat'::INT + 'dog'::INT
would return an expression evaluator error:
=> SELECT 'cat'::INT + 'dog'::INT;
ERROR 3681: Invalid input syntax for integer: "cat"
The following COPY
statement would also roll back because Vertica cannot parse the row to the requested data type:
=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;
In the following COPY
statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat'
row as a VARCHAR type, COPY parses 'cat'
directly as an INTEGER.
=> COPY t (a FILLER INT, i AS a) FROM STDIN;
In the following example, transformation errors are rejected instead of aborting the load.
=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1;
ALTER DATABASE
=> CREATE TABLE sales (price INTEGER);
COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej;
dollars
\.
=> SELECT rejected_data, rejected_reason FROM sales_rej;
rejected_data | rejected_reason
---------------+-----------------------------------------------
dollars | Invalid integer format 'dollars' for column 1 (price)
(1 row)
See also
11.1 - Saving load rejections (REJECTED DATA)
COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.
COPY
load rejections are data rows that did not load due to a parser exception or, optionally, transformation error. By default, if you do not specify a rejected data file, COPY
saves rejected data files to this location:
catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.`*`n`*
catalog_dir/ |
The database catalog files directory, for example:
/home/dbadmin/VMart/v_vmart_node0001_catalog
|
target_table |
The table into which data was loaded (target_table). |
source |
The source of the load data, which can be STDIN, or a file name, such as baseball.csv .
|
copy-from-rejected-data. n |
The default name for a rejected data file, followed by n suffix, indicating the number of files, such as .1 , .2, .3 . For example, this default file name indicates file 3 after loading from STDIN:
fw-STDIN-copy-from-rejected-data.3 .
|
Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table, lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.
Multiple rejected data files
Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data. Here are some of the factors:
-
Number of sources being loaded
-
Total number of rejected rows
-
Size of the source file (or files)
-
Cooperative parsing and number of threads being used
-
UDLs that support apportioned loads
-
For your own COPY parser, the number of objects returned from prepareUDSources()
Naming conventions for rejected files
You can specify one or more rejected data files with the files you are loading. Use the REJECTED DATA parameter to specify a file location and name, and separate consecutive rejected data file names with a comma (,). Do not use the ON ANY NODE
option because it is applicable only to load files.
If you specify one or more files, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix, and appends a numeric suffix to each rejected data file. For example, if you specify the name my_rejects
for the REJECTED_DATA parameter, and the file you are loading is large enough (> 10MB), several files such as the following will exist:
-
my_rejects-1
-
my_rejects-2
-
my_rejects-3
COPY uses cooperative parsing by default, having the nodes parse a specific part of the file contents. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).
The maximum number of rejected data files cannot be greater than the number of sources being loaded, per thread to parse any portion. The resource pool determines the maximum number of threads. For cooperative parse, use all available threads.
If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDL's that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.
For all data loads except COPY LOCAL
, COPY
behaves as follows:
No rejected data file specified... |
Rejected data file specified... |
For a single data file (pathToData or STDIN ), COPY stores one or more rejected data files in the default location. |
For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one files is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error. |
For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a file name prefix, as noted. |
For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If path is not a directory, COPY returns an error.
COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.
|
Rejected data files are returned to the initiator node. |
Rejected data files are not shipped to the initiator node. |
Maximum length of file names
Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, \tmp\<shorter-file-name>
.
11.2 - Saving rejected data to a table
Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data.
Use the REJECTED DATA
parameter with the AS TABLE
clause to specify a table in which to save rejected data. Saving rejected data to a file is mutually exclusive with using the AS TABLE
clause.
When you use the AS TABLE
clause, Vertica creates a new table if one does not exist, or appends to an existing table. If no parsing rejections occur during a load, the table exists but is empty. The next time you load data, Vertica inserts any rejected rows to the existing table.
The load rejection tables are a special type of table with the following capabilities and limitations:
-
Support SELECT
statements
-
Can use DROP TABLE
-
Cannot be created outside of a COPY
statement
-
Do not support DML and DDL activities
-
Are not K-safe
To make the data in a rejected table K-safe, you can do one of the following:
-
Write a CREATE TABLE..AS
statement, such as this example:
=> CREATE TABLE new_table AS SELECT * FROM rejected_table;
-
Create a table to store rejected records, and run INSERT..SELECT
operations into the new table
Using COPY NO COMMIT
If the COPY
statement includes options NO COMMIT
and REJECTED DATA AS TABLE
, and the reject-table
does not already exist, Vertica Analytic Database saves the rejected data table as a LOCAL TEMP table and returns a message that a LOCAL TEMP table is being created.
Rejected-data tables are useful for Extract-Load-Transform workflows, where you will likely use temporary tables more frequently. The rejected-data tables let you quickly load data and identify which records failed to load. If you load data into a temporary table that you created using the ON COMMIT DELETE
clause, the COPY
operation will not commit.
Location of rejected data table records
When you save rejected records to a table, using the REJECTED DATA AS TABLE
table_name
option, the data for the table is saved in a database data subdirectory, RejectionTableData
. For example, for a VMart
database, table data files reside here:
/home/dbadmin/VMart/v_vmart_node0001_data/RejectionTableData
Rejected data tables include both rejected data and the reason for the rejection (exceptions), along with other data columns, described next. Vertica suggests that you periodically drop any rejected data tables that you no longer require.
Querying a rejected data table
When you specify a rejected data table when loading data with COPY
, you can query that table for information about rejected data after the load operation is complete. For example:
-
Create the loader
table:
=> CREATE TABLE loader(a INT)
CREATE TABLE
-
Use COPY
to load values, saving rejected data to a table, loader_rejects
:
=> COPY loader FROM STDIN REJECTED DATA AS TABLE loader_rejects;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1
>> 2
>> 3
>> a
>> \.
-
Query the loader
table after loading data:
=> SELECT * FROM loader;
x
---
1
2
3
(3 rows)
-
Query the loader_rejects
table to see its column rows:
=> SELECT * FROM loader_rejects;
-[ RECORD 1 ]-------------+--------------------------------------------
node_name | v_vmart_node0001
file_name | STDIN
session_id | v_vmart_node0001.example.-24016:0x3439
transaction_id | 45035996274080923
statement_id | 1
batch_number | 0
row_number | 4
rejected_data | a
rejected_data_orig_length | 1
rejected_reason | Invalid integer format 'a' for column 1 (x)
The rejected data table has the following columns:
Column |
Data Type |
Description |
node_name |
VARCHAR |
The name of the Vertica node on which the input load file was located. |
file_name |
VARCHAR |
The name of the file being loaded, which applies if you loaded a file (as opposed to using STDIN). |
session_id |
VARCHAR |
The session ID number in which the COPY statement occurred. |
transaction_id |
INTEGER |
Identifier for the transaction within the session, if any; otherwise NULL. |
statement_id |
INTEGER |
The unique identification number of the statement within the transaction that included the rejected data.
Tip
You can use the session_id , transaction_id , and statement_id columns to create joins with many system tables. For example, if you join against the QUERY_REQUESTS table using those three columns, the QUERY_REQUESTS.REQUEST column contains the actual COPY statement (as a string) used to load this data.
|
batch_number |
INTEGER |
INTERNAL USE. Represents which batch (chunk) the data comes from. |
row_number |
INTEGER |
The rejected row number from the input file, or -1 if it could not be determined. The value can be -1 when using cooperative parse.
Each parse operation resets the row number, so in an apportioned load, there can be several entries with the same row number but different rows.
|
rejected_data |
LONG VARCHAR |
The data that was not loaded. |
rejected_data_orig_length |
INTEGER |
The length of the rejected data. |
rejected_reason |
VARCHAR |
The error that caused the rejected row. This column returns the same message that exists in a load exceptions file when you do not save to a table. |
Exporting the rejected records table
You can export the contents of the column rejected_data
to a file to capture only the data rejected during the first COPY statement. Then, correct the data in the file, save it, and load the updated file.
To export rejected records:
-
Create a sample table:
=> CREATE TABLE t (i int);
CREATE TABLE
-
Copy data directly into the table, using a table to store rejected data:
=> COPY t FROM STDIN REJECTED DATA AS TABLE t_rejects;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1
>> 2
>> 3
>> 4
>> a
>> b
>> c
>> \.
-
Show only tuples and set the output format:
=> \t
Showing only tuples.
=> \a
Output format is unaligned.
-
Output to a file:
=> \o rejected.txt
=> select rejected_data from t_rejects;
=> \o
-
Use the cat
command on the saved file:
=> \! cat rejected.txt
a
b
c
After a file exists, you can fix load errors and use the corrected file as load input to the COPY
statement.
11.3 - Saving load exceptions (EXCEPTIONS)
COPY exceptions consist of informational messages describing why a row of data could not be parsed.
COPY exceptions consist of informational messages describing why a row of data could not be parsed. The EXCEPTIONS option lets you specify a file to which COPY writes exceptions. If you omit this option, COPY saves exception files to the following path: catalog-dir
/CopyErrorLogs/
tablename
-
sourcefilename
-copy-from-exceptions
, where:
-
catalog-dir
is the directory holding the database catalog files
-
table
is the name of the table being loaded into
-
sourcefile
is the name of the file being loaded
Note
REJECTED DATA AS TABLE is mutually exclusive with EXCEPTIONS.
The file produced by the EXCEPTIONS option indicates the line number and the reason for each exception.
If copying from STDIN, the source file name is STDIN.
You can specify rejected data and exceptions files for individual files in a data load. Separate rejected data and exception file names with commas in the COPY statement.
You must specify a filename in the path to load multiple input files. Keep in mind that long table names combined with long data file names can exceed the operating system's maximum length (typically 255 characters). To work around file names exceeding the maximum length, use a path for the exceptions file that differs from the default path; for example, /tmp/<shorter-file-name>
.
If you specify an EXCEPTIONS path:
-
For one data file, the path must be a file, and COPY stores all information in this file.
-
For multiple data files, the path must be a directory. COPY creates one file in this directory for each data file.
-
Exceptions files are not stored on the initiator node.
-
You can specify only one path per node.
If you do not specify the EXCEPTIONS path, COPY stores exception files in the default directory.
11.4 - COPY rejected data and exception files
When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files.
When executing a COPY
statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files. Typically, the number of threads depends on the number of node cores in the system. Each node processes part of the load data. If the load succeeds overall, any parser rejections that occur during load processing are written to that node's specific rejected data and exceptions files. If the load fails, the rejected data file contents can be incomplete, or empty. If you do not specify a file name explicitly, COPY uses a default name and location for rejected data files. See the next topic for specifying your own rejected data and exception files.
Both rejected data and exceptions files are saved and stored on a per-node basis. This example uses multiple files as COPY
inputs. Since the statement does not include either the REJECTED DATA
or EXCEPTIONS
parameters, rejected data and exceptions files are written to the default location, the database catalog subdirectory, CopyErrorLogs
, on each node:
\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set file1 '''':dir'C1_large_tbl.dat'''
\set file2 '''':dir'C2_large_tbl.dat'''
\set file3 '''':remote_dir'C3_large_tbl.dat'''
\set file4 '''':remote_dir'C4_large_tbl.dat'''
=>COPY large_tbl FROM :file1 ON site01,:file2 ON site01,
:file3 ON site02,
:file4 ON site02
DELIMITER '|';
Specifying rejected data and exceptions files
The optional COPY
REJECTED DATA
and EXCEPTIONS
parameters 'path'
element lets you specify a non-default path in which to store the files.
If path resolves to a storage location, and the user invoking COPY is not a superuser, these are the required permissions:
Both parameters also have an optional ON
nodename clause that uses the specified path:
...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]...[ REJECTED DATA 'path' [ ON nodename ] [, ...] ]
While 'path' specifies the location of the rejected data and exceptions files (with their corresponding parameters), the optional ON
nodename clause moves any existing rejected data and exception files on the node to the specified path on the same node.
Saving rejected data and exceptions files to a single server
The COPY
statement does not have a facility to merge exception and rejected data files after COPY
processing is complete. To see the contents of exception and rejected data files requires accessing each node's specific files.
Note
To save all exceptions and rejected data files on a network host, be sure to give each node's files unique names, so that different cluster nodes do not overwrite other nodes' files. For instance, if you set up a server with two directories (/vertica/exceptions
and /vertica/rejections
), specify file names for each Vertica cluster node to identify each node, such as node01_exceptions.txt
and node02_exceptions.txt
. This way, each cluster node's files are easily distinguishable in the exceptions and rejections directories.
Using VSQL variables for rejected data and exceptions files
This example uses vsql
variables to specify the path and file names to use with the exceptions
and rejected data
parameters (except_s1
and reject_s1
). The COPY
statement specifies a single input file (large_tbl
) on the initiator node:
\set dir `pwd`/data/ \set file1 '''':dir'C1_large_tbl.dat'''
\set except_s1 '''':dir'exceptions'''
\set reject_s1 '''':dir'rejections'''
COPY large_tbl FROM :file1 ON site01 DELIMITER '|'
REJECTED DATA :reject_s1 ON site01
EXCEPTIONS :except_s1 ON site01;
This example uses variables to specify exception and rejected date files (except_s2
and reject_s2
) on a remote node. The COPY statement consists of a single input file on a remote node (site02
):
\set remote_dir /vertica/test_dev/tmp_ms/\set except_s2 '''':remote_dir'exceptions'''
\set reject_s2 '''':remote_dir'rejections'''
COPY large_tbl FROM :file1 ON site02 DELIMITER '|'
REJECTED DATA :reject_s2 ON site02
EXCEPTIONS :except_s2 ON site02;
This example uses variables to specify that the exception and rejected data files are on a remote node (indicated by :remote_dir
). The inputs to the COPY statement consist of multiple data files on two nodes (site01
and site02
). The exceptions
and rejected data
options use the ON
nodename
clause with the variables to indicate where the files reside (site01
and site02
):
\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set except_s1 '''':dir''''
\set reject_s1 '''':dir''''
\set except_s2 '''':remote_dir''''
\set reject_s2 '''':remote_dir''''
COPY large_tbl FROM :file1 ON site01,
:file2 ON site01,
:file3 ON site02,
:file4 ON site02
DELIMITER '|'
REJECTED DATA :reject_s1 ON site01, :reject_s2 ON site02
EXCEPTIONS :except_s1 ON site01, :except_s2 ON site02;
11.5 - COPY LOCAL rejection and exception files
Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files.
Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files. This behavior differs from using COPY, which saves both files automatically, regardless of whether you use the optional REJECTED DATA
and EXCEPTIONS
parameters to specify either file explicitly.
Use the REJECTED DATA
and EXCEPTIONS
parameters with COPY LOCAL and COPY LOCAL FROM STDIN to save the corresponding output files on the client. If you do not use these options, rejected data parsing events (and the exceptions that describe them) are not retained, even if they occur.
You can load multiple input files using COPY LOCAL (or COPY LOCAL FROM STDIN). If you also use the REJECTED DATA
and EXCEPTIONS
options, the statement writes rejected rows and exceptions and to separate files. The respective files contain all rejected rows and corresponding exceptions, respectively, regardless of how many input files were loaded.
If COPY LOCAL does not reject any rows, it does not create either file.
Note
Because COPY LOCAL (and COPY LOCAL FROM STDIN) must write any rejected rows and exceptions to the client, you cannot use the [ON nodename ]
clause with either the rejected data
or exceptions
options.
Specifying rejected data and exceptions files
To save any rejected data and their exceptions to files:
-
In the COPY LOCAL (and COPY LOCAL FROM STDIN) statement, use the REJECTED DATA 'path'
and the EXCEPTIONS 'path'
parameters, respectively.
-
Specify two different file names for the two options. You cannot use one file for both the REJECTED DATA
and the EXCEPTIONS
.
-
When you invoke COPY LOCAL or COPY LOCAL FROM STDIN, the files you specify need not pre-exist. If they do, COPY LOCAL must be able to overwrite them.
You can specify the path and file names with vsql variables:
\set rejected ../except_reject/copyLocal.rejected
\set exceptions ../except_reject/copyLocal.exceptions
Note
Using COPY LOCAL
does not support storing rejected data in a table, as you can when using the COPY
statement.
When you use the COPY LOCAL or COPY LOCAL FROM STDIN statement, specify the variable names for the files with their corresponding parameters:
=> COPY large_tbl FROM LOCAL rejected data :rejected exceptions :exceptions;
=> COPY large_tbl FROM LOCAL STDIN rejected data :rejected exceptions :exceptions;
12 - Monitoring COPY loads and metrics
You can check COPY loads using:.
You can check COPY loads using:
More generally, the EXECUTION_ENGINE_PROFILES system table records information about query events, including loads.
Using Vertica functions
Two meta-functions return COPY metrics for the number of accepted or rejected rows from the most recent COPY statement run in the current session:
-
To get the number of accepted rows, use the GET_NUM_ACCEPTED_ROWS function:
=> select get_num_accepted_rows();
get_num_accepted_rows
-----------------------
11
(1 row)
-
To check the number of rejected rows, use the GET_NUM_REJECTED_ROWS function:
=> select get_num_rejected_rows();
get_num_rejected_rows
-----------------------
0
(1 row)
Note
GET_NUM_ACCEPTED_ROWS and GET_NUM_REJECTED_ROWS support loads from STDIN, COPY LOCAL from a Vertica client, or a single file on the initiator. You cannot use these functions for multi-node loads.
Using the CURRENT_LOAD_SOURCE function
You can include the CURRENT_LOAD_SOURCE
function as a part of the COPY statement. Doing so allows you to insert into a column the input file name or value computed by this function.
To insert the file names into a column from multiple source files:
=> COPY t (c1, c2, c3 as CURRENT_LOAD_SOURCE()) FROM '/home/load_file_1' ON exampledb_node02,
'/home/load_file_2' ON exampledb_node03 DELIMITER ',';
Using the LOAD_STREAMS system table
Vertica includes a set of system tables that include monitoring information. The LOAD_STREAMS system table includes information about load stream metrics from COPY and COPY FROM VERTICA statements. Thus, you can query table values to get COPY metrics.
Vertica maintains system table metrics until they reach a designated size quota (in kilobytes). This quota is set through internal processes, which you cannot set or view directly.
Labeling copy streams
COPY can include the STREAM NAME parameter to label its load stream so it is easy to identify in the LOAD_STREAMS system table. For example:
=> COPY mytable FROM myfile DELIMITER '|' STREAM NAME 'My stream name';
Load stream metrics
The following LOAD_STREAMS
columns show on the status of a load as it progresses:
Column name |
Value... |
ACCEPTED_ROW_COUNT |
Increases during parsing, up to the maximum number of rows in the input file. |
PARSE_COMPLETE_PERCENT |
Remains zero (0) until all named pipes return an EOF. While COPY awaits an EOF from multiple pipes, it can appear to be hung. However, before canceling the COPY statement, check your system CPU and disk accesses to determine if any activity is in progress.
In a typical load, the PARSE_COMPLETE_PERCENT value can either increase slowly or jump quickly to 100%, if you are loading from named pipes or STDIN.
|
SORT_COMPLETE_PERCENT |
Remains at 0 when loading from named pipes or STDIN. After PARSE_COMPLETE_PERCENT reaches 100 percent, SORT_COMPLETE_PERCENT increases to 100 percent. |
Depending on the data sizes, a significant lag can occur between the time PARSE_COMPLETE_PERCENT
reaches 100 percent and the time SORT_COMPLETE_PERCENT
begins to increase.
This example queries load stream data from the LOAD_STREAMS system table:
=> \pset expanded
Expanded display is on.
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Using the LOAD_SOURCES system table
The LOAD_STREAMS table shows the total number of rows that were loaded or rejected. Grouping this information by source can help you determine from where data is coming. The LOAD_SOURCES system table includes some of the same data as LOAD_STREAMS does but adds this source-specific information. If apportioning is enabled, LOAD_SOURCES also provides information about how loads are apportioned among nodes.
You can use this table to identify causes of differing query results. For example, you can use the following statement to create an external table based on globs:
=> CREATE EXTERNAL TABLE tt AS COPY WITH SOURCE AWS(dir = 'foo', file = '*');
If you select from this table, Vertica loads data from every file in the foo
directory and creates one row in the LOAD_SOURCES table for each file. Suppose you later repeat the query and see different results. You could look at the LOAD_SOURCES table and discover that—between the two queries—somebody added another file to the foo
directory. Because each file is recorded in LOAD_SOURCES, you can see the new file that explains the changed query results.
If you are using many data sources. you might prefer to disable this reporting. To disable reporting, set the LoadSourceStatisticsLimit configuration parameter to 0. This parameter sets the upper bound on the number of sources profiled by LOAD_SOURCES per load. The default value is 256.
13 - Using load scripts
You can write and run a load script for the COPY statement using a simple text-delimited file format.
You can write and run a load script for the COPY statement using a simple text-delimited file format. For information about other load formats see Data formats. Vertica recommends that you load the smaller tables before the largest tables. To check data formats before loading, see Handling Non-UTF-8 input.
Using absolute paths in a load script
Unless you are using the COPY FROM LOCAL
statement, using COPY
on a remote client requires an absolute path for a data file. You cannot use relative paths on a remote client. For a load script, you can use vsql variables to specify the locations of data files relative to your Linux working directory.
To use vsql variables to specify data file locations:
-
Create a vsql variable containing your Linux current directory.
\set t_pwd `pwd`
-
Create another vsql variable that uses a path relative to the Linux current directory variable for a specific data file.
\set input_file '\'':t_pwd'/Date_Dimension.tbl\''
-
Use the second variable in the COPY statement:
=> COPY Date_Dimension FROM :input_file DELIMITER '|';
-
Repeat steps 2 and 3 to load all data files.
Note
COPY FROM LOCAL
does not require an absolute path for data files. You can use paths that are relative to the client's directory system.
Running a load script
You can run a load script on any host, as long as the data files are on that host.
-
Change your Linux working directory to the location of the data files.
$ cd /opt/vertica/doc/retail_example_database
-
Run the Administration Tools.
$ /opt/vertica/bin/admintools
-
Connect to the database.
-
Run the load script.
14 - Troubleshooting data loads
You might encounter the following issues when loading data.
You might encounter the following issues when loading data. For issues specific to external tables, see Troubleshooting external tables.
Cannot load data because all columns are complex types
A native table must contain at least one column that is either a scalar type or a native array (one-dimensional array of scalar types). If the data you wish to load consists entirely of complex types, you must add a scalar column to the table before loading the data. You do not need to populate the scalar column.
First, create a native table with the columns in the data plus one scalar column, and then load the data into that table as shown in the following example:
-- Native table with extra scalar column:
=> CREATE TABLE customers(person ROW(name VARCHAR, card VARCHAR), tempval int);
CREATE TABLE
-- Load complex data, ignoring scalar column:
=> COPY customers(person) FROM :custdata PARQUET;
Rows Loaded
-------------
3
(1 row)
Reads from Parquet files report unexpected data-type mismatches
If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INTEGER, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example:
ERROR 7247: Datatype mismatch: column 2 in the parquet_cpp source
[/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING.
Error 7087: wrong number of columns
When loading ORC or Parquet data, you might see an error stating that you have the wrong number of columns:
=> CREATE TABLE nation (nationkey bigint, name varchar(500),
regionkey bigint, comment varchar(500));
CREATE TABLE
=> COPY nation from :orc_dir ORC;
ERROR 7087: Attempt to load 4 columns from an orc source
[/tmp/orc_glob/test.orc] that has 9 columns
When you load data from ORC or Parquet files, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition.
For Parquet data, time zones in timestamp values are not correct
Reading timestamps from a Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the Parquet format does not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets timestamps read from Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time such as the following:
WARNING 0: SQL TIMESTAMPTZ is more appropriate for Parquet TIMESTAMP
because values are stored in UTC
When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP.
Time zones can also be incorrect in ORC data, but the reason is different.
For ORC data, time zones in timestamp values are not correct
Vertica and Hive both use the Apache ORC library to interact with ORC data. The behavior of this library changed with Hive version 1.2.0, so timestamp representation depends on what version was used to write the data.
When writing timestamps, the ORC library now records the time zone in the stripe footer. Vertica looks for this value and applies it when loading timestamps. If the file was written with an older version of the library, the time zone is missing from the file.
If the file does not contain a time zone, Vertica uses the local time zone and logs an ORC_FILE_INFO event in the QUERY_EVENTS system table.
The first time you query a new ORC data source, you should query this table to look for missing time zone information:
=> SELECT event_category, event_type, event_description, operator_name, event_details, COUNT(event_type)
AS COUNT FROM QUERY_EVENTS WHERE event_type ILIKE 'ORC_FILE_INFO'
GROUP BY event_category, event_type, event_description, operator_name, event_details
ORDER BY event_details;
event_category | event_type | event_description | operator_name | event_details | count
----------------+---------------+----------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-------
EXECUTION | ORC_FILE_INFO | ORC file does not have writer timezone information | OrcParser | Timestamp values in the ORC source [data/sales_stats.orc] will be computed using local timezone | 2
(1 row)
Time zones can also be incorrect in Parquet data, but the reason is different.
Some date and timestamp values are wrong by several days
When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values.
15 - Working with external data
An alternative to importing data into Vertica is to query it in place.
An alternative to importing data into Vertica is to query it in place. Querying external data instead of importing it can be advantageous in some cases:
-
If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
-
If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
-
If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
-
If you have a very large volume of data and do not want to increase your license capacity.
-
If you have lower-priority data in Vertica that you still want to be able to query.
To query external data, you must describe your data as an external table. Like native tables, external tables have table definitions and can be queried. Unlike native tables, external tables have no catalog and Vertica loads selected data from the external source as needed. For some formats, the query planner can take advantage of partitions and sorting in the data, so querying an external table does not mean you load all of the data at query time. (For more information about native tables, see Working with native tables.)
There is one special type of external data not covered in this section. If you are reading data from Hadoop, and specifically from a Hive data warehouse, then instead of defining your own external tables you can read the schema information from Hive. For more information, see Using the HCatalog Connector.
15.1 - How external tables differ from native tables
You can use external tables in the same ways you use Vertica native tables.
You can use external tables in the same ways you use Vertica native tables. Because the data is external to the database, however, there are some differences in how external tables operate.
Data
The data for an external table can reside anywhere, so long as all database nodes can access it. S3, HDFS, and NFS mount points are common places to find external data. Naturally, querying external data can incur some latency compared to querying locally-stored ROS data, but Vertica has optimizations that can reduce the impact. For example, Vertica can take advantage of node and rack locality for HDFS data.
Because the data is external, Vertica loads external data each time you query it. Vertica is optimized to reduce the volume of read data, including predicate pushdown and partition pruning for formats that support partitioning. The ORC and Parquet formats support these optimizations.
Because the data is read at query time, you must ensure that your users have and retain permission to read the data in its original location. Depending on where the data is stored, you might need to take additional steps to manage access, such as creating AWS IAM roles on S3.
Because the data is not stored in Vertica, external tables do not use superprojections and buddy projections.
Resource consumption
External tables add very little to the Vertica catalog, which reduces the resources that queries consume. Because the data is not stored in Vertica, external tables are not affected by the Tuple Mover and do not cause ROS pushback. Vertica uses a small amount of memory when reading external table data, because the table contents are not part of your database and are parsed each time the external table is used.
Backup and restore
Because the data in external tables is managed outside of Vertica, only the external table definitions, not the data files, are included in database backups. Arrange for a separate backup process for your external table data.
DML support
External tables allow you to read external data. They do not allow you to modify it. Some DML operations are therefore not available for external tables, including:
-
DELETE FROM
-
INSERT INTO
-
SELECT...FOR UPDATE
Sequences and identity columns
The COPY statement definition for external tables can include identity columns and sequences. Whenever a select statement queries the external table, sequences and identity columns are re-evaluated. This results in changing the external table column values, even if the underlying external table data remains the same.
15.2 - Creating external tables
To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement.
To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.
You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying where to load data from and Data formats.
As with native tables, you can use the INFER_TABLE_DDL function to derive column definitions from data files in the Parquet, ORC, and Avro formats.
When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. You may omit partition columns. If you omit data columns, queries using the table abort with an error. For other data formats, you can select only the data columns of interest.
If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.
When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.
Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.
The following example defines an external table for delimited data stored in HDFS:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
AS COPY FROM 'hdfs:///dat/ext1.csv' DELIMITER ',';
The following example uses data in the ORC format that is stored in S3. The data has two partition columns, represented as directories in the file structure. For more information about partitions, see Using partition columns.
=> CREATE EXTERNAL TABLE transactions (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 's3://datalake/sales/*/*/*'
ORC(hive_partition_cols='created,region');
The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
AS COPY FROM '/data/sales/*.parquet' PARQUET;
When using the ORC and Parquet formats, Vertica supports some additional options in the COPY
statement and data structures for columns. See ORC (parser) and PARQUET (parser).
If ORC or Parquet data is partitioned, Vertica expects Hive-style partitioning. If you see unexpected results when reading data, verify that globs in your file paths correctly align with the partition structure. See Troubleshooting external tables.
Special considerations for external tables
If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.
You can see unexpected query results if constraints on columns cause values to be rejected:
-
If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.
-
If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.
-
When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.
One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).
When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.
If your data is in Parquet or ORC format, you can take advantage of partitioning to limit the amount of data that Vertica reads. These formats are special in this respect because they embed metadata in the file headers. For more information about using partitioned data, see Using partition columns.
Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.
Tip
When working with a new external data source, consider setting REJECTMAX to 1 to make problems in the data apparent. Testing in this way allows you to discover problems in the data before running production queries against it.
After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.
Required permissions
In addition to having permission in Vertica, users must have read access to the external data.
-
For data on the local disk this access is governed by local file permissions.
-
For data in HDFS, access might be governed by Kerberos authentication. See Accessing kerberized HDFS data.
-
For data on S3, you need access through an AWS IAM role. See S3 object store.
For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) object store.
By default, you must also be a database superuser to access external tables through a SELECT statement.
In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (storage location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp
(including in subdirectories to any depth):
=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';
=> GRANT ALL ON LOCATION '/tmp' to Bob;
Organizing external table data
If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), your COPY definition statement can use wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:
/2018/monthly/
In this case, the external table COPY statement includes a wildcard definition such as the following:
=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/2018/monthly/*'
Whenever a Vertica query references the external table archive
, and Vertica parses the COPY statement, all stored data in the top-level monthly
directory is accessible to the query.
Validating table definitions
When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:
=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar)
AS COPY '/home/dbadmin/designer.log';
ERROR 2778: COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source
Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.
To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.
Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling messy data.
Viewing external table definitions
When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.
To list all tables, use a SELECT * query, as shown:
=> SELECT * FROM TABLES WHERE table_definition <> '';
Use a query such as the following to list the external table definitions:
=> SELECT table_name, table_definition FROM TABLES;
table_name | table_definition
------------+----------------------------------------------------------------------
t1 | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ','
t1_copy | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ','
t2 | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)
15.3 - Querying external tables
After you create an external table, you can query it as you would query any other table.
After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:
=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE
You can now write queries against these tables, such as the following:
=> SELECT * FROM catalog;
id | description | category
----+----------------------+-------------
10 | 24in monitor | computers
11 | 27in monitor | computers
12 | 24in IPS monitor | computers
20 | 1TB USB drive | computers
21 | 2TB USB drive | computers
22 | 32GB USB thumb drive | computers
30 | 40in LED TV | electronics
31 | 50in LED TV | electronics
32 | 60in plasma TV | electronics
(9 rows)
=> SELECT * FROM inventory;
storeID | prodID | quantity
---------+--------+----------
502 | 10 | 17
502 | 11 | 2
517 | 10 | 1
517 | 12 | 2
517 | 12 | 4
542 | 10 | 3
542 | 11 | 11
542 | 12 | 1
(8 rows)
=> SELECT inventory.storeID,catalog.description,inventory.quantity
FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
storeID | description | quantity
---------+------------------+----------
502 | 24in monitor | 17
517 | 24in monitor | 1
542 | 24in monitor | 3
502 | 27in monitor | 2
542 | 27in monitor | 11
517 | 24in IPS monitor | 2
517 | 24in IPS monitor | 4
542 | 24in IPS monitor | 1
(8 rows)
One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.
Handling errors
Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit
configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit
to –1
removes the limit, but is not recommended.
If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log
file and does not report subsequent rejected rows.
Using the ExternalTablesExceptionsLimit
configuration parameter differs from using the COPY statement REJECTMAX
parameter to set a low rejection threshold. The REJECTMAX
value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX
, COPY aborts execution instead of logging a warning in vertica.log
.
Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.
After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)
The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.
If you have ORC or Parquet data, you can take advantage of optimizations including partition pruning and predicate pushdown. See Improving query performance.
Using external tables with user-defined load (UDL) functions
You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).
15.3.1 - Using partition columns
An ORC or Parquet file contains data columns.
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values. You can use partitioning to improve the performance of queries that restrict results by the partitioned column.
For example, if you have a table with a date column, and you know you will be writing queries restricted to particular dates, you can partition by date. Thus, Vertica can skip reading some files entirely when executing your date-restricted queries. This behavior is called partition pruning.
You can create partitions regardless of where you store the files—in HDFS, in an S3 bucket, on a local file system, or in a shared file system such as NFS.
You can use Hive or EXPORT TO PARQUET to create partitions, or you can create them manually. For information about creating partitions as part of exporting data from Vertica, see Partitioning and sorting data. See Improving Query Performance for information about tuning partitions.
Partition structure
By default, both Hive and Vertica write Hadoop columnar format files that contain the data for all table columns without partitioning. The column data is laid out in stripes, or groups of row data. When Vertica loads this data it reads all of the stripes.
If you partition the data, however, you can avoid writing some of that data into the files and thus reduce the amount to be read. Instead of storing a column's data in the files, you create a directory structure that partitions the data based on the value in a column.
For example, if the data includes a date column, you can write each date as a separate partition. Each partition is a directory with a name of the form "column=value". If you have a date column named "created" that is partitioned by day, you would have the following directory structure:
path/created=2016-11-01/*
path/created=2016-11-02/*
path/created=2016-11-03/*
path/...
As this example shows, the files in each subdirectory contain all columns except the "created" column.
You can partition by more than one column, creating a layered structure. For example, adding another partitioned column, "region", to the preceding example would produce the following directory structure:
path/created=2016-11-01/region=northeast/*
path/created=2016-11-01/region=central/*
path/created=2016-11-01/region=southeast/*
path/created=2016-11-01/...
path/created=2016-11-02/region=northeast/*
path/created=2016-11-02/region=central/*
path/created=2016-11-02/region=southeast/*
path/created=2016-11-02/...
path/created=2016-11-03/...
path/...
With this change, the data files contain all columns except "created" and "region".
Note
The files must contain at least one real (not partitioned) column. You cannot partition by every column in a table.
You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.
Under some circumstances Hive writes a partition with a value of __HIVE_DEFAULT_PARTITION__. Vertica treats these values as NULL.
COPY syntax
When creating an external table from partitioned data, you must do all of the following:
-
In the column definition in the external table, if you are using strong schema matching (the default), list the partition columns last and in order.
-
In the path, use wildcards to include all of the levels of directories and files.
-
In the ORC or PARQUET statement, specify the partition columns in the hive_partition_cols parameter. (The argument name is the same even if you didn't use Hive to do the partitioning; it refers to Hive-style partitions.) When using strong schema matching, you must list the names in order.
The following example creates an external table using the partitioned data shown previously. The table includes four columns. Two columns, "id" and "name", are in the data files. The other two, "created" and "region", are partitioned.
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:///path/*/*/*'
ORC(hive_partition_cols='created,region');
The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must always be one more than the number of partitioned columns.
You do not need to include all of the partitioned columns in hive_partition_cols if those columns are not relevant for your queries. However, the partition columns must be the last columns in the table definition. For example, you can define the following table for the partitioned data shown previously:
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='region');
Values in the "created" column are all null because no data appears in the files for that column and hive_partition_cols does not include it.
However, the following example produces an error.
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='created');
In this example, the table definition includes the "region" column after the "created" column, and "region" is not included in hive_partition_cols. Because this column is not listed as a partition column, Vertica interprets it as a data column and produces an error because the column is not present.
If Vertica cannot convert a partition value to the declared type for that column, it sets the value to NULL. The following example incorrectly declares region to be an integer rather than a varchar.
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region int)
AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='region');
Vertica cannot coerce a directory named "region=northeast" into an integer value, so it sets that column value to NULL for all rows it reads from this directory. If you declare the column with IS NOT NULL, Vertica rejects the row. If the number of rows exceeds REJECTMAX, Vertica reports an error.
Note
If you change how files are partitioned on disk, you must re-create your external tables.
Queries
When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. See Improving query performance for more information about partition pruning and other techniques for optimizing queries.
The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.
=> SELECT * FROM t WHERE region='northeast';
To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:
files with unmatched Hive partition have been pruned
15.3.2 - Improving query performance
When working with external tables in the Parquet and ORC columnar formats, Vertica tries to improve performance in the following ways:.
When working with external tables in the Parquet and ORC columnar formats, Vertica tries to improve performance in the following ways:
-
By pushing query execution closer to the data so less has to be read and transmitted. Vertica uses the following specific techniques: predicate pushdown, column selection, and partition pruning.
-
By taking advantage of data locality in the query plan.
-
By analyzing the row count to get the best join orders in the query plan.
The following figure illustrates optimizations that can reduce the amount of data to be read:
Tuning ORC stripes and Parquet rowgroups
Vertica can read ORC and Parquet files generated by any Hive version. However, newer Hive versions store more metadata in these files. This metadata is used by both Hive and Vertica to prune values and to read only the required data. Use the latest Hive version to store data in these formats. ORC and Parquet are fully forward- and backward-compatible. To get the best performance, use Hive 0.14 or later.
The ORC format splits a table into groups of rows called stripes and stores column-level metadata in each stripe. The Parquet format splits a table into groups of rows called rowgroups and stores column-level metadata in each rowgroup. Each stripe/rowgroup's metadata is used during predicate evaluation to determine whether the values from this stripe need to be read. Large stripes usually yield better performance, so set the stripe size to at least 256M.
Hive writes ORC stripes and Parquet rowgroups to HDFS, which stores data in HDFS blocks distributed among multiple physical data nodes. Accessing an HDFS block requires opening a separate connection to the corresponding data node. It is advantageous to ensure that an ORC stripe or Parquet rowgroup does not span more than one HDFS block. To do so, set the HDFS block size to be larger than the stripe/rowgroup size. Setting HDFS block size to 512M is usually sufficient.
Hive provides three compression options: None, Snappy, and Zlib. Use Snappy or Zlib compression to reduce storage and I/O consumption. Usually, Snappy is less CPU-intensive but can yield lower compression ratios compared to Zlib.
Storing data in sorted order can improve data access and predicate evaluation performance. Sort table columns based on the likelihood of their occurrence in query predicates; columns that most frequently occur in comparison or range predicates should be sorted first.
Partitioning tables is a very useful technique for data organization. Similarly to sorting tables by columns, partitioning can improve data access and predicate evaluation performance. Vertica supports Hive-style partitions and partition pruning.
The following Hive statement creates an ORC table with stripe size 256M and Zlib compression:
hive> CREATE TABLE customer_visits (
customer_id bigint,
visit_num int,
page_view_dt date)
STORED AS ORC tblproperties("orc.compress"="ZLIB",
"orc.stripe.size"="268435456");
The following statement creates a Parquet table with stripe size 256M and Zlib compression:
hive> CREATE TABLE customer_visits (
customer_id bigint,
visit_num int,
page_view_dt date)
STORED AS PARQUET tblproperties("parquet.compression"="ZLIB",
"parquet.stripe.size"="268435456");
Predicate pushdown and column selection
Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics, and row group indexes. Predicates are applied only to the first two levels. Parquet files have two levels of statistics: rowgroup statistics and page statistics. Predicates are only applied to the first level.
Predicate pushdown is automatically applied for files written with Hive version 0.14 and later. ORC files written with earlier versions of Hive might not contain the required statistics. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.
Another query performance optimization technique used by Vertica is column selection. Vertica reads from ORC or Parquet files only the columns specified in the query statement. For example, the following statement reads only the customer_id and visit_num columns from the corresponding ORC files:
=> CREATE EXTERNAL TABLE customer_visits (
customer_id bigint,
visit_num int,
page_view_dt date)
AS COPY FROM '...' ORC;
=> SELECT customer_id from customer_visits
WHERE visit_num > 10;
Data locality
In a cluster where Vertica nodes are co-located on HDFS nodes, the query can use data locality to improve performance. For Vertica to do so, both the following conditions must exist::
When both these conditions exist, the query planner uses the co-located database node to read that data locally, instead of making a network call.
You can see how much data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC/Parquet data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data.
Creating sorted files in Hive
Unlike Vertica, Hive does not store table columns in separate files and does not create multiple projections per table with different sort orders. For efficient data access and predicate pushdown, sort Hive table columns based on the likelihood of their occurrence in query predicates. Columns that most frequently occur in comparison or range predicates should be sorted first.
Data can be inserted into Hive tables in a sorted order by using the ORDER BY or SORT BY keywords. For example, to insert data into the ORC table "customer_visit" from another table "visits" with the same columns, use these keywords with the INSERT INTO command:
hive> INSERT INTO TABLE customer_visits
SELECT * from visits
ORDER BY page_view_dt;
hive> INSERT INTO TABLE customer_visits
SELECT * from visits
SORT BY page_view_dt;
The difference between the two keywords is that ORDER BY guarantees global ordering on the entire table by using a single MapReduce reducer to populate the table. SORT BY uses multiple reducers, which can cause ORC or Parquet files to be sorted by the specified column(s) but not be globally sorted. Using the latter keyword can increase the time taken to load the file.
You can combine clustering and sorting to sort a table globally. The following table definition adds a hint that data is inserted into this table bucketed by customer_id and sorted by page_view_dt:
hive> CREATE TABLE customer_visits_bucketed (
customer_id bigint,
visit_num int,
page_view_dt date)
CLUSTERED BY (page_view_dt)
SORTED BY (page_view_dt)INTO 10 BUCKETS
STORED AS ORC;
When inserting data into the table, you must explicitly specify the clustering and sort columns, as in the following example:
hive> INSERT INTO TABLE customer_visits_bucketed
SELECT * from visits
DISTRIBUTE BY page_view_dt
SORT BY page_view_dt;
The following statement is equivalent:
hive> INSERT INTO TABLE customer_visits_bucketed
SELECT * from visits
CLUSTER BY page_view_dt;
Both of the above commands insert data into the customer_visits_bucketed table, globally sorted on the page_view_dt column.
Partitioning Hive tables
Table partitioning in Hive is an effective technique for data separation and organization, as well as for reducing storage requirements. To partition a table in Hive, include it in the PARTITIONED BY clause:
hive> CREATE TABLE customer_visits (
customer_id bigint,
visit_num int)
PARTITIONED BY (page_view_dt date)
STORED AS ORC;
Hive does not materialize partition column(s). Instead, it creates subdirectories of the following form:
path_to_table/partition_column_name=value/
When the table is queried, Hive parses the subdirectories' names to materialize the values in the partition columns. The value materialization in Hive is a plain conversion from a string to the appropriate data type.
Inserting data into a partitioned table requires specifying the value(s) of the partition column(s). The following example creates two partition subdirectories, "customer_visits/page_view_dt=2016-02-01" and "customer_visits/page_view_dt=2016-02-02":
hive> INSERT INTO TABLE customer_visits
PARTITION (page_view_dt='2016-02-01')
SELECT customer_id, visit_num from visits
WHERE page_view_dt='2016-02-01'
ORDER BY page_view_dt;
hive> INSERT INTO TABLE customer_visits
PARTITION (page_view_dt='2016-02-02')
SELECT customer_id, visit_num from visits
WHERE page_view_dt='2016-02-02'
ORDER BY page_view_dt;
Each directory contains ORC files with two columns, customer_id and visit_num.
Accessing partitioned data from Vertica
Vertica recognizes and supports Hive-style partitions. You can read partition values and data using the HCatalog Connector or the COPY statement.
If you use the HCatalog Connector, you must create an HCatalog schema in Vertica that mirrors a schema in Hive:
=> CREATE EXTERNAL TABLE customer_visits (customer_id int, visit_num int,
page_view_dtm date)
AS COPY FROM 'hdfs://host:port/path/customer_visits/*/*' ORC
(hive_partition_cols='page_view_dtm');
The following statement reads all ORC files stored in all sub-directories including the partition values:
=> SELECT customer_id, visit_num, page_view FROM customer_visits;
When executing queries with predicates on partition columns, Vertica uses the subdirectory names to skip files that do not satisfy the predicate. This process is called partition pruning.
You can also define a separate external table for each subdirectory, as in the following example:
=> CREATE EXTERNAL TABLE customer_visits_20160201 (customer_id int,
visit_num int, page_view_dtm date)
AS COPY FROM
'hdfs://host:port/path/customer_visits/page_view_dt=2016-02-01/*' ORC;
Example: a partitioned, sorted ORC table
Suppose you have data stored in CSV files containing three columns: customer_id, visit_num, page_view_dtm:
1,123,2016-01-01
33,1,2016-02-01
2,57,2016-01-03
...
The goal is to create the following Hive table:
hive> CREATE TABLE customer_visits (
customer_id bigint,
visit_num int)
PARTITIONED BY (page_view_dt date)
STORED AS ORC;
To achieve this, perform the following steps:
-
Copy or move the CSV files to HDFS.
-
Define a textfile Hive table and copy the CSV files into it:
hive> CREATE TABLE visits (
customer_id bigint,
visit_num int,
page_view_dt date)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> LOAD DATA INPATH path_to_csv_files INTO TABLE visits;
-
For each unique value in page_view_dt, insert the data into the target table while materializing page_view_dt as page_view_dtm:
hive> INSERT INTO TABLE customer_visits
PARTITION (page_view_dt='2016-01-01')
SELECT customer_id, visit_num FROM visits
WHERE page_view_dt='2016-01-01'
ORDER BY page_view_dt;
...
This operation inserts data from visits.customer_id into customer_visits.customer_id, and from visits.visit_num into customer_visits.visit_num. These two columns are stored in generated ORC files. Simultaneously, values from visits.page_view_dt are used to create partitions for the partition column customer_visits.page_view_dt, which is not stored in the ORC files.
Data modification in Hive
Hive is well-suited for reading large amounts of write-once data. Its optimal usage is loading data in bulk into tables and never modifying the data. In particular, for data stored in the ORC and Parquet formats, this usage pattern produces large, globally (or nearly globally) sorted files.
Periodic addition of data to tables (known as “trickle load”) is likely to produce many small files. The disadvantage of this is that Vertica has to access many more files during query planning and execution. These extra access can result in longer query-processing time. The major performance degradation comes from the increase in the number of file seeks on HDFS.
Hive can also modify underlying ORC or Parquet files without user involvement. If enough records in a Hive table are modified or deleted, for example, Hive deletes existing files and replaces them with newly-created ones. Hive can also be configured to automatically merge many small files into a few larger files.
When new tables are created, or existing tables are modified in Hive, you must manually synchronize Vertica to keep it up to date. The following statement synchronizes the Vertica schema "hcat" after a change in Hive:
=> SELECT sync_with_hcatalog_schema('hcat_local', 'hcat');
Schema evolution in Hive
Hive supports two kinds of schema evolution:
- New columns can be added to existing tables in Hive. Vertica automatically handles this kind of schema evolution. The old records display NULLs for the newer columns.
- The type of a column for a table can be modified in Hive. Vertica does not support this kind of schema evolution.
The following example demonstrates schema evolution through new columns. In this example, hcat.parquet.txt is a file with the following values:
-1|0.65|0.65|6|'b'
hive> create table hcat.parquet_tmp (a int, b float, c double, d int, e varchar(4))
row format delimited fields terminated by '|' lines terminated by '\n';
hive> load data local inpath 'hcat.parquet.txt' overwrite into table
hcat.parquet_tmp;
hive> create table hcat.parquet_evolve (a int) partitioned by (f int) stored as
parquet;
hive> insert into table hcat.parquet_evolve partition (f=1) select a from
hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (b float);
hive> insert into table hcat.parquet_evolve partition (f=2) select a, b from
hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (c double);
hive> insert into table hcat.parquet_evolve partition (f=3) select a, b, c from
hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (d int);
hive> insert into table hcat.parquet_evolve partition (f=4) select a, b, c, d from
hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (e varchar(4));
hive> insert into table hcat.parquet_evolve partition (f=5) select a, b, c, d, e
from hcat.parquet_tmp;
hive> insert into table hcat.parquet_evolve partition (f=6) select a, b, c, d, e
from hcat.parquet_tmp;
=> SELECT * from hcat_local.parquet_evolve;
a | b | c | d | e | f
----+-------------------+------+---+---+---
-1 | | | | | 1
-1 | 0.649999976158142 | | | | 2
-1 | 0.649999976158142 | 0.65 | | | 3
-1 | 0.649999976158142 | 0.65 | 6 | | 4
-1 | 0.649999976158142 | 0.65 | 6 | b | 5
-1 | 0.649999976158142 | 0.65 | 6 | b | 6
(6 rows)
15.4 - Monitoring external tables
Vertica records information about external tables in system tables.
Vertica records information about external tables in system tables. You can use these tables to track your external data and queries against it.
The TABLES system table contains data about all tables, both native and external. The TABLE_DEFINITION column is specific to external tables. You can query this column to see all external data sources currently in use, as in the following example:
=> SELECT table_name, create_time, table_definition FROM tables WHERE table_definition != '';
table_name | create_time | table_definition
---------------+-------------------------------+-----------------------------------------------------------
customers_orc | 2018-03-21 11:07:30.159442-04 | COPY from '/home/dbadmin/sample_orc_files/0*' ORC
miscprod | 2018-06-26 17:40:04.012121-04 | copy from '/home/dbadmin/data/prod.csv'
students | 2018-06-26 17:46:50.695024-04 | copy from '/home/dbadmin/students.csv'
numbers | 2018-06-26 17:53:52.407441-04 | copy from '/home/dbadmin/tt.dat'
catalog | 2018-06-26 18:12:28.598519-04 | copy from '/home/dbadmin/data/prod.csv' delimiter ','
inventory | 2018-06-26 18:13:06.951802-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
test | 2018-06-27 16:31:39.170866-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
(7 rows)
The EXTERNAL_TABLE_DETAILS table provides more details, including file sizes. Vertica computes the values in this table at query time, which is potentially expensive, so consider restricting the query by schema or table.
=> SELECT table_name, source_format, total_file_size_bytes FROM external_table_details;
table_name | source_format | total_file_size_bytes
---------------+---------------+-----------------------
customers_orc | ORC | 619080883
miscprod | DELIMITED | 254
students | DELIMITED | 763
numbers | DELIMITED | 30
catalog | DELIMITED | 254
inventory | DELIMITED | 74
test | DELIMITED | 74
(7 rows)
If the size of an external table changes significantly over time, you should rerun ANALYZE_EXTERNAL_ROW_COUNT() to gather updated statistics. See Improving Query Performance for External Tables.
The LOAD_SOURCES table shows information for loads currently in progress. This table does not record information about loads of ORC or Parquet data.
15.5 - Troubleshooting external tables
You might encounter the following issues when creating or querying external tables.
You might encounter the following issues when creating or querying external tables. For general data-load troubleshooting, see Troubleshooting data loads.
File not found or permission denied
If a query against an external table produces a file or permission error, ensure that the user executing the query has the necessary permissions in both Vertica and the file system. See the permissions section in Creating external tables.
Error 7226: cannot find partition column
When querying ORC or Parquet data, you might see an error message stating that a partition column is missing:
ERROR 7226: Cannot find partition column [region] in parquet source
[/data/table_int/int_original/000000_0]
This error can occur if you partition your ORC or Parquet data (see Using partition columns). If you create an external table and then change the partition structure, for example by renaming a column, you must then re-create the external table. If you see this error, update your table to match the partitioning on disk.
Error 6766: is a directory
When querying data you might see an error message stating that an input file is a directory:
ERROR 6766: Error reading from orc parser input stream
[/tmp/orc_glob/more_nations]: Is a directory
This error occurs if the glob in the table's COPY FROM clause matches an empty directory. This error occurs only for files in the Linux file system; empty directories in HDFS are ignored.
To correct the error, make the glob more specific. Instead of *, for example, use *.orc.