This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
COPY
COPY; Load data;.
COPY bulk-loads data into a Vertica database. By default, COPY automatically commits itself and any current transaction except when loading temporary tables. If COPY is terminated or interrupted Vertica rolls it back.
COPY reads data as UTF-8 encoding.
For information on loading one or more files or pipes on a cluster host or on a client system, see COPY LOCAL.
Syntax
COPY [ /*+ LABEL (label-string)*/ ] [[database.]schema-name.]target-table
[ ( { column-as-expression | column }
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char' ]
[ ENFORCELENGTH ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ FILLER datatype]
[ FORMAT 'format' ]
[ NULL [ AS ] 'string' ]
[ TRIM 'byte' ]
[,...] ) ]
[ COLUMN OPTION (column
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char' ]
[ ENFORCELENGTH ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ FORMAT 'format' ]
[ NULL [ AS ] 'string' ]
[ TRIM 'byte' ]
[,...] ) ]
FROM {
[ LOCAL ] STDIN [ compression ]
| { 'path-to-data'
[ ON { nodename | (nodeset) | ANY NODE | EACH NODE } ] [ compression ] }[,...]
[ PARTITION COLUMNS column[,...] ]
| LOCAL 'path-to-data' [ compression ] [,...]
| VERTICA source-database.[source-schema.]source-table[( source-column[,...] ) ]
}
[ NATIVE
| FIXEDWIDTH COLSIZES {( integer )[,...]}
| NATIVE VARCHAR
| ORC
| PARQUET
]
| [ WITH ] UDL-clause[...]
}
[ ABORT ON ERROR ]
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char'
[ ENFORCELENGTH ]
[ ERROR TOLERANCE ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ EXCEPTIONS 'path' [ ON nodename] [,...]
[ NULL [ AS ] 'string' ]
[ RECORD TERMINATOR 'string' ]
[ REJECTED DATA {'path' [ ON nodename] [,...] | AS TABLE reject-table} ]
[ REJECTMAX integer ]
[ SKIP integer ]
[ SKIP BYTES integer ]
[ STREAM NAME 'streamName']
[ TRAILING NULLCOLS ]
[ TRIM 'byte' ]
[ [ WITH ] PARSER parser ([ arg=value[,...] ]) ] ]
[ NO COMMIT ]
Parameters
See Parameters.
Restrictions
See Restrictions.
Privileges
Superusers have full COPY privileges. The following requirements apply to non-superusers:
-
INSERT privilege on table
-
USAGE privilege on schema
-
USER-accessible storage location
-
Applicable READ or WRITE privileges granted to the storage location where files are read or written
COPY can specify a path to store rejected data and exceptions. If the path resolves to a storage location, the following privileges apply to non-superusers:
1 - Examples
For additional COPY examples, see the reference pages for specific parsers, including: DELIMITED (Parser), ORC (Parser), PARQUET (Parser), FJSONPARSER (Parser), and FAVROPARSER (Parser).
For additional COPY examples, see the reference pages for specific parsers, including: DELIMITED, ORC, PARQUET, FJSONPARSER, and FAVROPARSER.
Specifying string options
Use COPY with FORMAT, DELIMITER, NULL, and ENCLOSED BY options:
=> COPY public.customer_dimension (customer_since FORMAT 'YYYY')
FROM STDIN
DELIMITER ','
NULL AS 'null'
ENCLOSED BY '"';
Use COPY with DELIMITER and NULL options. This example sets and references a vsql
variable for the input file:
=> \set input_file ../myCopyFromLocal/large_table.gzip
=> COPY store.store_dimension
FROM :input_file
DELIMITER '|'
NULL ''
RECORD TERMINATOR E'\f';
Including multiple source files
Create a table and then copy multiple source files to it:
=> CREATE TABLE sampletab (a int);
CREATE TABLE
=> COPY sampletab FROM '/home/dbadmin/one.dat', 'home/dbadmin/two.dat';
Rows Loaded
-------------
2
(1 row)
Use wildcards to indicate a group of files:
=> COPY myTable FROM 'webhdfs:///mydirectory/ofmanyfiles/*.dat';
Wildcards can include regular expressions:
=> COPY myTable FROM 'webhdfs:///mydirectory/*_[0-9]';
Specify multiple paths in a single COPY statement:
=> COPY myTable FROM 'webhdfs:///data/sales/01/*.dat', 'webhdfs:///data/sales/02/*.dat',
'webhdfs:///data/sales/historical.dat';
Distributing a load
Load data that is shared across all nodes. Vertica distributes the load across all nodes, if possible:
=> COPY sampletab FROM '/data/file.dat' ON ANY NODE;
Load data from two files. Because the first load file does not specify nodes (or ON ANY NODE), the initiator performs the load. Loading the second file is distributed across all nodes:
=> COPY sampletab FROM '/data/file1.dat', '/data/file2.dat' ON ANY NODE;
Specify different nodes for each load file:
=> COPY sampletab FROM '/data/file1.dat' ON (v_vmart_node0001, v_vmart_node0002),
'/data/file2.dat' ON (v_vmart_node0003, v_vmart_node0004);
Loading data from shared storage
To load data from shared storage, use URLs in the corresponding schemes:
-
HDFS: [[s]web]hdfs://[
nameservice
]/
path
-
S3: s3://
bucket
/
path
-
Google Cloud: gs://
bucket
/
path
-
Azure: azb://
account
/
container
/
path
Note
Loads from HDFS, S3, GCS, and Azure default to ON ANY NODE; you do not need to specify it.
Load a file stored in HDFS using the default name node or name service:
=> COPY t FROM 'webhdfs:///opt/data/file1.dat';
Load data from a particular HDFS name service (testNS). You specify a name service if your database is configured to read from more than one HDFS cluster:
=> COPY t FROM 'webhdfs://testNS/opt/data/file2.csv';
Load data from an S3 bucket:
=> COPY t FROM 's3://AWS_DataLake/*' ORC;
Partitioned data
Data files can be partitioned using the directory structure, such as:
path/created=2016-11-01/region=northeast/*
path/created=2016-11-01/region=central/*
path/created=2016-11-01/region=southeast/*
path/created=2016-11-01/...
path/created=2016-11-02/region=northeast/*
path/created=2016-11-02/region=central/*
path/created=2016-11-02/region=southeast/*
path/created=2016-11-02/...
path/created=2016-11-03/...
path/...
Load partition columns using the PARTITION COLUMNS option:
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:///path/*/*/*'
PARTITION COLUMNS created, region;
Using filler columns
In the following example, the table has columns for first name, last name, and full name, but the data being loaded contains columns for first, middle, and last names. The COPY statement reads all of the source data but only loads the source columns for first and last names. It constructs the data for the full name by concatenating each of the source data columns, including the middle name. The middle name is read as a FILLER column so it can be used in the concatenation, but is ignored otherwise. (There is no table column for middle name.)
=> CREATE TABLE names(first VARCHAR(20), last VARCHAR(20), full VARCHAR(60));
CREATE TABLE
=> COPY names(first,
middle FILLER VARCHAR(20),
last,
full AS first||' '||middle||' '||last)
FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Marc|Gregory|Smith
>> Sue|Lucia|Temp
>> Jon|Pete|Hamilton
>> \.
=> SELECT * from names;
first | last | full
-------+----------+--------------------
Jon | Hamilton | Jon Pete Hamilton
Marc | Smith | Marc Gregory Smith
Sue | Temp | Sue Lucia Temp
(3 rows)
Loading data into a flex table
Create a Flex table and copy JSON data into it using FJSONPARSER:
=> CREATE FLEX TABLE darkdata();
CREATE TABLE
=> COPY tweets FROM '/myTest/Flexible/DATA/tweets_12.json' PARSER FJSONPARSER();
Rows Loaded
-------------
12
(1 row)
Using named pipes
COPY supports named pipes that follow the same naming conventions as file names on the given file system. Permissions are open
, write
, and close
.
Create named pipe, pipe1
, and set two vsql
variables:
=> \! mkfifo pipe1
=> \set dir `pwd`/
=> \set file '''':dir'pipe1'''
Copy an uncompressed file from the named pipe:
=> \! cat pf1.dat > pipe1 &
=> COPY large_tbl FROM :file delimiter '|';
=> SELECT * FROM large_tbl;
=> COMMIT;
Loading compressed data
Copy a GZIP file from a named pipe and uncompress it:
=> \! gzip pf1.dat
=> \! cat pf1.dat.gz > pipe1 &
=> COPY large_tbl FROM :file ON site01 GZIP delimiter '|';
=> SELECT * FROM large_tbl;
=> COMMIT;
=> \!gunzip pf1.dat.gz
2 - Parameters
COPY parameters and their descriptions are divided into the following sections:.
COPY parameters and their descriptions are divided into the following sections:
Target options
The following options apply to the target tables and their columns:
-
LABEL
Assigns a label to a statement to identify it for profiling and debugging.
[
database
.]
schema
Database and schema. The default schema is public
. If you specify a database, it must be the current database.
COPY ignores schema-name
when used in CREATE EXTERNAL TABLE or CREATE FLEX EXTERNAL TABLE statements.
target-table
- The target columnar or flexible table for loading new data. Vertica loads the data into all projections that include columns from the schema table.
column-as-expression
- An expression used to compute values for the target column, which must not be of a complex type. For example:
=> COPY t(year AS TO_CHAR(k, 'YYYY')) FROM 'myfile.dat'
Use this option to transform data when it is loaded into the target database.
For details, see Transforming data during loads.
column
- Restricts the load to one or more specified columns in the table. If you omit specifying columns, COPY loads all columns by default.
Table columns that you omit from the column list are assigned their DEFAULT or SET USING values, if any; otherwise, COPY inserts NULL.
If you leave the column
parameter blank to load all columns in the table, you can use the optional parameter COLUMN OPTION to specify parsing options for specific columns.
The data file must contain the same number of columns as the COPY command's column list.
COLUMN OPTION
- Specifies load metadata for one or more columns declared in the table column list. For example, you can specify that a column has its own DELIMITER, ENCLOSED BY, or NULL AS expression, and so on. You do not have to specify every column name explicitly in the COLUMN OPTION list, but each column you specify must correspond to a column in the table column list.
Column options
Depending on how they are specified, the following COPY options can qualify specific columns or all columns. Some parser-specific options can also apply to either specific columns or all columns. See Global and column-specific options For details about these two modes.
ENFORCELENGTH
- If specified, COPY rejects data rows of type CHAR, VARCHAR, BINARY, and VARBINARY, or elements of those types in collections, if they are larger than the declared size.
By default, COPY truncates offending rows of these data types and elements of these types in collections, but does not reject the rows. For more details, see Handling Messy Data.
If a collection does not fit with all of its elements, COPY rejects the row without truncating. It does not reduce the number of elements. This can happen if each element is individually within limits but the number of elements causes the collection to exceed the maximum size for the column.
FILLER
datatype
- Reads but does not copy the data of an input column. Use filler columns to ignore input columns that do not have columns in the table. You can also use filler columns to transform data (see Examples and Transforming data during loads). Filler columns cannot be of complex types.
FORMAT '
format
'
- Input format, one of the following:
See Binary (native) data to learn more about these formats.
When loading date/time columns, using FORMAT significantly improves load performance. COPY supports the same formats as the TO_DATE function. See Template patterns for date/time formatting.
If you specify invalid format strings, the COPY operation returns an error.
NULL [AS]
- The string representing a null value. The default is an empty string (
''
). You can specify a null value as any ASCII value in the range E'\000'
to E'\177'
inclusive. You cannot use the same character for both the DELIMITER and NULL options. For details, see Delimited data.
The following options are available for specifying source data:
LOCAL
- Loads data files (up to 65,535 files) on a client system, rather than on a cluster host. LOCAL can qualify the STDIN and [path-to-data](#pathToData) parameters. For details, see COPY LOCAL.
Restrictions: Invalid for CREATE EXTERNAL TABLE AS COPY
STDIN
- Reads from the client a standard input instead of a file. STDIN takes one input source only. To load multiple input sources, use [path-to-data](#pathToData).
User must have INSERT privileges on the table and USAGE privileges on its schema.
Restrictions: Invalid for CREATE EXTERNAL TABLE AS COPY
-
path-to-data
- Specifies the absolute path of the file (or files) containing the data, which can be from multiple input sources.
-
If the file is stored in HDFS, path-to-data
is a URI in the webhdfs
scheme, typically [[s]web]hdfs://[
nameservice
]/
path
. See HDFS file system.
-
If the file is stored in an S3 bucket, path-to-data
is a URI in the format s3://
bucket
/
path
. See S3 object store.
-
If the file is stored in Google Cloud Storage, path-to-data
is a URI in the format gs://
bucket
/
path
. See Google Cloud Storage (GCS) object store.
-
If the file is stored in Azure Blob Storage, path-to-data
is a URI in the format azb://
account
/
container
/
path
. See Azure Blob Storage object store.
-
If the file is on the local Linux file system or an NFS mount, path-to-data
is a local absolute file path.
path-to-data
can optionally contain wildcards to match more than one file. The file or files must be accessible to the local client or the host on which the COPY statement runs. COPY skips empty files in the file list. A file list that includes directories causes the query to fail. See Specifying where to load data from. The supported patterns for wildcards are specified in the Linux Manual Page for Glob (7), and for ADO.net platforms, through the .NET Directory.getFiles method.
You can use variables to construct the pathname as described in Using load scripts.
If path-to-data
resolves to a storage location on a local file system, and the user invoking COPY is not a superuser, the following requirements apply:
Further, if a user has privileges but is not a superuser, and invokes COPY from that storage location, Vertica ensures that symbolic links do not result in unauthorized access.
PARTITION COLUMNS
column
[,...]
- Columns whose values are specified in the directory structure and not in the data itself. This option is valid if the path contains directory names of the form
colname
=
value
, such as:
/data/created=2016-01-01/*.dat
/data/created=2016-01-02/*.dat
/data/created=2016-01-03/*.dat
The value is parsed from the value
part of the directory name. If it cannot be coerced to the column data type, it is rejected. If the value is missing, COPY loads it as null.
The following example loads these dates into a table column, along with other columns in the data files:
=> CREATE EXTERNAL TABLE records
(id int, name varchar(50), created date)
AS COPY FROM 'webhdfs:///path/*/*'
PARTITION COLUMNS created;
For more information, see Partitioned file paths.
ON
nodename
- Specifies the node on which the data to copy resides and the node that should parse the load file. If you omit
nodename
, the location of the input file defaults to the initiator node. Use nodename
to copy and parse a load file from a node other than the COPY initiator node.
Note
nodename
is invalid with STDIN and LOCAL.
ON (
nodeset
)
- Specifies a set of nodes on which to perform the load. The same data must be available for load on all named nodes.
nodeset
is a comma-separated list of node names in parentheses. For example:
=> COPY t FROM 'file1.txt' ON (v_vmart_node0001, v_vmart_node0002);
Vertica apportions the load among all of the specified nodes. If you also specify ERROR TOLERANCE or REJECTMAX, Vertica instead chooses a single node on which to perform the load.
If the data is available on all nodes, you usually use ON ANY NODE, which is the default for loads from HDFS and cloud object stores. However, you can use ON nodeset
to do manual load-balancing among concurrent loads.
-
ON ANY NODE
- Specifies that the data to load is available on all nodes, so COPY opens the path and parses it from any node in the cluster. For an Eon Mode database, COPY uses nodes within the same subcluster as the initiator.
Caution
The data must be the same on all nodes. If the data differs on two nodes, an incorrect or incomplete result is returned, with no error or warning.
Vertica attempts to apportion the load among several nodes if a file is large enough to benefit from apportioning. It chooses a single node if ERROR TOLERANCE or REJECTMAX is specified.
You can use a wildcard or glob (such as *.dat
) to load multiple input files, combined with the ON ANY NODE clause. If you use a glob, COPY distributes the list of files to all cluster nodes and spreads the workload.
ON ANY NODE is invalid with STDIN and LOCAL. STDIN can only use the client host, and LOCAL indicates a client node.
ON ANY NODE is the default for loads from all paths other than Linux (HDFS and cloud object stores).
ON EACH NODE
- Loads data from the specified path on each node. Use this option when the path exists on all nodes but the data files it contains are different on each node. If the path is not valid on all nodes, COPY loads the valid paths and produces a warning. If the path is a shared location, COPY loads it only once as for ON ANY NODE.
-
compression
- The input compression type, one of the following:
-
UNCOMPRESSED (default)
-
BZIP
-
GZIP
-
LZO
-
ZSTD
Input files can be of any format. If you use wildcards, all qualifying input files must be in the same format. To load different file formats, specify the format types specifically.
The following requirements and restrictions apply:
-
When using concatenated BZIP or GZIP files, verify that all source files terminate with a record terminator before concatenating them.
-
Concatenated BZIP and GZIP files are not supported for NATIVE (binary) and NATIVE VARCHAR formats.
-
LZO files are assumed to be compressed with lzop
. Vertica supports the following lzop arguments:
-
BZIP, GZIP, ZSTD, and LZO compression cannot be used with ORC format.
VERTICA
- See COPY FROM VERTICA.
[WITH]
UDL-clause
[...]
- Specifies one or more user-defined load functions—one source, and optionally one or more filters and one parser, as follows:
SOURCE source( [arg=value[,...] ]
[ FILTER filter( [arg=value[,...] ] ) ]...
[ PARSER parser( [arg=value[,...] ] ) ]
To use a flex table parser for column tables, use the PARSER parameter followed by a flex table parser argument. For supported flex table parsers, see Bulk loading data into flex tables.
Handling options
The following options control how COPY handles different contingencies:
ABORT ON ERROR
- Specifies that COPY stops if any row is rejected. The statement is rolled back and no data is loaded.
COLSIZES (
integer
[,...])
- Specifies column widths when loading fixed-width data. COPY requires that you specify COLSIZES when using the FIXEDWIDTH parser. COLSIZES and the list of integers must correspond to the columns listed in the table column list. For details, see Fixed-width format data.
ERROR TOLERANCE
- Specifies that COPY treats each source during execution independently when loading data. The statement is not rolled back if a single source is invalid. The invalid source is skipped and the load continues.
Using this parameter disables apportioned load.
Restrictions: Invalid for ORC or Parquet data
-
EXCEPTIONS
- Specifies the file name or absolute path of the file in which to write exceptions, as follows:
EXCEPTIONS 'path' [ ON nodename[,...]]
Exceptions describe why each rejected row was rejected. Each exception describes the corresponding record in the file specified by the REJECTED DATA option.
Files are written on the node or nodes executing the load. If the file already exists, it is overwritten.
To collect all exceptions in one place, use the REJECTED DATA AS TABLE clause and exceptions are automatically listed in the table's rejected_reason
column.
The ON nodename
clause moves existing exceptions files on nodename
to the indicated path
on the same node. For details, see Saving load exceptions (EXCEPTIONS).
If you use this parameter with COPY...ON ANY NODE, you must still specify the individual nodes for the exception files, as in the following example:
EXCEPTIONS '/home/ex01.txt' on v_db_node0001,'/home/ex02.txt'
on v_db_node0002,'/home/ex03.txt' on v_db_node0003
If path
resolves to a storage location, the following privileges apply to non-superusers:
-
The storage location must be created with the USER option (see CREATE LOCATION).
-
The user must have READ access to the storage location where the files exist, as described in GRANT (storage location).
-
REJECTED DATA
- Specifies where to write each row that failed to load. If this parameter is specified, records that failed due to parsing errors are always written. Records that failed due to an error during a transformation are written only if configuration parameter CopyFaultTolerantExpressions is set.
The syntax for this parameter is:
REJECTED DATA
{ 'path' [ ON nodename ] [,...] | AS TABLE reject-table }
Vertica can write rejected data to the specified path or to a table:
-
'
path
' [ON
nodename
]
: Copies the rejected row data to the specified path on the node executing the load. If qualified by ON nodename
, Vertica moves existing rejected data files on nodename
to path
on the same node.
The value of path
can be a directory or a file prefix. If there are multiple load sources, path
is always treated as a directory. If there are not multiple load sources but path
ends with '/
', or if a directory of that name already exists, it is also treated as a directory. Otherwise, path
is treated as a file prefix.
Files are written on the node or nodes executing the load. If the file already exists, it is overwritten.
When this parameter is used with LOCAL, the output is written to the client.
Note
Do not qualify path
with ON ANY NODE. To collect all rejected data in one place regardless of how the load is distributed, use a table.
-
AS TABLE reject-table
: Saves rejected rows to reject-table
.
Note
REJECTED DATA AS TABLE is incompatible with
EXCEPTIONS.
For details about both options, see Handling messy data.
REJECTMAX
integer
- The maximum number of logical records that can be rejected before a load fails. For details, see Handling messy data.
REJECTMAX disables apportioned load.
SKIP
integer
- The number of records to skip in a load file. For example, you can use the SKIP option to omit table header information.
Restrictions: Invalid for ORC or Parquet data
STREAM NAME
- Supplies a COPY load stream identifier. Using a stream name helps to quickly identify a particular load. The STREAM NAME value that you supply in the load statement appears in the STREAM_NAME column of system tables LOAD_STREAMS and LOAD_SOURCES.
A valid stream name can contain any combination of alphanumeric or special characters up to 128 bytes in length.
For example:
=> COPY mytable FROM myfile
DELIMITER '|' STREAM NAME 'My stream name';
-
WITH parser
- Specifies the parser to use when bulk loading columnar tables, one of the following:
By default, COPY uses the DELIMITER parser for UTF-8 format, delimited text input data. You do not specify the DELIMITER parser directly; absence of a specific parser indicates the default.
To use a flex table parser for column tables, use the PARSER parameter followed by a flex table parser argument. For supported flex table parsers, see Bulk loading data into flex tables.
When loading into flex tables, you must use a compatible parser. For supported flex table parsers, see Bulk loading data into flex tables.
COPY LOCAL does not support the NATIVE, NATIVE VARCHAR, ORC, and PARQUET parsers.
For parser support for complex data types, see the documentation of the specific parser.
For parser details, see Data formats in Data load.
-
NO COMMIT
- Prevents the COPY statement from committing its transaction automatically when it finishes copying data. This option must be the last COPY statement parameter.
For details, see Using transactions to stage a load.
Restrictions: Invalid for ORC or Parquet data, ignored by CREATE EXTERNAL TABLE AS COPY
Parser-specific options
The following options apply only when using specific parsers.
DELIMITED parser
DELIMITER
Indicates the single ASCII character used to separate columns within each record of a file. You can use any ASCII value in the range E'\000' to E'\177', inclusive. You cannot use the same character for both the DELIMITER and NULL parameters. For more information, see Delimited data.
Default: Vertical bar ('|').
ENCLOSED [BY]
Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
ESCAPE [AS]
Sets the escape character. Once set, the character following the escape character is interpreted literally, rather than as a special character. You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL: E'\000').
The COPY statement does not interpret the data it reads in as String literals. It also does not follow the same escape rules as other SQL statements (including the COPY parameters). When reading data, COPY interprets only the characters defined by these options as special values:
-
ESCAPE [AS]
-
DELIMITER
-
ENCLOSED [BY]
-
RECORD TERMINATOR
-
All COLLECTION options
Default: Backslash ('\').
NO ESCAPE
Eliminates escape-character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
RECORD TERMINATOR
- Specifies the literal character string indicating the end of a data file record. For more information about using this parameter, see Delimited data.
TRAILING NULLCOLS
- Specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values. For other information and examples, see Fixed-width format data.
COLLECTIONDELIMITER
For columns of collection types, indicates the single ASCII character used to separate elements within each collection. You can use any ASCII value in the range E'\000' to E'\177', inclusive. No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: Comma (',').
COLLECTIONOPEN
, COLLECTIONCLOSE
For columns of collection types, these options indicate the characters that mark the beginning and end of the collection. It is an error to use these characters elsewhere within the list of elements without escaping them. No COLLECTION option may have the same value as any other COLLECTION option.
Default: Square brackets ('[' and ']').
COLLECTIONNULLELEMENT
The string representing a null element value in a collection. You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII value except NULL: E'\000'). No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: 'null'
COLLECTIONENCLOSE
For columns of collection types, sets the quote character within which to enclose individual elements, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
No COLLECTION option may have the same value as any other COLLECTION option.
Default: double quote ('"')
FIXEDWIDTH parser
SKIP BYTES
integer
- The total number of bytes in a record to skip.
TRIM
- Trims the number of bytes you specify from a column. This option is only available when loading fixed-width data. You can set TRIM at the table level for a column, or as part of the COLUMN OPTION parameter.
3 - Parsers
Vertica supports several parsers to load different types of data.
Vertica supports several parsers to load different types of data. Some parsers are for use only with flex tables, as noted.
3.1 - DELIMITED
Use the DELIMITED parser, which is the default, to load delimited text data using COPY.
Use the DELIMITED parser, which is the default, to load delimited text data using COPY. You can specify the delimiter, escape characters, how to handle null values, and other parameters.
The DELIMITED parser supports both apportioned load and cooperative parse.
COPY options
The following options are specific to this parser. See Parameters for other applicable options.
DELIMITER
Indicates the single ASCII character used to separate columns within each record of a file. You can use any ASCII value in the range E'\000' to E'\177', inclusive. You cannot use the same character for both the DELIMITER and NULL parameters. For more information, see Delimited data.
Default: Vertical bar ('|').
ENCLOSED [BY]
Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
ESCAPE [AS]
Sets the escape character. Once set, the character following the escape character is interpreted literally, rather than as a special character. You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL: E'\000').
The COPY statement does not interpret the data it reads in as String literals. It also does not follow the same escape rules as other SQL statements (including the COPY parameters). When reading data, COPY interprets only the characters defined by these options as special values:
-
ESCAPE [AS]
-
DELIMITER
-
ENCLOSED [BY]
-
RECORD TERMINATOR
-
All COLLECTION options
Default: Backslash ('\').
NO ESCAPE
Eliminates escape-character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
RECORD TERMINATOR
- Specifies the literal character string indicating the end of a data file record. For more information about using this parameter, see Delimited data.
TRAILING NULLCOLS
- Specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values. For other information and examples, see Fixed-width format data.
COLLECTIONDELIMITER
For columns of collection types, indicates the single ASCII character used to separate elements within each collection. You can use any ASCII value in the range E'\000' to E'\177', inclusive. No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: Comma (',').
COLLECTIONOPEN
, COLLECTIONCLOSE
For columns of collection types, these options indicate the characters that mark the beginning and end of the collection. It is an error to use these characters elsewhere within the list of elements without escaping them. No COLLECTION option may have the same value as any other COLLECTION option.
Default: Square brackets ('[' and ']').
COLLECTIONNULLELEMENT
The string representing a null element value in a collection. You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII value except NULL: E'\000'). No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: 'null'
COLLECTIONENCLOSE
For columns of collection types, sets the quote character within which to enclose individual elements, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
No COLLECTION option may have the same value as any other COLLECTION option.
Default: double quote ('"')
Data types
The DELIMITED parser supports reading one-dimensional collections (arrays or sets) of scalar types.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
Examples
The following example shows the default behavior, in which the delimiter character is '|'
=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE
=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.
=> SELECT * FROM employees;
id | name | department
----+-----------------+--------------
17 | Howard Wolowitz | Astrophysics
42 | Sheldon Cooper | Physics
(2 rows)
The following example shows loading array values with the default options.
=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE
=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
(2 rows)
In the following example, collections are enclosed in braces and delimited by periods, and the arrays contain null values.
=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672".null."darpa-1963"}|{16200.null.16700}
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
19 | Leonard | ["us-1672",null,"darpa-1963"] | [16200,null,16700]
(3 rows)
3.2 - FAVROPARSER
Parses data from an Avro file.
Parses data from an Avro file. The following requirements apply:
-
Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.
-
FAVROPARSER does not support Avro files with separate schema files. The Avro file must include the schema.
You can load complex types in the Avro source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
This parser can notify you if it finds keys in the data that are not part of the table definition. See Unmatched Keys.
When loading into a flex table, Vertica loads all data into the __raw__
(VMap) column, including complex types found in the data.
This parser does not support apportioned load or cooperative parse.
Syntax
FAVROPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten all Avro maps. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to flatten all Avro arrays. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
flatten_records
- Boolean, whether to flatten all Avro records. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to NULL.
If the column is an array and the data to be loaded is too large, then false sets the column value to NULL and true rejects the row.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
suppress_warnings
- String, which warnings to suppress:
-
unmatched_key
(see Unmatched Keys)
-
true
or t
(suppress all warnings)
-
false
or f
(do not suppress warnings)
Default: false
Primitive data types
FAVROPARSER supports the following primitive data types, including as element types and field values in complex types.
Note
Vertica does not have an explicit 4-byte (32-bit integer) or smaller types. Instead, Vertica encoding and compression automatically eliminate the storage overhead of values that require less than 64 bits.
Avro logical types
FAVROPARSER supports the following Avro logical types. The target column must use a Vertica data type that supports the logical type. When you attempt to load data using an invalid logical type, the logical type is ignored and the underlying Avro type is used.
AVRO Logical Type |
Base Avro Type |
Supported Vertica Data Types |
decimal
0 < precision ≤ 1024
0 ≤ scale ≤ precision
|
bytes or fixed |
NUMERIC, Character
Vertica rejects the value if:
-
The Avro precision setting is greater than the precision setting for the target column.
-
For fixed types, the precision value is greater than what is allowed by the size attribute.
If the data type for the target column uses the default precision setting, the precision setting in the Avro schema overrides the default.
|
date |
integer |
DATE, Character |
time-micros |
long |
TIME/TIMETZ, Character
The time logical type does not provide a time zone value. For target columns that use the TIMETZ data type, Vertica uses UTC as the default.
|
time-millis |
int |
timestamp-micros |
long |
TIMESTAMP/TIMESTAMPTZ, TIME/TIMETZ
For timestamp-millis only, the timezone is included and is represented as an offset to UTC. Additionally, the millisecond values are right-extended with padded zeros.
|
timestamp-millis |
long |
duration |
fixed |
INTERVAL, Character |
Avro complex data types
The Avro format supports several complex data types. When loading into strongly-typed columns, you can use the ROW and ARRAY types to represent them. For example, Avro Record and Enums are structs (ROWs); see the Avro specification.
You can use ARRAY[ROW] to match an Avro map. You must name the ROW fields key
and value
. These are the names that the Avro format uses for those fields in the data, and the parser relies on field names to match data to table columns.
If the total size of an array exceeds the size defined by the target table, the parser sets the value to null.
When loading into flex tables or using flexible complex types, this parser handles Avro complex types as follows:
Record
The name of each field is used as a virtual column name. If flatten_records
is true and several nesting levels are present, Vertica concatenates the record names to create the key name.
Map
The value of each map key is used as a virtual column name. If flatten_maps
is true and several nesting levels are present, Vertica concatenates the key names to create the key name.
Enum
Vertica treats Avro Enums like records, with the name of the Enum as the key and the value as the value.
Array
Vertica treats Avro Arrays as key/value pairs. By default, the index of each element is the key. In the following example, product_detail
is a Record with a field, product_category
, that is an Array:
=> CREATE FLEX TABLE products;
CREATE TABLE
=> COPY products FROM :datafile WITH PARSER FAVROPARSER();
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details": {
"0.__name__": "OrderDetail",
"0.product_detail.__name__": "Product",
"0.product_detail.price": "46.21",
"0.product_detail.product_category": {
"0": "electronics",
"1": "printers",
"2": "computers"
},
"0.product_detail.product_description": "hp printer X11ew description :\
P",
"0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0004",
"0.product_detail.product_id": "999012",
"0.product_detail.product_map.one": "1.1",
"0.product_detail.product_map.two": "1.1",
"0.product_detail.product_name": "hp printer X11ew",
"0.product_detail.product_status": "ONLY_FEW_LEFT",
"0.quantity": "3",
"0.total": "354.34"
},
"order_id": "2389646",
"total": "132.43"
}
...
If flatten_arrays
is true and several nesting levels are present, Vertica concatenates the indices to create the key name.
=> COPY products FROM :datafile WITH PARSER FAVROPARSER(flatten_arrays=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details.0.__name__": "OrderDetail",
"order_details.0.product_detail.__name__": "Product",
"order_details.0.product_detail.price": "46.21",
"order_details.0.product_detail.product_category.0": "electronics",
"order_details.0.product_detail.product_category.1": "printers",
"order_details.0.product_detail.product_category.2": "computers",
"order_details.0.product_detail.product_description": "hp printer X11ew des\
cription :P",
"order_details.0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0\
004",
"order_details.0.product_detail.product_id": "999012",
"order_details.0.product_detail.product_map.one": "1.1",
"order_details.0.product_detail.product_map.two": "1.1",
"order_details.0.product_detail.product_name": "hp printer X11ew",
"order_details.0.product_detail.product_status": "ONLY_FEW_LEFT",
"order_details.0.quantity": "3",
"order_details.0.total": "354.34",
"order_id": "2389646",
"total": "132.43"
}
...
Union
Vertica treats Avro Unions as arrays.
Unmatched keys
Data being loaded can contain keys that are not part of the table definition. If you are loading into a flex table (or a flexible complex type column), no data is lost. For a table with strongly-defined columns, however, new keys cannot be loaded because the table does not have a place to put them.
This parser emits warnings if it finds new keys and if both of the following are true:
New keys are logged in the UDX_EVENTS system table. If a new key is a complex type with nested keys, only the top-level key is logged. When you see a warning about unmatched keys, you can query this table and then use ALTER TABLE to modify your table definition for future loads.
Querying an external table loads data and thus can trigger these warnings. To prevent them, set the suppress_warnings
parameter to 'unmatched_keys' or 'true':
=> CREATE EXTERNAL TABLE restaurants(
name VARCHAR(50),
menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100])
AS COPY FROM '/data/rest.json'
PARSER FAVROPARSER(suppress_warnings='unmatched_key');
Examples
This example shows how to create and load a flex table with Avro data using favroparser
. After loading the data, you can query virtual columns:
=> CREATE FLEX TABLE avro_basic();
CREATE TABLE
=> COPY avro_basic FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER();
Rows Loaded
-------------
5
(1 row)
=> SELECT station, temp, time FROM avro_basic;
station | temp | time
---------+------+---------------
mohali | 0 | -619524000000
lucknow | 22 | -619506000000
norwich | -11 | -619484400000
ams | 111 | -655531200000
baddi | 78 | -655509600000
(5 rows)
For more information, see Avro data.
3.3 - FCEFPARSER
Parses ArcSight Common Event Format (CEF) log files.
Parses ArcSight Common Event Format (CEF) log files. This parser loads values directly into any table column with a column name that matches a source data key. The parser stores the data loaded into a flex table in a single VMap.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FCEFPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Single-character delimiter.
Default: ' '
record_terminator
- Single-character record terminator.
**Default ****value: **newline
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default: true
reject_on_unescaped_delimiter
- Boolean, specifies whether to reject rows containing unescaped delimiters. The CEF standard does not permit them.
Default: false
Examples
The following example illustrates creating a sample flex table for CEF data, with two real columns, eventId
and priority
.
-
Create a flex table cefdata
:
=> create flex table cefdata();
CREATE TABLE
-
Load some basic CEF data, using the flex parser fcefparser
:
=> copy cefdata from stdin parser fcefparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> CEF:0|ArcSight|ArcSight|2.4.1|machine:20|New alert|High|
>> \.
-
Use the maptostring()
function to view the contents of your cefdata
flex table:
=> select maptostring(__raw__) from cefdata;
maptostring
-------------------------------------------------------------
{
"deviceproduct" : "ArcSight",
"devicevendor" : "ArcSight",
"deviceversion" : "2.4.1",
"name" : "New alert",
"severity" : "High",
"signatureid" : "machine:20",
"version" : "0"
}
(1 row)
-
Select some virtual columns from the cefdata
flex table:
= select deviceproduct, severity, deviceversion from cefdata;
deviceproduct | severity | deviceversion
---------------+----------+---------------
ArcSight | High | 2.4.1
(1 row)
For more information, see Common event format (CEF) data
See also
3.4 - FCSVPARSER
Parses CSV format (comma-separated values) data.
Parses CSV format (comma-separated values) data. Use this parser to load CSV data into columnar, flex, and hybrid tables. All data must be encoded in Unicode UTF-8 format. The fcsvparser
parser supports the RFC 4180 standard for CSV data, and other options, to accommodate variations in CSV file format definitions. Invalid records are rejected. For more information about data formats, see Handling Non-UTF-8 input.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FCSVPARSER ( [parameter='value'[,...]] )
Parameters
type
- The default parameter values for the parser, one of the following strings:
You do not have to use the type parameter when loading data that conforms to the RFC 4180 standard (such as MS Excel files). See Loading CSV data for the RFC4180
default parameters, and other options you can specify for traditional CSV files.
Default: RFC4180
delimiter
- A single-character value used to separate fields in the CSV data.
Default: ,
(for rfc4180
and traditional
)
escape
- A single-character value used as an escape character to interpret the next character in the data literally.
Default:
-
rfc4180
: "
-
traditional
: \
enclosed_by
- A single-character value. Use
enclosed_by
to include a value that is identical to the delimiter, but should be interpreted literally. For example, if the data delimiter is a comma (,
), and you want to use a comma within the data ("my name is jane, and his is jim"
).
Default: "
record_terminator
- A single-character value used to specify the end of a record.
Default:
-
rfc4180
: \n
-
traditional
: \r\n
header
- Boolean, specifies whether to use the first row of data as a header column. When
header=true
(default), and no header exists, fcsvparser uses a default column heading. The default header consists of ucol
n
, where n is the column offset number, starting with 0
for the first column. You can specify custom column heading names using the header_names
parameter, described next.
If you specify header=false
, the fcsvparser
parses the first row of input as data, rather than as column headers.
Default: true
header_names
- A list of column header names, delimited by the character defined by the parser's delimiter parameter. Use this parameter to specify header names in a CSV file without a header row, or to override the column names present in the CSV source. To override one or more existing column names, specify the header names to use. This parameter overrides any header row in the data.
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default: true
omit_empty_keys
- Boolean, specifies how the parser handles header keys without values. If true, keys with an empty value in the
header
row are not loaded.
Default: false
reject_on_duplicate
- Boolean, specifies whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.
Default: false
reject_on_empty_key
- Boolean, specifies whether to reject any row containing a key without a value.
Default: false
reject_on_materialized_type_error
- Boolean, specifies whether to reject any materialized column value that the parser cannot coerce into a compatible data type. See Loading CSV data.
Default: false
Examples
This example shows how you can use fcsvparser
to load a flex table, build a view, and then query that view.
-
Create a flex table for CSV data:
=> CREATE FLEX TABLE rfc();
CREATE TABLE
-
Use fcsvparser
to load the data from STDIN. Specify that no header exists, and enter some data as shown:
=> COPY rfc FROM stdin PARSER fcsvparser(header='false');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 10,10,20
>> 10,"10",30
>> 10,"20""5",90
>> \.
-
Run the compute_flextable_keys_and_build_view
function, and query the rfc_view
. Notice that the default enclosed_by
character permits an escape character ("
) within a field ("20""5"
). Thus, the resulting value was parsed correctly. Since no header existed in the input data, the function added ucol
n
for each column:
=> SELECT compute_flextable_keys_and_build_view('rfc');
compute_flextable_keys_and_build_view
--------------------------------------------------------------------------------------------
Please see public.rfc_keys for updated keys
The view public.rfc_view is ready for querying
(1 row)
=> SELECT * FROM rfc_view;
ucol0 | ucol1 | ucol2
-------+-------+-------
10 | 10 | 20
10 | 10 | 30
10 | 20"5 | 90
(3 rows)
For more information and examples using other parameters of this parser, see Loading CSV data.
See also
3.5 - FDELIMITEDPAIRPARSER
Parses delimited data files.
Parses delimited data files. This parser provides a subset of the functionality in the parser fdelimitedparser
. Use the fdelimitedpairparser
when the data you are loading specifies pairs of column names with data in each row.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FDELIMITEDPAIRPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Specifies a single-character delimiter.
Default: ' '
record_terminator
- Specifies a single-character record terminator.
Default: newline
trim
- Boolean specifies whether to trim white space from header names and key values.
Default: true
Examples
The following example illustrates creating a sample flex table for simple delimited data, with two real columns, eventId
and priority
.
-
Create a table:
=> create flex table CEFData(eventId int default(eventId::int), priority int default(priority::int) );
CREATE TABLE
-
Load a sample delimited OpenText ArcSight log file into the CEFData
table, using the fcefparser
:
=> copy CEFData from '/home/release/kmm/flextables/sampleArcSight.txt' parser fdelimitedpairparser();
Rows Loaded | 200
-
After loading the sample data file, use maptostring()
to display the virtual columns in the __raw__
column of CEFData
:
=> select maptostring(__raw__) from CEFData limit 1; maptostring
-----------------------------------------------------------
"agentassetid" : "4-WwHuD0BABCCQDVAeX21vg==",
"agentzone" : "3083",
"agt" : "265723237",
"ahost" : "svsvm0176",
"aid" : "3tGoHuD0BABCCMDVAeX21vg==",
"art" : "1099267576901",
"assetcriticality" : "0",
"at" : "snort_db",
"atz" : "America/Los_Angeles",
"av" : "5.3.0.19524.0",
"cat" : "attempted-recon",
"categorybehavior" : "/Communicate/Query",
"categorydevicegroup" : "/IDS/Network",
"categoryobject" : "/Host",
"categoryoutcome" : "/Attempt",
"categorysignificance" : "/Recon",
"categorytechnique" : "/Scan",
"categorytupledescription" : "An IDS observed a scan of a host.",
"cnt" : "1",
"cs2" : "3",
"destinationgeocountrycode" : "US",
"destinationgeolocationinfo" : "Richardson",
"destinationgeopostalcode" : "75082",
"destinationgeoregioncode" : "TX",
"destinationzone" : "3133",
"device product" : "Snort",
"device vendor" : "Snort",
"device version" : "1.8",
"deviceseverity" : "2",
"dhost" : "198.198.121.200",
"dlat" : "329913940429",
"dlong" : "-966644973754",
"dst" : "3334896072",
"dtz" : "America/Los_Angeles",
"dvchost" : "unknown:eth1",
"end" : "1364676323451",
"eventid" : "1219383333",
"fdevice product" : "Snort",
"fdevice vendor" : "Snort",
"fdevice version" : "1.8",
"fdtz" : "America/Los_Angeles",
"fdvchost" : "unknown:eth1",
"lblstring2label" : "sig_rev",
"locality" : "0",
"modelconfidence" : "0",
"mrt" : "1364675789222",
"name" : "ICMP PING NMAP",
"oagentassetid" : "4-WwHuD0BABCCQDVAeX21vg==",
"oagentzone" : "3083",
"oagt" : "265723237",
"oahost" : "svsvm0176",
"oaid" : "3tGoHuD0BABCCMDVAeX21vg==",
"oat" : "snort_db",
"oatz" : "America/Los_Angeles",
"oav" : "5.3.0.19524.0",
"originator" : "0",
"priority" : "8",
"proto" : "ICMP",
"relevance" : "10",
"rt" : "1099267573000",
"severity" : "8",
"shost" : "198.198.104.10",
"signature id" : "[1:469]",
"slat" : "329913940429",
"slong" : "-966644973754",
"sourcegeocountrycode" : "US",
"sourcegeolocationinfo" : "Richardson",
"sourcegeopostalcode" : "75082",
"sourcegeoregioncode" : "TX",
"sourcezone" : "3133",
"src" : "3334891530",
"start" : "1364676323451",
"type" : "0"
}
(1 row)
-
Select the eventID
and priority
real columns, along with two virtual columns, atz
and destinationgeoregioncode
:
=> select eventID, priority, atz, destinationgeoregioncode from CEFData limit 10;
eventID | priority | atz | destinationgeoregioncode
------------+----------+---------------------+--------------------------
1218325417 | 5 | America/Los_Angeles |
1219383333 | 8 | America/Los_Angeles | TX
1219533691 | 9 | America/Los_Angeles | TX
1220034458 | 5 | America/Los_Angeles | TX
1220034578 | 9 | America/Los_Angeles |
1220067119 | 5 | America/Los_Angeles | TX
1220106960 | 5 | America/Los_Angeles | TX
1220142122 | 5 | America/Los_Angeles | TX
1220312009 | 5 | America/Los_Angeles | TX
1220321355 | 5 | America/Los_Angeles | CA
(10 rows)
See also
3.6 - FDELIMITEDPARSER
Parses data using a delimiter character to separate values.
Parses data using a delimiter character to separate values. The fdelimitedparser
loads delimited data, storing it in a single-value VMap.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Note
By default, fdelimitedparser
treats empty fields as NULL
, rather than as an empty string (''
). This behavior makes casting easier. Casting a NULL to an integer (NULL::int
) is valid, while casting an empty string to an integer (''::int
) is not. If required, use the treat_empty_val_as_null
parameter to change the default behavior of fdelimitedparser
.
Syntax
FDLIMITEDPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Single character delimiter.
Default:|
record_terminator
- Single-character record terminator.
Default:\n
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default: true
header
- Boolean, specifies that a header column exists. The parser uses
col###
for the column names if you use this parameter but no header exists.
Default:true
omit_empty_keys
- Boolean, specifies how the parser handles header keys without values. If
omit_empty_keys=true
, keys with an empty value in the header
row are not loaded.
Default: false
reject_on_duplicate
- Boolean, specifies whether to ignore duplicate records (
false
), or to reject duplicates (true
). In either case, the load continues.
Default:false
reject_on_empty_key
- Boolean, specifies whether to reject any row containing a key without a value.
Default:false
reject_on_materialized_type_error
- Boolean, specifies whether to reject any row value for a materialized column that the parser cannot coerce into a compatible data type. See Using flex table parsers.
Default:false
treat_empty_val_as_null
- Boolean, specifies that empty fields become
NULLs
, rather than empty strings (''
).
Default: true
Examples
-
Create a flex table for delimited data:
t=> CREATE FLEX TABLE delim_flex ();
CREATE TABLE
-
Use the fdelimitedparser
to load some delimited data from STDIN
, specifying a comma (,
) column delimiter:
=> COPY delim_flex FROM STDIN parser fdelimitedparser (delimiter=',');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> deviceproduct, severity, deviceversion
>> ArcSight, High, 2.4.1
>> \.
You can now query virtual columns in the delim_flex
flex table:
=> SELECT deviceproduct, severity, deviceversion from delim_flex;
deviceproduct | severity | deviceversion
---------------+----------+---------------
ArcSight | High | 2.4.1
(1 row)
See also
3.7 - FJSONPARSER
Parses and loads a JSON file.
Parses and loads a JSON file. This file can contain either repeated JSON data objects (including nested maps), or an outer list of JSON elements.
When loading into a flex or hybrid table, the parser stores the JSON data in a single-value VMap. When loading into a hybrid or columnar table, the parser loads data directly into any table column with a column name that matches a key in the JSON source data.
You can load complex types in the JSON source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
This parser can notify you if it finds keys in the data that are not part of the table definition. See Unmatched Keys.
FJSONPARSER supports cooperative parse only if record_terminator
is specified. It does not support apportioned load.
Syntax
FJSONPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten sub-maps within the JSON data, separating map levels with a period (
.
). This value affects all data in the load, including nested maps.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to convert lists to sub-maps with integer keys. When lists are flattened, key names are concatenated as for maps. Lists are not flattened by default. This value affects all data in the load, including nested lists.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
reject_on_duplicate
- Boolean, whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.
Default: false
reject_on_empty_key
- Boolean, whether to reject any row containing a field key without a value.
Default: false
omit_empty_keys
- Boolean, whether to omit any field key from the data that does not have a value. Other fields in the same record are loaded.
Default: false
record_terminator
- When set, any invalid JSON records are skipped and parsing continues with the next record. Records must be terminated uniformly. For example, if your input file has JSON records terminated by newline characters, set this parameter to
E'\n')
. If any invalid JSON records exist, parsing continues after the next record_terminator
.
Even if the data does not contain invalid records, specifying an explicit record terminator can improve load performance by allowing cooperative parse and apportioned load to operate more efficiently.
When you omit this parameter, parsing ends at the first invalid JSON record.
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to NULL.
If the column is an array and the data to be loaded is too large, then false sets the column value to NULL and true rejects the row.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
start_point
- String, the name of a key in the JSON load data at which to begin parsing. The parser ignores all data before the
start_point
value. The value is loaded for each object in the file. The parser processes data after the first instance, and up to the second, ignoring any remaining data.
start_point_occurrence
- Integer, the nth occurrence of the value you specify with
start_point
. Use in conjunction with start_point
when the data has multiple start values and you know the occurrence at which to begin parsing.
Default: 1
suppress_nonalphanumeric_key_chars
- Boolean, whether to suppress non-alphanumeric characters in JSON key values. The parser replaces these characters with an underscore (
_
) when this parameter is true.
Default: false
key_separator
- Character for the parser to use when concatenating key names.
Default: period (.
)
suppress_warnings
- String, which warnings to suppress:
-
unmatched_key
(see Unmatched Keys)
-
true
or t
(suppress all warnings)
-
false
or f
(do not suppress warnings)
Default: false
Data types
If the total size of an array exceeds the size defined by the target table, the parser sets the value to null.
Unmatched keys
Data being loaded can contain keys that are not part of the table definition. If you are loading into a flex table (or a flexible complex type column), no data is lost. For a table with strongly-defined columns, however, new keys cannot be loaded because the table does not have a place to put them.
This parser emits warnings if it finds new keys and if both of the following are true:
New keys are logged in the UDX_EVENTS system table. If a new key is a complex type with nested keys, only the top-level key is logged. When you see a warning about unmatched keys, you can query this table and then use ALTER TABLE to modify your table definition for future loads.
Querying an external table loads data and thus can trigger these warnings. To prevent them, set the suppress_warnings
parameter to 'unmatched_keys' or 'true':
=> CREATE EXTERNAL TABLE restaurants(
name VARCHAR(50),
menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100])
AS COPY FROM '/data/rest.json'
PARSER FJSONPARSER(suppress_warnings='unmatched_key');
Examples
The following example loads JSON data from STDIN using the default parameters:
=> CREATE TABLE people(age INT, name VARCHAR);
CREATE TABLE
=> COPY people FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"age": 5, "name": "Tim"}
>> {"age": 3}
>> {"name": "Fred"}
>> {"name": "Bob", "age": 10}
>> \.
=> SELECT * FROM people;
age | name
-----+------
| Fred
10 | Bob
5 | Tim
3 |
(4 rows)
The following example uses the reject_on_duplicate
parameter to reject duplicate values:
=> CREATE FLEX TABLE json_dupes();
CREATE TABLE
=> COPY json_dupes FROM stdin PARSER fjsonparser(reject_on_duplicate=true)
exceptions '/home/dbadmin/load_errors/json_e.out'
rejected data '/home/dbadmin/load_errors/json_r.out';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"a":"1","a":"2","b":"3"}
>> \.
=> \!cat /home/dbadmin/load_errors/json_e.out
COPY: Input record 1 has been rejected (Rejected by user-defined parser).
Please see /home/dbadmin/load_errors/json_r.out, record 1 for the rejected record.
COPY: Loaded 0 rows, rejected 1 rows.
The following example loads array data:
$ cat addrs.json
{"number": 301, "street": "Grant", "attributes": [1, 2, 3, 4]}
=> CREATE EXTERNAL TABLE customers(number INT, street VARCHAR, attributes ARRAY[INT])
AS COPY FROM 'addrs.json' PARSER fjsonparser();
=> SELECT number, street, attributes FROM customers;
num | street| attributes
-----+-----------+---------------
301 | Grant | [1,2,3,4]
(1 row)
The following example loads a flexible complex type, rejecting rows that have empty keys within the nested records. Notice that while the data has two restaurants, one has a key name that is an empty string. This one is rejected:
$ cat rest1.json
{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"location_city" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"location_city" : ["Pittsburgh"],
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}
=> CREATE TABLE rest (name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY);
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, reject_on_empty_key=true);
Rows Loaded
------------
1
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) FROM rest;
maptostring | maptostring
---------------------------+-------------------------------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
(1 row)
To instead load partial data, use omit_empty_keys
to bypass the missing keys while loading everything else:
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, omit_empty_keys=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) from rest;
maptostring | maptostring
-------------------------------------------------+---------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
{
"0": "Cambridge",
"1": "Pittsburgh"
} | {
"0": {
"item": "cheese pizza"
},
"1": {
"item": "spinach pizza",
"price": "$10.50"
}
}
(2 rows)
To instead load this data with strong typing, define the complex types in the table:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR(80)],
menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
)
AS COPY FROM '/data/rest.json' PARSER fjsonparser();
=> SELECT * FROM restaurants;
name | cuisine | location_city | \
menu
-------------------+---------+----------------------------+--------------------\
--------------------------------------------------------
Bob's pizzeria | Italian | ["Cambridge","Pittsburgh"] | [{"item":"cheese pi\
zza","price":0.0},{"item":"spinach pizza","price":0.0}]
Bakersfield Tacos | Mexican | ["Pittsburgh"] | [{"item":"veggie ta\
co","price":0.0},{"item":"steak taco","price":0.0}]
(2 rows)
In the following example, the data contains two new fields. One is a top-level field (a new column), and the other is a new field on an existing struct. The new fields are recorded in the UDX_EVENTS system table:
=> COPY rest FROM '/data/rest2.json' PARSER FJSONPARSER();
WARNING 10596: Warning in UDx call in user-defined object [FJSONParser], code: 0, message:
Data source contained keys which did not match table schema
HINT: SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type = 'UNMATCHED_KEY' GROUP BY key
Rows Loaded
-------------
2
(1 row)
=> SELECT key, SUM(num_instances) FROM v_monitor.UDX_EVENTS
WHERE event_type = 'UNMATCHED_KEY' GROUP BY key;
key | SUM
------------------------+-----
chain | 1
menu.elements.calories | 7
(2 rows)
For other examples, see JSON data.
3.8 - FREGEXPARSER
Parses a regular expression, matching columns to the contents of the named regular expression groups.
Parses a regular expression, matching columns to the contents of the named regular expression groups.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FREGEXPARSER ( pattern=[parameter-name='value'[,...]] )
Parameters
pattern
- Specifies the regular expression of data to match.
Default: Empty string (""
)
use_jit
- Boolean, specifies whether to use just-in-time compiling when parsing the regular expression.
Default: false
record_terminator
- Specifies the character used to separate input records.
Default: \n
logline_column
- A string that captures the destination column containing the full string that the regular expression matched.
Default: Empty string (""
)
Example
These examples use the following regular expression, which searches for information that includes the timestamp
, date
, thread_name
, and thread_id
strings.
Caution
For display purposes, this sample regular expression adds new line characters to split long lines of text. To use this expression in a query, first copy and edit the example to remove any new line characters.
This example expression loads any thread_id
hex value, regardless of whether it has a 0x
prefix, (<thread_id>(?:0x)?[0-9a-f]+)
.
'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
-
Create a flex table (vlog
) to contain the results of a Vertica log file. For this example, we made a copy of a log file in the directory /home/dbadmin/data/vertica.log
:
=> CREATE FLEX TABLE vlog1();
CREATE TABLE
-
Use the fregexparser
with the sample regular expression to load data from the log file. Be sure to remove any line characters before using this expression shown here:
=> COPY vlog1 FROM '/home/dbadmin/tempdat/KMvertica.log'
PARSER FREGEXPARSER(pattern='^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
);
Rows Loaded
-------------
31049
(1 row)
-
After successfully loading data, use the MAPTOSTRING() function with the table's __raw__
column. The four rows (limt 4
) that the query returns are regular expression results of the KMvertica.log
file, parsed with fregexparser
. The output shows thread_id
values with a preceding 0x
or without:
=> SELECT maptostring(__raw__) FROM vlog1 LIMIT 4;
maptostring
-------------------------------------------------------------------------------------
{
"text" : " [Init] <INFO> Log /home/dbadmin/VMart/v_vmart_node0001_catalog/vertica.log
opened; #2",
"thread_id" : "0x7f2157e287c0",
"thread_name" : "Main",
"time" : "2017-03-21 23:30:01.704"
}
{
"text" : " [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -D
/home/dbadmin/VMart/v_vmart_node0001_catalog -C VMart -n v_vmart_node0001 -h
10.20.100.247 -p 5433 -P 4803 -Y ipv4",
"thread_id" : "0x7f2157e287c0",
"thread_name" : "Main",
"time" : "2017-03-21 23:30:01.704"
}
{
"text" : " [Init] <INFO> Starting up Vertica Analytic Database v8.1.1-20170321",
"thread_id" : "7f2157e287c0",
"thread_name" : "Main",
"time" : "2017-03-21 23:30:01.704"
}
{
"text" : " [Init] <INFO> Compiler Version: 4.8.2 20140120 (Red Hat 4.8.2-15)",
"thread_id" : "7f2157e287c0",
"thread_name" : "Main",
"time" : "2017-03-21 23:30:01.704"
}
(4 rows)
See also
3.9 - ORC
Use the ORC clause with the COPY FROM statement to load data in the ORC format.
Use the ORC clause with the COPY statement to load data in the ORC format. When loading data into Vertica, you can read all primitive types, UUIDs, and complex types.
When loading ORC data, you must account for all columns in the data; you cannot select only some columns.
If the table definition includes columns of primitive types and those columns are not in the data, the parser fills those columns with NULL. If the table definition includes columns of complex types, those columns must be present in the data.
This parser does not support apportioned load or cooperative parse.
Syntax
ORC ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See
Partitioned file paths. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter.
allow_no_match
- Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
Data types
This parser can read all primitive types, UUIDs, and complex types.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
Examples
The ORC clause does not use the PARSER option:
=> CREATE EXTERNAL TABLE orders (...)
AS COPY FROM 's3://DataLake/orders.orc' ORC;
You can read a map column as an array of rows, as in the following example:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
orderdate DATE
) AS COPY FROM '...' ORC;
3.10 - PARQUET
Use the PARQUET parser with the COPY FROM statement to load data in the Parquet format.
Use the PARQUET
parser with the COPY statement to load data in the Parquet format. When loading data into Vertica you can read all primitive types, UUIDs, and complex types.
By default, the Parquet parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose Schema Matching.
When loading Parquet data, Vertica caches the Parquet metadata to improve efficiency. This cache uses local TEMP storage and is not used if TEMP is remote. See the ParquetMetadataCacheSizeMB configuration parameter to change the size of the cache.
This parser does not support apportioned load or cooperative parse.
Syntax
PARQUET ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See
Partitioned file paths. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter.
allow_no_match
- Boolean. Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
allow_long_varbinary_match_complex_type
- Boolean. Whether to enable flexible column types (see Flexible complex types). If true, the Parquet parser allows a complex type in the data to match a table column defined as LONG VARBINARY. If false, the Parquet parser requires strong typing of complex types. With the parameter set you can still use strong typing. Set this parameter to false if you want use of flexible columns to be treated as an error.
do_soft_schema_match_by_name
- Boolean. Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and parquet file (false, default). See Loose Schema Matching for more information.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true (default) means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.
Loose schema matching
By default, the Parquet parser uses strong schema matching. This means that all columns in the Parquet data must be loaded, in the same order as in the data. However, there are times when you only want to pull certain columns, or you want to be able to accommodate future changes in the Parquet schema.
Use the do_soft_schema_match_by_name
parameter to enable soft schema matching. This setting has the following effects:
-
Columns in the data are matched to columns in the table by their names. Names must exactly match but are case-insensitive.
-
Columns that exist in the Parquet data but are not part of the table definition are ignored.
-
Columns that exist in the table definition but not the Parquet data are filled with NULL. The parser logs an UNMATCHED_TABLE_COLUMNS_PARQUETPARSER event in QUERY_EVENTS.
-
If the same case-insensitive column name occurs more than once in the Parquet data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Column types do not need to exactly match, so long as the data type in the Parquet file can be coerced to the type used by the table. If a type cannot be coerced, the parser logs a TYPE_MISMATCH_COLUMNS_PARQUETPARSER event in QUERY_EVENTS. If reject_on_materialized_type_error
is true then the parser rejects the row. If it is false, the parser uses NULL or, for string values that are too long, truncates the value.
-
Columns using complex types (other than one-dimensional arrays of primitive types) can be defined but not queried.
Data types
The Parquet parser maps Parquet data types to Vertica data types as follows.
Parquet Logical Type |
Vertica Data Type |
StringLogicalType |
VARCHAR |
MapLogicalType |
ARRAY[ROW] |
ListLogicalType |
ARRAY/SET |
IntLogicalType |
INT/NUMERIC |
DecimalLogicalType |
NUMERIC |
DateLogicalType |
DATE |
TimeLogicalType |
TIME |
TimestampLogicalType |
TIMESTAMP |
UUIDLogicalType |
UUID |
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
The following logical types are not supported:
- EnumLogicalType
- IntervalLogicalType
- JSONLogicalType
- BSONLogicalType
- UnknownLogicalType
The Parquet parser supports the following mappings of physical types:
Parquet Physical Type |
Vertica Data Type |
BOOLEAN |
BOOLEAN |
INT32/INT64 |
INT |
INT96 |
Supported only for TIMESTAMP |
FLOAT |
DOUBLE |
DOUBLE |
DOUBLE |
BYTE_ARRAY |
VARBINARY |
FIXED_LEN_BYTE_ARRAY |
BINARY |
The following mappings are supported with type coercion and loose schema matching.
Parquet Physical Type |
Coercible to Vertica Data Type |
BOOLEAN |
BOOLEAN |
INT32, INT64, BOOLEAN |
INT |
FLOAT, DOUBLE |
DOUBLE |
INT32, INT96 |
DATE |
INT64, INT96 |
TIMESTAMP, TIMESTAMPTZ |
INT64
If precision > 0: INT32, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
|
Numeric |
BYTE_ARRAY |
CHAR, VARCHAR, LONG VARCHAR, BINARY, VARBINARY, LONG VARBINARY |
FIXED_LEN_BYTE_ARRAY |
UUID |
Vertica supports only 3-level-encoded arrays, not 2-level-encoded.
Examples
The PARQUET clause does not use the PARSER option:
=> COPY sales FROM 's3://DataLake/sales.parquet' PARQUET;
In the following example, the data directory contains no files:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet' PARQUET;
=> SELECT COUNT(*) FROM customers;
ERROR 7869: No files match when expanding glob: [webhdfs:///data/*.parquet]
To read zero rows instead of producing an error, use the allow_no_match
parameter:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet'
PARQUET(allow_no_match='true');
=> SELECT COUNT(*) FROM customers;
count
-------
0
(1 row)
To allow reading a complex type (menu, in this example) as a flexible column type, use the allow_long_varbinary_match_complex_type
parameter:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
To read only some columns from the restaurant data, use loose schema matching:
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True',
do_soft_schema_match_by_name='True');
=> SELECT * from restaurant;
name | cuisine
-------------------+----------
Bob's pizzeria | Italian
Bakersfield Tacos | Mexican
(2 rows)
4 - Restrictions
COPY has the following restrictions:.
COPY has the following restrictions:
Invalid data
COPY considers the following data invalid:
-
Missing columns (an input line has fewer columns than the recipient table).
-
Extra columns (an input line has more columns than the recipient table).
-
Empty columns for an INTEGER or DATE/TIME data type. If a column is empty for either of these types, COPY
does not use the default value that was defined by CREATE TABLE. However, if you do not supply a column option as part of the COPY
statement, the default value is used.
-
Incorrect representation of a data type. For example, trying to load a non-numeric value into an INTEGER column is invalid.
Constraint violations
If primary key, unique key, or check constraints are enabled for automatic enforcement in the target table, Vertica enforces those constraints when you load new data. If a violation occurs, Vertica rolls back the operation and returns an error.
Empty line handling
When COPY
encounters an empty line while loading data, the line is neither inserted nor rejected, but COPY
increments the line record number. Consider this behavior when evaluating rejected records. If you return a list of rejected records and COPY
encountered an empty row while loading data, the position of rejected records is not incremented by one, as demonstrated in the following example.
The example first loads values into a table that defines the first column as INT. Note the errors on rows 3, 4, and 8:
=> \! cat -n /home/dbadmin/test.txt
1 1|A|2
2 2|B|4
3 A|D|7
4 A|E|7
5
6
7 6|A|3
8 B|A|3
The empty rows (5 and 6) shift the reporting of the error on row 8:
=> SELECT row_number, rejected_data, rejected_reason FROM test_bad;
row_number | rejected_data | rejected_reason
------------+---------------+----------------------------------------------
3 | A|D|7 | Invalid integer format 'A' for column 1 (c1)
4 | A|E|7 | Invalid integer format 'A' for column 1 (c1)
6 | B|A|3 | Invalid integer format 'B' for column 1 (c1)
(3 rows)
Compressed file errors
When loading compressed files, COPY
might abort and report an error, if the file seems to be corrupted. For example, this behavior can occur if reading the header block fails.
Disk quota
Tables and schemas can have disk quotas. If a load would violate either quota, the operation fails. For more information, see Disk quotas.