This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
COPY
COPY; Load data;.
COPY bulk-loads data into a Vertica database. By default, COPY automatically commits itself and any current transaction except when loading temporary tables. If COPY is terminated or interrupted Vertica rolls it back.
COPY reads data as UTF-8 encoding.
For information on loading one or more files or pipes on a cluster host or on a client system, see COPY LOCAL.
Syntax
COPY [ /*+ LABEL (label-string)*/ ] [[database.]schema-name.]target-table
[ ( { column-as-expression | column }
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char' ]
[ ENFORCELENGTH ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ FILLER datatype]
[ FORMAT 'format' ]
[ NULL [ AS ] 'string' ]
[ TRIM 'byte' ]
[,...] ) ]
[ COLUMN OPTION ( column
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char' ]
[ ENFORCELENGTH ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ FORMAT 'format' ]
[ NULL [ AS ] 'string' ]
[ TRIM 'byte' ]
[,...] ) ]
FROM {
[ LOCAL ] STDIN [ input-format ]
| 'path-to-data' [ ON { nodename | (nodeset) | ANY NODE } ] [ compression ] }[,...]
| LOCAL '`*`path-to-data`*`' [ compression ] [,...]
| VERTICA source-database.[source-schema.]source-table[( source-column[,...] ) ]
}
| [ WITH ] UDL-clause[...]
}
[ ABORT ON ERROR ]
[ DELIMITER [ AS ] 'char' ]
[ ENCLOSED [ BY ] 'char'
[ ENFORCELENGTH ]
[ ERROR TOLERANCE ]
[ ESCAPE [ AS ] 'char' | NO ESCAPE ]
[ EXCEPTIONS 'path' [ ON nodename ] [,...]
[ NULL [ AS ] 'string' ]
[ RECORD TERMINATOR 'string' ]
[ REJECTED DATA {'path' [ ON nodename ] [,...] | AS TABLE reject-table} ]
[ REJECTMAX integer ]
[ SKIP integer ]
[ SKIP BYTES integer ]
[ STREAM NAME 'streamName']
[ TRAILING NULLCOLS ]
[ TRIM 'byte' ]
[ [ WITH ] PARSER parser ([ arg=value[,...] ]) ] ]
[ NO COMMIT ]
Parameters
See Parameters.
Restrictions
See Restrictions.
Privileges
Superusers have full COPY privileges. The following requirements apply to non-superusers:
-
INSERT privilege on table
-
USAGE privilege on schema
-
USER-accessible storage location
-
Applicable READ or WRITE privileges granted to the storage location where files are read or written
COPY can specify a path to store rejected data and exceptions. If the path resolves to a storage location, the following privileges apply to non-superusers:
1 - DELIMITED (parser)
Use the DELIMITED parser, which is the default, to load delimited text data using COPY.
Use the DELIMITED parser, which is the default, to load delimited text data using COPY. You can specify the delimiter, escape characters, how to handle null values, and other parameters.
The DELIMITED parser supports reading one-dimensional collections (arrays or sets) of scalar types.
COPY options
The following options are specific to this parser. See Parameters for other applicable options.
DELIMITER
Indicates the single ASCII character used to separate columns within each record of a file. You can use any ASCII value in the range E'\000' to E'\177', inclusive. You cannot use the same character for both the DELIMITER and NULL parameters. For more information, see Delimited data.
Default: Vertical bar ('|').
ENCLOSED [BY]
Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
ESCAPE [AS]
Sets the escape character. Once set, the character following the escape character is interpreted literally, rather than as a special character. You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL: E'\000').
The COPY statement does not interpret the data it reads in as String literals. It also does not follow the same escape rules as other SQL statements (including the COPY parameters). When reading data, COPY interprets only the characters defined by these options as special values:
-
ESCAPE [AS]
-
DELIMITER
-
ENCLOSED [BY]
-
RECORD TERMINATOR
-
All COLLECTION options
Default: Backslash ('\').
NO ESCAPE
Eliminates escape-character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
RECORD TERMINATOR
- Specifies the literal character string indicating the end of a data file record. For more information about using this parameter, see Delimited data.
TRAILING NULLCOLS
- Specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values. For other information and examples, see Fixed-width format data.
COLLECTIONDELIMITER
For columns of collection types, indicates the single ASCII character used to separate elements within each collection. You can use any ASCII value in the range E'\000' to E'\177', inclusive. No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: Comma (',').
COLLECTIONOPEN
, COLLECTIONCLOSE
For columns of collection types, these options indicate the characters that mark the beginning and end of the collection. It is an error to use these characters elsewhere within the list of elements without escaping them. No COLLECTION option may have the same value as any other COLLECTION option.
Default: Square brackets ('[' and ']').
COLLECTIONNULLELEMENT
The string representing a null element value in a collection. You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII value except NULL: E'\000'). No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: 'null'
COLLECTIONENCLOSE
For columns of collection types, sets the quote character within which to enclose individual elements, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
No COLLECTION option may have the same value as any other COLLECTION option.
Default: double quote ('"')
Examples
The following example shows the default behavior, in which the delimiter character is '|'
=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE
=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.
=> SELECT * FROM employees;
id | name | department
----+-----------------+--------------
17 | Howard Wolowitz | Astrophysics
42 | Sheldon Cooper | Physics
(2 rows)
The following example shows loading array values with the default options.
=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE
=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
(2 rows)
In the following example, collections are enclosed in braces and delimited by periods, and the arrays contain null values.
=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672".null."darpa-1963"}|{16200.null.16700}
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
19 | Leonard | ["us-1672",null,"darpa-1963"] | [16200,null,16700]
(3 rows)
2 - Examples
For additional COPY examples, see the reference pages for specific parsers, including: DELIMITED (Parser), ORC (Parser), PARQUET (Parser), FJSONPARSER (Parser), and FAVROPARSER (Parser).
For additional COPY examples, see the reference pages for specific parsers, including: DELIMITED (parser), ORC (parser), PARQUET (parser), FJSONPARSER (parser), and FAVROPARSER (parser).
Specifying string options
Use COPY with FORMAT, DELIMITER, NULL, and ENCLOSED BY options:
=> COPY public.customer_dimension (customer_since FORMAT 'YYYY')
FROM STDIN
DELIMITER ','
NULL AS 'null'
ENCLOSED BY '"';
Use COPY with DELIMITER and NULL options. This example sets and references a vsql
variable for the input file:
=> \set input_file ../myCopyFromLocal/large_table.gzip
=> COPY store.store_dimension
FROM :input_file
DELIMITER '|'
NULL ''
RECORD TERMINATOR E'\f';
Including multiple source files
Create a table and then copy multiple source files to it:
=> CREATE TABLE sampletab (a int);
CREATE TABLE
=> COPY sampletab FROM '/home/dbadmin/one.dat', 'home/dbadmin/two.dat';
Rows Loaded
-------------
2
(1 row)
Use wildcards to indicate a group of files:
=> COPY myTable FROM 'webhdfs:///mydirectory/ofmanyfiles/*.dat';
Wildcards can include regular expressions:
=> COPY myTable FROM 'webhdfs:///mydirectory/*_[0-9]';
Specify multiple paths in a single COPY statement:
=> COPY myTable FROM 'webhdfs:///data/sales/01/*.dat', 'webhdfs:///data/sales/02/*.dat',
'webhdfs:///data/sales/historical.dat';
Distributing a load
Load data that is shared across all nodes. Vertica distributes the load across all nodes, if possible:
=> COPY sampletab FROM '/data/file.dat' ON ANY NODE;
Load data from two files. Because the first load file does not specify nodes (or ON ANY NODE), the initiator performs the load. Loading the second file is distributed across all nodes:
=> COPY sampletab FROM '/data/file1.dat', '/data/file2.dat' ON ANY NODE;
Specify different nodes for each load file:
=> COPY sampletab FROM '/data/file1.dat' ON (v_vmart_node0001, v_vmart_node0002),
'/data/file2.dat' ON (v_vmart_node0003, v_vmart_node0004);
Loading data from shared storage
To load data from shared storage, use URLs in the corresponding schemes:
-
HDFS: [[s]web]hdfs://[
nameservice
]/
path
-
S3: s3://
bucket
/
path
-
Google Cloud: gs://
bucket
/
path
-
Azure: azb://
account
/
container
/
path
Note
Loads from HDFS, S3, GCS, and Azure default to ON ANY NODE; you do not need to specify it.
Load a file stored in HDFS using the default name node or name service:
=> COPY t FROM 'webhdfs:///opt/data/file1.dat';
Load data from a particular HDFS name service (testNS). You specify a name service if your database is configured to read from more than one HDFS cluster:
=> COPY t FROM 'webhdfs://testNS/opt/data/file2.csv';
Load data from an S3 bucket:
=> COPY t FROM 's3://AWS_DataLake/*' ORC;
Load data in the ORC format from HDFS:
=> COPY t FROM 'webhdfs:///opt/data/sales.orc' ORC;
Load Parquet data from an S3 bucket:
=> COPY t FROM 's3://AWS_DataLake/sales.parquet' PARQUET;
Using filler columns
In the following example, the table has columns for first name, last name, and full name, but the data being loaded contains columns for first, middle, and last names. The COPY statement reads all of the source data but only loads the source columns for first and last names. It constructs the data for the full name by concatenating each of the source data columns, including the middle name. The middle name is read as a FILLER column so it can be used in the concatenation, but is ignored otherwise. (There is no table column for middle name.)
=> CREATE TABLE names(first VARCHAR(20), last VARCHAR(20), full VARCHAR(60));
CREATE TABLE
=> COPY names(first,
middle FILLER VARCHAR(20),
last,
full AS first||' '||middle||' '||last)
FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Marc|Gregory|Smith
>> Sue|Lucia|Temp
>> Jon|Pete|Hamilton
>> \.
=> SELECT * from names;
first | last | full
-------+----------+--------------------
Jon | Hamilton | Jon Pete Hamilton
Marc | Smith | Marc Gregory Smith
Sue | Temp | Sue Lucia Temp
(3 rows)
Loading data into a flex table
Create a Flex table and copy JSON data into it using FJSONPARSER:
=> CREATE FLEX TABLE darkdata();
CREATE TABLE
=> COPY tweets FROM '/myTest/Flexible/DATA/tweets_12.json' PARSER FJSONPARSER();
Rows Loaded
-------------
12
(1 row)
Using named pipes
COPY supports named pipes that follow the same naming conventions as file names on the given file system. Permissions are open
, write
, and close
.
Create named pipe, pipe1
, and set two vsql
variables:
=> \! mkfifo pipe1
=> \set dir `pwd`/
=> \set file '''':dir'pipe1'''
Copy an uncompressed file from the named pipe:
=> \! cat pf1.dat > pipe1 &
=> COPY large_tbl FROM :file delimiter '|';
=> SELECT * FROM large_tbl;
=> COMMIT;
Loading compressed data
Copy a GZIP file from a named pipe and uncompress it:
=> \! gzip pf1.dat
=> \! cat pf1.dat.gz > pipe1 &
=> COPY large_tbl FROM :file ON site01 GZIP delimiter '|';
=> SELECT * FROM large_tbl;
=> COMMIT;
=> \!gunzip pf1.dat.gz
3 - FAVROPARSER (parser)
Parses data from an Avro file.
Parses data from an Avro file. The input file must use binary serialization encoding. Use this parser to load data into columnar, flex, and hybrid tables.
You can load complex types in the Avro source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
When loading into a flex table, Vertica loads all data into the __raw__ (VMap) column, including complex types found in the data.
Note
FAVROPARSER does not support Avro files with separate schema files. The Avro file must have its related schema in the file you are loading.
Syntax
FAVROPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten all Avro maps. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to flatten all Avro arrays. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
flatten_records
- Boolean, whether to flatten all Avro records. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to null.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
Primitive data types
FAVROPARSER supports the following primitive data types, including as element types and field values in complex types.
Note
Vertica does not have an explicit 4-byte (32-bit integer) or smaller types. Instead, Vertica encoding and compression automatically eliminate the storage overhead of values that require less than 64 bits.
Avro logical types
FAVROPARSER supports the following Avro logical types. The target column must use a Vertica data type that supports the logical type. When you attempt to load data using an invalid logical type, the logical type is ignored and the underlying Avro type is used.
AVRO Logical Type |
Base Avro Type |
Supported Vertica Data Types |
decimal
0 < precision ≤ 1024
0 ≤ scale ≤ precision
|
bytes or fixed |
NUMERIC, Character
Vertica rejects the value if:
-
The Avro precision setting is greater than the precision setting for the target column.
-
For fixed types, the precision value is greater than what is allowed by the size attribute.
If the data type for the target column uses the default precision setting, the precision setting in the Avro schema overrides the default.
|
date |
integer |
DATE, Character |
time-micros |
long |
TIME/TIMETZ, Character
The time logical type does not provide a time zone value. For target columns that use the TIMETZ data type, Vertica uses UTC as the default.
|
time-millis |
int |
timestamp-micros |
long |
TIMESTAMP/TIMESTAMPTZ, TIME/TIMETZ
For timestamp-millis only, the timezone is included and is represented as an offset to UTC. Additionally, the millisecond values are right-extended with padded zeros.
|
timestamp-millis |
long |
duration |
fixed |
INTERVAL, Character |
Avro complex data types
The Avro format supports several complex data types. When loading into strongly-typed columns, you can use the ROW and ARRAY types to represent them. For example, Avro Record and Enums are structs (ROWs); see the Avro specification.
You can use ARRAY[ROW] to match an Avro map. You must name the ROW fields key
and value
. These are the names that the Avro format uses for those fields in the data, and the parser relies on field names to match data to table columns.
When loading into flex tables or using flexible complex types, this parser handles Avro complex types as follows:
Record
The name of each field is used as a virtual column name. If flatten_records
is true and several nesting levels are present, Vertica concatenates the record names to create the key name.
Map
The value of each map key is used as a virtual column name. If flatten_maps
is true and several nesting levels are present, Vertica concatenates the key names to create the key name.
Enum
Vertica treats Avro Enums like records, with the name of the Enum as the key and the value as the value.
Array
Vertica treats Avro Arrays as key/value pairs. By default, the index of each element is the key. In the following example, product_detail
is a Record with a field, product_category
, that is an Array:
=> CREATE FLEX TABLE products;
CREATE TABLE
=> COPY products FROM :datafile WITH PARSER FAVROPARSER();
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details": {
"0.__name__": "OrderDetail",
"0.product_detail.__name__": "Product",
"0.product_detail.price": "46.21",
"0.product_detail.product_category": {
"0": "electronics",
"1": "printers",
"2": "computers"
},
"0.product_detail.product_description": "hp printer X11ew description :\
P",
"0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0004",
"0.product_detail.product_id": "999012",
"0.product_detail.product_map.one": "1.1",
"0.product_detail.product_map.two": "1.1",
"0.product_detail.product_name": "hp printer X11ew",
"0.product_detail.product_status": "ONLY_FEW_LEFT",
"0.quantity": "3",
"0.total": "354.34"
},
"order_id": "2389646",
"total": "132.43"
}
...
If flatten_arrays
is true and several nesting levels are present, Vertica concatenates the indices to create the key name.
=> COPY products FROM :datafile WITH PARSER FAVROPARSER(flatten_arrays=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details.0.__name__": "OrderDetail",
"order_details.0.product_detail.__name__": "Product",
"order_details.0.product_detail.price": "46.21",
"order_details.0.product_detail.product_category.0": "electronics",
"order_details.0.product_detail.product_category.1": "printers",
"order_details.0.product_detail.product_category.2": "computers",
"order_details.0.product_detail.product_description": "hp printer X11ew des\
cription :P",
"order_details.0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0\
004",
"order_details.0.product_detail.product_id": "999012",
"order_details.0.product_detail.product_map.one": "1.1",
"order_details.0.product_detail.product_map.two": "1.1",
"order_details.0.product_detail.product_name": "hp printer X11ew",
"order_details.0.product_detail.product_status": "ONLY_FEW_LEFT",
"order_details.0.quantity": "3",
"order_details.0.total": "354.34",
"order_id": "2389646",
"total": "132.43"
}
...
Union
Vertica treats Avro Unions as arrays.
Examples
This example shows how to create and load a flex table with Avro data using favroparser
. After loading the data, you can query virtual columns:
=> CREATE FLEX TABLE avro_basic();
CREATE TABLE
=> COPY avro_basic FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER();
Rows Loaded
-------------
5
(1 row)
=> SELECT station, temp, time FROM avro_basic;
station | temp | time
---------+------+---------------
mohali | 0 | -619524000000
lucknow | 22 | -619506000000
norwich | -11 | -619484400000
ams | 111 | -655531200000
baddi | 78 | -655509600000
(5 rows)
For more information, see Avro data.
4 - FJSONPARSER (parser)
Parses and loads a JSON file.
Parses and loads a JSON file. This file can contain either repeated JSON data objects (including nested maps), or an outer list of JSON elements.
When loading into a flex or hybrid table, the parser stores the JSON data in a single-value VMap. When loading into a hybrid or columnar table, the parser loads data directly into any table column with a column name that matches a key in the JSON source data.
You can load complex types in the JSON source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
Syntax
FJSONPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten sub-maps within the JSON data, separating map levels with a period (
.
). This value affects all data in the load, including nested maps.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to convert lists to sub-maps with integer keys. When lists are flattened, key names are concatenated as for maps. Lists are not flattened by default. This value affects all data in the load, including nested lists.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
reject_on_duplicate
- Boolean, whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.
Default: false
reject_on_empty_key
- Boolean, whether to reject any row containing a field key without a value.
Default: false
omit_empty_keys
- Boolean, whether to omit any field key from the data that does not have a value. Other fields in the same record are loaded.
Default: false
record_terminator
- When set, any invalid JSON records are skipped and parsing continues with the next record. Records must be terminated uniformly. For example, if your input file has JSON records terminated by newline characters, set this parameter to
E'\n')
. If any invalid JSON records exist, parsing continues after the next record_terminator
.
Even if the data does not contain invalid records, specifying an explicit record terminator can improve load performance by allowing cooperative parse and apportioned load to operate more efficiently.
When you omit this parameter, parsing ends at the first invalid JSON record.
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to null.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
start_point
- String, the name of a key in the JSON load data at which to begin parsing. The parser ignores all data before the
start_point
value. The value is loaded for each object in the file. The parser processes data after the first instance, and up to the second, ignoring any remaining data.
start_point_occurrence
- Integer, the nth occurrence of the value you specify with
start_point
. Use in conjunction with start_point
when the data has multiple start values and you know the occurrence at which to begin parsing.
Default: 1
suppress_nonalphanumeric_key_chars
- Boolean, whether to suppress non-alphanumeric characters in JSON key values. The parser replaces these characters with an underscore (
_
) when this parameter is true.
Default: false
key_separator
- Character for the parser to use when concatenating key names.
Default: period (.
)
Examples
The following example loads JSON data from STDIN using the default parameters:
=> CREATE TABLE people(age INT, name VARCHAR);
CREATE TABLE
=> COPY people FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"age": 5, "name": "Tim"}
>> {"age": 3}
>> {"name": "Fred"}
>> {"name": "Bob", "age": 10}
>> \.
=> SELECT * FROM people;
age | name
-----+------
| Fred
10 | Bob
5 | Tim
3 |
(4 rows)
The following example uses the reject_on_duplicate
parameter to reject duplicate values:
=> CREATE FLEX TABLE json_dupes();
CREATE TABLE
=> COPY json_dupes FROM stdin PARSER fjsonparser(reject_on_duplicate=true)
exceptions '/home/dbadmin/load_errors/json_e.out'
rejected data '/home/dbadmin/load_errors/json_r.out';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"a":"1","a":"2","b":"3"}
>> \.
=> \!cat /home/dbadmin/load_errors/json_e.out
COPY: Input record 1 has been rejected (Rejected by user-defined parser).
Please see /home/dbadmin/load_errors/json_r.out, record 1 for the rejected record.
COPY: Loaded 0 rows, rejected 1 rows.
The following example loads array data:
$ cat addrs.json
{"number": 301, "street": "Grant", "attributes": [1, 2, 3, 4]}
=> CREATE EXTERNAL TABLE customers(number INT, street VARCHAR, attributes ARRAY[INT])
AS COPY FROM 'addrs.json' PARSER fjsonparser();
=> SELECT number, street, attributes FROM customers;
num | street| attributes
-----+-----------+---------------
301 | Grant | [1,2,3,4]
(1 row)
The following example loads a flexible complex type, rejecting rows that have empty keys within the nested records. Notice that while the data has two restaurants, one has a key name that is an empty string. This one is rejected:
$ cat rest1.json
{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"location_city" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"location_city" : ["Pittsburgh"],
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}
=> CREATE TABLE rest (name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY);
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, reject_on_empty_key=true);
Rows Loaded
------------
1
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) FROM rest;
maptostring | maptostring
---------------------------+-------------------------------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
(1 row)
To instead load partial data, use omit_empty_keys
to bypass the missing keys while loading everything else:
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, omit_empty_keys=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) from rest;
maptostring | maptostring
-------------------------------------------------+---------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
{
"0": "Cambridge",
"1": "Pittsburgh"
} | {
"0": {
"item": "cheese pizza"
},
"1": {
"item": "spinach pizza",
"price": "$10.50"
}
}
(2 rows)
To instead load this data with strong typing, define the complex types in the table:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR(80)],
menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
)
AS COPY FROM '/data/rest.json' PARSER fjsonparser();
=> SELECT * FROM restaurants;
name | cuisine | location_city | \
menu
-------------------+---------+----------------------------+--------------------\
--------------------------------------------------------
Bob's pizzeria | Italian | ["Cambridge","Pittsburgh"] | [{"item":"cheese pi\
zza","price":0.0},{"item":"spinach pizza","price":0.0}]
Bakersfield Tacos | Mexican | ["Pittsburgh"] | [{"item":"veggie ta\
co","price":0.0},{"item":"steak taco","price":0.0}]
(2 rows)
For other examples, see JSON data.
5 - ORC (parser)
Use the ORC clause with the COPY FROM statement to load data in the ORC format.
Use the ORC clause with the COPY statement to load data in the ORC format. When loading data into Vertica, you can read all primitive types, UUIDs, and complex types.
When loading ORC data, you must account for all columns in the data; you cannot select only some columns.
If the table definition includes columns of primitive types and those columns are not in the data, the parser fills those columns with NULL. If the table definition includes columns of complex types, those columns must be present in the data.
The ORC clause can be used alone or with optional parameters.
Syntax
ORC ( [parameter=value[,...]] )
Parameters
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data. See Using partition columns.
allow_no_match
- Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
Examples
Use the ORC clause without parameters if your data is not partitioned:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prodkey ARRAY[VARCHAR(10)],
orderprices ARRAY[DECIMAL(12,2)],
orderdate DATE
) AS COPY FROM 's3://AWS_DataLake/orders.orc' ORC;
In the following example, the "id" and "name" columns are included in the data and the "created" and "region" columns are partition columns. Partition columns, if not also data columns, must be listed last when defining columns:
=> CREATE EXTERNAL TABLE stores
(id INT,
name VARCHAR(50),
created DATE,
region VARCHAR(50))
AS COPY FROM 'webhdfs:///path/*/*/*'
ORC(hive_partition_cols='created,region');
You can read a map column as an array of rows, as in the following example:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
orderdate DATE
) AS COPY FROM '...' ORC;
6 - Parameters
COPY parameters and their descriptions are divided into the following sections:.
COPY parameters and their descriptions are divided into the following sections:
Target options
The following options apply to the target tables and their columns:
-
LABEL
Assigns a label to a statement to identify it for profiling and debugging.
[
database
.]
schema
Database and schema. The default schema is public
. If you specify a database, it must be the current database.
COPY ignores schema-name
when used in CREATE EXTERNAL TABLE or CREATE FLEX EXTERNAL TABLE statements.
*
target-table*
- The target columnar or flexible table for loading new data. Vertica loads the data into all projections that include columns from the schema table.
*
column-as-expression*
- An expression used to compute values for the target column, which must not be of a complex type. For example:
=> COPY t(year AS TO_CHAR(k, 'YYYY')) FROM 'myfile.dat'
Use this option to transform data when it is loaded into the target database.
For details, see Transforming data during loads.
*
column*
- Restricts the load to one or more specified columns in the table. If you omit specifying columns, COPY loads all columns by default.
Table columns that you omit from the column list are assigned their DEFAULT or SET USING values, if any; otherwise, COPY inserts NULL.
If you leave the column
parameter blank to load all columns in the table, you can use the optional parameter COLUMN OPTION to specify parsing options for specific columns.
The data file must contain the same number of columns as the COPY command's column list.
COLUMN OPTION
- Specifies load metadata for one or more columns declared in the table column list. For example, you can specify that a column has its own DELIMITER, ENCLOSED BY, or NULL AS expression, and so on. You do not have to specify every column name explicitly in the COLUMN OPTION list, but each column you specify must correspond to a column in the table column list.
Column options
Depending on how they are specified, the following COPY options can qualify specific columns or all columns. Some parser-specific options can also apply to either specific columns or all columns. See Global and column-specific options For details about these two modes.
ENFORCELENGTH
- If specified, COPY rejects data rows of type CHAR, VARCHAR, BINARY, and VARBINARY, or elements of those types in collections, if they are larger than the declared size.
By default, COPY truncates offending rows of these data types and elements of these types in collections, but does not reject the rows. For more details, see Handling Messy Data.
If a collection does not fit with all of its elements, COPY rejects the row without truncating. It does not reduce the number of elements. This can happen if each element is individually within limits but the number of elements causes the collection to exceed the maximum size for the column.
FILLER
datatype
- Reads but does not copy the data of an input column. Use filler columns to ignore input columns that do not have columns in the table. You can also use filler columns to transform data (see Examples and Transforming data during loads). Filler columns cannot be of complex types.
FORMAT '
format
'
- Input format, one of the following:
See Binary (native) data to learn more about these formats.
When loading date/time columns, using FORMAT significantly improves load performance. COPY supports the same formats as the TO_DATE function.
See the following topics for additional information:
If you specify invalid format strings, the COPY operation returns an error.
NULL [AS]
- The string representing a null value. The default is an empty string (
''
). You can specify a null value as any ASCII value in the range E'\000'
to E'\177'
inclusive. You cannot use the same character for both the DELIMITER and NULL options. For details, see Delimited data.
The following options are available for specifying source data:
LOCAL
- Loads a data file on a client system, rather than on a cluster host. LOCAL can qualify the STDIN and [path-to-data](#pathToData) parameters. For details, see COPY LOCAL.
Restrictions: Invalid for CREATE EXTERNAL TABLE AS COPY
STDIN
- Reads from the client a standard input instead of a file. STDIN takes one input source only. To load multiple input sources, use [path-to-data](#pathToData).
User must have INSERT privileges on the table and USAGE privileges on its schema.
Restrictions: Invalid for CREATE EXTERNAL TABLE AS COPY
path-to-data
- Specifies the absolute path of the file (or files) containing the data, which can be from multiple input sources.
-
If the file is stored in HDFS, path-to-data
is a URI in the webhdfs
scheme, typically [[s]web]hdfs://[
nameservice
]/
path
. See HDFS file system.
-
If the file is stored in an S3 bucket, path-to-data
is a URI in the format s3://
bucket
/
path
. See S3 object store.
-
If the file is stored in Google Cloud Storage, path-to-data
is a URI in the format gs://
bucket
/
path
. See Google Cloud Storage (GCS) object store.
-
If the file is stored in Azure Blob Storage, path-to-data
is a URI in the format azb://
account
/
container
/
path
. See Azure Blob Storage object store.
-
If the file is on the local Linux file system or an NFS mount, path-to-data
is a local absolute file path.
path-to-data
can optionally contain wildcards to match more than one file. The file or files must be accessible to the local client or the host on which the COPY statement runs. COPY skips empty files in the file list. A file list that includes directories causes the query to fail. See Specifying where to load data from. The supported patterns for wildcards are specified in the Linux Manual Page for Glob (7), and for ADO.net platforms, through the .NET Directory.getFiles method.
You can use variables to construct the pathname as described in Using load scripts.
If path-to-data
resolves to a storage location on a local file system, and the user invoking COPY is not a superuser, the following requirements apply:
Further, if a user has privileges but is not a superuser, and invokes COPY from that storage location, Vertica ensures that symbolic links do not result in unauthorized access.
ON
nodename
- Specifies the node on which the data to copy resides and the node that should parse the load file. If you omit
nodename
, the location of the input file defaults to the initiator node. Use nodename
to copy and parse a load file from a node other than the COPY initiator node.
Note
nodename
is invalid with STDIN and LOCAL.
ON (
nodeset
)
- Specifies a set of nodes on which to perform the load. The same data must be available for load on all named nodes.
nodeset
is a comma-separated list of node names in parentheses. For example:
=> COPY t FROM 'file1.txt' ON (v_vmart_node0001, v_vmart_node0002);
Vertica apportions the load among all of the specified nodes. If you also specify ERROR TOLERANCE or REJECTMAX, Vertica instead chooses a single node on which to perform the load.
If the data is available on all nodes, you usually use ON ANY NODE, which is the default for loads from HDFS and cloud object stores. However, you can use ON nodeset
to do manual load-balancing among concurrent loads.
ON ANY NODE
- Specifies that the source file to load is available on all nodes, so COPY opens the file and parses it from any node in the cluster. For an Eon Mode database, COPY uses nodes within the same subcluster as the initiator.
Caution
The file must be the same on all nodes. If the file differs on two nodes, an incorrect or incomplete result is returned, with no error or warning.
Vertica attempts to apportion the load among several nodes if the file is large enough to benefit from apportioning. It chooses a single node if ERROR TOLERANCE or REJECTMAX is specified.
You can use a wildcard or glob (such as *.dat
) to load multiple input files, combined with the ON ANY NODE clause. If you use a glob, COPY distributes the list of files to all cluster nodes and spreads the workload.
ON ANY NODE is invalid with STDIN and LOCAL. STDIN can only use the client host, and LOCAL indicates a client node.
ON ANY NODE is the default for loads from all paths other than Linux (HDFS and cloud object stores).
- compression
- The input compression type, one of the following:
-
UNCOMPRESSED (default)
-
BZIP
-
GZIP
-
LZO
-
ZSTD
Input files can be of any format. If you use wildcards, all qualifying input files must be in the same format. To load different file formats, specify the format types specifically.
The following requirements and restrictions apply:
-
When using concatenated BZIP or GZIP files, verify that all source files terminate with a record terminator before concatenating them.
-
Concatenated BZIP and GZIP files are not supported for NATIVE (binary) and NATIVE VARCHAR formats.
-
LZO files are assumed to be compressed with lzop
. Vertica supports the following lzop arguments:
-
BZIP, GZIP, ZSTD, and LZO compression cannot be used with ORC format.
VERTICA
- See COPY FROM VERTICA.
[WITH]
UDL-clause
[...]
- Specifies one or more user-defined load functions—one source, and optionally one or more filters and one parser, as follows:
SOURCE source( [arg=value[,...] ]
[ FILTER filter( [arg=value[,...] ] ) ]...
[ PARSER parser( [arg=value[,...] ] ) ]
To use a flex table parser for column tables, use the PARSER parameter followed by a flex table parser argument. For supported flex table parsers, see Bulk loading data into flex tables.
Handling options
The following options control how COPY handles different contingencies:
ABORT ON ERROR
- Specifies that COPY stops if any row is rejected. The statement is rolled back and no data is loaded.
COLSIZES (
integer
[,...])
- Specifies column widths when loading fixed-width data. COPY requires that you specify COLSIZES when using the FIXEDWIDTH parser. COLSIZES and the list of integers must correspond to the columns listed in the table column list. For details, see Fixed-width format data.
ERROR TOLERANCE
- Specifies that COPY treats each source during execution independently when loading data. The statement is not rolled back if a single source is invalid. The invalid source is skipped and the load continues.
Using this parameter disables apportioned load.
Restrictions: Invalid for ORC or Parquet data
EXCEPTIONS
- Specifies the file name or absolute path of the file in which to write exceptions, as follows:
EXCEPTIONS 'path' [ ON nodename[,...]]
Exceptions describe why each rejected row was rejected. Each exception describes the corresponding record in the file specified by the REJECTED DATA option.
Files are written on the node or nodes executing the load. If the file already exists, it is overwritten.
To collect all exceptions in one place, use the REJECTED DATA AS TABLE clause and exceptions are automatically listed in the table's rejected_reason
column.
The ON nodename
clause moves existing exceptions files on nodename
to the indicated path
on the same node. For details, see Saving load exceptions (EXCEPTIONS).
If you use this parameter with COPY...ON ANY NODE, you must still specify the individual nodes for the exception files, as in the following example:
EXCEPTIONS '/home/ex01.txt' on v_db_node0001,'/home/ex02.txt'
on v_db_node0002,'/home/ex03.txt' on v_db_node0003
If path
resolves to a storage location, the following privileges apply to non-superusers:
-
The storage location must be created with the USER option (see CREATE LOCATION).
-
The user must have READ access to the storage location where the files exist, as described in GRANT (storage location).
REJECTED DATA
- Specifies where to write each row that failed to load. If this parameter is specified, records that failed due to parsing errors are always written. Records that failed due to an error during a transformation are written only if configuration parameter CopyFaultTolerantExpressions is set.
The syntax for this parameter is:
REJECTED DATA
{ 'path' [ ON nodename ] [,...] | AS TABLE reject-table }
Vertica can write rejected data to the specified path or to a table:
-
'
path
' [ON
nodename
]
: Copies the rejected row data to the specified path on the node executing the load. If qualified by ON nodename
, Vertica moves existing rejected data files on nodename
to path
on the same node.
The value of path
can be a directory or a file prefix. If there are multiple load sources, path
is always treated as a directory. If there are not multiple load sources but path
ends with '/
', or if a directory of that name already exists, it is also treated as a directory. Otherwise, path
is treated as a file prefix.
Files are written on the node or nodes executing the load. If the file already exists, it is overwritten.
When this parameter is used with LOCAL, the output is written to the client.
Note
Do not qualify path
with ON ANY NODE. To collect all rejected data in one place regardless of how the load is distributed, use a table.
-
AS TABLE reject-table
: Saves rejected rows to reject-table
.
Note
REJECTED DATA AS TABLE is incompatible with
EXCEPTIONS.
For details about both options, see Handling messy data.
REJECTMAX
integer
- The maximum number of logical records that can be rejected before a load fails. For details, see Handling messy data.
REJECTMAX disables apportioned load.
SKIP
integer
- The number of records to skip in a load file. For example, you can use the SKIP option to omit table header information.
Restrictions: Invalid for ORC or Parquet data
STREAM NAME
- Supplies a COPY load stream identifier. Using a stream name helps to quickly identify a particular load. The STREAM NAME value that you supply in the load statement appears in the STREAM_NAME column of system tables LOAD_STREAMS and LOAD_SOURCES.
A valid stream name can contain any combination of alphanumeric or special characters up to 128 bytes in length.
For example:
=> COPY mytable FROM myfile
DELIMITER '|' STREAM NAME 'My stream name';
WITH
parser
- Specifies the parser to use when bulk loading columnar tables, one of the following:
By default, COPY uses the DELIMITER parser for UTF-8 format, delimited text input data. You do not specify the DELIMITER parser directly; absence of a specific parser indicates the default.
To use a flex table parser for column tables, use the PARSER parameter followed by a flex table parser argument. For supported flex table parsers, see Bulk loading data into flex tables.
When loading into flex tables, you must use a compatible parser. For supported flex table parsers, see Bulk loading data into flex tables.
COPY LOCAL does not support the NATIVE, NATIVE VARCHAR, ORC, and PARQUET parsers.
For parser support for complex data types, see the documentation of the specific parser.
For parser details, see Data formats in Data load.
NO COMMIT
- Prevents the COPY statement from committing its transaction automatically when it finishes copying data. This option must be the last COPY statement parameter.
For details, see Using transactions to stage a load.
Restrictions: Invalid for ORC or Parquet data, ignored by CREATE EXTERNAL TABLE AS COPY
Parser-specific options
The following options apply only when using specific parsers.
DELIMITED parser
DELIMITER
Indicates the single ASCII character used to separate columns within each record of a file. You can use any ASCII value in the range E'\000' to E'\177', inclusive. You cannot use the same character for both the DELIMITER and NULL parameters. For more information, see Delimited data.
Default: Vertical bar ('|').
ENCLOSED [BY]
Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
ESCAPE [AS]
Sets the escape character. Once set, the character following the escape character is interpreted literally, rather than as a special character. You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL: E'\000').
The COPY statement does not interpret the data it reads in as String literals. It also does not follow the same escape rules as other SQL statements (including the COPY parameters). When reading data, COPY interprets only the characters defined by these options as special values:
-
ESCAPE [AS]
-
DELIMITER
-
ENCLOSED [BY]
-
RECORD TERMINATOR
-
All COLLECTION options
Default: Backslash ('\').
NO ESCAPE
Eliminates escape-character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
RECORD TERMINATOR
- Specifies the literal character string indicating the end of a data file record. For more information about using this parameter, see Delimited data.
TRAILING NULLCOLS
- Specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values. For other information and examples, see Fixed-width format data.
COLLECTIONDELIMITER
For columns of collection types, indicates the single ASCII character used to separate elements within each collection. You can use any ASCII value in the range E'\000' to E'\177', inclusive. No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: Comma (',').
COLLECTIONOPEN
, COLLECTIONCLOSE
For columns of collection types, these options indicate the characters that mark the beginning and end of the collection. It is an error to use these characters elsewhere within the list of elements without escaping them. No COLLECTION option may have the same value as any other COLLECTION option.
Default: Square brackets ('[' and ']').
COLLECTIONNULLELEMENT
The string representing a null element value in a collection. You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII value except NULL: E'\000'). No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: 'null'
COLLECTIONENCLOSE
For columns of collection types, sets the quote character within which to enclose individual elements, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
No COLLECTION option may have the same value as any other COLLECTION option.
Default: double quote ('"')
FIXEDWIDTH parser
SKIP BYTES
integer
- The total number of bytes in a record to skip.
TRIM
- Trims the number of bytes you specify from a column. This option is only available when loading fixed-width data. You can set TRIM at the table level for a column, or as part of the COLUMN OPTION parameter.
7 - PARQUET (parser)
Use the PARQUET parser with the COPY FROM statement to load data in the Parquet format.
Use the PARQUET
parser with the COPY statement to load data in the Parquet format. When loading data into Vertica you can read all primitive types, UUIDs, and complex types.
By default, the Parquet parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose Schema Matching.
When loading Parquet data, Vertica caches the Parquet metadata to improve efficiency. This cache uses local TEMP storage and is not used if TEMP is remote. See the ParquetMetadataCacheSizeMB configuration parameter to change the size of the cache.
Syntax
PARQUET ( [parameter='value'[,...]] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data. See Using partition columns.
allow_no_match
- Boolean. Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
allow_long_varbinary_match_complex_type
- Boolean. Whether to enable flexible column types (see Flexible complex types). If true, the Parquet parser allows a complex type in the data to match a table column defined as LONG VARBINARY. If false, the Parquet parser requires strong typing of complex types. With the parameter set you can still use strong typing. Set this parameter to false if you want use of flexible columns to be treated as an error.
do_soft_schema_match_by_name
- Boolean. Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and parquet file (false, default). See Loose Schema Matching for more information.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true (default) means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.
Loose schema matching
By default, the Parquet parser uses strong schema matching. This means that all columns in the Parquet data must be loaded, in the same order as in the data. However, there are times when you only want to pull certain columns, or you want to be able to accommodate future changes in the Parquet schema.
Use the do_soft_schema_match_by_name
parameter to enable soft schema matching. This setting has the following effects:
-
Columns in the data are matched to columns in the table by their names. Names must exactly match but are case-insensitive.
-
Columns that exist in the Parquet data but are not part of the table definition are ignored.
-
Columns that exist in the table definition but not the Parquet data are filled with NULL. The parser logs an UNMATCHED_TABLE_COLUMNS_PARQUETPARSER event in QUERY_EVENTS.
-
If the same case-insensitive column name occurs more than once in the Parquet data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Hive partition columns do not need to appear last in the table definition like they do with strong schema matching. If a partition column specified in the table definition does not exist in the data path, the parser uses a value of NULL and logs a MISSING_HIVE_PARTITION event in the QUERY_EVENTS table.
-
Column types do not need to exactly match, so long as the data type in the Parquet file can be coerced to the type used by the table. If a type cannot be coerced, the parser logs a TYPE_MISMATCH_COLUMNS_PARQUETPARSER event in QUERY_EVENTS. If reject_on_materialized_type_error
is true then the parser rejects the row. If it is false, the parser uses NULL or, for string values that are too long, truncates the value.
-
Columns using complex types (other than one-dimensional arrays of primitive types) can be defined but not queried.
Data types
The Parquet parser maps Parquet data types to Vertica data types as follows.
Parquet Logical Type |
Vertica Data Type |
StringLogicalType |
VARCHAR |
MapLogicalType |
ARRAY[ROW] |
ListLogicalType |
ARRAY/SET |
IntLogicalType |
INT/NUMERIC |
DecimalLogicalType |
NUMERIC |
DateLogicalType |
DATE |
TimeLogicalType |
TIME |
TimestampLogicalType |
TIMESTAMP |
UUIDLogicalType |
UUID |
The following logical types are not supported:
- EnumLogicalType
- IntervalLogicalType
- JSONLogicalType
- BSONLogicalType
- UnknownLogicalType
The Parquet parser supports the following mappings of physical types:
Parquet Physical Type |
Vertica Data Type |
BOOLEAN |
BOOLEAN |
INT32/INT64 |
INT |
INT96 |
Supported only for TIMESTAMP |
FLOAT |
DOUBLE |
DOUBLE |
DOUBLE |
BYTE_ARRAY |
VARBINARY |
FIXED_LEN_BYTE_ARRAY |
BINARY |
The following mappings are supported with type coercion and loose schema matching.
Parquet Physical Type |
Coercible to Vertica Data Type |
BOOLEAN |
BOOLEAN |
INT32, INT64, BOOLEAN |
INT |
FLOAT, DOUBLE |
DOUBLE |
INT32, INT96 |
DATE |
INT64, INT96 |
TIMESTAMP, TIMESTAMPTZ |
INT64
If precision > 0: INT32, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
|
Numeric |
BYTE_ARRAY |
CHAR, VARCHAR, LONG VARCHAR, BINARY, VARBINARY, LONG VARBINARY |
FIXED_LEN_BYTE_ARRAY |
UUID |
Vertica supports only 3-level-encoded arrays, not 2-level-encoded.
Examples
Use the PARQUET
clause without parameters if your data is not partitioned.
=> COPY t FROM 's3://AWS_DataLake/sales.parquet' PARQUET;
In the following example, the "id" and "name" columns are included in the data and the "created" and "region" columns are partition columns. Partition columns, if not also data columns, must be listed last when defining columns with strong schema matching (the default).
=> CREATE EXTERNAL TABLE t (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');
In the following example, the data directory contains no files.
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet' PARQUET;
=> SELECT COUNT(*) FROM customers;
ERROR 7869: No files match when expanding glob: [webhdfs:///data/*.parquet]
To read zero rows instead of producing an error, use the allow_no_match parameter with PARQUET:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet' PARQUET(allow_no_match='true');
=> SELECT COUNT(*) FROM customers;
count
-------
0
(1 row)
To allow reading a complex type (menu, in this example) as a flexible column type, use the allow_long_varbinary_match_complex_type
parameter:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
To read only some columns from the restaurant data, use loose schema matching:
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True',
do_soft_schema_match_by_name='True');
=> SELECT * from restaurant;
name | cuisine
-------------------+----------
Bob's pizzeria | Italian
Bakersfield Tacos | Mexican
(2 rows)
8 - Restrictions
COPY has the following restrictions:.
COPY has the following restrictions:
Invalid data
COPY considers the following data invalid:
-
Missing columns (an input line has fewer columns than the recipient table).
-
Extra columns (an input line has more columns than the recipient table).
-
Empty columns for an INTEGER or DATE/TIME data type. If a column is empty for either of these types, COPY
does not use the default value that was defined by CREATE TABLE. However, if you do not supply a column option as part of the COPY
statement, the default value is used.
-
Incorrect representation of a data type. For example, trying to load a non-numeric value into an INTEGER column is invalid.
Constraint violations
If primary key, unique key, or check constraints are enabled for automatic enforcement in the target table, Vertica enforces those constraints when you load new data. If a violation occurs, Vertica rolls back the operation and returns an error.
Empty line handling
When COPY
encounters an empty line while loading data, the line is neither inserted nor rejected, but COPY
increments the line record number. Consider this behavior when evaluating rejected records. If you return a list of rejected records and COPY
encountered an empty row while loading data, the position of rejected records is not incremented by one, as demonstrated in the following example.
The example first loads values into a table that defines the first column as INT. Note the errors on rows 3, 4, and 8:
=> \! cat -n /home/dbadmin/test.txt
1 1|A|2
2 2|B|4
3 A|D|7
4 A|E|7
5
6
7 6|A|3
8 B|A|3
The empty rows (5 and 6) shift the reporting of the error on row 8:
=> SELECT row_number, rejected_data, rejected_reason FROM test_bad;
row_number | rejected_data | rejected_reason
------------+---------------+----------------------------------------------
3 | A|D|7 | Invalid integer format 'A' for column 1 (c1)
4 | A|E|7 | Invalid integer format 'A' for column 1 (c1)
6 | B|A|3 | Invalid integer format 'B' for column 1 (c1)
(3 rows)
Compressed file errors
When loading compressed files, COPY
might abort and report an error, if the file seems to be corrupted. For example, this behavior can occur if reading the header block fails.