Vertica supports several parsers to load different types of data. Some parsers are for use only with flex tables, as noted.
This is the multi-page printable view of this section. Click here to print.
Parsers
- 1: DELIMITED
- 2: FAVROPARSER
- 3: FCEFPARSER
- 4: FCSVPARSER
- 5: FDELIMITEDPAIRPARSER
- 6: FDELIMITEDPARSER
- 7: FJSONPARSER
- 8: FREGEXPARSER
- 9: ORC
- 10: PARQUET
1 - DELIMITED
Use the DELIMITED parser, which is the default, to load delimited text data using COPY. You can specify the delimiter, escape characters, how to handle null values, and other parameters.
The DELIMITED parser supports both apportioned load and cooperative parse.
COPY options
The following options are specific to this parser. See Parameters for other applicable options.
DELIMITER
Indicates the single ASCII character used to separate columns within each record of a file. You can use any ASCII value in the range E'\000' to E'\177', inclusive. You cannot use the same character for both the DELIMITER and NULL parameters. For more information, see Delimited data.
Default: Vertical bar ('|').
ENCLOSED [BY]
Sets the quote character within which to enclose data, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000'). By default, ENCLOSED BY has no value, meaning data is not enclosed by any sort of quote character.
ESCAPE [AS]
Sets the escape character. Once set, the character following the escape character is interpreted literally, rather than as a special character. You can define an escape character using any ASCII value in the range E'\001' to E'\177', inclusive (any ASCII character except NULL: E'\000').
The COPY statement does not interpret the data it reads in as String literals. It also does not follow the same escape rules as other SQL statements (including the COPY parameters). When reading data, COPY interprets only the characters defined by these options as special values:
-
ESCAPE [AS]
-
DELIMITER
-
ENCLOSED [BY]
-
RECORD TERMINATOR
-
All COLLECTION options
Default: Backslash ('\').
-
NO ESCAPE
Eliminates escape-character handling. Use this option if you do not need any escape character and you want to prevent characters in your data from being interpreted as escape sequences.
RECORD TERMINATOR
- Specifies the literal character string indicating the end of a data file record. For more information about using this parameter, see Delimited data.
TRAILING NULLCOLS
- Specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values. For other information and examples, see Fixed-width format data.
COLLECTIONDELIMITER
For columns of collection types, indicates the single ASCII character used to separate elements within each collection. You can use any ASCII value in the range E'\000' to E'\177', inclusive. No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: Comma (',').
COLLECTIONOPEN
,COLLECTIONCLOSE
For columns of collection types, these options indicate the characters that mark the beginning and end of the collection. It is an error to use these characters elsewhere within the list of elements without escaping them. No COLLECTION option may have the same value as any other COLLECTION option.
Default: Square brackets ('[' and ']').
COLLECTIONNULLELEMENT
The string representing a null element value in a collection. You can specify a null value as any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII value except NULL: E'\000'). No COLLECTION option may have the same value as any other COLLECTION option. For more information, see Delimited data.
Default: 'null'
COLLECTIONENCLOSE
For columns of collection types, sets the quote character within which to enclose individual elements, allowing delimiter characters to be embedded in string values. You can choose any ASCII value in the range E'\001' to E'\177' inclusive (any ASCII character except NULL: E'\000').
No COLLECTION option may have the same value as any other COLLECTION option.
Default: double quote ('"')
Data types
The DELIMITED parser supports reading one-dimensional collections (arrays or sets) of scalar types.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
Examples
The following example shows the default behavior, in which the delimiter character is '|'
=> CREATE TABLE employees (id INT, name VARCHAR(50), department VARCHAR(50));
CREATE TABLE
=> COPY employees FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|Physics
>> 17|Howard Wolowitz|Astronomy
>> \.
=> SELECT * FROM employees;
id | name | department
----+-----------------+--------------
17 | Howard Wolowitz | Astrophysics
42 | Sheldon Cooper | Physics
(2 rows)
The following example shows loading array values with the default options.
=> CREATE TABLE researchers (id INT, name VARCHAR, grants ARRAY[VARCHAR], values ARRAY[INT]);
CREATE TABLE
=> COPY researchers FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 42|Sheldon Cooper|[US-7376,DARPA-1567]|[65000,135000]
>> 17|Howard Wolowitz|[NASA-1683,NASA-7867,SPX-76]|[16700,85000,45000]
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
(2 rows)
In the following example, collections are enclosed in braces and delimited by periods, and the arrays contain null values.
=> COPY researchers FROM STDIN COLLECTIONOPEN '{' COLLECTIONCLOSE '}' COLLECTIONDELIMITER '.';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 19|Leonard|{"us-1672".null."darpa-1963"}|{16200.null.16700}
>> \.
=> SELECT * FROM researchers;
id | name | grants | values
----+-----------------+------------------------------------+---------------------
17 | Howard Wolowitz | ["NASA-1683","NASA-7867","SPX-76"] | [16700,85000,45000]
42 | Sheldon Cooper | ["US-7376","DARPA-1567"] | [65000,135000]
19 | Leonard | ["us-1672",null,"darpa-1963"] | [16200,null,16700]
(3 rows)
2 - FAVROPARSER
Parses data from an Avro file. The following requirements apply:
-
Avro files must be encoded in the Avro binary serialization encoding format, described in the Apache Avro standard. The parser also supports Snappy and deflate compression.
-
FAVROPARSER does not support Avro files with separate schema files. The Avro file must include the schema.
You can load complex types in the Avro source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
This parser can notify you if it finds keys in the data that are not part of the table definition. See Unmatched Keys.
When loading into a flex table, Vertica loads all data into the __raw__
(VMap) column, including complex types found in the data.
This parser does not support apportioned load or cooperative parse.
Syntax
FAVROPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten all Avro maps. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to flatten all Avro arrays. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
flatten_records
- Boolean, whether to flatten all Avro records. Key names are concatenated with nested levels. This value is recursive and affects all data in the load.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to NULL.
If the column is an array and the data to be loaded is too large, then false sets the column value to NULL and true rejects the row.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
suppress_warnings
- String, which warnings to suppress:
-
unmatched_key
(see Unmatched Keys) -
true
ort
(suppress all warnings) -
false
orf
(do not suppress warnings)
Default: false
-
Primitive data types
FAVROPARSER supports the following primitive data types, including as element types and field values in complex types.
AVRO Data Type | Vertica Data Type | Value |
---|---|---|
NULL | NULL value | No value |
boolean | Boolean data type | A binary value |
int | INTEGER | 32-bit signed integer |
long | INTEGER | 64-bit signed integer |
float |
Synonymous with 64-bit IEEE FLOAT |
Single precision (32-bit) IEEE 754 floating-point number |
double | DOUBLE PRECISION (FLOAT) | Double precision (64-bit) IEEE 754 floating-point number |
bytes | VARBINARY | Sequence of 8-bit unsigned bytes |
string | VARCHAR | Unicode character sequence |
Note
Vertica does not have an explicit 4-byte (32-bit integer) or smaller types. Instead, Vertica encoding and compression automatically eliminate the storage overhead of values that require less than 64 bits.Avro logical types
FAVROPARSER supports the following Avro logical types. The target column must use a Vertica data type that supports the logical type. When you attempt to load data using an invalid logical type, the logical type is ignored and the underlying Avro type is used.
AVRO Logical Type | Base Avro Type | Supported Vertica Data Types |
---|---|---|
0 < 0 ≤ |
bytes or fixed |
Vertica rejects the value if:
If the data type for the target column uses the default precision setting, the precision setting in the Avro schema overrides the default. |
date |
integer |
DATE, Character |
time-micros |
long |
The time logical type does not provide a time zone value. For target columns that use the TIMETZ data type, Vertica uses UTC as the default. |
time-millis |
int | |
timestamp-micros |
long |
TIMESTAMP/TIMESTAMPTZ, TIME/TIMETZ For |
timestamp-millis |
long | |
duration |
fixed |
INTERVAL, Character |
Avro complex data types
The Avro format supports several complex data types. When loading into strongly-typed columns, you can use the ROW and ARRAY types to represent them. For example, Avro Record and Enums are structs (ROWs); see the Avro specification.
You can use ARRAY[ROW] to match an Avro map. You must name the ROW fields key
and value
. These are the names that the Avro format uses for those fields in the data, and the parser relies on field names to match data to table columns.
If the total size of an array exceeds the size defined by the target table, the parser sets the value to null.
When loading into flex tables or using flexible complex types, this parser handles Avro complex types as follows:
Record
The name of each field is used as a virtual column name. If flatten_records
is true and several nesting levels are present, Vertica concatenates the record names to create the key name.
Map
The value of each map key is used as a virtual column name. If flatten_maps
is true and several nesting levels are present, Vertica concatenates the key names to create the key name.
Enum
Vertica treats Avro Enums like records, with the name of the Enum as the key and the value as the value.
Array
Vertica treats Avro Arrays as key/value pairs. By default, the index of each element is the key. In the following example, product_detail
is a Record with a field, product_category
, that is an Array:
=> CREATE FLEX TABLE products;
CREATE TABLE
=> COPY products FROM :datafile WITH PARSER FAVROPARSER();
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details": {
"0.__name__": "OrderDetail",
"0.product_detail.__name__": "Product",
"0.product_detail.price": "46.21",
"0.product_detail.product_category": {
"0": "electronics",
"1": "printers",
"2": "computers"
},
"0.product_detail.product_description": "hp printer X11ew description :\
P",
"0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0004",
"0.product_detail.product_id": "999012",
"0.product_detail.product_map.one": "1.1",
"0.product_detail.product_map.two": "1.1",
"0.product_detail.product_name": "hp printer X11ew",
"0.product_detail.product_status": "ONLY_FEW_LEFT",
"0.quantity": "3",
"0.total": "354.34"
},
"order_id": "2389646",
"total": "132.43"
}
...
If flatten_arrays
is true and several nesting levels are present, Vertica concatenates the indices to create the key name.
=> COPY products FROM :datafile WITH PARSER FAVROPARSER(flatten_arrays=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT MAPTOSTRING(__raw__) FROM products ORDER BY __identity__;
maptostring
--------------------------------------------------------------------------------
{
"__name__": "Order",
"customer_id": "111222",
"order_details.0.__name__": "OrderDetail",
"order_details.0.product_detail.__name__": "Product",
"order_details.0.product_detail.price": "46.21",
"order_details.0.product_detail.product_category.0": "electronics",
"order_details.0.product_detail.product_category.1": "printers",
"order_details.0.product_detail.product_category.2": "computers",
"order_details.0.product_detail.product_description": "hp printer X11ew des\
cription :P",
"order_details.0.product_detail.product_hash": "\u0000\u0001\u0002\u0003\u0\
004",
"order_details.0.product_detail.product_id": "999012",
"order_details.0.product_detail.product_map.one": "1.1",
"order_details.0.product_detail.product_map.two": "1.1",
"order_details.0.product_detail.product_name": "hp printer X11ew",
"order_details.0.product_detail.product_status": "ONLY_FEW_LEFT",
"order_details.0.quantity": "3",
"order_details.0.total": "354.34",
"order_id": "2389646",
"total": "132.43"
}
...
Union
Vertica treats Avro Unions as arrays.
Unmatched keys
Data being loaded can contain keys that are not part of the table definition. If you are loading into a flex table (or a flexible complex type column), no data is lost. For a table with strongly-defined columns, however, new keys cannot be loaded because the table does not have a place to put them.
This parser emits warnings if it finds new keys and if both of the following are true:
-
The target table is not a flex table.
-
The new key is not nested within a flexible complex type column.
New keys are logged in the UDX_EVENTS system table. If a new key is a complex type with nested keys, only the top-level key is logged. When you see a warning about unmatched keys, you can query this table and then use ALTER TABLE to modify your table definition for future loads.
Querying an external table loads data and thus can trigger these warnings. To prevent them, set the suppress_warnings
parameter to 'unmatched_keys' or 'true':
=> CREATE EXTERNAL TABLE restaurants(
name VARCHAR(50),
menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100])
AS COPY FROM '/data/rest.json'
PARSER FAVROPARSER(suppress_warnings='unmatched_key');
Examples
This example shows how to create and load a flex table with Avro data using favroparser
. After loading the data, you can query virtual columns:
=> CREATE FLEX TABLE avro_basic();
CREATE TABLE
=> COPY avro_basic FROM '/home/dbadmin/data/weather.avro' PARSER FAVROPARSER();
Rows Loaded
-------------
5
(1 row)
=> SELECT station, temp, time FROM avro_basic;
station | temp | time
---------+------+---------------
mohali | 0 | -619524000000
lucknow | 22 | -619506000000
norwich | -11 | -619484400000
ams | 111 | -655531200000
baddi | 78 | -655509600000
(5 rows)
For more information, see Avro data.
3 - FCEFPARSER
Parses ArcSight Common Event Format (CEF) log files. This parser loads values directly into any table column with a column name that matches a source data key. The parser stores the data loaded into a flex table in a single VMap.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FCEFPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Single-character delimiter.
Default:
' '
record_terminator
- Single-character record terminator.
**Default ****value: **
newline
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default:
true
reject_on_unescaped_delimiter
- Boolean, specifies whether to reject rows containing unescaped delimiters. The CEF standard does not permit them.
Default:
false
Examples
The following example illustrates creating a sample flex table for CEF data, with two real columns, eventId
and priority
.
-
Create a flex table
cefdata
:=> create flex table cefdata(); CREATE TABLE
-
Load some basic CEF data, using the flex parser
fcefparser
:=> copy cefdata from stdin parser fcefparser(); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> CEF:0|ArcSight|ArcSight|2.4.1|machine:20|New alert|High| >> \.
-
Use the
maptostring()
function to view the contents of yourcefdata
flex table:=> select maptostring(__raw__) from cefdata; maptostring ------------------------------------------------------------- { "deviceproduct" : "ArcSight", "devicevendor" : "ArcSight", "deviceversion" : "2.4.1", "name" : "New alert", "severity" : "High", "signatureid" : "machine:20", "version" : "0" } (1 row)
-
Select some virtual columns from the
cefdata
flex table:= select deviceproduct, severity, deviceversion from cefdata; deviceproduct | severity | deviceversion ---------------+----------+--------------- ArcSight | High | 2.4.1 (1 row)
For more information, see Common event format (CEF) data
See also
4 - FCSVPARSER
Parses CSV format (comma-separated values) data. Use this parser to load CSV data into columnar, flex, and hybrid tables. All data must be encoded in Unicode UTF-8 format. The fcsvparser
parser supports the RFC 4180 standard for CSV data, and other options, to accommodate variations in CSV file format definitions. Invalid records are rejected. For more information about data formats, see Handling Non-UTF-8 input.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FCSVPARSER ( [parameter='value'[,...]] )
Parameters
type
- The default parameter value for the parser, one of the following strings:
-
rfc4180
-
traditional
You do not have to use the type parameter when loading data that conforms to the RFC 4180 standard (such as MS Excel files). See Loading CSV data for the
RFC4180
default parameters, and other options you can specify for traditional CSV files.Default:
RFC4180
-
delimiter
- A single-character value used to separate fields in the CSV data.
Default:
,
(forrfc4180
andtraditional
) escape
- A single-character value used as an escape character to interpret the next character in the data literally.
Default:
-
rfc4180
:"
-
traditional
:\
-
enclosed_by
- A single-character value. Use
enclosed_by
to include a value that is identical to the delimiter, but should be interpreted literally. For example, if the data delimiter is a comma (,
), and you want to use a comma within the data ("my name is jane, and his is jim"
).Default:
"
record_terminator
- A single-character value used to specify the end of a record.
Default:
-
rfc4180
:\n
-
traditional
:\r\n
-
trailing_nullcols
- Boolean, specifies that if Vertica encounters a record with insufficient data to match the columns in the table column list, COPY inserts the missing columns with NULL values.
Default: false
header
- Boolean, specifies whether to use the first row of data as a header column. When
header=true
(default), and no header exists, fcsvparser uses a default column heading. The default header consists ofucol
n
, where n is the column offset number, starting with0
for the first column. You can specify custom column heading names using theheader_names
parameter, described next.If you specify
header=false
, thefcsvparser
parses the first row of input as data, rather than as column headers.Default: true
header_names
- A list of column header names, delimited by the character defined by the parser's delimiter parameter. Use this parameter to specify header names in a CSV file without a header row, or to override the column names present in the CSV source. To override one or more existing column names, specify the header names to use. This parameter overrides any header row in the data.
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default: true
omit_empty_keys
- Boolean, specifies how the parser handles header keys without values. If true, keys with an empty value in the
header
row are not loaded.Default: false
reject_on_duplicate
- Boolean, specifies whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.
Default: false
reject_on_empty_key
- Boolean, specifies whether to reject any row containing a key without a value.
Default: false
reject_on_materialized_type_error
- Boolean, specifies whether to reject any materialized column value that the parser cannot coerce into a compatible data type. See Loading CSV data.
Default: false
Examples
The following example loads data that complies with RFC 4180:
=> CREATE TABLE sample(a INT, b VARCHAR, c INT);
CREATE TABLE
=> COPY sample FROM stdin PARSER FCSVPARSER(header='false');
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 10,10,20
>> 10,"10",30
>> 10,"20""5",90
>> \.
=> SELECT * FROM sample;
a | b | c
----+------+----
10 | 10 | 20
10 | 10 | 30
10 | 20"5 | 90
(3 rows)
Note the value 20"5
, which came from the input "20""5"
. Per RFC 4180, to include a double-quote character in a string, you must double-quote the entire string and then escape the double-quote with another double-quote.
For more information and examples using other parameters of this parser, see Loading CSV data.
See also
5 - FDELIMITEDPAIRPARSER
Parses delimited data files. This parser provides a subset of the functionality in the parser fdelimitedparser
. Use the fdelimitedpairparser
when the data you are loading specifies pairs of column names with data in each row.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FDELIMITEDPAIRPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Specifies a single-character delimiter.
Default:
' '
record_terminator
- Specifies a single-character record terminator.
Default: newline
trim
- Boolean specifies whether to trim white space from header names and key values.
Default: true
Examples
The following example illustrates creating a sample flex table for simple delimited data, with two real columns, eventId
and priority
.
-
Create a table:
=> create flex table CEFData(eventId int default(eventId::int), priority int default(priority::int) ); CREATE TABLE
-
Load a sample delimited OpenText ArcSight log file into the
CEFData
table, using thefcefparser
:=> copy CEFData from '/home/release/kmm/flextables/sampleArcSight.txt' parser fdelimitedpairparser(); Rows Loaded | 200
-
After loading the sample data file, use
maptostring()
to display the virtual columns in the__raw__
column ofCEFData
:=> select maptostring(__raw__) from CEFData limit 1; maptostring ----------------------------------------------------------- "agentassetid" : "4-WwHuD0BABCCQDVAeX21vg==", "agentzone" : "3083", "agt" : "265723237", "ahost" : "svsvm0176", "aid" : "3tGoHuD0BABCCMDVAeX21vg==", "art" : "1099267576901", "assetcriticality" : "0", "at" : "snort_db", "atz" : "America/Los_Angeles", "av" : "5.3.0.19524.0", "cat" : "attempted-recon", "categorybehavior" : "/Communicate/Query", "categorydevicegroup" : "/IDS/Network", "categoryobject" : "/Host", "categoryoutcome" : "/Attempt", "categorysignificance" : "/Recon", "categorytechnique" : "/Scan", "categorytupledescription" : "An IDS observed a scan of a host.", "cnt" : "1", "cs2" : "3", "destinationgeocountrycode" : "US", "destinationgeolocationinfo" : "Richardson", "destinationgeopostalcode" : "75082", "destinationgeoregioncode" : "TX", "destinationzone" : "3133", "device product" : "Snort", "device vendor" : "Snort", "device version" : "1.8", "deviceseverity" : "2", "dhost" : "198.198.121.200", "dlat" : "329913940429", "dlong" : "-966644973754", "dst" : "3334896072", "dtz" : "America/Los_Angeles", "dvchost" : "unknown:eth1", "end" : "1364676323451", "eventid" : "1219383333", "fdevice product" : "Snort", "fdevice vendor" : "Snort", "fdevice version" : "1.8", "fdtz" : "America/Los_Angeles", "fdvchost" : "unknown:eth1", "lblstring2label" : "sig_rev", "locality" : "0", "modelconfidence" : "0", "mrt" : "1364675789222", "name" : "ICMP PING NMAP", "oagentassetid" : "4-WwHuD0BABCCQDVAeX21vg==", "oagentzone" : "3083", "oagt" : "265723237", "oahost" : "svsvm0176", "oaid" : "3tGoHuD0BABCCMDVAeX21vg==", "oat" : "snort_db", "oatz" : "America/Los_Angeles", "oav" : "5.3.0.19524.0", "originator" : "0", "priority" : "8", "proto" : "ICMP", "relevance" : "10", "rt" : "1099267573000", "severity" : "8", "shost" : "198.198.104.10", "signature id" : "[1:469]", "slat" : "329913940429", "slong" : "-966644973754", "sourcegeocountrycode" : "US", "sourcegeolocationinfo" : "Richardson", "sourcegeopostalcode" : "75082", "sourcegeoregioncode" : "TX", "sourcezone" : "3133", "src" : "3334891530", "start" : "1364676323451", "type" : "0" } (1 row)
-
Select the
eventID
andpriority
real columns, along with two virtual columns,atz
anddestinationgeoregioncode
:=> select eventID, priority, atz, destinationgeoregioncode from CEFData limit 10; eventID | priority | atz | destinationgeoregioncode ------------+----------+---------------------+-------------------------- 1218325417 | 5 | America/Los_Angeles | 1219383333 | 8 | America/Los_Angeles | TX 1219533691 | 9 | America/Los_Angeles | TX 1220034458 | 5 | America/Los_Angeles | TX 1220034578 | 9 | America/Los_Angeles | 1220067119 | 5 | America/Los_Angeles | TX 1220106960 | 5 | America/Los_Angeles | TX 1220142122 | 5 | America/Los_Angeles | TX 1220312009 | 5 | America/Los_Angeles | TX 1220321355 | 5 | America/Los_Angeles | CA (10 rows)
See also
6 - FDELIMITEDPARSER
Parses data using a delimiter character to separate values. The fdelimitedparser
loads delimited data, storing it in a single-value VMap.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Note
By default,fdelimitedparser
treats empty fields as NULL
, rather than as an empty string (''
). This behavior makes casting easier. Casting a NULL to an integer (NULL::int
) is valid, while casting an empty string to an integer (''::int
) is not. If required, use the treat_empty_val_as_null
parameter to change the default behavior of fdelimitedparser
.
Syntax
FDLIMITEDPARSER ( [parameter-name='value'[,...]] )
Parameters
delimiter
- Single character delimiter.
Default:
|
record_terminator
- Single-character record terminator.
Default:
\n
trim
- Boolean, specifies whether to trim white space from header names and key values.
Default:
true
header
- Boolean, specifies that a header column exists. The parser uses
col###
for the column names if you use this parameter but no header exists.Default:
true
omit_empty_keys
- Boolean, specifies how the parser handles header keys without values. If
omit_empty_keys=true
, keys with an empty value in theheader
row are not loaded.Default:
false
reject_on_duplicate
- Boolean, specifies whether to ignore duplicate records (
false
), or to reject duplicates (true
). In either case, the load continues.Default:
false
reject_on_empty_key
- Boolean, specifies whether to reject any row containing a key without a value.
Default:
false
reject_on_materialized_type_error
- Boolean, specifies whether to reject any row value for a materialized column that the parser cannot coerce into a compatible data type. See Using flex table parsers.
Default:
false
treat_empty_val_as_null
- Boolean, specifies that empty fields become
NULLs
, rather than empty strings (''
).Default:
true
Examples
-
Create a flex table for delimited data:
t=> CREATE FLEX TABLE delim_flex (); CREATE TABLE
-
Use the
fdelimitedparser
to load some delimited data fromSTDIN
, specifying a comma (,
) column delimiter:=> COPY delim_flex FROM STDIN parser fdelimitedparser (delimiter=','); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> deviceproduct, severity, deviceversion >> ArcSight, High, 2.4.1 >> \.
You can now query virtual columns in the delim_flex
flex table:
=> SELECT deviceproduct, severity, deviceversion from delim_flex;
deviceproduct | severity | deviceversion
---------------+----------+---------------
ArcSight | High | 2.4.1
(1 row)
See also
7 - FJSONPARSER
Parses and loads a JSON file. This file can contain either repeated JSON data objects (including nested maps), or an outer list of JSON elements.
When loading into a flex or hybrid table, the parser stores the JSON data in a single-value VMap. When loading into a hybrid or columnar table, the parser loads data directly into any table column with a column name that matches a key in the JSON source data.
You can load complex types in the JSON source (arrays, structs, or combinations) with strong typing or as flexible complex types. A flexible complex type is loaded into a VMap column, as in flex tables. To load complex types as VMap columns, specify a column type of LONG VARBINARY. To preserve the indexing in complex types, set flatten_maps
to false.
This parser can notify you if it finds keys in the data that are not part of the table definition. See Unmatched Keys.
FJSONPARSER supports cooperative parse only if record_terminator
is specified. It does not support apportioned load.
Syntax
FJSONPARSER ( [parameter=value[,...]] )
Parameters
flatten_maps
- Boolean, whether to flatten sub-maps within the JSON data, separating map levels with a period (
.
). This value affects all data in the load, including nested maps.This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: true
flatten_arrays
- Boolean, whether to convert lists to sub-maps with integer keys. When lists are flattened, key names are concatenated as for maps. Lists are not flattened by default. This value affects all data in the load, including nested lists.
This parameter applies only to flex tables or VMap columns and is ignored when loading strongly-typed complex types.
Default: false
reject_on_duplicate
- Boolean, whether to ignore duplicate records (false), or to reject duplicates (true). In either case, the load continues.
Default: false
reject_on_empty_key
- Boolean, whether to reject any row containing a field key without a value.
Default: false
omit_empty_keys
- Boolean, whether to omit any field key from the data that does not have a value. Other fields in the same record are loaded.
Default: false
record_terminator
- When set, any invalid JSON records are skipped and parsing continues with the next record. Records must be terminated uniformly. For example, if your input file has JSON records terminated by newline characters, set this parameter to
E'\n')
. If any invalid JSON records exist, parsing continues after the nextrecord_terminator
.Even if the data does not contain invalid records, specifying an explicit record terminator can improve load performance by allowing cooperative parse and apportioned load to operate more efficiently.
When you omit this parameter, parsing ends at the first invalid JSON record.
reject_on_materialized_type_error
Boolean, whether to reject a data row that contains a materialized column value that cannot be coerced into a compatible data type. If the value is false and the type cannot be coerced, the parser sets the value in that column to NULL.
If the column is an array and the data to be loaded is too large, then false sets the column value to NULL and true rejects the row.
If the column is a strongly-typed complex type, as opposed to a flexible complex type, then a type mismatch anywhere in the complex type causes the entire column to be treated as a mismatch. The parser does not partially load complex types.
Default: false
start_point
- String, the name of a key in the JSON load data at which to begin parsing. The parser ignores all data before the
start_point
value. The value is loaded for each object in the file. The parser processes data after the first instance, and up to the second, ignoring any remaining data. start_point_occurrence
- Integer, the nth occurrence of the value you specify with
start_point
. Use in conjunction withstart_point
when the data has multiple start values and you know the occurrence at which to begin parsing.Default: 1
suppress_nonalphanumeric_key_chars
- Boolean, whether to suppress non-alphanumeric characters in JSON key values. The parser replaces these characters with an underscore (
_
) when this parameter is true.Default: false
key_separator
- Character for the parser to use when concatenating key names.
Default: period (
.
) suppress_warnings
- String, which warnings to suppress:
-
unmatched_key
(see Unmatched Keys) -
true
ort
(suppress all warnings) -
false
orf
(do not suppress warnings)
Default: false
-
Data types
If the total size of an array exceeds the size defined by the target table, the parser sets the value to null.
Unmatched keys
Data being loaded can contain keys that are not part of the table definition. If you are loading into a flex table (or a flexible complex type column), no data is lost. For a table with strongly-defined columns, however, new keys cannot be loaded because the table does not have a place to put them.
This parser emits warnings if it finds new keys and if both of the following are true:
-
The target table is not a flex table.
-
The new key is not nested within a flexible complex type column.
New keys are logged in the UDX_EVENTS system table. If a new key is a complex type with nested keys, only the top-level key is logged. When you see a warning about unmatched keys, you can query this table and then use ALTER TABLE to modify your table definition for future loads.
Querying an external table loads data and thus can trigger these warnings. To prevent them, set the suppress_warnings
parameter to 'unmatched_keys' or 'true':
=> CREATE EXTERNAL TABLE restaurants(
name VARCHAR(50),
menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100])
AS COPY FROM '/data/rest.json'
PARSER FJSONPARSER(suppress_warnings='unmatched_key');
Examples
The following example loads JSON data from STDIN using the default parameters:
=> CREATE TABLE people(age INT, name VARCHAR);
CREATE TABLE
=> COPY people FROM STDIN PARSER FJSONPARSER();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"age": 5, "name": "Tim"}
>> {"age": 3}
>> {"name": "Fred"}
>> {"name": "Bob", "age": 10}
>> \.
=> SELECT * FROM people;
age | name
-----+------
| Fred
10 | Bob
5 | Tim
3 |
(4 rows)
The following example uses the reject_on_duplicate
parameter to reject duplicate values:
=> CREATE FLEX TABLE json_dupes();
CREATE TABLE
=> COPY json_dupes FROM stdin PARSER fjsonparser(reject_on_duplicate=true)
exceptions '/home/dbadmin/load_errors/json_e.out'
rejected data '/home/dbadmin/load_errors/json_r.out';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {"a":"1","a":"2","b":"3"}
>> \.
=> \!cat /home/dbadmin/load_errors/json_e.out
COPY: Input record 1 has been rejected (Rejected by user-defined parser).
Please see /home/dbadmin/load_errors/json_r.out, record 1 for the rejected record.
COPY: Loaded 0 rows, rejected 1 rows.
The following example loads array data:
$ cat addrs.json
{"number": 301, "street": "Grant", "attributes": [1, 2, 3, 4]}
=> CREATE EXTERNAL TABLE customers(number INT, street VARCHAR, attributes ARRAY[INT])
AS COPY FROM 'addrs.json' PARSER fjsonparser();
=> SELECT number, street, attributes FROM customers;
num | street| attributes
-----+-----------+---------------
301 | Grant | [1,2,3,4]
(1 row)
The following example loads a flexible complex type, rejecting rows that have empty keys within the nested records. Notice that while the data has two restaurants, one has a key name that is an empty string. This one is rejected:
$ cat rest1.json
{
"name" : "Bob's pizzeria",
"cuisine" : "Italian",
"location_city" : ["Cambridge", "Pittsburgh"],
"menu" : [{"item" : "cheese pizza", "" : "$8.25"},
{"item" : "spinach pizza", "price" : "$10.50"}]
}
{
"name" : "Bakersfield Tacos",
"cuisine" : "Mexican",
"location_city" : ["Pittsburgh"],
"menu" : [{"item" : "veggie taco", "price" : "$9.95"},
{"item" : "steak taco", "price" : "$10.95"}]
}
=> CREATE TABLE rest (name VARCHAR, cuisine VARCHAR, location_city LONG VARBINARY, menu LONG VARBINARY);
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, reject_on_empty_key=true);
Rows Loaded
------------
1
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) FROM rest;
maptostring | maptostring
---------------------------+-------------------------------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
(1 row)
To instead load partial data, use omit_empty_keys
to bypass the missing keys while loading everything else:
=> COPY rest FROM '/data/rest1.json'
PARSER fjsonparser(flatten_maps=false, omit_empty_keys=true);
Rows Loaded
-------------
2
(1 row)
=> SELECT maptostring(location_city), maptostring(menu) from rest;
maptostring | maptostring
-------------------------------------------------+---------------------------------
{
"0": "Pittsburgh"
} | {
"0": {
"item": "veggie taco",
"price": "$9.95"
},
"1": {
"item": "steak taco",
"price": "$10.95"
}
}
{
"0": "Cambridge",
"1": "Pittsburgh"
} | {
"0": {
"item": "cheese pizza"
},
"1": {
"item": "spinach pizza",
"price": "$10.50"
}
}
(2 rows)
To instead load this data with strong typing, define the complex types in the table:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR,
location_city ARRAY[VARCHAR(80)],
menu ARRAY[ ROW(item VARCHAR(80), price FLOAT) ]
)
AS COPY FROM '/data/rest.json' PARSER fjsonparser();
=> SELECT * FROM restaurants;
name | cuisine | location_city | \
menu
-------------------+---------+----------------------------+--------------------\
--------------------------------------------------------
Bob's pizzeria | Italian | ["Cambridge","Pittsburgh"] | [{"item":"cheese pi\
zza","price":0.0},{"item":"spinach pizza","price":0.0}]
Bakersfield Tacos | Mexican | ["Pittsburgh"] | [{"item":"veggie ta\
co","price":0.0},{"item":"steak taco","price":0.0}]
(2 rows)
In the following example, the data contains two new fields. One is a top-level field (a new column), and the other is a new field on an existing struct. The new fields are recorded in the UDX_EVENTS system table:
=> COPY rest FROM '/data/rest2.json' PARSER FJSONPARSER();
WARNING 10596: Warning in UDx call in user-defined object [FJSONParser], code: 0, message:
Data source contained keys which did not match table schema
HINT: SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type = 'UNMATCHED_KEY' GROUP BY key
Rows Loaded
-------------
2
(1 row)
=> SELECT key, SUM(num_instances) FROM v_monitor.UDX_EVENTS
WHERE event_type = 'UNMATCHED_KEY' GROUP BY key;
key | SUM
------------------------+-----
chain | 1
menu.elements.calories | 7
(2 rows)
For other examples, see JSON data.
8 - FREGEXPARSER
Parses a regular expression, matching columns to the contents of the named regular expression groups.
This parser is for use in Flex tables only. All flex parsers store the data as a single VMap in the LONG VARBINAR_raw__
column. If a data row is too large to fit in the column, it is rejected. Vertica supports null values for loading data with NULL-specified columns.
Syntax
FREGEXPARSER ( pattern=[parameter-name='value'[,...]] )
Parameters
pattern
- Specifies the regular expression of data to match.
Default: Empty string (
""
) use_jit
- Boolean, specifies whether to use just-in-time compiling when parsing the regular expression.
Default: false
record_terminator
- Specifies the character used to separate input records.
Default:
\n
logline_column
- A string that captures the destination column containing the full string that the regular expression matched.
Default: Empty string (
""
)
Example
These examples use the following regular expression, which searches for information that includes the timestamp
, date
, thread_name
, and thread_id
strings.
Caution
For display purposes, this sample regular expression adds new line characters to split long lines of text. To use this expression in a query, first copy and edit the example to remove any new line characters.This example expression loads any thread_id
hex value, regardless of whether it has a 0x
prefix, (<thread_id>(?:0x)?[0-9a-f]+)
.
'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)
(?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+)
-?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)]
\<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: )
?(?<text>.*)'
-
Create a flex table (
vlog
) to contain the results of a Vertica log file. For this example, we made a copy of a log file in the directory/home/dbadmin/data/vertica.log
:=> CREATE FLEX TABLE vlog1(); CREATE TABLE
-
Use the
fregexparser
with the sample regular expression to load data from the log file. Be sure to remove any line characters before using this expression shown here:=> COPY vlog1 FROM '/home/dbadmin/tempdat/KMvertica.log' PARSER FREGEXPARSER(pattern=
'^(?<time>\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d+) (?<thread_name>[A-Za-z ]+):(?<thread_id>(?:0x)?[0-9a-f]+) -?(?<transaction_id>[0-9a-f])?(?:[(?<component>\w+)] \<(?<level>\w+)\> )?(?:<(?<elevel>\w+)> @[?(?<enode>\w+)]?: ) ?(?<text>.*)'
-
After successfully loading data, use the MAPTOSTRING() function with the table's
__raw__
column. The four rows (limt 4
) that the query returns are regular expression results of theKMvertica.log
file, parsed withfregexparser
. The output showsthread_id
values with a preceding0x
or without:=> SELECT maptostring(__raw__) FROM vlog1 LIMIT 4; maptostring ------------------------------------------------------------------------------------- { "text" : " [Init] <INFO> Log /home/dbadmin/VMart/v_vmart_node0001_catalog/vertica.log opened; #2", "thread_id" : "0x7f2157e287c0", "thread_name" : "Main", "time" : "2017-03-21 23:30:01.704" } { "text" : " [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -D /home/dbadmin/VMart/v_vmart_node0001_catalog -C VMart -n v_vmart_node0001 -h 10.20.100.247 -p 5433 -P 4803 -Y ipv4", "thread_id" : "0x7f2157e287c0", "thread_name" : "Main", "time" : "2017-03-21 23:30:01.704" } { "text" : " [Init] <INFO> Starting up Vertica Analytic Database v8.1.1-20170321", "thread_id" : "7f2157e287c0", "thread_name" : "Main", "time" : "2017-03-21 23:30:01.704" } { "text" : " [Init] <INFO> Compiler Version: 4.8.2 20140120 (Red Hat 4.8.2-15)", "thread_id" : "7f2157e287c0", "thread_name" : "Main", "time" : "2017-03-21 23:30:01.704" } (4 rows)
See also
9 - ORC
Use the ORC clause with the COPY statement to load data in the ORC format. When loading data into Vertica, you can read all primitive types, UUIDs, and complex types.
By default, the ORC parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose (soft) Schema Matching.
If the table definition includes columns of primitive types and those columns are not in the data, the parser fills those columns with NULL. If the table definition includes columns of complex types, those columns must be present in the data.
This parser does not support apportioned load or cooperative parse. However, it enables parallel loading by dividing loads into individual sources, such as row groups or stripes, which are processed concurrently across multiple Vertica nodes using multi-threading. The number of threads is automatically determined based on the available memory in the resource pool.
Syntax
ORC ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See Partitioned data. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter. allow_no_match
- Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
do_soft_schema_match_by_name
- Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and ORC file (false, default). See Loose Schema Matching for more information.
Default: false.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.Default: true.
Loose schema matching
By default, the ORC parser uses strong schema matching. This means that all columns in the ORC data must be loaded, and they must be loaded in the same order as in the data. However, there are times when you only want to pull certain columns, or you want to be able to accommodate future changes in the ORC schema. This is called loose (or soft) schema matching.
Use the do_soft_schema_match_by_name
parameter to enable loose schema matching. This setting has the following effects:
-
Columns in the data are matched to columns in the table by their names. Names must exactly match but are case-insensitive.
-
Columns that exist in the ORC data but are not part of the table definition are ignored.
-
Columns that exist in the table definition but not the ORC data are filled with NULL.
-
If the same case-insensitive column name occurs more than once in the ORC data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Column types do not need to exactly match, so long as the data type in the ORC file can be coerced to the type used by the table.
While the Parquet parser applies loose schema matching to both column and field names, the ORC parser applies it only to column names.
Data types
This parser can read all primitive types, UUIDs, and complex types.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
The following mappings are supported with type coercion and loose schema matching:
ORC Physical Type | Coercible to Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
BYTE, SHORT, INT, LONG | INT |
FLOAT, DOUBLE | FLOAT |
DECIMAL | NUMERIC |
DATE | DATE |
TIMESTAMP | TIMESTAMP, TIMESTAMPTZ |
STRING, CHAR, VARCHAR | CHAR, VARCHAR, LONG VARCHAR |
BINARY | BINARY, VARBINARY, LONG VARBINARY |
Examples
The ORC clause does not use the PARSER option:
=> CREATE EXTERNAL TABLE orders (...)
AS COPY FROM 's3://DataLake/orders.orc' ORC;
You can read a map column as an array of rows, as in the following example:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
orderdate DATE
) AS COPY FROM '...' ORC;
10 - PARQUET
Use the PARQUET
parser with the COPY statement to load data in the Parquet format. When loading data into Vertica you can read all primitive types, UUIDs, and complex types.
By default, the Parquet parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose Schema Matching.
When loading Parquet data, Vertica caches the Parquet metadata to improve efficiency. This cache uses local TEMP storage and is not used if TEMP is remote. See the ParquetMetadataCacheSizeMB configuration parameter to change the size of the cache.
This parser does not support apportioned load or cooperative parse. However, it enables parallel loading by dividing loads into individual sources, such as row groups or stripes, which are processed concurrently across multiple Vertica nodes using multi-threading. The number of threads is automatically determined based on the available memory in the resource pool.
Syntax
PARQUET ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See Partitioned data. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter. allow_no_match
- Boolean. Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
allow_long_varbinary_match_complex_type
- Boolean. Whether to enable flexible column types (see Flexible complex types). If true, the Parquet parser allows a complex type in the data to match a table column defined as LONG VARBINARY. If false, the Parquet parser requires strong typing of complex types. With the parameter set you can still use strong typing. Set this parameter to false if you want use of flexible columns to be treated as an error.
do_soft_schema_match_by_name
- Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and parquet file (false, default). See Loose Schema Matching for more information.
Default: false.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.Default: true.
Loose schema matching
By default, the Parquet parser uses strong schema matching. This means that all columns and struct fields in the Parquet data must be loaded, and they must be loaded in the same order as in the data. However, there are times when you only want to pull certain columns or fields, or you want to be able to accommodate future changes in the Parquet schema. This is called loose (or soft) schema matching.
Use the do_soft_schema_match_by_name
parameter to enable loose schema matching. This setting has the following effects:
-
Columns or struct fields in the data are matched to columns or fields in the table by their names. Names must exactly match but are case-insensitive.
-
Columns or fields that exist in the Parquet data but are not part of the table definition are ignored.
-
Columns or fields that exist in the table definition but not the Parquet data are filled with NULL. The parser logs an UNMATCHED_TABLE_COLUMNS_PARQUETPARSER event in QUERY_EVENTS.
-
If the same case-insensitive column name occurs more than once in the Parquet data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Column and field types do not need to exactly match, so long as the data type in the Parquet file can be coerced to the type used by the table. If a type cannot be coerced, the parser logs a TYPE_MISMATCH_COLUMNS_PARQUETPARSER event in QUERY_EVENTS. If
reject_on_materialized_type_error
is true then the parser rejects the row. If it is false, the parser uses NULL or, for string values that are too long, truncates the value.
Data types
The Parquet parser maps Parquet data types to Vertica data types as follows.
Parquet Logical Type | Vertica Data Type |
---|---|
StringLogicalType | VARCHAR |
MapLogicalType | ARRAY[ROW] |
ListLogicalType | ARRAY/SET |
IntLogicalType | INT/NUMERIC |
DecimalLogicalType | NUMERIC |
DateLogicalType | DATE |
TimeLogicalType | TIME |
TimestampLogicalType | TIMESTAMP |
UUIDLogicalType | UUID |
Decimal precision must be <= 153.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
The following logical types are not supported:
- EnumLogicalType
- IntervalLogicalType (Intervals exported using EXPORT TO PARQUET do not use this logical type)
- JSONLogicalType
- BSONLogicalType
- UnknownLogicalType
The Parquet parser supports the following mappings of physical types:
Parquet Physical Type | Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
INT32/INT64 | INT |
INT64 (a number of microseconds) | INTERVAL |
INT96 | Supported only for TIMESTAMP |
FLOAT | DOUBLE |
DOUBLE | DOUBLE |
BYTE_ARRAY | VARBINARY |
FIXED_LEN_BYTE_ARRAY | BINARY |
The following mappings are supported with type coercion and loose schema matching.
Parquet Physical Type | Coercible to Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
INT32, INT64, BOOLEAN | INT |
FLOAT, DOUBLE | DOUBLE |
INT32, INT96 | DATE |
INT64, INT96 | TIMESTAMP, TIMESTAMPTZ |
INT64 If precision > 0: INT32, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |
Numeric |
BYTE_ARRAY | CHAR, VARCHAR, LONG VARCHAR, BINARY, VARBINARY, LONG VARBINARY |
FIXED_LEN_BYTE_ARRAY | UUID |
Vertica supports only 3-level-encoded arrays, not 2-level-encoded.
Examples
The PARQUET clause does not use the PARSER option:
=> COPY sales FROM 's3://DataLake/sales.parquet' PARQUET;
In the following example, the data directory contains no files:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet' PARQUET;
=> SELECT COUNT(*) FROM customers;
ERROR 7869: No files match when expanding glob: [webhdfs:///data/*.parquet]
To read zero rows instead of producing an error, use the allow_no_match
parameter:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet'
PARQUET(allow_no_match='true');
=> SELECT COUNT(*) FROM customers;
count
-------
0
(1 row)
To allow reading a complex type (menu, in this example) as a flexible column type, use the allow_long_varbinary_match_complex_type
parameter:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
To read only some columns from the restaurant data, use loose schema matching:
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True',
do_soft_schema_match_by_name='True');
=> SELECT * from restaurant;
name | cuisine
-------------------+----------
Bob's pizzeria | Italian
Bakersfield Tacos | Mexican
(2 rows)