PARQUET
Use the PARQUET
parser with the COPY statement to load data in the Parquet format. When loading data into Vertica you can read all primitive types, UUIDs, and complex types.
By default, the Parquet parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose Schema Matching.
When loading Parquet data, Vertica caches the Parquet metadata to improve efficiency. This cache uses local TEMP storage and is not used if TEMP is remote. See the ParquetMetadataCacheSizeMB configuration parameter to change the size of the cache.
This parser does not support apportioned load or cooperative parse. However, it enables parallel loading by dividing loads into individual sources, such as row groups or stripes, which are processed concurrently across multiple Vertica nodes using multi-threading. The number of threads is automatically determined based on the available memory in the resource pool.
Syntax
PARQUET ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See Partitioned data. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter. allow_no_match
- Boolean. Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
allow_long_varbinary_match_complex_type
- Boolean. Whether to enable flexible column types (see Flexible complex types). If true, the Parquet parser allows a complex type in the data to match a table column defined as LONG VARBINARY. If false, the Parquet parser requires strong typing of complex types. With the parameter set you can still use strong typing. Set this parameter to false if you want use of flexible columns to be treated as an error.
do_soft_schema_match_by_name
- Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and parquet file (false, default). See Loose Schema Matching for more information.
Default: false.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.Default: true.
Loose schema matching
By default, the Parquet parser uses strong schema matching. This means that all columns and struct fields in the Parquet data must be loaded, and they must be loaded in the same order as in the data. However, there are times when you only want to pull certain columns or fields, or you want to be able to accommodate future changes in the Parquet schema. This is called loose (or soft) schema matching.
Use the do_soft_schema_match_by_name
parameter to enable loose schema matching. This setting has the following effects:
-
Columns or struct fields in the data are matched to columns or fields in the table by their names. Names must exactly match but are case-insensitive.
-
Columns or fields that exist in the Parquet data but are not part of the table definition are ignored.
-
Columns or fields that exist in the table definition but not the Parquet data are filled with NULL. The parser logs an UNMATCHED_TABLE_COLUMNS_PARQUETPARSER event in QUERY_EVENTS.
-
If the same case-insensitive column name occurs more than once in the Parquet data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Column and field types do not need to exactly match, so long as the data type in the Parquet file can be coerced to the type used by the table. If a type cannot be coerced, the parser logs a TYPE_MISMATCH_COLUMNS_PARQUETPARSER event in QUERY_EVENTS. If
reject_on_materialized_type_error
is true then the parser rejects the row. If it is false, the parser uses NULL or, for string values that are too long, truncates the value.
Data types
The Parquet parser maps Parquet data types to Vertica data types as follows.
Parquet Logical Type | Vertica Data Type |
---|---|
StringLogicalType | VARCHAR |
MapLogicalType | ARRAY[ROW] |
ListLogicalType | ARRAY/SET |
IntLogicalType | INT/NUMERIC |
DecimalLogicalType | NUMERIC |
DateLogicalType | DATE |
TimeLogicalType | TIME |
TimestampLogicalType | TIMESTAMP |
UUIDLogicalType | UUID |
Decimal precision must be <= 153.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
The following logical types are not supported:
- EnumLogicalType
- IntervalLogicalType (Intervals exported using EXPORT TO PARQUET do not use this logical type)
- JSONLogicalType
- BSONLogicalType
- UnknownLogicalType
The Parquet parser supports the following mappings of physical types:
Parquet Physical Type | Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
INT32/INT64 | INT |
INT64 (a number of microseconds) | INTERVAL |
INT96 | Supported only for TIMESTAMP |
FLOAT | DOUBLE |
DOUBLE | DOUBLE |
BYTE_ARRAY | VARBINARY |
FIXED_LEN_BYTE_ARRAY | BINARY |
The following mappings are supported with type coercion and loose schema matching.
Parquet Physical Type | Coercible to Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
INT32, INT64, BOOLEAN | INT |
FLOAT, DOUBLE | DOUBLE |
INT32, INT96 | DATE |
INT64, INT96 | TIMESTAMP, TIMESTAMPTZ |
INT64 If precision > 0: INT32, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |
Numeric |
BYTE_ARRAY | CHAR, VARCHAR, LONG VARCHAR, BINARY, VARBINARY, LONG VARBINARY |
FIXED_LEN_BYTE_ARRAY | UUID |
Vertica supports only 3-level-encoded arrays, not 2-level-encoded.
Examples
The PARQUET clause does not use the PARSER option:
=> COPY sales FROM 's3://DataLake/sales.parquet' PARQUET;
In the following example, the data directory contains no files:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet' PARQUET;
=> SELECT COUNT(*) FROM customers;
ERROR 7869: No files match when expanding glob: [webhdfs:///data/*.parquet]
To read zero rows instead of producing an error, use the allow_no_match
parameter:
=> CREATE EXTERNAL TABLE customers (...)
AS COPY FROM 'webhdfs:///data/*.parquet'
PARQUET(allow_no_match='true');
=> SELECT COUNT(*) FROM customers;
count
-------
0
(1 row)
To allow reading a complex type (menu, in this example) as a flexible column type, use the allow_long_varbinary_match_complex_type
parameter:
=> CREATE EXTERNAL TABLE restaurants
(name VARCHAR, cuisine VARCHAR, location_city ARRAY[VARCHAR], menu LONG VARBINARY)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True');
To read only some columns from the restaurant data, use loose schema matching:
=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
AS COPY FROM '/data/rest*.parquet'
PARQUET(allow_long_varbinary_match_complex_type='True',
do_soft_schema_match_by_name='True');
=> SELECT * from restaurant;
name | cuisine
-------------------+----------
Bob's pizzeria | Italian
Bakersfield Tacos | Mexican
(2 rows)