Parquet data

Parquet is a column-oriented file format.

Parquet is a column-oriented file format. Vertica has a parser that can take advantage of the columnar layout.

In the COPY statement, specify the parser as follows. Note that this is not the usual PARSER parser-name syntax; COPY supports Parquet directly:


=> COPY tableName FROM path PARQUET[(...)];

The parser takes several optional parameters; see the PARQUET reference page.

Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Schema matching

By default, the Parquet parser uses strong schema matching. This means that the load must consume all columns in the Parquet data in the order they occur in the data. You can, instead, use loose schema matching, which allows you to select the columns and struct fields you want and ignore the rest. Loose schema matching depends on the names of the columns and fields in the data rather than their order, so the names in your table must match those in the data. Types must match or be coercible. For more information on how to use loose schema matching, see Loose Schema Matching on the PARQUET reference page.

Use the do_soft_schema_match_by_name parameter to specify loose schema matching. In the following example, the Parquet data contains more columns than those used in the table.

=> CREATE EXTERNAL TABLE restaurants(name VARCHAR, cuisine VARCHAR)
    AS COPY FROM '/data/rest*.parquet'
    PARQUET(do_soft_schema_match_by_name='True');

=> SELECT * from restaurant;
       name        | cuisine
-------------------+----------
 Bob's pizzeria    | Italian
 Bakersfield Tacos | Mexican
(2 rows)

Metadata caching

Parquet files include metadata that Vertica uses when loading data. To avoid repeatedly fetching this data, particularly from remote sources or where API calls incur financial costs, Vertica caches this metadata on each participating node during the planning phase for use during the execution phase.

Vertica uses TEMP storage for the cache, and only if TEMP storage is on the local file system.

You can limit the size of the cache by setting the ParquetMetadataCacheSizeMB configuration parameter. The default is 4GB.

Supported data types

Vertica can natively read columns of all Hive primitive data types. For a complete list, see HIVE Data Types (specifically the Numeric, Date/Time, String, and Misc lists). Vertica can also load UUIDs and complex types (arrays and structs in any combination).

The data types you specify for COPY or CREATE EXTERNAL TABLE AS COPY must exactly match the types in the data, though Vertica permits implicit casting among compatible numeric types.

For the Parquet format only, you can use flexible complex types instead of fully specifying the schema for complex types. See Flexible complex types.

Timestamps and time zones

To correctly report timestamps, Vertica must know what time zone the data was written in. Hive does not record the writer time zone. Vertica assumes timestamp values were written in the local time zone and reports a warning at query time.

Hive provides an option, when writing Parquet files, to record timestamps in the local time zone. If you are using Parquet files that record times in this way, set the UseLocalTzForParquetTimestampConversion configuration parameter to 0 to disable the conversion done by Vertica. (See General parameters.)