ORC
Use the ORC clause with the COPY statement to load data in the ORC format. When loading data into Vertica, you can read all primitive types, UUIDs, and complex types.
By default, the ORC parser uses strong schema matching, meaning that columns in the data must exactly match the columns in the table using the data. You can optionally use Loose (soft) Schema Matching.
If the table definition includes columns of primitive types and those columns are not in the data, the parser fills those columns with NULL. If the table definition includes columns of complex types, those columns must be present in the data.
This parser does not support apportioned load or cooperative parse. However, it enables parallel loading by dividing loads into individual sources, such as row groups or stripes, which are processed concurrently across multiple Vertica nodes using multi-threading. The number of threads is automatically determined based on the available memory in the resource pool.
Syntax
ORC ( [ parameter=value[,...] ] )
Parameters
All parameters are optional.
hive_partition_cols
- Comma-separated list of columns that are partition columns in the data.
Deprecated
Instead, use COPY PARTITION COLUMNS. See Partitioned data. If you use both this parameter and PARTITION COLUMNS, COPY ignores the parameter. allow_no_match
- Whether to accept a path containing a glob with no matching files and report zero rows in query results. If this parameter is not set, Vertica returns an error if the path in the FROM clause does not match at least one file.
do_soft_schema_match_by_name
- Whether to enable loose schema matching (true) instead of the strict matching based on column order in the table definition and ORC file (false, default). See Loose Schema Matching for more information.
Default: false.
reject_on_materialized_type_error
- Boolean, applies only if
do_soft_schema_match_by_name
is true. Specifies what to do when loose schema matching is being used and a value cannot be coerced from the data to the target column type. A value of true means to reject the row; a value of false means to use NULL for the value or, for strings that are too long, truncate. See the table of type coercions for coercible type mappings.Default: true.
Loose schema matching
By default, the ORC parser uses strong schema matching. This means that all columns in the ORC data must be loaded, and they must be loaded in the same order as in the data. However, there are times when you only want to pull certain columns, or you want to be able to accommodate future changes in the ORC schema. This is called loose (or soft) schema matching.
Use the do_soft_schema_match_by_name
parameter to enable loose schema matching. This setting has the following effects:
-
Columns in the data are matched to columns in the table by their names. Names must exactly match but are case-insensitive.
-
Columns that exist in the ORC data but are not part of the table definition are ignored.
-
Columns that exist in the table definition but not the ORC data are filled with NULL.
-
If the same case-insensitive column name occurs more than once in the ORC data, the parser uses the last one. (This situation can arise when using data written by tools that are case-sensitive.)
-
Column types do not need to exactly match, so long as the data type in the ORC file can be coerced to the type used by the table.
While the Parquet parser applies loose schema matching to both column and field names, the ORC parser applies it only to column names.
Data types
This parser can read all primitive types, UUIDs, and complex types.
If the total size of an array exceeds the size defined by the target table, the parser rejects the row.
The following mappings are supported with type coercion and loose schema matching:
ORC Physical Type | Coercible to Vertica Data Type |
---|---|
BOOLEAN | BOOLEAN |
BYTE, SHORT, INT, LONG | INT |
FLOAT, DOUBLE | FLOAT |
DECIMAL | NUMERIC |
DATE | DATE |
TIMESTAMP | TIMESTAMP, TIMESTAMPTZ |
STRING, CHAR, VARCHAR | CHAR, VARCHAR, LONG VARCHAR |
BINARY | BINARY, VARBINARY, LONG VARBINARY |
Examples
The ORC clause does not use the PARSER option:
=> CREATE EXTERNAL TABLE orders (...)
AS COPY FROM 's3://DataLake/orders.orc' ORC;
You can read a map column as an array of rows, as in the following example:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
orderdate DATE
) AS COPY FROM '...' ORC;