INFER_TABLE_DDL

Inspects a file in Parquet, ORC, or Avro format and returns a CREATE TABLE or CREATE EXTERNAL TABLE statement based on its contents.

Inspects a file in Parquet, ORC, or Avro format and returns a CREATE TABLE or CREATE EXTERNAL TABLE statement based on its contents. If you use a glob to specify more than one file, the function inspects only one.

The returned statement might be incomplete if the file contains ambiguous or unknown data types. It could also contain more columns or columns with longer names than what Vertica supports; this function does not enforce Vertica system limits. Always inspect the output and address any issues before using it to create a table.

This function supports partition columns, inferred from the input path, for the ORC and Parquet formats only. Because partitioning is done through the directory structure, there is insufficient information to infer the type of partition columns. This function shows these columns with a data type of UNKNOWN and emits a warning. Partition columns apply only to external tables.

The function handles most data types, including complex types. If an input type is not supported in Vertica, the function emits a warning.

For VARCHAR and VARBINARY columns, this function does not specify a length. The Vertica default length for these types is 80 bytes. If the data values are longer, using the returned table definition unmodified could cause data to be truncated. Always review VARCHAR and VARBINARY columns to determine if you need to specify a length. This function emits a warning if the input file contains columns of these types:

WARNING 9311: This generated statement contains one or more varchar/varbinary columns which default to length 80

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

INFER_TABLE_DDL( path USING PARAMETERS param=value[,...] )

Arguments

path
Path to a file or glob. Any path that is valid for COPY and uses a file format supported by this function is valid. If a glob specifies more than one file, this function reads a single file.

Parameters

format
Input format (string), one of 'Parquet', 'ORC', or 'Avro'. This parameter is required.
table_name
The name of the table to create. This parameter is required.

Do not include a schema name as part of the table name; use the table_schema parameter.

table_schema
The schema in which to create the table. If omitted, the function does not include a schema in the output.
table_type
The type of table to create, either 'native' (the default) or 'external'.
with_copy_statement
For native tables, whether to include a COPY statement in addition to the CREATE TABLE statement. The default is false.

Privileges

Non-superuser: READ privileges on the USER-accessible storage location.

Examples

In the following example, the input path contains data for a table with two integer columns. The external table definition can be fully inferred, and you can use the returned SQL statement as-is. The function reads one file from the input path:

=> SELECT INFER_TABLE_DDL('/data/orders/*.orc'
        USING PARAMETERS format = 'orc', table_name = 'orders', table_type = 'external');

                INFER_TABLE_DDL
-----------------------------------------------------------------------------------
create external table "orders" (
  "id" int,
  "quantity" int
) as copy from '/data/orders/*.orc' orc;
(1 row)

To create a table in a schema, use the table_schema parameter. Do not add it to the table name; the function treats it as a name with a period in it, not a schema.

The following example shows output with complex types. You can use the definition as-is or modify the VARCHAR sizes:

=> SELECT INFER_TABLE_DDL('/data/people/*.parquet'
        USING PARAMETERS format = 'parquet', table_name = 'employees');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                    INFER_TABLE_DDL
-------------------------------------------------------------------------
 create table "employees"(
  "employeeID" int,
  "personal" Row(
    "name" varchar,
    "address" Row(
      "street" varchar,
      "city" varchar,
      "zipcode" int
    ),
    "taxID" int
  ),
  "department" varchar
 );
(1 row)

In the following example, the input file contains a map in the "prods" column. You can read a map as an array of rows:

=> SELECT INFER_TABLE_DDL('/data/orders.parquet'
    USING PARAMETERS format='parquet', table_name='orders');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                INFER_TABLE_DDL
------------------------------------------------------------------------
 create table "orders"(
  "orderkey" int,
  "custkey" int,
  "prods" Array[Row(
    "key" varchar,
    "value" numeric(12,2)
  )],
  "orderdate" date
 );
(1 row)

The following example returns the definition of a native table and the COPY statement:

=> SELECT INFER_TABLE_DDL('/data/orders/*.orc'
        USING PARAMETERS format = 'orc', table_name = 'orders',
                         table_type = 'native', with_copy_statement = true);

                INFER_TABLE_DDL
--------------------------------------------------------------------------------------------------
create table "orders" (
  "id" int,
  "quantity" int
);
copy "orders" from '/data/orders/*.orc' orc;

In the following example, the data contains one materialized column and two partition columns. The date and region columns are in the data in addition to being partition columns, and so the table definition shows them twice. Partition columns are always of unknown type:

=> SELECT INFER_TABLE_DDL('/data/sales/*/*/*
        USING PARAMETERS format = 'parquet', table_name = 'sales', table_type = 'external');
WARNING 9262: This generated statement is incomplete because of one or more unknown column types.
Fix these data types before creating the table
                INFER_TABLE_DDL
------------------------------------------------------------------------
 create external table "sales"(
  "tx_id" int,
  "date" UNKNOWN,
  "region" UNKNOWN
) as copy from '/data/sales/*/*/*' parquet(hive_partition_cols='date,region');
(1 row)