Inference

If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition.

If your data is in Parquet, ORC, JSON, or Avro format, you can use a Vertica function to inspect a sample file and automatically generate a "first draft" of a table definition. You can then refine this definition, such as by specifying VARCHAR lengths, NUMERIC precision, and ARRAY bounds. If the data format supports it, you can ignore columns in the data that you do not need for your table.

When you have an initial table definition, you can load more data. New data might have different properties, so it's important to adjust your table definition in light of more data. Test the table definition with enough data to feel confident before putting it into production.

You might find it easier to do initial exploration with external tables, whose definitions are more easily modified.

Create first draft of table

The INFER_TABLE_DDL function inspects a data file and returns a CREATE TABLE or CREATE EXTERNAL TABLE statement based on its contents. The column definitions in the returned statement can be incomplete, and sometimes the function cannot determine a column's data type. The function also uses the column names as they appear in the data, even if they violate Vertica restrictions. For all of these reasons, you must review and adjust the proposed definition.

In the following example, the input path contains data for a product catalog. Note the warnings about data types:

=> SELECT INFER_TABLE_DDL('/data/products/products.parquet'
    USING PARAMETERS format = 'parquet', table_name = 'products');

WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80

     INFER_TABLE_DDL

------------------------------------------------------------------
 create table "products"(
  "id" int,
  "msrp" numeric(6,2),
  "name" varchar,
  "description" varchar,
  "vendors" Array[Row(
    "name" varchar,
    "price" numeric(6,2)
  )]
);
(1 row)

The table definition contains several VARCHAR values with unspecified lengths. The Vertica default length for VARCHAR is 80, and for a native table, Vertica allocates memory based on the maximum length. If a value is unlikely to be that large, such as for a name, you can specify a smaller size to improve query performance. Conversely, if a value could be much longer, such as the description field here, you can specify a larger size so that Vertica does not truncate the value.

For numeric values, INFER_TABLE_DDL suggests precision and scale based on the data in the sample file. As with VARCHAR, values that are too high cause more memory to be allocated. In addition, computations could produce unexpected results by using the extra precision instead of rounding.

Arrays, by default, are unbounded, and Vertica allocates memory to accommodate the maximum binary size. To reduce this footprint, you can specify a maximum number of elements for an array or its maximum binary size. For more information about array bounds, see Limits on Element Count and Collection Size on the ARRAY reference page.

In this example, you might make the following changes:

Shorten the product name and vendor name fields, for example VARCHAR(32).
Lengthen the description field, for example VARCHAR(1000).
Accept the suggested numeric precision and scale.
Specify a maximum number of elements for the vendors array.

With these changes, the table definition becomes:

=> CREATE TABLE products(
      id INT,
      msrp NUMERIC(6,2),
      name VARCHAR(),
      description VARCHAR(),
      vendors ARRAY[ROW(name VARCHAR(), price NUMERIC(6,2)),]
      );

As a sanity check, start by loading the data file you used for INFER_TABLE_DDL. All rows should load successfully. The following query output has added carriage returns for readability:

=> COPY products FROM '/data/products/products.parquet' PARQUET;
 Rows Loaded
-------------
           1
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]

Test with data

INFER_TABLE_DDL bases its recommendations on a small sample of your data. As you load more data you might find outliers. Before you use your new table in production, test it with more data.

In the following example, a data file contains values that do not satisfy the table definition:

=> COPY products FROM '/data/products/products2.parquet' PARQUET;

WARNING 9738:  Some rows were rejected by the parquet parser
 Rows Loaded
-------------
           2
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]
-[ RECORD 2 ]---------------------------------------------------------------------------------------------------------------------
id          | 1271
msrp        | 8999.99
name        | professional kitchen appliance s
description | product description goes here
vendors     | [{"name":"Amazon","price":"8299.99"},{"name":"Designer Solutions","price":"8999.99"}]

If you compare the query output to the previous query, one immediate difference is the width of the name column. The value in the second row uses the full VARCHAR(32) length and appears to be truncated.

By default, COPY truncates VARCHAR values that are too long instead of rejecting them. The warning from the COPY statement indicates that in addition to the name length, there is another problem that caused data not to be loaded.

COPY can report rejected data to a (different) table. You can also direct COPY to enforce VARCHAR lengths instead of truncating:

=> COPY products FROM '/data/products/products2.parquet' PARQUET
;
 Rows Loaded
-------------
           1
(1 row)

Note that only one row of data is loaded this time instead of two, because ENFORCELENGTH rejects the too-long string.

You can query the rejections table for more information:

=> \x

=> SELECT rejected_reason, rejected_data FROM products_rejected;

-[ RECORD 1 ]---+--------------------------------------------------------------------
rejected_reason | The 34-byte value is too long for type varchar(32), column 3 (name)
rejected_data   | professional kitchen appliance set
-[ RECORD 2 ]---+--------------------------------------------------------------------
rejected_reason | In column 2: Value (11999.99) exceeds range of type NUMERIC(6,2)
rejected_data   | 11999.99

The first rejection is caused by the long string. The second is caused by a price that exeeds the defined scale. You can increase both using ALTER TABLE:

=> ALTER TABLE products ALTER COLUMN name SET DATA TYPE VARCHAR(50);

=> ALTER TABLE products ALTER COLUMN msrp SET DATA TYPE NUMERIC(8,2);

With these changes, the data file now loads without errors:

=> TRUNCATE TABLE products;

=> DROP TABLE products_rejected;

=> COPY products FROM '/data/products/products2.parquet' PARQUET
ENFORCELENGTH REJECTED DATA AS TABLE products_rejected;

 Rows Loaded
-------------
           3
(1 row)

=> \x
=> SELECT * FROM products;
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
id          | 1064
msrp        | 329.99
name        | 4k 48in TV
description | something long goes here
vendors     | [{"name":"Amazon","price":"299.99"},{"name":"Best Buy","price":"289.99"},{"name":"Bob's Hardware","price":"309.99"}]
-[ RECORD 2 ]---------------------------------------------------------------------------------------------------------------------
id          | 1183
msrp        | 11999.99
name        | home theatre system
description | product description...
vendors     | [{"name":"Amazon","price":"8949.99"},{"name":"Bob's Hardware","price":"9999.99"}]
-[ RECORD 3 ]---------------------------------------------------------------------------------------------------------------------
id          | 1271
msrp        | 8999.99
name        | professional kitchen appliance set
description | product description goes here
vendors     | [{"name":"Amazon","price":"8299.99"},{"name":"Designer Solutions","price":"8999.99"}]

=> SELECT rejected_reason, rejected_data FROM products_rejected;
(No rows)

Continue to load more data using the ENFORCELENGTH and REJECTED DATA options, adjusting your table definition as needed, until the rate of new problems drops to an acceptable level. Most data sets of any size contain some anomalies that will be rejected. You can use the REJECTMAX option to set a limit on rejections per data load; if the limit is exceeded, COPY aborts the load.

Data formats

INFER_TABLE_DDL supports data in Parquet, ORC, Avro, and JSON formats. The first three formats enforce a schema on each file, but JSON data can contain any mix of fields and data types. When inspecting JSON data, INFER_TABLE_DDL can report multiple options. The iterative process of loading samples, examining the results, and adjusting the table definition is the same, but some details are different for JSON.

For data in other formats, including delimited text data, you cannot use INFER_TABLE_DDL. You can examine sample data and create your own initial table definition, and then test iteratively. Alternatively, you can use a flex table to explore the data.