Schema evolution

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table.

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table. Typically, you use strong typing to specify the table columns and parsers reject data that does not match the table definition. Some parsers cause the entire load to fail if the data does not match, and some load what they can and produce warnings about unhandled data.

For long-lived databases or more volatile domains, new columns and fields can appear in data, disrupting existing ETL pipelines. Some parsers provide ways to handle differences between the data and the table definition.

Parquet and ORC

By default, the Parquet and ORC parsers use strong schema matching. This means that all columns in the data must be loaded in the same order as in the data. Unmatched columns produce load errors:

=> CREATE TABLE orders_summary(orderid INT, accountid INT);

=> COPY orders_summary FROM '/data/orders.parquet' PARQUET;
ERROR 9135:  Attempt to load 2 columns from a parquet source [/data/orders.parquet] that has 3 columns

You can pick out only the specific columns you want using loose schema matching. Use the do_soft_schema_match_by_name parameter in the PARQUET or ORC parser. With loose schema matching, columns in the data are matched to columns in the table by their names. Columns in the data that are not part of the table definition are ignored:

=> COPY orders_summary FROM '/data/orders.parquet'
   PARQUET(do_soft_schema_match_by_name='true');
 Rows Loaded
-------------
           3
(1 row)

Alternatively, you can use ALTER TABLE to add new columns to the table definition and continue to use strong schema matching.

JSON and Avro

By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This behavior is in contrast to the Parquet and ORC parsers, which by default require that a table consume all of the columns in the data. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.

When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample data file using INFER_TABLE_DDL. You can use the results to define a table. If later data files add more fields, the parser emits a warning and logs information about the new fields in the UDX_EVENTS system table. You can use this information to decide whether to alter the table definition or ignore the new fields.

If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings enabled and without committing (COPY...NO COMMIT).

For a detailed discussion of this workflow, see Schema changes in Data exploration.