Schema evolution

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table.

When you load data, Vertica must be able to match the columns or fields in the data to the columns defined in the table. Typically, you use strong typing to specify the table columns and parsers reject data that does not match the table definition. Some parsers cause the entire load to fail if the data does not match, and some load what they can and produce warnings about unhandled data.

For long-lived databases or more volatile domains, new columns and fields can appear in data, disrupting existing ETL pipelines. Some parsers provide ways to handle differences between the data and the table definition.

Parquet

By default, the Parquet parser uses strong schema matching. This means that all columns in the Parquet data must be loaded in the same order as in the data. Unmatched columns produce load errors:

=> CREATE TABLE orders_summary(orderid INT, accountid INT);

=> COPY orders_summary FROM '/data/orders.parquet' PARQUET;
ERROR 9135:  Attempt to load 2 columns from a parquet source [/data/orders.parquet] that has 3 columns

You can pick out only the specific columns you want using loose schema matching. Use the do_soft_schema_match_by_name parameter in the PARQUET parser. With loose schema matching, columns in the data are matched to columns in the table by their names. Columns in the data that are not part of the table definition are ignored:

=> COPY orders_some FROM '/data/orders.parquet'
   PARQUET(do_soft_schema_match_by_name='true');
 Rows Loaded
-------------
           3
(1 row)

Alternatively, you can use ALTER TABLE to add new columns to the table definition and continue to use strong schema matching.

JSON and Avro

By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This behavior is in contrast to the Parquet and ORC parsers, which by default require that a table consume all of the columns in the data. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.

When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample data file:

=> SELECT INFER_TABLE_DDL ('/data/rest1.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                               INFER_TABLE_DDL
---------------------------------------------------------------------------------------------------------------------
 Candidate matched 1/1 of total files(s):
create table "restaurants"(
  "cuisine" varchar,
  "hours" Row(
    "close" time,
    "open" time
  ),
  "location" Array[Row(
    "city" varchar,
    "state" varchar
  )],
  "menu" Array[Row(
    "item" varchar,
    "price" numeric(precision, scale)
  )],
  "name" varchar
);

(1 row)

INFER_TABLE_DDL omits VARCHAR lengths, as noted in the warning. It also leaves NUMERIC precision and scale to be filled in. Because it only samples the data, it also cannot supply array bounds. From this output you could define the following table and load some data:

=> CREATE TABLE restaurants(
       name VARCHAR(50),
       cuisine VARCHAR(25),
       location ARRAY[ROW(city VARCHAR(20), state VARCHAR(2)),50],
       hours ROW(open TIME, close TIME),
       menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100]);

=> COPY restaurants FROM '/data/rest1.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

A second data file has some new fields. These parsers load what can be loaded and warn about the new data:

=> COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER() NO COMMIT;
WARNING 10596:  Warning in UDx call in user-defined object [FJSONParser], code: 0, message:
Data source contained keys which did not match table schema
HINT:  SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type = 'UNMATCHED_KEY' GROUP BY key
 Rows Loaded
-------------
           2
(1 row)

Loads that use the NO COMMIT option, as in this example, can recover from errors. Without NO COMMIT, COPY loads the data, skipping the new fields. If you change your table definition to handle the new fields and repeat the load, the table has two rows for each record, one with the new fields and one without. Use NO COMMIT to experiment, and use ROLLBACK to recover.

New fields are logged in the UDX_EVENTS system table:

--- from the HINT:
=> SELECT key, SUM(num_instances) FROM v_monitor.UDX_EVENTS
   WHERE event_type = 'UNMATCHED_KEY' AND transaction_id=CURRENT_TRANS_ID()
   GROUP BY key;
          key           | SUM
------------------------+-----
 chain                  |   1
 menu.elements.calories |   7
(2 rows)

=> ROLLBACK;

The second data file has two new fields, one at the top level and one in the existing menu complex type. To find out their data types, you can inspect the file or use INFER_TABLE_DDL again:

=> SELECT INFER_TABLE_DDL ('/data/rest2.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                               INFER_TABLE_DDL
---------------------------------------------------------------------------------------------------------------------
 Candidate matched 1/1 of total files(s):
create table "restaurants"(
  "chain" bool,
  "cuisine" varchar,
  "hours" Row(
    "close" time,
    "open" time
  ),
  "location" Array[Row(
    "city" varchar,
    "state" varchar
  )],
  "menu" Array[Row(
    "calories" int,
    "item" varchar,
    "price" numeric(precision, scale)
  )],
  "name" varchar
);

(1 row)

You can use ALTER TABLE to add the new column and alter the menu column. With these changes, the new data can be loaded without errors:

=> ALTER TABLE restaurants ADD COLUMN chain BOOLEAN;

=> ALTER TABLE restaurants ALTER COLUMN menu
SET DATA TYPE ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2), calories INT),100];

--- repeat the load:
COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

ALTER TABLE affects future loads. Data already in the table has null values for the added column and field:

=> SELECT EXPLODE(name, chain, menu) OVER() FROM restaurants;
     name     | chain | position |                          value
--------------+-------+----------+---------------------------------------------------------
 Pizza House  |       |        0 | {"item":"cheese pizza","price":"7.99","calories":null}
 Pizza House  |       |        1 | {"item":"spinach pizza","price":"8.99","calories":null}
 Pizza House  |       |        2 | {"item":"garlic bread","price":"4.99","calories":null}
 Sushi World  |       |        0 | {"item":"maki platter","price":"21.95","calories":null}
 Sushi World  |       |        1 | {"item":"tuna roll","price":"4.95","calories":null}
 Greasy Spoon | f     |        0 | {"item":"scrambled eggs","price":"3.99","calories":350}
 Greasy Spoon | f     |        1 | {"item":"grilled cheese","price":"3.95","calories":500}
 Greasy Spoon | f     |        2 | {"item":"tuna melt","price":"5.95","calories":600}
 Greasy Spoon | f     |        3 | {"item":"french fries","price":"1.99","calories":350}
 Pasta World  | t     |        0 | {"item":"cheese pizza","price":"7.99","calories":1200}
 Pasta World  | t     |        1 | {"item":"spinach pizza","price":"8.99","calories":900}
 Pasta World  | t     |        2 | {"item":"garlic bread","price":"4.99","calories":250}
(12 rows)

If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings and NO COMMIT.