Schema changes

After you develop a schema for a table and continue to load data, you might find that some new data diverges from that schema.

After you develop a schema for a table and continue to load data, you might find that some new data diverges from that schema. Sometimes new fields appear, and sometimes fields change data types or stop appearing. Depending on how volatile you expect your data to be, you might decide to load what you can and ignore the rest, or you might decide to update your table definition to handle the changes.

Typically, COPY rejects data that does not match the definition and loads the rest. You can monitor these rejections and adjust your table definition, as described in Test with Data. Parsers for some formats have additional considerations:

  • ORC: if the table definition does not include all columns in the table, the load aborts with an error.

  • Parquet: by default, the table definition must include all columns as for ORC. Alternatively, you can use loose schema matching to load only columns that are part of the table definition, ignoring others. Type mismatches that cannot be coerced are errors.

  • JSON and Avro: the parser loads only the columns that are part of the table definition. If the data contains other columns, the parsers emit warnings. Type mismatches that cannot be coerced are errors.

Parquet

If you load Parquet data that contains extra columns, by default the load fails with an error:

=> CREATE TABLE orders_summary(orderid INT, accountid INT);

=> COPY orders_summary FROM '/data/orders.parquet' PARQUET;
ERROR 9135:  Attempt to load 2 columns from a parquet source [/data/orders.parquet] that has 3 columns

If you do not know what the extra columns are, you can use inference to inspect a sample file. If the new fields are important, you can use ALTER TABLE to add the new columns to the table definition and then load the data.

If the new fields are not important and you want to ignore them, you can use loose schema matching. Loose schema matching means that the parser loads columns whose names match the table definition and ignores any others. Enable this with the do_soft_schema_match_by_name parameter in the PARQUET parser:

=> COPY orders_summary FROM '/data/orders.parquet'
   PARQUET(do_soft_schema_match_by_name='true');
 Rows Loaded
-------------
           3
(1 row)

Loose schema matching allows you to ignore extra data. Missing data and mismatched data types are still errors. If loads fail with these errors, you can use inference to inspect the data and then alter your table definition.

JSON and Avro

By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.

When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample file:

=> SELECT INFER_TABLE_DDL ('/data/rest1.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
...

=> CREATE TABLE restaurants(
       name VARCHAR(50),
       cuisine VARCHAR(25),
       location ARRAY[ROW(city VARCHAR(20), state VARCHAR(2)),50],
       hours ROW(open TIME, close TIME),
       menu ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2)),100]);

=> COPY restaurants FROM '/data/rest1.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

A second data file has some new fields. These parsers load what can be loaded and warn about the new data:

=> COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER() NO COMMIT;
WARNING 10596:  Warning in UDx call in user-defined object [FJSONParser], code: 0, message:
Data source contained keys which did not match table schema
HINT:  SELECT key, sum(num_instances) FROM v_monitor.udx_events WHERE event_type = 'UNMATCHED_KEY' GROUP BY key
 Rows Loaded
-------------
           2
(1 row)

Loads that use the NO COMMIT option, as in this example, can recover from errors. Without NO COMMIT, COPY loads the data, skipping the new fields. If you change your table definition to handle the new fields and repeat the load, the table has two rows for each record, one with the new fields and one without. Use NO COMMIT to experiment, and use ROLLBACK to recover.

New fields are logged in the UDX_EVENTS system table:

--- from the HINT:
=> SELECT key, SUM(num_instances) FROM v_monitor.UDX_EVENTS
   WHERE event_type = 'UNMATCHED_KEY' AND transaction_id=CURRENT_TRANS_ID()
   GROUP BY key;
          key           | SUM
------------------------+-----
 chain                  |   1
 menu.elements.calories |   7
(2 rows)

=> ROLLBACK;

The data file has two new fields, one at the top level and one in the existing menu complex-type column. To find out their data types, you can inspect the file or use INFER_TABLE_DDL:

=> SELECT INFER_TABLE_DDL ('/data/rest2.json'
   USING PARAMETERS table_name='restaurants', format='json');
WARNING 9311:  This generated statement contains one or more varchar/varbinary columns which default to length 80
                               INFER_TABLE_DDL
---------------------------------------------------------------------------------------------------------------------
 Candidate matched 1/1 of total files(s):
create table "restaurants"(
  "chain" bool,
  "cuisine" varchar,
  "hours" Row(
    "close" time,
    "open" time
  ),
  "location" Array[Row(
    "city" varchar,
    "state" varchar
  )],
  "menu" Array[Row(
    "calories" int,
    "item" varchar,
    "price" numeric(precision, scale)
  )],
  "name" varchar
);

(1 row)

You can use ALTER TABLE to add the new column and alter the menu column. With these changes, the data can be loaded without errors:

=> ALTER TABLE restaurants ADD COLUMN chain BOOLEAN;

=> ALTER TABLE restaurants ALTER COLUMN menu
SET DATA TYPE ARRAY[ROW(item VARCHAR(50), price NUMERIC(8,2), calories INT),100];

--- repeat the load:
COPY restaurants FROM '/data/rest2.json' PARSER FJSONPARSER();
 Rows Loaded
-------------
           2
(1 row)

ALTER TABLE affects future loads. Data already in the table has null values for the added column and field:

=> SELECT EXPLODE(name, chain, menu) OVER() FROM restaurants;
     name     | chain | position |                          value
--------------+-------+----------+---------------------------------------------------------
 Pizza House  |       |        0 | {"item":"cheese pizza","price":"7.99","calories":null}
 Pizza House  |       |        1 | {"item":"spinach pizza","price":"8.99","calories":null}
 Pizza House  |       |        2 | {"item":"garlic bread","price":"4.99","calories":null}
 Sushi World  |       |        0 | {"item":"maki platter","price":"21.95","calories":null}
 Sushi World  |       |        1 | {"item":"tuna roll","price":"4.95","calories":null}
 Greasy Spoon | f     |        0 | {"item":"scrambled eggs","price":"3.99","calories":350}
 Greasy Spoon | f     |        1 | {"item":"grilled cheese","price":"3.95","calories":500}
 Greasy Spoon | f     |        2 | {"item":"tuna melt","price":"5.95","calories":600}
 Greasy Spoon | f     |        3 | {"item":"french fries","price":"1.99","calories":350}
 Pasta World  | t     |        0 | {"item":"cheese pizza","price":"7.99","calories":1200}
 Pasta World  | t     |        1 | {"item":"spinach pizza","price":"8.99","calories":900}
 Pasta World  | t     |        2 | {"item":"garlic bread","price":"4.99","calories":250}
(12 rows)

If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings and NO COMMIT.