Schema changes
After you develop a schema for a table and continue to load data, you might find that some new data diverges from that schema. Sometimes new fields appear, and sometimes fields change data types or stop appearing. Depending on how volatile you expect your data to be, you might decide to load what you can and ignore the rest, or you might decide to update your table definition to handle the changes.
Typically, COPY rejects data that does not match the definition and loads the rest. You can monitor these rejections and adjust your table definition, as described in Test with Data. Parsers for some formats have additional considerations:
-
Parquet and ORC: by default, the table definition must include all columns. Alternatively, you can use loose schema matching to load only columns that are part of the table definition, ignoring others. Type mismatches that cannot be coerced are errors.
-
JSON and Avro: the parser loads only the columns that are part of the table definition. If the data contains other columns, the parsers emit warnings. Type mismatches that cannot be coerced are errors.
Parquet and ORC
If you load Parquet or ORC data that contains extra columns, by default the load fails with an error:
If you do not know what the extra columns are, you can use inference to inspect a sample file. If the new columns are important, you can use ALTER TABLE to add them to the table definition and then load the data.
If the new columns are not important and you want to ignore them, you can use loose schema matching. Loose schema matching means that the parser loads columns whose names match the table definition and ignores any others. For the Parquet parser, loose schema matching also applies to struct fields.
Enable loose matching with the do_soft_schema_match_by_name
parameter in the PARQUET or ORC parser:
Loose schema matching allows you to ignore extra data. Missing data and mismatched data types are still errors. If loads fail with these errors, you can use inference to inspect the data and then alter your table definition.
JSON and Avro
By default, the JSON and Avro parsers skip fields in the data that are not part of the table definition and proceed with the rest of the load. This means that the JSON and Avro parsers are more flexible in the face of variability in the data, but also that you must monitor your loads more closely for new fields.
When creating tables that will use JSON or Avro data, a good first step is to infer a table definition from a sample file:
A second data file has some new fields. These parsers load what can be loaded and warn about the new data:
Loads that use the NO COMMIT option, as in this example, can recover from errors. Without NO COMMIT, COPY loads the data, skipping the new fields. If you change your table definition to handle the new fields and repeat the load, the table has two rows for each record, one with the new fields and one without. Use NO COMMIT to experiment, and use ROLLBACK to recover.
New fields are logged in the UDX_EVENTS system table:
The data file has two new fields, one at the top level and one in the existing menu complex-type column. To find out their data types, you can inspect the file or use INFER_TABLE_DDL:
You can use ALTER TABLE to add the new column and alter the menu column. With these changes, the data can be loaded without errors:
ALTER TABLE affects future loads. Data already in the table has null values for the added column and field:
If you are only interested in certain fields and don't care about new fields found in data loads, you can use the suppress_warnings
parameter on the JSON or Avro parser to ignore them. If you suppress warnings, you can separately check for new fields from time to time with INFER_TABLE_DDL or by loading sample data with warnings and NO COMMIT.