Transforming data during loads

To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading.

To promote a consistent database and reduce the need for scripts to transform data at the source, you can transform data with an expression as part of loading. Transforming data while loading lets you compute values to insert into a target column, either from other columns or from values in the data that you load as FILLER columns (see Deriving Table Columns From Data File Columns). You can transform data to be loaded into columns of scalar types and native arrays, but not other complex types.

For example, you might have text data that is not compatible with UTF-8, the encoding that Vertica expects. You can use the MAKEUTF8 function during load to remove or replace non-UTF-8 characters, as illustrated in Converting Data While Loading. Or you might want to extract fields for day, month, and year from a single input date.

When transforming data during a load, you load the data into a column normally, and then use that column in an expression to populate another column. The COPY statement must always contain at least one parsed column, which can be a FILLER column. You can intersperse parsed and computed columns in a COPY statement.

The following example extracts day, month, and year columns from a single input date column:

=> CREATE TABLE purchases
    (id INT, year VARCHAR(10), month VARCHAR(10), day VARCHAR(10), ts TIMESTAMP);

=> COPY purchases (id, year AS TO_CHAR(ts,'YYYY'),
                   month AS TO_CHAR(ts,'MM'), day AS TO_CHAR(ts, 'DD'),
                   ts FORMAT 'YYYY-MM-DD') FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1943|2021-03-29
>> 1256|2021-03-30
>> 1512|2021-03-31
>> \.

=> SELECT * FROM purchases;
  id  | year | month | day |         ts
------+------+-------+-----+---------------------
 1256 | 2021 | 03    | 30  | 2021-03-30 00:00:00
 1512 | 2021 | 03    | 31  | 2021-03-31 00:00:00
 1943 | 2021 | 03    | 29  | 2021-03-29 00:00:00
(3 rows)

The input data has two columns, id and ts (timestamp). The COPY statement specifies the format of the timestamp column using the FORMAT option. The TO_CHAR function uses that format information to extract the year, month, and day columns.

Using expressions in COPY statements

The expression in a COPY statement can be as simple as a single column, or more complex, such as a case statement for multiple columns. An expression can specify multiple columns, and multiple expressions can refer to the same parsed column. You can use expressions for columns of all supported data types.

COPY expressions can use many SQL functions, operators, constants, NULLs, and comments, including these functions:

Requirements and restrictions:

  • COPY expressions cannot use SQL meta-functions, analytic functions, aggregate functions, or computed columns.

  • For computed columns, you must list all parsed columns in the COPY statement expression. Do not specify FORMAT or RAW in the source data for a computed column.

  • The return data type of the expression must be coercible to that of the target column. Parsed column parameters are also coerced to match the expression.

Handling expression errors

Errors in expressions within your COPY statement are SQL errors. As such, they are handled differently from parse errors. When a parse error occurs, COPY rejects the row and adds it to the rejected data file or table. COPY also adds the reason for a rejected row to the exceptions file or the rejected data table. For example, COPY parsing does not implicitly cast data types. If a type mismatch occurs between the data being loaded and a column type (such as attempting to load a text value into a FLOAT column), COPY rejects the row and continues processing.

If an error occurs in an expression in your COPY statement, then by default the entire load fails. For example, if your COPY statement uses a function expression, and a syntax error exists in that expression, the entire load is rolled back. All SQL errors, including ones caused by rolling back the COPY, are stored in the Vertica log file. However, unlike parse rejections and exception messages, SQL expression errors are brief and may require further research.

You can have COPY treat errors in transformation expressions like parse errors. Rejected rows are added to the same file or table, and exceptions are added to the same exceptions file or table. To enable this behavior, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.)

Loading data with expression rejections is potentially slower than loading with the same number of parse rejections. Enable expression rejections if your data has a few bad rows, to allow the rest of the data to be loaded. If you are concerned about the time it takes to complete a load with many bad rows, use the REJECTMAX parameter to set a limit. If COPY finds more than REJECTMAX bad rows, it aborts and rolls back the load.

See Handling messy data for more information about managing rejected data.

Deriving table columns from data file columns

When loading data, your source data might contain one or more columns that do not exist in the target table. Or, the source and target tables have matched columns, but you want to omit one or more source columns from the target table.

Use the FILLER parameter to identify a column of source data that COPY can ignore or use to compute new values that are loaded into the target table. The following requirements apply:

  • Define the FILLER parameter data type so it is compatible with the source data. For example, be sure to define a VARCHAR in the target table so its length can contain all source data; otherwise, data might be truncated. You can specify multiple filler columns by using the FILLER parameter more than once in the COPY statement.

  • The name of the filler column must not match the name of any column in the target table.

In the following example, the table has columns for first name, last name, and full name, but the data being loaded contains columns for first, middle, and last names. The COPY statement reads all of the source data but only loads the source columns for first and last names. It constructs the data for the full name by concatenating each of the source data columns, including the middle name. The middle name is read as a FILLER column so it can be used in the concatenation, but is ignored otherwise. (There is no table column for middle name.)

=> CREATE TABLE names(first VARCHAR(20), last VARCHAR(20), full VARCHAR(60));
CREATE TABLE
=> COPY names(first,
              middle FILLER VARCHAR(20),
              last,
              full AS first||' '||middle||' '||last)
      FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Marc|Gregory|Smith
>> Sue|Lucia|Temp
>> Jon|Pete|Hamilton
>> \.
=> SELECT * from names;
 first |   last   |        full
-------+----------+--------------------
 Jon   | Hamilton | Jon Pete Hamilton
 Marc  | Smith    | Marc Gregory Smith
 Sue   | Temp     | Sue Lucia Temp
(3 rows)