Handling messy data

Loading data with COPY has two main phases, parsing and loading.

Loading data with COPY has two main phases, parsing and loading. During parsing, if COPY encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY cannot parse a row of data. Following are some parser errors that can cause a rejected row:

  • Unsupported parser options

  • Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections

  • Malformed context for the parser in use

  • Missing delimiters

Optionally, COPY can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY aborts a load if it encounters errors during the loading phase.

Several optional parameters let you determine how strictly COPY handles rejections. For example, you can have COPY fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY handles rejected data.

Save rejected rows (REJECTED DATA and EXCEPTIONS)

The COPY statement automatically saves a copy of each rejected row in a rejected-data file. COPY also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs, as shown in this example:

v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1

You can optionally save COPY rejections and exceptions in one of two other ways:

  • Use the REJECTED DATA reject_path and EXCEPTIONS except_path parameters to save both outputs to locations of your choice. REJECTED DATA records rejected rows, while EXCEPTIONS records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY creates the directory if it does not exist.) If a path value is a file, COPY uses it as a file prefix if multiple files are written.

  • Use the REJECTED DATA AS TABLE reject_table clause. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving rejected data to a table.

If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database installation, rejected data table records are stored in the RejectionTableData directory as follows:

=> cd v_mart_node003_data\RejectionTableData\
=> ls
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2
.
.
.
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2

COPY LOCAL rejected data

For COPY LOCAL operations, if you use REJECTED DATA or EXCEPTIONS with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE instead of REJECTED DATA.

Enforce truncating or rejecting rows (ENFORCELENGTH)

When parsing data of type CHAR, VARCHAR, BINARY, or VARBINARY, rows may exceed the target table length. By default, COPY truncates such rows without rejecting them.

Use the ENFORCELENGTH parameter to reject rows that exceed the target table.

For example, loading 'abc' into a table column specified as VARCHAR(2) results in COPY truncating the value to 'ab' and loading it. Loading the same row with the ENFORCELENGTH parameter causes COPY to reject the row.

Specify a maximum number of rejections (REJECTMAX)

The REJECTMAX parameter specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.

When the number of rejected records becomes equal to the REJECTMAX value, the load fails. If you do not specify a value for REJECTMAX, or if the value is 0, COPY allows an unlimited number of exceptions to occur.

If you allow COPY to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX to limit the impact. See Handling Transformation Errors.

Handling transformation errors

By default, COPY aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.

To have COPY treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA or REJECTED DATA AS TABLE to specify the output location.

You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX when enabling transformation rejections.

If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.

Abort data loads for any error (ABORT ON ERROR)

Using the ABORT ON ERROR argument is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.

If you use the ABORT ON ERROR as part of a CREATE EXTERNAL TABLE AS COPY FROM statement, the option is used whenever a query references the external table. The offending error is saved in the COPY exceptions or rejected data file.

Understanding row rejections and rollback errors

Depending on the type of error that COPY encounters, Vertica does one of the following:

  • Rejects the offending row and loads other rows into a table

  • Rolls back the entire COPY statement without loading any data

The following table summarizes the reasons for rejected rows or rollbacks.

Rejected Rows Load Rollback

COPY cannot parse rows that contain any of the following:

  • Incompatible data types

  • Missing fields

  • Missing delimiters

COPY rolls back a load if it encounters any of these conditions:

  • Server-side errors, such as lack of memory

  • Primary key or foreign key constraint violations

  • Loading NULL data into a NOT NULL column

  • Transformation errors (by default)

This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY statement, "a::INT + b::INT" is a SQL expression in which a and b are derived values:

=> CREATE TABLE t (i INT);
=> COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT)
   FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> cat|dog
>> \.

Vertica cannot parse the row to the requested data type and rejects the row:

ERROR 2827:  Could not convert "cat" from column "*FILLER*".a to an int8

If a resolved to 'cat' and b to 'dog', the next expression 'cat'::INT + 'dog'::INT would return an expression evaluator error:

=> SELECT 'cat'::INT + 'dog'::INT;
ERROR 3681:  Invalid input syntax for integer: "cat"

The following COPY statement would also roll back because Vertica cannot parse the row to the requested data type:

=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;

In the following COPY statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat' row as a VARCHAR type, COPY parses 'cat' directly as an INTEGER.

=> COPY t (a FILLER INT, i AS a) FROM STDIN;

In the following example, transformation errors are rejected instead of aborting the load.

=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1;
ALTER DATABASE

=> CREATE TABLE sales (price INTEGER);
COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej;
dollars
\.

=> SELECT rejected_data, rejected_reason FROM sales_rej;
 rejected_data |                rejected_reason
---------------+-----------------------------------------------
 dollars       | Invalid integer format 'dollars' for column 1 (price)
(1 row)

See also