This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Handling messy data

Loading data with COPY has two main phases, parsing and loading.

Loading data with COPY has two main phases, parsing and loading. During parsing, if COPY encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY cannot parse a row of data. Following are some parser errors that can cause a rejected row:

  • Unsupported parser options

  • Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections

  • Malformed context for the parser in use

  • Missing delimiters

Optionally, COPY can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY aborts a load if it encounters errors during the loading phase.

Several optional parameters let you determine how strictly COPY handles rejections. For example, you can have COPY fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY handles rejected data.

Save rejected rows (REJECTED DATA and EXCEPTIONS)

The COPY statement automatically saves a copy of each rejected row in a rejections file. COPY also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs, as shown in this example:

v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1

You can optionally save COPY rejections and exceptions in one of two other ways:

  • Use the REJECTED DATA and EXCEPTIONS options to save both outputs to locations of your choice. REJECTED DATA records rejected rows, while EXCEPTIONS records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY creates the directory if it does not exist.) If a path value is a file, COPY uses it as a file prefix if multiple files are written.

  • Use the REJECTED DATA AS TABLE option. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving rejected data to a table.

If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database installation, rejected data table records are stored in the RejectionTableData directory as follows:

=> cd v_mart_node003_data\RejectionTableData\
=> ls
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2
.
.
.
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2

COPY LOCAL rejected data

For COPY LOCAL operations, if you use REJECTED DATA or EXCEPTIONS with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE instead of REJECTED DATA.

Enforce truncating or rejecting rows (ENFORCELENGTH)

When parsing CHAR, VARCHAR, BINARY, or VARBINARY data, rows may exceed the target table length. By default, COPY truncates such rows without rejecting them. Use the ENFORCELENGTH option to instead reject rows that exceed the target table.

For example, loading 'abc' into a table column specified as VARCHAR(2) results in COPY truncating the value to 'ab' and loading it. Loading the same row with the ENFORCELENGTH option causes COPY to reject the row.

Specify a maximum number of rejections (REJECTMAX)

The REJECTMAX option specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.

When the number of rejected records becomes equal to the REJECTMAX value, the load fails. If you do not specify a value for REJECTMAX, or if the value is 0, COPY allows an unlimited number of exceptions to occur.

If you allow COPY to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX to limit the impact. See Handling Transformation Errors.

Handling transformation errors

By default, COPY aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.

To have COPY treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA or REJECTED DATA AS TABLE to specify the output location.

You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX when enabling transformation rejections.

If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.

Abort data loads for any error (ABORT ON ERROR)

Using the ABORT ON ERROR option is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.

If you use ABORT ON ERROR as part of a CREATE EXTERNAL TABLE statement, the option is used whenever a query references the external table. The offending error is saved in the COPY exceptions or rejected data file.

Understanding row rejections and rollback errors

Depending on the type of error that COPY encounters, Vertica does one of the following:

  • Rejects the offending row and loads other rows into a table

  • Rolls back the entire COPY statement without loading any data

COPY cannot parse rows that contain any of the following:

  • Incompatible data types

  • Missing fields

  • Missing delimiters

COPY rolls back a load if it encounters any of these conditions:

  • Server-side errors, such as lack of memory

  • Primary key or foreign key constraint violations

  • Loading NULL data into a NOT NULL column

  • Transformation errors (by default)

This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY statement, "a::INT + b::INT" is a SQL expression in which a and b are derived values:

=> CREATE TABLE t (i INT);
=> COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT)
   FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> cat|dog
>> \.

Vertica cannot parse the row to the requested data type and rejects the row:

ERROR 2827:  Could not convert "cat" from column "*FILLER*".a to an int8

If a resolved to 'cat' and b to 'dog', the next expression 'cat'::INT + 'dog'::INT would return an expression evaluator error:

=> SELECT 'cat'::INT + 'dog'::INT;
ERROR 3681:  Invalid input syntax for integer: "cat"

The following COPY statement would also roll back because Vertica cannot parse the row to the requested data type:

=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;

In the following COPY statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat' row as a VARCHAR type, COPY parses 'cat' directly as an INTEGER.

=> COPY t (a FILLER INT, i AS a) FROM STDIN;

In the following example, transformation errors are rejected instead of aborting the load.

=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1;
ALTER DATABASE

=> CREATE TABLE sales (price INTEGER);
CREATE TABLE

=> COPY sales (f FILLER VARCHAR, price AS f::INT)
   FROM STDIN REJECTED DATA AS TABLE sales_rej;

=> COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>1
>>2a
>>3
>>\.

=> SELECT * FROM sales;
 price
-------
     1
     3
(2 rows)

=> SELECT rejected_data, rejected_reason FROM sales_rej;
 rejected_data |                rejected_reason
---------------+-----------------------------------------------
 Tuple (2a)    | ERROR 2827:  Could not convert "2a" from column "*FILLER*".f to an int8
(1 row)

See also

1 - Saving load rejections (REJECTED DATA)

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.

Load rejections are data rows that COPY did not load due to a parser exception or, optionally, transformation error. By default Vertica saves information about rejections in files on database nodes. A better approach is to save rejections to a table.

By default, if you do not specify a rejected data file, COPY saves rejected data files to this location:

catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.sequence-number
catalog_dir
The database catalog files directory, for example /home/dbadmin/VMart/v_vmart_node0001_catalog.
target_table
The table into which data was loaded.
source
The source of the load data, which can be STDIN or a file name, such as baseball.csv.
copy-from-rejected-data.sequence-number
The default name for a rejected data file, followed by a numeric suffix, indicating the number of files, such as .1, .2, .3. For example, this default file name indicates file 3 after loading from STDIN: fw-STDIN-copy-from-rejected-data.3.

Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.

Multiple rejected data files

Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data, including:

  • Number of sources being loaded

  • Total number of rejected rows

  • Size of the source file (or files)

  • Cooperative parsing and number of threads being used

  • UDLs that support apportioned loads

  • For your own parser, the number of objects returned from prepareUDSources()

Naming conventions for rejected files

You can specify one or more files for rejected data using the REJECTED DATA clause. If you do so, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix and appends numeric suffixes. For example, if you specify REJECTED DATA my_rejects, and the file you are loading is large enough (> 10MB), rejections are written to several files named my_rejects-1, my_rejects-2, and so on.

By default COPY uses cooperative parsing, which means a node uses multiple threads to load portions in parallel. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).

If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDLs that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.

For all data loads except COPY LOCAL, COPY behaves as follows:

If no rejected data file is specified:

  • For a single data file or STDIN, COPY stores one or more rejected data files in the default location.

  • For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a filename prefix.

  • Rejected data files are returned to the initiator node.

If a rejected data file is specified:

  • For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one file is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error.

  • For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If the path is not a directory, COPY returns an error.

    COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.

  • Rejected data files are not shipped to the initiator node.

Maximum length of file names

Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, /tmp/<shorter-file-name>.

2 - Saving rejected data to a table

Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data.

Use the REJECTED DATA AS TABLE option to collect rejected data, including exceptions, in a table that can be queried. REJECTED DATA AS TABLE is incompatable with rejection or exception paths.

When you use REJECTED DATA AS TABLE, Vertica creates a new table if one does not exist, or appends to an existing table from a previous load. Vertica always creates the table if it does not already exist, even if the load produces no rejections. You cannot create the table directly.

Load rejection tables do not support DML and DDL operations and are not K-safe. You can query and drop them.

To make the data in a rejected table K-safe, you can copy it to another table using CREATE TABLE AS:

=> CREATE TABLE save_rejections AS SELECT * FROM rejected_data_table;

Alternatively, you can copy data to another table using INSERT SELECT.

Location of rejected data table records

When you save rejected records to a table, the data for the table is saved in a database data subdirectory, RejectionTableData. For example, a VMart database uses a path like /home/dbadmin/VMart/v_vmart_node0001_data/RejectionTableData.

If you need to make rejected data available from outside of Vertica, you can export the rejections table to a shared file system.

Vertica suggests that you periodically drop any rejected data tables that you no longer require.

Using COPY NO COMMIT

If the COPY statement includes the NO COMMIT option, and the rejections table does not already exist, Vertica saves the rejected data table as a LOCAL TEMP table.

Rejected-data tables are useful for Extract-Load-Transform workflows, where you will likely use temporary tables more frequently. The rejected-data tables let you quickly load data and identify which records failed to load. If you load data into a temporary table that you created using the ON COMMIT DELETE clause, the COPY operation will not commit.

Querying a rejected data table

After a load, you can query the corresponding rejections table for information about failures. For example, consider the following table definition and COPY statement:

=> CREATE TABLE loader(x INT)
CREATE TABLE
    
=> COPY loader FROM STDIN REJECTED DATA AS TABLE loader_rejects;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1
>> 2
>> 3
>> a
>> \.

Three rows were successfully loaded:

=> SELECT * FROM loader;
 x
---
 1
 2
 3
(3 rows)

The loader_rejects table records the failed row. The data that was rejected is in the rejected_data column and the reason for the rejection is in rejected_reason:

=> SELECT * FROM loader_rejects;
-[ RECORD 1 ]-------------+--------------------------------------------
node_name                 | v_vmart_node0001
file_name                 | STDIN
session_id                | v_vmart_node0001.example.-24016:0x3439
transaction_id            | 45035996274080923
statement_id              | 1
batch_number              | 0
row_number                | 4
rejected_data             | a
rejected_data_orig_length | 1
rejected_reason           | Invalid integer format 'a' for column 1 (x)

The rejected data table has the following columns:

Column Data Type Description
node_name VARCHAR The name of the Vertica node on which the input load file was located.
file_name VARCHAR The name of the file being loaded, which applies if you loaded a file (as opposed to using STDIN).
session_id VARCHAR The session ID number in which the COPY statement occurred.
transaction_id INTEGER Identifier for the transaction within the session, if any; otherwise NULL.
statement_id INTEGER

The unique identification number of the statement within the transaction that included the rejected data.

batch_number INTEGER INTERNAL USE. Represents which batch (chunk) the data comes from.
row_number INTEGER

The rejected row number from the input file, or -1 if it could not be determined. The value can be -1 when using cooperative parse.

Each parse operation resets the row number, so in an apportioned load, there can be several entries with the same row number but different rows.

rejected_data LONG VARCHAR The data that was not loaded.
rejected_data_orig_length INTEGER The length of the rejected data.
rejected_reason VARCHAR The error that caused the rejected row. This column returns the same message that exists in a load exceptions file when you do not save to a table.

Correcting rejected data

If a load rejects data, you can save the rejected data in a file, correct the errors, and reload the updated file. First, direct vsql to show only tuples and set the output format:

=> \t
Showing only tuples.
=> \a
Output format is unaligned.

Then query the rejections table and direct output to a file:

=> \o rejected.txt
=> SELECT rejected_data FROM loader_rejects;
=> \o

The saved file has the following output:

=> \! cat rejected.txt
a

You can edit the file to correct the errors and then use COPY to load the updated file.

3 - Saving load exceptions (EXCEPTIONS)

COPY exceptions consist of informational messages describing why a row of data could not be parsed.

COPY exceptions consist of informational messages describing why a row of data could not be parsed. The EXCEPTIONS option lets you specify a file to which COPY writes exceptions. If you omit this option, COPY saves exception files to the following path: catalog-dir/CopyErrorLogs/tablename-sourcefilename-copy-from-exceptions, where:

  • catalog-dir is the directory holding the database catalog files

  • table is the name of the table being loaded into

  • sourcefile is the name of the file being loaded

The file produced by the EXCEPTIONS option indicates the line number and the reason for each exception.

If copying from STDIN, the source file name is STDIN.

You can specify rejected data and exceptions files for individual files in a data load. Separate rejected data and exception file names with commas in the COPY statement.

You must specify a filename in the path to load multiple input files. Keep in mind that long table names combined with long data file names can exceed the operating system's maximum length (typically 255 characters). To work around file names exceeding the maximum length, use a path for the exceptions file that differs from the default path; for example, /tmp/<shorter-file-name>.

If you specify an EXCEPTIONS path:

  • For one data file, the path must be a file, and COPY stores all information in this file.

  • For multiple data files, the path must be a directory. COPY creates one file in this directory for each data file.

  • Exceptions files are not stored on the initiator node.

  • You can specify only one path per node.

If you do not specify the EXCEPTIONS path, COPY stores exception files in the default directory.

4 - COPY rejected data and exception files

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files.

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files. Typically, the number of threads depends on the number of node cores in the system. Each node processes part of the load data. If the load succeeds overall, any parser rejections that occur during load processing are written to that node's specific rejected data and exceptions files. If the load fails, the rejected data file contents can be incomplete, or empty. If you do not specify a file name explicitly, COPY uses a default name and location for rejected data files. See the next topic for specifying your own rejected data and exception files.

Both rejected data and exceptions files are saved and stored on a per-node basis. This example uses multiple files as COPY inputs. Since the statement does not include either the REJECTED DATA or EXCEPTIONS parameters, rejected data and exceptions files are written to the default location, the database catalog subdirectory, CopyErrorLogs, on each node:

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set file1 '''':dir'C1_large_tbl.dat'''
\set file2 '''':dir'C2_large_tbl.dat'''
\set file3 '''':remote_dir'C3_large_tbl.dat'''
\set file4 '''':remote_dir'C4_large_tbl.dat'''
=>COPY large_tbl FROM :file1 ON site01,:file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|';

Specifying rejected data and exceptions files

The optional COPY REJECTED DATA and EXCEPTIONS parameters 'path' element lets you specify a non-default path in which to store the files.

If path resolves to a storage location, and the user invoking COPY is not a superuser, these are the required permissions:

Both parameters also have an optional ON nodename clause that uses the specified path:

...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]...[ REJECTED DATA 'path' [ ON nodename ] [, ...] ]

While 'path' specifies the location of the rejected data and exceptions files (with their corresponding parameters), the optional ON nodename clause moves any existing rejected data and exception files on the node to the specified path on the same node.

Saving rejected data and exceptions files to a single server

The COPY statement does not have a facility to merge exception and rejected data files after COPY processing is complete. To see the contents of exception and rejected data files requires accessing each node's specific files.

Using VSQL variables for rejected data and exceptions files

This example uses vsql variables to specify the path and file names to use with the exceptions and rejected data parameters (except_s1 and reject_s1). The COPY statement specifies a single input file (large_tbl) on the initiator node:

\set dir `pwd`/data/ \set file1 '''':dir'C1_large_tbl.dat'''
\set except_s1 '''':dir'exceptions'''
\set reject_s1 '''':dir'rejections'''

COPY large_tbl FROM :file1 ON site01 DELIMITER '|'
REJECTED DATA :reject_s1 ON site01
EXCEPTIONS :except_s1 ON site01;

This example uses variables to specify exception and rejected date files (except_s2 and reject_s2) on a remote node. The COPY statement consists of a single input file on a remote node (site02):

\set remote_dir /vertica/test_dev/tmp_ms/\set except_s2 '''':remote_dir'exceptions'''
\set reject_s2 '''':remote_dir'rejections'''

COPY large_tbl FROM :file1 ON site02 DELIMITER '|'
REJECTED DATA :reject_s2 ON site02
EXCEPTIONS :except_s2 ON site02;

This example uses variables to specify that the exception and rejected data files are on a remote node (indicated by :remote_dir). The inputs to the COPY statement consist of multiple data files on two nodes (site01 and site02). The exceptions and rejected data options use the ON nodename clause with the variables to indicate where the files reside (site01 and site02):

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set except_s1 '''':dir''''
\set reject_s1 '''':dir''''
\set except_s2 '''':remote_dir''''
\set reject_s2 '''':remote_dir''''
COPY large_tbl FROM :file1 ON site01,
               :file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|'
               REJECTED DATA :reject_s1 ON site01, :reject_s2 ON site02
               EXCEPTIONS :except_s1 ON site01, :except_s2 ON site02;

5 - COPY LOCAL rejection and exception files

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files.

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files. This behavior differs from using COPY, which saves both files automatically, regardless of whether you use the optional REJECTED DATA and EXCEPTIONS parameters to specify either file explicitly.

Use the REJECTED DATA and EXCEPTIONS parameters with COPY LOCAL and COPY LOCAL FROM STDIN to save the corresponding output files on the client. If you do not use these options, rejected data parsing events (and the exceptions that describe them) are not retained, even if they occur.

You can load multiple input files using COPY LOCAL (or COPY LOCAL FROM STDIN). If you also use the REJECTED DATA and EXCEPTIONS options, the statement writes rejected rows and exceptions and to separate files. The respective files contain all rejected rows and corresponding exceptions, respectively, regardless of how many input files were loaded.

If COPY LOCAL does not reject any rows, it does not create either file.

Specifying rejected data and exceptions files

To save any rejected data and their exceptions to files:

  1. In the COPY LOCAL (and COPY LOCAL FROM STDIN) statement, use the REJECTED DATA 'path' and the EXCEPTIONS 'path' parameters, respectively.

  2. Specify two different file names for the two options. You cannot use one file for both the REJECTED DATA and the EXCEPTIONS.

  3. When you invoke COPY LOCAL or COPY LOCAL FROM STDIN, the files you specify need not pre-exist. If they do, COPY LOCAL must be able to overwrite them.

You can specify the path and file names with vsql variables:

\set rejected ../except_reject/copyLocal.rejected
\set exceptions ../except_reject/copyLocal.exceptions

When you use the COPY LOCAL or COPY LOCAL FROM STDIN statement, specify the variable names for the files with their corresponding parameters:

=> COPY large_tbl FROM LOCAL rejected data :rejected exceptions :exceptions;
=> COPY large_tbl FROM LOCAL STDIN rejected data :rejected exceptions :exceptions;