This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Handling messy data

Loading data with COPY has two main phases, parsing and loading.

Loading data with COPY has two main phases, parsing and loading. During parsing, if COPY encounters errors it rejects the faulty data and continues loading data. Rejected data is created whenever COPY cannot parse a row of data. Following are some parser errors that can cause a rejected row:

  • Unsupported parser options

  • Incorrect data types for the table into which data is being loaded, including incorrect data types for members of collections

  • Malformed context for the parser in use

  • Missing delimiters

Optionally, COPY can reject data and continue loading when transforming data during the load phase. This behavior is controlled by a configuration parameter. By default, COPY aborts a load if it encounters errors during the loading phase.

Several optional parameters let you determine how strictly COPY handles rejections. For example, you can have COPY fail when it rejects a single row, or allow a specific number of rejections before the load fails. This section presents the parameters to determine how COPY handles rejected data.

Save rejected rows (REJECTED DATA and EXCEPTIONS)

The COPY statement automatically saves a copy of each rejected row in a rejected-data file. COPY also saves a corresponding explanation of what caused the rejection in an exceptions file. By default, Vertica saves both files in a database catalog subdirectory, called CopyErrorLogs, as shown in this example:

v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-rejected-data.1
v_mart_node003_catalog\CopyErrorLogs\trans-STDIN-copy-from-exceptions.1

You can optionally save COPY rejections and exceptions in one of two other ways:

  • Use the REJECTED DATA reject_path and EXCEPTIONS except_path parameters to save both outputs to locations of your choice. REJECTED DATA records rejected rows, while EXCEPTIONS records a description of why each row was rejected. If a path value is an existing directory or ends in '/', or the load includes multiple sources, files are written in that directory. (COPY creates the directory if it does not exist.) If a path value is a file, COPY uses it as a file prefix if multiple files are written.

  • Use the REJECTED DATA AS TABLE reject_table clause. This option writes both the rejected data and the exception descriptions to the same table. For more information, see Saving rejected data to a table.

If you save rejected data to a table, the table files are stored in the data subdirectory. For example, in a VMart database installation, rejected data table records are stored in the RejectionTableData directory as follows:

=> cd v_mart_node003_data\RejectionTableData\
=> ls
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805099_1.1
TABLE_REJECTED_RECORDS_"bg"_mytest01.example.-25441:0x6361_45035996273805113_2.2
.
.
.
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.1
TABLE_REJECTED_RECORDS_"delimr"_mytest01.example.-5958:0x3d47_45035996273815749_1.2

COPY LOCAL rejected data

For COPY LOCAL operations, if you use REJECTED DATA or EXCEPTIONS with a file path, the files are written on the client. If you want rejections to be available on all nodes, use REJECTED DATA AS TABLE instead of REJECTED DATA.

Enforce truncating or rejecting rows (ENFORCELENGTH)

When parsing data of type CHAR, VARCHAR, BINARY, or VARBINARY, rows may exceed the target table length. By default, COPY truncates such rows without rejecting them.

Use the ENFORCELENGTH parameter to reject rows that exceed the target table.

For example, loading 'abc' into a table column specified as VARCHAR(2) results in COPY truncating the value to 'ab' and loading it. Loading the same row with the ENFORCELENGTH parameter causes COPY to reject the row.

Specify a maximum number of rejections (REJECTMAX)

The REJECTMAX parameter specifies the maximum number of logical records that can be rejected before a load fails. A rejected row consists of the data that could not be parsed (or optionally transformed) into the corresponding data type during a bulk load. Rejected data does not indicate referential constraints. For information about using constraints, and the option of enforcing constraints during bulk loading, see Constraints.

When the number of rejected records becomes equal to the REJECTMAX value, the load fails. If you do not specify a value for REJECTMAX, or if the value is 0, COPY allows an unlimited number of exceptions to occur.

If you allow COPY to reject rows and proceed when it encounters transformation errors, consider using REJECTMAX to limit the impact. See Handling Transformation Errors.

Handling transformation errors

By default, COPY aborts a load if it encounters errors when performing transformations. This is the default because rejecting transformation errors is potentially more expensive than rejecting parse errors. Sometimes, however, you would prefer to load the data anyway and reject the problematic rows, the way it does for parse errors.

To have COPY treat errors in transformation expressions like parse errors, set the CopyFaultTolerantExpressions configuration parameter to 1. (See General parameters.) Rows that are rejected during transformation, in the expression-evaluation phase of a data load, are written to the same destination as rows rejected during parsing. Use REJECTED DATA or REJECTED DATA AS TABLE to specify the output location.

You might want to enable transformation rejections if your data contains a few bad rows. By enabling these rejections, you can load the majority of your data and proceed. Vertica recommends using REJECTMAX when enabling transformation rejections.

If your data contains many bad values, then the performance for loading the good rows could be worse than with parser errors.

Abort data loads for any error (ABORT ON ERROR)

Using the ABORT ON ERROR argument is the most restrictive way to load data, because no exceptions or rejections are allowed. A COPY operation stops if any row is rejected. No data is loaded and Vertica rolls back the command.

If you use the ABORT ON ERROR as part of a CREATE EXTERNAL TABLE AS COPY FROM statement, the option is used whenever a query references the external table. The offending error is saved in the COPY exceptions or rejected data file.

Understanding row rejections and rollback errors

Depending on the type of error that COPY encounters, Vertica does one of the following:

  • Rejects the offending row and loads other rows into a table

  • Rolls back the entire COPY statement without loading any data

The following table summarizes the reasons for rejected rows or rollbacks.

Rejected Rows Load Rollback

COPY cannot parse rows that contain any of the following:

  • Incompatible data types

  • Missing fields

  • Missing delimiters

COPY rolls back a load if it encounters any of these conditions:

  • Server-side errors, such as lack of memory

  • Primary key or foreign key constraint violations

  • Loading NULL data into a NOT NULL column

  • Transformation errors (by default)

This example illustrates what happens when Vertica cannot coerce a row to the requested data type. For example, in the following COPY statement, "a::INT + b::INT" is a SQL expression in which a and b are derived values:

=> CREATE TABLE t (i INT);
=> COPY t (a FILLER VARCHAR, b FILLER VARCHAR, i AS a::INT + b::INT)
   FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> cat|dog
>> \.

Vertica cannot parse the row to the requested data type and rejects the row:

ERROR 2827:  Could not convert "cat" from column "*FILLER*".a to an int8

If a resolved to 'cat' and b to 'dog', the next expression 'cat'::INT + 'dog'::INT would return an expression evaluator error:

=> SELECT 'cat'::INT + 'dog'::INT;
ERROR 3681:  Invalid input syntax for integer: "cat"

The following COPY statement would also roll back because Vertica cannot parse the row to the requested data type:

=> COPY t (a FILLER VARCHAR, i AS a::INT) FROM STDIN;

In the following COPY statement, Vertica rejects only the offending row without rolling back the statement. Instead of evaluating the 'cat' row as a VARCHAR type, COPY parses 'cat' directly as an INTEGER.

=> COPY t (a FILLER INT, i AS a) FROM STDIN;

In the following example, transformation errors are rejected instead of aborting the load.

=> ALTER DATABASE DEFAULT SET CopyFaultTolerantExpressions = 1;
ALTER DATABASE

=> CREATE TABLE sales (price INTEGER);
COPY sales FROM STDIN REJECTED DATA AS TABLE sales_rej;
dollars
\.

=> SELECT rejected_data, rejected_reason FROM sales_rej;
 rejected_data |                rejected_reason
---------------+-----------------------------------------------
 dollars       | Invalid integer format 'dollars' for column 1 (price)
(1 row)

See also

1 - Saving load rejections (REJECTED DATA)

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error. By default, if you do not specify a rejected data file, COPY saves rejected data files to this location:

catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.`*`n`*
catalog_dir/

The database catalog files directory, for example:

/home/dbadmin/VMart/v_vmart_node0001_catalog

target_table The table into which data was loaded (target_table).
source The source of the load data, which can be STDIN, or a file name, such as baseball.csv.
copy-from-rejected-data.n

The default name for a rejected data file, followed by n suffix, indicating the number of files, such as .1, .2, .3. For example, this default file name indicates file 3 after loading from STDIN:

fw-STDIN-copy-from-rejected-data.3.

Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table, lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.

Multiple rejected data files

Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data. Here are some of the factors:

  • Number of sources being loaded

  • Total number of rejected rows

  • Size of the source file (or files)

  • Cooperative parsing and number of threads being used

  • UDLs that support apportioned loads

  • For your own COPY parser, the number of objects returned from prepareUDSources()

Naming conventions for rejected files

You can specify one or more rejected data files with the files you are loading. Use the REJECTED DATA parameter to specify a file location and name, and separate consecutive rejected data file names with a comma (,). Do not use the ON ANY NODE option because it is applicable only to load files.

If you specify one or more files, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix, and appends a numeric suffix to each rejected data file. For example, if you specify the name my_rejects for the REJECTED_DATA parameter, and the file you are loading is large enough (> 10MB), several files such as the following will exist:

  • my_rejects-1

  • my_rejects-2

  • my_rejects-3

COPY uses cooperative parsing by default, having the nodes parse a specific part of the file contents. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).

The maximum number of rejected data files cannot be greater than the number of sources being loaded, per thread to parse any portion. The resource pool determines the maximum number of threads. For cooperative parse, use all available threads.

If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDL's that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.

For all data loads except COPY LOCAL, COPY behaves as follows:

No rejected data file specified... Rejected data file specified...
For a single data file (pathToData or STDIN), COPY stores one or more rejected data files in the default location. For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one files is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error.
For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a file name prefix, as noted.

For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If path is not a directory, COPY returns an error.

COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.

Rejected data files are returned to the initiator node. Rejected data files are not shipped to the initiator node.

Maximum length of file names

Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, \tmp\<shorter-file-name>.

2 - Saving rejected data to a table

Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data.

Use the REJECTED DATA parameter with the AS TABLE clause to specify a table in which to save rejected data. Saving rejected data to a file is mutually exclusive with using the AS TABLE clause.

When you use the AS TABLE clause, Vertica creates a new table if one does not exist, or appends to an existing table. If no parsing rejections occur during a load, the table exists but is empty. The next time you load data, Vertica inserts any rejected rows to the existing table.

The load rejection tables are a special type of table with the following capabilities and limitations:

  • Support SELECT statements

  • Can use DROP TABLE

  • Cannot be created outside of a COPY statement

  • Do not support DML and DDL activities

  • Are not K-safe

To make the data in a rejected table K-safe, you can do one of the following:

  • Write a CREATE TABLE..AS statement, such as this example:

    => CREATE TABLE new_table AS SELECT * FROM rejected_table;
    
  • Create a table to store rejected records, and run INSERT..SELECT operations into the new table

Using COPY NO COMMIT

If the COPY statement includes options NO COMMIT and REJECTED DATA AS TABLE, and the reject-table does not already exist, Vertica Analytic Database saves the rejected data table as a LOCAL TEMP table and returns a message that a LOCAL TEMP table is being created.

Rejected-data tables are useful for Extract-Load-Transform workflows, where you will likely use temporary tables more frequently. The rejected-data tables let you quickly load data and identify which records failed to load. If you load data into a temporary table that you created using the ON COMMIT DELETE clause, the COPY operation will not commit.

Location of rejected data table records

When you save rejected records to a table, using the REJECTED DATA AS TABLE table_name option, the data for the table is saved in a database data subdirectory, RejectionTableData. For example, for a VMart database, table data files reside here:

/home/dbadmin/VMart/v_vmart_node0001_data/RejectionTableData

Rejected data tables include both rejected data and the reason for the rejection (exceptions), along with other data columns, described next. Vertica suggests that you periodically drop any rejected data tables that you no longer require.

Querying a rejected data table

When you specify a rejected data table when loading data with COPY, you can query that table for information about rejected data after the load operation is complete. For example:

  1. Create the loader table:

    => CREATE TABLE loader(a INT)
    CREATE TABLE
    
  2. Use COPY to load values, saving rejected data to a table, loader_rejects:

    => COPY loader FROM STDIN REJECTED DATA AS TABLE loader_rejects;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 1
    >> 2
    >> 3
    >> a
    >> \.
    
  3. Query the loader table after loading data:

    => SELECT * FROM loader;
     x
    ---
     1
     2
     3
    (3 rows)
    
  4. Query the loader_rejects table to see its column rows:

    => SELECT * FROM loader_rejects;
    -[ RECORD 1 ]-------------+--------------------------------------------
    node_name                 | v_vmart_node0001
    file_name                 | STDIN
    session_id                | v_vmart_node0001.example.-24016:0x3439
    transaction_id            | 45035996274080923
    statement_id              | 1
    batch_number              | 0
    row_number                | 4
    rejected_data             | a
    rejected_data_orig_length | 1
    rejected_reason           | Invalid integer format 'a' for column 1 (x)
    

The rejected data table has the following columns:

Column Data Type Description
node_name VARCHAR The name of the Vertica node on which the input load file was located.
file_name VARCHAR The name of the file being loaded, which applies if you loaded a file (as opposed to using STDIN).
session_id VARCHAR The session ID number in which the COPY statement occurred.
transaction_id INTEGER Identifier for the transaction within the session, if any; otherwise NULL.
statement_id INTEGER

The unique identification number of the statement within the transaction that included the rejected data.

batch_number INTEGER INTERNAL USE. Represents which batch (chunk) the data comes from.
row_number INTEGER

The rejected row number from the input file, or -1 if it could not be determined. The value can be -1 when using cooperative parse.

Each parse operation resets the row number, so in an apportioned load, there can be several entries with the same row number but different rows.

rejected_data LONG VARCHAR The data that was not loaded.
rejected_data_orig_length INTEGER The length of the rejected data.
rejected_reason VARCHAR The error that caused the rejected row. This column returns the same message that exists in a load exceptions file when you do not save to a table.

Exporting the rejected records table

You can export the contents of the column rejected_data to a file to capture only the data rejected during the first COPY statement. Then, correct the data in the file, save it, and load the updated file.

To export rejected records:

  1. Create a sample table:

    => CREATE TABLE t (i int);
    CREATE TABLE
    
  2. Copy data directly into the table, using a table to store rejected data:

    => COPY t FROM STDIN REJECTED DATA AS TABLE t_rejects;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> 1
    >> 2
    >> 3
    >> 4
    >> a
    >> b
    >> c
    >> \.
    
  3. Show only tuples and set the output format:

    => \t
    Showing only tuples.
    => \a
    Output format is unaligned.
    
  4. Output to a file:

    => \o rejected.txt
    => select rejected_data from t_rejects;
    => \o
    
  5. Use the catcommand on the saved file:

    
    => \! cat rejected.txt
    a
    b
    c
    

After a file exists, you can fix load errors and use the corrected file as load input to the COPY statement.

3 - Saving load exceptions (EXCEPTIONS)

COPY exceptions consist of informational messages describing why a row of data could not be parsed.

COPY exceptions consist of informational messages describing why a row of data could not be parsed. The EXCEPTIONS option lets you specify a file to which COPY writes exceptions. If you omit this option, COPY saves exception files to the following path: catalog-dir/CopyErrorLogs/tablename-sourcefilename-copy-from-exceptions, where:

  • catalog-dir is the directory holding the database catalog files

  • table is the name of the table being loaded into

  • sourcefile is the name of the file being loaded

The file produced by the EXCEPTIONS option indicates the line number and the reason for each exception.

If copying from STDIN, the source file name is STDIN.

You can specify rejected data and exceptions files for individual files in a data load. Separate rejected data and exception file names with commas in the COPY statement.

You must specify a filename in the path to load multiple input files. Keep in mind that long table names combined with long data file names can exceed the operating system's maximum length (typically 255 characters). To work around file names exceeding the maximum length, use a path for the exceptions file that differs from the default path; for example, /tmp/<shorter-file-name>.

If you specify an EXCEPTIONS path:

  • For one data file, the path must be a file, and COPY stores all information in this file.

  • For multiple data files, the path must be a directory. COPY creates one file in this directory for each data file.

  • Exceptions files are not stored on the initiator node.

  • You can specify only one path per node.

If you do not specify the EXCEPTIONS path, COPY stores exception files in the default directory.

4 - COPY rejected data and exception files

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files.

When executing a COPY statement, and parallel processing is ON (the default setting), COPY creates separate threads to process load files. Typically, the number of threads depends on the number of node cores in the system. Each node processes part of the load data. If the load succeeds overall, any parser rejections that occur during load processing are written to that node's specific rejected data and exceptions files. If the load fails, the rejected data file contents can be incomplete, or empty. If you do not specify a file name explicitly, COPY uses a default name and location for rejected data files. See the next topic for specifying your own rejected data and exception files.

Both rejected data and exceptions files are saved and stored on a per-node basis. This example uses multiple files as COPY inputs. Since the statement does not include either the REJECTED DATA or EXCEPTIONS parameters, rejected data and exceptions files are written to the default location, the database catalog subdirectory, CopyErrorLogs, on each node:

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set file1 '''':dir'C1_large_tbl.dat'''
\set file2 '''':dir'C2_large_tbl.dat'''
\set file3 '''':remote_dir'C3_large_tbl.dat'''
\set file4 '''':remote_dir'C4_large_tbl.dat'''
=>COPY large_tbl FROM :file1 ON site01,:file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|';

Specifying rejected data and exceptions files

The optional COPY REJECTED DATA and EXCEPTIONS parameters 'path' element lets you specify a non-default path in which to store the files.

If path resolves to a storage location, and the user invoking COPY is not a superuser, these are the required permissions:

Both parameters also have an optional ON nodename clause that uses the specified path:

...[ EXCEPTIONS 'path' [ ON nodename ] [, ...] ]...[ REJECTED DATA 'path' [ ON nodename ] [, ...] ]

While 'path' specifies the location of the rejected data and exceptions files (with their corresponding parameters), the optional ON nodename clause moves any existing rejected data and exception files on the node to the specified path on the same node.

Saving rejected data and exceptions files to a single server

The COPY statement does not have a facility to merge exception and rejected data files after COPY processing is complete. To see the contents of exception and rejected data files requires accessing each node's specific files.

Using VSQL variables for rejected data and exceptions files

This example uses vsql variables to specify the path and file names to use with the exceptions and rejected data parameters (except_s1 and reject_s1). The COPY statement specifies a single input file (large_tbl) on the initiator node:

\set dir `pwd`/data/ \set file1 '''':dir'C1_large_tbl.dat'''
\set except_s1 '''':dir'exceptions'''
\set reject_s1 '''':dir'rejections'''

COPY large_tbl FROM :file1 ON site01 DELIMITER '|'
REJECTED DATA :reject_s1 ON site01
EXCEPTIONS :except_s1 ON site01;

This example uses variables to specify exception and rejected date files (except_s2 and reject_s2) on a remote node. The COPY statement consists of a single input file on a remote node (site02):

\set remote_dir /vertica/test_dev/tmp_ms/\set except_s2 '''':remote_dir'exceptions'''
\set reject_s2 '''':remote_dir'rejections'''

COPY large_tbl FROM :file1 ON site02 DELIMITER '|'
REJECTED DATA :reject_s2 ON site02
EXCEPTIONS :except_s2 ON site02;

This example uses variables to specify that the exception and rejected data files are on a remote node (indicated by :remote_dir). The inputs to the COPY statement consist of multiple data files on two nodes (site01 and site02). The exceptions and rejected data options use the ON nodename clause with the variables to indicate where the files reside (site01 and site02):

\set dir `pwd`/data/ \set remote_dir /vertica/test_dev/tmp_ms/
\set except_s1 '''':dir''''
\set reject_s1 '''':dir''''
\set except_s2 '''':remote_dir''''
\set reject_s2 '''':remote_dir''''
COPY large_tbl FROM :file1 ON site01,
               :file2 ON site01,
               :file3 ON site02,
               :file4 ON site02
               DELIMITER '|'
               REJECTED DATA :reject_s1 ON site01, :reject_s2 ON site02
               EXCEPTIONS :except_s1 ON site01, :except_s2 ON site02;

5 - COPY LOCAL rejection and exception files

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files.

Invoking COPY LOCAL (or COPY LOCAL FROM STDIN) does not automatically create rejected data and exceptions files. This behavior differs from using COPY, which saves both files automatically, regardless of whether you use the optional REJECTED DATA and EXCEPTIONS parameters to specify either file explicitly.

Use the REJECTED DATA and EXCEPTIONS parameters with COPY LOCAL and COPY LOCAL FROM STDIN to save the corresponding output files on the client. If you do not use these options, rejected data parsing events (and the exceptions that describe them) are not retained, even if they occur.

You can load multiple input files using COPY LOCAL (or COPY LOCAL FROM STDIN). If you also use the REJECTED DATA and EXCEPTIONS options, the statement writes rejected rows and exceptions and to separate files. The respective files contain all rejected rows and corresponding exceptions, respectively, regardless of how many input files were loaded.

If COPY LOCAL does not reject any rows, it does not create either file.

Specifying rejected data and exceptions files

To save any rejected data and their exceptions to files:

  1. In the COPY LOCAL (and COPY LOCAL FROM STDIN) statement, use the REJECTED DATA 'path' and the EXCEPTIONS 'path' parameters, respectively.

  2. Specify two different file names for the two options. You cannot use one file for both the REJECTED DATA and the EXCEPTIONS.

  3. When you invoke COPY LOCAL or COPY LOCAL FROM STDIN, the files you specify need not pre-exist. If they do, COPY LOCAL must be able to overwrite them.

You can specify the path and file names with vsql variables:

\set rejected ../except_reject/copyLocal.rejected
\set exceptions ../except_reject/copyLocal.exceptions

When you use the COPY LOCAL or COPY LOCAL FROM STDIN statement, specify the variable names for the files with their corresponding parameters:

=> COPY large_tbl FROM LOCAL rejected data :rejected exceptions :exceptions;
=> COPY large_tbl FROM LOCAL STDIN rejected data :rejected exceptions :exceptions;