Saving load rejections (REJECTED DATA)

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error. By default, if you do not specify a rejected data file, COPY saves rejected data files to this location:

catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.`*`n`*
catalog_dir/

The database catalog files directory, for example:

/home/dbadmin/VMart/v_vmart_node0001_catalog

target_table The table into which data was loaded (target_table).
source The source of the load data, which can be STDIN, or a file name, such as baseball.csv.
copy-from-rejected-data.n

The default name for a rejected data file, followed by n suffix, indicating the number of files, such as .1, .2, .3. For example, this default file name indicates file 3 after loading from STDIN:

fw-STDIN-copy-from-rejected-data.3.

Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table, lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.

Multiple rejected data files

Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data. Here are some of the factors:

  • Number of sources being loaded

  • Total number of rejected rows

  • Size of the source file (or files)

  • Cooperative parsing and number of threads being used

  • UDLs that support apportioned loads

  • For your own COPY parser, the number of objects returned from prepareUDSources()

Naming conventions for rejected files

You can specify one or more rejected data files with the files you are loading. Use the REJECTED DATA parameter to specify a file location and name, and separate consecutive rejected data file names with a comma (,). Do not use the ON ANY NODE option because it is applicable only to load files.

If you specify one or more files, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix, and appends a numeric suffix to each rejected data file. For example, if you specify the name my_rejects for the REJECTED_DATA parameter, and the file you are loading is large enough (> 10MB), several files such as the following will exist:

  • my_rejects-1

  • my_rejects-2

  • my_rejects-3

COPY uses cooperative parsing by default, having the nodes parse a specific part of the file contents. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).

The maximum number of rejected data files cannot be greater than the number of sources being loaded, per thread to parse any portion. The resource pool determines the maximum number of threads. For cooperative parse, use all available threads.

If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDL's that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.

For all data loads except COPY LOCAL, COPY behaves as follows:

No rejected data file specified... Rejected data file specified...
For a single data file (pathToData or STDIN), COPY stores one or more rejected data files in the default location. For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one files is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error.
For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a file name prefix, as noted.

For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If path is not a directory, COPY returns an error.

COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.

Rejected data files are returned to the initiator node. Rejected data files are not shipped to the initiator node.

Maximum length of file names

Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, \tmp\<shorter-file-name>.