Saving load rejections (REJECTED DATA)

COPY load rejections are data rows that did not load due to a parser exception or, optionally, transformation error.

Load rejections are data rows that COPY did not load due to a parser exception or, optionally, transformation error. By default Vertica saves information about rejections in files on database nodes. A better approach is to save rejections to a table.

By default, if you do not specify a rejected data file, COPY saves rejected data files to this location:

The database catalog files directory, for example /home/dbadmin/VMart/v_vmart_node0001_catalog.
The table into which data was loaded.
The source of the load data, which can be STDIN or a file name, such as baseball.csv.
The default name for a rejected data file, followed by a numeric suffix, indicating the number of files, such as .1, .2, .3. For example, this default file name indicates file 3 after loading from STDIN: fw-STDIN-copy-from-rejected-data.3.

Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.

Multiple rejected data files

Unless a load is very small (< 10MB), COPY creates more than one file to hold rejected rows. Several factors determine how many files COPY creates for rejected data, including:

  • Number of sources being loaded

  • Total number of rejected rows

  • Size of the source file (or files)

  • Cooperative parsing and number of threads being used

  • UDLs that support apportioned loads

  • For your own parser, the number of objects returned from prepareUDSources()

Naming conventions for rejected files

You can specify one or more files for rejected data using the REJECTED DATA clause. If you do so, and COPY requires multiple files for rejected data, COPY uses the rejected data file names you supply as a prefix and appends numeric suffixes. For example, if you specify REJECTED DATA my_rejects, and the file you are loading is large enough (> 10MB), rejections are written to several files named my_rejects-1, my_rejects-2, and so on.

By default COPY uses cooperative parsing, which means a node uses multiple threads to load portions in parallel. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY uses multiple threads (.1, .2, .3, and so on).

If you use COPY with a UDL that supports apportioned load, the file suffix is an offset value. UDLs that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.

For all data loads except COPY LOCAL, COPY behaves as follows:

If no rejected data file is specified:

  • For a single data file or STDIN, COPY stores one or more rejected data files in the default location.

  • For multiple source files, COPY stores all rejected data in separate files in the default directory, using the source file as a filename prefix.

  • Rejected data files are returned to the initiator node.

If a rejected data file is specified:

  • For one data file, COPY interprets the rejected data path as a file, and stores all rejected data at the location. If more than one file is required from parallel processing, COPY appends a numeric suffix. If the path is not a file, COPY returns an error.

  • For multiple source files, COPY interprets the rejected path as a directory. COPY stores all information in separate files, one for each source. If the path is not a directory, COPY returns an error.

    COPY accepts only one path per node. For example, if you specify the rejected data path as my_rejected_data, COPY creates a directory of that name on each node. If you provide more than one rejected data path, COPY returns an error.

  • Rejected data files are not shipped to the initiator node.

Maximum length of file names

Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, /tmp/<shorter-file-name>.