Saving load rejections (REJECTED DATA)
Load rejections are data rows that COPY
did not load due to a parser exception or, optionally, transformation error. By default Vertica saves information about rejections in files on database nodes. A better approach is to save rejections to a table.
By default, if you do not specify a rejected data file, COPY
saves rejected data files to this location:
catalog_dir/CopyErrorLogs/target_table-source-copy-from-rejected-data.sequence-number
catalog_dir
- The database catalog files directory, for example
/home/dbadmin/VMart/v_vmart_node0001_catalog
. target_table
- The table into which data was loaded.
source
- The source of the load data, which can be STDIN or a file name, such as
baseball.csv
. copy-from-rejected-data.
sequence-number
- The default name for a rejected data file, followed by a numeric suffix, indicating the number of files, such as
.1
,.2, .3
. For example, this default file name indicates file 3 after loading from STDIN:fw-STDIN-copy-from-rejected-data.3
.
Saving rejected data to the default location, or to a location of your choice, lets you review the file contents, resolve problems, and reload the data from the rejected data files. Saving rejected data to a table lets you query the table to see rejected data rows and the reasons (exceptions) why the rows could not be parsed. Vertica recommends saving rejected data to a table.
Multiple rejected data files
Unless a load is very small (< 10MB), COPY
creates more than one file to hold rejected rows. Several factors determine how many files COPY
creates for rejected data, including:
-
Number of sources being loaded
-
Total number of rejected rows
-
Size of the source file (or files)
-
Cooperative parsing and number of threads being used
-
UDLs that support apportioned loads
-
For your own parser, the number of objects returned from
prepareUDSources()
Naming conventions for rejected files
You can specify one or more files for rejected data using the REJECTED DATA
clause. If you do so, and COPY
requires multiple files for rejected data, COPY
uses the rejected data file names you supply as a prefix and appends numeric suffixes. For example, if you specify REJECTED DATA my_rejects
, and the file you are loading is large enough (> 10MB), rejections are written to several files named my_rejects-1
, my_rejects-2
, and so on.
By default COPY
uses cooperative parsing, which means a node uses multiple threads to load portions in parallel. Depending on the file or portion size, each thread generates at least one rejected data file per source file or portion, and returns load results to the initiator node. The file suffix is a thread index when COPY
uses multiple threads (.1, .2, .3, and so on).
If you use COPY
with a UDL that supports apportioned load, the file suffix is an offset value. UDLs that support apportioned loading render cooperative parsing unnecessary. For apportioned loads, COPY
creates at least one rejected file per data portion, and more files depending on the size of the load and number of rejected rows.
For all data loads except COPY LOCAL
, COPY
behaves as follows:
If no rejected data file is specified:
-
For a single data file or
STDIN
,COPY
stores one or more rejected data files in the default location. -
For multiple source files,
COPY
stores all rejected data in separate files in the default directory, using the source file as a filename prefix. -
Rejected data files are returned to the initiator node.
If a rejected data file is specified:
-
For one data file,
COPY
interprets the rejected data path as a file, and stores all rejected data at the location. If more than one file is required from parallel processing,COPY
appends a numeric suffix. If the path is not a file,COPY
returns an error. -
For multiple source files,
COPY
interprets the rejected path as a directory.COPY
stores all information in separate files, one for each source. If the path is not a directory,COPY
returns an error.COPY
accepts only one path per node. For example, if you specify the rejected data path asmy_rejected_data
,COPY
creates a directory of that name on each node. If you provide more than one rejected data path,COPY
returns an error. -
Rejected data files are not shipped to the initiator node.
Maximum length of file names
Loading multiple input files in one statement requires specifying full path names for each file. Keep in mind that long input file names, combined with rejected data file names, can exceed the operating system's maximum length (typically 255 characters). To work around file names that exceed the maximum length, use a path for the rejected data file that differs from the default path—for example, /tmp/<shorter-file-name>
.