CREATE DATA LOADER

CREATE DATA LOADER creates an automatic data loader that executes a COPY statement when new data files appear in a specified location. The loader records which files have already been successfully loaded and skips them.

CREATE DATA LOADER creates an automatic data loader that executes a COPY statement when new data files appear in a specified path. The loader records which files have already been successfully loaded and skips them. These events are recorded in the DATA_LOADER_EVENTS system table.

You can execute a data loader in the following ways:

On demand: Use EXECUTE DATA LOADER to execute the data loader once.
On a schedule: Use a scheduled stored procedure.
In response to a notification (AWS SQS only): Define a trigger on the data loader.

Executing the loader automatically commits the transaction.

The DATA_LOADERS system table shows all defined loaders. The DATA_LOADER_EVENTS system table records paths that were attempted and their outcomes. To prevent unbounded growth, records in DATA_LOADER_EVENTS are purged after a specified retention interval. If previously-loaded files are still in the source path after this purge, the data loader sees them as new files and loads them again.

Syntax

CREATE [ OR REPLACE ] DATA LOADER [ IF NOT EXISTS ]
   [schema.]name
   [ RETRY LIMIT { NONE | DEFAULT | limit } ]
   [ RETENTION INTERVAL monitoring-retention ]
   [ TRIGGERED BY integration-json ]
   AS copy-statement

Arguments

OR REPLACE

If a data loader with the same name in the same schema exists, replace it. The original data loader's monitoring table is dropped.

This option cannot be used with IF NOT EXISTS.

IF NOT EXISTS

If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.

The IF NOT EXISTS clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.

This option cannot be used with OR REPLACE.

schema

Schema containing the data loader. The default schema is public.

name

Name of the data loader.

RETRY LIMIT { NONE | DEFAULT | limit }

Maximum number of times to retry a failing file. Each time the data loader is executed, it attempts to load all files that have not yet been successfully loaded, up to this per-file limit. If set to DEFAULT, at load time the loader uses the value of the DataLoaderDefaultRetryLimit configuration parameter.

Default: DEFAULT

RETENTION INTERVAL monitoring-retention

How long to keep records in the events table. DATA_LOADER_EVENTS records events for all data loaders, but each data loader has its own retention interval.

Default: 14 days

TRIGGERED BY integration-json

Information about an external input for the data loader, a JSON object with the following keys:

provider (required): name of the provider; the only supported value is "AWS-SQS".
resourceUrl (required): endpoint for the integration.
timeout: number of seconds to wait between loads. The default is 10.
minFileNumPerLoad: if the number of files waiting to be loaded reaches this threshold, the load begins immediately instead of waiting for the next scheduled load. The default is 10 files.
minTotalSizePerLoad: if the total size of files waiting to be loaded reaches this threshold, the load begins immediately. The value is an integer with a unit of measure, one of K, M, G, or T. The default is 1G.

If a data loader has a trigger, the loader is initially disabled. When you are ready to use it, use ALTER DATA LOADER with the ENABLE option. When the loader is enabled, Vertica executes it automatically when SQS produces notifications for new files. For details, see Triggers.

AS copy-statement

The COPY statement that the loader executes. The FROM clause typically uses a glob.

Privileges

Non-superuser: CREATE privileges on the schema.

Restrictions

COPY NO COMMIT is not supported.
Data loaders are executed with COPY ABORT ON ERROR.
The COPY statement must specify file paths. You cannot use COPY FROM VERTICA.

Triggers (AWS SQS only)

If the data to be loaded is stored on AWS and you have configured an SQS queue, then instead of executing a data loader explicitly using EXECUTE DATA LOADER, you can define the loader to respond to S3 events. When files are added to the AWS bucket, a service running on AWS adds a message to the SQS queue. Vertica reads from this queue, executes the data loader, and, if the load was successful, removes the message from the queue. After the initial setup, this process is automatic.

To configure the data loader to read from SQS, use the TRIGGERED BY option with a JSON object following this template:

{
  "provider" : "AWS-SQS",
  "resourceUrl" : "https://sqs.region.amazonaws.com/account_number/queue_name"
}

You can set additional values to control timing. See the description of the TRIGGERED BY syntax.

Use the following SQS parameters to authenticate to SQS:

SQSAuth, to set an ID and key for all queue access.
SQSQueueCredentials, to set per-queue access. Queue-specific values take precedence over SQSAuth.

A data loader with an SQS trigger is initially disabled. Use ALTER DATA LOADER with the ENABLE option when you are ready to begin processing events. To pause execution, call ALTER DATA LOADER with the DISABLE option.

You can alter the trigger JSON using ALTER DATA LOADER. You must disable the loader before changing the trigger and then enable it again.

If a data loader has a trigger, you must disable it before dropping it.

File systems

The source path can be any shared file system or object store that all database nodes can access. To use an HDFS or Linux file system safely, you must prevent the loader from reading a partially-written file. One way to achieve this is to only execute the loader when files are not being written to the loader's path. Another way to achieve this is to write new files in a temporary location and move them to the loader's path only when they are complete.

Examples

The following data loader can be executed manually or by using a scheduled stored procedure:

=> CREATE DATA LOADER s.dl1 RETRY LIMIT NONE 
   AS COPY s.local FROM 's3://b/data/*.dat';

=> SELECT name, schemaname, copystmt, retrylimit FROM data_loaders;
 name | schemaname |              copystmt                 | retrylimit 
------+------------+---------------------------------------+------------
 dl1  | s          | COPY s.local FROM 's3://b/data/*.dat' |         -1
(1 row)

The following data loader reads from an SQS queue on AWS and loads new files automatically at least every 10 seconds, or immediately when five files or 100MB of data accumulate:

=> CREATE DATA LOADER s.dl2
   TRIGGERED BY '{ "provider" : "AWS-SQS", 
                   "resourceUrl" : "https://sqs.us-east-1.amazonaws.com/accountID/queueName",
                   "minFileNumPerLoad" : "5",
                   "minTotalSizePerLoad" : "100M" }'
   AS COPY s.local FROM 's3://b/data/*.dat';

See Automatic load for an extended example.