CREATE DATA LOADER
CREATE DATA LOADER creates an automatic data loader that executes a COPY statement when new data files appear in a specified path. The loader records which files have already been successfully loaded and skips them. These events are recorded in the DATA_LOADER_EVENTS system table.
You can execute a data loader in the following ways:
-
On demand: Use EXECUTE DATA LOADER to execute the data loader once.
-
On a schedule: Use a scheduled stored procedure.
-
In response to a notification (AWS SQS only): Define a trigger on the data loader.
Executing the loader automatically commits the transaction.
The DATA_LOADERS system table shows all defined loaders. The DATA_LOADER_EVENTS system table records paths that were attempted and their outcomes. To prevent unbounded growth, records in DATA_LOADER_EVENTS are purged after a specified retention interval. If previously-loaded files are still in the source path after this purge, the data loader sees them as new files and loads them again.
Syntax
CREATE [ OR REPLACE ] DATA LOADER [ IF NOT EXISTS ]
[schema.]name
[ RETRY LIMIT { NONE | DEFAULT | limit } ]
[ RETENTION INTERVAL monitoring-retention ]
[ TRIGGERED BY integration-json ]
AS copy-statement
Arguments
OR REPLACE
- If a data loader with the same name in the same schema exists, replace it. The original data loader's monitoring table is dropped.
This option cannot be used with
IF NOT EXISTS
. IF NOT EXISTS
If an object with the same name exists, return without creating the object. If you do not use this directive and the object already exists, Vertica returns with an error message.
The
IF NOT EXISTS
clause is useful for SQL scripts where you might not know if the object already exists. The ON ERROR STOP directive can be helpful in scripts.This option cannot be used with
OR REPLACE
.schema
- Schema containing the data loader. The default schema is
public
. name
- Name of the data loader.
RETRY LIMIT { NONE | DEFAULT |
limit
}
- Maximum number of times to retry a failing file. Each time the data loader is executed, it attempts to load all files that have not yet been successfully loaded, up to this per-file limit. If set to DEFAULT, at load time the loader uses the value of the DataLoaderDefaultRetryLimit configuration parameter.
Default:
DEFAULT
RETENTION INTERVAL
monitoring-retention
- How long to keep records in the events table. DATA_LOADER_EVENTS records events for all data loaders, but each data loader has its own retention interval.
Default:
14 days
TRIGGERED BY
integration-json
- Information about an external input for the data loader, a JSON object with the following keys:
provider
(required): name of the provider; the only supported value is "AWS-SQS".resourceUrl
(required): endpoint for the integration.timeout
: number of seconds to wait between loads. The default is 10.minFileNumPerLoad
: if the number of files waiting to be loaded reaches this threshold, the load begins immediately instead of waiting for the next scheduled load. The default is 10 files.minTotalSizePerLoad
: if the total size of files waiting to be loaded reaches this threshold, the load begins immediately. The value is an integer with a unit of measure, one ofK
,M
,G
, orT
. The default is1G
.
If a data loader has a trigger, the loader is initially disabled. When you are ready to use it, use ALTER DATA LOADER with the ENABLE option. When the loader is enabled, Vertica executes it automatically when SQS produces notifications for new files. For details, see Triggers.
AS
copy-statement
- The COPY statement that the loader executes. The FROM clause typically uses a glob.
Privileges
Non-superuser: CREATE privileges on the schema.
Note
To use data loader commands with the TRIGGERED BY clause, you need to have READ permissions to the source storage location or (pseudo)superuser privileges.Restrictions
- COPY NO COMMIT is not supported.
- Data loaders are executed with COPY ABORT ON ERROR.
- The COPY statement must specify file paths. You cannot use COPY FROM VERTICA.
Triggers (AWS SQS only)
If the data to be loaded is stored on AWS and you have configured an SQS queue, then instead of executing a data loader explicitly using EXECUTE DATA LOADER, you can define the loader to respond to S3 events. When files are added to the AWS bucket, a service running on AWS adds a message to the SQS queue. Vertica reads from this queue, executes the data loader, and, if the load was successful, removes the message from the queue. After the initial setup, this process is automatic.
To configure the data loader to read from SQS, use the TRIGGERED BY option with a JSON object following this template:
{
"provider" : "AWS-SQS",
"resourceUrl" : "https://sqs.region.amazonaws.com/account_number/queue_name"
}
You can set additional values to control timing. See the description of the TRIGGERED BY
syntax.
Use the following SQS parameters to authenticate to SQS:
SQSAuth
, to set an ID and key for all queue access.SQSQueueCredentials
, to set per-queue access. Queue-specific values take precedence overSQSAuth
.
A data loader with an SQS trigger is initially disabled. Use ALTER DATA LOADER with the ENABLE option when you are ready to begin processing events. To pause execution, call ALTER DATA LOADER with the DISABLE option.
You can alter the trigger JSON using ALTER DATA LOADER. You must disable the loader before changing the trigger and then enable it again.
If a data loader has a trigger, you must disable it before dropping it.
File systems
The source path can be any shared file system or object store that all database nodes can access. To use an HDFS or Linux file system safely, you must prevent the loader from reading a partially-written file. One way to achieve this is to only execute the loader when files are not being written to the loader's path. Another way to achieve this is to write new files in a temporary location and move them to the loader's path only when they are complete.
Examples
The following data loader can be executed manually or by using a scheduled stored procedure:
=> CREATE DATA LOADER s.dl1 RETRY LIMIT NONE
AS COPY s.local FROM 's3://b/data/*.dat';
=> SELECT name, schemaname, copystmt, retrylimit FROM data_loaders;
name | schemaname | copystmt | retrylimit
------+------------+---------------------------------------+------------
dl1 | s | COPY s.local FROM 's3://b/data/*.dat' | -1
(1 row)
The following data loader reads from an SQS queue on AWS and loads new files automatically at least every 10 seconds, or immediately when five files or 100MB of data accumulate:
=> CREATE DATA LOADER s.dl2
TRIGGERED BY '{ "provider" : "AWS-SQS",
"resourceUrl" : "https://sqs.us-east-1.amazonaws.com/accountID/queueName",
"minFileNumPerLoad" : "5",
"minTotalSizePerLoad" : "100M" }'
AS COPY s.local FROM 's3://b/data/*.dat';
See Automatic load for an extended example.