Automatic load
A data loader automatically loads new files from a location so that you do not have to add them to Vertica manually. Automatically loading new data into ROS tables is an alternative to using external tables. External tables load selected data at query time, which can be convenient but also can incur API costs for object stores.
A data loader is tied to a path for data and a target table. When executed, the loader attempts to load files that it has not previously loaded. A loader has a retry limit to prevent malformed files from being tried over and over. Data loaders record monitoring information, including which files have been loaded, in the DATA_LOADER_EVENTS system table.
You can execute a data loader in the following ways:
-
Use EXECUTE DATA LOADER to load new data on demand.
-
Use a scheduled stored procedure to load new data periodically.
-
Use an AWS SQS queue to load data automatically in response to notifications.
Data pipelines
Automatic load enables data pipelines. The usual workflow is:
- Stage files for loading, adding new files over time.
- Run the data loader automatically, either periodically as a scheduled task or (for AWS) using SQS triggers.
- Remove loaded files from the staging path after they have been loaded.
Data loaders do not detect if files have been edited or deleted. When a data loader runs, it attempts to load all files that it has not yet loaded (up to the retry limit).
Records in the monitoring table are eventually purged. If previously-loaded files are still in the source path, future executions of the data loader will load them again. To avoid duplicate loads, you can set the purge interval to a value that is higher than your retention time for source files.
Creating loaders
A data loader connects a table in Vertica to a source path. The loader definition includes a COPY statement. The FROM clause includes the source path with globbing.
Use CREATE DATA LOADER to create the loader:
=> CREATE TABLE sales(...);
=> CREATE DATA LOADER sales_dl AS COPY sales FROM 's3://MyBucket/sales/2024/*.dat';
The source path can be any shared file system or object store that all database nodes can access. To use an HDFS or Linux file system safely, you must prevent the loader from reading a partially-written file. One way to achieve this is to only execute the loader when files are not being written to the loader's path. Another way to achieve this is to write new files in a temporary location and move them to the loader's path only when they are complete.
By default, data loaders are executed when you invoke them explicitly. If the data to be loaded is stored on AWS and you have configured an SQS queue, then you can instead create the loader with an SQS trigger. This causes the loader to run automatically when files are added to the AWS bucket. See Executing loaders using SQS.
Executing loaders directly
EXECUTE DATA LOADER executes the loader:
=> EXECUTE DATA LOADER sales_dl;
Rows Loaded
-------------
5
(1 row)
By default, Vertica retries a failed file load three times. You can change this value using the RETRY LIMIT option.
Vertica groups input files into batches to distribute the load for better performance. You can override this behavior if needed.
EXECUTE DATA LOADER runs the loader once. To run it periodically, create a scheduled task:
=> CREATE SCHEDULE daily_load USING CRON '0 13 * * *';
--- statement must be quoted:
=> CREATE PROCEDURE dl_runner() as $$ begin EXECUTE 'EXECUTE DATA LOADER sales_dl'; end; $$;
=> CREATE TRIGGER t_load ON SCHEDULE daily_load EXECUTE PROCEDURE dl_runner() AS DEFINER;
Executing loaders using SQS
If the files to be loaded are in an AWS S3 bucket, you can use an SQS (Simple Queue Service) queue on AWS to notify Vertica to load new files as they are added to the bucket. You specify the SQS queue as part of the data loader, and Vertica automatically runs EXECUTE DATA LOADER in response to events in the queue.
When files are added to the AWS bucket, the SQS service, running on AWS, adds a message to the queue. Vertica reads from this queue, executes the data loader, and, if the load was successful, removes the message from the queue. After the initial setup, this process is automatic and you do not need to call EXECUTE DATA LOADER directly.
To configure the data loader with SQS, use the TRIGGERED BY option as in the following example:
=> CREATE TABLE alerts(...);
=> CREATE DATA LOADER alerts_dl TRIGGERED BY '{
"provider":"AWS-SQS",
"resourceUrl":"https://queue.amazonaws.com/xxxxxxxxxxx/ExampleQueue"
}'
AS COPY alerts FROM 's3://MyBucket/alerts/*.dat';
Note
To use data loader commands with the TRIGGERED BY clause, you need to have READ permissions to the source storage location or (pseudo)superuser privileges.The provider
and resourceUrl
fields are required. The only supported provider is AWS-SQS. Specify the endpoint for the SQS queue in resourceUrl
. You can specify other optional fields in this JSON; see CREATE DATA LOADER.
Because SQS triggers could produce messages before the data loader is ready, you must explicitly enable the loader to begin processing. This step is not necessary for data loaders that do not use SQS:
=> ALTER DATA LOADER alerts_dl ENABLE;
To load any files that were present in the bucket before the SQS queue was set up, run the loader once manually:
=> EXECUTE DATA LOADER alerts_dl;
For more information about how Vertica interacts with SQS, see SQS integration.
Monitoring loads
The DATA_LOADERS system table shows all defined loaders, and the DATA_LOADER_EVENTS system table records load events for all data loaders. You must own or have any privilege on a data loader to see its events in DATA_LOADER_EVENTS.
Records in the events table are purged periodically. You can set the purge frequency for individual data loaders in CREATE DATA LOADER or change it using ALTER DATA LOADER. The default is 14 days.
The SQS_LISTENER_ERROR_HISTORY table records errors and rejections that occur when reading from SQS, such as malformed SQS messages and authorization problems.