Automatic load
A data loader automatically loads new files from a location so that you do not have to add them to Vertica manually. Automatically loading new data into ROS tables is an alternative to using external tables. External tables load selected data at query time, which can be convenient but also can incur API costs for object stores.
A data loader is tied to a path for data and a target table. When executed, the loader attempts to load files that it has not previously loaded. A loader has a retry limit to prevent malformed files from being tried over and over. Data loaders record monitoring information in the DATA_LOADER_EVENTS system table.
You can execute a data loader in the following ways:
-
Use EXECUTE DATA LOADER to load new data on demand.
-
Use a scheduled stored procedure to load new data periodically.
-
Use an AWS SQS queue to load data automatically in response to notifications.
Data pipelines
Automatic load enables data pipelines. The usual workflow is:
- Stage files for loading, adding new files over time.
- Run the data loader automatically, either periodically as a scheduled task or (for AWS) using SQS triggers.
- Remove loaded files from the staging path.
Data loaders do not detect if files have been edited or deleted. When a data loader runs, it attempts to load all files that it has not yet loaded (up to the retry limit).
Records in the monitoring table are eventually purged. If previously-loaded files are still in the source path, future executions of the data loader will load them again. To avoid duplicate loads, you can set the purge interval to a value that is higher than your retention time for source files.
Creating loaders
A data loader connects a table in Vertica to a source path. The loader definition includes a COPY statement. The FROM clause includes the source path with globbing.
Use CREATE DATA LOADER to create the loader:
=> CREATE TABLE sales(...);
=> CREATE DATA LOADER sales_dl AS COPY sales FROM 's3://MyBucket/sales/2023/*.dat';
The source path can be any shared file system or object store that all database nodes can access. To use an HDFS or Linux file system safely, you must prevent the loader from reading a partially-written file. One way to achieve this is to only execute the loader when files are not being written to the loader's path. Another way to achieve this is to write new files in a temporary location and move them to the loader's path only when they are complete.
Executing loaders
EXECUTE DATA LOADER executes the loader:
=> EXECUTE DATA LOADER sales_dl;
Rows Loaded
-------------
5
(1 row)
By default, Vertica retries a failed file load three times. You can change this value using the RETRY LIMIT argument.
Vertica groups input files into batches to distribute the load for better performance. You can override this behavior if needed.
EXECUTE DATA LOADER runs the loader once. Create a scheduled task to run it periodically:
=> CREATE SCHEDULE daily_load USING CRON '0 13 * * *';
--- statement must be quoted:
=> CREATE PROCEDURE dl_runner() as $$ begin EXECUTE 'EXECUTE DATA LOADER sales_dl'; end; $$;
=> CREATE TRIGGER t_load ON SCHEDULE daily_load EXECUTE PROCEDURE dl_runner() AS DEFINER;
If the files to be loaded are in an AWS S3 bucket, you can use an SQS (Simple Queue Service) queue on AWS to notify Vertica to load new files as they are added to the bucket. You define a trigger as part of the data loader, and Vertica automatically runs EXECUTE DATA LOADER in response to events in the queue.
Monitoring loads
The DATA_LOADERS system table shows all defined loaders, and the DATA_LOADER_EVENTS system table records load events for all data loaders. You must own or have any privilege on a data loader to see its events in DATA_LOADER_EVENTS.
Records in the events table are purged periodically. You can set the purge frequency for individual data loaders in CREATE DATA LOADER or change it using ALTER DATA LOADER. The default is 14 days.