Automatic load

A data loader automatically loads new files from a location, so that you do not hae to add them to Vertica manually. Automatically loading new data into ROS tables is an alternative to external tables. External tables load selected data at query time, which can be convenient but also can incur API costs for object stores.

A data loader automatically loads new files from a location, so that you do not have to add them to Vertica manually. Automatically loading new data into ROS tables is an alternative to using external tables. External tables load selected data at query time, which can be convenient but also can incur API costs for object stores.

A data loader is tied to a path for data and a target table. When executed, the loader attempts to load files that it has not previously loaded. A loader has a retry limit to prevent malformed files from being tried over and over. Each loader records monitoring information in an associated table.

You execute a data loader explicitly. To run it periodically, you can put the execution in a scheduled stored procedure.

Data pipelines

Automatic load enables data pipelines. The usual workflow is:

  1. Stage files for loading, adding new files over time.
  2. Run the data loader periodically as a scheduled task.
  3. Remove loaded files from the staging path.

Data loaders do not detect if files have been edited or deleted. When a data loader runs, it attempts to load all files that it has not yet loaded (up to the retry limit).

Records in the monitoring table are eventually purged. If previously-loaded files are still in the source path, future executions of the data loader will load them again. To avoid duplicate loads, you can set the purge interval to a value that is higher than your retention time for source files.

Creating and executing loaders

A data loader connects a table in Vertica to a source path. The loader definition includes a COPY statement. The FROM clause includes the source path with globbing.

Use CREATE DATA LOADER to create the loader:

=> CREATE TABLE sales(...);

=> CREATE DATA LOADER sales_dl AS COPY sales FROM 's3://MyBucket/sales/2023/*.dat';

The source path can be any shared file system or object store that all database nodes can access. To use an HDFS or Linux file system safely, you must prevent the loader from reading a partially-written file. One way to achieve this is to only execute the loader when files are not being written to the loader's path. Another way to achieve this is to write new files in a temporary location and move them to the loader's path only when they are complete.

By default, Vertica retries a failed file load three times. You can change this value using the RETRY LIMIT argument.

EXECUTE DATA LOADER runs the loader once. Create a scheduled task to run it periodically:

=> CREATE SCHEDULE daily_load USING CRON '0 13 * * *';

--- statement must be quoted:
=> CREATE PROCEDURE dl_runner() as $$ begin EXECUTE 'EXECUTE DATA LOADER sales_dl'; end; $$;

=> CREATE TRIGGER t_load ON SCHEDULE daily_load EXECUTE PROCEDURE dl_runner() AS DEFINER;

Monitoring loads

The DATA_LOADERS system table shows all defined loaders.

Each data loader has an associated monitoring table that records paths that were attempted and their outcomes. The table follows the following naming scheme:

v_data_loader.schema_loader-name

Access to monitoring tables requires superuser privileges. The table records each file that was attempted and the results:

=> SELECT file_name, load_status, rows_loaded 
   FROM v_data_loader.public_sales_dl ORDER BY time_stamp;
             file_name            | load_status | rows_loaded 
----------------------------------+-------------+-------------
 s3://MyBucket/sales/2023/01.dat  | 1           |          10
 s3://MyBucket/sales/2023/02.dat  | 1           |          10
 s3://MyBucket/sales/2023/03.dat  | 0           |           0
(3 rows)

The monitoring table has the following columns:

  • TIME_STAMP
  • FILE_NAME
  • TRANSACTION_ID
  • LOAD_STATUS: 0 for success, 1 for failure
  • ROWS_LOADED
  • FAILURE_REASON
  • RETRY_ATTEMPT: 0 if this is the first attempt, otherwise the number of retries including this one

Records in the monitoring table are purged periodically. You can set the frequency in CREATE DATA LOADER. The default is 14 days.