SQS integration
Data loaders automatically load new files from a specified path into Vertica, as explained in Automatic load. If you are loading from an S3 bucket on AWS, you can use the Simple Queue Service (SQS) to simplify load pipelines. This section explains how Vertica integrates with SQS. This information can help you tune your data loaders.
On AWS
On the AWS side, an SQS queue is tied to an S3 bucket. When files are added to the bucket, S3 sends an event notification of type s3:ObjectCreated
to SQS. These messages include the paths to the new files, which consumers can use to read from S3. SQS limits the number of files that can be included in a single message.
Messages stay in the SQS queue until removed.
SQS can send other types of event notifications. Vertica consumes only s3:ObjectCreated
events.
In Vertica
When you create a data loader with an SQS trigger, you specify the resource URL for the queue. Vertica continuously polls all SQS queues that have at least one active data loader and collects SQS messages into an internal queue.
Data loaders, in turn, read from this internal queue. This separation allows Vertica to receive messages as quickly as SQS generates them while controlling the frequency of loader executions. By default, data loaders poll the internal queue every ten seconds and load all files that are waiting. You can modify how individual data loaders poll in the following ways:
- You can change the polling frequency (
timeout
, a number of seconds). - You can load immediately when the number of files waiting to be loaded reaches a threshold (
minFileNumPerLoad
). - You can load immediately when the total size of files waiting to be loaded reaches a threshold (
minTotalSizePerLoad
).
Adjusting these thresholds allows you to fine-tune data loaders based on expected velocity or file properties.
After the data loaders consume messages from the internal queue and successfully load the files, Vertica removes the messages from the SQS queue. Vertica does not remove messages if the files could not be loaded, but does remove malformed messages that cannot be processed, logging these failures in the SQS_LISTENER_ERROR_HISTORY system table.
More than one data loader can read from the same SQS queue. Data loaders might have discrete paths or they might overlap. For example, one loader might read from s3://mybucket/sales/*.parquet
while another reads from s3://mybucket/clicks/*
; these do not interact. On the other hand, if another loader reads from s3://mybucket/sales/*
, then some files will be consumed by more than one loader. Vertica keeps track of these interactions and only removes the message from the queue when all interested data loaders have successfully consumed it.
When a data loader is disabled, if it is the last data loader triggered by an SQS queue, Vertica also disables polling from that queue into its internal buffer. When a loader is enabled, Vertica resumes polling.