Consuming data from Kafka
A Kafka consumer subscribes to one or more topics managed by a Kafka cluster. Each topic is a data stream, an unbounded dataset that is represented as an ordered sequence of messages. Vertica can manually or automatically consume Kafka topics to perform analytics on your streaming data.
Manually consume data
Manually consume data from Kafka with a COPY statement that calls a KafkaSource function and parser. Manual loads are helpful when you want to:
-
Populate a table one time with the messages currently in Kafka.
-
Analyze a specific set of messages. You can choose the subset of data to load from the Kafka stream.
-
Explore the data in a Kafka stream before you set up a scheduler to continuously stream the data into Vertica.
-
Control the data load in ways not possible with the scheduler. For example, you cannot perform business logic or custom rejection handling during the data load from Kafka because the scheduler does not support additional processing during its transactions. Instead, you can periodically run a transaction that executes a COPY statement to load data from Kafka, and then perform additional processing.
For a detailed example, see Manually consume data from Kafka.
Automatically consume data
Automatically consume streaming data from Kafka into Vertica with a scheduler, a command-line tool that loads data as it arrives. The scheduler loads data in segments defined by a microbatch, a unit of work that processes the partitions of a single Kafka topic for a specified duration of time. You can manage scheduler configuration and options using the vkconfig tool.
For details, see Automatically consume data from Kafka with a scheduler.
Monitoring consumption
You must monitor message consumption to ensure that Kafka and Vertica are communicating effectively. You can use native Kafka tools to monitor consumer groups, or you can use vkconfig tool to view consumption details.
For additional information, see Monitoring message consumption.
Parsing data with Kafka filters
Your data stream might encode data that the Kafka parser functions cannot parse by default. Use Kafka filters to delimit messages in your stream to improve data consumption.
For details, see Parsing custom formats.