The Vertica Apache Kafka integration includes a scheduler, a mechanism that you can configure to automatically consume data from Kafka and load that data into a Vertica database. The Vertica Kafka Scheduler is the containerized version of that scheduler that runs natively on Kubernetes. Both schedulers have identical functionality and accept the same configuration parameters.
This document provides quickstart instructions about how to create, configure, and launch the Vertica Kafka Scheduler on Kubernetes. It includes minimal details about each command. For in-depth documentation about scheduler behavior and advanced configuration, see Automatically consume data from Kafka with a scheduler.
Prerequisites
- Kubernetes 1.21.1 and higher
- Helm 3.5.0 and higher
kubectl
command-line tool- Kafka cluster
Add the Helm charts
To simplify deployment, Vertica packages the Kafka Scheduler in a Helm chart. Add the charts to your local helm repository:
Launch a scheduler
When you launch a scheduler, you must update the scheduler configuration, create the scheduler, set up a Vertica database to consume data from the scheduler, and then launch the scheduler.
The Vertica Kafka scheduler has two modes:
- initializer: Configuration mode. Starts a container so that you can
exec
into it and configure it. - launcher: Launch mode. Launches the scheduler. Starts a container that calls
vkconfig launch
automatically. Run this mode after you configure the container ininitializer
mode.
Use the initializer
mode to configure all the scheduler settings. After you configure the scheduler, you upgrade the helm chart to launch it in launcher
mode.
Install the scheduler
Install the scheduler Helm chart to start the scheduler in initializer mode. The following helm install
command deploys a scheduler named vkscheduler
in the kafka
namespace:
The command output provides the kubectl exec
command that you can use to access a shell in the initializer pod and configure the scheduler.
Verify that the scheduler's initializer pod is running:
Create the target table
The target table is the Vertica database table that stores the data that the scheduler loads from Kafka. In this example, you create a flex table so that you can load data with an unknown or varying schema:
- Create a flex table to store the data:
- Create a user for the flex table:
- Create a resource pool for the scheduler. Vertica recommends that each scheduler have exclusive use of its own resource pool so that you can fine-tune the scheduler's impact on your Vertica cluster's performance:
For additional details, see Managing scheduler resources and performance.
Override scheduler configuration
After you install the scheduler, you need to configure it for your environment. The scheduler configuration file is vkconfig.conf
, and it is stored in the following location in the initializer pod:
/opt/vertica/packages/kafka/config/vkconfig.conf
By default, vkconfig.conf
contains the following values:
config-schema=Scheduler
dbport=5433
enable-ssl=false
username=dbadmin
vkconfig.conf
is read-only from within the filesystem, so you must upgrade the Helm chart to override the default settings. The following YAML-formatted file provides a template for scheduler overrides:
This template requires that you update the following values:
image.tag
: Scheduler version. The scheduler version must match the version of the Vertica database that you used to create the target table.conf.content.config-schema
: Scheduler name. When you launch the scheduler, the Vertica database creates a schema that you can track with data streaming tables.conf.content.dbhost
: IP address for a host in your Vertica cluster.
For example, the scheduler-overrides.yaml
file contains the following values:
After you define your overrides, use helm upgrade
to apply the overrides to the scheduler initializer pod:
Configure the scheduler
After you update vkconfig.conf
, you need to configure the scheduler itself. A scheduler is a combination of multiple components that you must configure individually with the vkconfig
command.
Note
The following steps do not provide detail about each component's configuraiton. For a detailed example, see Setting up a scheduler. For details about all options for each component, see vkconfig script options.To configure the scheduler, you must access the scheduler initializer pod to execute the vkconfig
commands:
-
Access a bash shell in the scheduler initializer pod:
-
Define the scheduler. This command identifies the Vertica user, resource pool, and settings such as frame duration:
-
Define the target, which is the Vertica database table that the loads data from the scheduler:
-
Define the load spec. This defines how Vertica parses the data from Kafka:
-
Define the cluster. This identifies your Kafka cluster:
-
Define the source. The source is the Kafka topic and partitions that you want to load data from:
-
Define the microbatch. The microbatch combines the components you created in the previous steps:
After you configure the scheduler, exit the pod:
Launch the scheduler
After you configure the scheduler, you must launch it. To launch the scheduler, upgrade the Helm chart to change the launcherEnabled
field to true
:
A new pod starts that runs the scheduler in launch mode:
Test your deployment
Now that you have a containerized Kafka cluster and VerticaDB CR running, you can test that the scheduler is automatically sending data from the Kafka producer to Vertica:
-
Open a shell that is running your Kafka producer and send sample JSON data:
-
Open a terminal with access to your Vertica cluster and
vsql
. Query theKafkaFlex
table to confirm that it contains the sample JSON data that you sent through the Kafka producer:
Clean up
To delete the scheduler, you must use the vkconfig
command with the scheduler tool --drop
option and the scheduler schema. You must access a shell within the scheduler pod to run the commands:
You can delete your Kubeneretes resources with the helm uninstall
command: