This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Containerized Kafka Scheduler

1: Kafka scheduler parameters

The Vertica Apache Kafka integration includes a scheduler, a mechanism that you can configure to automatically consume data from Kafka and load that data into a Vertica database. The Vertica Kafka Scheduler is the containerized version of that scheduler that runs natively on Kubernetes. Both schedulers have identical functionality and accept the same configuration parameters.

This document provides quickstart instructions about how to create, configure, and launch the Vertica Kafka Scheduler on Kubernetes. It includes minimal details about each command. For in-depth documentation about scheduler behavior and advanced configuration, see Automatically consume data from Kafka with a scheduler.

Prerequisites

Kubernetes 1.21.1 and higher
Helm 3.5.0 and higher
kubectl command-line tool
Kafka cluster

Add the Helm charts

To simplify deployment, Vertica packages the Kafka Scheduler in a Helm chart. Add the charts to your local helm repository:

$ helm repo add vertica-charts https://vertica.github.io/charts
$ helm repo update

Launch a scheduler

When you launch a scheduler, you must update the scheduler configuration, create the scheduler, set up a Vertica database to consume data from the scheduler, and then launch the scheduler.

The Vertica Kafka scheduler has two modes:

initializer: Configuration mode. Starts a container so that you can exec into it and configure it.
launcher: Launch mode. Launches the scheduler. Starts a container that calls vkconfig launch automatically. Run this mode after you configure the container in initializer mode.

Use the initializer mode to configure all the scheduler settings. After you configure the scheduler, you upgrade the helm chart to launch it in launcher mode.

Install the scheduler

Install the scheduler Helm chart to start the scheduler in initializer mode. The following helm install command deploys a scheduler named vkscheduler in the kafka namespace:

$ helm install vkscheduler --namespace kafka vertica-charts/vertica-kafka-scheduler
NAME: vkscheduler
LAST DEPLOYED: Tue Apr  2 11:53:49 2024
NAMESPACE: kafka
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Vertica's Kafka Scheduler has been deployed.

The initializer pod is running. You can exec into it and run your vkconfig
commands with this command:

kubectl exec -n kafka -it vkscheduler-vertica-kafka-scheduler-initializer -- bash

The command output provides the kubectl exec command that you can use to access a shell in the initializer pod and configure the scheduler.

Verify that the scheduler's initializer pod is running:

$ kubectl get pods --namespace kafka
NAME                                              READY   STATUS    RESTARTS      AGE
...
vkscheduler-vertica-kafka-scheduler-initializer   1/1     Running   1 (12s ago)   77s

Create the target table

The target table is the Vertica database table that stores the data that the scheduler loads from Kafka. In this example, you create a flex table so that you can load data with an unknown or varying schema:

Create a flex table to store the data:

=> CREATE FLEX TABLE KafkaFlex();
CREATE TABLE

Create a user for the flex table:
```
=> CREATE USER KafkaUser;
CREATE USER
```
Create a resource pool for the scheduler. Vertica recommends that each scheduler have exclusive use of its own resource pool so that you can fine-tune the scheduler's impact on your Vertica cluster's performance:
```
=> CREATE RESOURCE POOL scheduler_pool PLANNEDCONCURRENCY 1;
CREATE RESOURCE POOL
```
For additional details, see Managing scheduler resources and performance.

Override scheduler configuration

After you install the scheduler, you need to configure it for your environment. The scheduler configuration file is vkconfig.conf, and it is stored in the following location in the initializer pod:

/opt/vertica/packages/kafka/config/vkconfig.conf

By default, vkconfig.conf contains the following values:

config-schema=Scheduler
dbport=5433
enable-ssl=false
username=dbadmin

vkconfig.conf is read-only from within the filesystem, so you must upgrade the Helm chart to override the default settings. The following YAML-formatted file provides a template for scheduler overrides:

image:
  repository: opentext/kafka-scheduler
  pullPolicy: IfNotPresent
  tag: scheduler-version
launcherEnabled: false
replicaCount: 1
initializerEnabled: true
conf:
  generate: true
  content:
    config-schema: scheduler-name
    username: dbadmin
    dbport: "5433"
    enable-ssl: "false"
    dbhost: vertica-db-host-ip
tls:
  enabled: false
serviceAccount:
  create: true

This template requires that you update the following values:

image.tag: Scheduler version. The scheduler version must match the version of the Vertica database that you used to create the target table.
conf.content.config-schema: Scheduler name. When you launch the scheduler, the Vertica database creates a schema that you can track with data streaming tables.
conf.content.dbhost: IP address for a host in your Vertica cluster.

For example, the scheduler-overrides.yaml file contains the following values:

image:
  repository: opentext/kafka-scheduler
  pullPolicy: IfNotPresent
  tag: 24.2.0
launcherEnabled: false
replicaCount: 1
initializerEnabled: true
conf:
  generate: true
  content:
    config-schema: scheduler-sample
    username: dbadmin
    dbport: "5433"
    enable-ssl: "false"
    dbhost: 10.20.30.40
tls:
  enabled: false
serviceAccount:
  create: true

After you define your overrides, use helm upgrade to apply the overrides to the scheduler initializer pod:

$ helm upgrade vkscheduler --namespace kafka vertica-charts/vertica-kafka-scheduler -f scheduler-overrides.yaml
Release "vkscheduler" has been upgraded. Happy Helming!
NAME: vkscheduler
LAST DEPLOYED: Tue Apr  2 11:54:35 2024
NAMESPACE: kafka
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
Vertica's Kafka Scheduler has been deployed.

The initializer pod is running. You can exec into it and run your vkconfig
commands with this command:

kubectl exec -n kafka -it vkscheduler-vertica-kafka-scheduler-initializer -- bash

Configure the scheduler

After you update vkconfig.conf, you need to configure the scheduler itself. A scheduler is a combination of multiple components that you must configure individually with the vkconfig command.

Note

The following steps do not provide detail about each component's configuraiton. For a detailed example, see Setting up a scheduler. For details about all options for each component, see vkconfig script options.

To configure the scheduler, you must access the scheduler initializer pod to execute the vkconfig commands:

Access a bash shell in the scheduler initializer pod:

$ kubectl exec -n kafka -it vkscheduler-vertica-kafka-scheduler-initializer -- bash

Define the scheduler. This command identifies the Vertica user, resource pool, and settings such as frame duration:

bash-5.1$ vkconfig scheduler --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --frame-duration 00:00:10 \
     --create --operator KafkaUser \
     --eof-timeout-ms 2000 \
     --config-refresh 00:01:00 \
     --new-source-policy START \
     --resource-pool scheduler_pool

Define the target, which is the Vertica database table that the loads data from the scheduler:

bash-5.1$ vkconfig target --add --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --target-schema public \
     --target-table KafkaFlex

Define the load spec. This defines how Vertica parses the data from Kafka:

bash-5.1$ vkconfig load-spec --add --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --load-spec KafkaSpec \
     --parser kafkajsonparser \
     --load-method DIRECT \
     --message-max-bytes 1000000

Define the cluster. This identifies your Kafka cluster:

bash-5.1$ vkconfig cluster --add --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --cluster KafkaCluster \
     --hosts kafka01.example.com:9092,kafka03.example.com:9092

Define the source. The source is the Kafka topic and partitions that you want to load data from:

bash-5.1$ vkconfig source --add --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --cluster KafkaCluster \
     --source KafkaTopic1 \
     --partitions 1

Define the microbatch. The microbatch combines the components you created in the previous steps:

bash-5.1$ vkconfig microbatch --add --conf /opt/vertica/packages/kafka/config/vkconfig.conf \
     --microbatch KafkaBatch1 \
     --add-source KafkaTopic1 \
     --add-source-cluster KafkaCluster \
     --target-schema public \
     --target-table KafkaFlex \
     --rejection-schema public \
     --rejection-table KafkaFlex_rej \
     --load-spec KafkaSpec

After you configure the scheduler, exit the pod:

bash-5.1$ exit
exit
$

Launch the scheduler

After you configure the scheduler, you must launch it. To launch the scheduler, upgrade the Helm chart to change the launcherEnabled field to true:

$ helm upgrade --namespace kafka vkscheduler vertica-charts/vertica-kafka-scheduler \
    --set "launcherEnabled=true"

A new pod starts that runs the scheduler in launch mode:

$ kubectl get pods
NAME                                              READY   STATUS      RESTARTS      AGE
tester-vertica-kafka-scheduler-66d5c49dbf-nc86k   1/1     Running     0             14s
tester-vertica-kafka-scheduler-initializer        1/1     Running     0             85m

Test your deployment

Now that you have a containerized Kafka cluster and VerticaDB CR running, you can test that the scheduler is automatically sending data from the Kafka producer to Vertica:

Open a shell that is running your Kafka producer and send sample JSON data:
```
>{"a": 1}
>{"a": 1000}
```

Open a terminal with access to your Vertica cluster and vsql. Query the KafkaFlex table to confirm that it contains the sample JSON data that you sent through the Kafka producer:

=> SELECT compute_flextable_keys_and_build_view('KafkaFlex');
                                 compute_flextable_keys_and_build_view                    
--------------------------------------------------------------------------------------------------------
 Please see public.KafkaFlex_keys for updated keys
The view public.KafkaFlex_view is ready for querying
(1 row)

=> SELECT a from KafkaFlex_view;
 a
-----
 1
 1000
(2 rows)

Clean up

To delete the scheduler, you must use the vkconfig command with the scheduler tool --drop option and the scheduler schema. You must access a shell within the scheduler pod to run the commands:

$ kubectl exec -n kafka -it vkscheduler-vertica-kafka-scheduler-initializer -- bash
bash-5.1$ /opt/vertica/packages/kafka/bin/vkconfig scheduler --drop --config-schema scheduler_sample

You can delete your Kubeneretes resources with the helm uninstall command:

$ helm uninstall vkscheduler -n kafka

1 - Kafka scheduler parameters

The following list describes the available settings for the Vertica Kafka Scheduler:

affinity

Applies affinity rules that constrain the scheduler to specific nodes.

conf.configMapName

Name of the ConfigMap to use and optionally generate. If omitted, the chart picks a suitable default.

conf.content

Set of key-value pairs in the generated ConfigMap. If conf.generate is false, this setting is ignored.

conf.generate

When set to true, the Helm chart controls the creation of the vkconfig.conf ConfigMap.

Default: true

fullNameOverride

Gives the Helm chart full control over the name of the objects that get created. This takes precedence over nameOverride.

initializerEnabled

When set to true, the initializer pod is created. This can be used to run any setup tasks needed.

Default: true

image.pullPolicy

How often Kubernetes pulls the image for an object. For details, see Updating Images in the Kubernetes documentation.

Default: IfNotPresent

image.repository

The image repository and name that contains the Vertica Kafka Scheduler.

Default: opentext/kafka-scheduler

image.tag

Version of the Vertica Kafka Scheduler. This setting must match the version of the Vertica server that the scheduler connects to.

For a list of available tags, see opentext/kafka-scheduler.

Default: Helm chart's appVersion

imagePullSecrets

List of Secrets that contain the required credentials to pull the image.

launcherEnabled

When set to true, the Helm chart creates the launch deployment. Enable this setting after you configure the scheduler options in the container.

Default: true

jvmOpts

Values to assign to the VKCONFIG_JVM_OPTS environment variable in the pods.

Note

You can omit most truststore and keystore settings because they are set by tls.* parameters.

nameOverride

Controls the name of the objects that get created. This is combined with the Helm chart release to form the name.

nodeSelector

nodeSelector that controls where the pod is scheduled.

podAnnotations

Annotations that you want to attach to the pods.

podSecurityContext

Security context for the pods.

replicaCount

Number of launch pods that the chart deploys.

Default: 1

resources

Host resources to use for the pod.

securityContext

Security context for the container in the pod.

serviceAccount.annotations

Annotations to attach to the ServiceAccount.

serviceAccount.create

When set to true, a ServiceAccount is created as part of the deployment.

Default: true

serviceAccount.name

Name of the service accountt. If this parameter is not set and serviceAccount.create is set to true, a name is generated using the fullname template.

timezone

Manages the timezone of the logger. As logging employs log4j, ensure you use a Java-friendly timezone ID. For details, see this Oracle documentation.

Default: UTC

tls.enabled

When set to true, the scheduler is set up for TLS authentication.

Default: false

tls.keyStoreMountPath

Directory name where the keystore is mounted in the pod. This setting controls the name of the keystore within the pod. The full path to the keystore is constructed by combining this parameter and tls.keyStoreSecretKey.

tls.keyStorePassword

Password that protects the keystore. If this setting is omitted, then no password is used.

tls.keyStoreSecretKey

Key within tls.keyStoreSecretName that is used as the keystore file name. This setting and tls.keyStoreMountPath form the full path to the key in the pod.

tls.keyStoreSecretName

Name of an existing Secret that contains the keystore. If this setting is omitted, no keystore information is included.

tls.trustStoreMountPath

Directory name where the truststore is mounted in the pod. This setting controls the name of the truststore within the pod. The full path to the truststore is constructed by combining this parameter with tls.trustStoreSecretKey.

tls.trustStorePassword

Password that protects the truststore. If this setting is omitted, then no password is used.

tls.trustStoreSecretKey

Key within tls.trustStoreSecretName that is used as the truststore file name. This is used with tls.trustStoreMountPath to form the full path to the key in the pod.

tls.trustStoreSecretName

Name of an existing Secret that contains the truststore. If this setting is omitted, then no truststore information is included.