OpenText Analytics Database 26.2.x – Consuming data from Kafka

Kafka-Integration: Manually consume data from Kafka

Mon, 01 Jan 0001 00:00:00 +0000

You can manually load streaming data from Kafka into OpenText™ Analytics Database using a COPY statement, just as you can load a finite set of data from a file or other source. Unlike a standard data source, Kafka data arrives continuously as a stream of messages that you must parse before loading into the database. Use Kafka functions in the COPY statement to prepare the data stream.

This example incrementally builds a COPY statement that manually loads JSON-encoded data from a Kafka topic named web_hits. The web_hits topic streams server logs of web site requests.

For information about loading data into the database, see Data load.

Creating the target table

To determine the target table schema, you must identify the message structure. The following is a sample of the web_hits stream:

{"url": "list.jsp", "ip": "144.177.38.106", "date": "2017/05/02 20:56:00",
"user-agent": "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 6.0; Trident/5.1)"}
{"url": "search/wp-content.html", "ip": "215.141.172.28", "date": "2017/05/02 20:56:01",
"user-agent": "Opera/9.53.(Windows NT 5.2; sl-SI) Presto/2.9.161 Version/10.00"}

This topic streams JSON-encoded data. Because JSON data is inconsistent and might contain unpredictable added values, store this data stream in a flex table. Flex tables dynamically accept additional fields that appear in the data.

The following statement creates a flex table named web_table to store the data stream:

=> CREATE FLEX TABLE web_table();

To begin the COPY statement, add the web_table as the target table:

COPY web_table

For more information about flex tables, see Flex tables.

Note

If you are copying data containing default values into a flex table, you must identify the default value column as __raw__. For more information, see Bulk Loading Data into Flex Tables.

Defining KafkaSource

The source of your COPY statement is always KafkaSource. KafkaSource accepts details about the data stream, Kafka brokers, and additional processing options to continuously load data until an end condition is met.

Stream details

The stream parameter defines the data segment that you want to load from one or more topic partitions. Each Kafka topic splits its messages into different partitions to get scalable throughput. Kafka keeps a backlog of messages for each topic according to rules set by the Kafka administrator. You can choose to load some or all of the messages in the backlog, or just load the currently streamed messages.

For each partition, the stream parameter requires the topic name, topic partition, and the partition offset as a list delimited by a pipe character (|). Optionally, you can provide and end offset as an end condition to stop loading from the data stream:

'stream='topic_name|partition|start_offset[|end_offset]'

To load the entire backlog from a single partition of the web_hits topic, use the SOURCE keyword to append KafkaSource with the following stream parameter values:

COPY ...
SOURCE KafkaSource(stream='web_hits|0|-2', ...

In the previous example:

web_hits is the name of the topic to load data from.
0 is the topic partition to load data from. Topic partitions are 0-indexed, and web_hits contains only one partition.
-2 loads the entire backlog. This is a special offset value that tells KafkaSource to start loading at the earliest available message offset.

Loading multiple partitions

This example loads from only one partition, but it is important to understand how to load from multiple partitions in a single COPY statement.

To load from additional partitions in the same topic, or even additional topics, supply a comma-separated list of topic name, partition number, and offset values delimited by pipe characters. For example, the following stream argument loads the entire message backlog from partitions 0 through 2 of the web_hits topic:

KafkaSource(stream='web_hits|0|-2,web_hits|1|-2,web_hits|2|-2'...

Note

While you can load messages from different Kafka topics in the same COPY statement, you must ensure the data from the different topics is compatible with the target table's schema. The schema is less of a concern if you are loading data into a flex table, which can accommodate almost any data you want to load.

When you load multiple partitions in the same COPY statement, you can set the executionparallelism parameter to define the number of threads created for the COPY statement. Ideally, you want to use one thread per partition. You can choose to not specify a value and let the database determine the number of threads based on the number of partitions and the resources available in the resource pool. In this example, there is only one partition, so there's no need for additional threads to load data.

Note

The EXECUTIONPARALLELISM setting on the resource pool assigned is the upper limit on the number of threads your COPY statement can use. Setting executionparallelism on the KafkaSource function call to a value that is higher than that of the resource pool's EXECUTIONPARALLELISM setting does not increase the number of threads the database uses beyond the limits of the resource pool.

Adding the Kafka brokers

KafkaSource requires the host names (or IP addresses) and port numbers of the brokers in your Kafka cluster. The database accesses the Kafka brokers to retrieve the Kafka data. In this example, the Kafka cluster has one broker named kafka01.example.com, running on port 9092. Append the brokers parameter and value to the COPY statement:

COPY ...
SOURCE KafkaSource(stream='web_hits|0|-2',
                   brokers='kafka01.example.com:9092', ...

Choosing the end condition

Because data continuously arrives from Kafka, manual loads from Kafka require that you define an end condition that indicates when to stop loading data. In addition to the end offset described in Stream Details, you can choose to:

Copy as much data as possible for a set duration of time.
Load data until no new data arrives within a timeout period.
Load all available data, and not wait for any further data to arrive.

This example runs COPY for 10000 milliseconds (10 seconds) to get a sample of the data. If the COPY statement is able to load the entire backlog of data in under 10 seconds, it spends the remaining time loading streaming data as it arrives. This values is set in the duration parameter. Append the duration value to complete the KafkaSource definition:

COPY ...
SOURCE KafkaSource(stream='web_hits|0|-2',
                    brokers='kafka01.example.com:9092',
                    duration=interval '10000 milliseconds')

If you start a long-duration COPY statement from Kafka and need to stop it, you can call one of the functions that closes its session, such as CLOSE_ALL_SESSIONS.

Important

The duration that you set for your data load is not exact. The duration controls how long the KafkaSource process runs.

Selecting a parser

Kafka does not enforce message formatting on its data streams. Messages are often in Avro or JSON format, but they could be in any format. Your COPY statement usually uses one of three Kafka-specific parsers:

Because the Kafka parsers can recognize record boundaries in streaming data, the other parsers (such as the Flex parsers) are not directly compatible with the output of KafkaSource. You must alter the KafkaSource output using filters before other parsers can process the data. See Parsing custom formats for more information.

In this example, the data in the web_hits is encoded in JSON format, so it uses the KafkaJSONParser. This value is set in the COPY statement's PARSER clause:

COPY ...
SOURCE ...
PARSER KafkaJSONParser()

Storing rejected data

OpenText™ Analytics Database saves raw Kafka messages that the parser cannot parse to a rejects table, along with information on why it was rejected. This table is created by the COPY statement. This example saves rejects to the table named web_hits_rejections. This value is set in the COPY statement's REJECTED DATA AS TABLE clause:

COPY ...
SOURCE ...
PARSER ...
REJECTED DATA AS TABLE public.web_hits_rejections;

Loading the data stream into OpenText™ Analytics Database

The following steps load JSON data from the web_hits topic for 10 seconds using the COPY statement that was incrementally built in the previous sections:

Execute the COPY statement:

=> COPY web_table
   SOURCE KafkaSource(stream='web_hits|0|-2',
                      brokers='kafka01.example.com:9092',
                      duration=interval '10000 milliseconds')
   PARSER KafkaJSONParser()
   REJECTED DATA AS TABLE public.web_hits_rejections;
 Rows Loaded
-------------
         800
(1 row)

Compute the flex table keys:

=> SELECT compute_flextable_keys('web_table');
              compute_flextable_keys
--------------------------------------------------
 Please see public.web_table_keys for updated keys
(1 row)

For additional details, see Computing flex table keys.

Query web_table_keys to return the keys:

=> SELECT * FROM web_table_keys;
  key_name  | frequency | data_type_guess
------------+-----------+-----------------
 date       |       800 | Timestamp
 user_agent |       800 | Varchar(294)
 ip         |       800 | Varchar(30)
 url        |       800 | Varchar(88)
(4 rows)

Query web_table to return the data loaded from the web_hits Kafka topic:

=> SELECT date, url, ip FROM web_table LIMIT 10;
        date         |                url                 |       ip
---------------------+------------------------------------+-----------------
 2021-10-15 02:33:31 | search/index.htm                   | 192.168.210.61
 2021-10-17 16:58:27 | wp-content/about.html              | 10.59.149.131
 2021-10-05 09:10:06 | wp-content/posts/category/faq.html | 172.19.122.146
 2021-10-01 08:05:39 | blog/wp-content/home.jsp           | 192.168.136.207
 2021-10-10 07:28:39 | main/main.jsp                      | 172.18.192.9
 2021-10-22 12:41:33 | tags/categories/about.html         | 10.120.75.17
 2021-10-17 09:41:09 | explore/posts/main/faq.jsp         | 10.128.39.196
 2021-10-13 06:45:36 | category/list/home.jsp             | 192.168.90.200
 2021-10-27 11:03:50 | category/posts/posts/index.php     | 10.124.166.226
 2021-10-26 01:35:12 | categories/search/category.htm     | 192.168.76.40
(10 rows)

Kafka-Integration: Automatically consume data from Kafka with a scheduler

Mon, 01 Jan 0001 00:00:00 +0000

OpenText™ Analytics Database offers a scheduler that loads streamed messages from one or more Kafka topics. Automatically loading streaming data has a number of advantages over manually using COPY:

The streamed data automatically appears in your database. The frequency with which new data appears in your database is governed by the scheduler's frame duration.
The scheduler provides an exactly-once consumption process. The schedulers manage offsets for you so that each message sent by Kafka is consumed once.
You can configure backup schedulers to provide high-availability. Should the primary scheduler fail for some reason, the backup scheduler automatically takes over loading data.
The scheduler manages resources for the data load. You control its resource usage through the settings on the resource pool you assign to it . When loading manually, you must take into account the resources your load consumes.

There are a few drawbacks to using a scheduler which may make it unsuitable for your needs. You may find that schedulers do not offer the flexibility you need for your load process. For example, schedulers cannot perform business logic during the load transaction. If you need to perform this sort of processing, you are better off creating your own load process. This process would periodically run COPY statements to load data from Kafka. Then it would perform the business logic processing you need before committing the transaction.

For information on job scheduler requirements, refer to Apache Kafka integrations.

What the job scheduler does

The scheduler is responsible for scheduling loads of data from Kafka. The scheduler's basic unit of processing is a frame, which is a period of time. Within each frame, the scheduler assigns a slice of time for each active microbatch to run. Each microbatch is responsible for loading data from a single source. Once the frame ends, the scheduler starts the next frame. The scheduler continues this process until you stop it.

The anatomy of a scheduler

Each scheduler has several groups of settings, each of which control an aspect of the data load. These groups are:

The scheduler itself, which defines the configuration schema, frame duration, and resource pool.
Clusters, which define the hosts in the Kafka cluster that the scheduler contacts to load data. Each scheduler can contain multiple clusters, allowing you to load data from multiple Kafka clusters with a single scheduler.
Sources, which define the Kafka topics and partitions in those topics to read data from.
Targets, which define the tables in the database that will receive the data. These tables can be traditional database tables, or they can be flex tables.
Load specs, which define setting the database uses while loading the data. These settings include the parsers and filters the database needs to use to load the data. For example, if you are reading a Kafka topic that is in Avro format, your load spec needs to specify the Avro parser and schema.
Microbatches, which represent an individual segment of a data load from a Kafka stream. They combine the definitions for your cluster, source, target, and load spec that you create using the other vkconfig tools. The scheduler uses all of the information in the microbatch to execute COPY statements using the KafkaSource UDL function to transfer data from Kafka to OpenText™ Analytics Database. The statistics on each microbatch's load is stored in the stream_microbatch_history table.

The vkconfig script

You use a Linux command-line script named vkconfig to create, configure, and run schedulers. This script is installed on your database hosts along with the database server in the following path:

/opt/vertica/packages/kafka/bin/vkconfig

Note

You can install and use the vkconfig utility on a non-OpenText™ Analytics Database host. You may want to do this if:

You do not want the scheduler to use the database host resources.
You want users who do not have shell accounts on the database hosts to be able to set up and alter schedulers.

The easiest way to install vkconfig on a host is to install the database server RPM. You must use the RPM that matches the version of the database installed on your database cluster. Do not create a database after installing the RPM. The vkconfig utility and its associated files will be in the /opt/vertica/packages/kafka/bin directory on the host.

The vkconfig script contains multiple tools. The first argument to the vkconfig script is always the tool you want to use. Each tool performs one function, such as changing one group of settings (such as clusters or sources) or starting and stopping the scheduler. For example, to create or configure a scheduler, you use the command:

$ /opt/vertica/packages/kafka/bin/vkconfig scheduler other options...

What happens when you create a scheduler

When you create a new scheduler, the vkconfig script takes the following steps:

Creates a new database schema using the name you specified for the scheduler. You use this name to identify the scheduler during configuration.
Creates the tables needed to manage the Kafka data load in the newly-created schema. See Data streaming schema tables for more information.

Validating schedulers

When you create or configure a scheduler, it validates the following settings:

Confirms that all brokers in the specified cluster exist.
Connects to the specified host or hosts and retrieves the list of all brokers in the Kafka cluster. Getting this list always ensures that the scheduler has an up-to-date list of all the brokers. If the host is part of a cluster that has already been defined, the scheduler cancels the configuration.
Confirms that the specified source exists. If the source no longer exists, the source is disabled.
Retrieves the number of partitions in the source. If the number of partitions retrieved from the source is different than the partitions value saved by the scheduler, the database updates the scheduler with the number of partitions retrieved from the source in the cluster.

You can disable validation using the --validation-type option in the vkconfig script's scheduler tool. See Scheduler tool options for more information.

Synchronizing schedulers

By default, the scheduler automatically synchronizes its configuration and source information with Kafka host clusters. You can configure the synchronization interval using the --config-refresh scheduler utility option. Each interval, the scheduler:

Checks for updates to the scheduler's configuration by querying its settings in its database configuration schema.
Performs all of the checks listed in Validating Schedulers.

You can configure synchronization settings using the --auto-sync option using the vkconfig script's scheduler tool. Scheduler tool options for details.

Launching a scheduler

You use the vkconfig script's launch tool to launch a scheduler.

When you launch a scheduler, it collects data from your sources, starting at the specified offset. You can view the stream_microbatch_history table to see what the scheduler is doing at any given time.

To learn how to create, configure, and launch a scheduler, see Setting up a scheduler in this guide.

You can also choose to bypass the scheduler. For example, you might want to do a single load with a specific range of offsets. For more information, see Manually consume data from Kafka in this guide.

If the database cluster goes down, the scheduler attempts to reconnect and fails. You must relaunch the scheduler when the cluster is restarted.

Managing a running scheduler

When you launch a scheduler from the command line, it runs in the foreground. It will run until you kill it (or the host shuts down). Usually, you want to start the scheduler as a daemon process that starts it when the host operating system starts, or after the database has started.

You shut down a running scheduler using the vkconfig script's shutdown tool. See Shutdown tool options for details.

You can change most of a scheduler's settings (adding or altering clusters, sources, targets, and microbatches for example) while it is running. The scheduler automatically acts on the configuration updates.

Launching multiple job schedulers for high availability

For high availability, you can launch two or more identical schedulers that target the same configuration schema. You differentiate the schedulers using the launch tool's --instance-name option (see Launch tool options).

The active scheduler loads data and maintains an S lock on the stream_lock table. The scheduler not in use remains in stand-by mode until the active scheduler fails or is disabled. If the active scheduler fails, the backup scheduler immediately obtains the lock on the stream_lock table, and takes over loading data from Kafka where the failed scheduler left off.

Managing messages rejected during automatic loads

The database rejects messages during automatic loads using the parser definition, which is required in the microbatch load spec.

The scheduler creates a rejection table to store rejected messages for each microbatch automatically. To manually specify a rejection table, use the --rejection-schema and --rejection-table microbatch utility options when creating the microbatch. Query the stream_microbatches table to return the rejection schema and table for a microbatch.

For additional details about how the database handles rejected data, see Handling messy data.

Passing options to the scheduler's JVM

The scheduler uses a Java Virtual Machine to connect to the database via JDBC. You can pass command-line options to the JVM through a Linux environment variable named VKCONFIG_JVM_OPTS. This option is useful when configuring a scheduler to use TLS/SSL encryption when connecting to the database. See Configuring your scheduler for TLS connections for more information.

Viewing schedulers from the MC

You can view the status of Kafka jobs from the MC. For more information, refer to Viewing load history.

Kafka-Integration: Monitoring message consumption

Mon, 01 Jan 0001 00:00:00 +0000

You can monitor the progress of your data streaming from Kafka several ways:

Monitoring the consumer groups to which OpenText™ Analytics Database reports its progress. This technique is best if the tools you want to use to monitor your data load work with Kafka.
Use the monitoring APIs built into the vkconfig tool. These APIs report the configuration and consumption of your streaming scheduler in JSON format. These APIs are useful if you are developing your own monitoring scripts, or your monitoring tools can consume status information in JSON format.

Kafka-Integration: Parsing custom formats

Mon, 01 Jan 0001 00:00:00 +0000

To process a Kafka data stream, the parser must identify the boundary between each message. OpenText™ Analytics Database provides Kafka parsers that can identify boundaries for Avro, JSON, and raw data formats, but your data stream might use a custom format. To parse custom formats, the database provides filters that insert boundary information in the data stream before it reaches the parser.

Kafka filters

The database provides the following filters:

KafkaInsertDelimiters: Inserts a user-specified delimiter between each message in the data stream. The delimiter can contain any characters and be of any length. This parser uses the following syntax:

KafkaInsertDelimiters(delimiter = 'delimiter')
KafkaInsertLengths: Inserts the message length in bytes at the beginning of the message. The database writes the length as a 4-byte uint32 value in big-endian network byte order. For example, a 100-byte message is preceded by 0x00000064. This parser uses the following syntax:

KafkaInsertLengths()

Note

Because each Kafka filter requires input from a stream source and outputs a non-stream source, you cannot use both Kafka filters in the same COPY statement to process a Kafka data stream.

In addition to one of the Kafka filters, you can include one or more user-defined filters in a single COPY statement. Specify multiple filters as a comma-separated list, and list the database filter first. If you use a non-Kafka parser, you must use at least one filter to prepare the data stream for the parser, or the parser fails and returns an error.

Examples

The following COPY statement loads comma-separated values from two partitions in a topic named iot-data. The load exits after it processes all messages in both partitions. The KafkaInsertDelimiters filter inserts newlines between the Kafka messages to convert them into traditional rows of data. The statement uses the standard COPY parser to delimit CSV values with a comma:

=> COPY kafka_iot SOURCE KafkaSource(stream='iot-data|0|-2,iot-data|1|-2',
                                     brokers='kafka01:9092',
                                     stop_on_eof=True)
                  FILTER KafkaInsertDelimiters(delimiter = E'\n')
                  DELIMITER ',';
 Rows Loaded
-------------
        3430
(1 row)