Data pipelines

This topic provides an overview of data pipelines in OTCAD.

Data pipeline

A data pipeline also known as a data loader is a declarative, automated way to process data continuously from external sources like Amazon S3 or Kafka. They allow you to define where, when, and how you need to load data into OTCAD with minimal manual intervention.

Data pipelines UI tour

You can view and manage data pipelines. After logging in to the OTCAD application, you land on the home page. Select the More options button and Data Pipeline. The Data pipeline page appears as shown in this image:

OTCAD user interface

Option	Description
a	Create a pipeline: Select this option to create a data pipeline.
b	Filter by: Select this option to filter data by different criteria.
c	Search pipeline name or schema: Search for a data pipeline based on the pipeline name or schema.
d	Actions: Select ⋮ in the Actions column to view pipeline details, edit, clone, pause or resume schedule, and delete a data pipeline.

This page displays the following information in the Overview card:

Total number of pipelines - The Total Pipelines card displays the total number of pipelines configured in the system.
Pipelines that failed execution - Select the Failed Execution list and the period for which you need to view the pipelines that failed to execute.
Active pipelines - The Active Pipelines card displays the total number of pipelines that are in the Active status.
From the Duration list, select the duration for which you need to view the data pipelines. You can choose from the following:
- Last 24 hours
- Last week
- Last month
- Last 6 months
- Last year
- Anytime

The Pipelines area displays the following information:

Pipeline name - The name of the data pipeline.
Created by - The user ID of the person who created the data pipeline.
Data source - The source location of the files that contain the data to be loaded.

Note
Data pipelines created from an AWS object store appear with the prefix S3, while data pipelines created from a Kafka data source appear with the prefix Topic.
Schema - The tables, columns (fields), data types, and relationships among different tables in the database.
Destination table - The database table where data is written after it has been processed or transformed from a source table or other data source.
Last run on - The timestamp at which the data pipeline was last run.
Last run state - Indicates the state of the data pipeline when it was last run. The values are:
- Executed - The data pipelines that are executed successfully.
- Failed - The data pipelines that have failed to execute.
- Partially executed - The data pipelines that are executed partially.
- Scheduled - The data pipelines that are scheduled to run at a specific duration.
Pipeline status - Indicates the present status of the data pipeline. The values are:
- Active - The data pipelines that are active.
- Inactive - The data pipelines that are inactive.
- Paused - The data pipelines that are paused for execution.
Actions - Options to view pipeline details, edit, clone, pause or resume schedule, and delete a data pipeline.

Create a data pipeline

You can create a data pipeline using either the AWS object store or Kafka data source.

Note

When creating data pipelines, ensure that mulitple database objects do not have the same name.

Create a data pipeline using AWS object store.
Create a data pipeline using the Kafka data source.

Create a data pipeline using AWS object store

AWS object store, primarily Amazon Simple Storage Service (S3) is a highly scalable, durable, and cost-effective cloud storage service for unstructured data. It stores data as objects in flat containers called buckets.

To create a data pipeline using AWS object store, do the following:

In the Data pipelines page, select +Create a pipeline.

The Create a pipeline page is displayed.
In the Pipeline name field, enter the name of the pipeline.
Select AWS object store.
In the Access key ID field, enter your AWS account access key id.
In the Secret access key field, enter your AWS account secret access key.

Note
Provide valid AWS credentials for the Access key ID and Secret access key. Invalid AWS credentials do not allow you to create a data pipeline.
From the Region list, select the region (geography) of the S3 bucket where the files are present.
In the S3 Bucket/File/Folder path field, enter the name or the folder path where the files are present.
Select the Data is encrypted option to specify the following parameters:
- Select either AWS Key Management Service Key or Customer managed keys if you wish to encrypt and load data into the S3 bucket.
- Select Encryption key ID.
Select Next.
In the Retry limit field, specify the number of times the system should attempt to retry a failed file load.
In the Parameters field, specify the copy parameters. For more information, see Parameters.
Select Next.
From the Destination table list, select the destination table to which you need to load the data.
Select Next.
Specify the schedule at which the data pipeline needs to run. Do one of the following:
- Select Schedule.
  - From the date pickers, select the Start date and End date.
  - In the Repeat every field, specify the duration at which the data pipeline needs to run.
  - From the Unit list, select the minute, hour, day, week, or month at which the data pipeline needs to run.
  - Select the On day option and specify the day on which the data pipeline needs to run.
  - Select the On option and specify the exact day and month on which the data pipeline needs to run.
  - Select the option Trigger when something is added to run the data pipeline when a file is added to the S3 bucket.
  - Enter the SQS credentials in the Access key ID, Secret access key, and Resource URL fields.
  <--Or-->
- Select Execute once.
Select Finish.

The data pipeline is created and displayed in the Data pipelines page.

Create a data pipeline using Kafka data source

Define data pipelines to ingest real-time data through Apache Kafka, an open-source distributed real-time streaming platform by leveraging Kafka topics for efficient streaming and processing. A Kafka topic is a category or feed name for streams of data, similar to a table in a database. Consumers read data from Kafka topics, with topics being organized into partitions for parallel processing. Each message in a topic is a record with a key, value, and timestamp. Kafka topics are logs where messages are ordered by an offset within each partition.

To create a data pipeline using Kafka data source, do the following:

In the Data pipelines page, select +Create a pipeline. The Create a pipeline page is displayed.
In the Pipeline name field, enter the name of the pipeline.

Select Kafka.

Note

  Create a destination table from the SQL editor before creating the data pipeline.

In the Bootstrap servers field, enter the initial list of Kafka broker addresses that a Kafka client uses to connect to the Kafka cluster.

Note

  You can use only those bootstrapped servers that are whitelisted by OpenText. For assistance, contact the technical support team.

Note

  For multi-cluster configurations, provide the addresses as a comma-separated list. For example, 10.20.41.15:9092,10.20.41.16:9092.

Do one of the following:
- Select SSL if your Kafka broker is configured with SSL.
  
  SSL is short for Secure Sockets Layer. It is a protocol that creates an encrypted link between a web server and a web browser to ensure that all data transmitted between them is confidential.
- In the Client key field, enter the encrypted client key generated for the SSL certificate.
- In the Password field, enter the password associated with the encrypted client key.
- In the Client certificate field, enter the certificate used to authenticate the client with the Kafka broker.
- In the CA certificate field, enter the certificate authority (CA) certificate used to validate the Kafka broker.
<--Or-->
- Select SASL if your Kafka broker is authenticated with Simple Authentication and Security Layer (SASL).
  
  SASL is a Kafka framework for authenticating clients and brokers, which can be used with or without TLS/SSL encryption. It allows Kafka to support different authentication mechanisms.
- In the SASL mechanism list, select one of the following:
  - Plain - A simple username/password authentication mechanism used with TLS for encryption to implement secure authentication.
  - SCRAM - SHA - 256 - A secure password-based authentication method that uses a challenge-response protocol and the SHA-256 hashing algorithm to verify user credentials without sending the password in plain text over the network.
  - SCRAM - SHA - 512 - A secure authentication mechanism that uses the SHA-512 cryptographic hash function to verify a user's credentials in a "challenge-response" format, which prevents the password from being sent directly over the network.
  - In the Username field, enter a valid username.
  - In the Password field, enter the password for SASL authentication.
Select Proceed.
In the Define configuration area, specify the configuration settings for the data source.

In the Topic field, enter the Kafka topic. A Kafka topic is a logical grouping of messages, split into multiple partitions for parallelism.
In the Partition box, type or select the number of partitions for the Kafka topic. A partition is a sequence of records within a topic, stored on brokers and consumed independently.
In the Start offset box, type or select the incremental ID for each partition. The offset is a unique identifier for each record within a partition, used to track consumer progress.
Select +Add topic to add more topics and partitions.
Select one of the available parser options depending on the message type in the topic:
- AVRO - An Avro schema registry is a crucial component in systems that utilize Apache Kafka and Avro for data serialization. Its primary purpose is to centralize the management and evolution of Avro schemas, providing a robust mechanism for ensuring data compatibility and governance in streaming data pipelines.
- In the URL field, enter the schema registry URL.
- In the Subject field, enter the subject information from the schema registry.
- In the Version fieled, enter the version of the schema registry.
  
  <--Or-->
- In the External Schema field, enter the schema of the AVRO message. Ensure that the schema is in JSON format.
- JSON - JSON schema registry for Kafka provides a mechanism to define, manage, and enforce the structure of JSON data being produced to and consumed from Kafka topics. This ensures data consistency and compatibility, especially in distributed systems where multiple applications interact with the same data streams.
- Kafka - A Kafka schema registry is an external service that acts as a central repository for managing and validating schemas for data in a Kafka cluster. It allows producers to register schemas and consumers to retrieve them, ensuring data consistency and compatibility as schemas evolve over time.

Select Proceed.
From the Destination table list, select the destination table to which you need to load the data.
Note
```
    New data is added to the end of the selected table without changing existing records.
```
Select Proceed.
Specify the schedule at which the data pipeline needs to run. Do one of the following:

Select Schedule.
- From the date pickers, select the Start date and End date.
- In the Repeat every field, specify the duration at which the data pipeline needs to run.
- From the Unit list, select the minute, hour, day, week, or month at which the data pipeline needs to run.
- Select the On day option and specify the day on which the data pipeline needs to run.
- Select the On option and specify the exact day of the week on which the data pipeline needs to run.
<--Or-->
Select Execute once.

Select Finish.

The data pipeline is created and displayed in the Data pipelines page.

Note

     The data pipeline is created successfully only if the SQL query execution is successful. If the SQL query execution fails, the reason for the failure appears in a message. Resolve the SQL query to ensure successful query execution and data pipeline creation.

Filter a data pipeline

You can filter data pipelines based on certain criteria. Filter data pipelines in one of these ways:

Schema
- Select the Filter icon in the Data pipelines page.
- Expand the Schema list.
- Select the required schema.

Data source

Select the Filter icon in the Data pipelines page.
Expand the Data source list.

Select the data source or data sources for which you wish to view the data pipeline.

Note

     Data pipelines created from an AWS object store appear with the prefix **S3**, while data pipelines created from a Kafka data source appear with the prefix **Topic**.

Pipeline status
- Select the Filter icon in the Data pipelines page.
- Expand the Pipeline status list.
- Select Active to view the data pipelines that are active. A data pipeline is in the Active status when the schedule end date is a future date and there are one or more ingestions that are yet to complete.
- Select Inactive to view the data pipelines that are inactive. A data pipeline is in the Inactive status either when the end date is past or there are no ingestions that are yet to complete.
Id of the person who created the pipeline
- Select the Filter icon in the Data pipelines page.
- Expand the Created by list.
- Select the user Id of the person who created the data pipeline.

Search a data pipeline

All data pipelines are displayed by default. You can search data pipelines using specific search criteria. To search data pipelines, do the following:

In the Data pipelines page, select +Search pipeline name or schema.
Enter either the pipeline name or the schema name.
You can sort the data pipeline by the pipeline name or the date on which it was last run. To sort the data pipelines in the ascending or descending order, select the Sort icon for the Pipeline name or Last run on column in the Data pipelines page.

View pipeline details

You can view pipeline details in the Data pipelines page. To view the details of a data pipeline, do the following:

In the Data pipeline page, hover over the Pipeline name column and select the +View details icon for a data pipeline. You can view the following details of the data pipeline in this page.
- Instance started at - Displays the timestamp of the data pipeline execution.
- Transaction ID - Displays the transaction ID of the data pipeline execution.
- Total file size - Displays the size of the file processed by the data pipeline.
- File loaded or Rows loaded - Displays the number of files (for AWS object store) or the total number of rows loaded into the target table (if the data source is Kafka).
- Status - Indicates the state of the data pipeline when it was last run. The values are:
  - Executed - All rows are successfully loaded into the target table and there are no entries in the rejects table.
  - Failed - For data pipelines that use AWS object data store, no rows are loaded into the target table. For data pipelines that use Kafka data source, some rows are loaded into the target table and the rejects table contains one or more entries.
  - Partially executed - For data pipelines that use AWS object data store, some rows are loaded into the target table and the rejects table contains one or more entries that failed. For data pipelines that use Kafka data source, no rows are loaded into the target table and the rejects table contains one or more entries.
    Note
```
     For data pipelines that use Kafka data source and are in Failed state, you can view the rejects table, edit and fix the entries that display an error. The state of such data pipelines then changes to Partially Fixed. However, you cannot edit and fix the entries in the rejects table for data pipelines that are in the Partially Fixed state. 
```
- For data pipelines that use the AWS object store and are either in Failed or Partially executed status, select the i icon.
  
  The Files tab appears with the following information:
  - Transaction ID
  - Timestamp of the transaction
  - Batch ID
  - File name
  - File size
  - Status
  - Rows loaded
  - Actions column (⋮) with options to view the details and synchronize the file.
    - Select View details.
      
      The error log appears with the file name, row number, failure reason, and rejected data. Resolve the error and reload the file.
    - Select Sync file to synchronize the data between the source and destination files. A message "File sync in progress" appears at the top of the page.
Select Execute pipeline to execute the selected data pipeline. Executing a data pipeline loads all files that have not already been loaded and that have not reached the retry limit. Executing the data pipeline commits the transaction.

Edit a data pipeline using AWS object store

After creating a data pipeline, you can edit the details to suit your requirements. To edit a data pipeline, do the following:

In the Data pipelines page, mouse over the Pipeline name column and select the +Edit pipeline icon for a data pipeline.
Select AWS object store.

Information is pre-populated in all fields.
In the Access key ID field, enter your AWS account access key id.
In the Secret access key field, enter your AWS account secret access key. All the details about the data pipeline are populated, except the Access key ID and Secret access key.
Note
```
     Provide valid AWS credentials for the **Access key ID** and **Secret access key**. Invalid AWS credentials do not allow you to edit a data pipeline.  
```
For more information about editing a data pipeline using AWS object store, see Create a data pipeline.

Edit a data pipeline using Kafka data source

After creating a data pipeline, you can edit the details to suit your requirements. To edit a data pipeline, do the following:

In the Data pipelines page, hover over the Pipeline name column and select the +Edit pipeline icon for a data pipeline.
Select Kafka.

Information is pre-populated in all fields.
Do one of the following:

Select SSL if your Kafka broker is configured with SSL.
- In the Client key field, enter the encrypted client key generated for the SSL certificate.
- In the Password field, enter the password associated with the encrypted client key.
- In the Client certificate field, enter the certificate used to authenticate the client with the Kafka broker.
- In the CA certificate field, enter the certificate authority (CA) certificate used to validate the Kafka broker.
<--Or-->
Select SASL if your Kafka broker is authenticated with Simple Authentication and Security Layer (SASL).
- In the SASL mechanism list, select one of the following:
- Plain - A simple username/password authentication mechanism used with TLS for encryption to implement secure authentication.
- SCRAM - SHA - 256 - A secure, password-based authentication method that uses a challenge-response protocol and the SHA-256 hashing algorithm to verify user credentials without sending the password in plain text over the network.
- SCRAM - SHA - 512 - A secure authentication mechanism that uses the SHA-512 cryptographic hash function to verify a user's credentials in a "challenge-response" format, which prevents the password from being sent directly over the network.
- In the Username field, enter a valid username.
- In the Password field, enter the password for SASL authentication.

For more information about editing a data pipeline using Kafka data source, see Create a data pipeline using Kafka data source.

Clone a data pipeline using AWS object store

You can create a clone or replica of an existing data pipeline. The configurations of the existing data pipeline are copied to the cloned data pipeline. You can edit these configuration settings in the cloned data pipeline.

In the Data pipelines page, select ⋮ in the Actions column.
Select Clone. The Create a pipeline page is displayed.
Select AWS object store.
In the Pipeline name field, enter the name of the pipeline.
In the Access key ID field, enter your AWS account access key id.
In the Secret access key field, enter your AWS account secret access key.
Note
```
     Provide valid AWS credentials for the **Access key ID** and **Secret access key**. Invalid AWS credentials do not allow you to edit a data pipeline.  
```
Information in all other fields is pre-populated. You can edit this information. For more information about cloning a data pipeline using AWS object store, see Create a data pipeline using AWS object store.

Clone a data pipeline using Kafka data source

In the Data pipelines page, select ⋮ in the Actions column.
Select Clone. The Create a pipeline page is displayed.
Select Kafka.
In the Pipeline name field, enter the name of the pipeline.
Do one of the following:
- Select SSL if your Kafka broker is configured with SSL.
  - In the Client key field, enter the encrypted client key generated for the SSL certificate.
  - In the Password field, enter the password associated with the encrypted client key.
  - In the Client certificate field, enter the certificate used to authenticate the client with the Kafka broker.
  - In the CA certificate field, enter the certificate authority (CA) certificate used to validate the Kafka broker.
<--Or-->
- Select SASL if your Kafka broker is authenticated with Simple Authentication and Security Layer (SASL).
- In the SASL mechanism list, select one of the following:
  - Plain - A simple username/password authentication mechanism used with TLS for encryption to implement secure authentication.
  - SCRAM - SHA - 256 - A secure, password-based authentication method that uses a challenge-response protocol and the SHA-256 hashing algorithm to verify user credentials without sending the password in plain text over the network.
  - SCRAM - SHA - 512 - A secure authentication mechanism that uses the SHA-512 cryptographic hash function to verify a user's credentials in a "challenge-response" format, which prevents the password from being sent directly over the network.
  - In the Username field, enter a valid username.
  - In the Password field, enter the password for SASL authentication. Information in all other fields is pre-populated. You can edit this information. For more information about cloning a data pipeline using Kafka data source, see Create a data pipeline using Kafka data source.

Pause and resume schedule

You can pause data ingestion anytime to suit your requirement.

In the Data pipelines page, select ⋮ in the Actions column.
Select Pause schedule.
In the Confirmation dialog, select Pause.

This will pause all data ingestions until the state is changed. Data that is already ingested will not be affected. A message "Data ingestion paused successfully" is displayed.

You can resume data ingestion anytime to suit your requirement.

In the Data pipelines page, select the data pipeline that is paused and click ⋮ in the Actions column.
Select Resume schedule.
In the Confirmation dialog, select Resume.

This will resume all data ingestions based on the defined schedule. A message "Data ingestion resumed successfully" is displayed.

Delete a data pipeline

You can delete a data pipeline that is no longer in use or required.

In the Data pipelines page, select a data pipeline and click ⋮ in the Actions column.
Select Resume schedule.

In the Confirmation dialog, select Remove.

Note

  Deleting a data pipeline removes all data pipeline configurations permanently. However, data that is already ingested is not deleted.