Data Pipelines
Data pipeline
A data pipeline also known as a data loader is a declarative, automated way to process data continuously from external sources like Amazon S3. They allow you to define where, when, and how you need to load data into OTCAD with minimal manual intervention.
Data pipelines UI tour
You can view and manage data pipelines. After logging in to the OTCAD application, you land on the home page. Select the More options button and Data Pipeline. The Data pipeline page appears as shown in this image:
This page displays the following information in the Overview card:
- Total number of pipelines - The Total Pipelines card displays the total number of pipelines configured in the system.
- Pipelines that failed execution - Select the Failed Execution drop-down list and the period for which you need to view the pipelines that failed to execute.
- Active pipelines - The Active Pipelines card displays the total number of pipelines that are in the Active status. The Pipelines area displays the following information:
- Pipeline name - The name of the data pipeline.
- Created by - The user ID of the person who created the data pipeline.
- Data source - The source location of the files that contain the data to be loaded.
- Schema - The tables, columns (fields), data types, and relationships among different tables in the database.
- Destination table - The database table where data is written after it has been processed or transformed from a source table or other data source.
- Last run on - The timestamp at which the data pipeline was last run.
- Last run status - Indicates the status of the data pipeline when it was last run.
- Pipeline status - Indicates the present status of the data pipeline.
Option | Description |
---|---|
a | Create a pipeline: Select this option to create a data pipeline. |
b | Filter by: Select this option to filter data by different criteria. |
c | Search pipeline name or schema: Search for a data pipeline based on the pipeline name or schema. |
d | View details: Select this option to view the pipeline details. |
e | Edit pipeline: Select this option to edit the data pipeline. |
f | Clone pipeline: Select this option to clone a data pipeline. |
g | Remove pipeline: Select this option to delete a data pipeline. |
Create a data pipeline
To create a data pipeline, do the following:
- In the Data Pipelines page, select +Create a pipeline. The Create a pipeline page displays.
- In the Pipeline name field, enter the name of the pipeline.
- In the Access key ID field, enter your AWS account access key id.
- In the Secret access key field, enter your AWS account secret access key.
Note
Provide valid AWS credentials for the Access key ID and Secret access key. Invalid AWS credentials do not allow you to create a data pipeline. - From the Region drop-down list, select the region (geography) of the S3 bucket where the files are present.
- In the S3 Bucket/File/Folder path field, enter the name or the folder path where the files are present.
- Select the Data is encrypted option to specify the following parameters:
- Select either AWS Key Management Service Key or Customer managed keys if you wish to encrypt and load data into the S3 bucket.
- Select Encryption key ID.
- Select Next.
- In the Retry limit field, specify the number of times for which the system needs to retry a failed file load.
- In the Parameters field, specify the copy parameters. For more information, see Parameters.
- Select Next.
- From the Destination table drop-down list, select the destination table to which you need to load the data.
- Select Next.
- Specify the schedule at which the data pipeline needs to run. Do one of the following:
- Select Schedule.
- From the date pickers, select the Start date and End date.
- In the Repeat every field, specify the duration at which the data pipeline needs to run.
- From the Unit drop-down list, select the minute, hour, day, week, or month at which the data pipeline needs to run.
- Select the On day radio button and specify the day on which the data pipeline needs to run.
- Select the On radio button and specify the exact day and month on which the data pipeline needs to run.
- Select the option Trigger when something is added to run the data pipeline when a file is added to the S3 bucket.
- Enter the SQS credentials in the Access key ID, Secret access key, and Resource URL fields. <--Or-->
- Select Execute once.
- Select Finish. The data pipeline is created and displayed in the Data Pipelines page.
Filter a data pipeline
You can filter data pipelines based on certain criteria. Filter data pipelines in one of these ways:
- A date range on which the data pipelines last ran.
- Select the Filter icon in the Data Pipelines page.
- Expand the Last run range drop-down list.
- In the date picker, select a date range (start date and end date) for which you need to view the data pipelines. For example, select a range from 1 May to 31 May to view data pipelines that were created in the month of May.
- Schema
- Select the Filter icon in the Data Pipelines page.
- Expand the Schema drop-down list.
- Select the required schema.
- Data source
- Select the Filter icon in the Data Pipelines page.
- Expand the Data source drop-down list.
- Select the data source or data sources for which you wish to view the data pipeline.
- Pipeline status
- Select the Filter icon in the Data Pipelines page.
- Expand the Pipeline status drop-down list.
- Select Active to view the data pipelines that are active.
A data pipeline is in the
Active
status when the schedule end date is a future date and there are one or more ingestions that are yet to complete. - Select Inactive to view the data pipelines that are inactive.
A data pipeline is in the
Inactive
status either when the end date is past or there are no ingestions that are yet to complete.
- Id of the person who created the pipeline
- Select the Filter icon in the Data Pipelines page.
- Expand the Created by drop-down list.
- Select the user Id of the person who created the data pipeline.
Search a data pipeline
All data pipelines are displayed by default. You can search data pipelines using specific search criteria. To search data pipelines, do the following:
- In the Data pipelines page, select +Search pipeline name or schema.
- Enter one of the following search criteria:
- Pipeline name
- Owner of the data pipeline (Created by)
- Schema
- Destination table
- You can sort the data pipeline by the pipeline name or the date on which it was last run. To sort the data pipelines in the ascending or descending order, select the Sort icon for the Pipeline name or Last run on column in the Data Pipelines page.
View pipeline details
You can view pipeline details in the Data pipelines page. To view the details of a data pipeline, do the following:
- In the Data pipeline page, mouse over the Pipeline name column and select the +View details icon for a data pipeline.
You can view the following details of the data pipeline in this page.
- Overview - Displays the owner (user ID) of the data pipeline.
- Configurations - Displays the source and destination paths of the data pipeline. In the Configurations card, click the Edit icon to edit the data source and destination paths of the data pipeline. For more information, see Creating a data pipeline.
- Total instances - Displays the total number of jobs. Select the drop-down list and view information about the instances for the last 24 hours, last week, last month, last 6 months, last one year, or all time.
- To view the reasons for the failure of a job, mouse over an instance with job status Failed and select the View error logs icon that appears for a file name. The error log provides information about the row that failed execution, reason for failure, and rejected data. You can troubleshoot the job with this information and ensure that this job is successfully executed.
- Incident Overview - Displays information about the data pipeline job in a pie chart. Select the drop-down list and view information about the jobs for the last 24 hours, last week, last month, last 6 months, last one year, or all time.
- Select Execute pipeline to execute the selected data pipeline. Executing a data pipeline loads all files that have not already been loaded and that have not reached the retry limit. Executing the data pipeline commits the transaction.
- Select Edit pipeline to edit the data pipeline.
- In the Access key ID field, enter your AWS account access key id.
- In the Secret access key field, enter your AWS account secret access key.
All the details about the data pipeline are populated, except the Access key ID and Secret access key.
Note
Provide valid AWS credentials for the **Access key ID** and **Secret access key**. Invalid AWS credentials do not allow you to edit a data pipeline.
Edit a data pipeline
After creating a data pipeline, you can edit the details to suit your requirements. To edit a data pipeline, do the following:
- In the Data Pipelines page, mouse over the Pipeline name column and select the +Edit pipeline icon for a data pipeline.
- In the Access key ID field, enter your AWS account access key id.
- In the Secret access key field, enter your AWS account secret access key.
All the details about the data pipeline are populated, except the Access key ID and Secret access key.
Note
Provide valid AWS credentials for the **Access key ID** and **Secret access key**. Invalid AWS credentials do not allow you to edit a data pipeline.
Clone a data pipeline
You can create a clone or replica of an existing data pipeline. The configurations of the existing data pipeline are copied to the cloned data pipeline. You can edit these configuration settings in the cloned data pipeline.
- In the Data Pipelines page, mouse over the Pipeline name column and select the Clone pipeline icon for a data pipeline.
- In the Confirmation dialog, select Confirm. The Create a pipeline page displays.
- In the Pipeline name field, enter the name of the pipeline.
- In the Access key ID field, enter your AWS account access key id.
- In the Secret access key field, enter your AWS account secret access key.
Note
Provide valid AWS credentials for the **Access key ID** and **Secret access key**. Invalid AWS credentials do not allow you to edit a data pipeline.
Remove a data pipeline
You can delete a data pipeline that is no longer in use or required.
- In the Data Pipelines page, mouse over the Pipeline name column and select the +Remove pipeline icon for a data pipeline.
- In the Confirmation dialog, select Remove.