Common use cases
Vertica supports a variety of use cases for reading data. Some of the most common are summarized here with links for more information. This is not a complete list of capabilities.
The COPY statement is central to loading data. See Introduction to the COPY statement for an overview of its use.
Loading data from files
You might have data, perhaps quite a bit of data, that you want to load into Vertica. These files might reside on shared storage, in the cloud, or on local nodes, and might be in a variety of formats.
For information about source locations, see Specifying where to load data from. To handle data in different formats you specify a parser; for more information about the options, see Data formats.
You are not limited to loading data "as-is"; you can also transform it during load. See Transforming data during loads.
Automatically loading new files
You can create a data pipeline to automatically load new data. A data loader automatically loads new files from a location, so that you do not have to add them to Vertica manually. Automatically loading new data into ROS tables is an alternative to using external tables. External tables load selected data at query time, which can be convenient but also can incur API costs for object stores.
You execute a data loader explicitly. To run it periodically, you can put the execution in a scheduled stored procedure. See Automatic load for an example.
Loading data from other services
Apache Kafka is a platform for streaming data. Vertica supports streaming data to and from Kafka. See Apache Kafka integration.
Apache Spark is a cluster-computing framework for distributed data. Vertica supports connecting to Spark for data. See Apache Spark integration.
You can copy data directly from another Vertica cluster, instead of exporting to files and then loading those files. See Database export and import.
Read data where it exists (don't import)
Instead of loading data into Vertica, you can read it in place using external tables. External tables can be advantageous in the following cases:
-
If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
-
If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
-
If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
-
If you have lower-priority data in Vertica that you still want to be able to query.
When you query an external table, Vertica loads the data it needs from the external source. The Parquet and ORC columnar formats are optimized for this kind of load, so using external tables does not necessarily have a large effect on performance compared to loading data into Vertica native tables.
For more information about using external tables, see Working with external data.
Complex types
Some data formats support complex types such as arrays and structs (sets of property-value pairs). You can use strong typing to define columns using the ARRAY and ROW types (including nesting) in native and external tables. See Complex types. Alternatively, you can define tables using flexible (schemaless) complex types without fully specifying them. You can load flexible complex types in the Parquet, ORC, JSON, and Avro formats; see Flexible complex types.
Unknown or evolving schema
Sometimes the schema for the data you want to load is unknown or changes over time, particularly with JSON data which does not embed a schema in data files. There are two primary ways to explore data with an unknown schema, explained in detail in Data exploration:
-
You can inspect sample data files and derive an initial table definition using the INFER_TABLE_DDL function. For tables with many or complex columns, using this function can save time and reduce manual effort.
-
You can use flex tables to load schemaless data as blobs and inspect it using flex functions. Flex tables are best used for initial exploration of heterogeneous or poly-structured data, not in production databases, because the query-time extraction affects performance.
If your data changes over time, some parsers emit warnings about mismatches or new columns. You can use ALTER TABLE to modify the table definition as the data evolves. For more information and an example, see Schema evolution.
Messy data
Sometimes data is not clean; values might not match the declared data types, or required values might be missing, or the parser might not be able to interpret a row for other reasons. You might still want to be able to load and explore this data. You can specify how error-tolerant to be and where to record information about rejected data using parameters to the COPY statement. For more information, see Handling messy data.