Common use cases

Vertica supports a variety of use cases for reading data.

Vertica supports a variety of use cases for reading data. Some of the most common are summarized here with links for more information. This is not a complete list of capabilities.

The COPY statement is central to loading data. See Introduction to the COPY statement for an overview of its use.

Loading data from files

You might have data, perhaps quite a bit of data, that you want to load into Vertica. These files might reside on shared storage, in the cloud, or on local nodes, and might be in a variety of formats.

For information about source locations, see Specifying where to load data from. To handle data in different formats you specify a parser; for more information about the options, see Data formats.

You are not limited to loading data "as-is"; you can also transform it during load. See Transforming data during loads.

Loading data from other services

Apache Kafka is a platform for streaming data. Vertica supports streaming data to and from Kafka. See Apache Kafka integration.

Apache Spark is a cluster-computing framework for distributed data. Vertica supports connecting to Spark for data. See Apache Spark integration.

You can copy data directly from another Vertica cluster, instead of exporting to files and then loading those files. See Database export and import.

Read data where it exists (don't import)

Instead of loading data into Vertica, you can read it in place using external tables. External tables can be advantageous in the following cases:

  • If you want to explore data, such as in a data lake, before selecting data to load into Vertica.

  • If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.

  • If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.

  • If you have lower-priority data in Vertica that you still want to be able to query.

When you query an external table, Vertica loads the data it needs from the external source. The Parquet and ORC columnar formats are optimized for this kind of load, so using external tables does not necessarily have a large effect on performance compared to loading data into Vertica native tables.

For more information about using external tables, see Working with external data.

Complex types

Some data formats support complex types such as arrays and structs (sets of property-value pairs). You can use strong typing to define columns using the ARRAY and ROW types (including nesting) in native and external tables. See Complex types. Alternatively, you can define tables using flexible (schemaless) complex types without fully specifying them. You can load flexible complex types in the Parquet, ORC, JSON, and Avro formats; see Flexible complex types.

Unknown schema

Sometimes the schema for the data you wish to load is unknown. Vertica provides two options in this case.

You can inspect sample data files and derive a table definition using the INFER_TABLE_DDL function. This function returns a good starting point for column definitions, including ones using complex types. You might still need to make adjustments, particularly for partition columns of unknown types or to adjust sizes of VARCHAR values. Particularly for tables with many or complex columns, using this function can save time and reduce manual effort.

Alternatively, you can load schemaless data into a Flex table. All data is loaded into a single binary column (a VMap) and is extracted at query time. This approach is particularly helpful if you need to support heterogeneous data in one table, such as JSON data with different properties. Loading into a Flex table is generally like loading into any other table, but for some formats you need to use a different parser. For more information about Flex tables and Flex-specific parsers, see Flex tables.

Messy data

Sometimes data is not clean; values might not match the declared data types, or required values might be missing, or the parser might not be able to interpret a row for other reasons. You might still want to be able to load and explore this data. You can specify how error-tolerant to be and where to record information about rejected data using parameters to the COPY statement. For more information, see Handling messy data.