Creating external tables

To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement.

To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.

You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying where to load data from and Data formats.

As with native tables, you can use the INFER_TABLE_DDL function to derive column definitions from data files in supported formats.

The following example defines an external table for delimited data stored in HDFS:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM 'hdfs:///data/ext1.csv' DELIMITER ',';

The following example uses data in the ORC format that is stored in S3. The data has two partition columns. For more information about partitions, see Partitioned file paths.

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 's3://datalake/sales/*/*/*'
   PARTITION COLUMNS created, region;

The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM '/data/sales/*.parquet' PARQUET;

When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.

Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.

ORC and Parquet formats

When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. If you omit data columns, queries using the table abort with an error.

If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

Iceberg format

You can create an external table for data stored by Apache Iceberg using CREATE EXTERNAL TABLE ICEBERG. An Iceberg table consists of data files and metadata describing the schema. Unlike other external tables, an Iceberg external table need not specify column definitions (DDL). The information is read from Iceberg metadata at query time. For certain data types you can adjust column definitions when creating the table, for example to specify VARCHAR sizes.

Special considerations for external tables

If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.

You can see unexpected query results if constraints on columns cause values to be rejected:

  • If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.

  • If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.

  • When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.

One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).

The JSON and Avro parsers produce warnings when the data contains columns or fields that are not part of the table definition. External tables load data at query time, so these warnings appear for each query. You can use the parser's suppress_warnings parameter to prevent these warnings in external tables. See the parser reference pages for an example.

When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.

You can take advantage of partitioning to limit the amount of data that Vertica reads. For more information about using partitioned data, see Partitioned file paths.

Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.

After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.

Required permissions

In addition to having permission in Vertica, users must have read access to the external data.

  • For data on the local disk this access is governed by local file permissions.

  • For data in HDFS, access might be governed by Kerberos authentication. See Accessing kerberized HDFS data.

  • For data on S3, you need access through an AWS IAM role. See S3 object store.

For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) object store.

By default, you must also be a database superuser to access external tables through a SELECT statement.

In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (storage location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp (including in subdirectories to any depth):

=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';
=> GRANT ALL ON LOCATION '/tmp' to Bob;

Organizing external table data

If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), you can use partitioning in combination with wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:

/year=2018/month=01/

You can then read the year and month values from the directory names in the COPY statement:

=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/*/*/*' PARTITION COLUMNS year, month;

Whenever a Vertica query references the external table archive, and Vertica parses the COPY statement, all stored data in the top-level monthly directory is accessible to the query. If the query filters by year or month, such as in a WHERE clause, Vertica skips irrelevant directories when evaluating the glob. See Partitioned file paths for more information.

Validating table definitions

When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:

=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar)
    AS COPY '/home/dbadmin/designer.log';
ERROR 2778:  COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source

Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.

To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.

Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling messy data.

Viewing external table definitions

When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.

To list all tables, use a SELECT * query, as shown:

=> SELECT * FROM TABLES WHERE table_definition <> '';

Use a query such as the following to list the external table definitions:

=> SELECT table_name, table_definition FROM TABLES;
 table_name |                                table_definition
------------+----------------------------------------------------------------------
 t1         | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t1_copy    | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t2         | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)