This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Working with external data
An alternative to importing data into Vertica is to query it in place.
An alternative to importing data into Vertica is to query it in place. Querying external data instead of importing it can be advantageous in some cases:
-
If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
-
If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
-
If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
-
If you have a very large volume of data and do not want to increase your license capacity.
-
If you have lower-priority data in Vertica that you still want to be able to query.
To query external data, you must describe your data as an external table. Like native tables, external tables have table definitions and can be queried. Unlike native tables, external tables have no catalog and Vertica loads selected data from the external source as needed. For some formats, the query planner can take advantage of partitions and sorting in the data, so querying an external table does not mean you load all of the data at query time. (For more information about native tables, see Working with native tables.)
There is one special type of external data not covered in this section. If you are reading data from Hadoop, and specifically from a Hive data warehouse, then instead of defining your own external tables you can read the schema information from Hive. For more information, see Using the HCatalog Connector.
1 - How external tables differ from native tables
You can use external tables in the same ways you use Vertica native tables.
You can use external tables in the same ways you use Vertica native tables. Because the data is external to the database, however, there are some differences in how external tables operate.
Data
The data for an external table can reside anywhere, so long as all database nodes can access it. S3, HDFS, and NFS mount points are common places to find external data. Naturally, querying external data can incur some latency compared to querying locally-stored ROS data, but Vertica has optimizations that can reduce the impact. For example, Vertica can take advantage of node and rack locality for HDFS data.
Because the data is external, Vertica loads external data each time you query it. Vertica is optimized to reduce the volume of read data, including predicate pushdown and partition pruning for formats that support partitioning. The ORC and Parquet formats support these optimizations.
Because the data is read at query time, you must ensure that your users have and retain permission to read the data in its original location. Depending on where the data is stored, you might need to take additional steps to manage access, such as creating AWS IAM roles on S3.
Because the data is not stored in Vertica, external tables do not use superprojections and buddy projections.
Resource consumption
External tables add very little to the Vertica catalog, which reduces the resources that queries consume. Because the data is not stored in Vertica, external tables are not affected by the Tuple Mover and do not cause ROS pushback. Vertica uses a small amount of memory when reading external table data, because the table contents are not part of your database and are parsed each time the external table is used.
Backup and restore
Because the data in external tables is managed outside of Vertica, only the external table definitions, not the data files, are included in database backups. Arrange for a separate backup process for your external table data.
DML support
External tables allow you to read external data. They do not allow you to modify it. Some DML operations are therefore not available for external tables, including:
-
DELETE FROM
-
INSERT INTO
-
SELECT...FOR UPDATE
Sequences and IDENTITY columns
The COPY statement definition for external tables can include IDENTITY columns and sequences. Whenever a select statement queries the external table, sequences and IDENTITY columns are re-evaluated. This results in changing the external table column values, even if the underlying external table data remains the same.
2 - Creating external tables
To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement.
To create an external table, combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.
You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying where to load data from and Data formats.
As with native tables, you can use the INFER_TABLE_DDL function to derive column definitions from data files in supported formats.
The following example defines an external table for delimited data stored in HDFS:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
AS COPY FROM 'hdfs:///data/ext1.csv' DELIMITER ',';
The following example uses data in the ORC format that is stored in S3. The data has two partition columns. For more information about partitions, see Partitioned file paths.
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 's3://datalake/sales/*/*/*'
PARTITION COLUMNS created, region;
The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:
=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
AS COPY FROM '/data/sales/*.parquet' PARQUET;
When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.
Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.
When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. If you omit data columns, queries using the table abort with an error.
If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.
You can create an external table for data stored by Apache Iceberg using CREATE EXTERNAL TABLE ICEBERG. An Iceberg table consists of data files and metadata describing the schema. Unlike other external tables, an Iceberg external table need not specify column definitions (DDL). The information is read from Iceberg metadata at query time. For certain data types you can adjust column definitions when creating the table, for example to specify VARCHAR sizes.
Special considerations for external tables
If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.
You can see unexpected query results if constraints on columns cause values to be rejected:
-
If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.
-
If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.
-
When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.
One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).
The JSON and Avro parsers produce warnings when the data contains columns or fields that are not part of the table definition. External tables load data at query time, so these warnings appear for each query. You can use the parser's suppress_warnings
parameter to prevent these warnings in external tables. See the parser reference pages for an example.
When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.
You can take advantage of partitioning to limit the amount of data that Vertica reads. For more information about using partitioned data, see Partitioned file paths.
Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.
Tip
When working with a new external data source, consider setting REJECTMAX to 1 to make problems in the data apparent. Testing in this way allows you to discover problems in the data before running production queries against it.
After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.
Required permissions
In addition to having permission in Vertica, users must have read access to the external data.
-
For data on the local disk this access is governed by local file permissions.
-
For data in HDFS, access might be governed by Kerberos authentication. See Accessing kerberized HDFS data.
-
For data on S3, you need access through an AWS IAM role. See S3 object store.
For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) object store.
By default, you must also be a database superuser to access external tables through a SELECT statement.
In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (storage location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp
(including in subdirectories to any depth):
=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';
=> GRANT ALL ON LOCATION '/tmp' to Bob;
Organizing external table data
If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), you can use partitioning in combination with wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:
/year=2018/month=01/
You can then read the year and month values from the directory names in the COPY statement:
=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/*/*/*' PARTITION COLUMNS year, month;
Whenever a Vertica query references the external table archive
, and Vertica parses the COPY statement, all stored data in the top-level monthly
directory is accessible to the query. If the query filters by year or month, such as in a WHERE clause, Vertica skips irrelevant directories when evaluating the glob. See Partitioned file paths for more information.
Validating table definitions
When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:
=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar)
AS COPY '/home/dbadmin/designer.log';
ERROR 2778: COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source
Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.
To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.
Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling messy data.
Viewing external table definitions
When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.
To list all tables, use a SELECT * query, as shown:
=> SELECT * FROM TABLES WHERE table_definition <> '';
Use a query such as the following to list the external table definitions:
=> SELECT table_name, table_definition FROM TABLES;
table_name | table_definition
------------+----------------------------------------------------------------------
t1 | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ','
t1_copy | COPY FROM 'TMPDIR/external_table.dat' DELIMITER ','
t2 | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)
3 - Partitioned file paths
Data files are sometimes partitioned in the file system using the directory structure.
Data files are sometimes partitioned in the file system using the directory structure. Partitioning allows you to move values out of the raw data, where they have to be included for each row, and into the directory structure. Partitioning can improve query performance by allowing entire directories to be skipped. Partitioning can also save disk space if partition values are not repeated in the data.
Partitioning by date is one common use:
/data/created=2016-11-01/*
/data/created=2016-11-02/*
/data/created=2016-11-03/*
/data/...
The files in the globs do not contain a created
column because this information is expressed through the file system. Vertica can read the partitioned values (dates, in this example) into a table column (created
, in this example).
Data can be partitioned by more than one value:
/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...
If data is partitioned by more than one value, the partitions must appear in the same order in all directory paths in the glob. In this example, creation date comes before region and must do so consistently.
COPY syntax
To read values from partitioned files, use wildcards (globs) in the COPY path and the PARTITION COLUMNS option:
=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
AS COPY FROM 'webhdfs:////data/*/*/*'
PARTITION COLUMNS created, region;
The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must be at least one more than the number of partitioned columns. Data files must include at least one real column; you cannot represent data entirely through directory structure.
The first part of each partition directory name must match the column name in the table definition. COPY parses the string after the =
for the values. Empty values, for example a directory named created=
, are treated as null values. For backward compatibility, a value of __HIVE_DEFAULT_PARTITION__
also means null.
Values that cannot be coerced to the correct types are rejected in the same way that non-coercible values in data are rejected.
Query execution
When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.
=> SELECT * FROM records WHERE region='northeast';
To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:
files with unmatched partition have been pruned
Partitions on object stores
When reading partitioned data from a file system, COPY can prune unneeded directories at each level, skipping entire directories of files. Object stores (S3, GCS, and Azure) do not have directories, so COPY cannot prune paths in the same way.
By default, when querying data on an object store, COPY first fetches a list of all object names. It then prunes unneeded ones before fetching the objects themselves. This approach is called flat listing. Accessing object stores can have high latency, and flat listing maximizes the number of results returned per request. Flat listing retrieves more object names than are needed but does so in as few calls to the object store as possible. In most cases this strategy is an acceptable tradeoff.
If the number of partitions at each level is large, or if there are many layers of partitioning, fetching all of the object names can have a noticeable performance cost. In this situation, you can set the ObjectStoreGlobStrategy configuration parameter to have COPY use hierarchical listing instead of flat listing. With hierarchical listing, COPY fetches information about objects one level at a time, so that COPY can prune unproductive paths earlier.
Consider data that is partitioned by date and then by region:
/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...
Suppose a query needs only created=2016-11-02
and region=central
. With the flat strategy, COPY retrieves the names of all objects in all dates and all regions before pruning most of them. With the hierarchical strategy, COPY first fetches the names of all of the created
values, without expanding the paths further. In a second call, it then fetches the names of all of the region
values below only that date, ignoring other paths. Finally, it fetches the names of the objects only below that path.
Creating partition structures
To create partitioned file structures, you can use Hive or the file exporters. For information about using Hive, see Hive primer for Vertica integration.
You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.
4 - Querying external tables
After you create an external table, you can query it as you would query any other table.
After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:
=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE
You can now write queries against these tables, such as the following:
=> SELECT * FROM catalog;
id | description | category
----+----------------------+-------------
10 | 24in monitor | computers
11 | 27in monitor | computers
12 | 24in IPS monitor | computers
20 | 1TB USB drive | computers
21 | 2TB USB drive | computers
22 | 32GB USB thumb drive | computers
30 | 40in LED TV | electronics
31 | 50in LED TV | electronics
32 | 60in plasma TV | electronics
(9 rows)
=> SELECT * FROM inventory;
storeID | prodID | quantity
---------+--------+----------
502 | 10 | 17
502 | 11 | 2
517 | 10 | 1
517 | 12 | 2
517 | 12 | 4
542 | 10 | 3
542 | 11 | 11
542 | 12 | 1
(8 rows)
=> SELECT inventory.storeID,catalog.description,inventory.quantity
FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
storeID | description | quantity
---------+------------------+----------
502 | 24in monitor | 17
517 | 24in monitor | 1
542 | 24in monitor | 3
502 | 27in monitor | 2
542 | 27in monitor | 11
517 | 24in IPS monitor | 2
517 | 24in IPS monitor | 4
542 | 24in IPS monitor | 1
(8 rows)
One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.
Handling errors
Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit
configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit
to –1
removes the limit, but is not recommended.
If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log
file and does not report subsequent rejected rows.
Using the ExternalTablesExceptionsLimit
configuration parameter differs from using the COPY statement REJECTMAX
parameter to set a low rejection threshold. The REJECTMAX
value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX
, COPY aborts execution instead of logging a warning in vertica.log
.
Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.
After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)
The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.
If your data is partitioned, Vertica automatically prunes partitions that cannot affect query results, causing less data to be loaded.
For ORC and Parquet data written with Hive version 0.14 and later, Vertica automatically uses predicate pushdown to further improve query performance. Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files written with earlier versions of Hive might not contain the statistics required to perform this optimization. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.
Using external tables with user-defined load (UDL) functions
You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).
5 - Monitoring external tables
Vertica records information about external tables in system tables.
Vertica records information about external tables in system tables. You can use these tables to track your external data and queries against it.
The TABLES system table contains data about all tables, both native and external. The TABLE_DEFINITION column is specific to external tables. You can query this column to see all external data sources currently in use, as in the following example:
=> SELECT table_name, create_time, table_definition FROM tables WHERE table_definition != '';
table_name | create_time | table_definition
---------------+-------------------------------+-----------------------------------------------------------
customers_orc | 2018-03-21 11:07:30.159442-04 | COPY from '/home/dbadmin/sample_orc_files/0*' ORC
miscprod | 2018-06-26 17:40:04.012121-04 | copy from '/home/dbadmin/data/prod.csv'
students | 2018-06-26 17:46:50.695024-04 | copy from '/home/dbadmin/students.csv'
numbers | 2018-06-26 17:53:52.407441-04 | copy from '/home/dbadmin/tt.dat'
catalog | 2018-06-26 18:12:28.598519-04 | copy from '/home/dbadmin/data/prod.csv' delimiter ','
inventory | 2018-06-26 18:13:06.951802-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
test | 2018-06-27 16:31:39.170866-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
(7 rows)
The EXTERNAL_TABLE_DETAILS table provides more details, including file sizes. Vertica computes the values in this table at query time, which is potentially expensive, so consider restricting the query by schema or table.
=> SELECT table_name, source_format, total_file_size_bytes FROM external_table_details;
table_name | source_format | total_file_size_bytes
---------------+---------------+-----------------------
customers_orc | ORC | 619080883
miscprod | DELIMITED | 254
students | DELIMITED | 763
numbers | DELIMITED | 30
catalog | DELIMITED | 254
inventory | DELIMITED | 74
test | DELIMITED | 74
(7 rows)
If the size of an external table changes significantly over time, you should rerun ANALYZE_EXTERNAL_ROW_COUNT() to gather updated statistics. See Improving Query Performance for External Tables.
The LOAD_SOURCES table shows information for loads currently in progress. This table does not record information about loads of ORC or Parquet data.
6 - Troubleshooting external tables
You might encounter the following issues when creating or querying external tables.
You might encounter the following issues when creating or querying external tables. For general data-load troubleshooting, see Troubleshooting data loads.
File not found or permission denied
If a query against an external table produces a file or permission error, ensure that the user executing the query has the necessary permissions in both Vertica and the file system. See the permissions section in Creating external tables.
Error 7226: cannot find partition column
When querying external tables backed by partitioned data, you might see an error message stating that a partition column is missing:
ERROR 7226: Cannot find partition column [region] in parquet source
[/data/table_int/int_original/000000_0]
If you create an external table and then change the partition structure, for example by renaming a column, you must then re-create the external table. If you see this error, update your table to match the partitioning on disk.
For more information about partition structure, see Partitioned file paths).
Error 6766: is a directory
When querying data, you might see an error message stating that an input file is a directory:
ERROR 6766: Error reading from orc parser input stream
[/tmp/orc_glob/more_nations]: Is a directory
This error occurs if the glob in the table's COPY FROM clause matches an empty directory. This error occurs only for files in the Linux file system; empty directories in HDFS are ignored.
To correct the error, make the glob more specific. Instead of *, for example, use *.orc.