Querying external tables
After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:
=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE
You can now write queries against these tables, such as the following:
=> SELECT * FROM catalog;
id | description | category
----+----------------------+-------------
10 | 24in monitor | computers
11 | 27in monitor | computers
12 | 24in IPS monitor | computers
20 | 1TB USB drive | computers
21 | 2TB USB drive | computers
22 | 32GB USB thumb drive | computers
30 | 40in LED TV | electronics
31 | 50in LED TV | electronics
32 | 60in plasma TV | electronics
(9 rows)
=> SELECT * FROM inventory;
storeID | prodID | quantity
---------+--------+----------
502 | 10 | 17
502 | 11 | 2
517 | 10 | 1
517 | 12 | 2
517 | 12 | 4
542 | 10 | 3
542 | 11 | 11
542 | 12 | 1
(8 rows)
=> SELECT inventory.storeID,catalog.description,inventory.quantity
FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
storeID | description | quantity
---------+------------------+----------
502 | 24in monitor | 17
517 | 24in monitor | 1
542 | 24in monitor | 3
502 | 27in monitor | 2
542 | 27in monitor | 11
517 | 24in IPS monitor | 2
517 | 24in IPS monitor | 4
542 | 24in IPS monitor | 1
(8 rows)
One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.
Handling errors
Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit
configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit
to –1
removes the limit, but is not recommended.
If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log
file and does not report subsequent rejected rows.
Using the ExternalTablesExceptionsLimit
configuration parameter differs from using the COPY statement REJECTMAX
parameter to set a low rejection threshold. The REJECTMAX
value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX
, COPY aborts execution instead of logging a warning in vertica.log
.
Improving query performance for external tables
Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.
After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)
The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.
If you have ORC or Parquet data, you can take advantage of optimizations including partition pruning and predicate pushdown. See Improving query performance.
Using external tables with user-defined load (UDL) functions
You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).