Querying external tables

After you create an external table, you can query it as you would query any other table.

After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:

=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
    AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
    AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE

You can now write queries against these tables, such as the following:

=> SELECT * FROM catalog;
 id |     description      |  category
----+----------------------+-------------
 10 | 24in monitor         | computers
 11 | 27in monitor         | computers
 12 | 24in IPS monitor     | computers
 20 | 1TB USB drive        | computers
 21 | 2TB USB drive        | computers
 22 | 32GB USB thumb drive | computers
 30 | 40in LED TV          | electronics
 31 | 50in LED TV          | electronics
 32 | 60in plasma TV       | electronics
(9 rows)

=> SELECT * FROM inventory;
 storeID | prodID | quantity
---------+--------+----------
     502 |     10 |       17
     502 |     11 |        2
     517 |     10 |        1
     517 |     12 |        2
     517 |     12 |        4
     542 |     10 |        3
     542 |     11 |       11
     542 |     12 |        1
(8 rows)

=> SELECT inventory.storeID,catalog.description,inventory.quantity
    FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
 storeID |   description    | quantity
---------+------------------+----------
     502 | 24in monitor     |       17
     517 | 24in monitor     |        1
     542 | 24in monitor     |        3
     502 | 27in monitor     |        2
     542 | 27in monitor     |       11
     517 | 24in IPS monitor |        2
     517 | 24in IPS monitor |        4
     542 | 24in IPS monitor |        1
(8 rows)

One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.

Handling errors

Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit to –1 removes the limit, but is not recommended.

If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log file and does not report subsequent rejected rows.

Using the ExternalTablesExceptionsLimit configuration parameter differs from using the COPY statement REJECTMAX parameter to set a low rejection threshold. The REJECTMAX value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX, COPY aborts execution instead of logging a warning in vertica.log.

Improving query performance for external tables

Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.

After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)

The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.

If your data is partitioned, Vertica automatically prunes partitions that cannot affect query results, causing less data to be loaded.

For ORC and Parquet data written with Hive version 0.14 and later, Vertica automatically uses predicate pushdown to further improve query performance. Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files written with earlier versions of Hive might not contain the statistics required to perform this optimization. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.

Using external tables with user-defined load (UDL) functions

You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).