This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Working with external data

An alternative to importing data into Vertica is to query it in place.

1: How external tables differ from native tables
2: Creating external tables
3: Querying external tables

3.1: Using partition columns
3.2: Improving query performance

4: Monitoring external tables
5: Troubleshooting external tables

An alternative to importing data into Vertica is to query it in place. Querying external data instead of importing it can be advantageous in some cases:

If you want to explore data, such as in a data lake, before selecting data to load into Vertica.
If you are one of several consumers sharing the same data, for example in a data lake, then reading it in place eliminates concerns about whether query results are up to date. There's only one copy, so all consumers see the same data.
If your data changes rapidly but you do not want to stream it into Vertica, you can instead query the latest updates automatically.
If you have a very large volume of data and do not want to increase your license capacity.
If you have lower-priority data in Vertica that you still want to be able to query.

To query external data, you must describe your data as an external table. Like native tables, external tables have table definitions and can be queried. Unlike native tables, external tables have no catalog and Vertica loads selected data from the external source as needed. For some formats, the query planner can take advantage of partitions and sorting in the data, so querying an external table does not mean you load all of the data at query time. (For more information about native tables, see Working with native tables.)

There is one special type of external data not covered in this section. If you are reading data from Hadoop, and specifically from a Hive data warehouse, then instead of defining your own external tables you can read the schema information from Hive. For more information, see Using the HCatalog Connector.

1 - How external tables differ from native tables

You can use external tables in the same ways you use Vertica native tables.

You can use external tables in the same ways you use Vertica native tables. Because the data is external to the database, however, there are some differences in how external tables operate.

Data

The data for an external table can reside anywhere, so long as all database nodes can access it. S3, HDFS, and NFS mount points are common places to find external data. Naturally, querying external data can incur some latency compared to querying locally-stored ROS data, but Vertica has optimizations that can reduce the impact. For example, Vertica can take advantage of node and rack locality for HDFS data.

Because the data is external, Vertica loads external data each time you query it. Vertica is optimized to reduce the volume of read data, including predicate pushdown and partition pruning for formats that support partitioning. The ORC and Parquet formats support these optimizations.

Because the data is read at query time, you must ensure that your users have and retain permission to read the data in its original location. Depending on where the data is stored, you might need to take additional steps to manage access, such as creating AWS IAM roles on S3.

Because the data is not stored in Vertica, external tables do not use superprojections and buddy projections.

Resource consumption

External tables add very little to the Vertica catalog, which reduces the resources that queries consume. Because the data is not stored in Vertica, external tables are not affected by the Tuple Mover and do not cause ROS pushback. Vertica uses a small amount of memory when reading external table data, because the table contents are not part of your database and are parsed each time the external table is used.

Backup and restore

Because the data in external tables is managed outside of Vertica, only the external table definitions, not the data files, are included in database backups. Arrange for a separate backup process for your external table data.

DML support

External tables allow you to read external data. They do not allow you to modify it. Some DML operations are therefore not available for external tables, including:

DELETE FROM
INSERT INTO
SELECT...FOR UPDATE

Sequences and identity columns

The COPY statement definition for external tables can include identity columns and sequences. Whenever a select statement queries the external table, sequences and identity columns are re-evaluated. This results in changing the external table column values, even if the underlying external table data remains the same.

2 - Creating external tables

To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement.

To create an external table you combine a table definition with a copy statement using the CREATE EXTERNAL TABLE AS COPY statement. CREATE EXTERNAL TABLE AS COPY uses a subset of parameters from CREATE TABLE and COPY.

You define your table columns as you would for a Vertica native table using CREATE TABLE. You also specify a COPY FROM clause to describe how to read the data, as you would for loading data. How you specify the FROM path depends on where the file is located and the data format. See Specifying where to load data from and Data formats.

As with native tables, you can use the INFER_TABLE_DDL function to derive column definitions from data files in the Parquet, ORC, and Avro formats.

When defining an external table for ORC or Parquet data, you must define all of the data columns in the file. You may omit partition columns. If you omit data columns, queries using the table abort with an error. For other data formats, you can select only the data columns of interest.

If you load from multiple ORC or Parquet files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.

When you create an external table, data is not added to the database and no projections are created. Instead, Vertica performs a syntactic check of the CREATE EXTERNAL TABLE AS COPY statement and stores the table name and COPY statement definition in the catalog. Each time a SELECT query references an external table, Vertica parses and executes the stored COPY statement to obtain the referenced data. Any problems in the table definition, such as incorrect column types, can be discovered only by querying the table.

Successfully returning data from an external table requires that the COPY definition be correct, and that other dependencies, such as files, nodes, and other resources are accessible and available at query time. If the table definition uses globs (wildcards), and files are added or deleted, the data in the external table can change between queries.

The following example defines an external table for delimited data stored in HDFS:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM 'hdfs:///dat/ext1.csv' DELIMITER ',';

The following example uses data in the ORC format that is stored in S3. The data has two partition columns, represented as directories in the file structure. For more information about partitions, see Using partition columns.

=> CREATE EXTERNAL TABLE transactions (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 's3://datalake/sales/*/*/*'
   ORC(hive_partition_cols='created,region');

The following example shows how you can read from all Parquet files in a local directory, with no partitions and no globs:

=> CREATE EXTERNAL TABLE sales (itemID INT, date DATE, price FLOAT)
    AS COPY FROM '/data/sales/*.parquet' PARQUET;

When using the ORC and Parquet formats, Vertica supports some additional options in the COPY statement and data structures for columns. See ORC (parser) and PARQUET (parser).

If ORC or Parquet data is partitioned, Vertica expects Hive-style partitioning. If you see unexpected results when reading data, verify that globs in your file paths correctly align with the partition structure. See Troubleshooting external tables.

Special considerations for external tables

If the maximum length of a column is smaller than the actual data, such as a VARCHAR that is too short, Vertica truncates the data and logs the event.

You can see unexpected query results if constraints on columns cause values to be rejected:

If you specify a NOT NULL column constraint and the data contains null values, those rows are rejected.
If you use ENFORCELENGTH, values that are too long are rejected rather than being truncated.
When reading ORC data, if you declare a scalar precision and some data does not fit, that row is rejected. For example, if you specify a column as Decimal(6,5), a value of 123.456 is rejected.

One way to know if column constraints have caused data to be rejected is if COUNT on a column returns a different value than COUNT(*).

When using the COPY parameter ON ANY NODE, confirm that the source file definition is identical on all nodes. Specifying different external files can produce inconsistent results.

If your data is in Parquet or ORC format, you can take advantage of partitioning to limit the amount of data that Vertica reads. These formats are special in this respect because they embed metadata in the file headers. For more information about using partitioned data, see Using partition columns.

Canceling a CREATE EXTERNAL TABLE AS COPY statement can cause unpredictable results. If you realize after beginning the operation that your table definition is incorrect (for example, you inadvertently specify the wrong external location), wait for the query to complete. When the external table exists, use DROP TABLE to remove its definition.

Tip

When working with a new external data source, consider setting REJECTMAX to 1 to make problems in the data apparent. Testing in this way allows you to discover problems in the data before running production queries against it.

After you create an external table, analyze its row count to improve query performance. See Improving Query Performance for External Tables.

Required permissions

In addition to having permission in Vertica, users must have read access to the external data.

For data on the local disk this access is governed by local file permissions.
For data in HDFS, access might be governed by Kerberos authentication. See Accessing kerberized HDFS data.
For data on S3, you need access through an AWS IAM role. See S3 object store.

For data in GCS, you must enable S3 compatibility before reading data. See Google Cloud Storage (GCS) object store.

By default, you must also be a database superuser to access external tables through a SELECT statement.

In most cases, to allow users without superuser access to query external tables, an administrator must create a USER storage location and grant those users read access to the location. See CREATE LOCATION and GRANT (storage location). This location must be a parent of the path used in the COPY statement when creating the external table. This requirement does not apply to external tables stored in HDFS. The following example shows granting access to a user named Bob to any external table whose data is located under /tmp (including in subdirectories to any depth):

=> CREATE LOCATION '/tmp' ALL NODES USAGE 'user';
=> GRANT ALL ON LOCATION '/tmp' to Bob;

Organizing external table data

If the data you store in external tables changes regularly (for instance, each month in the case of storing recent historical data), your COPY definition statement can use wildcards (globs) to make parsing the stored COPY statement definition more dynamic. For instance, if you store monthly data on an NFS mount, you could organize monthly files within a top-level directory for a calendar year, such as:

/2018/monthly/

In this case, the external table COPY statement includes a wildcard definition such as the following:

=> CREATE EXTERNAL TABLE archive (...) AS COPY FROM '/nfs_name/2018/monthly/*'

Whenever a Vertica query references the external table archive, and Vertica parses the COPY statement, all stored data in the top-level monthly directory is accessible to the query.

Validating table definitions

When you create an external table, Vertica validates the syntax of the CREATE EXTERNAL TABLE AS COPY FROM statement. For example, if you omit a required keyword in the statement, creating the external table fails:

=> CREATE EXTERNAL TABLE ext (ts timestamp, d varchar)
    AS COPY '/home/dbadmin/designer.log';
ERROR 2778:  COPY requires a data source; either a FROM clause or a WITH SOURCE for a user-defined source

Checking other components of the COPY definition, such as path statements and node availability, does not occur until a SELECT query references the external table.

To validate an external table definition, run a SELECT query that references the external table. Check that the returned query data is what you expect. If the query does not return data correctly, check the COPY exception and rejected data log files.

Because the COPY definition determines what occurs when you query an external table, COPY statement errors can reveal underlying problems. For more information about COPY exceptions and rejections, see Handling messy data.

Viewing external table definitions

When you create an external table, Vertica stores the COPY definition statement in the table_definition column of the TABLES system table.

To list all tables, use a SELECT * query, as shown:

=> SELECT * FROM TABLES WHERE table_definition <> '';

Use a query such as the following to list the external table definitions:

=> SELECT table_name, table_definition FROM TABLES;
 table_name |                                table_definition
------------+----------------------------------------------------------------------
 t1         | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t1_copy    | COPY            FROM 'TMPDIR/external_table.dat'  DELIMITER ','
 t2         | COPY FROM 'TMPDIR/external_table2.dat' DELIMITER ','
(3 rows)

3 - Querying external tables

After you create an external table, you can query it as you would query any other table.

After you create an external table, you can query it as you would query any other table. Suppose you have created the following external tables:

=> CREATE EXTERNAL TABLE catalog (id INT, description VARCHAR, category VARCHAR)
    AS COPY FROM 'hdfs:///dat/catalog.csv' DELIMITER ',';
CREATE TABLE
=> CREATE EXTERNAL TABLE inventory(storeID INT, prodID INT, quantity INT)
    AS COPY FROM 'hdfs:///dat/inventory.csv' DELIMITER ',';
CREATE TABLE

You can now write queries against these tables, such as the following:

=> SELECT * FROM catalog;
 id |     description      |  category
----+----------------------+-------------
 10 | 24in monitor         | computers
 11 | 27in monitor         | computers
 12 | 24in IPS monitor     | computers
 20 | 1TB USB drive        | computers
 21 | 2TB USB drive        | computers
 22 | 32GB USB thumb drive | computers
 30 | 40in LED TV          | electronics
 31 | 50in LED TV          | electronics
 32 | 60in plasma TV       | electronics
(9 rows)

=> SELECT * FROM inventory;
 storeID | prodID | quantity
---------+--------+----------
     502 |     10 |       17
     502 |     11 |        2
     517 |     10 |        1
     517 |     12 |        2
     517 |     12 |        4
     542 |     10 |        3
     542 |     11 |       11
     542 |     12 |        1
(8 rows)

=> SELECT inventory.storeID,catalog.description,inventory.quantity
    FROM inventory JOIN catalog ON inventory.prodID = catalog.id;
 storeID |   description    | quantity
---------+------------------+----------
     502 | 24in monitor     |       17
     517 | 24in monitor     |        1
     542 | 24in monitor     |        3
     502 | 27in monitor     |        2
     542 | 27in monitor     |       11
     517 | 24in IPS monitor |        2
     517 | 24in IPS monitor |        4
     542 | 24in IPS monitor |        1
(8 rows)

One important difference between external tables and Vertica native tables is that querying an external table reads the external data every time. (See How external tables differ from native tables.) Specifically, each time a select query references the external table, Vertica parses the COPY statement definition again to access the data. Certain errors in either your table definition or your data do not become apparent until you run a query, so test your external tables before deploying them in a production environment.

Handling errors

Querying external table data with an incorrect COPY FROM statement definition can potentially result in many rejected rows. To limit the number of rejections, Vertica sets the maximum number of retained rejections with the ExternalTablesExceptionsLimit configuration parameter. The default value is 100. Setting the ExternalTablesExceptionsLimit to –1 removes the limit, but is not recommended.

If COPY errors reach the maximum number of rejections, the external table query continues, but COPY generates a warning in the vertica.log file and does not report subsequent rejected rows.

Using the ExternalTablesExceptionsLimit configuration parameter differs from using the COPY statement REJECTMAX parameter to set a low rejection threshold. The REJECTMAX value controls how many rejected rows to permit before causing the load to fail. If COPY encounters a number of rejected rows equal to or greater than REJECTMAX, COPY aborts execution instead of logging a warning in vertica.log.

Improving query performance for external tables

Queries that include joins perform better if the smaller table is the inner one. For native tables, the query optimizer uses cardinality to choose the inner table. For external tables, the query optimizer uses the row count if available.

After you create an external table, use ANALYZE_EXTERNAL_ROW_COUNT to collect this information. Calling this function is potentially expensive because it has to materialize one column of the table to be able to count the rows, so do this analysis when your database is not busy with critical queries. (This is why Vertica does not perform this operation automatically when you create the table.)

The query optimizer uses the results of your most-recent call to this function when planning queries. If the volume of data changes significantly, therefore, you should run it again to provide updated statistics. A difference of a few percent does not matter, but if your data volume grows by 20% or more, you should repeat this operation when able.

If you have ORC or Parquet data, you can take advantage of optimizations including partition pruning and predicate pushdown. See Improving query performance.

Using external tables with user-defined load (UDL) functions

You can use external tables in conjunction with UDL functions that you create. For more information about using UDLs, see User Defined Load (UDL).

3.1 - Using partition columns

An ORC or Parquet file contains data columns.

An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do not store values for partition columns; instead, when writing the files you divide them into groups (partitions) based on column values. You can use partitioning to improve the performance of queries that restrict results by the partitioned column.

For example, if you have a table with a date column, and you know you will be writing queries restricted to particular dates, you can partition by date. Thus, Vertica can skip reading some files entirely when executing your date-restricted queries. This behavior is called partition pruning.

You can create partitions regardless of where you store the files—in HDFS, in an S3 bucket, on a local file system, or in a shared file system such as NFS.

You can use Hive or EXPORT TO PARQUET to create partitions, or you can create them manually. For information about creating partitions as part of exporting data from Vertica, see Partitioning and sorting data. See Improving Query Performance for information about tuning partitions.

Partition structure

By default, both Hive and Vertica write Hadoop columnar format files that contain the data for all table columns without partitioning. The column data is laid out in stripes, or groups of row data. When Vertica loads this data it reads all of the stripes.

If you partition the data, however, you can avoid writing some of that data into the files and thus reduce the amount to be read. Instead of storing a column's data in the files, you create a directory structure that partitions the data based on the value in a column.

For example, if the data includes a date column, you can write each date as a separate partition. Each partition is a directory with a name of the form "column=value". If you have a date column named "created" that is partitioned by day, you would have the following directory structure:

path/created=2016-11-01/*
path/created=2016-11-02/*
path/created=2016-11-03/*
path/...

As this example shows, the files in each subdirectory contain all columns except the "created" column.

You can partition by more than one column, creating a layered structure. For example, adding another partitioned column, "region", to the preceding example would produce the following directory structure:

path/created=2016-11-01/region=northeast/*
path/created=2016-11-01/region=central/*
path/created=2016-11-01/region=southeast/*
path/created=2016-11-01/...
path/created=2016-11-02/region=northeast/*
path/created=2016-11-02/region=central/*
path/created=2016-11-02/region=southeast/*
path/created=2016-11-02/...
path/created=2016-11-03/...
path/...

With this change, the data files contain all columns except "created" and "region".

Note

The files must contain at least one real (not partitioned) column. You cannot partition by every column in a table.

You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.

Under some circumstances Hive writes a partition with a value of __HIVE_DEFAULT_PARTITION__. Vertica treats these values as NULL.

COPY syntax

When creating an external table from partitioned data, you must do all of the following:

In the column definition in the external table, if you are using strong schema matching (the default), list the partition columns last and in order.
In the path, use wildcards to include all of the levels of directories and files.
In the ORC or PARQUET statement, specify the partition columns in the hive_partition_cols parameter. (The argument name is the same even if you didn't use Hive to do the partitioning; it refers to Hive-style partitions.) When using strong schema matching, you must list the names in order.

The following example creates an external table using the partitioned data shown previously. The table includes four columns. Two columns, "id" and "name", are in the data files. The other two, "created" and "region", are partitioned.

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:///path/*/*/*'
   ORC(hive_partition_cols='created,region');

The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must always be one more than the number of partitioned columns.

You do not need to include all of the partitioned columns in hive_partition_cols if those columns are not relevant for your queries. However, the partition columns must be the last columns in the table definition. For example, you can define the following table for the partitioned data shown previously:

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='region');

Values in the "created" column are all null because no data appears in the files for that column and hive_partition_cols does not include it.

However, the following example produces an error.

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='created');

In this example, the table definition includes the "region" column after the "created" column, and "region" is not included in hive_partition_cols. Because this column is not listed as a partition column, Vertica interprets it as a data column and produces an error because the column is not present.

If Vertica cannot convert a partition value to the declared type for that column, it sets the value to NULL. The following example incorrectly declares region to be an integer rather than a varchar.

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region int)
   AS COPY FROM 'webhdfs:///path/*/*/*' ORC(hive_partition_cols='region');

Vertica cannot coerce a directory named "region=northeast" into an integer value, so it sets that column value to NULL for all rows it reads from this directory. If you declare the column with IS NOT NULL, Vertica rejects the row. If the number of rows exceeds REJECTMAX, Vertica reports an error.

Note

If you change how files are partitioned on disk, you must re-create your external tables.

Queries

When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. See Improving query performance for more information about partition pruning and other techniques for optimizing queries.

The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.

=> SELECT * FROM t WHERE region='northeast';

To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:

files with unmatched Hive partition have been pruned

3.2 - Improving query performance

When working with external tables in the Parquet and ORC columnar formats, Vertica tries to improve performance in the following ways:.

When working with external tables in the Parquet and ORC columnar formats, Vertica tries to improve performance in the following ways:

By pushing query execution closer to the data so less has to be read and transmitted. Vertica uses the following specific techniques: predicate pushdown, column selection, and partition pruning.
By taking advantage of data locality in the query plan.
By analyzing the row count to get the best join orders in the query plan.

The following figure illustrates optimizations that can reduce the amount of data to be read:

Tuning ORC stripes and Parquet rowgroups

Vertica can read ORC and Parquet files generated by any Hive version. However, newer Hive versions store more metadata in these files. This metadata is used by both Hive and Vertica to prune values and to read only the required data. Use the latest Hive version to store data in these formats. ORC and Parquet are fully forward- and backward-compatible. To get the best performance, use Hive 0.14 or later.

The ORC format splits a table into groups of rows called stripes and stores column-level metadata in each stripe. The Parquet format splits a table into groups of rows called rowgroups and stores column-level metadata in each rowgroup. Each stripe/rowgroup's metadata is used during predicate evaluation to determine whether the values from this stripe need to be read. Large stripes usually yield better performance, so set the stripe size to at least 256M.

Hive writes ORC stripes and Parquet rowgroups to HDFS, which stores data in HDFS blocks distributed among multiple physical data nodes. Accessing an HDFS block requires opening a separate connection to the corresponding data node. It is advantageous to ensure that an ORC stripe or Parquet rowgroup does not span more than one HDFS block. To do so, set the HDFS block size to be larger than the stripe/rowgroup size. Setting HDFS block size to 512M is usually sufficient.

Hive provides three compression options: None, Snappy, and Zlib. Use Snappy or Zlib compression to reduce storage and I/O consumption. Usually, Snappy is less CPU-intensive but can yield lower compression ratios compared to Zlib.

Storing data in sorted order can improve data access and predicate evaluation performance. Sort table columns based on the likelihood of their occurrence in query predicates; columns that most frequently occur in comparison or range predicates should be sorted first.

Partitioning tables is a very useful technique for data organization. Similarly to sorting tables by columns, partitioning can improve data access and predicate evaluation performance. Vertica supports Hive-style partitions and partition pruning.

The following Hive statement creates an ORC table with stripe size 256M and Zlib compression:

hive> CREATE TABLE customer_visits (
        customer_id bigint,
        visit_num int,
        page_view_dt date)
    STORED AS ORC tblproperties("orc.compress"="ZLIB",
        "orc.stripe.size"="268435456");

The following statement creates a Parquet table with stripe size 256M and Zlib compression:

hive> CREATE TABLE customer_visits (
        customer_id bigint,
        visit_num int,
        page_view_dt date)
    STORED AS PARQUET tblproperties("parquet.compression"="ZLIB",
        "parquet.stripe.size"="268435456");

Predicate pushdown and column selection

Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics, and row group indexes. Predicates are applied only to the first two levels. Parquet files have two levels of statistics: rowgroup statistics and page statistics. Predicates are only applied to the first level.

Predicate pushdown is automatically applied for files written with Hive version 0.14 and later. ORC files written with earlier versions of Hive might not contain the required statistics. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.

Another query performance optimization technique used by Vertica is column selection. Vertica reads from ORC or Parquet files only the columns specified in the query statement. For example, the following statement reads only the customer_id and visit_num columns from the corresponding ORC files:

=> CREATE EXTERNAL TABLE customer_visits (
        customer_id bigint,
        visit_num int,
        page_view_dt date)
    AS COPY FROM '...' ORC;

=> SELECT customer_id from customer_visits
        WHERE visit_num > 10;

Data locality

In a cluster where Vertica nodes are co-located on HDFS nodes, the query can use data locality to improve performance. For Vertica to do so, both the following conditions must exist::

The data is on an HDFS node where a database node is also present.
The query is not restricted to specific nodes using ON NODE.

When both these conditions exist, the query planner uses the co-located database node to read that data locally, instead of making a network call.

You can see how much data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC/Parquet data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data.

Creating sorted files in Hive

Unlike Vertica, Hive does not store table columns in separate files and does not create multiple projections per table with different sort orders. For efficient data access and predicate pushdown, sort Hive table columns based on the likelihood of their occurrence in query predicates. Columns that most frequently occur in comparison or range predicates should be sorted first.

Data can be inserted into Hive tables in a sorted order by using the ORDER BY or SORT BY keywords. For example, to insert data into the ORC table "customer_visit" from another table "visits" with the same columns, use these keywords with the INSERT INTO command:

hive> INSERT INTO TABLE customer_visits
        SELECT * from visits
        ORDER BY page_view_dt;

hive> INSERT INTO TABLE customer_visits
        SELECT * from visits
        SORT BY page_view_dt;

The difference between the two keywords is that ORDER BY guarantees global ordering on the entire table by using a single MapReduce reducer to populate the table. SORT BY uses multiple reducers, which can cause ORC or Parquet files to be sorted by the specified column(s) but not be globally sorted. Using the latter keyword can increase the time taken to load the file.

You can combine clustering and sorting to sort a table globally. The following table definition adds a hint that data is inserted into this table bucketed by customer_id and sorted by page_view_dt:

hive> CREATE TABLE customer_visits_bucketed (
        customer_id bigint,
        visit_num int,
        page_view_dt date)
    CLUSTERED BY (page_view_dt)
    SORTED BY (page_view_dt)INTO 10 BUCKETS
    STORED AS ORC;

When inserting data into the table, you must explicitly specify the clustering and sort columns, as in the following example:

hive> INSERT INTO TABLE customer_visits_bucketed
    SELECT * from visits
    DISTRIBUTE BY page_view_dt
    SORT BY page_view_dt;

The following statement is equivalent:

hive> INSERT INTO TABLE customer_visits_bucketed
    SELECT * from visits
    CLUSTER BY page_view_dt;

Both of the above commands insert data into the customer_visits_bucketed table, globally sorted on the page_view_dt column.

Partitioning Hive tables

Table partitioning in Hive is an effective technique for data separation and organization, as well as for reducing storage requirements. To partition a table in Hive, include it in the PARTITIONED BY clause:

hive> CREATE TABLE customer_visits (
        customer_id bigint,
        visit_num int)
    PARTITIONED BY (page_view_dt date)
    STORED AS ORC;

Hive does not materialize partition column(s). Instead, it creates subdirectories of the following form:

path_to_table/partition_column_name=value/

When the table is queried, Hive parses the subdirectories' names to materialize the values in the partition columns. The value materialization in Hive is a plain conversion from a string to the appropriate data type.

Inserting data into a partitioned table requires specifying the value(s) of the partition column(s). The following example creates two partition subdirectories, "customer_visits/page_view_dt=2016-02-01" and "customer_visits/page_view_dt=2016-02-02":

hive> INSERT INTO TABLE customer_visits
    PARTITION (page_view_dt='2016-02-01')
    SELECT customer_id, visit_num from visits
    WHERE page_view_dt='2016-02-01'
    ORDER BY page_view_dt;

hive> INSERT INTO TABLE customer_visits
    PARTITION (page_view_dt='2016-02-02')
    SELECT customer_id, visit_num from visits
    WHERE page_view_dt='2016-02-02'
    ORDER BY page_view_dt;

Each directory contains ORC files with two columns, customer_id and visit_num.

Accessing partitioned data from Vertica

Vertica recognizes and supports Hive-style partitions. You can read partition values and data using the HCatalog Connector or the COPY statement.

If you use the HCatalog Connector, you must create an HCatalog schema in Vertica that mirrors a schema in Hive:

=> CREATE EXTERNAL TABLE customer_visits (customer_id int, visit_num int,
                    page_view_dtm date)
    AS COPY FROM 'hdfs://host:port/path/customer_visits/*/*' ORC
    (hive_partition_cols='page_view_dtm');

The following statement reads all ORC files stored in all sub-directories including the partition values:

=> SELECT customer_id, visit_num, page_view FROM customer_visits;

When executing queries with predicates on partition columns, Vertica uses the subdirectory names to skip files that do not satisfy the predicate. This process is called partition pruning.

You can also define a separate external table for each subdirectory, as in the following example:

=> CREATE EXTERNAL TABLE customer_visits_20160201 (customer_id int,
            visit_num int, page_view_dtm date)
    AS COPY FROM
    'hdfs://host:port/path/customer_visits/page_view_dt=2016-02-01/*' ORC;

Example: a partitioned, sorted ORC table

Suppose you have data stored in CSV files containing three columns: customer_id, visit_num, page_view_dtm:

1,123,2016-01-01
33,1,2016-02-01
2,57,2016-01-03
...

The goal is to create the following Hive table:

hive> CREATE TABLE customer_visits (
            customer_id bigint,
            visit_num int)
    PARTITIONED BY (page_view_dt date)
    STORED AS ORC;

To achieve this, perform the following steps:

Copy or move the CSV files to HDFS.

Define a textfile Hive table and copy the CSV files into it:

hive> CREATE TABLE visits (
            customer_id bigint,
            visit_num int,
            page_view_dt date)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;

hive> LOAD DATA INPATH path_to_csv_files INTO TABLE visits;

For each unique value in page_view_dt, insert the data into the target table while materializing page_view_dt as page_view_dtm:
```
hive> INSERT INTO TABLE customer_visits
        PARTITION (page_view_dt='2016-01-01')
        SELECT customer_id, visit_num FROM visits
        WHERE page_view_dt='2016-01-01'
        ORDER BY page_view_dt;
...
```
This operation inserts data from visits.customer_id into customer_visits.customer_id, and from visits.visit_num into customer_visits.visit_num. These two columns are stored in generated ORC files. Simultaneously, values from visits.page_view_dt are used to create partitions for the partition column customer_visits.page_view_dt, which is not stored in the ORC files.

Data modification in Hive

Hive is well-suited for reading large amounts of write-once data. Its optimal usage is loading data in bulk into tables and never modifying the data. In particular, for data stored in the ORC and Parquet formats, this usage pattern produces large, globally (or nearly globally) sorted files.

Periodic addition of data to tables (known as “trickle load”) is likely to produce many small files. The disadvantage of this is that Vertica has to access many more files during query planning and execution. These extra access can result in longer query-processing time. The major performance degradation comes from the increase in the number of file seeks on HDFS.

Hive can also modify underlying ORC or Parquet files without user involvement. If enough records in a Hive table are modified or deleted, for example, Hive deletes existing files and replaces them with newly-created ones. Hive can also be configured to automatically merge many small files into a few larger files.

When new tables are created, or existing tables are modified in Hive, you must manually synchronize Vertica to keep it up to date. The following statement synchronizes the Vertica schema "hcat" after a change in Hive:

=> SELECT sync_with_hcatalog_schema('hcat_local', 'hcat');

Schema evolution in Hive

Hive supports two kinds of schema evolution:

New columns can be added to existing tables in Hive. Vertica automatically handles this kind of schema evolution. The old records display NULLs for the newer columns.
The type of a column for a table can be modified in Hive. Vertica does not support this kind of schema evolution.

The following example demonstrates schema evolution through new columns. In this example, hcat.parquet.txt is a file with the following values:

-1|0.65|0.65|6|'b'

hive> create table hcat.parquet_tmp (a int, b float, c double, d int, e varchar(4))
      row format delimited fields terminated by '|' lines terminated by '\n';

hive> load data local inpath 'hcat.parquet.txt' overwrite into table
      hcat.parquet_tmp;

hive> create table hcat.parquet_evolve (a int) partitioned by (f int) stored as
      parquet;
hive> insert into table hcat.parquet_evolve partition (f=1) select a from
      hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (b float);
hive> insert into table hcat.parquet_evolve partition (f=2) select a, b from
      hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (c double);
hive> insert into table hcat.parquet_evolve partition (f=3) select a, b, c from
      hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (d int);
hive> insert into table hcat.parquet_evolve partition (f=4) select a, b, c, d from
      hcat.parquet_tmp;
hive> alter table hcat.parquet_evolve add columns (e varchar(4));
hive> insert into table hcat.parquet_evolve partition (f=5) select a, b, c, d, e
      from hcat.parquet_tmp;
hive> insert into table hcat.parquet_evolve partition (f=6) select a, b, c, d, e
      from hcat.parquet_tmp;

=> SELECT * from hcat_local.parquet_evolve;

  a |         b         |  c   | d | e | f
----+-------------------+------+---+---+---
 -1 |                   |      |   |   | 1
 -1 | 0.649999976158142 |      |   |   | 2
 -1 | 0.649999976158142 | 0.65 |   |   | 3
 -1 | 0.649999976158142 | 0.65 | 6 |   | 4
 -1 | 0.649999976158142 | 0.65 | 6 | b | 5
 -1 | 0.649999976158142 | 0.65 | 6 | b | 6
(6 rows)

4 - Monitoring external tables

Vertica records information about external tables in system tables.

Vertica records information about external tables in system tables. You can use these tables to track your external data and queries against it.

The TABLES system table contains data about all tables, both native and external. The TABLE_DEFINITION column is specific to external tables. You can query this column to see all external data sources currently in use, as in the following example:

=> SELECT table_name, create_time, table_definition FROM tables WHERE table_definition != '';
table_name   |          create_time          |                     table_definition
---------------+-------------------------------+-----------------------------------------------------------
customers_orc | 2018-03-21 11:07:30.159442-04 | COPY from '/home/dbadmin/sample_orc_files/0*' ORC
miscprod      | 2018-06-26 17:40:04.012121-04 | copy from '/home/dbadmin/data/prod.csv'
students      | 2018-06-26 17:46:50.695024-04 | copy from '/home/dbadmin/students.csv'
numbers       | 2018-06-26 17:53:52.407441-04 | copy from '/home/dbadmin/tt.dat'
catalog       | 2018-06-26 18:12:28.598519-04 | copy from '/home/dbadmin/data/prod.csv' delimiter ','
inventory     | 2018-06-26 18:13:06.951802-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
test          | 2018-06-27 16:31:39.170866-04 | copy from '/home/dbadmin/data/stores.csv' delimiter ','
                (7 rows)

The EXTERNAL_TABLE_DETAILS table provides more details, including file sizes. Vertica computes the values in this table at query time, which is potentially expensive, so consider restricting the query by schema or table.

=> SELECT table_name, source_format, total_file_size_bytes FROM external_table_details;
table_name   | source_format | total_file_size_bytes
---------------+---------------+-----------------------
customers_orc | ORC           |             619080883
miscprod      | DELIMITED     |                   254
students      | DELIMITED     |                   763
numbers       | DELIMITED     |                    30
catalog       | DELIMITED     |                   254
inventory     | DELIMITED     |                    74
test          | DELIMITED     |                    74
(7 rows)

If the size of an external table changes significantly over time, you should rerun ANALYZE_EXTERNAL_ROW_COUNT() to gather updated statistics. See Improving Query Performance for External Tables.

The LOAD_SOURCES table shows information for loads currently in progress. This table does not record information about loads of ORC or Parquet data.

5 - Troubleshooting external tables

You might encounter the following issues when creating or querying external tables.

You might encounter the following issues when creating or querying external tables. For general data-load troubleshooting, see Troubleshooting data loads.

File not found or permission denied

If a query against an external table produces a file or permission error, ensure that the user executing the query has the necessary permissions in both Vertica and the file system. See the permissions section in Creating external tables.

Error 7226: cannot find partition column

When querying ORC or Parquet data, you might see an error message stating that a partition column is missing:

ERROR 7226: Cannot find partition column [region] in parquet source
    [/data/table_int/int_original/000000_0]

This error can occur if you partition your ORC or Parquet data (see Using partition columns). If you create an external table and then change the partition structure, for example by renaming a column, you must then re-create the external table. If you see this error, update your table to match the partitioning on disk.

Error 6766: is a directory

When querying data you might see an error message stating that an input file is a directory:

ERROR 6766: Error reading from orc parser input stream
[/tmp/orc_glob/more_nations]: Is a directory

This error occurs if the glob in the table's COPY FROM clause matches an empty directory. This error occurs only for files in the Linux file system; empty directories in HDFS are ignored.

To correct the error, make the glob more specific. Instead of *, for example, use *.orc.