Partitioned file paths

Data files are sometimes partitioned in the file system using the directory structure.

Data files are sometimes partitioned in the file system using the directory structure. Partitioning allows you to move values out of the raw data, where they have to be included for each row, and into the directory structure. Partitioning can improve query performance by allowing entire directories to be skipped. Partitioning can also save disk space if partition values are not repeated in the data.

Partitioning by date is one common use:

/data/created=2016-11-01/*
/data/created=2016-11-02/*
/data/created=2016-11-03/*
/data/...

The files in the globs do not contain a created column because this information is expressed through the file system. Vertica can read the partitioned values (dates, in this example) into a table column (created, in this example).

Data can be partitioned by more than one value:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

If data is partitioned by more than one value, the partitions must appear in the same order in all directory paths in the glob. In this example, creation date comes before region and must do so consistently.

COPY syntax

To read values from partitioned files, use wildcards (globs) in the COPY path and the PARTITION COLUMNS option:

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:////data/*/*/*'
   PARTITION COLUMNS created, region;

The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must be at least one more than the number of partitioned columns. Data files must include at least one real column; you cannot represent data entirely through directory structure.

The first part of each partition directory name must match the column name in the table definition. COPY parses the string after the = for the values. Empty values, for example a directory named created=, are treated as null values. For backward compatibility, a value of __HIVE_DEFAULT_PARTITION__ also means null.

Values that cannot be coerced to the correct types are rejected in the same way that non-coercible values in data are rejected.

Query execution

When executing queries with predicates, Vertica skips subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.

=> SELECT * FROM records WHERE region='northeast';

To verify that Vertica is pruning partitions, look in the explain plan for a message similar to the following:

files with unmatched partition have been pruned

Partitions on object stores

When reading partitioned data from a file system, COPY can prune unneeded directories at each level, skipping entire directories of files. Object stores (S3, GCS, and Azure) do not have directories, so COPY cannot prune paths in the same way.

By default, when querying data on an object store, COPY first fetches a list of all object names. It then prunes unneeded ones before fetching the objects themselves. This approach is called flat listing. Accessing object stores can have high latency, and flat listing maximizes the number of results returned per request. Flat listing retrieves more object names than are needed but does so in as few calls to the object store as possible. In most cases this strategy is an acceptable tradeoff.

If the number of partitions at each level is large, or if there are many layers of partitioning, fetching all of the object names can have a noticeable performance cost. In this situation, you can set the ObjectStoreGlobStrategy configuration parameter to have COPY use hierarchical listing instead of flat listing. With hierarchical listing, COPY fetches information about objects one level at a time, so that COPY can prune unproductive paths earlier.

Consider data that is partitioned by date and then by region:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

Suppose a query needs only created=2016-11-02 and region=central. With the flat strategy, COPY retrieves the names of all objects in all dates and all regions before pruning most of them. With the hierarchical strategy, COPY first fetches the names of all of the created values, without expanding the paths further. In a second call, it then fetches the names of all of the region values below only that date, ignoring other paths. Finally, it fetches the names of the objects only below that path.

Creating partition structures

To create partitioned file structures, you can use Hive or the file exporters. For information about using Hive, see Hive primer for Vertica integration.

You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.