Partitioned data

Data files are sometimes partitioned in the file system using the directory structure.

Data files are sometimes partitioned in the file system using the directory structure. Partitioning allows you to move values out of the raw data, where they have to be included for each row, and into the directory structure. Partitioning can improve query performance by allowing entire directories to be skipped. Partitioning can also save disk space if partition values are not repeated in the data.

There are two important aspects to working with partitioned data:

  • Telling COPY how to extract column values from paths.
  • Enabling partition pruning, a performance optimization.

Partitions as directory structure

Partitioning is reflected in the directory structure of the data files. In Hive-style partitioning, each directory name contains the column name and the value:

/data/created=2016-11-01/*
/data/created=2016-11-02/*
/data/created=2016-11-03/*
/data/...

The files in the globs do not contain a created column because this information is expressed through the file system. Vertica can read the partitioned values (dates, in this example) into a table column (created, in this example).

Data can be partitioned by more than one value:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

Another way to partition data is to use the values directly, without the column names:

/data/2017/01*
/data/2017/02/*
/data/2017/03/*
/data/...

In this case, you have to use an expression to tell COPY how to extract columns and values from the paths, for example to read the first level as a year and the second as a month.

The syntax for loading partitioned data differs for Hive-style and other partitioning schemes. For both schemes, list partition columns in the PARTITION COLUMNS clause. Doing so allows Vertica to improve query performance when predicates involve partition columns by eliminating irrelevant partitions from consideration.

COPY syntax for Hive-style partitions

Consider the following partition layout:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/...
/data/...

To load values for the created and region columns, use wildcards (globs) in the COPY path and the PARTITION COLUMNS option:

=> CREATE EXTERNAL TABLE records (id int, name varchar(50), created date, region varchar(50))
   AS COPY FROM 'webhdfs:////data/*/*/*'
   PARTITION COLUMNS created, region;

The path includes one wildcard (*) for each level of directory partitioning and then one more for the files. The number of wildcards must be at least one more than the number of partitioned columns. Data files must include at least one real column; you cannot represent data entirely through directory structure.

The first part of each partition directory name must match the column name in the table definition. COPY parses the string after the = for the values. Empty values, for example a directory named created=, are treated as null values. For backward compatibility, a value of __HIVE_DEFAULT_PARTITION__ also means null.

Values that cannot be coerced to the correct types are rejected in the same way that non-coercible values in data are rejected.

The PARTITION COLUMNS option specifies the partition columns in order.

If data is partitioned by more than one value, the partitions must appear in the same order in all directory paths in the glob even though the directory names contain the column name.

COPY syntax for other partition schemes

Consider the following partition layout:

/data/reviews/2023/07/*.json
/data/reviews/2023/08/*.json
/data/reviews/2023/09/*.json
/data/reviews/...

Suppose you want to load this data into the following table:

=> CREATE TABLE reviews(review_id VARCHAR, stars FLOAT, year INT, month INT);

You can use the CURRENT_LOAD_SOURCE function to read values out of the path. The function takes an integer argument, which is the position in a /-delimited path. For example, in the path /data/reviews/2023/09/*.json, CURRENT_LOAD_SOURCE(3) returns 2023.

The following statement loads this data:

=> COPY reviews
        (review_id, stars,
         year AS CURRENT_LOAD_SOURCE(3)::INT,
         month AS CURRENT_LOAD_SOURCE(4)::INT)
FROM '/data/reviews/*/*/*.json' PARSER FJSONPARSER();

The same approach works for external tables. For external tables, use the PARTITION COLUMNS clause to tell Vertica which columns can benefit from partition pruning, a performance optimization:

=> CREATE EXTERNAL TABLE reviews
     (review_id VARCHAR, stars FLOAT, year INT, month INT)
   AS COPY (review_id, stars,
            year AS CURRENT_LOAD_SOURCE(3)::INT,
             month AS CURRENT_LOAD_SOURCE(4)::INT)
FROM '/data/reviews/*/*/*.json' 
PARTITION COLUMNS year, month
PARSER FJSONPARSER();

Query execution

When executing queries with predicates, Vertica skips partitioned subdirectories that do not satisfy the predicate. This process is called partition pruning and it can significantly improve query performance. The following example reads only the partitions for the specified region, for all dates. Although the data is also partitioned by date, the query does not restrict the date.

=> SELECT * FROM records WHERE region='northeast';

To verify that Vertica is pruning partitions, look in the QUERY_EVENTS system table for the PARTITION_PATH_PRUNED event type. Alternatively, you can use the vsql \o meta-command with grep to look in the explain plan, before running a query, for a message containing "unmatched partitions have been pruned":

=> \o | grep -i pruned
=> EXPLAIN SELECT * FROM sales WHERE year = 2022;
 
3[label = "LoadStep: sales
(1 paths with unmatched partition have been pruned from DELIMITED source list)
(year = 2022)
Unc: Integer(8)
Unc: Integer(8)
Unc: Integer(8)", color = "brown", shape = "box"];
 
=> \o

Partitions on object stores

When reading partitioned data from a file system, COPY can prune unneeded directories at each level, skipping entire directories of files. Object stores (S3, GCS, and Azure) do not have directories, so COPY cannot prune paths in the same way.

By default, when querying data on an object store, COPY first fetches a list of all object names. It then prunes unneeded ones before fetching the objects themselves. This approach is called flat listing. Accessing object stores can have high latency, and flat listing maximizes the number of results returned per request. Flat listing retrieves more object names than are needed but does so in as few calls to the object store as possible. In most cases this strategy is an acceptable tradeoff.

If the number of partitions at each level is large, or if there are many layers of partitioning, fetching all of the object names can have a noticeable performance cost. In this situation, you can set the ObjectStoreGlobStrategy configuration parameter to have COPY use hierarchical listing instead of flat listing. With hierarchical listing, COPY fetches information about objects one level at a time, so that COPY can prune unproductive paths earlier.

Consider data that is partitioned by date and then by region:

/data/created=2016-11-01/region=northeast/*
/data/created=2016-11-01/region=central/*
/data/created=2016-11-01/region=southeast/*
/data/created=2016-11-01/...
/data/created=2016-11-02/region=northeast/*
/data/created=2016-11-02/region=central/*
/data/created=2016-11-02/region=southeast/*
/data/created=2016-11-02/...
/data/created=2016-11-03/...
/data/...

Suppose a query needs only created=2016-11-02 and region=central. With the flat strategy, COPY retrieves the names of all objects in all dates and all regions before pruning most of them. With the hierarchical strategy, COPY first fetches the names of all of the created values, without expanding the paths further. In a second call, it then fetches the names of all of the region values below only that date, ignoring other paths. Finally, it fetches the names of the objects only below that path.

Creating partition structures

To create partitioned file structures, you can use Hive or the file exporters. For information about using Hive, see Hive primer for Vertica integration.

You can create partitions for columns of any simple data type. As a best practice, however, you should avoid partitioning columns with BOOLEAN, FLOAT, and NUMERIC types.