This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

File export

You might want to export a table or other query results from Vertica, either to share it with other applications or to move lower-priority data from ROS to less-expensive storage.

You might want to export a table or other query results from Vertica, either to share it with other applications or to move lower-priority data from ROS to less-expensive storage. Vertica supports several EXPORT TO statements for different file formats: EXPORT TO PARQUET, EXPORT TO ORC, EXPORT TO JSON, and EXPORT TO DELIMITED.

You can export data to HDFS, S3, Google Cloud Storage (GCS), Azure Blob Storage, or the Linux file system. Be careful to avoid concurrent exports to the same output destination. Doing so is an error on any file system and can produce incorrect results.

Not all exporters support all complex types. See the Data Types section on individual reference pages.

You can export ROS data or data that is readable through external tables. After exporting ROS data, you can drop affected ROS partitions to reclaim storage space.

After exporting data, you can define external tables to read that data in Vertica. Parquet and ORC are columnar formats that Vertica can take advantage of in query optimization, so consider using one of those formats for external tables. See Working with external data.

1 - Syntax

EXPORT TO PARQUET, EXPORT TO ORC, EXPORT TO JSON, and EXPORT TO DELIMITED have the same general syntax.

Use the EXPORT statements to export data specified by a SELECT statement, as in the following example:

=> EXPORT TO PARQUET(directory='webhdfs:///data/sales_data')
   AS SELECT * FROM public.sales;
 Rows Exported
---------------
         14336
(1 row)

The directory parameter specifies where to write the files and is required. You must have permission to write to the output directory. You can use the ifDirExists parameter to specify what to do if the directory already exists:

  • fail (default)

  • overwrite: replace the entire directory

  • append: export new files into the existing directory

You can export to HDFS, S3, GCS, Azure, or the Linux file system. For additional considerations specific to object stores, see Exporting to object stores. If you export to the local Linux file system (not a shared NFS mount) you must take some additional steps; see Exporting to the Linux file system.

You can use EXPORT statements to write queries across multiple tables in Vertica and export the results. With this approach you can take advantage of powerful, fast query execution in Vertica while making the results available to other clients:

=> EXPORT TO ORC(directory='webhdfs:///data/sales_by_region')
   AS SELECT sale.price, sale.date, store.region
   FROM public.sales sale
   JOIN public.vendor store ON sale.distribID = store.ID;
 Rows Exported
---------------
         23301
(1 row)

Restrictions

  • If you perform more than one concurrent export to the same directory, only one succeeds. If you perform concurrent exports with ifDirExists=overwrite, the results will be incorrect.

  • Exports cannot be run in fenced mode. If the ForceUDxFencedMode configuration parameter is set to true to support other UDxs, you can set it to false at the session level before performing the export.

Data types

All exporters can export scalar types. All except EXPORT TO DELIMITED also support primitive arrays. The JSON and Parquet exporters also support the ARRAY and ROW types in any combination:

=> SELECT * FROM restaurants;
       name        | cuisine |       location_city        |                                     menu
-------------------+---------+----------------------------+------------------------------------------------------------------------------
 Bob's pizzeria    | Italian | ["Cambridge","Pittsburgh"] | [{"item":"cheese pizza","price":null},{"item":"spinach pizza","price":10.5}]
 Bakersfield Tacos | Mexican | ["Pittsburgh"]             | [{"item":"veggie taco","price":9.95},{"item":"steak taco","price":10.95}]
(2 rows)

=> EXPORT TO JSON (directory='/output/json') AS SELECT * FROM restaurants;
 Rows Exported
---------------
             2
(1 row)

=> \! cat /output/json/*.json
{"name":"Bob's pizzeria","cuisine":"Italian","location_city":["Cambridge","Pittsburgh"],"menu":[{"item":"cheese pizza","price":null},{"item":"spinach pizza","price":10.5}]}
{"name":"Bakersfield Tacos","cuisine":"Mexican","location_city":["Pittsburgh"],"menu":[{"item":"veggie taco","price":9.95},{"item":"steak taco","price":10.95}]}

Query

EXPORT statements rewrite the query you specify, because the export is done by a user-defined transform function (UDTF). Because of this rewrite, there are some restrictions on the query you supply.

The query can contain only a single outer SELECT statement. For example, you cannot use UNION:

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm')
   OVER(PARTITION BY hash)
   AS
   SELECT 1 as account_id, '{}' as json, 0 hash
   UNION ALL
   SELECT 2 as account_id, '{}' as json, 1 hash;
ERROR 8975:  Only a single outer SELECT statement is supported
HINT:  Please use a subquery for multiple outer SELECT statements

Instead, rewrite the query to use a subquery:

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm')
   OVER(PARTITION BY hash)
   AS
   SELECT
    account_id,
    json
   FROM
   (
     SELECT 1 as account_id, '{}' as json, 0 hash
     UNION ALL
     SELECT 2 as account_id, '{}' as json, 1 hash
   ) a;
 Rows Exported
---------------
             2
(1 row)

To use composite statements such as UNION, INTERSECT, and EXCEPT, rewrite them as subqueries.

When exporting data you can use an OVER() clause to partition and sort the data as described in Partitioning and sorting data. Partitioning and sorting can improve query performance. If you partition the data, you cannot specify schema and table names in the SELECT statement. In this case, specify only the column name.

All exporters have a required directory parameter. All allow you to specify a compression type, and each exporter has format-specific parameters. See the parameter descriptions on the individual reference pages: EXPORT TO PARQUET, EXPORT TO ORC, EXPORT TO JSON, and EXPORT TO DELIMITED.

2 - Partitioning and sorting data

When exporting, you can use the optional OVER clause in the SELECT statement to specify how to partition and/or sort the exported data.

When exporting, you can use the optional OVER clause in the SELECT statement to specify how to partition and/or sort the exported data. Partitioning reduces the sizes of the output data files and canimprove performance when Vertica queries external tables containing thisdata. (See Partitioned data.) If you do not specify how to partition the data, Vertica optimizes the export for maximum parallelism.

To specify partition columns, use PARTITION BY in theOVER clause as in the following example:

=> EXPORT TO PARQUET(directory = 'webhdfs:///data/export')
  OVER(PARTITION BY date) AS SELECT transactionID, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can both partition by a column and include that column in the SELECT clause. Including the column allows you to sort it. Vertica still takes advantage of the partitioning during query execution.

You can sort values within a partition for a further performance improvement. Sort table columns based on the likelihood of their occurrence in query predicates; columns that most frequently occur in comparison or range predicates should be sorted first. You can sort values within each partition using ORDER BY in the OVER clause:

=> EXPORT TO PARQUET(directory = 'webhdfs:///data/export')
  OVER(PARTITION BY date ORDER BY price) AS SELECT date, price FROM public.sales;
 Rows Exported
---------------
         28337
(1 row)

You can use ORDER BY even without partitioning. Storing data in sorted order can improve data access and predicate evaluation performance.

Targets in the OVER clause must be column references; they cannot beexpressions.For more information about OVER, see SQL analytics.

If you are exporting data to a local file system, you might want to force a single node to write all of the files. To do so, use an empty OVER clause.

3 - Exporting to object stores

Object-store file systems (S3, Google Cloud Storage, and Azure Blob Storage) have some differences from other file systems that affect data export.

Object-store file systems (S3, Google Cloud Storage, and Azure Blob Storage) have some differences from other file systems that affect data export. You must set some additional configuration parameters for authentication and region, and there are some restrictions on the output.

URI formats and configuration parameters are described on the following reference pages:

Configuration parameters affect all access to the corresponding object stores, both reads and writes. Instead of setting them globally, you can limit the effects of your settings by setting them at the session level before exporting data.

Overwrites

By default, exports fail if the target directory already exists. You can use the ifDirExists parameter to instead append or overwrite. If you specify overwrite for an export to an object store, the existing directory is deleted recursively at the beginning of the operation and is not restored if the operation fails. Be careful not to export to a directory containing data you want to keep.

Output restrictions

Object-store file systems do not support renaming files in place; they implement a rename as a copy followed by a delete. On other file systems, the exporters support atomicity by writing output into a temporary directory and renaming it when complete. Such an approach is impractical for object stores, so the exporters write directly to the destination path. It is therefore possible to begin reading the exported data before the export has finished, which could lead to errors. Be careful to wait for the export to finish before using the data.

Vertica does not support simultaneous exports to the same path in object stores. The results are undefined.

S3 limits buckets to 5TB. You might need to divide very large exports.

4 - Exporting to the Linux file system

Exports to the local file system can be to an NFS mount (shared) or to the Linux file system on each node (non-shared).

Exports to the local file system can be to an NFS mount (shared) or to the Linux file system on each node (non-shared). If you export to an NFS mount on the Linux file system, the exporters behave the same way as for any other shared location: all of the exported files are written to the same (shared) destination.

If you export to the local Linux file system, each Vertica node writes its portion of the export to its local (non-shared) file system. Exports to non-shared local file systems have the following restrictions:

  • The output directory must not exist on any node.

  • You must have a USER storage location or superuser privileges.

  • You cannot override the permissions mode of 700 for directories and 600 for files.

To define an external table to read data exported to a non-shared local file system, use COPY...ON EACH NODE, as in the following example:

=> CREATE EXTERNAL TABLE sales (...)
   AS COPY FROM '/data/sales/*.parquet' ON EACH NODE PARQUET;

The path is the same path that you specified in the export statement and is the same on all nodes.

If you omit the ON clause and just specify a path, COPY only loads the data it finds on the initiator node.

5 - Monitoring exports

You can review information about exports, including numbers of row groups, file sizes, and file names.

You can review information about exports, including numbers of row groups, file sizes, and file names.

The export statements are UDxs. The UDX_EVENTS system table records events logged during UDx execution, including timestamps, node names, and session IDs. This table contains a column (__RAW__), which holds a VMap of whatever additional data an individual UDx logged. The export statements log details about the exported files in this table. While you can work with this table directly and materialize values from the VMap column, you might prefer to define a view to simplify your access.

The following statement defines a view showing only the events from EXPORT TO PARQUET, materializing the VMap values.

=> CREATE VIEW parquet_export_events AS
   SELECT
    report_time,
    node_name,
    session_id,
    user_id,
    user_name,
    transaction_id,
    statement_id,
    request_id,
    udx_name,
    event_type,
    file,
    created,
    closed,
    rows,
    row_groups,
    size_mb,
    row_group_file,
    row_group_n,
    row_group_rows,
    row_group_bytes,
    row_group_expected_size_mb, 
    row_groups_exceeding_target
   FROM
    v_monitor.udx_events
   WHERE
    udx_name ilike 'ParquetExport%';

The exporters report the following UDx-specific columns:

Column Name Data Type Description
EVENT_TYPE VARCHAR What was written. All exporters report FILE events. In addition, the Parquet exporter reports ROW_GROUP events. Row groups are only logged if debugging is enabled.
FILE VARCHAR For EVENT_TYPE of FILE, name of the output file.
CREATED TIMESTAMPTZ For EVENT_TYPE of FILE, when the file was created.
CLOSED TIMESTAMPTZ For EVENT_TYPE of FILE, when the file was closed after writing.
ROWS INTEGER For EVENT_TYPE of FILE, the total number of rows in the file.
ROW_GROUPS INTEGER For EVENT_TYPE of FILE, the number of row groups, for formats that use them (Parquet and ORC).
SIZE_MB FLOAT For EVENT_TYPE of FILE, file size.
ROW_GROUP_FILE VARCHAR For EVENT_TYPE of ROW_GROUP, name of the file containing the row group.
ROW_GROUP_N INTEGER For EVENT_TYPE of ROW_GROUP, row group number.
ROW_GROUP_ROWS INTEGER For EVENT_TYPE of ROW_GROUP, number of rows written in this row group.
ROW_GROUP_BYTES INTEGER For EVENT_TYPE of ROW_GROUP, number of bytes written in this row group.
ROW_GROUP_EXPECTED_SIZE_MB INTEGER For Parquet exports and an EVENT_TYPE of FILE, target size of each row group.
ROW_GROUPS_EXCEEDING_TARGET INTEGER For Parquet exports and an EVENT_TYPE of FILE, number of row groups exceeding the target size.

The following example shows the results of a single export.

=> SELECT file,rows,row_groups,size_mb FROM PARQUET_EXPORT_EVENTS;
                                  file                                 | rows  | row_groups | size_mb
-----------------------------------------------------------------------+-------+------------+----------
 /data/outgZxN3irt/450c4213-v_vmart_node0001-139770732459776-0.parquet | 29696 | 1          | 0.667203
 /data/outgZxN3irt/9df1c797-v_vmart_node0001-139770860660480-0.parquet | 29364 | 1          | 0.660922
(2 rows)

In this table, the output directory name (/data/out) is appended with a generated string (gZxN3irt). For exports to HDFS or to local file systems (including NFS), EXPORT TO PARQUET first writes data into a scratch directory and then renames it at the end of the operation. The events are logged during export and so show the temporary name. Some output destinations, such as AWS S3, do not support rename operations, so in those cases this table does not show generated names.

ROW_GROUP events are only recorded when debugging is enabled:

=> SELECT SET_DEBUG_LOG_ALLNODES('UDX', 'ALL');