EXPORT TO PARQUET

Exports a table, columns from a table, or query results to files in the Parquet format.

Exports a table, columns from a table, or query results to files in the Parquet format.

You can use an OVER() clause to partition the data before export. You can partition data instead of or in addition to exporting the column data. Partitioning data can improve query performance by enabling partition pruning. See Partitioned file paths.

There are some limitations on the queries you can use in an export statement. See Query Restrictions.

You can export data stored in Vertica in ROS format and data from external tables.

This statement returns the number of rows written and logs information about exported files in a system table. See Monitoring exports.

During an export to HDFS or an NFS mount point, Vertica writes files to a temporary directory in the same location as the destination and renames the directory when the export is complete. Do not attempt to use the files in the temporary directory. During an export to S3, GCS, or Azure, Vertica writes files directly to the destination path, so you must wait for the export to finish before reading the files. For more information, see Exporting to object stores.

After you export data, you can use the GET_METADATA function to inspect the results.

Syntax


EXPORT [ /*+LABEL (label-string)*/ ] TO PARQUET 
   ( directory='path'[, param=value[,...] ] )
   [ OVER (over-clause ) ] AS SELECT query-expression

Arguments

LABEL
Assigns a label to a statement to identify it for profiling and debugging.
over-clause
Specifies how to partition table data using PARTITION BY. Within partitions you can sort using ORDER BY. See SQL analytics. This clause may contain column references but not expressions.

If you partition data, Vertica creates a partition directory structure, transforming column names to lowercase. See Partitioned file paths for a description of the directory structure. If you use the fileName parameter, you cannot use partitioning.

If you omit this clause, Vertica optimizes for maximum parallelism.

query-expression
Specifies the data to export. See Query Restrictions for important limitations.

Parameters

directory

The destination directory for the output files. The current user must have permission to write it. The destination can be on any of the following file systems:

See also: ifDirExists.

ifDirExists

What to do if directory already exists, one of:

  • fail (default)

  • overwrite: replace the entire directory

  • append: export new files into the existing directory

If you specify overwrite for an export to an object store, the existing directory is deleted recursively at the beginning of the operation and is not restored if the operation fails. Be careful not to export to a directory containing data you want to keep. For an export to a Linux file system or HDFS, the directory is only overwritten if the export succeeds.

Do not do concurrent exports to the same directory. In particular, if you do so with a value of overwrite, all operations appear to succeed, but the results are incorrect.

When using append, be careful to use the same table schema. Otherwise, queries of external tables using this data path could fail.

filename

If specified, all output is written to a single file of this name in the location specified by directory. While the query can be processed by multiple nodes, only a single node generates the output data. The fileSizeMB parameter is ignored, and the query cannot use partitioning in the OVER() clause.

compression
Column compression type, one of:
  • Snappy

  • GZIP

  • Brotli

  • ZSTD

  • Uncompressed

Default: Snappy

rowGroupSizeMB
The uncompressed size of exported row groups, in MB, an integer value between 1 and fileSizeMB , inclusive, or unlimited if fileSizeMB is 0.

The row groups in the exported files are smaller than this value because Parquet files are compressed on write. For best performance when exporting to HDFS, set size to be smaller than the HDFS block size.

Row-group size affects memory consumption during export. An export thread consumes at least double the row-group size. The default value of 512MB is a compromise between writing larger row groups and allowing enough free memory for other Vertica operations. If you perform exports when the database is not otherwise under heavy load, you can improve read performance on the exported data by increasing row-group size on export. However, row groups that span multiple blocks on HDFS decrease read performance by requiring more I/O, so do not set the row-group size to be larger than your HDFS block size.

Default: 512

fileSizeMB

The maximum file size of a single output file. This value is a hint, not a hard limit. A value of 0 specifies no limit. If filename is also specified, fileSizeMB is ignored.

This value affects the size of individual output files, not the total output size. For smaller values, Vertica divides the output into more files; all data is still exported.

Default: 10GB

fileMode

For writes to HDFS only, permission to apply to all exported files. You can specify the value in Unix octal format (such as 665) or user-group-other format—for example, rwxr-xr-x. The value must be formatted as a string even if using the octal format.

Valid octal values range between 0 and 1777, inclusive. See HDFS Permissions in the Apache Hadoop documentation.

When writing files to any destination other than HDFS, this parameter has no effect.

Default: 660, regardless of the value of fs.permissions.umask-mode in hdfs-site.xml.

dirMode

For writes to HDFS only, permission to apply to all exported directories. Values follow the same rules as those for fileMode. Further, you must give the Vertica HDFS user full permission, at least rwx------ or 700.

When writing files to any destination other than HDFS, this parameter has no effect.

Default: 755, regardless of the value of fs.permissions.umask-mode in hdfs-site.xml.

int96AsTimestamp
Boolean, specifies whether to export timestamps as int96 physical type (true) or int64 physical type (false).

Default: true

Privileges

Non-superusers:

  • Source table: SELECT

  • Source table schema: USAGE

  • Destination directory: Write

Query restrictions

You must provide an alias column label for selected column targets that are expressions.

If you partition the output, you cannot specify schema and table names in the SELECT statement. Specify only the column name.

The query can contain only a single outer SELECT statement. For example, you cannot use UNION:

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm')
   OVER(PARTITION BY hash)
   AS
   SELECT 1 as account_id, '{}' as json, 0 hash
   UNION ALL
   SELECT 2 as account_id, '{}' as json, 1 hash;
ERROR 8975:  Only a single outer SELECT statement is supported
HINT:  Please use a subquery for multiple outer SELECT statements

Instead, rewrite the query to use a subquery:

=> EXPORT TO PARQUET(directory = '/mnt/shared_nfs/accounts/rm')
   OVER(PARTITION BY hash)
   AS
   SELECT
    account_id,
    json
   FROM
   (
     SELECT 1 as account_id, '{}' as json, 0 hash
     UNION ALL
     SELECT 2 as account_id, '{}' as json, 1 hash
   ) a;
 Rows Exported
---------------
             2
(1 row)

To use composite statements such as UNION, INTERSECT, and EXCEPT, rewrite them as subqueries.

Data types

EXPORT TO PARQUET converts Vertica data types to Hive data types as shown in the following table.

Vertica Data Type Hive Data Type
INTEGER, BIGINT BIGINT
FLOAT, DECIMAL, SMALLINT, TINYINT, CHAR, BOOLEAN Corresponding Hive type
VARCHAR, LONG VARCHAR VARCHAR (max 64KB) or STRING (can be read as either)
BINARY, VARBINARY, LONG VARBINARY BINARY
DATE DATE if supported by your version of Hive, otherwise INT96 (can be read as TIMESTAMP)
TIMESTAMP, TIMESTAMPTZ TIMESTAMP. Vertica does not convert TIMESTAMP values to UTC. To avoid problems arising from time zones, use TIMESTAMPTZ instead of TIMESTAMP.
TIME, TIMEZ, INTERVAL Not supported
ARRAY ARRAY
SET Not supported
ROW STRUCT

Decimal precision must be <= 38.

The exported Hive types might not be identical to the Vertica types. For example, a Vertica INT is exported as a Hive BIGINT. When defining Hive external tables to read exported data, you might have to adjust column definitions.

This operation exports raw Flex columns as binary data.

Output

The export operation always creates (or appends to) an output directory, even if all output is written to a single file or the query produces zero rows.

Output file names follow the pattern: prefix-nodename-threadId[-sequenceNumber].parquet. prefix is typically an 8-character hash, but can be longer if the export appended to an existing directory. A sequence number is added if an export needs to be broken into pieces to satisfy fileSizeMB.

Column names in partition directories are lowercase.

Files exported to a local file system by any Vertica user are owned by the Vertica superuser. Files exported to HDFS or object stores are owned by the Vertica user who exported the data.

Making concurrent exports to the same output destination is an error and can produce incorrect results.

Exports to the local file system can be to an NFS mount (shared) or to the Linux file system on each node (non-shared). For details, see Exporting to the Linux file system. Exports to non-shared local file systems have the following restrictions:

  • The output directory must not exist on any node.

  • You must have a USER storage location or superuser privileges.

  • You cannot override the permissions mode of 700 for directories and 600 for files.

Exports to object-store file systems are not atomic. Be careful to wait for the export to finish before using the data. For details, see Exporting to object stores.

Examples

The following example demonstrates exporting all columns from theT1 table in the public schema, using GZIP compression.

=> EXPORT TO PARQUET(directory='webhdfs:///user1/data', compression='gzip')
  AS SELECT * FROM public.T1;

The following example demonstrates exporting the results of a query using more than one table.

=> EXPORT TO PARQUET(directory='s3://DataLake/sales_by_region')
   AS SELECT sale.price, sale.date, store.region
   FROM public.sales sale
   JOIN public.vendor store ON sale.distribID = store.ID;

The following example demonstrates partitioning and exporting data. EXPORT TO PARQUET first partitions the data on region and then, within each partition, sorts by store.

=> EXPORT TO PARQUET(directory='gs://DataLake/user2/data')
  OVER(PARTITION BY store.region ORDER BY store.ID)
  AS SELECT sale.price, sale.date, store.ID
  FROM public.sales sale
  JOIN public.vendor store ON sale.distribID = store.ID;

The following example uses an alias column label for a selected column target that is an expression.

=> EXPORT TO PARQUET(directory='webhdfs:///user3/data')
  OVER(ORDER BY col1) AS SELECT col1 + col1 AS A, col2
  FROM public.T3;

The following example sets permissions for the output.

=> EXPORT TO PARQUET(directory='webhdfs:///user1/data',
     fileMode='432', dirMode='rwxrw-r-x')
  AS SELECT * FROM public.T1;