Specifying where to load data from
Each COPY statement requires either a FROM clause to indicate the location of the file or files being loaded or a SOURCE clause when using a user-defined source. For more about the SOURCE clause, see Parameters. This section covers use of the FROM clause.
Loading from a specific path
Use the path-to-data
argument to indicate the location of one or more files to load. You can load data from the following locations:
-
The local file system.
-
NFS, through a mount point on the local file system.
-
S3 object store, Google Cloud Storage (GCS) object store, and Azure Blob Storage object store. If a data file you want to read resides in an S3 bucket that uses server-side encryption, you might need to set additional parameters.
-
HDFS file system. If a data file you want to read resides on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal, session
doAs
user, or session delegation token. See Accessing kerberized HDFS data for more information about these options.
If the path is a URL, you must use URL encoding for the '%' character. Otherwise, URL encoding ('%NN' where NN is a two-digit hexadecimal number) is permitted but not required.
When copying from the local file system, the COPY statement expects to find files in the same location on every node that participates in the query. If you are using NFS, then you can create an NFS mount point on each node. Doing so allows all database nodes to participate in the load for better performance without requiring files to be copied to all nodes.
Treat NFS mount points as local files in paths:
=> COPY sales FROM '/mount/sales.dat' ON ANY NODE;
You can specify more than one path in the same COPY statement, as in the following example.
=> COPY myTable FROM 'webhdfs:///data/sales/01/*.dat', 'webhdfs:///data/sales/02/*.dat',
'webhdfs:///data/sales/historical.dat';
For files in HDFS, you can specify a name service (hadoopNS
). In this example, the COPY statement is part of the definition of an external table:
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
AS COPY FROM 'webhdfs://hadoopNS/data/users.csv';
If path-to-data
resolves to a storage location on a local file system, and the user invoking COPY is not a superuser, these permissions are required:
-
The storage location must have been created with the USER option (see CREATE LOCATION).
-
The user must already have been granted READ access to the storage location where the file or files exist, as described in GRANT (storage location).
Vertica prevents symbolic links from allowing unauthorized access.
Loading with wildcards (glob)
You can invoke COPY for a large number of files in a shared directory with a single statement such as:
=> COPY myTable FROM '/data/manyfiles/*.dat' ON ANY NODE;
The glob (*
) must indicate a set of files, not directories. The following statement fails if /data/manyfiles
contains any subdirectories:
=> COPY myTable FROM '/data/manyfiles/*' ON ANY NODE;
If /data/manyfiles
contains subdirectories that contain files, you can use a glob in the path for the subdirectories:
=> COPY myTable FROM '/data/manyfiles/*/*' ON ANY NODE;
Sometimes a directory structure partitions the data, as in the following example:
path/created=2016-11-01/*
path/created=2016-11-02/*
path/created=2016-11-03/*
path/...
You still use a glob to read the data, but you can also read the partition values themselves (creation date, in this case) as table columns. See Partitioned file paths.
Using a wildcard with the ON ANY NODE clause expands the file list on the initiator node. This command then distributes the individual files among all nodes, so that the COPY workload is evenly distributed across the entire cluster. ON ANY NODE is the default for all file systems other than Linux.
Loading from a Vertica client
Use COPY LOCAL to load files on a client system to the Vertica database. For example, to copy a GZIP file from your local client, use a command such as this:
=> COPY store.store_dimension FROM LOCAL '/usr/files/my_data/input_file' GZIP;
You can use a comma-separated list to load multiple files of the same compression type. COPY LOCAL then concatenates the files into a single file, so you cannot combine files with different compression types in the list. When listing multiple files, be sure to specify the type of every input file, such as BZIP, as shown:
=>COPY simple_table FROM LOCAL 'input_file.bz' BZIP, 'input_file.bz' BZIP;
You can load data from a local client from STDIN, as follows:
=> COPY simple_table FROM LOCAL STDIN;
Loading from Kafka or Spark
For information about streaming data from Kafka, see Apache Kafka integration.
For information about using Vertica with Spark data, see Apache Spark integration.
Loading data from an IDOL CFS client
The IDOL Connector Framework Server (CFS) VerticaIndexer feature lets CFS clients connect to your Vertica database using ODBC. After it is connected, CFS uses COPY...FROM LOCAL statements to load IDOL document metadata into an existing flex table. For more information, see the Using flex tables for IDOL data section in Using Flex Tables.