Distributing a load
Vertica can divide the work of loading data among multiple database nodes, taking advantage of parallelism to speed up the operation. How this is done depends on where the data is and what types of parallelism the parsers support.
Vertica can be most effective in distributing a load when the data to be loaded is found in shared storage available to all nodes. Sometimes, however, data is available only on specific nodes, which you must specify.
Types of load parallelism
Vertica supports several types of parallelism. All built-in parsers support distributed load, and some parsers support apportioned load and cooperative parse. The reference pages for individual parsers include the kinds of load parallelism they support.
Load parallelism is enabled by default but can be disabled. The configuration parameters named in this section are described in General parameters.
Distributed load
Vertica distributes files in a multi-file load to several nodes to load in parallel, instead of loading all of them on a single node. Vertica automatically distributes a load if the files are accessible to all nodes and you do not restrict participating nodes.
Apportioned load and cooperative parse both require an input that can be divided at record boundaries. The difference is that cooperative parse does a sequential scan to find record boundaries, while apportioned load first jumps (seeks) to a given position and then scans. Some formats, like generic XML, do not support seeking.
Apportioned load
In an apportioned load, Vertica divides a single large file or other single source into segments (portions), which it assigns to several nodes to load in parallel. Apportioned load divides the load at planning time, based on available nodes and cores on each node.
To use apportioned load, you must ensure that the source is reachable by all participating database nodes. You typically use apportioned load with distributed file systems.
Apportioned load is enabled by default for parsers that support it. To disable it, set the EnableApportionLoad configuration parameter to 0.
Cooperative parse
By default, Vertica parses a data source in a single thread on one database node. If a parser supports cooperative parse, the node instead uses multiple threads to parallelize the parse. Cooperative parse divides a load at execution time, based on how threads are scheduled.
Cooperative parse is enabled by default for parsers that support it. To disable it, set the EnableCooperativeParse configuration parameter to 0.
Loading on specific nodes
You can indicate which node or nodes should parse an input path by using any of the following:
-
A node name:
ON node
-
A set of nodes:
ON nodeset
(see Specifying Distributed File Loads) -
ON ANY NODE
(default for HDFS and cloud paths) -
ON EACH NODE
Using the ON ANY NODE
clause indicates that the source path is available on all of the nodes. If you specify this clause, COPY parses the files from any node in the cluster. If the path contains a glob, COPY expands the glob on the initiator node. ON ANY NODE
is the default for all file systems except Linux.
Using the
ON nodeset
clause indicates that the source file is on all named nodes. If you specify this clause, COPY opens the file and parses it from any node in the set. Be sure that the source file you specify is available and accessible on each applicable cluster node.
ON node
and ON ANY NODE load each file once, choosing one node to perform the load. If, instead, you have different data on each node and you want to load all of it, and the path is the same, use ON EACH NODE:
=> COPY myTable FROM '/local_export/*.dat' ON EACH NODE;
If the path is not valid on all nodes, COPY loads the valid paths and produces a warning. If the path is a shared location, COPY loads it only once as for ON ANY NODE.
Use ON EACH NODE
when you want to load from the same path on every node and the files are different. For example, if you have machine-specific data, such as system event logs, or if an operation wrote data on each node individually, you can use this clause to load all of it.
If the data to be loaded is on a client, use COPY FROM LOCAL instead of specifying nodes. All local files are loaded and parsed serially with each COPY statement, so you cannot perform parallel loads with the LOCAL option.
Specifying distributed file loads
You can direct individual files in a multi-file load to specific nodes:
=> COPY t FROM '/data/file1.dat' ON v_vmart_node0001, '/data/file2.dat' ON v_vmart_node0002;
You can use globbing (wildcard expansion) to specify a group of files with the ON ANY NODE option:
=> COPY t FROM '/data/*.dat' ON ANY NODE;
You can limit the nodes that participate in an apportioned load. Doing so is useful if you need to balance several concurrent loads. Vertica apportions each load individually; it does not account for other loads that might be in progress on those nodes. You can, therefore, potentially speed up your loads by managing apportioning yourself:
=> COPY t FROM '/data/big1.dat' ON (v_vmart_node0001, v_vmart_node0002, v_vmart_node0003),
'/data/big2.dat' ON (v_vmart_node0004, v_vmart_node0005);
You can specify a compression type for each path. However, because file compression is a filter, you cannot use apportioned load for a compressed file.
Specifying distributed loads with sources
You can also apportion loads using COPY WITH SOURCE. You can create sources and parsers with the user-defined load (UDL) API. If both the source and parser support apportioned load, and EnableApportionLoad is set, then Vertica attempts to divide the load among nodes.
The following example shows a load that you could apportion:
=> COPY t WITH SOURCE MySource() PARSER MyParser();
The built-in delimited parser supports apportioning, so you can use it with a user-defined source:
=> COPY t WITH SOURCE MySource();
Number of load streams
Although the number of files you can load is not restricted, the optimal number of load streams depends on several factors, including:
-
Number of nodes
-
Physical and logical schemas
-
Host processors
-
Memory
-
Disk space
Using too many load streams can deplete or reduce system memory required for optimal query processing. See Best practices for managing workload resources for advice on configuring load streams.