Apportioned load
A parser can use more than one database node to load a single input source in parallel. This approach is referred to as apportioned load. Among the parsers built into Vertica, the default (delimited) parser supports apportioned load.
Apportioned load, like cooperative parse, requires an input that can be divided at record boundaries. The difference is that cooperative parse does a sequential scan to find record boundaries, while apportioned load first jumps (seeks) to a given position and then scans. Some formats, like generic XML, do not support seeking.
To use apportioned load, you must ensure that the source is reachable by all participating database nodes. You typically use apportioned load with distributed file systems.
It is possible for a parser to not support apportioned load directly but to have a chunker that supports apportioning.
You can use apportioned load and cooperative parse independently or together. See Combining cooperative parse and apportioned load.
How Vertica apportions a load
If both the parser and its source support apportioning, then you can specify that a single input is to be distributed to multiple database nodes for loading. The SourceFactory
breaks the input into portions and assigns them to execution nodes. Each Portion
consists of an offset into the input and a size. Vertica distributes the portions and their parameters to the execution nodes. A source factory running on each node produces a UDSource
for the given portion.
The UDParser
first determines where to start parsing:
-
If the portion is the first one in the input, the parser advances to the offset and begins parsing.
-
If the portion is not the first, the parser advances to the offset and then scans until it finds the end of a record. Because records can break across portions, parsing begins after the first record-end encountered.
The parser must complete a record, which might require it to read past the end of the portion. The parser is responsible for parsing all records that begin in the assigned portion, regardless of where they end. Most of this work occurs within the process()
method of the parser.
Sometimes, a portion contains nothing to be parsed by its assigned node. For example, suppose you have a record that begins in portion 1, runs through all of portion 2, and ends in portion 3. The parser assigned to portion 1 parses the record, and the parser assigned to portion 3 starts after that record. The parser assigned to portion 2, however, has no record starting within its portion.
If the load also uses Cooperative parse, then after apportioning the load and before parsing, Vertica divides portions into chunks for parallel loading.
Implementing apportioned load
To implement apportioned load, perform the following actions in the source, the parser, and their factories.
In your SourceFactory
subclass:
-
Implement
isSourceApportionable()
and returntrue
. -
Implement
plan()
to determine portion size, designate portions, and assign portions to execution nodes. To assign portions to particular executors, pass the information using the parameter writer on the plan context (PlanContext::getWriter()
). -
Implement
prepareUDSources()
. Vertica calls this method on each execution node with the plan context created by the factory. This method returns theUDSource
instances to be used for this node's assigned portions. -
If sources can take advantage of parallelism, you can implement
getDesiredThreads()
to request a number of threads for each source. See SourceFactory class for more information about this method.
In your UDSource
subclass, implement process()
as you would for any other source, using the assigned portion. You can retrieve this portion with getPortion()
.
In your ParserFactory
subclass:
-
Implement
isParserApportionable()
and returntrue
. -
If your parser uses a
UDChunker
that supports apportioned load, implementisChunkerApportionable()
.
In your UDParser
subclass:
-
Write your
UDParser
subclass to operate on portions rather than whole sources. You can do so by handling the stream statesPORTION_START
andPORTION_END
, or by using theContinuousUDParser
API. Your parser must scan for the beginning of the portion, find the first record boundary after that position, and parse to the end of the last record beginning in that portion. Be aware that this behavior might require that the parser read beyond the end of the portion. -
Handle the special case of a portion containing no record start by returning without writing any output.
In your UDChunker
subclass, implement alignPortion()
. See Aligning Portions.
Example
The SDK provides a C++ example of apportioned load in the ApportionLoadFunctions
directory:
-
FilePortionSource
is a subclass ofUDSource
. -
DelimFilePortionParser
is a subclass ofContinuousUDParser
.
Use these classes together. You could also use FilePortionSource
with the built-in delimited parser.
The following example shows how you can load the libraries and create the functions in the database:
=> CREATE LIBRARY FilePortionSourceLib as '/home/dbadmin/FP.so';
=> CREATE LIBRARY DelimFilePortionParserLib as '/home/dbadmin/Delim.so';
=> CREATE SOURCE FilePortionSource AS
LANGUAGE 'C++' NAME 'FilePortionSourceFactory' LIBRARY FilePortionSourceLib;
=> CREATE PARSER DelimFilePortionParser AS
LANGUAGE 'C++' NAME 'DelimFilePortionParserFactory' LIBRARY DelimFilePortionParserLib;
The following example shows how you can use the source and parser to load data:
=> COPY t WITH SOURCE FilePortionSource(file='g1/*.dat') PARSER DelimFilePortionParser(delimiter = '|',
record_terminator = '~');