C++ SDK Documentation  24.2.0
Vertica::UDFilter Class Referenceabstract

Reads input data from a UDSource or another UDFilter and transforms it. More...

Inheritance diagram for Vertica::UDFilter:
Inheritance graph
Collaboration diagram for Vertica::UDFilter:
Collaboration graph

Public Member Functions

void cancelUDX (ServerInterface &srvInterface)
 
virtual void destroy (ServerInterface &srvInterface)
 
virtual void destroy (ServerInterface &srvInterface, SessionParamWriterMap &udSessionParams)
 
bool isCanceled () const
 
virtual StreamState process (ServerInterface &srvInterface, DataBuffer &input, InputState input_state, DataBuffer &output)=0
 
virtual StreamState processWithMetadata (ServerInterface &srvInterface, DataBuffer &input, LengthBuffer &input_lengths, InputState input_state, DataBuffer &output, LengthBuffer &output_lengths)
 
virtual void setup (ServerInterface &srvInterface)
 
virtual bool useSideChannel ()
 

Protected Member Functions

virtual void cancel (ServerInterface &srvInterface)
 

Detailed Description

Reads input data from a UDSource or another UDFilter and transforms it.

For example, a UDFilter might unzip a file, convert UTF-16 to UTF-8, or remove personally identifying information such as social security numbers. UDFilters can be chained, for example unzipping, converting encodings, and then stripping personal information. The first UDFilter in a chain receives its input from a UDSource, and the output of the last one in the chain is sent to a UDParser.

UDFilter is part of the load pipeline. The load pipeline consists of up to one UDSource, any number of UDFilters, and up to one UDParser.

Note that it is UNSAFE to maintain pointers or references to any of these arguments (or any other argument passed by reference into any other function in this API) beyond the scope of the function call in question. For example, do not store a reference to the server interface or the input block on an instance variable. Vertica may free and replace these objects.

Member Function Documentation

◆ cancel()

virtual void Vertica::UDXObject::cancel ( ServerInterface srvInterface)
inlineprotectedvirtualinherited

Cancel callback to be overridden by the UDX. Called when the query running the UDX has been canceled.

Note
  • This method will be invoked at most once per UDX object. Once a UDX object has been canceled, it will never be un-canceled.
  • This method may be called from a separate thread, concurrently with other methods of this UDX object (but never the constructor or destructor). Implementations must be thread-safe with all methods of this UDX.
  • This method will be invoked for either an explicit user cancel, or in the event of an error during query execution.

Referenced by Vertica::UDXObject::cancelUDX().

◆ cancelUDX()

void Vertica::UDXObject::cancelUDX ( ServerInterface srvInterface)
inlineinherited

Cancel callback invoked when the query running the UDX has been canceled. See cancel().

◆ destroy()

virtual void Vertica::UDFilter::destroy ( ServerInterface srvInterface)
inlinevirtual

UDFilter::destroy()

Will be invoked during query execution, after the last time that process() is called on this UDFilter instance for a particular input file.

May optionally be overridden to perform tear-down/destruction.

See UDFilter::setup() for a note about the restartability of UDFilters.

◆ isCanceled()

bool Vertica::UDXObject::isCanceled ( ) const
inlineinherited
Returns
true iff this UDX has been canceled

◆ process()

virtual StreamState Vertica::UDFilter::process ( ServerInterface srvInterface,
DataBuffer input,
InputState  input_state,
DataBuffer output 
)
pure virtual

UDFilter::process()

Reads data from the input stream and emits data to the output stream. Vertica invokes this method repeatedly during query execution until it returns DONE, or until the query is canceled by the user.

Input: a stream of bytes.
Output: a stream of bytes.

Returns
INPUT_NEEDED
OUTPUT_NEEDED
DONE
KEEP_GOING
See Vertica::StreamState for details about return values.

On each invocation, process() is handed some input data and a buffer to write output data to. It is expected to read and process some amount of the input data, write some amount of output data, and return a value that informs Vertica what needs to happen next.

process() must set input.offset to the number of bytes that were successfully read from the input buffer that do not need to be re-consumed by a subsequent invocation of process(). This might not be larger than input.size (input.size is the size of the buffer). If it is set to 0, this indicates that process() cannot process any part of an input buffer of this size, and requires more data per invocation. For example, a block-based decompression algorithm might return 0 if the input buffer does not contain a complete block.

Note that input might contain null bytes if the source file contains null bytes. Additionally, note that input is NOT automatically null-terminated.

If input_state == END_OF_FILE, then the last byte in input is the last byte in the input stream. Returning INPUT_NEEDED does not result in any new input appearing. In this case, process() should return DONE as soon as this operator finishes producing all of the output that it is going to produce.

process() should set output.offset (an output parameter) to the number of bytes that were written to the output buffer. It is common, though not necessary, for this to be the same as output.size (an input parameter). When process() is called, output.offset is uninitialized. To indicate that the buffer is too small to hold a record, process() should set output.offset to 0 and return OUTPUT_NEEDED. Then, process() is called again with a larger buffer.

In general, process() code should assume that buffers, input, and INPUT_NEEDED start at output.buf[output.offset].

As a performance optimization, upstream operators might start processing emitted data (data between output.buf[0] and output.buf[output.offset]) before OUTPUT_NEEDED is returned. For this reason, output.offset must be strictly increasing.

Referenced by processWithMetadata().

◆ processWithMetadata()

virtual StreamState Vertica::UDFilter::processWithMetadata ( ServerInterface srvInterface,
DataBuffer input,
LengthBuffer input_lengths,
InputState  input_state,
DataBuffer output,
LengthBuffer output_lengths 
)
inlinevirtual

UDFilter::processWithMetadata()

Reads data from the input stream and emits data to the output stream, and reads record length metadata from the input_lengths stream and outputs it to the output_lengths stream in the side channel. To implement processWithMetadata(), you must override useSideChannel() to return true. Vertica invokes this method repeatedly during query execution, until it returns DONE or until the query is canceled by the user.

Input: a stream of data bytes, and a stream of bytes containing message length metadata.
Output: a stream of data bytes, and a stream of bytes containing message length metadata.

Returns
INPUT_NEEDED
OUTPUT_NEEDED
DONE
KEEP_GOING
See Vertica::StreamState for details about return values.

On each invocation, processWithMetadata() is handed some input data and a data buffer to write output data to, and record length metadata and a length buffer to write metadata to. It is expected to read and process some amount of the input data and input metadata, write some amount of output data and output metadata, and return a value that informs Vertica what needs to happen next.

For the DataBuffer, processWithMetadata() must set input.offset to the number of bytes that were successfully read from the input buffer that do not need to be re-consumed by a subsequent invocation of processWithMetadata(). This might not be larger than input.size (input.size is the size of DataBuffer). If it is set to 0, this indicates that processWithMetadata() cannot process any part of an input buffer of this size, and requires more data per invocation. For example, a block-based decompression algorithm might return 0 if the input buffer does not contain a complete block.

For the LengthBuffer, processWithMetadata() must set input_lengths.offset to the number of length values that were successfully read from the input_lengths buffer that do not need to be re-consumed by a subsequent invocation of processWithMetadata(). This cannot be larger than input_lengths.size (input_lengths.size is the number of length values that the LengthBuffer can hold). If it is set to 0, this indicates that processWithMetadata() cannot process any part of an input_lengths buffer of this size, and requires more data per invocation.

Note that input might contain null bytes if the source file contains null bytes. Additionally, note that input is NOT automatically null-terminated.

If input_state == END_OF_FILE, then the last byte in input is the last byte in the input stream. Returning INPUT_NEEDED does not result in any new input appearing. In this case, processWithMetadata() should return DONE as soon as this operator has finished producing all output that it can produce.

For the DataBuffer, processWithMetadata() should set output.offset (an output parameter) to the number of bytes that were written to the output buffer. It is common, though not necessary, for this to be the same as output.size (an input parameter). When processWithMetadata() is called, output.offset is uninitialized. To indicate that the buffer is too small to hold a record, processWithMetadata() should set output.offset and output_lengths.offset to 0 and return OUTPUT_NEEDED. Then, processWithMetadata() is called again with a larger buffer.

For the LengthBuffer, processWithMetadata() should set output_lengths.offset to the number of length values that were written to the output_lengths buffer. If output.offset is set to 0, then output_lengths.offset should also be set to 0.

In general, processWithMetadata() code should assume that data buffers and INPUT_NEEDED start at output.buff[output.offset], and length buffers start at output_lengths.buf[output_lengths.offset].

As a performance optimization, upstream operators might start processing emitted data (data between output.buf[0] and output.buf[output.offset] in the DataBuffer, and between output_lengths.buf[0] and output_lengths.buf[output_lengths.offset] in the LengthBuffer) before OUTPUT_NEEDED is returned. For this reason, output.offset and output_lengths.offset must be strictly increasing.

◆ setup()

virtual void Vertica::UDFilter::setup ( ServerInterface srvInterface)
inlinevirtual

UDFilter::setup()

Will be invoked during query execution, prior to the first time that process() is called on this UDFilter instance for a particular input file.

May optionally be overridden to perform setup/initialzation.

Note that UDFilters MUST BE RESTARTABLE! If loading large numbers of files, a given UDFilter may be re-used for multiple files. Vertica follows the worker-pool design pattern: At the start of COPY execution, several Parsers and several Filters are instantiated per node, by calling the corresponding prepare() method multiple times. Each Filter/Parser pair is then internally assigned to an initial Source (UDSource or internal). At that point, setup() is called; then process() is called until it is finished; then destroy() is called. setup() will be called a second time, then process() until it is finished, then destroy(). This repeats until all sources have been read.

◆ useSideChannel()

virtual bool Vertica::UDFilter::useSideChannel ( )
inlinevirtual

UDFilter::useSideChannel()

Provides access to the side channel containing record length metadata, when the data source has metadata about record boundaries available in a structured format that is separate from the data payload, and it is retained throughout the load stack.

Override and return true to indicate that processWithMetadata() should be called instead of process().

Return false to implement process().

Returns
false by default.