MultiPhaseTransformFunctionFactory class
Multi-phase UDTFs let you break your data processing into multiple steps. Using this feature, your UDTFs can perform processing in a way similar to Hadoop or other MapReduce frameworks. You can use the first phase to break down and gather data, and then use subsequent phases to process the data. For example, the first phase of your UDTF could extract specific types of user interactions from a web server log stored in the column of a table, and subsequent phases could perform analysis on those interactions.
Multi-phase UDTFs also let you decide where processing should occur: locally on each node, or throughout the cluster. If your multi-phase UDTF is like a MapReduce process, you want the first phase of your multi-phase UDTF to process data that is stored locally on the node where the instance of the UDTF is running. This prevents large segments of data from being copied around the Vertica cluster. Depending on the type of processing being performed in later phases, you may choose to have the data segmented and distributed across the Vertica cluster.
Each phase of the UDTF is the same as a traditional (single-phase) UDTF: it receives a table as input, and generates a table as output. The schema for each phase's output does not have to match its input, and each phase can output as many or as few rows as it wants.
You create a subclass of TransformFunction
to define the processing performed by each stage. If you already have a TransformFunction
from a single-phase UDTF that performs the processing you want a phase of your multi-phase UDTF to perform, you can easily adapt it to work within the multi-phase UDTF.
What makes a multi-phase UDTF different from a traditional UDTF is the factory class you use. You define a multi-phase UDTF using a subclass of MultiPhaseTransformFunctionFactory
, rather than the TransformFunctionFactory
. This special factory class acts as a container for all of the phases in your multi-step UDTF. It provides Vertica with the input and output requirements of the entire multi-phase UDTF (through the getPrototype()
function), and a list of all the phases in the UDTF.
Within your subclass of the MultiPhaseTransformFunctionFactory
class, you define one or more subclasses of TransformFunctionPhase
. These classes fill the same role as the TransformFunctionFactory
class for each phase in your multi-phase UDTF. They define the input and output of each phase and create instances of their associated TransformFunction
classes to perform the processing for each phase of the UDTF. In addition to these subclasses, your MultiPhaseTransformFunctionFactory
includes fields that provide a handle to an instance of each of the TransformFunctionPhase
subclasses.
API
The MultiPhaseTransformFunctionFactory
class extends TransformFunctionFactory
The API provides the following additional methods for extension by subclasses:
virtual void getPhases(ServerInterface &srvInterface,
std::vector< TransformFunctionPhase * > &phases)=0;
If using this factory you must also extend TransformFunctionPhase
. See the SDK reference documentation.
The MultiPhaseTransformFunctionFactory class extends TransformFunctionFactory
. The API provides the following methods for extension by subclasses:
public abstract void getPhases(ServerInterface srvInterface,
Vector< TransformFunctionPhase > phases);
If using this factory you must also extend TransformFunctionPhase
. See the SDK reference documentation.
The TransformFunctionFactory class extends TransformFunctionFactory
. For each phase, the factory must define a class that extends TransformFunctionPhase
.
The factory adds the following method:
def getPhase(cls, srv)
TransformFunctionPhase
has the following methods:
def createTransformFunction(cls, srv)
def getReturnType(self, srv_interface, input_types, output_types)