C++ example: string tokenizer
The following example shows a subclass of TransformFunction
named StringTokenizer
. It defines a UDTF that reads a table containing an INTEGER ID column and a VARCHAR column. It breaks the text in the VARCHAR column into tokens (individual words). It returns a table containing each token, the row it occurred in, and its position within the string.
Loading and using the example
The following example shows how to load the function into Vertica. It assumes that the TransformFunctions.so
library that contains the function has been copied to the dbadmin user's home directory on the initiator node.
=> CREATE LIBRARY TransformFunctions AS
'/home/dbadmin/TransformFunctions.so';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize
AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions;
CREATE TRANSFORM FUNCTION
You can then use it from SQL statements, for example:
=> CREATE TABLE T (url varchar(30), description varchar(2000));
CREATE TABLE
=> INSERT INTO T VALUES ('www.amazon.com','Online retail merchant and provider of cloud services');
OUTPUT
--------
1
(1 row)
=> INSERT INTO T VALUES ('www.vertica.com','World''s fastest analytic database');
OUTPUT
--------
1
(1 row)
=> COMMIT;
COMMIT
=> -- Invoke the UDTF
=> SELECT url, tokenize(description) OVER (partition by url) FROM T;
url | words
-----------------+-----------
www.amazon.com | Online
www.amazon.com | retail
www.amazon.com | merchant
www.amazon.com | and
www.amazon.com | provider
www.amazon.com | of
www.amazon.com | cloud
www.amazon.com | services
www.vertica.com | World's
www.vertica.com | fastest
www.vertica.com | analytic
www.vertica.com | database
(12 rows)
Notice that the number of rows and columns in the result table are different than the input table. This is one of the strengths of a UDTF.
TransformFunction implementation
The following code shows the StringTokenizer
class.
class StringTokenizer : public TransformFunction
{
virtual void processPartition(ServerInterface &srvInterface,
PartitionReader &inputReader,
PartitionWriter &outputWriter)
{
try {
if (inputReader.getNumCols() != 1)
vt_report_error(0, "Function only accepts 1 argument, but %zu provided", inputReader.getNumCols());
do {
const VString &sentence = inputReader.getStringRef(0);
// If input string is NULL, then output is NULL as well
if (sentence.isNull())
{
VString &word = outputWriter.getStringRef(0);
word.setNull();
outputWriter.next();
}
else
{
// Otherwise, let's tokenize the string and output the words
std::string tmp = sentence.str();
std::istringstream ss(tmp);
do
{
std::string buffer;
ss >> buffer;
// Copy to output
if (!buffer.empty()) {
VString &word = outputWriter.getStringRef(0);
word.copy(buffer);
outputWriter.next();
}
} while (ss);
}
} while (inputReader.next() && !isCanceled());
} catch(std::exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing partition: [%s]", e.what());
}
}
};
The processPartition()
function in this example follows a pattern that you will follow in your own UDTF: it loops over all rows in the table partition that Vertica sends it, processing each row and checking for cancellation before advancing. For UDTFs you do not have to actually process every row. You can exit your function without having read all of the input without any issues. You may choose to do this if your UDTF is performing some sort search or some other operation where it can determine that the rest of the input is unneeded.
In this example, processPartition()
first extracts the VString
containing the text from the PartitionReader
object. The VString
class represents a Vertica string value (VARCHAR or CHAR). If there is input, it then tokenizes it and adds it to the output using the PartitionWriter
object.
Similarly to reading input columns, the PartitionWriter
class has functions for writing each type of data to the output row. In this case, the example calls the PartitionWriter
object's getStringRef()
function to allocate a new VString
object to hold the token to output for the first column, and then copies the token's value into the VString
.
TranformFunctionFactory implementation
The following code shows the factory class.
class TokenFactory : public TransformFunctionFactory
{
// Tell Vertica that we take in a row with 1 string, and return a row with 1 string
virtual void getPrototype(ServerInterface &srvInterface, ColumnTypes &argTypes, ColumnTypes &returnType)
{
argTypes.addVarchar();
returnType.addVarchar();
}
// Tell Vertica what our return string length will be, given the input
// string length
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// Error out if we're called with anything but 1 argument
if (inputTypes.getColumnCount() != 1)
vt_report_error(0, "Function only accepts 1 argument, but %zu provided", inputTypes.getColumnCount());
int input_len = inputTypes.getColumnType(0).getStringLength();
// Our output size will never be more than the input size
outputTypes.addVarchar(input_len, "words");
}
virtual TransformFunction *createTransformFunction(ServerInterface &srvInterface)
{ return vt_createFuncObject<StringTokenizer>(srvInterface.allocator); }
};
In this example:
-
The UDTF takes a VARCHAR column as input. To define the input column,
getPrototype()
callsaddVarchar()
on theColumnTypes
object that represents the input table. -
The UDTF returns a VARCHAR as output. The
getPrototype()
function callsaddVarchar()
to define the output table.
This example must return the maximum length of the VARCHAR output column. It sets the length to the length of the input string. This is a safe value, because the output will never be longer than the input string. It also sets the name of the VARCHAR output column to "words".
Note
You are not required to supply a name for an output column in this function. However, it is a best practice to do so. If you do not name an output column,getReturnType()
sets the column name to "". The SQL statements that call your UDTF must provide aliases for any unnamed columns to access them or else they return an error. From a usability standpoint, it is easier for you to supply the column names here once. The alternative is to force all of the users of your function to supply their own column names for each call to the UDTF.
The implementation of the createTransformFunction()
function in the example is boilerplate code. It just calls the vt_returnFuncObj
macro with the name of the TransformFunction
class associated with this factory class. This macro takes care of instantiating a copy of the TransformFunction
class that Vertica can use to process data.
The RegisterFactory macro
The final step in creating your UDTF is to call the RegisterFactory
macro. This macro ensures that your factory class is instantiated when Vertica loads the shared library containing your UDTF. Having your factory class instantiated is the only way that Vertica can find your UDTF and determine what its inputs and outputs are.
The RegisterFactory
macro just takes the name of your factory class:
RegisterFactory(TokenFactory);