Python example: string tokenizer
The following example shows a transform function that breaks an input string into tokens (based on whitespace).
The following example shows a transform function that breaks an input string into tokens (based on whitespace). It is similar to the tokenizer examples for C++ and Java.
Loading and using the example
Create the library and function:
=> CREATE LIBRARY pyudtf AS '/home/dbadmin/udx/tokenize.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize AS NAME 'StringTokenizerFactory' LIBRARY pyudtf;
CREATE TRANSFORM FUNCTION
You can then use the function in SQL statements, for example:
=> CREATE TABLE words (w VARCHAR);
CREATE TABLE
=> COPY words FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> this is a test of the python udtf
>> \.
=> SELECT tokenize(w) OVER () FROM words;
token
----------
this
is
a
test
of
the
python
udtf
(8 rows)
Setup
All Python UDxs must import the Vertica SDK.
import vertica_sdk
UDTF Python code
The following code defines the tokenizer and its factory.
class StringTokenizer(vertica_sdk.TransformFunction):
"""
Transform function which tokenizes its inputs.
For each input string, each of the whitespace-separated tokens of that
string is produced as output.
"""
def processPartition(self, server_interface, input, output):
while True:
for token in input.getString(0).split():
output.setString(0, token)
output.next()
if not input.next():
break
class StringTokenizerFactory(vertica_sdk.TransformFunctionFactory):
def getPrototype(self, server_interface, arg_types, return_type):
arg_types.addVarchar()
return_type.addVarchar()
def getReturnType(self, server_interface, arg_types, return_type):
return_type.addColumn(arg_types.getColumnType(0), "tokens")
def createTransformFunction(cls, server_interface):
return StringTokenizer()