Python 示例:字符串分词器
以下示例显示将输入字符串分解为标记(基于空格)的转换函数。它类似于 C++ 和 Java 的分词器示例。
加载和使用示例
创建库和函数:
=> CREATE LIBRARY pyudtf AS '/home/dbadmin/udx/tokenize.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize AS NAME 'StringTokenizerFactory' LIBRARY pyudtf;
CREATE TRANSFORM FUNCTION
然后,您可以在 SQL 语句中使用该函数,例如:
=> CREATE TABLE words (w VARCHAR);
CREATE TABLE
=> COPY words FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> this is a test of the python udtf
>> \.
=> SELECT tokenize(w) OVER () FROM words;
token
----------
this
is
a
test
of
the
python
udtf
(8 rows)
设置
所有 Python UDx 都必须导入 Vertica SDK。
import vertica_sdk
UDTF Python 代码
以下代码定义了分词器及其工厂。
class StringTokenizer(vertica_sdk.TransformFunction):
"""
Transform function which tokenizes its inputs.
For each input string, each of the whitespace-separated tokens of that
string is produced as output.
"""
def processPartition(self, server_interface, input, output):
while True:
for token in input.getString(0).split():
output.setString(0, token)
output.next()
if not input.next():
break
class StringTokenizerFactory(vertica_sdk.TransformFunctionFactory):
def getPrototype(self, server_interface, arg_types, return_type):
arg_types.addVarchar()
return_type.addVarchar()
def getReturnType(self, server_interface, arg_types, return_type):
return_type.addColumn(arg_types.getColumnType(0), "tokens")
def createTransformFunction(cls, server_interface):
return StringTokenizer()