Requirements for custom stemmers and tokenizers
Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).
Before implementing a custom stemmer or tokenizer in Vertica verify that the UDx extension meets these requirements.
Note
Custom tokenizers can return multi-column text indices.Vertica stemmer requirements
Comply with these requirements when you create custom stemmers:
-
Must be a User Defined Scalar Function (UDSF) or a SQL Function
-
Can be written in C++, Java, or R
-
Volatility set to stable or immutable
Supported Data Input Types:
-
Varchar
-
Long varchar
Supported Data Output Types:
-
Varchar
-
Long varchar
Vertica tokenizer requirements
To create custom tokenizers, follow these requirements:
-
Must be a User Defined Transform Function (UDTF)
-
Can be written in C++, Java, or R
-
Input type must match the type of the input text
Supported Data Input Types:
-
Char
-
Varchar
-
Long varchar
-
Varbinary
-
Long varbinary
Supported Data Output Types:
-
Varchar
-
Long varchar