Requirements for custom stemmers and tokenizers

Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides.

Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).

Verify that the UDx extension meets these requirements before implementing a custom stemmer or tokenizer.

Note

Custom tokenizers can return multi-column text indices.

Stemmer requirements

Comply with these requirements when you create custom stemmers:

Must be a User Defined Scalar Function (UDSF) or a SQL Function
Can be written in C++, Java, or R
Volatility set to stable or immutable

Supported Data Input Types:

Varchar
Long varchar

Supported Data Output Types:

Varchar
Long varchar

Tokenizer requirements

To create custom tokenizers, follow these requirements:

Must be a User Defined Transform Function (UDTF)
Can be written in C++, Java, or R
Input type must match the type of the input text

Supported Data Input Types:

Char
Varchar
Long varchar
Varbinary
Long varbinary

Supported Data Output Types:

Varchar
Long varchar