Requirements for custom stemmers and tokenizers

Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides.

Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).

Verify that the UDx extension meets these requirements before implementing a custom stemmer or tokenizer.

Stemmer requirements

Comply with these requirements when you create custom stemmers:

  • Must be a User Defined Scalar Function (UDSF) or a SQL Function

  • Can be written in C++, Java, or R

  • Volatility set to stable or immutable

Supported Data Input Types:

  • Varchar

  • Long varchar

Supported Data Output Types:

  • Varchar

  • Long varchar

Tokenizer requirements

To create custom tokenizers, follow these requirements:

  • Must be a User Defined Transform Function (UDTF)

  • Can be written in C++, Java, or R

  • Input type must match the type of the input text

Supported Data Input Types:

  • Char

  • Varchar

  • Long varchar

  • Varbinary

  • Long varbinary

Supported Data Output Types:

  • Varchar

  • Long varchar