Requirements for custom stemmers and tokenizers

Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides.

Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).

Before implementing a custom stemmer or tokenizer in Vertica verify that the UDx extension meets these requirements.

Vertica stemmer requirements

Comply with these requirements when you create custom stemmers:

  • Must be a User Defined Scalar Function (UDSF) or a SQL Function

  • Can be written in C++, Java, or R

  • Volatility set to stable or immutable

Supported Data Input Types:

  • Varchar

  • Long varchar

Supported Data Output Types:

  • Varchar

  • Long varchar

Vertica tokenizer requirements

To create custom tokenizers, follow these requirements:

  • Must be a User Defined Transform Function (UDTF)

  • Can be written in C++, Java, or R

  • Input type must match the type of the input text

Supported Data Input Types:

  • Char

  • Varchar

  • Long varchar

  • Varbinary

  • Long varbinary

Supported Data Output Types:

  • Varchar

  • Long varchar