OpenText Analytics Database 26.2.x – Stemmers and tokenizers

Admin: Stemmers

Mon, 01 Jan 0001 00:00:00 +0000

Stemmers use the Porter stemming algorithm to find words derived from the same base/root word. For example, if you perform a search on a text index for the keyword database, you might also want to get results containing the word databases.

To achieve this type of matching, the database stores words in their stemmed form when using any of the v_txtindex stemmers.

OpenText™ Analytics Database provides the following stemmers:

Name	Description
v_txtindex.Stemmer(long varchar)	Not sensitive to case; outputs lowercase words. Stems strings from a database table. Alias of StemmerCaseInsensitive.
v_txtindex.StemmerCaseSensitive(long varchar)	Sensitive to case. Stems strings from a database table.
v_txtindex.StemmerCaseInsensitive(long varchar)	Default stemmer used if no stemmer is specified when creating a text index. Not sensitive to case; outputs lowercase words. Stems strings from a database table.
v_txtindex.caseInsensitiveNoStemming(long varchar)	Not sensitive to case; outputs lowercase words. Does not use the Porter Stemming algorithm.

Examples

The following examples show how to use a stemmer when creating a text index.

Create a text index using the StemmerCaseInsensitive stemmer:

=> CREATE TEXT INDEX idx_100 ON top_100 (id, feedback) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar)
                                                              TOKENIZER v_txtindex.StringTokenizer(long varchar);

Create a text index using the StemmerCaseSensitive stemmer:

=> CREATE TEXT INDEX idx_unstruc ON unstruc_data (__identity__, __raw__) STEMMER v_txtindex.StemmerCaseSensitive(long varchar)
                                                                                  TOKENIZER public.FlexTokenizer(long varbinary);

Create a text index without using a stemmer:

=> CREATE TEXT INDEX idx_logs FROM sys_logs ON (id, message) STEMMER NONE TOKENIZER v_txtindex.StringTokenizer(long varchar);

Admin: Tokenizers

Mon, 01 Jan 0001 00:00:00 +0000

A tokenizer does the following:

Receives a stream of characters.
Breaks the stream into individual tokens that usually correspond to individual words.
Returns a stream of tokens.

Admin: Configuring a tokenizer

Mon, 01 Jan 0001 00:00:00 +0000

You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.

Admin: Requirements for custom stemmers and tokenizers

Mon, 01 Jan 0001 00:00:00 +0000

Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).

Verify that the UDx extension meets these requirements before implementing a custom stemmer or tokenizer.

Note

Custom tokenizers can return multi-column text indices.

Stemmer requirements

Comply with these requirements when you create custom stemmers:

Must be a User Defined Scalar Function (UDSF) or a SQL Function
Can be written in C++, Java, or R
Volatility set to stable or immutable

Supported Data Input Types:

Varchar
Long varchar

Supported Data Output Types:

Varchar
Long varchar

Tokenizer requirements

To create custom tokenizers, follow these requirements:

Must be a User Defined Transform Function (UDTF)
Can be written in C++, Java, or R
Input type must match the type of the input text

Supported Data Input Types:

Char
Varchar
Long varchar
Varbinary
Long varbinary

Supported Data Output Types:

Varchar
Long varchar