This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Configuring a tokenizer

You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library.

1: Tokenizer base configuration
2: RetrieveTokenizerproc_oid
3: Set tokenizer parameters
4: View tokenizer parameters
5: Delete tokenizer config file

You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.

1 - Tokenizer base configuration

You can choose among several different tokenizer base configurations:.

You can choose among two tokenizer base configurations:

Ngram with position: logNgramTokenizerPositionFactory
Ngram without position: logNgramTokenizerFactory

The following example creates an Ngram tokenizer without positional relevance:

=> CREATE TRANSFORM FUNCTION v_txtindex.myNgramTokenizer 
   AS LANGUAGE 'C++' 
   NAME 'logNgramTokenizerFactory' 
   LIBRARY v_txtindex.logSearchLib NOT FENCED;

2 - RetrieveTokenizerproc_oid

After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures.

After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.

Enter the following query, substituting your own tokenizer name:

=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';

3 - Set tokenizer parameters

Use the tokenizer's proc_oid to configure the tokenizer.

Use the tokenizer's proc_oid to configure the tokenizer. See Configuring a tokenizer for more information about getting the proc_oid of your tokenizer. The following examples show how you can configure each of the tokenizer parameters:

Configure stop words:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordscaseinsensitive','for,the' USING PARAMETERS proc_oid='45035996274128376');

Configure major separators:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorseparators', E'{}()&[]' USING PARAMETERS proc_oid='45035996274128376');

Configure minor separators:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minorseparators', '-,$' USING PARAMETERS proc_oid='45035996274128376');

Configure minimum length:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minlength', '1' USING PARAMETERS proc_oid='45035996274128376');

Configure maximum length:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('maxlength', '140' USING PARAMETERS proc_oid='45035996274128376');

Configure ngramssize:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('ngramssize', '2' USING PARAMETERS proc_oid='45035996274128376');

Lock tokenizer parameters

When you finish configuring the tokenizer, set the parameter, used, to True. After changing this setting, you are no longer able to alter the parameters of the tokenizer. At this point, the tokenizer is ready for you to use to create a text index.

Configure the used parameter:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used', 'True' USING PARAMETERS proc_oid='45035996274128376');

4 - View tokenizer parameters

After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:.

After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:

Use the GET_TOKENIZER_PARAMETER — View individual tokenizer parameter settings.
Use the READ_CONFIG_FILE — View all tokenizer parameter settings.

View individual tokenizer parameter settings

If you need to see an individual parameter setting for a tokenizer, you can use GET_TOKENIZER_PARAMETER to see specific tokenizer parameter settings:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 {}()&[]
(1 row)

For more information, see GET_TOKENIZER_PARAMETER.

View all tokenizer parameter settings

If you need to see all of the parameters for a tokenizer, you can use READ_CONFIG_FILE to see all of the parameter settings for your tokenizer:

=> SELECT v_txtindex.READ_CONFIG_FILE( USING PARAMETERS proc_oid='45035996274126984') OVER();
               config_key | config_value
--------------------------+---------------
          majorseparators | {}()&[]
                maxlength | 140
                minlength | 1
          minorseparators | -,$
 stopwordscaseinsensitive | for,the
                     type | 1
                     used | true
(7 rows)

If the parameter, used, is set to False, then you can only view the parameters that have been applied to the tokenizer.

Note

Vertica automatically supplies the value for Type, unless you are using an ngram tokenizer, which allows you to set it.

For more information, see READ_CONFIG_FILE.

5 - Delete tokenizer config file

Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file.

Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file. This function does not delete the User- Defined Transform Function (UDTF). It only deletes the configuration file associated with the UDTF.

Delete the tokenizer configuration file when the parameter, used, is set to False:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274127086');

Delete the tokenizer configuration file with the parameter, confirm, set to True. This setting forces the configuration file deletion, even if the parameter, used, is also set to True:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984', confirm='true');

For more information, see DELETE_TOKENIZER_CONFIG_FILE.

Configuring a tokenizer

1 - Tokenizer base configuration

2 - RetrieveTokenizerproc_oid

3 - Set tokenizer parameters

Lock tokenizer parameters

See also

4 - View tokenizer parameters

View individual tokenizer parameter settings

View all tokenizer parameter settings

Note

5 - Delete tokenizer config file