This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Text search functions

This section contains text search functions specific to Vertica.

1: DELETE_TOKENIZER_CONFIG_FILE
2: GET_TOKENIZER_PARAMETER
3: READ_CONFIG_FILE
4: SET_TOKENIZER_PARAMETER

This section contains text search functions specific to Vertica.

1 - DELETE_TOKENIZER_CONFIG_FILE

Deletes a tokenizer configuration file.

Syntax

SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='proc_oid', confirm={true | false });

Parameters

confirm = [true | false]

Boolean flag. Indicates that the configuration file should be removed even if the tokenizer is still in use.

True — Force deletion of the tokenizer when the used parameter value is True.

False — Delete tokenizer if the used parameter value is False.

Default:False

proc_oid

A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following example shows how you can use DELETE_TOKENIZER_CONFIG_FILE to delete the tokenizer configuration file:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='45035996274126984');
 DELETE_TOKENIZER_CONFIG_FILE
------------------------------
 t
(1 row)

2 - GET_TOKENIZER_PARAMETER

Returns the configuration parameter for a given tokenizer.

Syntax

SELECT v_txtindex.GET_TOKENIZER_PARAMETER(parameter_name USING PARAMETERS proc_oid='proc_oid');

Parameters

parameter_name

Name of the parameter to be returned.

One of the following:

stopWordsCaseInsensitive
minorSeparators
majorSeparators
minLength
maxLength
ngramsSize
used

proc_oid

A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following examples show how you can use GET_TOKENIZER_PARAMETER.

Return the stop words used in a tokenizer:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('stopwordscaseinsensitive' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 devil,TODAY,the,fox
(1 row)

Return the major separators used in a tokenizer:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 {}()&[]
(1 row)

3 - READ_CONFIG_FILE

Reads and returns the key-value pairs of all the parameters of a given tokenizer.

You must use the OVER() clause with this function.

Syntax

SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='proc_oid') OVER ()

Parameters

proc_oid: A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following example shows how you can use READ_CONFIG_FILE to return the parameters associated with a tokenizer:

=> SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984') OVER();
                config_key | config_value
 --------------------------+---------------------
  majorseparators          | {}()&[]
  stopwordscaseinsensitive | devil,TODAY,the,fox
(2 rows)

4 - SET_TOKENIZER_PARAMETER

Configures the tokenizer parameters.

Important

\n, \t,\r must be entered as Unicode using Vertica notation, U&’\000D’, or using Vertica escaping notation, E’\r’. Otherwise, they are taken literally as two separate characters. For example, "\" & "r".

Syntax

SELECT v_txtindex.SET_TOKENIZER_PARAMETER (parameter_name, parameter_value USING PARAMETERS proc_oid='proc_oid')

Parameters

parameter_name

Name of the parameter to be configured.

Use one of the following:

stopwordsCaseInsensitive: List of stop words. All the tokens that belong to the list are ignored. Vertica supports separators and stop words up to the first 256 Unicode characters.

If you want to define a stop word that contains a comma or a backslash, then it needs to be escaped.
For example: "Dear Jack\," "Dear Jack\\"

Default: '' (empty list)
majorSeparators:List of major separators. Enclose in quotes with no spaces between.

Default: E' []<>(){}|!;,''"*&?+\r\n\t'
minorSeparators: List of minor separators. Enclose in quotes with no spaces between.

Default: E'/:=@.-$#%\\_'
minLength — Minimum length a token can have, type Integer. Must be greater than 0.

Default: '2'
maxLength: Maximum length a token can be. Type Integer. Cannot be greater than 1024 bytes. For information about increasing the token size, see Text search parameters.

Default: '128'
ngramsSize: Integer value greater than zero. Use only with ngram tokenizers.

Default: '3'
used: Indicates when a tokenizer configuration cannot be changed. Type Boolean. After you set used to True, any calls to setTokenizerParameter fail.

You must set the parameter used to True before using the configured tokenizer. Doing so prevents the configuration from being modified after being used to create a text index.

Default: False

parameter_value

The value of a configuration parameter.

If you want to disable minorSeperators or stopWordsCaseInsensitive, then set their values to ''.

proc_oid

A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following examples show how you can use SET_TOKENIZER_PARAMETER to configure stop words and separators.

Configure the stop words of a tokenizer:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordsCaseInsensitive', 'devil,TODAY,the,fox' USING PARAMETERS proc_oid='45035996274126984');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)

Configure the major separators of a tokenizer:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorSeparators',E'{}()&[]' USING PARAMETERS proc_oid='45035996274126984');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)