This section contains text search functions specific to Vertica.
This is the multi-page printable view of this section. Click here to print.
Text search functions
- 1: DELETE_TOKENIZER_CONFIG_FILE
- 2: GET_TOKENIZER_PARAMETER
- 3: READ_CONFIG_FILE
- 4: SET_TOKENIZER_PARAMETER
1 - DELETE_TOKENIZER_CONFIG_FILE
Deletes a tokenizer configuration file.
Syntax
SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='proc_oid', confirm={true | false });
Parameters
confirm = [true | false]
- Boolean flag. Indicates that the configuration file should be removed even if the tokenizer is still in use.
True
— Force deletion of the tokenizer when the used parameter value is True.False
— Delete tokenizer if the used parameter value is False.Default:
False
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following example shows how you can use DELETE_TOKENIZER_CONFIG_FILE to delete the tokenizer configuration file:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='45035996274126984');
DELETE_TOKENIZER_CONFIG_FILE
------------------------------
t
(1 row)
2 - GET_TOKENIZER_PARAMETER
Returns the configuration parameter for a given tokenizer.
Syntax
SELECT v_txtindex.GET_TOKENIZER_PARAMETER(parameter_name USING PARAMETERS proc_oid='proc_oid');
Parameters
parameter_name
- Name of the parameter to be returned.
One of the following:
-
stopWordsCaseInsensitive
-
minorSeparators
-
majorSeparators
-
minLength
-
maxLength
-
ngramsSize
-
used
-
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following examples show how you can use GET_TOKENIZER_PARAMETER.
Return the stop words used in a tokenizer:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('stopwordscaseinsensitive' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
devil,TODAY,the,fox
(1 row)
Return the major separators used in a tokenizer:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
{}()&[]
(1 row)
3 - READ_CONFIG_FILE
Reads and returns the key-value pairs of all the parameters of a given tokenizer.
You must use the OVER() clause with this function.
Syntax
SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='proc_oid') OVER ()
Parameters
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following example shows how you can use READ_CONFIG_FILE to return the parameters associated with a tokenizer:
=> SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984') OVER();
config_key | config_value
--------------------------+---------------------
majorseparators | {}()&[]
stopwordscaseinsensitive | devil,TODAY,the,fox
(2 rows)
4 - SET_TOKENIZER_PARAMETER
Configures the tokenizer parameters.
Important
\n, \t,\r
must be entered as Unicode using Vertica notation, U&’\000D’
, or using Vertica escaping notation, E’\r’
. Otherwise, they are taken literally as two separate characters. For example, "\" & "r"
.
Syntax
SELECT v_txtindex.SET_TOKENIZER_PARAMETER (parameter_name, parameter_value USING PARAMETERS proc_oid='proc_oid')
Parameters
parameter_name
- Name of the parameter to be configured.
Use one of the following:
-
stopwordsCaseInsensitive
: List of stop words. All the tokens that belong to the list are ignored. Vertica supports separators and stop words up to the first 256 Unicode characters.If you want to define a stop word that contains a comma or a backslash, then it needs to be escaped.
For example:"Dear Jack\," "Dear Jack\\"
Default:
''
(empty list) -
majorSeparators
:List of major separators. Enclose in quotes with no spaces between.Default:
E' []<>(){}|!;,''"*&?+\r\n\t'
-
minorSeparators
: List of minor separators. Enclose in quotes with no spaces between.Default:
E'/:=@.-$#%\\_'
-
minLength
— Minimum length a token can have, type Integer. Must be greater than 0.Default:
'2'
-
maxLength
: Maximum length a token can be. Type Integer. Cannot be greater than 1024 bytes. For information about increasing the token size, see Text search parameters.Default:
'128'
-
ngramsSize
: Integer value greater than zero. Use only with ngram tokenizers.Default:
'3'
-
used
: Indicates when a tokenizer configuration cannot be changed. Type Boolean. After you set used toTrue
, any calls to setTokenizerParameter fail.You must set the parameter
used
toTrue
before using the configured tokenizer. Doing so prevents the configuration from being modified after being used to create a text index.Default:
False
-
parameter_value
- The value of a configuration parameter.
If you want to disable minorSeperators or stopWordsCaseInsensitive, then set their values to
''
. proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following examples show how you can use SET_TOKENIZER_PARAMETER to configure stop words and separators.
Configure the stop words of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordsCaseInsensitive', 'devil,TODAY,the,fox' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
Configure the major separators of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorSeparators',E'{}()&[]' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)