SET_TOKENIZER_PARAMETER
Configures the tokenizer parameters.
Important
\n, \t,\r
must be entered as Unicode using Vertica notation, U&’\000D’
, or using Vertica escaping notation, E’\r’
. Otherwise, they are taken literally as two separate characters. For example, "\" & "r"
.
Syntax
SELECT v_txtindex.SET_TOKENIZER_PARAMETER (parameter_name, parameter_value USING PARAMETERS proc_oid='proc_oid')
Parameters
parameter_name
- Name of the parameter to be configured.
Use one of the following:
-
stopwordsCaseInsensitive
: List of stop words. All the tokens that belong to the list are ignored. Vertica supports separators and stop words up to the first 256 Unicode characters.If you want to define a stop word that contains a comma or a backslash, then it needs to be escaped.
For example:"Dear Jack\," "Dear Jack\\"
Default:
''
(empty list) -
majorSeparators
:List of major separators. Enclose in quotes with no spaces between.Default:
E' []<>(){}|!;,''"*&?+\r\n\t'
-
minorSeparators
: List of minor separators. Enclose in quotes with no spaces between.Default:
E'/:=@.-$#%\\_'
-
minLength
— Minimum length a token can have, type Integer. Must be greater than 0.Default:
'2'
-
maxLength
: Maximum length a token can be. Type Integer. Cannot be greater than 1024 bytes. For information about increasing the token size, see Text search parameters.Default:
'128'
-
ngramsSize
: Integer value greater than zero. Use only with ngram tokenizers.Default:
'3'
-
used
: Indicates when a tokenizer configuration cannot be changed. Type Boolean. After you set used toTrue
, any calls to setTokenizerParameter fail.You must set the parameter
used
toTrue
before using the configured tokenizer. Doing so prevents the configuration from being modified after being used to create a text index.Default:
False
-
parameter_value
- The value of a configuration parameter.
If you want to disable minorSeperators or stopWordsCaseInsensitive, then set their values to
''
. proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following examples show how you can use SET_TOKENIZER_PARAMETER to configure stop words and separators.
Configure the stop words of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordsCaseInsensitive', 'devil,TODAY,the,fox' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
Configure the major separators of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorSeparators',E'{}()&[]' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)