Vertica provides default stemmers and tokenizers. You can also create your own custom stemmers and tokenizers. The following topics explain the default stemmers and tokenizers, and the requirements for creating custom stemmers and tokenizers in Vertica.
This is the multi-page printable view of this section. Click here to print.
Stemmers and tokenizers
- 1: Vertica stemmers
- 2: Vertica tokenizers
- 2.1: Preconfigured tokenizers
- 2.2: Advanced log tokenizer
- 2.3: Basic log tokenizer
- 2.4: Whitespace log tokenizer
- 2.5: ICU tokenizer
- 3: Configuring a tokenizer
- 3.1: Tokenizer base configuration
- 3.2: RetrieveTokenizerproc_oid
- 3.3: Set tokenizer parameters
- 3.4: View tokenizer parameters
- 3.5: Delete tokenizer config file
- 4: Requirements for custom stemmers and tokenizers
1 - Vertica stemmers
Vertica stemmers use the Porter stemming algorithm to find words derived from the same base/root word. For example, if you perform a search on a text index for the keyword database, you might also want to get results containing the word databases.
To achieve this type of matching, Vertica stores words in their stemmed form when using any of the v_txtindex stemmers.
The Vertica Analytics Platform provides the following stemmers:
Name | Description |
---|---|
v_txtindex.Stemmer(long varchar) |
Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table. Alias of StemmerCaseInsensitive. |
v_txtindex.StemmerCaseSensitive(long varchar) | Sensitive to case. Stems strings from a Vertica table. |
v_txtindex.StemmerCaseInsensitive(long varchar) |
Default stemmer used if no stemmer is specified when creating a text index. Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table. |
v_txtindex.caseInsensitiveNoStemming(long varchar) | Not sensitive to case; outputs lowercase words. Does not use the Porter Stemming algorithm. |
Examples
The following examples show how to use a stemmer when creating a text index.
Create a text index using the StemmerCaseInsensitive stemmer:
=> CREATE TEXT INDEX idx_100 ON top_100 (id, feedback) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar)
TOKENIZER v_txtindex.StringTokenizer(long varchar);
Create a text index using the StemmerCaseSensitive stemmer:
=> CREATE TEXT INDEX idx_unstruc ON unstruc_data (__identity__, __raw__) STEMMER v_txtindex.StemmerCaseSensitive(long varchar)
TOKENIZER public.FlexTokenizer(long varbinary);
Create a text index without using a stemmer:
=> CREATE TEXT INDEX idx_logs FROM sys_logs ON (id, message) STEMMER NONE TOKENIZER v_txtindex.StringTokenizer(long varchar);
2 - Vertica tokenizers
A tokenizer does the following:
-
Receives a stream of characters.
-
Breaks the stream into individual tokens that usually correspond to individual words.
-
Returns a stream of tokens.
2.1 - Preconfigured tokenizers
The Vertica Analytics Platform provides the following preconfigured tokenizers:
Name | Description |
---|---|
public.FlexTokenizer(LONG VARBINARY) | Splits the values in your flex table by white space. |
v_txtindex.StringTokenizer(LONG VARCHAR) | Splits the string into words by splitting on white space. |
v_txtindex.StringTokenizerDelim(string VARCHAR, 'delimiter ' CHAR(1)) |
Splits a string into tokens using the specified delimiter character. |
v_txtindex.AdvancedLogTokenizer | Uses the default parameters for all tokenizer parameters. For more information, see Advanced log tokenizer. |
v_txtindex.BasicLogTokenizer | Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic log tokenizer. |
v_txtindex.WhitespaceLogTokenizer |
Uses default values for tokenizer parameters, except for majorseparators, which uses E' \t\n\f\r' ; and minorseparator, which uses an empty list. For more information, see Whitespace log tokenizer. |
Vertica also provides the following tokenizer, which is not preconfigured:
Name | Description |
---|---|
v_txtindex.ICUTokenizer | Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer. |
Examples
The following examples show how you can use a preconfigured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
TOKENIZER v_txtindex.StringTokenizer(long varchar)
STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
TOKENIZER public.FlexTokenizer(long varbinary)
STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
Use the StringTokenizerDelim to split a string at the specified delimiter:
=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
word | delim
-----------------+-------
SingleWord | dd
Break On Spaces |
Break:On:Colons | :
(3 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
words
-----------------
Break
On
Colons
SingleWor
Break
On
Spaces
(7 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
words | input
-----------------+-----------------
Break | Break:On:Colons
On | Break:On:Colons
Colons | Break:On:Colons
SingleWor | SingleWord
Break | Break On Spaces
On | Break On Spaces
Spaces | Break On Spaces
(7 rows)
2.2 - Advanced log tokenizer
Returns tokens that can include minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. The advanced log tokenizer offers more granularity than the basic log tokenizer in defining separators through the addition of minor separators. This approach is frequently appropriate for analyzing log files.
Important
If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.Parameters
Parameter Name | Parameter Value |
---|---|
stopwordscaseinsensitive |
'' |
minorseparators |
E'/:=@.-$#%\\_' |
majorseparators |
E' []<>(){}|!;,''"*&?+\r\n\t' |
minLength |
'2' |
maxLength |
'128' |
used |
'True' |
Examples
The following example shows how you can create a text index, from the table foo, using the Advanced Log Tokenizer without a stemmer.
=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_AdvancedLogTokenizer ON foo (id, text)
TOKENIZER v_txtindex.AdvancedLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_AdvancedLogTokenizer;
token | doc_id
-----------------------------+--------
%ASA-6-302013: | 1
00 | 1
00:00:05.700433 | 1
05 | 1
10 | 1
101 | 1
101.123.123.111/443 | 1
111 | 1
123 | 1
2014 | 1
2014-05-10 | 1
302013 | 1
443 | 1
700433 | 1
9986454 | 1
ASA | 1
Built | 1
TCP | 1
connection | 1
for | 1
outbound | 1
outside | 1
outside:101.123.123.111/443 | 1
(23 rows)
2.3 - Basic log tokenizer
Returns tokens that exclude specified minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. This approach is frequently appropriate for analyzing log files.
Important
If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.Parameters
Parameter Name | Parameter Value |
---|---|
stopwordscaseinsensitive |
'' |
minorseparators |
'' |
majorseparators |
E' []<>(){}|!;,''"*&?+\r\n\t' |
minLength |
'2' |
maxLength |
'128' |
used |
'True' |
Examples
The following example shows how you can create a text index, from the table foo, using the Basic Log Tokenizer without a stemmer.
=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_BasicLogTokenizer ON foo (id, text)
TOKENIZER v_txtindex.BasicLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_BasicLogTokenizer;
token | doc_id
-----------------------------+--------
%ASA-6-302013: | 1
00:00:05.700433 | 1
101.123.123.111/443 | 1
2014-05-10 | 1
9986454 | 1
Built | 1
TCP | 1
connection | 1
for | 1
outbound | 1
outside:101.123.123.111/443 | 1
(11 rows)
2.4 - Whitespace log tokenizer
Returns only tokens surrounded by whitespace. You can use this tokenizer in situations where you want to the tokens in your source document to be separated by whitespace characters only. This approach lets you retain the ability to set stop words and token length limits.
Important
If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.Parameters
Parameter Name | Parameter Value |
---|---|
stopwordscaseinsensitive |
'' |
minorseparators |
'' |
majorseparators |
E' \t\n\f\r' |
minLength |
'2' |
maxLength |
'128' |
used |
'True' |
Examples
The following example shows how you can create a text index, from the table foo, using the Whitespace Log Tokenizer without a stemmer.
=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 998 6454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_WhitespaceLogTokenizer ON foo (id, text)
TOKENIZER v_txtindex.WhitespaceLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_WhitespaceLogTokenizer;
token | doc_id
-----------------------------+--------
%ASA-6-302013: | 1
(101.123.123.111/443) | 1
00:00:05.700433 | 1
2014-05-10 | 1
6454 | 1
998 | 1
Built | 1
TCP | 1
connection | 1
for | 1
outbound | 1
outside:101.123.123.111/443 | 1
(12 rows)
2.5 - ICU tokenizer
Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.
The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.
Important
If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.Parameters
Parameter Name | Parameter Value |
---|---|
locale |
Uses the POSIX naming convention: language[_COUNTRY] Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES. The default value is English if you do not specify a locale. |
Example
The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.
For more on how to configure tokenizers, see Configuring a tokenizer.
-
Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.
VMart=> CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE 'C++' NAME 'ICUTokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED; CREATE TRANSFORM FUNCTION
-
Get the procedure ID of the tokenizer.
VMart=> SELECT proc_oid from vs_procedures where procedure_name = 'ICUChineseTokenizer'; proc_oid ------------------- 45035996280452894 (1 row)
-
Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('locale','zh_CN' using parameters proc_oid='45035996280452894'); SET_TOKENIZER_PARAMETER ------------------------- t (1 row)
-
Lock the tokenizer.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used','true' using parameters proc_oid='45035996273762696'); SET_TOKENIZER_PARAMETER ------------------------- t (1 row)
-
Create an example table, foo, containing simplified Chinese text to index.
VMart=> CREATE TABLE foo(doc_id integer primary key not null,text varchar(250)); CREATE TABLE VMart=> INSERT INTO foo values(1, u&'\4E2D\534E\4EBA\6C11\5171\548C\56FD'); OUTPUT -------- 1
-
Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.
VMart=> CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none; CREATE INDEX
-
View the new index.
VMart=> SELECT * FROM index_example ORDER BY token,doc_id; token | doc_id --------+-------- 中华 | 1 人民 | 1 共和国 | 1 (3 rows)
3 - Configuring a tokenizer
You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib
library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.
3.1 - Tokenizer base configuration
You can choose among several different tokenizer base configurations:
Type | Position | Without Position |
---|---|---|
Ngram | logNgramTokenizerPositionFactory | logNgramTokenizerFactory |
Words | logWordITokenizerPositionFactory | logWordITokenizerFactory |
Create a logWord tokenizer without positional relevance:
=> CREATE TRANSFORM FUNCTION v_txtindex.fooTokenizer AS LANGUAGE 'C++' NAME 'logWordITokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
3.2 - RetrieveTokenizerproc_oid
After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.
Enter the following query, substituting your own tokenizer name:
=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';
3.3 - Set tokenizer parameters
Use the tokenizer's proc_oid to configure the tokenizer. See Configuring a tokenizer for more information about getting the proc_oid of your tokenizer. The following examples show how you can configure each of the tokenizer parameters:
Configure stop words:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordscaseinsensitive','for,the' USING PARAMETERS proc_oid='45035996274128376');
Configure major separators:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorseparators', E'{}()&[]' USING PARAMETERS proc_oid='45035996274128376');
Configure minor separators:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minorseparators', '-,$' USING PARAMETERS proc_oid='45035996274128376');
Configure minimum length:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minlength', '1' USING PARAMETERS proc_oid='45035996274128376');
Configure maximum length:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('maxlength', '140' USING PARAMETERS proc_oid='45035996274128376');
Configure ngramssize:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('ngramssize', '2' USING PARAMETERS proc_oid='45035996274128376');
Lock tokenizer parameters
When you finish configuring the tokenizer, set the parameter, used, to True
. After changing this setting, you are no longer able to alter the parameters of the tokenizer. At this point, the tokenizer is ready for you to use to create a text index.
Configure the used parameter:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used', 'True' USING PARAMETERS proc_oid='45035996274128376');
See also
SET_TOKENIZER_PARAMETER3.4 - View tokenizer parameters
After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:
-
Use the GET_TOKENIZER_PARAMETER — View individual tokenizer parameter settings.
-
Use the READ_CONFIG_FILE — View all tokenizer parameter settings.
View individual tokenizer parameter settings
If you need to see an individual parameter setting for a tokenizer, you can use GET_TOKENIZER_PARAMETER to see specific tokenizer parameter settings:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
{}()&[]
(1 row)
For more information, see GET_TOKENIZER_PARAMETER.
View all tokenizer parameter settings
If you need to see all of the parameters for a tokenizer, you can use READ_CONFIG_FILE to see all of the parameter settings for your tokenizer:
=> SELECT v_txtindex.READ_CONFIG_FILE( USING PARAMETERS proc_oid='45035996274126984') OVER();
config_key | config_value
--------------------------+---------------
majorseparators | {}()&[]
maxlength | 140
minlength | 1
minorseparators | -,$
stopwordscaseinsensitive | for,the
type | 1
used | true
(7 rows)
If the parameter, used, is set to False
, then you can only view the parameters that have been applied to the tokenizer.
Note
Vertica automatically supplies the value for Type, unless you are using an ngram tokenizer, which allows you to set it.For more information, see READ_CONFIG_FILE.
3.5 - Delete tokenizer config file
Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file. This function does not delete the User- Defined Transform Function (UDTF). It only deletes the configuration file associated with the UDTF.
Delete the tokenizer configuration file when the parameter, used, is set to False
:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274127086');
Delete the tokenizer configuration file with the parameter, confirm, set to True
. This setting forces the configuration file deletion, even if the parameter, used, is also set to True
:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984', confirm='true');
For more information, see DELETE_TOKENIZER_CONFIG_FILE.
4 - Requirements for custom stemmers and tokenizers
Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).
Before implementing a custom stemmer or tokenizer in Vertica verify that the UDx extension meets these requirements.
Note
Custom tokenizers can return multi-column text indices.Vertica stemmer requirements
Comply with these requirements when you create custom stemmers:
-
Must be a User Defined Scalar Function (UDSF) or a SQL Function
-
Can be written in C++, Java, or R
-
Volatility set to stable or immutable
Supported Data Input Types:
-
Varchar
-
Long varchar
Supported Data Output Types:
-
Varchar
-
Long varchar
Vertica tokenizer requirements
To create custom tokenizers, follow these requirements:
-
Must be a User Defined Transform Function (UDTF)
-
Can be written in C++, Java, or R
-
Input type must match the type of the input text
Supported Data Input Types:
-
Char
-
Varchar
-
Long varchar
-
Varbinary
-
Long varbinary
Supported Data Output Types:
-
Varchar
-
Long varchar