Preconfigured tokenizers
The Vertica Analytics Platform provides the following preconfigured tokenizers:.
The Vertica Analytics Platform provides the following preconfigured tokenizers:
Name | Description |
---|---|
public.FlexTokenizer(LONG VARBINARY) | Splits the values in your flex table by white space. |
v_txtindex.StringTokenizer(LONG VARCHAR) | Splits the string into words by splitting on white space. |
v_txtindex.StringTokenizerDelim(string VARCHAR, 'delimiter ' CHAR(1)) |
Splits a string into tokens using the specified delimiter character. |
v_txtindex.AdvancedLogTokenizer | Uses the default parameters for all tokenizer parameters. For more information, see Advanced log tokenizer. |
v_txtindex.BasicLogTokenizer | Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic log tokenizer. |
v_txtindex.WhitespaceLogTokenizer |
Uses default values for tokenizer parameters, except for majorseparators, which uses E' \t\n\f\r' ; and minorseparator, which uses an empty list. For more information, see Whitespace log tokenizer. |
Vertica also provides the following tokenizer, which is not preconfigured:
Name | Description |
---|---|
v_txtindex.ICUTokenizer | Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer. |
Examples
The following examples show how you can use a preconfigured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
TOKENIZER v_txtindex.StringTokenizer(long varchar)
STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
TOKENIZER public.FlexTokenizer(long varbinary)
STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
Use the StringTokenizerDelim to split a string at the specified delimiter:
=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
word | delim
-----------------+-------
SingleWord | dd
Break On Spaces |
Break:On:Colons | :
(3 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
words
-----------------
Break
On
Colons
SingleWor
Break
On
Spaces
(7 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
words | input
-----------------+-----------------
Break | Break:On:Colons
On | Break:On:Colons
Colons | Break:On:Colons
SingleWor | SingleWord
Break | Break On Spaces
On | Break On Spaces
Spaces | Break On Spaces
(7 rows)