Preconfigured tokenizers

The Vertica Analytics Platform provides the following preconfigured tokenizers:.

The Vertica Analytics Platform provides the following preconfigured tokenizers:

public.FlexTokenizer(LONG VARBINARY)
Splits the values in your flex table by white space.
v_txtindex.StringTokenizer(LONG VARCHAR)
Splits the string into words by splitting on white space.
v_txtindex.StringTokenizerDelim(LONG VARCHAR, CHAR(1))
Splits a string into tokens using the specified delimiter character.

Vertica also provides the following tokenizer, which is not preconfigured:

v_txtindex.ICUTokenizer
Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.

Examples

The following examples show how you can use a preconfigured tokenizer when creating a text index.

Use the StringTokenizer to create an index from the top_100:

=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);

Use the FlexTokenizer to create an index from unstructured data:

=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);

Use the StringTokenizerDelim to split a string at the specified delimiter:

=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)