Preconfigured tokenizers
The Vertica Analytics Platform provides the following preconfigured tokenizers:.
	The Vertica Analytics Platform provides the following preconfigured tokenizers:
| Name | Description | 
|---|---|
| public.FlexTokenizer(LONG VARBINARY) | Splits the values in your flex table by white space. | 
| v_txtindex.StringTokenizer(LONG VARCHAR) | Splits the string into words by splitting on white space. | 
| v_txtindex.StringTokenizerDelim( stringVARCHAR, 'delimiter' CHAR(1)) | Splits a string into tokens using the specified delimiter character. | 
| v_txtindex.AdvancedLogTokenizer | Uses the default parameters for all tokenizer parameters. For more information, see Advanced log tokenizer. | 
| v_txtindex.BasicLogTokenizer | Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic log tokenizer. | 
| v_txtindex.WhitespaceLogTokenizer | Uses default values for tokenizer parameters, except for majorseparators, which uses E' \t\n\f\r'; and minorseparator, which uses an empty list. For more information, see Whitespace log tokenizer. | 
Vertica also provides the following tokenizer, which is not preconfigured:
| Name | Description | 
|---|---|
| v_txtindex.ICUTokenizer | Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer. | 
Examples
The following examples show how you can use a preconfigured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
Use the StringTokenizerDelim to split a string at the specified delimiter:
=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)