预配置的分词器

Vertica Analytics Platform 提供以下预配置的分词器:

Vertica 还提供了以下未预配置的分词器:

示例

以下示例显示了在创建文本索引时如何使用预配置的分词器。

使用 StringTokenizer 从 top_100 创建索引:

=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);

使用 FlexTokenizer 从非结构化数据创建索引:

=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);

使用 StringTokenizerDelim 在指定的分隔符处拆分字符串:

=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)