A tokenizer does the following:
-
Receives a stream of characters.
-
Breaks the stream into individual tokens that usually correspond to individual words.
-
Returns a stream of tokens.
This is the multi-page printable view of this section. Click here to print.
A tokenizer does the following:
Receives a stream of characters.
Breaks the stream into individual tokens that usually correspond to individual words.
Returns a stream of tokens.
The Vertica Analytics Platform provides the following preconfigured tokenizers:
public.FlexTokenizer(LONG VARBINARY)
v_txtindex.StringTokenizer(LONG VARCHAR)
v_txtindex.StringTokenizerDelim(LONG VARCHAR, CHAR(1))
Vertica also provides the following tokenizer, which is not preconfigured:
v_txtindex.ICUTokenizer
The following examples show how you can use a preconfigured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
TOKENIZER v_txtindex.StringTokenizer(long varchar)
STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
TOKENIZER public.FlexTokenizer(long varbinary)
STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
Use the StringTokenizerDelim to split a string at the specified delimiter:
=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
word | delim
-----------------+-------
SingleWord | dd
Break On Spaces |
Break:On:Colons | :
(3 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
words
-----------------
Break
On
Colons
SingleWor
Break
On
Spaces
(7 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
words | input
-----------------+-----------------
Break | Break:On:Colons
On | Break:On:Colons
Colons | Break:On:Colons
SingleWor | SingleWord
Break | Break On Spaces
On | Break On Spaces
Spaces | Break On Spaces
(7 rows)
Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.
The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.
Parameter Name | Parameter Value |
---|---|
locale |
Uses the POSIX naming convention: language[_COUNTRY] Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES. The default value is English if you do not specify a locale. |
The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.
For more on how to configure tokenizers, see Configuring a tokenizer.
Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.
VMart=> CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE 'C++' NAME 'ICUTokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
CREATE TRANSFORM FUNCTION
Get the procedure ID of the tokenizer.
VMart=> SELECT proc_oid from vs_procedures where procedure_name = 'ICUChineseTokenizer';
proc_oid
-------------------
45035996280452894
(1 row)
Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('locale','zh_CN' using parameters proc_oid='45035996280452894');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
Lock the tokenizer.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used','true' using parameters proc_oid='45035996273762696');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
Create an example table, foo, containing simplified Chinese text to index.
VMart=> CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
CREATE TABLE
VMart=> INSERT INTO foo values(1, u&'\4E2D\534E\4EBA\6C11\5171\548C\56FD');
OUTPUT
--------
1
Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.
VMart=> CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
CREATE INDEX
View the new index.
VMart=> SELECT * FROM index_example ORDER BY token,doc_id;
token | doc_id
--------+--------
中华 | 1
人民 | 1
共和国 | 1
(3 rows)