ICU tokenizer

Supports multiple languages.

Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.

The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.

Parameters

Parameter Name Parameter Value
locale

Uses the POSIX naming convention: language[_COUNTRY]

Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES.

The default value is English if you do not specify a locale.

Example

The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.

For more on how to configure tokenizers, see Configuring a tokenizer.

  1. Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.

    VMart=> CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE 'C++' NAME 'ICUTokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
    CREATE TRANSFORM FUNCTION
    
  2. Get the procedure ID of the tokenizer.

    VMart=> SELECT proc_oid from vs_procedures where procedure_name = 'ICUChineseTokenizer';
         proc_oid
    -------------------
     45035996280452894
    (1 row)
    
  3. Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.

    VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('locale','zh_CN' using parameters proc_oid='45035996280452894');
     SET_TOKENIZER_PARAMETER
    -------------------------
     t
    (1 row)
    
  4. Lock the tokenizer.

    VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used','true' using parameters proc_oid='45035996273762696');
     SET_TOKENIZER_PARAMETER
    -------------------------
     t
    (1 row)
    
  5. Create an example table, foo, containing simplified Chinese text to index.

    VMart=> CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
    CREATE TABLE
    
    VMart=> INSERT INTO foo values(1, u&'\4E2D\534E\4EBA\6C11\5171\548C\56FD');
     OUTPUT
    --------
          1
    
  6. Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.

    VMart=> CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
    CREATE INDEX
    
  7. View the new index.

    VMart=> SELECT * FROM index_example ORDER BY token,doc_id;
     token  | doc_id
    --------+--------
     中华    |      1
     人民   |      1
     共和国 |      1
    (3 rows)