<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Vertica Documentation – Vertica tokenizers</title>
    <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/</link>
    <description>Recent content in Vertica tokenizers on Vertica Documentation</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Admin: Preconfigured tokenizers</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/preconfigured-tokenizers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/preconfigured-tokenizers/</guid>
      <description>
        
        
        &lt;p&gt;The Vertica Analytics Platform provides the following preconfigured tokenizers:

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Name&lt;/th&gt; 

&lt;th &gt;
Description&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
public.FlexTokenizer(LONG VARBINARY)&lt;/td&gt; 

&lt;td &gt;
Splits the values in your flex table by white space.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.StringTokenizer(LONG VARCHAR)&lt;/td&gt; 

&lt;td &gt;


Splits the string into words by splitting on white space.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.StringTokenizerDelim(&lt;em&gt;&lt;code&gt;string&lt;/code&gt;&lt;/em&gt; LONG VARCHAR, &#39;&lt;em&gt;&lt;code&gt;delimiter&lt;/code&gt;&lt;/em&gt;&#39; CHAR(1))&lt;/td&gt; 

&lt;td &gt;


Splits a string into tokens using the specified delimiter character.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.AdvancedLogTokenizer&lt;/td&gt; 

&lt;td &gt;
Uses the default parameters for all tokenizer parameters. For more information, see &lt;a href=&#34;../../../../../en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/advanced-log-tokenizer/&#34;&gt;Advanced log tokenizer&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.BasicLogTokenizer&lt;/td&gt; 

&lt;td &gt;
Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see &lt;a href=&#34;../../../../../en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/basic-log-tokenizer/&#34;&gt;Basic log tokenizer&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.WhitespaceLogTokenizer&lt;/td&gt; 

&lt;td &gt;
Uses default values for tokenizer parameters, except for majorseparators, which uses &lt;code&gt;E&#39; \t\n\f\r&#39;&lt;/code&gt;; and minorseparator, which uses an empty list. For more information, see &lt;a href=&#34;../../../../../en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/whitespace-log-tokenizer/&#34;&gt;Whitespace log tokenizer&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/p&gt;
&lt;p&gt;Vertica also provides the following tokenizer, which is not preconfigured:

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Name&lt;/th&gt; 

&lt;th &gt;
Description&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.ICUTokenizer&lt;/td&gt; 

&lt;td &gt;
Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/p&gt;
&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following examples show how you can use a preconfigured tokenizer when creating a text index.&lt;/p&gt;
&lt;p&gt;Use the StringTokenizer to create an index from the top_100:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Use the FlexTokenizer to create an index from unstructured data:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Use the StringTokenizerDelim to split a string at the specified delimiter:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=&amp;gt; COPY string_table FROM STDIN DELIMITER &amp;#39;,&amp;#39;;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; SingleWord,dd
&amp;gt;&amp;gt; Break On Spaces,&amp;#39; &amp;#39;
&amp;gt;&amp;gt; Break:On:Colons,:
&amp;gt;&amp;gt; \.
=&amp;gt; SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)

=&amp;gt; SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)

=&amp;gt; SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: Advanced log tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/advanced-log-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/advanced-log-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;Returns tokens that can include minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. The advanced log tokenizer offers more granularity than the basic log tokenizer in defining separators through the addition of minor separators. This approach is frequently appropriate for analyzing log files.

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
If you create a database with no tables and the k-safety has increased, you must rebalance your data using &lt;a href=&#34;../../../../../en/sql-reference/functions/management-functions/cluster-functions/rebalance-cluster/&#34;&gt;REBALANCE_CLUSTER&lt;/a&gt; before using a Vertica tokenizer.
&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;parameters&#34;&gt;Parameters&lt;/h2&gt;

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Parameter Name&lt;/th&gt; 

&lt;th &gt;
Parameter Value&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
stopwordscaseinsensitive&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;E&#39;/:=@.-$#%\\_&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
majorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;E&#39; []&amp;lt;&amp;gt;(){}|!;,&#39;&#39;&amp;quot;*&amp;amp;?+\r\n\t&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;2&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
maxLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;128&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
used&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;True&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following example shows how you can create a text index, from the table foo, using the Advanced Log Tokenizer without a stemmer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=&amp;gt; COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
&amp;gt;&amp;gt; 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
&amp;gt;&amp;gt; \.
=&amp;gt; CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                    SEGMENTED BY HASH(id) ALL NODES KSAFE;
=&amp;gt; CREATE TEXT INDEX indexfoo_AdvancedLogTokenizer ON foo (id, text)
                  TOKENIZER v_txtindex.AdvancedLogTokenizer(LONG VARCHAR) STEMMER NONE;
=&amp;gt; SELECT * FROM indexfoo_AdvancedLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 00                          |      1
 00:00:05.700433             |      1
 05                          |      1
 10                          |      1
 101                         |      1
 101.123.123.111/443         |      1
 111                         |      1
 123                         |      1
 2014                        |      1
 2014-05-10                  |      1
 302013                      |      1
 443                         |      1
 700433                      |      1
 9986454                     |      1
 ASA                         |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside                     |      1
 outside:101.123.123.111/443 |      1
(23 rows)
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: Basic log tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/basic-log-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/basic-log-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;Returns tokens that exclude specified minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. This approach is frequently appropriate for analyzing log files.

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
If you create a database with no tables and the k-safety has increased, you must rebalance your data using &lt;a href=&#34;../../../../../en/sql-reference/functions/management-functions/cluster-functions/rebalance-cluster/&#34;&gt;REBALANCE_CLUSTER&lt;/a&gt; before using a Vertica tokenizer.
&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;parameters&#34;&gt;Parameters&lt;/h2&gt;

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Parameter Name&lt;/th&gt; 

&lt;th &gt;
Parameter Value&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
stopwordscaseinsensitive&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
majorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;E&#39; []&amp;lt;&amp;gt;(){}|!;,&#39;&#39;&amp;quot;*&amp;amp;?+\r\n\t&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;2&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
maxLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;128&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
used&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;True&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following example shows how you can create a text index, from the table foo, using the Basic Log Tokenizer without a stemmer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=&amp;gt; COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
&amp;gt;&amp;gt; 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
&amp;gt;&amp;gt; \.
=&amp;gt; CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                     SEGMENTED BY HASH(id) ALL NODES KSAFE;
=&amp;gt; CREATE TEXT INDEX indexfoo_BasicLogTokenizer ON foo (id, text)
                 TOKENIZER v_txtindex.BasicLogTokenizer(LONG VARCHAR) STEMMER NONE;
=&amp;gt; SELECT * FROM indexfoo_BasicLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 00:00:05.700433             |      1
 101.123.123.111/443         |      1
 2014-05-10                  |      1
 9986454                     |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside:101.123.123.111/443 |      1
(11 rows)
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: Whitespace log tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/whitespace-log-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/whitespace-log-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;Returns only tokens surrounded by whitespace. You can use this tokenizer in situations where you want to the tokens in your source document to be separated by whitespace characters only. This approach lets you retain the ability to set stop words and token length limits.

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
If you create a database with no tables and the k-safety has increased, you must rebalance your data using &lt;a href=&#34;../../../../../en/sql-reference/functions/management-functions/cluster-functions/rebalance-cluster/&#34;&gt;REBALANCE_CLUSTER&lt;/a&gt; before using a Vertica tokenizer.
&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;parameters&#34;&gt;Parameters&lt;/h2&gt;

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Parameter Name&lt;/th&gt; 

&lt;th &gt;
Parameter Value&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
stopwordscaseinsensitive&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
majorseparators&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;E&#39; \t\n\f\r&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
minLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;2&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
maxLength&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;128&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
used&lt;/td&gt; 

&lt;td &gt;
&lt;code&gt;&#39;True&#39;&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following example shows how you can create a text index, from the table foo, using the Whitespace Log Tokenizer without a stemmer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=&amp;gt; COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
&amp;gt;&amp;gt; 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 998 6454 for outside:101.123.123.111/443 (101.123.123.111/443)
&amp;gt;&amp;gt; \.
=&amp;gt; CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                     SEGMENTED BY HASH(id) ALL NODES KSAFE;
=&amp;gt; CREATE TEXT INDEX indexfoo_WhitespaceLogTokenizer ON foo (id, text)
                TOKENIZER v_txtindex.WhitespaceLogTokenizer(LONG VARCHAR) STEMMER NONE;
=&amp;gt; SELECT * FROM indexfoo_WhitespaceLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 (101.123.123.111/443)       |      1
 00:00:05.700433             |      1
 2014-05-10                  |      1
 6454                        |      1
 998                         |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside:101.123.123.111/443 |      1
(12 rows)
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: ICU tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/icu-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/icu-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.&lt;/p&gt;
&lt;p&gt;The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
If you create a database with no tables and the k-safety has increased, you must rebalance your data using &lt;a href=&#34;../../../../../en/sql-reference/functions/management-functions/cluster-functions/rebalance-cluster/&#34;&gt;REBALANCE_CLUSTER&lt;/a&gt; before using a Vertica tokenizer.
&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;parameters&#34;&gt;Parameters&lt;/h2&gt;

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Parameter Name&lt;/th&gt; 

&lt;th &gt;
Parameter Value&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
locale&lt;/td&gt; 

&lt;td &gt;






&lt;p&gt;Uses the POSIX naming convention: language[_COUNTRY]&lt;/p&gt;
&lt;p&gt;Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES.&lt;/p&gt;
&lt;p&gt;The default value is English if you do not specify a locale.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&#34;example&#34;&gt;Example&lt;/h2&gt;
&lt;p&gt;The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.&lt;/p&gt;
&lt;p&gt;For more on how to configure tokenizers, see &lt;a href=&#34;../../../../../en/admin/using-text-search/stemmers-and-tokenizers/configuring-tokenizer/&#34;&gt;Configuring a tokenizer&lt;/a&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE &amp;#39;C++&amp;#39; NAME &amp;#39;ICUTokenizerFactory&amp;#39; LIBRARY v_txtindex.logSearchLib NOT FENCED;
CREATE TRANSFORM FUNCTION
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Get the procedure ID of the tokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT proc_oid from vs_procedures where procedure_name = &amp;#39;ICUChineseTokenizer&amp;#39;;
     proc_oid
-------------------
 45035996280452894
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT v_txtindex.SET_TOKENIZER_PARAMETER(&amp;#39;locale&amp;#39;,&amp;#39;zh_CN&amp;#39; using parameters proc_oid=&amp;#39;45035996280452894&amp;#39;);
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Lock the tokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT v_txtindex.SET_TOKENIZER_PARAMETER(&amp;#39;used&amp;#39;,&amp;#39;true&amp;#39; using parameters proc_oid=&amp;#39;45035996273762696&amp;#39;);
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create an example table, foo, containing simplified Chinese text to index.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
CREATE TABLE

VMart=&amp;gt; INSERT INTO foo values(1, u&amp;amp;&amp;#39;\4E2D\534E\4EBA\6C11\5171\548C\56FD&amp;#39;);
 OUTPUT
--------
      1
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
CREATE INDEX
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;View the new index.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT * FROM index_example ORDER BY token,doc_id;
 token  | doc_id
--------+--------
 中华    |      1
 人民   |      1
 共和国 |      1
(3 rows)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ol&gt;

      </description>
    </item>
    
  </channel>
</rss>
