<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>OpenText Analytics Database 26.2.x – Tokenizers</title>
    <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/</link>
    <description>Recent content in Tokenizers on OpenText Analytics Database 26.2.x</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Admin: Preconfigured tokenizers</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/preconfigured-tokenizers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/preconfigured-tokenizers/</guid>
      <description>
        
        
        &lt;p&gt;OpenText™ Analytics Database provides the following preconfigured tokenizers:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;public.FlexTokenizer(LONG VARBINARY)&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;Splits the values in your flex table by white space.&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;v_txtindex.StringTokenizer(LONG VARCHAR)&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;Splits the string into words by splitting on white space.&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;v_txtindex.StringTokenizerDelim(LONG VARCHAR, CHAR(1))&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;Splits a string into tokens using the specified delimiter character.&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;The database also provides the following tokenizer, which is not preconfigured:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;v_txtindex.ICUTokenizer&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.&lt;/dd&gt;
&lt;/dl&gt;
&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following examples show how you can use a preconfigured tokenizer when creating a text index.&lt;/p&gt;
&lt;p&gt;Use the StringTokenizer to create an index from the top_100:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Use the FlexTokenizer to create an index from unstructured data:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Use the StringTokenizerDelim to split a string at the specified delimiter:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=&amp;gt; COPY string_table FROM STDIN DELIMITER &amp;#39;,&amp;#39;;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; SingleWord,dd
&amp;gt;&amp;gt; Break On Spaces,&amp;#39; &amp;#39;
&amp;gt;&amp;gt; Break:On:Colons,:
&amp;gt;&amp;gt; \.
=&amp;gt; SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)

=&amp;gt; SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)

=&amp;gt; SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: ICU tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/icu-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/icu-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.&lt;/p&gt;
&lt;p&gt;The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.

&lt;div class=&#34;admonition important&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Important&lt;/h4&gt;
If you create a database with no tables and the k-safety has increased, you must rebalance your data using &lt;a href=&#34;../../../../../en/sql-reference/functions/management-functions/cluster-functions/rebalance-cluster/#&#34;&gt;REBALANCE_CLUSTER&lt;/a&gt; before using a tokenizer.
&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;parameters&#34;&gt;Parameters&lt;/h2&gt;

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Parameter Name&lt;/th&gt; 

&lt;th &gt;
Parameter Value&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
locale&lt;/td&gt; 

&lt;td &gt;






&lt;p&gt;Uses the POSIX naming convention: language[_COUNTRY]&lt;/p&gt;
&lt;p&gt;Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES.&lt;/p&gt;
&lt;p&gt;The default value is English if you do not specify a locale.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;h2 id=&#34;example&#34;&gt;Example&lt;/h2&gt;
&lt;p&gt;The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.&lt;/p&gt;
&lt;p&gt;For more on how to configure tokenizers, see &lt;a href=&#34;../../../../../en/admin/using-text-search/stemmers-and-tokenizers/configuring-tokenizer/#&#34;&gt;Configuring a tokenizer&lt;/a&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE &amp;#39;C++&amp;#39; NAME &amp;#39;ICUTokenizerFactory&amp;#39; LIBRARY v_txtindex.logSearchLib NOT FENCED;
CREATE TRANSFORM FUNCTION
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Get the procedure ID of the tokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT proc_oid from vs_procedures where procedure_name = &amp;#39;ICUChineseTokenizer&amp;#39;;
     proc_oid
-------------------
 45035996280452894
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT v_txtindex.SET_TOKENIZER_PARAMETER(&amp;#39;locale&amp;#39;,&amp;#39;zh_CN&amp;#39; using parameters proc_oid=&amp;#39;45035996280452894&amp;#39;);
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Lock the tokenizer.&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT v_txtindex.SET_TOKENIZER_PARAMETER(&amp;#39;used&amp;#39;,&amp;#39;true&amp;#39; using parameters proc_oid=&amp;#39;45035996273762696&amp;#39;);
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create an example table, foo, containing simplified Chinese text to index.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
CREATE TABLE

VMart=&amp;gt; INSERT INTO foo values(1, u&amp;amp;&amp;#39;\4E2D\534E\4EBA\6C11\5171\548C\56FD&amp;#39;);
 OUTPUT
--------
      1
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create an index, index_example, on the table foo. The example creates the index without a stemmer; stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
CREATE INDEX
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;View the new index.&lt;br /&gt;&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;VMart=&amp;gt; SELECT * FROM index_example ORDER BY token,doc_id;
 token  | doc_id
--------+--------
 中华    |      1
 人民   |      1
 共和国 |      1
(3 rows)
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ol&gt;

      </description>
    </item>
    
  </channel>
</rss>
