<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>OpenText Analytics Database 26.2.x – Stemmers and tokenizers</title>
    <link>/en/admin/using-text-search/stemmers-and-tokenizers/</link>
    <description>Recent content in Stemmers and tokenizers on OpenText Analytics Database 26.2.x</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/en/admin/using-text-search/stemmers-and-tokenizers/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Admin: Stemmers</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/stemmers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/stemmers/</guid>
      <description>
        
        
        &lt;p&gt;&lt;em&gt;Stemmers&lt;/em&gt; use the Porter stemming algorithm to find words derived from the same base/root word. For example, if you perform a search on a text index for the keyword &lt;em&gt;database&lt;/em&gt;, you might also want to get results containing the word &lt;em&gt;databases&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;To achieve this type of matching, the database stores words in their stemmed form when using any of the v_txtindex stemmers.&lt;/p&gt;
&lt;p&gt;OpenText™ Analytics Database provides the following stemmers:

&lt;table class=&#34;table table-bordered&#34; &gt;



&lt;tr&gt; 

&lt;th &gt;
Name&lt;/th&gt; 

&lt;th &gt;
Description&lt;/th&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.Stemmer(long varchar)&lt;/td&gt; 

&lt;td &gt;




&lt;p&gt;Not sensitive to case; outputs lowercase words. Stems strings from a database table.&lt;/p&gt;
&lt;p&gt;Alias of StemmerCaseInsensitive.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.StemmerCaseSensitive(long varchar)&lt;/td&gt; 

&lt;td &gt;
Sensitive to case. Stems strings from a database table.&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.StemmerCaseInsensitive(long varchar)&lt;/td&gt; 

&lt;td &gt;




&lt;p&gt;Default stemmer used if no stemmer is specified when creating a text index.&lt;/p&gt;
&lt;p&gt;Not sensitive to case; outputs lowercase words. Stems strings from a database table.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt; 

&lt;td &gt;
v_txtindex.caseInsensitiveNoStemming(long varchar)&lt;/td&gt; 

&lt;td &gt;
Not sensitive to case; outputs lowercase words. Does not use the Porter Stemming algorithm.&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/p&gt;
&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;p&gt;The following examples show how to use a stemmer when creating a text index.&lt;/p&gt;
&lt;p&gt;Create a text index using the StemmerCaseInsensitive stemmer:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_100 ON top_100 (id, feedback) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar)
                                                              TOKENIZER v_txtindex.StringTokenizer(long varchar);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Create a text index using the StemmerCaseSensitive stemmer:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_unstruc ON unstruc_data (__identity__, __raw__) STEMMER v_txtindex.StemmerCaseSensitive(long varchar)
                                                                                  TOKENIZER public.FlexTokenizer(long varbinary);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Create a text index without using a stemmer:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE TEXT INDEX idx_logs FROM sys_logs ON (id, message) STEMMER NONE TOKENIZER v_txtindex.StringTokenizer(long varchar);
&lt;/code&gt;&lt;/pre&gt;
      </description>
    </item>
    
    <item>
      <title>Admin: Tokenizers</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/tokenizers/</guid>
      <description>
        
        
        &lt;p&gt;A tokenizer does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Receives a stream of characters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Breaks the stream into individual tokens that usually correspond to individual words.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Returns a stream of tokens.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
    <item>
      <title>Admin: Configuring a tokenizer</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/configuring-tokenizer/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/configuring-tokenizer/</guid>
      <description>
        
        
        &lt;p&gt;You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the &lt;code&gt;v_txtindex.AdvTxtSearchLib&lt;/code&gt; library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Admin: Requirements for custom stemmers and tokenizers</title>
      <link>/en/admin/using-text-search/stemmers-and-tokenizers/requirements-custom-stemmers-and-tokenizers/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/using-text-search/stemmers-and-tokenizers/requirements-custom-stemmers-and-tokenizers/</guid>
      <description>
        
        
        &lt;p&gt;Sometimes, you may want specific tokenization or stemming behavior that differs from what OpenText™ Analytics Database provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see &lt;a href=&#34;../../../../en/extending/developing-udxs/#&#34;&gt;Developing user-defined extensions (UDxs)&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Verify that the UDx extension meets these requirements before implementing a custom stemmer or tokenizer.

&lt;div class=&#34;alert admonition note&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Note&lt;/h4&gt;

Custom tokenizers can return multi-column text indices.

&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;stemmer-requirements&#34;&gt;Stemmer requirements&lt;/h2&gt;
&lt;p&gt;Comply with these requirements when you create custom stemmers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Must be a User Defined Scalar Function (UDSF) or a SQL Function&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Can be written in C++, Java, or R&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Volatility set to stable or immutable&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Supported Data Input Types&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Varchar&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Long varchar&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Supported Data Output Types&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Varchar&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Long varchar&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;tokenizer-requirements&#34;&gt;Tokenizer requirements&lt;/h2&gt;
&lt;p&gt;To create custom tokenizers, follow these requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Must be a User Defined Transform Function (UDTF)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Can be written in C++, Java, or R&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Input type must match the type of the input text&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Supported Data Input Types&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Char&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Varchar&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Long varchar&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Varbinary&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Long varbinary&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Supported Data Output Types&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Varchar&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Long varchar&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

      </description>
    </item>
    
  </channel>
</rss>
