This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Using text search

Text search allows you to quickly search the contents of a single CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword.

1: Creating a text index
2: Creating a text index on a flex table
3: Searching a text index
4: Dropping a text index
5: Stemmers and tokenizers

5.1: Vertica stemmers
5.2: Vertica tokenizers

5.2.1: Preconfigured tokenizers
5.2.2: Advanced log tokenizer
5.2.3: Basic log tokenizer
5.2.4: Whitespace log tokenizer
5.2.5: ICU tokenizer

5.3: Configuring a tokenizer

5.3.1: Tokenizer base configuration
5.3.2: RetrieveTokenizerproc_oid
5.3.3: Set tokenizer parameters
5.3.4: View tokenizer parameters
5.3.5: Delete tokenizer config file

5.4: Requirements for custom stemmers and tokenizers

Text search allows you to quickly search the contents of a single CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword.

You can use this feature on columns that are queried repeatedly regarding their contents. After you create the text index, DML operations become slightly slower on the source table. This performance change results from syncing the text index and source table. Any time an operation is performed on the source table, the text index updates in the background. Regular queries on the source table are not affected.

The text index contains all of the words from the source table's text field and any other additional columns you included during index creation. Additional columns are not indexed—their values are just passed through to the text index. The text index is like any other Vertica table , except it is linked to the source table internally.

First, create a text index on the table you plan to search. Then, after you have indexed your table, run a query against the text index for a specific keyword. This query returns a doc_id for each instance of the keyword. After querying the text index, joining the text index back to the source table should give a significant performance improvement over directly querying the source table about the contents of its text field.

Important

Do not alter the contents or definitions of the text index. If you alter the contents or definitions of the text index, the results do not appropriately match the source table.

1 - Creating a text index

In the following example, you perform a text search using a source table called t_log.

In the following example, you perform a text search using a source table called t_log. This source table has two columns:

One column containing the table's primary key
Another column containing log file information

You must associate a projection with the source table. Use a projection that is sorted by the primary key and either segmented by hash(id) or unsegmented. You can define this projection on the source table, along with any other existing projections.

Create a text index on the table for which you want to perform a text search.

=> CREATE TEXT INDEX text_index ON t_log (id, text);

The text index contains two columns:

doc_id uses the unique identifier from the source table.
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.

If your table is partitioned then your text index also contains a third column named partition.

=> SELECT * FROM text_index;
          token         | doc_id | partition
------------------------+--------+-----------
<info>                  |      6 |      2014
<warning>               |      2 |      2014
<warning>               |      3 |      2014
<warning>               |      4 |      2014
<warning>               |      5 |      2014
database                |      6 |      2014
execute:                |      6 |      2014
object                  |      4 |      2014
object                  |      5 |      2014
[catalog]               |      4 |      2014
[catalog]               |      5 |      2014

You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.

Your text index stays synchronized to the contents of the source table through any operation that is run on the source table. These operations include, but are not limited to:

COPY
INSERT
UPDATE
DELETE
DROP PARTITION
MOVE_PARTITIONS_TO_TABLE

When you move or swap partitions in a source table that is indexed, verify that the destination table already exists and is indexed in the same way.

2 - Creating a text index on a flex table

In the following example, you create a text index on a flex table.

In the following example, you create a text index on a flex table. The example assumes that you have created a flex table called mountains. See Getting started in Using Flex Tables to create the flex table used in this example.

Before you can create a text index on your flex table, add a primary key constraint to the flex table.

=> ALTER TABLE mountains ADD PRIMARY KEY (__identity__);

Create a text index on the table for which you want to perform a text search. Tokenize the __raw__column with the FlexTokenizer and specify the data type as LONG VARBINARY. It is important to use the FlexTokenizer when creating text indices on flex tables because the data type of the __raw__ column differs from the default StringTokenizer.

=> CREATE TEXT INDEX flex_text_index ON mountains(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);

The text index contains two columns:

doc_id uses the unique identifier from the source table.
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.

If your table is partitioned then your text index also contains a third column named partition.

=> SELECT * FROM flex_text_index;
    token    | doc_id
-------------+--------
 50.6        |      5
 Mt          |      5
 Washington  |      5
 mountain    |      5
 12.2        |      3
 15.4        |      2
 17000       |      3
 29029       |      2
 Denali      |      3
 Helen       |      2
 Mt          |      2
 St          |      2
 mountain    |      3
 volcano     |      2
 29029       |      1
 34.1        |      1
 Everest     |      1
 mountain    |      1
 14000       |      4
 Kilimanjaro |      4
 mountain    |      4
(21 rows)

You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.

Your text index stays synchronized to the contents of the source table through any operation that is run on the source table. These operations include, but are not limited to:

COPY
INSERT
UPDATE
DELETE
DROP PARTITION
MOVE_PARTITIONS_TO_TABLE

When you move or swap partitions in a source table that is indexed, verify that the destination table already exists and is indexed in the same way.

3 - Searching a text index

After you create a text index, write a query to run against the index to search for a specific keyword.

In the following example, you use a WHERE clause to search for the keyword <WARNING> in the text index. The WHERE clause should use the stemmer you used to create the text index. When you use the STEMMER keyword, it stems the keyword to match the keywords in your text index. If you did not use the STEMMER keyword, then the default stemmer is v_txtindex.StemmerCaseInsensitive. If you used STEMMER NONE, then you can omit STEMMER keyword from the WHERE clause.

=> SELECT * FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>');
  token    | doc_id
-----------+--------
<warning>  |     2
<warning>  |     3
<warning>  |     4
<warning>  |     5
(4 rows)

Next, write a query to display the full contents of the source table that match the keyword you searched for in the text index.

=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>'));
id |    date    |                                     text
---+------------+-----------------------------------------------------------------------------------------------
4  | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45035968
5  | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45030
2  | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 4503
3  | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45066
(4 rows)

Use the doc_id to find the exact location of the keyword in the source table.The doc_id matches the unique identifier from the source table. This matching allows you to quickly find the instance of the keyword in your table.

Performing a case-sensitive and case-insensitive text search query

Your text index is optimized to match all instances of words depending upon your stemmer. By default, the case insensitive stemmer is applied to all text indices that do not specify a stemmer. Therefore, if the queries you plan to write against your text index are case sensitive, then Vertica recommends you use a case sensitive stemmer to build your text index.

The following examples show queries that match case-sensitive and case-insensitive words that you can use when performing a text search.

This query finds case-insensitive records in a case insensitive text index:

=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('warning'));

This query finds case-sensitive records in a case sensitive text index:

=> SELECT * FROM t_log_case_sensitive WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('Warning'));

Including and excluding keywords in a text search query

Your text index also allows you to perform more detailed queries to find multiple keywords or omit results with other keywords. The following example shows a more detailed query that you can use when performing a text search.

In this example, t_log is the source table, and text_index is the text index. The query finds records that either contain:

Both the words '<WARNING>' and 'validate'
Only the word '[Log]' and does not contain 'validateDependencies'

SELECT * FROM t_log where (
   id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('<WARNING>'))
      AND (   id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validate')
      OR id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('[Log]')))
      AND NOT (id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validateDependencies'))));

This query returns the following results:

id  |   date     |                               text
----+------------+------------------------------------------------------------------------------------------------
11  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607702 [Log] <WARNING> validate: Object 4503 via fld num_all_roles
13  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607706 [Log] <WARNING> validate: Object 45035 refers to root_i3
14  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607708 [Log] <WARNING> validate: Object 4503 refers to int_2
17  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607700 [Txn] <WARNING> Begin validate Txn: fff0ed17 catalog editor
(4 rows)

4 - Dropping a text index

Dropping a text index removes the specified text index from the database.

You can drop a text index when:

It is no longer queried frequently.
An administrative task needs to be performed on the source table and requires the text index to be dropped.

Dropping the text index does not drop the source table associated with the text index. However, if you drop the source table associated with a text index, then that text index is also dropped. Vertica considers the text index a dependent object.

The following example illustrates how to drop a text index named text_index:

=> DROP TEXT INDEX text_index;
DROP INDEX

5 - Stemmers and tokenizers

Vertica provides default stemmers and tokenizers.

Vertica provides default stemmers and tokenizers. You can also create your own custom stemmers and tokenizers. The following topics explain the default stemmers and tokenizers, and the requirements for creating custom stemmers and tokenizers in Vertica.

5.1 - Vertica stemmers

Vertica stemmers use the Porter stemming algorithm to find words derived from the same base/root word.

Vertica stemmers use the Porter stemming algorithm to find words derived from the same base/root word. For example, if you perform a search on a text index for the keyword database, you might also want to get results containing the word databases.

To achieve this type of matching, Vertica stores words in their stemmed form when using any of the v_txtindex stemmers.

The Vertica Analytics Platform provides the following stemmers:

Name	Description
v_txtindex.Stemmer(long varchar)	Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table. Alias of StemmerCaseInsensitive.
v_txtindex.StemmerCaseSensitive(long varchar)	Sensitive to case. Stems strings from a Vertica table.
v_txtindex.StemmerCaseInsensitive(long varchar)	Default stemmer used if no stemmer is specified when creating a text index. Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table.
v_txtindex.caseInsensitiveNoStemming(long varchar)	Not sensitive to case; outputs lowercase words. Does not use the Porter Stemming algorithm.

Examples

The following examples show how to use a stemmer when creating a text index.

Create a text index using the StemmerCaseInsensitive stemmer:

=> CREATE TEXT INDEX idx_100 ON top_100 (id, feedback) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar)
                                                              TOKENIZER v_txtindex.StringTokenizer(long varchar);

Create a text index using the StemmerCaseSensitive stemmer:

=> CREATE TEXT INDEX idx_unstruc ON unstruc_data (__identity__, __raw__) STEMMER v_txtindex.StemmerCaseSensitive(long varchar)
                                                                                  TOKENIZER public.FlexTokenizer(long varbinary);

Create a text index without using a stemmer:

=> CREATE TEXT INDEX idx_logs FROM sys_logs ON (id, message) STEMMER NONE TOKENIZER v_txtindex.StringTokenizer(long varchar);

5.2 - Vertica tokenizers

A tokenizer does the following:.

A tokenizer does the following:

Receives a stream of characters.
Breaks the stream into individual tokens that usually correspond to individual words.
Returns a stream of tokens.

5.2.1 - Preconfigured tokenizers

The Vertica Analytics Platform provides the following preconfigured tokenizers:.

The Vertica Analytics Platform provides the following preconfigured tokenizers:

Name	Description
public.FlexTokenizer(LONG VARBINARY)	Splits the values in your flex table by white space.
v_txtindex.StringTokenizer(LONG VARCHAR)	Splits the string into words by splitting on white space.
v_txtindex.StringTokenizerDelim(`string` LONG VARCHAR, '`delimiter`' CHAR(1))	Splits a string into tokens using the specified delimiter character.
v_txtindex.AdvancedLogTokenizer	Uses the default parameters for all tokenizer parameters. For more information, see Advanced log tokenizer.
v_txtindex.BasicLogTokenizer	Uses the default values for all tokenizer parameters except minorseparator, which is set to an empty list. For more information, see Basic log tokenizer.
v_txtindex.WhitespaceLogTokenizer	Uses default values for tokenizer parameters, except for majorseparators, which uses `E' \t\n\f\r'`; and minorseparator, which uses an empty list. For more information, see Whitespace log tokenizer.

Vertica also provides the following tokenizer, which is not preconfigured:

Name	Description
v_txtindex.ICUTokenizer	Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.

Examples

The following examples show how you can use a preconfigured tokenizer when creating a text index.

Use the StringTokenizer to create an index from the top_100:

=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
                TOKENIZER v_txtindex.StringTokenizer(long varchar)
                 STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);

Use the FlexTokenizer to create an index from unstructured data:

=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
                                 TOKENIZER public.FlexTokenizer(long varbinary)
                                    STEMMER v_txtindex.StemmerCaseSensitive(long varchar);

Use the StringTokenizerDelim to split a string at the specified delimiter:

=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
            word | delim
-----------------+-------
      SingleWord | dd
 Break On Spaces |
 Break:On:Colons | :
(3 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
      words
-----------------
 Break
 On
 Colons
 SingleWor
 Break
 On
 Spaces
(7 rows)

=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
           words | input
-----------------+-----------------
           Break | Break:On:Colons
              On | Break:On:Colons
          Colons | Break:On:Colons
       SingleWor | SingleWord
           Break | Break On Spaces
              On | Break On Spaces
          Spaces | Break On Spaces
(7 rows)

5.2.2 - Advanced log tokenizer

Returns tokens that can include minor separators.

Returns tokens that can include minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. The advanced log tokenizer offers more granularity than the basic log tokenizer in defining separators through the addition of minor separators. This approach is frequently appropriate for analyzing log files.

Important

If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.

Parameters

Parameter Name	Parameter Value
stopwordscaseinsensitive	`''`
minorseparators	`E'/:=@.-$#%\\_'`
majorseparators	`E' []<>(){}\|!;,''"*&?+\r\n\t'`
minLength	`'2'`
maxLength	`'128'`
used	`'True'`

Examples

The following example shows how you can create a text index, from the table foo, using the Advanced Log Tokenizer without a stemmer.

=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                    SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_AdvancedLogTokenizer ON foo (id, text)
                  TOKENIZER v_txtindex.AdvancedLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_AdvancedLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 00                          |      1
 00:00:05.700433             |      1
 05                          |      1
 10                          |      1
 101                         |      1
 101.123.123.111/443         |      1
 111                         |      1
 123                         |      1
 2014                        |      1
 2014-05-10                  |      1
 302013                      |      1
 443                         |      1
 700433                      |      1
 9986454                     |      1
 ASA                         |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside                     |      1
 outside:101.123.123.111/443 |      1
(23 rows)

5.2.3 - Basic log tokenizer

Returns tokens that exclude specified minor separators.

Returns tokens that exclude specified minor separators. You can use this tokenizer in situations when your tokens are separated by whitespace or various punctuation. This approach is frequently appropriate for analyzing log files.

Important

If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.

Parameters

Parameter Name	Parameter Value
stopwordscaseinsensitive	`''`
minorseparators	`''`
majorseparators	`E' []<>(){}\|!;,''"*&?+\r\n\t'`
minLength	`'2'`
maxLength	`'128'`
used	`'True'`

Examples

The following example shows how you can create a text index, from the table foo, using the Basic Log Tokenizer without a stemmer.

=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 9986454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                     SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_BasicLogTokenizer ON foo (id, text)
                 TOKENIZER v_txtindex.BasicLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_BasicLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 00:00:05.700433             |      1
 101.123.123.111/443         |      1
 2014-05-10                  |      1
 9986454                     |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside:101.123.123.111/443 |      1
(11 rows)

5.2.4 - Whitespace log tokenizer

Returns only tokens surrounded by whitespace.

Returns only tokens surrounded by whitespace. You can use this tokenizer in situations where you want to the tokens in your source document to be separated by whitespace characters only. This approach lets you retain the ability to set stop words and token length limits.

Important

If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.

Parameters

Parameter Name	Parameter Value
stopwordscaseinsensitive	`''`
minorseparators	`''`
majorseparators	`E' \t\n\f\r'`
minLength	`'2'`
maxLength	`'128'`
used	`'True'`

Examples

The following example shows how you can create a text index, from the table foo, using the Whitespace Log Tokenizer without a stemmer.

=> CREATE TABLE foo (id INT PRIMARY KEY NOT NULL,text VARCHAR(250));
=> COPY foo FROM STDIN;
End with a backslash and a period on a line by itself.
>> 1|2014-05-10 00:00:05.700433 %ASA-6-302013: Built outbound TCP connection 998 6454 for outside:101.123.123.111/443 (101.123.123.111/443)
>> \.
=> CREATE PROJECTION foo_projection AS SELECT * FROM foo ORDER BY id
                                     SEGMENTED BY HASH(id) ALL NODES KSAFE;
=> CREATE TEXT INDEX indexfoo_WhitespaceLogTokenizer ON foo (id, text)
                TOKENIZER v_txtindex.WhitespaceLogTokenizer(LONG VARCHAR) STEMMER NONE;
=> SELECT * FROM indexfoo_WhitespaceLogTokenizer;
            token            | doc_id
-----------------------------+--------
 %ASA-6-302013:              |      1
 (101.123.123.111/443)       |      1
 00:00:05.700433             |      1
 2014-05-10                  |      1
 6454                        |      1
 998                         |      1
 Built                       |      1
 TCP                         |      1
 connection                  |      1
 for                         |      1
 outbound                    |      1
 outside:101.123.123.111/443 |      1
(12 rows)

5.2.5 - ICU tokenizer

Supports multiple languages.

Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.

The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.

Important

If you create a database with no tables and the k-safety has increased, you must rebalance your data using REBALANCE_CLUSTER before using a Vertica tokenizer.

Parameters

Parameter Name	Parameter Value
locale	Uses the POSIX naming convention: language[_COUNTRY] Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES. The default value is English if you do not specify a locale.

Parameter Name

Parameter Value

locale

Uses the POSIX naming convention: language[_COUNTRY]

Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES.

The default value is English if you do not specify a locale.

Example

The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.

For more on how to configure tokenizers, see Configuring a tokenizer.

Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.

VMart=> CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE 'C++' NAME 'ICUTokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
CREATE TRANSFORM FUNCTION

Get the procedure ID of the tokenizer.

VMart=> SELECT proc_oid from vs_procedures where procedure_name = 'ICUChineseTokenizer';
     proc_oid
-------------------
 45035996280452894
(1 row)

Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.

VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('locale','zh_CN' using parameters proc_oid='45035996280452894');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)

Lock the tokenizer.

VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used','true' using parameters proc_oid='45035996273762696');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)

Create an example table, foo, containing simplified Chinese text to index.

VMart=> CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
CREATE TABLE

VMart=> INSERT INTO foo values(1, u&'\4E2D\534E\4EBA\6C11\5171\548C\56FD');
 OUTPUT
--------
      1

Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.
```
VMart=> CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
CREATE INDEX
```

View the new index.

VMart=> SELECT * FROM index_example ORDER BY token,doc_id;
 token  | doc_id
--------+--------
 中华    |      1
 人民   |      1
 共和国 |      1
(3 rows)

5.3 - Configuring a tokenizer

You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library.

You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.

5.3.1 - Tokenizer base configuration

You can choose among several different tokenizer base configurations:.

You can choose among several different tokenizer base configurations:

Type	Position	Without Position
Ngram	logNgramTokenizerPositionFactory	logNgramTokenizerFactory
Words	logWordITokenizerPositionFactory	logWordITokenizerFactory

Create a logWord tokenizer without positional relevance:

=> CREATE TRANSFORM FUNCTION v_txtindex.fooTokenizer AS LANGUAGE 'C++' NAME 'logWordITokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;

5.3.2 - RetrieveTokenizerproc_oid

After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures.

After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.

Enter the following query, substituting your own tokenizer name:

=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';

5.3.3 - Set tokenizer parameters

Use the tokenizer's proc_oid to configure the tokenizer.

Use the tokenizer's proc_oid to configure the tokenizer. See Configuring a tokenizer for more information about getting the proc_oid of your tokenizer. The following examples show how you can configure each of the tokenizer parameters:

Configure stop words:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordscaseinsensitive','for,the' USING PARAMETERS proc_oid='45035996274128376');

Configure major separators:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorseparators', E'{}()&[]' USING PARAMETERS proc_oid='45035996274128376');

Configure minor separators:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minorseparators', '-,$' USING PARAMETERS proc_oid='45035996274128376');

Configure minimum length:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minlength', '1' USING PARAMETERS proc_oid='45035996274128376');

Configure maximum length:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('maxlength', '140' USING PARAMETERS proc_oid='45035996274128376');

Configure ngramssize:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('ngramssize', '2' USING PARAMETERS proc_oid='45035996274128376');

Lock tokenizer parameters

When you finish configuring the tokenizer, set the parameter, used, to True. After changing this setting, you are no longer able to alter the parameters of the tokenizer. At this point, the tokenizer is ready for you to use to create a text index.

Configure the used parameter:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used', 'True' USING PARAMETERS proc_oid='45035996274128376');

5.3.4 - View tokenizer parameters

After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:.

After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:

Use the GET_TOKENIZER_PARAMETER — View individual tokenizer parameter settings.
Use the READ_CONFIG_FILE — View all tokenizer parameter settings.

View individual tokenizer parameter settings

If you need to see an individual parameter setting for a tokenizer, you can use GET_TOKENIZER_PARAMETER to see specific tokenizer parameter settings:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 {}()&[]
(1 row)

For more information, see GET_TOKENIZER_PARAMETER.

View all tokenizer parameter settings

If you need to see all of the parameters for a tokenizer, you can use READ_CONFIG_FILE to see all of the parameter settings for your tokenizer:

=> SELECT v_txtindex.READ_CONFIG_FILE( USING PARAMETERS proc_oid='45035996274126984') OVER();
               config_key | config_value
--------------------------+---------------
          majorseparators | {}()&[]
                maxlength | 140
                minlength | 1
          minorseparators | -,$
 stopwordscaseinsensitive | for,the
                     type | 1
                     used | true
(7 rows)

If the parameter, used, is set to False, then you can only view the parameters that have been applied to the tokenizer.

Note

Vertica automatically supplies the value for Type, unless you are using an ngram tokenizer, which allows you to set it.

For more information, see READ_CONFIG_FILE.

5.3.5 - Delete tokenizer config file

Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file.

Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file. This function does not delete the User- Defined Transform Function (UDTF). It only deletes the configuration file associated with the UDTF.

Delete the tokenizer configuration file when the parameter, used, is set to False:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274127086');

Delete the tokenizer configuration file with the parameter, confirm, set to True. This setting forces the configuration file deletion, even if the parameter, used, is also set to True:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984', confirm='true');

For more information, see DELETE_TOKENIZER_CONFIG_FILE.

5.4 - Requirements for custom stemmers and tokenizers

Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides.

Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).

Before implementing a custom stemmer or tokenizer in Vertica verify that the UDx extension meets these requirements.

Note

Custom tokenizers can return multi-column text indices.

Vertica stemmer requirements

Comply with these requirements when you create custom stemmers:

Must be a User Defined Scalar Function (UDSF) or a SQL Function
Can be written in C++, Java, or R
Volatility set to stable or immutable

Supported Data Input Types:

Varchar
Long varchar

Supported Data Output Types:

Varchar
Long varchar

Vertica tokenizer requirements

To create custom tokenizers, follow these requirements:

Must be a User Defined Transform Function (UDTF)
Can be written in C++, Java, or R
Input type must match the type of the input text

Supported Data Input Types:

Char
Varchar
Long varchar
Varbinary
Long varbinary

Supported Data Output Types:

Varchar
Long varchar

Using text search

Important

1 - Creating a text index

2 - Creating a text index on a flex table

3 - Searching a text index

Performing a case-sensitive and case-insensitive text search query

Including and excluding keywords in a text search query

4 - Dropping a text index

5 - Stemmers and tokenizers

5.1 - Vertica stemmers

Examples

5.2 - Vertica tokenizers

5.2.1 - Preconfigured tokenizers

Examples

5.2.2 - Advanced log tokenizer

Important

Parameters

Examples

5.2.3 - Basic log tokenizer

Important

Parameters

Examples

5.2.4 - Whitespace log tokenizer

Important

Parameters

Examples

5.2.5 - ICU tokenizer

Important

Parameters

Example

5.3 - Configuring a tokenizer

5.3.1 - Tokenizer base configuration

5.3.2 - RetrieveTokenizerproc_oid

5.3.3 - Set tokenizer parameters

Lock tokenizer parameters

See also

5.3.4 - View tokenizer parameters

View individual tokenizer parameter settings

View all tokenizer parameter settings

Note

5.3.5 - Delete tokenizer config file

5.4 - Requirements for custom stemmers and tokenizers

Note

Vertica stemmer requirements

Vertica tokenizer requirements