OpenText Analytics Database 26.2.x – Using text search

Admin: Creating a text index

Mon, 01 Jan 0001 00:00:00 +0000

In the following example, you perform a text search using a source table called t_log. This source table has two columns:

One column containing the table's primary key
Another column containing log file information

You must associate a projection with the source table. Use a projection that is sorted by the primary key and either segmented by hash(id) or unsegmented. You can define this projection on the source table, along with any other existing projections.

Use CREATE TEXT INDEX on the table for which you want to perform a text search.

=> CREATE TEXT INDEX text_index ON t_log (id, text);

The text index contains two columns:

doc_id uses the unique identifier from the source table.
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.

If your table is partitioned then your text index also contains a third column named partition.

=> SELECT * FROM text_index;
          token         | doc_id | partition
------------------------+--------+-----------
<info>                  |      6 |      2014
<warning>               |      2 |      2014
<warning>               |      3 |      2014
<warning>               |      4 |      2014
<warning>               |      5 |      2014
database                |      6 |      2014
execute:                |      6 |      2014
object                  |      4 |      2014
object                  |      5 |      2014
[catalog]               |      4 |      2014
[catalog]               |      5 |      2014

You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.

The index contains all existing data and stays synchronized to the contents of the source table through any subsequent DML operations. These operations include, but are not limited to:

COPY
INSERT
UPDATE
DELETE
DROP PARTITION
MOVE_PARTITIONS_TO_TABLE

When you move or swap partitions in a source table that is indexed, verify that the destination table already exists and is indexed in the same way.

Admin: Creating a text index on a flex table

Mon, 01 Jan 0001 00:00:00 +0000

In the following example, you create a text index on a flex table. The example assumes that you have created a flex table called mountains. See Getting started in Using Flex Tables to create the flex table used in this example.

Before you can create a text index on your flex table, add a primary key constraint to the flex table.

=> ALTER TABLE mountains ADD PRIMARY KEY (__identity__);

Use CREATE TEXT INDEX on the table for which you want to perform a text search. Tokenize the __raw__column with the FlexTokenizer and specify the data type as LONG VARBINARY. It is important to use the FlexTokenizer when creating text indices on flex tables because the data type of the __raw__ column differs from the default StringTokenizer.

=> CREATE TEXT INDEX flex_text_index ON mountains(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);

The text index contains two columns:

doc_id uses the unique identifier from the source table.
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.

If your table is partitioned then your text index also contains a third column named partition.

=> SELECT * FROM flex_text_index;
    token    | doc_id
-------------+--------
 50.6        |      5
 Mt          |      5
 Washington  |      5
 mountain    |      5
 12.2        |      3
 15.4        |      2
 17000       |      3
 29029       |      2
 Denali      |      3
 Helen       |      2
 Mt          |      2
 St          |      2
 mountain    |      3
 volcano     |      2
 29029       |      1
 34.1        |      1
 Everest     |      1
 mountain    |      1
 14000       |      4
 Kilimanjaro |      4
 mountain    |      4
(21 rows)

You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.

The index contains all existing data and stays synchronized to the contents of the source table through any subsequent DML operations. These operations include, but are not limited to:

COPY
INSERT
UPDATE
DELETE
DROP PARTITION
MOVE_PARTITIONS_TO_TABLE

When you move or swap partitions in a source table that is indexed, verify that the destination table already exists and is indexed in the same way.

Admin: Searching a text index

Mon, 01 Jan 0001 00:00:00 +0000

After you create a text index, write a query to run against the index to search for a specific keyword.

In the following example, you use a WHERE clause to search for the keyword <WARNING> in the text index. The WHERE clause should use the stemmer you used to create the text index. When you use the STEMMER keyword, it stems the keyword to match the keywords in your text index. If you did not use the STEMMER keyword, then the default stemmer is v_txtindex.StemmerCaseInsensitive. If you used STEMMER NONE, then you can omit STEMMER keyword from the WHERE clause.

=> SELECT * FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>');
  token    | doc_id
-----------+--------
<warning>  |     2
<warning>  |     3
<warning>  |     4
<warning>  |     5
(4 rows)

Next, write a query to display the full contents of the source table that match the keyword you searched for in the text index.

=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>'));
id |    date    |                                     text
---+------------+-----------------------------------------------------------------------------------------------
4  | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45035968
5  | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45030
2  | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 4503
3  | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45066
(4 rows)

Use the doc_id to find the exact location of the keyword in the source table.The doc_id matches the unique identifier from the source table. This matching allows you to quickly find the instance of the keyword in your table.

Performing a case-sensitive and case-insensitive text search query

Your text index is optimized to match all instances of words depending upon your stemmer. By default, the case insensitive stemmer is applied to all text indices that do not specify a stemmer. Therefore, it is recommended that you use a case sensitive stemmer to build your text index if the queries you plan to write against your text index are case sensitive.

The following examples show queries that match case-sensitive and case-insensitive words that you can use when performing a text search.

This query finds case-insensitive records in a case insensitive text index:

=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('warning'));

This query finds case-sensitive records in a case sensitive text index:

=> SELECT * FROM t_log_case_sensitive WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('Warning'));

Including and excluding keywords in a text search query

Your text index also allows you to perform more detailed queries to find multiple keywords or omit results with other keywords. The following example shows a more detailed query that you can use when performing a text search.

In this example, t_log is the source table, and text_index is the text index. The query finds records that either contain:

Both the words '<WARNING>' and 'validate'
Only the word '[Log]' and does not contain 'validateDependencies'

SELECT * FROM t_log where (
   id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('<WARNING>'))
      AND (   id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validate')
      OR id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('[Log]')))
      AND NOT (id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validateDependencies'))));

This query returns the following results:

id  |   date     |                               text
----+------------+------------------------------------------------------------------------------------------------
11  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607702 [Log] <WARNING> validate: Object 4503 via fld num_all_roles
13  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607706 [Log] <WARNING> validate: Object 45035 refers to root_i3
14  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607708 [Log] <WARNING> validate: Object 4503 refers to int_2
17  | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607700 [Txn] <WARNING> Begin validate Txn: fff0ed17 catalog editor
(4 rows)

Admin: Dropping a text index

Mon, 01 Jan 0001 00:00:00 +0000

Dropping a text index removes the specified text index from the database.

You can drop a text index when:

It is no longer queried frequently.
An administrative task needs to be performed on the source table and requires the text index to be dropped.

Dropping the text index does not drop the source table associated with the text index. However, if you drop the source table associated with a text index, that text index is also dropped. The database considers the text index a dependent object.

The following example illustrates how to drop a text index named text_index:

=> DROP TEXT INDEX text_index;
DROP INDEX

Admin: Stemmers and tokenizers

Mon, 01 Jan 0001 00:00:00 +0000

OpenText™ Analytics Database provides default stemmers and tokenizers. You can also create your own custom stemmers and tokenizers. The following topics explain the default stemmers and tokenizers, and the requirements for creating custom stemmers and tokenizers.