This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Using text search
Text search allows you to quickly search the contents of a single CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword.
Text search allows you to quickly search the contents of a single CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword.
You can use this feature on columns that are queried repeatedly regarding their contents. After you create the text index, DML operations become slightly slower on the source table. This performance change results from syncing the text index and source table. Any time an operation is performed on the source table, the text index updates in the background. Regular queries on the source table are not affected.
The text index contains all of the words from the source table's text field and any other additional columns you included during index creation. Additional columns are not indexed—their values are just passed through to the text index. The text index is like any other Vertica table , except it is linked to the source table internally.
First, create a text index on the table you plan to search. Then, after you have indexed your table, run a query against the text index for a specific keyword. This query returns a doc_id for each instance of the keyword. After querying the text index, joining the text index back to the source table should give a significant performance improvement over directly querying the source table about the contents of its text field.
Important
Do not alter the contents or definitions of the text index. If you alter the contents or definitions of the text index, the results do not appropriately match the source table.
1 - Creating a text index
In the following example, you perform a text search using a source table called t_log.
In the following example, you perform a text search using a source table called t_log. This source table has two columns:
You must associate a projection with the source table. Use a projection that is sorted by the primary key and either segmented by hash(id) or unsegmented. You can define this projection on the source table, along with any other existing projections.
Create a text index on the table for which you want to perform a text search.
=> CREATE TEXT INDEX text_index ON t_log (id, text);
The text index contains two columns:
-
doc_id uses the unique identifier from the source table.
-
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.
If your table is partitioned then your text index also contains a third column named partition.
=> SELECT * FROM text_index;
token | doc_id | partition
------------------------+--------+-----------
<info> | 6 | 2014
<warning> | 2 | 2014
<warning> | 3 | 2014
<warning> | 4 | 2014
<warning> | 5 | 2014
database | 6 | 2014
execute: | 6 | 2014
object | 4 | 2014
object | 5 | 2014
[catalog] | 4 | 2014
[catalog] | 5 | 2014
You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.
Your text index stays synchronized to the contents of the source table through any operation that is run on the source table. These operations include, but are not limited to:
2 - Creating a text index on a flex table
In the following example, you create a text index on a flex table.
In the following example, you create a text index on a flex table. The example assumes that you have created a flex table called mountains. See Getting started in Using Flex Tables to create the flex table used in this example.
Before you can create a text index on your flex table, add a primary key constraint to the flex table.
=> ALTER TABLE mountains ADD PRIMARY KEY (__identity__);
Create a text index on the table for which you want to perform a text search. Tokenize the __raw__column with the FlexTokenizer and specify the data type as LONG VARBINARY. It is important to use the FlexTokenizer when creating text indices on flex tables because the data type of the __raw__ column differs from the default StringTokenizer.
=> CREATE TEXT INDEX flex_text_index ON mountains(__identity__, __raw__) TOKENIZER public.FlexTokenizer(long varbinary);
The text index contains two columns:
-
doc_id uses the unique identifier from the source table.
-
token is populated with text strings from the designated column from the source table. The word column results from tokenizing and stemming the words found in the text column.
If your table is partitioned then your text index also contains a third column named partition.
=> SELECT * FROM flex_text_index;
token | doc_id
-------------+--------
50.6 | 5
Mt | 5
Washington | 5
mountain | 5
12.2 | 3
15.4 | 2
17000 | 3
29029 | 2
Denali | 3
Helen | 2
Mt | 2
St | 2
mountain | 3
volcano | 2
29029 | 1
34.1 | 1
Everest | 1
mountain | 1
14000 | 4
Kilimanjaro | 4
mountain | 4
(21 rows)
You create a text index on a source table only once. In the future, you do not have to re-create the text index each time the source table is updated or changed.
Your text index stays synchronized to the contents of the source table through any operation that is run on the source table. These operations include, but are not limited to:
3 - Searching a text index
After you create a text index, write a query to run against the index to search for a specific keyword.
After you create a text index, write a query to run against the index to search for a specific keyword.
In the following example, you use a WHERE clause to search for the keyword <WARNING> in the text index. The WHERE clause should use the stemmer you used to create the text index. When you use the STEMMER keyword, it stems the keyword to match the keywords in your text index. If you did not use the STEMMER keyword, then the default stemmer is v_txtindex.StemmerCaseInsensitive. If you used STEMMER NONE, then you can omit STEMMER keyword from the WHERE clause.
=> SELECT * FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>');
token | doc_id
-----------+--------
<warning> | 2
<warning> | 3
<warning> | 4
<warning> | 5
(4 rows)
Next, write a query to display the full contents of the source table that match the keyword you searched for in the text index.
=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('<WARNING>'));
id | date | text
---+------------+-----------------------------------------------------------------------------------------------
4 | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45035968
5 | 2014-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45030
2 | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 4503
3 | 2013-06-04 | 11:00:49.568 unknown:0x7f9207607700 [Catalog] <WARNING> validateDependencies: Object 45066
(4 rows)
Use the doc_id to find the exact location of the keyword in the source table.The doc_id matches the unique identifier from the source table. This matching allows you to quickly find the instance of the keyword in your table.
Performing a case-sensitive and case-insensitive text search query
Your text index is optimized to match all instances of words depending upon your stemmer. By default, the case insensitive stemmer is applied to all text indices that do not specify a stemmer. Therefore, if the queries you plan to write against your text index are case sensitive, then Vertica recommends you use a case sensitive stemmer to build your text index.
The following examples show queries that match case-sensitive and case-insensitive words that you can use when performing a text search.
This query finds case-insensitive records in a case insensitive text index:
=> SELECT * FROM t_log WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseInsensitive('warning'));
This query finds case-sensitive records in a case sensitive text index:
=> SELECT * FROM t_log_case_sensitive WHERE id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('Warning'));
Including and excluding keywords in a text search query
Your text index also allows you to perform more detailed queries to find multiple keywords or omit results with other keywords. The following example shows a more detailed query that you can use when performing a text search.
In this example, t_log is the source table, and text_index is the text index. The query finds records that either contain:
SELECT * FROM t_log where (
id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('<WARNING>'))
AND ( id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validate')
OR id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('[Log]')))
AND NOT (id IN (SELECT doc_id FROM text_index WHERE token = v_txtindex.StemmerCaseSensitive('validateDependencies'))));
This query returns the following results:
id | date | text
----+------------+------------------------------------------------------------------------------------------------
11 | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607702 [Log] <WARNING> validate: Object 4503 via fld num_all_roles
13 | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607706 [Log] <WARNING> validate: Object 45035 refers to root_i3
14 | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607708 [Log] <WARNING> validate: Object 4503 refers to int_2
17 | 2014-05-04 | 11:00:49.568 unknown:0x7f9207607700 [Txn] <WARNING> Begin validate Txn: fff0ed17 catalog editor
(4 rows)
4 - Dropping a text index
Dropping a text index removes the specified text index from the database.
Dropping a text index removes the specified text index from the database.
You can drop a text index when:
Dropping the text index does not drop the source table associated with the text index. However, if you drop the source table associated with a text index, then that text index is also dropped. Vertica considers the text index a dependent object.
The following example illustrates how to drop a text index named text_index:
=> DROP TEXT INDEX text_index;
DROP INDEX
5 - Stemmers and tokenizers
Vertica provides default stemmers and tokenizers.
Vertica provides default stemmers and tokenizers. You can also create your own custom stemmers and tokenizers. The following topics explain the default stemmers and tokenizers, and the requirements for creating custom stemmers and tokenizers in Vertica.
5.1 - Vertica stemmers
Vertica stemmers use the Porter stemming algorithm to find words derived from the same base/root word.
Vertica stemmers use the Porter stemming algorithm to find words derived from the same base/root word. For example, if you perform a search on a text index for the keyword database, you might also want to get results containing the word databases.
To achieve this type of matching, Vertica stores words in their stemmed form when using any of the v_txtindex stemmers.
The Vertica Analytics Platform provides the following stemmers:
Name |
Description |
v_txtindex.Stemmer(long varchar) |
Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table.
Alias of StemmerCaseInsensitive.
|
v_txtindex.StemmerCaseSensitive(long varchar) |
Sensitive to case. Stems strings from a Vertica table. |
v_txtindex.StemmerCaseInsensitive(long varchar) |
Default stemmer used if no stemmer is specified when creating a text index.
Not sensitive to case; outputs lowercase words. Stems strings from a Vertica table.
|
v_txtindex.caseInsensitiveNoStemming(long varchar) |
Not sensitive to case; outputs lowercase words. Does not use the Porter Stemming algorithm. |
Examples
The following examples show how to use a stemmer when creating a text index.
Create a text index using the StemmerCaseInsensitive stemmer:
=> CREATE TEXT INDEX idx_100 ON top_100 (id, feedback) STEMMER v_txtindex.StemmerCaseInsensitive(long varchar)
TOKENIZER v_txtindex.StringTokenizer(long varchar);
Create a text index using the StemmerCaseSensitive stemmer:
=> CREATE TEXT INDEX idx_unstruc ON unstruc_data (__identity__, __raw__) STEMMER v_txtindex.StemmerCaseSensitive(long varchar)
TOKENIZER public.FlexTokenizer(long varbinary);
Create a text index without using a stemmer:
=> CREATE TEXT INDEX idx_logs FROM sys_logs ON (id, message) STEMMER NONE TOKENIZER v_txtindex.StringTokenizer(long varchar);
5.2 - Vertica tokenizers
A tokenizer does the following:.
A tokenizer does the following:
-
Receives a stream of characters.
-
Breaks the stream into individual tokens that usually correspond to individual words.
-
Returns a stream of tokens.
5.2.1 - Preconfigured tokenizers
The Vertica Analytics Platform provides the following preconfigured tokenizers:.
The Vertica Analytics Platform provides the following preconfigured tokenizers:
public.FlexTokenizer(LONG VARBINARY)
- Splits the values in your flex table by white space.
v_txtindex.StringTokenizer(LONG VARCHAR)
- Splits the string into words by splitting on white space.
v_txtindex.StringTokenizerDelim(LONG VARCHAR, CHAR(1))
- Splits a string into tokens using the specified delimiter character.
Vertica also provides the following tokenizer, which is not preconfigured:
v_txtindex.ICUTokenizer
- Supports multiple languages. Tokenizes based on the conventions of the language you set in the locale parameter. For more information, see ICU Tokenizer.
Examples
The following examples show how you can use a preconfigured tokenizer when creating a text index.
Use the StringTokenizer to create an index from the top_100:
=> CREATE TEXT INDEX idx_100 FROM top_100 on (id, feedback)
TOKENIZER v_txtindex.StringTokenizer(long varchar)
STEMMER v_txtindex.StemmerCaseInsensitive(long varchar);
Use the FlexTokenizer to create an index from unstructured data:
=> CREATE TEXT INDEX idx_unstruc FROM unstruc_data on (__identity__, __raw__)
TOKENIZER public.FlexTokenizer(long varbinary)
STEMMER v_txtindex.StemmerCaseSensitive(long varchar);
Use the StringTokenizerDelim to split a string at the specified delimiter:
=> CREATE TABLE string_table (word VARCHAR(100), delim VARCHAR);
CREATE TABLE
=> COPY string_table FROM STDIN DELIMITER ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>
>> SingleWord,dd
>> Break On Spaces,' '
>> Break:On:Colons,:
>> \.
=> SELECT * FROM string_table;
word | delim
-----------------+-------
SingleWord | dd
Break On Spaces |
Break:On:Colons | :
(3 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER () FROM string_table;
words
-----------------
Break
On
Colons
SingleWor
Break
On
Spaces
(7 rows)
=> SELECT v_txtindex.StringTokenizerDelim(word,delim) OVER (PARTITION BY word), word as input FROM string_table;
words | input
-----------------+-----------------
Break | Break:On:Colons
On | Break:On:Colons
Colons | Break:On:Colons
SingleWor | SingleWord
Break | Break On Spaces
On | Break On Spaces
Spaces | Break On Spaces
(7 rows)
5.2.2 - ICU tokenizer
Supports multiple languages.
Supports multiple languages. You can use this tokenizer to identify word boundaries in languages other than English, including Asian languages that are not separated by whitespace.
The ICU Tokenizer is not pre-configured. You configure the tokenizer by first creating a user-defined transform Function (UDTF). Then set the parameter, locale, to identify the language to tokenizer.
Important
If you create a database with no tables and the k-safety has increased, you must rebalance your data using
REBALANCE_CLUSTER before using a Vertica tokenizer.
Parameters
Parameter Name |
Parameter Value |
locale |
Uses the POSIX naming convention: language[_COUNTRY]
Identify the language using its ISO-639 code, and the country using its ISO-3166 code. For example, the parameter value for simplified Chinese is zh_CN, and the value for Spanish is es_ES.
The default value is English if you do not specify a locale.
|
Example
The following example steps show how you can configure the ICU Tokenizer for simplified Chinese, then create a text index from the table foo, which contains Chinese characters.
For more on how to configure tokenizers, see Configuring a tokenizer.
-
Create the tokenizer using a UDTF. The example tokenizer is named ICUChineseTokenizer.
VMart=> CREATE OR REPLACE TRANSFORM FUNCTION v_txtindex.ICUChineseTokenizer AS LANGUAGE 'C++' NAME 'ICUTokenizerFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED;
CREATE TRANSFORM FUNCTION
-
Get the procedure ID of the tokenizer.
VMart=> SELECT proc_oid from vs_procedures where procedure_name = 'ICUChineseTokenizer';
proc_oid
-------------------
45035996280452894
(1 row)
-
Set the parameter, locale, to simplified Chinese. Identify the tokenizer using its procedure ID.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('locale','zh_CN' using parameters proc_oid='45035996280452894');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
-
Lock the tokenizer.
VMart=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used','true' using parameters proc_oid='45035996273762696');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
-
Create an example table, foo, containing simplified Chinese text to index.
VMart=> CREATE TABLE foo(doc_id integer primary key not null,text varchar(250));
CREATE TABLE
VMart=> INSERT INTO foo values(1, u&'\4E2D\534E\4EBA\6C11\5171\548C\56FD');
OUTPUT
--------
1
-
Create an index, index_example, on the table foo. The example creates the index without a stemmer; Vertica stemmers work only on English text. Using a stemmer for English on non-English text can cause incorrect tokenization.
VMart=> CREATE TEXT INDEX index_example ON foo (doc_id, text) TOKENIZER v_txtindex.ICUChineseTokenizer(long varchar) stemmer none;
CREATE INDEX
-
View the new index.
VMart=> SELECT * FROM index_example ORDER BY token,doc_id;
token | doc_id
--------+--------
中华 | 1
人民 | 1
共和国 | 1
(3 rows)
5.3 - Configuring a tokenizer
You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib library.
You configure a tokenizer by creating a user-defined transform function (UDTF) using one of the two base UDTFs in the v_txtindex.AdvTxtSearchLib
library. The library contains two base tokenizers: one for Log Words and one for Ngrams. You can configure each base function with or without positional relevance.
5.3.1 - Tokenizer base configuration
You can choose among several different tokenizer base configurations:.
You can choose among two tokenizer base configurations:
- Ngram with position:
logNgramTokenizerPositionFactory
- Ngram without position:
logNgramTokenizerFactory
The following example creates an Ngram tokenizer without positional relevance:
=> CREATE TRANSFORM FUNCTION v_txtindex.myNgramTokenizer
AS LANGUAGE 'C++'
NAME 'logNgramTokenizerFactory'
LIBRARY v_txtindex.logSearchLib NOT FENCED;
5.3.2 - RetrieveTokenizerproc_oid
After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures.
After you create the tokenizer, Vertica writes the name and proc_oid to the system table vs_procedures. You must retrieve the tokenizer's proc_oid to perform additional configuration.
Enter the following query, substituting your own tokenizer name:
=> SELECT proc_oid FROM vs_procedures WHERE procedure_name = 'fooTokenizer';
5.3.3 - Set tokenizer parameters
Use the tokenizer's proc_oid to configure the tokenizer.
Use the tokenizer's proc_oid to configure the tokenizer. See Configuring a tokenizer for more information about getting the proc_oid of your tokenizer. The following examples show how you can configure each of the tokenizer parameters:
Configure stop words:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordscaseinsensitive','for,the' USING PARAMETERS proc_oid='45035996274128376');
Configure major separators:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorseparators', E'{}()&[]' USING PARAMETERS proc_oid='45035996274128376');
Configure minor separators:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minorseparators', '-,$' USING PARAMETERS proc_oid='45035996274128376');
Configure minimum length:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minlength', '1' USING PARAMETERS proc_oid='45035996274128376');
Configure maximum length:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('maxlength', '140' USING PARAMETERS proc_oid='45035996274128376');
Configure ngramssize:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('ngramssize', '2' USING PARAMETERS proc_oid='45035996274128376');
Lock tokenizer parameters
When you finish configuring the tokenizer, set the parameter, used, to True
. After changing this setting, you are no longer able to alter the parameters of the tokenizer. At this point, the tokenizer is ready for you to use to create a text index.
Configure the used parameter:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used', 'True' USING PARAMETERS proc_oid='45035996274128376');
See also
SET_TOKENIZER_PARAMETER
5.3.4 - View tokenizer parameters
After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:.
After creating a custom tokenizer, you can view the tokenizer's parameter settings in either of two ways:
View individual tokenizer parameter settings
If you need to see an individual parameter setting for a tokenizer, you can use GET_TOKENIZER_PARAMETER to see specific tokenizer parameter settings:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
{}()&[]
(1 row)
For more information, see GET_TOKENIZER_PARAMETER.
View all tokenizer parameter settings
If you need to see all of the parameters for a tokenizer, you can use READ_CONFIG_FILE to see all of the parameter settings for your tokenizer:
=> SELECT v_txtindex.READ_CONFIG_FILE( USING PARAMETERS proc_oid='45035996274126984') OVER();
config_key | config_value
--------------------------+---------------
majorseparators | {}()&[]
maxlength | 140
minlength | 1
minorseparators | -,$
stopwordscaseinsensitive | for,the
type | 1
used | true
(7 rows)
If the parameter, used, is set to False
, then you can only view the parameters that have been applied to the tokenizer.
Note
Vertica automatically supplies the value for Type, unless you are using an ngram tokenizer, which allows you to set it.
For more information, see READ_CONFIG_FILE.
5.3.5 - Delete tokenizer config file
Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file.
Use the DELETE_TOKENIZER_CONFIG_FILE function to delete a tokenizer configuration file. This function does not delete the User- Defined Transform Function (UDTF). It only deletes the configuration file associated with the UDTF.
Delete the tokenizer configuration file when the parameter, used, is set to False
:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274127086');
Delete the tokenizer configuration file with the parameter, confirm, set to True
. This setting forces the configuration file deletion, even if the parameter, used, is also set to True
:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984', confirm='true');
For more information, see DELETE_TOKENIZER_CONFIG_FILE.
5.4 - Requirements for custom stemmers and tokenizers
Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides.
Sometimes, you may want specific tokenization or stemming behavior that differs from what Vertica provides. In such cases, you can to implement your own custom User Defined Extensions (UDx) to replace the stemmer or tokenizer. For more information about building custom UDxs see Developing user-defined extensions (UDxs).
Before implementing a custom stemmer or tokenizer in Vertica verify that the UDx extension meets these requirements.
Note
Custom tokenizers can return multi-column text indices.
Vertica stemmer requirements
Comply with these requirements when you create custom stemmers:
-
Must be a User Defined Scalar Function (UDSF) or a SQL Function
-
Can be written in C++, Java, or R
-
Volatility set to stable or immutable
Supported Data Input Types:
Supported Data Output Types:
Vertica tokenizer requirements
To create custom tokenizers, follow these requirements:
-
Must be a User Defined Transform Function (UDTF)
-
Can be written in C++, Java, or R
-
Input type must match the type of the input text
Supported Data Input Types:
-
Char
-
Varchar
-
Long varchar
-
Varbinary
-
Long varbinary
Supported Data Output Types: