This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Match and search functions

This section contains functions for text search and regular expressions, and functions used in the MATCH clause.

1: MATCH clause functions

1.1: EVENT_NAME
1.2: MATCH_ID
1.3: PATTERN_ID

2: Regular expression functions

2.1: MATCH_COLUMNS
2.2: REGEXP_COUNT
2.3: REGEXP_ILIKE
2.4: REGEXP_INSTR
2.5: REGEXP_LIKE
2.6: REGEXP_NOT_ILIKE
2.7: REGEXP_NOT_LIKE
2.8: REGEXP_REPLACE
2.9: REGEXP_SUBSTR

3: Text search functions

3.1: DELETE_TOKENIZER_CONFIG_FILE
3.2: GET_TOKENIZER_PARAMETER
3.3: READ_CONFIG_FILE
3.4: SET_TOKENIZER_PARAMETER

This section contains functions for text search and regular expressions, and functions used in the MATCH clause.

1 - MATCH clause functions

Used with the MATCH clause, the functions in this section return additional data about the patterns found or returned.

Used with the MATCH clause, the functions in this section return additional data about the patterns found or returned. For example, you can use these functions to return values representing the name of the event or pattern that matched the input row, the sequential number of the match, or a partition-wide unique identifier for the instance of the pattern that matched.

Pattern matching is particularly useful for clickstream analysis where you might want to identify users' actions based on their Web browsing behavior (page clicks). A typical online clickstream funnel is:

Company home page -> product home page -> search -> results -> purchase online

Using the above clickstream funnel, you can search for a match on the user's sequence of web clicks and identify that the user:

Landed on the company home page.
Navigated to the product page.
Ran a search.
Clicked a link from the search results.
Made a purchase.

For examples that use this clickstream model, see Event series pattern matching.

Note

GROUP BY and PARTITION BY expressions do not support window functions.

1.1 - EVENT_NAME

Returns a VARCHAR value representing the name of the event that matched the row.

Syntax

EVENT_NAME()

Notes

Pattern matching functions must be used in MATCH clause syntax; for example, if you call EVENT_NAME() on its own, Vertica returns the following error message:

=> SELECT event_name();
ERROR:  query with pattern matching function event_name must include a MATCH clause

Examples

Note

This example uses the schema defined in Event series pattern matching.

The following statement analyzes users' browsing history on website2.com and identifies patterns where the user landed on website2.com from another Web site (Entry) and browsed to any number of other pages (Onsite) before making a purchase (Purchase). The query also outputs the values for EVENT_NAME(), which is the name of the event that matched the row.

SELECT uid,
       sid,
       ts,
       refurl,
       pageurl,
       action,
       event_name()
FROM clickstream_log
MATCH
  (PARTITION BY uid, sid ORDER BY ts
   DEFINE
     Entry    AS RefURL  NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
     Onsite   AS PageURL ILIKE     '%website2.com%' AND Action='V',
     Purchase AS PageURL ILIKE     '%website2.com%' AND Action = 'P'
   PATTERN
     P AS (Entry Onsite* Purchase)
   ROWS MATCH FIRST EVENT);
 uid | sid |    ts    |        refurl        |       pageurl        | action | event_name
-----+-----+----------+----------------------+----------------------+--------+------------
   1 | 100 | 12:00:00 | website1.com         | website2.com/home    | V      | Entry
   1 | 100 | 12:01:00 | website2.com/home    | website2.com/floby   | V      | Onsite
   1 | 100 | 12:02:00 | website2.com/floby   | website2.com/shamwow | V      | Onsite
   1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy     | P      | Purchase
   2 | 100 | 12:10:00 | website1.com         | website2.com/home    | V      | Entry
   2 | 100 | 12:11:00 | website2.com/home    | website2.com/forks   | V      | Onsite
   2 | 100 | 12:13:00 | website2.com/forks   | website2.com/buy     | P      | Purchase
(7 rows)

1.2 - MATCH_ID

Returns a successful pattern match as an INTEGER value.

Returns a successful pattern match as an INTEGER value. The returned value is the ordinal position of a match within a partition.

Syntax

MATCH_ID()

Notes

Pattern matching functions must be used in MATCH clause syntax; for example, if you call MATCH_ID() on its own, Vertica returns the following error message:

=> SELECT match_id();
ERROR:  query with pattern matching function match_id must include a MATCH clause

Examples

Note

This example uses the schema defined in Event series pattern matching.

The following statement analyzes users' browsing history on a site called website2.com and identifies patterns where the user reached website2.com from another Web site (Entry in the MATCH clause) and browsed to any number of other pages (Onsite) before making a purchase (Purchase). The query also outputs values for the MATCH_ID(), which represents a sequential number of the match.

SELECT uid,
       sid,
       ts,
       refurl,
       pageurl,
       action,
       match_id()
FROM clickstream_log
MATCH
  (PARTITION BY uid, sid ORDER BY ts
   DEFINE
     Entry    AS RefURL  NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
     Onsite   AS PageURL ILIKE     '%website2.com%' AND Action='V',
     Purchase AS PageURL ILIKE     '%website2.com%' AND Action = 'P'
   PATTERN
     P AS (Entry Onsite* Purchase)
   ROWS MATCH FIRST EVENT);

uid | sid |    ts    |        refurl        |       pageurl        | action | match_id
----+-----+----------+----------------------+----------------------+--------+------------
1   | 100 | 12:00:00 | website1.com         | website2.com/home    | V      |          1
1   | 100 | 12:01:00 | website2.com/home    | website2.com/floby   | V      |          2
1   | 100 | 12:02:00 | website2.com/floby   | website2.com/shamwow | V      |          3
1   | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy     | P      |          4
2   | 100 | 12:10:00 | website1.com         | website2.com/home    | V      |          1
2   | 100 | 12:11:00 | website2.com/home    | website2.com/forks   | V      |          2
2   | 100 | 12:13:00 | website2.com/forks   | website2.com/buy     | P      |          3
(7 rows)

1.3 - PATTERN_ID

Returns an integer value that is a partition-wide unique identifier for the instance of the pattern that matched.

Syntax

PATTERN_ID()

Notes

Pattern matching functions must be used in MATCH clause syntax; for example, if call PATTERN_ID() on its own, Vertica returns the following error message:

=> SELECT pattern_id();
ERROR:  query with pattern matching function pattern_id must include a MATCH clause

Examples

Note

This example uses the schema defined in Event series pattern matching.

The following statement analyzes users' browsing history on website2.com and identifies patterns where the user landed on website2.com from another Web site (Entry) and browsed to any number of other pages (Onsite) before making a purchase (Purchase). The query also outputs values for PATTERN_ID(), which represents the partition-wide identifier for the instance of the pattern that matched.

SELECT uid,
       sid,
       ts,
       refurl,
       pageurl,
       action,
       pattern_id()
FROM clickstream_log
MATCH
  (PARTITION BY uid, sid ORDER BY ts
   DEFINE
     Entry    AS RefURL  NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
     Onsite   AS PageURL ILIKE     '%website2.com%' AND Action='V',
     Purchase AS PageURL ILIKE     '%website2.com%' AND Action = 'P'
   PATTERN
     P AS (Entry Onsite* Purchase)
   ROWS MATCH FIRST EVENT);


uid | sid |    ts    |        refurl        |       pageurl        | action | pattern_id
----+-----+----------+----------------------+----------------------+--------+------------
1   | 100 | 12:00:00 | website1.com         | website2.com/home    | V      |          1
1   | 100 | 12:01:00 | website2.com/home    | website2.com/floby   | V      |          1
1   | 100 | 12:02:00 | website2.com/floby   | website2.com/shamwow | V      |          1
1   | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy     | P      |          1
2   | 100 | 12:10:00 | website1.com         | website2.com/home    | V      |          1
2   | 100 | 12:11:00 | website2.com/home    | website2.com/forks   | V      |          1
2   | 100 | 12:13:00 | website2.com/forks   | website2.com/buy     | P      |          1
(7 rows)

2 - Regular expression functions

A regular expression lets you perform pattern matching on strings of characters.

A regular expression lets you perform pattern matching on strings of characters. The regular expression syntax allows you to precisely define the pattern used to match strings, giving you much greater control than wildcard matching used in the LIKE predicate. The Vertica regular expression functions let you perform tasks such as determining if a string value matches a pattern, extracting a portion of a string that matches a pattern, or counting the number of times a pattern occurs within a string.

Vertica uses the Perl Compatible Regular Expression (PCRE) library to evaluate regular expressions. As its name implies, PCRE's regular expression syntax is compatible with the syntax used by the Perl 5 programming language. You can read PCRE's documentation about its library. However, if you are unfamiliar with using regular expressions, the Perl Regular Expressions Documentation is a good introduction.

Note

The regular expression functions only operate on valid UTF-8 strings. If you try using a regular expression function on a string that is not valid UTF-8, the query fails with an error. To prevent an error from occurring, use the ISUTF8 function as an initial clause to ensure the strings you pass to the regular expression functions are valid UTF-8 strings. Alternatively, or you can use the 'b' argument to treat the strings as binary octets, rather than UTF-8 encoded strings.

2.1 - MATCH_COLUMNS

Specified as an element in a SELECT list, returns all columns in queried tables that match the specified pattern.

Specified as an element in a SELECT list, returns all columns in queried tables that match the specified pattern. For example:

=> SELECT MATCH_COLUMNS ('%order%') FROM store.store_orders_fact LIMIT 3;
 order_number | date_ordered | quantity_ordered | total_order_cost | reorder_level
--------------+--------------+------------------+------------------+---------------
       191119 | 2003-03-09   |               15 |             4021 |            23
        89985 | 2003-05-04   |               19 |             2692 |            23
       246962 | 2007-06-01   |               77 |             4419 |            42
(3 rows)

Syntax

MATCH_COLUMNS ('pattern')

Arguments

pattern

The pattern to match against all column names in the queried tables, where pattern typically contains one or both of the following wildcard characters:

_ (underscore): Match any single character.
% (percent sign): Match any string of zero or more characters.

The pattern can also include backslash (\) characters to escape reserved characters that are embedded in column names: _(underscore), % (percent sign), and backlash (\) itself.

Privileges

None

DDL usage

You can use MATCH_COLUMNS to define database objects—for example, specify it in CREATE PROJECTION to identify projection columns, or in CREATE TABLE...AS to identify columns in the new table. In all cases, Vertica expands the MATCH_COLUMNS output before it stores the object DDL. Subsequent changes to the original source table have no effect on the derived object definitions.

Restrictions

In general, MATCH_COLUMNS is specified as an element in a SELECT list. For example, CREATE PROJECTION can call MATCH_COLUMNS to specify the columns to include in a projection. However, attempts to specify columns in the projection's segmentation clause return with an error:

=> CREATE PROJECTION p_store_orders AS SELECT
    MATCH_COLUMNS('%product%'),
    MATCH_COLUMNS('%store%'),
    order_number FROM store.store_orders_fact SEGMENTED BY MATCH_COLUMNS('products%') ALL NODES;
ERROR 0:  MATCH_COLUMNS() function can only be specified as an element in a SELECT list
=> CREATE PROJECTION p_store_orders AS SELECT
    MATCH_COLUMNS('%product%'),
    MATCH_COLUMNS('%store%'),
    order_number FROM store.store_orders_fact;
WARNING 4468:  Projection <store.p_store_orders_b0> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468:  Projection <store.p_store_orders_b1> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
CREATE PROJECTION

If you call MATCH_COLUMNS from a function that supports a fixed number of arguments, Vertica returns an error. For example, the UPPER function supports only one argument; so calling MATCH_COLUMNS from UPPER as follows returns an error:

=> SELECT MATCH_COLUMNS('emp%') FROM employee_dimension LIMIT 1;
-[ RECORD 1 ]-----------+---------------------------------
employee_key            | 1
employee_gender         | Male
employee_first_name     | Craig
employee_middle_initial | F
employee_last_name      | Robinson
employee_age            | 22
employee_street_address | 5 Bakers St
employee_city           | Thousand Oaks
employee_state          | CA
employee_region         | West

=> SELECT UPPER (MATCH_COLUMNS('emp%')) FROM employee_dimension;
ERROR 10465:  MATCH_COLUMNS() function can only be specified as an element in a SELECT list

In contrast, the HASH function accepts an unlimited number of arguments, so calling MATCH_COLUMNS as an argument succeeds:

=> select HASH(MATCH_COLUMNS('emp%')) FROM employee_dimension LIMIT 10;
        HASH
---------------------
 2047284364908178817
 1421997332260827278
 7981613309330877388
  792898558199431621
 5275639269069980417
 7892790768178152349
  184601038712735208
 3020263228621856381
 7056305566297085916
 3328422577712931057
(10 rows)

Other constraints

The following usages of MATCH_COLUMNS are invalid and return with an error:

Including MATCH_COLUMNS in the non-recursive (base) term query of a RECURSIVE WITH clause

Concatenating the results of MATCH_COLUMNS calls:

=> SELECT MATCH_COLUMNS ('%store%')||MATCH_COLUMNS('%store%') FROM store.store_orders_fact;
ERROR 0:  MATCH_COLUMNS() function can only be specified as an element in a SELECT list

Setting an alias on MATCH_COLUMNS

Examples

The following CREATE PROJECTION statement uses MATCH_COLUMNS to specify table columns in the new projection:

=> CREATE PROJECTION p_store_orders AS SELECT
    MATCH_COLUMNS('%product%'),
    MATCH_COLUMNS('%store%'),
    order_number FROM store.store_orders_fact;
WARNING 4468:  Projection <store.p_store_orders_b0> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468:  Projection <store.p_store_orders_b1> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
          The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
CREATE PROJECTION

=> SELECT export_objects('', 'store.p_store_orders_b0');
...

CREATE PROJECTION store.p_store_orders_b0 /*+basename(p_store_orders)*/
(
 product_key,
 product_version,
 store_key,
 order_number
)
AS
 SELECT store_orders_fact.product_key,
        store_orders_fact.product_version,
        store_orders_fact.store_key,
        store_orders_fact.order_number
 FROM store.store_orders_fact
 ORDER BY store_orders_fact.product_key,
          store_orders_fact.product_version,
          store_orders_fact.store_key,
          store_orders_fact.order_number
SEGMENTED BY hash(store_orders_fact.product_key, store_orders_fact.product_version, store_orders_fact.store_key, store_orders_fact.order_number) ALL NODES OFFSET 0;

SELECT MARK_DESIGN_KSAFE(1);

(1 row)

As shown in the EXPORT_OBJECTS output, Vertica stores the result sets of the two MATCH_COLUMNS calls in the new projection's DDL. Later changes in the anchor table DDL have no effect on this projection.

2.2 - REGEXP_COUNT

Returns the number times a regular expression matches a string.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_COUNT ( string-expession, pattern [, position [, regexp-modifier ]... ] )

Parameters

string-expression

The VARCHAR or LONG VARCHAR expression to evaluate for matches with the regular expression specified in pattern. If string-expression is in the __raw__ column of a flex or columnar table, cast the string to a LONG VARCHAR before searching for pattern.

pattern

The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

position

The number of characters from the start of the string where the function should start searching for matches. By default, the function begins searching for a match at the first (leftmost) character. Setting this parameter to a value greater than 1 begins searching for a match at the *n*th character you specify.

Default: 1

regexp-modifier

One or more single-character flags that modify how the regular expression pattern is matched to string-expression:

b: Treat strings as binary octets, rather than UTF-8 characters.
c (default): Force the match to be case sensitive.
i: Force the match to be case insensitive.
m: Treat the string to match as multiple lines. Using this modifier, the start of line (^) and end of line ($) regular expression operators match line breaks (\n) within the string. Without the m modifier, the start and end of line operators match only the start and end of the string.
n: Match the regular expression operator (.) to a newline (\n). By default, the . operator matches any character except a newline.
x: Add comments to regular expressions. The x modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#) and end with a newline (\n). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\).

Examples

Count the number of occurrences of the substring an in the specified string (a man, a plan, a canal: Panama):

=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', 'an');
 REGEXP_COUNT
--------------
            4
(1 row)

Find the number of occurrences of the substring an, starting with the fifth character.

=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', 'an',5);
 REGEXP_COUNT
--------------
            3
(1 row)

Find the number of occurrences of a substring containing a lower-case character followed by an:

=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', '[a-z]an');
 REGEXP_COUNT
--------------
            3
(1 row

REGEXP_COUNT specifies the i modifier, so it ignores case:


=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', '[a-z]an', 1, 'i');

 REGEXP_COUNT
--------------
            4

2.3 - REGEXP_ILIKE

Returns true if the string contains a match for the regular expression.

Returns true if the string contains a match for the regular expression. REGEXP_ILIKE is similar to the LIKE, except that it uses a case insensitive regular expression, rather than simple wildcard character matching.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_ILIKE ( string-expression, pattern )

Parameters

string-expression``: The VARCHAR or LONG VARCHAR expression to evaluate for matches with the regular expression specified in pattern. If string-expression is in the __raw__ column of a flex or columnar table, cast the string to a LONG VARCHAR before searching for pattern.
pattern``: The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

Examples

This example creates a table containing several strings to demonstrate regular expressions.

Create table longvc with a single, long varchar column body, and insert data with distinct characters:

=> CREATE table longvc(body long varchar (1048576));
CREATE TABLE

=> insert into longvc values ('На берегу пустынных волн');
=> insert into longvc values ('Voin syödä lasia, se ei vahingoita minua');
=> insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。');
=> insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.');
=> insert into longvc values ('zésbaésbaa');
=> insert into longvc values ('Out of the frying pan, he landed immediately in the fire');

=> SELECT * FROM longvc;
                      body
------------------------------------------------
 На берегу пустынных волн
 Voin syödä lasia, se ei vahingoita minua
 私はガラスを食べられます。それは私を傷つけません。
 Je peux manger du verre, ça ne me fait pas mal.
 zésbaésbaa
 Out of the frying pan, he landed immediately in the fire
(6 rows)

Pattern match table rows containing the character ç:

=> SELECT * FROM longvc where regexp_ilike(body, 'ç');
                      body
-------------------------------------------------
 Je peux manger du verre, ça ne me fait pas mal.
(1 row)

Select all rows that contain the characters A/a:

=> SELECT * FROM longvc where regexp_ilike(body, 'A');
                      body
-------------------------------------------------
 Je peux manger du verre, ça ne me fait pas mal.
 Voin syödä lasia, se ei vahingoita minua
 zésbaésbaa
(3 rows)

Select all rows that contain the characters O/o:

=> SELECT * FROM longvc where regexp_ilike(body, 'O');
                           body
----------------------------------------------------------
 Voin syödä lasia, se ei vahingoita minua
 Out of the frying pan, he landed immediately in the fire
(2 rows)

2.4 - REGEXP_INSTR

Returns the starting or ending position in a string where a regular expression matches.

Returns the starting or ending position in a string where a regular expression matches. REGEXP_INSTR returns 0 if no match for the regular expression is found in the string.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_INSTR ( string-expression, pattern 
   [, position [, occurrence [, return-position [, regexp-modifier ]... [, captured-subexp ]]]] )

Parameters

string-expression

pattern

The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

position

Default: 1

occurrence

Controls which occurrence of a pattern match in the string to return. By default, the function returns the position of the first matching substring. Use this parameter to find the position of subsequent matching substrings. For example, setting this parameter to 3 returns the position of the third substring that matches the pattern.

Default: 1

return-position

Sets the position within the string to return. Using the default position (0), the function returns the string position of the first character of the substring that matches the pattern. If you set return-position to 1, the function returns the position of the first character after the end of the matching substring.

Default: 0

regexp-modifier

One or more single-character flags that modify how the regular expression pattern is matched to string-expression:

b: Treat strings as binary octets, rather than UTF-8 characters.
c (default): Force the match to be case sensitive.
i: Force the match to be case insensitive.
m: Treat the string to match as multiple lines. Using this modifier, the start of line (^) and end of line ($) regular expression operators match line breaks (\n) within the string. Without the m modifier, the start and end of line operators match only the start and end of the string.
n: Match the regular expression operator (.) to a newline (\n). By default, the . operator matches any character except a newline.
x: Add comments to regular expressions. The x modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#) and end with a newline (\n). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\).

captured-subexp

The captured subexpression whose position to return. By default, the function returns the position of the first character in string that matches the regular expression. If you set this value from 1 – 9, the function returns the subexpression captured by the corresponding set of parentheses in the regular expression. For example, setting this value to 3 returns the substring captured by the third set of parentheses in the regular expression.

Default: 0

Note

The subexpressions are numbered left to right, based on the appearance of opening parenthesis, so nested regular expressions . For example, in the regular expression \s*(\w+\s+(\w+)), subexpression 1 is the one that captures everything but any leading whitespaces.

Examples

Find the first occurrence of a sequence of letters starting with the letter e and ending with the letter y in the specified string (easy come, easy go).

=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y');
 REGEXP_INSTR
--------------
            1
(1 row)

Starting at the second character (2), find the first sequence of letters starting with the letter e and ending with the letter y:

=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y',2);
 REGEXP_INSTR
--------------
           12
(1 row)

Starting at the first character (1), find the second sequence of letters starting with the letter e and ending with the letter y:

=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y',1,2);
 REGEXP_INSTR
--------------
           12
(1 row)

Find the position of the first character after the first whitespace:

=> SELECT REGEXP_INSTR('easy come, easy go','\s',1,1,1);
 REGEXP_INSTR
--------------
            6
(1 row)

Find the position of the start of the third word in a string by capturing each word as a subexpression, and returning the third subexpression's start position.

=> SELECT REGEXP_INSTR('one two three','(\w+)\s+(\w+)\s+(\w+)', 1,1,0,'',3);
REGEXP_INSTR
--------------
            9
(1 row)

2.5 - REGEXP_LIKE

Returns true if the string matches the regular expression.

Returns true if the string matches the regular expression. REGEXP_LIKE is similar to the LIKE, except that it uses regular expressions rather than simple wildcard character matching.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_LIKE ( string-expression, pattern [, regexp-modifier ]... )

Parameters

string-expression

pattern

The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

regexp-modifier

One or more single-character flags that modify how the regular expression pattern is matched to string-expression:

b: Treat strings as binary octets, rather than UTF-8 characters.
c (default): Force the match to be case sensitive.
i: Force the match to be case insensitive.
m: Treat the string to match as multiple lines. Using this modifier, the start of line (^) and end of line ($) regular expression operators match line breaks (\n) within the string. Without the m modifier, the start and end of line operators match only the start and end of the string.
n: Match the regular expression operator (.) to a newline (\n). By default, the . operator matches any character except a newline.
x: Add comments to regular expressions. The x modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#) and end with a newline (\n). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\).

Examples

Create a table that contains several strings:

=> CREATE TABLE t (v VARCHAR);
CREATE TABLE
=> CREATE PROJECTION t1 AS SELECT * FROM t;
CREATE PROJECTION
=> COPY t FROM stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> aaa
>> Aaa
>> abc
>> abc1
>> 123
>> \.
=> SELECT * FROM t;
   v
-------
 aaa
 Aaa
 abc
 abc1
 123
(5 rows)

Select all records from table t that contain the letter a:

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a');
  v
------
 Aaa
 aaa
 abc
 abc1
(4 rows)

Select all rows from table t that start with the letter a:

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'^a');
  v
------
 aaa
 abc
 abc1
(3 rows)

Select all rows that contain the substring aa:

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aa');
  v
-----
 Aaa
 aaa
(2 rows)

Select all rows that contain a digit.

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'\d');
  v
------
 123
 abc1
(2 rows)

Select all rows that contain the substring aaa.

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aaa');
  v
-----
 aaa
(1 row)

Select all rows that contain the substring aaa using case-insensitive matching.

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aaa', 'i');
  v
-----
 Aaa
 aaa
(2 rows)

Select rows that contain the substring a b c.

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a b c');
 v
---
(0 rows)

Select rows that contain the substring a b c, ignoring space within the regular expression.

=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a b c','x');
  v
------
 abc
 abc1
(2 rows)

Add multi-line rows to table t:

=> COPY t FROM stdin RECORD TERMINATOR '!';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Record 1 line 1
>> Record 1 line 2
>> Record 1 line 3!
>> Record 2 line 1
>> Record 2 line 2
>> Record 2 line 3!
>> \.

Select rows from table t that start with the substring Record and end with the substring line 2.

=> SELECT v from t WHERE REGEXP_LIKE(v,'^Record.*line 2$');
 v
---
(0 rows)

Select rows that start with the substring Record and end with the substring line 2, treating multiple lines as separate strings.

=> SELECT v from t WHERE REGEXP_LIKE(v,'^Record.*line 2$','m');
                        v
--------------------------------------------------

Record 2 line 1
Record 2 line 2
Record 2 line 3
Record 1 line 1
Record 1 line 2
Record 1 line 3
(2 rows)

2.6 - REGEXP_NOT_ILIKE

Returns true if the string does not match the case-insensitive regular expression.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_NOT_ILIKE ( string-expression, pattern )

Parameters

string-expression``: The VARCHAR or LONG VARCHAR expression to evaluate for matches with the regular expression specified in pattern. If string-expression is in the __raw__ column of a flex or columnar table, cast the string to a LONG VARCHAR before searching for pattern.
pattern``: The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

Examples

Create a table (longvc) with a single, long varchar column (body). Then, insert data with some distinct characters, and query the table contents:

=> CREATE table longvc(body long varchar (1048576));
CREATE TABLE

=> insert into longvc values ('На берегу пустынных волн');
=> insert into longvc values ('Voin syödä lasia, se ei vahingoita minua');
=> insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。');
=> insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.');
=> insert into longvc values ('zésbaésbaa');

=> SELECT * FROM longvc;
                      body
------------------------------------------------
 На берегу пустынных волн
 Voin syödä lasia, se ei vahingoita minua
 私はガラスを食べられます。それは私を傷つけません。
 Je peux manger du verre, ça ne me fait pas mal.
 zésbaésbaa
(5 rows)

Find all rows that do not contain the character ç:


=> SELECT * FROM longvc where regexp_not_ilike(body, 'ç');
                        body
----------------------------------------------------
 Voin syödä lasia, se ei vahingoita minua
 zésbaésbaa
 На берегу пустынных волн
 私はガラスを食べられます。それは私を傷つけません。
(4 rows)

Find all rows that do not contain the substring a:

=> SELECT * FROM longvc where regexp_not_ilike(body, 'a');
                        body
----------------------------------------------------
 На берегу пустынных волн
 私はガラスを食べられます。それは私を傷つけません。
(2 rows)

2.7 - REGEXP_NOT_LIKE

Returns true if the string does not contain a match for the regular expression.

Returns true if the string does not contain a match for the regular expression. REGEXP_NOT_LIKE is a case sensitive regular expression.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_NOT_LIKE ( string-expression, pattern )

Parameters

string-expression``: The VARCHAR or LONG VARCHAR expression to evaluate for matches with the regular expression specified in pattern. If string-expression is in the __raw__ column of a flex or columnar table, cast the string to a LONG VARCHAR before searching for pattern.
pattern``: The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

Examples

Create a table (longvc) with the LONG VARCHAR column body. Then, insert data with some distinct characters and query the table contents:

=> CREATE table longvc(body long varchar (1048576));
CREATE TABLE

=> insert into longvc values ('На берегу пустынных волн');
=> insert into longvc values ('Voin syödä lasia, se ei vahingoita minua');
=> insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。');
=> insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.');
=> insert into longvc values ('zésbaésbaa');

=> SELECT * FROM longvc;
                      body
------------------------------------------------
 На берегу пустынных волн
 Voin syödä lasia, se ei vahingoita minua
 私はガラスを食べられます。それは私を傷つけません。
 Je peux manger du verre, ça ne me fait pas mal.
 zésbaésbaa
(5 rows)

Use REGEXP_NOT_LIKE to return rows that do not contain the character ç:

=> SELECT * FROM longvc where regexp_not_like(body, 'ç');
                        body
----------------------------------------------------
 Voin syödä lasia, se ei vahingoita minua
 zésbaésbaa
 На берегу пустынных волн
 私はガラスを食べられます。それは私を傷つけません。
(4 rows)

Return all rows that do not contain the characters *ö and *ä:

=> SELECT * FROM longvc where regexp_not_like(body, '.*ö.*ä');
                        body
----------------------------------------------------
 Je peux manger du verre, ça ne me fait pas mal.
 zésbaésbaa
 На берегу пустынных волн
 私はガラスを食べられます。それは私を傷つけません。
(4 rows)

Pattern match all rows that do not contain the characters z and *ésbaa:

=> SELECT * FROM longvc where regexp_not_like(body, 'z.*ésbaa');
                        body
----------------------------------------------------
 Je peux manger du verre, ça ne me fait pas mal.
 Voin syödä lasia, se ei vahingoita minua
 zésbaésbaa
 На берегу пустынных волн
 私はガラスを食べられます。それは私を傷つけません。
(5 rows)

2.8 - REGEXP_REPLACE

Replaces all occurrences of a substring that match a regular expression with another substring.

Replaces all occurrences of a substring that match a regular expression with another substring. REGEXP_REPLACE is similar to the REPLACE function, except it uses a regular expression to select the substring to be replaced.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_REPLACE ( string-expression, target 
   [, replacement [, position [, occurrence[...] [, regexp-modifier]]]] )

Parameters

string-expression

pattern

The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

replacement

The string to replace matched substrings. If you do not supply a replacement, the function deletes matched substrings. The replacement string can contain backreferences for substrings captured by the regular expression. The first captured substring is inserted into the replacement string using \1, the second \2, and so on.

position

Default: 1

occurrence

Controls which occurrence of a pattern match in the string to replace. By default, the function replaces all matching substrings. For example, setting this parameter to 3 replaces the third matching instance.

Default: 1

regexp-modifier

One or more single-character flags that modify how the regular expression pattern is matched to string-expression:

b: Treat strings as binary octets, rather than UTF-8 characters.
c (default): Force the match to be case sensitive.
i: Force the match to be case insensitive.
m: Treat the string to match as multiple lines. Using this modifier, the start of line (^) and end of line ($) regular expression operators match line breaks (\n) within the string. Without the m modifier, the start and end of line operators match only the start and end of the string.
n: Match the regular expression operator (.) to a newline (\n). By default, the . operator matches any character except a newline.
x: Add comments to regular expressions. The x modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#) and end with a newline (\n). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\).

How Oracle handles subexpressions

Unlike Oracle, Vertica can handle an unlimited number of captured subexpressions, while Oracle is limited to nine.

In Vertica, you can use \10 in the replacement pattern to access the substring captured by the tenth set of parentheses in the regular expression. In Oracle, \10 is treated as the substring captured by the first set of parentheses, followed by a zero. To force this Oracle behavior in Vertica, use the \g back reference and enclose the number of the captured subexpression in curly braces. For example, \g{1}0 is the substring captured by the first set of parentheses followed by a zero.

You can also name captured subexpressions to make your regular expressions less ambiguous. See the PCRE documentation for details.

Examples

Find groups of word characters—letters, numbers and underscore—that end with thy in the string healthy, wealthy, and wise, and replace them with nothing.

=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy');
 REGEXP_REPLACE
----------------
 , , and wise
(1 row)

Find groups of word characters ending with thy and replace with the string something.

=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something');
         REGEXP_REPLACE
--------------------------------
 something, something, and wise
(1 row)

Find groups of word characters ending with thy and replace with the string something starting at the third character in the string.

=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something', 3);
          REGEXP_REPLACE
----------------------------------
 hesomething, something, and wise
(1 row)

Replace the second group of word characters ending with thy with the string something.

=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something', 1, 2);
        REGEXP_REPLACE
------------------------------
 healthy, something, and wise
(1 row)

Find groups of word characters ending with thy capturing the letters before the thy, and replace with the captured letters plus the letters ish.

=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','(\w+)thy', '\1ish');
       REGEXP_REPLACE
----------------------------
 healish, wealish, and wise
(1 row)

Create a table to demonstrate replacing strings in a query.

=> CREATE TABLE customers (name varchar(50), phone varchar(11));
CREATE TABLE
=> CREATE PROJECTION customers1 AS SELECT * FROM customers;
CREATE PROJECTION
=> COPY customers FROM stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Able, Adam|17815551234
>> Baker,Bob|18005551111
>> Chu,Cindy|16175559876
>> Dodd,Dinara|15083452121
>> \.

Query the customers, using REGEXP_REPLACE to format phone numbers.

=> SELECT name, REGEXP_REPLACE(phone, '(\d)(\d{3})(\d{3})(\d{4})',
'\1-(\2) \3-\4') as phone FROM customers;
    name     |      phone
-------------+------------------
 Able, Adam  | 1-(781) 555-1234
 Baker,Bob   | 1-(800) 555-1111
 Chu,Cindy   | 1-(617) 555-9876
 Dodd,Dinara | 1-(508) 345-2121
(4 rows)

2.9 - REGEXP_SUBSTR

Returns the substring that matches a regular expression within a string.

Returns the substring that matches a regular expression within a string. If no matches are found, REGEXP_SUBSTR returns NULL. This is different from an empty string, which the function can return if the regular expression matches a zero-length string.

This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.

Important

If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.

Syntax

REGEXP_SUBSTR ( string-expression, pattern 
   [, position [,  occurrence [, regexp-modifier [, captured-subexp ]]... ]] )

Parameters

string-expression

pattern

The regular expression to match against string-expression. The regular expression must conform with Perl regular expression syntax.

position

Default: 1

occurrence

Controls which occurrence of a pattern match in the string to return. By default, the function returns the first matching substring. For example, setting this parameter to 3 returns the third matching instance.

Default: 1

regexp-modifier

One or more single-character flags that modify how the regular expression pattern is matched to string-expression:

b: Treat strings as binary octets, rather than UTF-8 characters.
c (default): Force the match to be case sensitive.
i: Force the match to be case insensitive.
m: Treat the string to match as multiple lines. Using this modifier, the start of line (^) and end of line ($) regular expression operators match line breaks (\n) within the string. Without the m modifier, the start and end of line operators match only the start and end of the string.
n: Match the regular expression operator (.) to a newline (\n). By default, the . operator matches any character except a newline.
x: Add comments to regular expressions. The x modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#) and end with a newline (\n). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\).

captured-subexp

The group to return. By default, the function returns all matching groups. For example, setting this value to 3 returns the substring captured by the third set of parentheses in the regular expression.

Default: 0

Note

Examples

Select the first substring of letters that end with thy.

=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy');
 REGEXP_SUBSTR
---------------
 healthy
(1 row)

Select the first substring of letters that ends with thy starting at the second character in the string.

=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy',2);
 REGEXP_SUBSTR
---------------
 ealthy
(1 row)

Select the second substring of letters that ends with thy.

=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy',1,2);
 REGEXP_SUBSTR
---------------
 wealthy
(1 row)

Return the contents of the third captured subexpression, which captures the third word in the string.

=> SELECT REGEXP_SUBSTR('one two three', '(\w+)\s+(\w+)\s+(\w+)', 1, 1, '', 3);
 REGEXP_SUBSTR
---------------
 three
(1 row)

3 - Text search functions

This section contains text search functions specific to Vertica.

3.1 - DELETE_TOKENIZER_CONFIG_FILE

Deletes a tokenizer configuration file.

Syntax

SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='proc_oid', confirm={true | false });

Parameters

confirm = [true | false]

Boolean flag. Indicates that the configuration file should be removed even if the tokenizer is still in use.

True — Force deletion of the tokenizer when the used parameter value is True.

False — Delete tokenizer if the used parameter value is False.

Default:False

proc_oid

A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following example shows how you can use DELETE_TOKENIZER_CONFIG_FILE to delete the tokenizer configuration file:

=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='45035996274126984');
 DELETE_TOKENIZER_CONFIG_FILE
------------------------------
 t
(1 row)

3.2 - GET_TOKENIZER_PARAMETER

Returns the configuration parameter for a given tokenizer.

Syntax

SELECT v_txtindex.GET_TOKENIZER_PARAMETER(parameter_name USING PARAMETERS proc_oid='proc_oid');

Parameters

parameter_name

Name of the parameter to be returned.

One of the following:

stopWordsCaseInsensitive
minorSeparators
majorSeparators
minLength
maxLength
ngramsSize
used

proc_oid

Examples

The following examples show how you can use GET_TOKENIZER_PARAMETER.

Return the stop words used in a tokenizer:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('stopwordscaseinsensitive' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 devil,TODAY,the,fox
(1 row)

Return the major separators used in a tokenizer:

=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
 getTokenizerParameter
-----------------------
 {}()&[]
(1 row)

3.3 - READ_CONFIG_FILE

Reads and returns the key-value pairs of all the parameters of a given tokenizer.

You must use the OVER() clause with this function.

Syntax

SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='proc_oid') OVER ()

Parameters

proc_oid: A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.

Examples

The following example shows how you can use READ_CONFIG_FILE to return the parameters associated with a tokenizer:

=> SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984') OVER();
                config_key | config_value
 --------------------------+---------------------
  majorseparators          | {}()&[]
  stopwordscaseinsensitive | devil,TODAY,the,fox
(2 rows)

3.4 - SET_TOKENIZER_PARAMETER

Configures the tokenizer parameters.

Important

\n, \t,\r must be entered as Unicode using Vertica notation, U&’\000D’, or using Vertica escaping notation, E’\r’. Otherwise, they are taken literally as two separate characters. For example, "\" & "r".

Syntax

SELECT v_txtindex.SET_TOKENIZER_PARAMETER (parameter_name, parameter_value USING PARAMETERS proc_oid='proc_oid')

Parameters

parameter_name

Name of the parameter to be configured.

Use one of the following:

stopwordsCaseInsensitive: List of stop words. All the tokens that belong to the list are ignored. Vertica supports separators and stop words up to the first 256 Unicode characters.

If you want to define a stop word that contains a comma or a backslash, then it needs to be escaped.
For example: "Dear Jack\," "Dear Jack\\"

Default: '' (empty list)
majorSeparators:List of major separators. Enclose in quotes with no spaces between.

Default: E' []<>(){}|!;,''"*&?+\r\n\t'
minorSeparators: List of minor separators. Enclose in quotes with no spaces between.

Default: E'/:=@.-$#%\\_'
minLength — Minimum length a token can have, type Integer. Must be greater than 0.

Default: '2'
maxLength: Maximum length a token can be. Type Integer. Cannot be greater than 1024 bytes. For information about increasing the token size, see Text search parameters.

Default: '128'
ngramsSize: Integer value greater than zero. Use only with ngram tokenizers.

Default: '3'
used: Indicates when a tokenizer configuration cannot be changed. Type Boolean. After you set used to True, any calls to setTokenizerParameter fail.

You must set the parameter used to True before using the configured tokenizer. Doing so prevents the configuration from being modified after being used to create a text index.

Default: False

parameter_value

The value of a configuration parameter.

If you want to disable minorSeperators or stopWordsCaseInsensitive, then set their values to ''.

proc_oid

Examples

The following examples show how you can use SET_TOKENIZER_PARAMETER to configure stop words and separators.

Configure the stop words of a tokenizer:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordsCaseInsensitive', 'devil,TODAY,the,fox' USING PARAMETERS proc_oid='45035996274126984');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)

Configure the major separators of a tokenizer:

=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorSeparators',E'{}()&[]' USING PARAMETERS proc_oid='45035996274126984');
 SET_TOKENIZER_PARAMETER
-------------------------
 t
(1 row)

Match and search functions

1 - MATCH clause functions

Note

1.1 - EVENT_NAME

Syntax

Notes

Examples

Note

See also

1.2 - MATCH_ID

Syntax

Notes

Examples

Note

See also

1.3 - PATTERN_ID

Syntax

Notes

Examples

Note

See also

2 - Regular expression functions

Note

2.1 - MATCH_COLUMNS

Syntax

Arguments

Privileges

DDL usage

Restrictions

Other constraints

Examples

2.2 - REGEXP_COUNT

Important

Syntax

Parameters

Examples

2.3 - REGEXP_ILIKE

Important

Syntax

Parameters

Examples

2.4 - REGEXP_INSTR

Important

Syntax

Parameters

Note

Examples

2.5 - REGEXP_LIKE

Important

Syntax

Parameters

Examples

2.6 - REGEXP_NOT_ILIKE

Important

Syntax

Parameters

Examples

2.7 - REGEXP_NOT_LIKE

Important

Syntax

Parameters

Examples

2.8 - REGEXP_REPLACE

Important

Syntax

Parameters

How Oracle handles subexpressions

Examples

2.9 - REGEXP_SUBSTR

Important

Syntax

Parameters

Note

Examples

3 - Text search functions

3.1 - DELETE_TOKENIZER_CONFIG_FILE

Syntax

Parameters

Examples

3.2 - GET_TOKENIZER_PARAMETER