This section contains functions for text search and regular expressions, and functions used in the MATCH clause.
This is the multi-page printable view of this section. Click here to print.
Match and search functions
- 1: MATCH clause functions
- 1.1: EVENT_NAME
- 1.2: MATCH_ID
- 1.3: PATTERN_ID
- 2: Regular expression functions
- 2.1: MATCH_COLUMNS
- 2.2: REGEXP_COUNT
- 2.3: REGEXP_ILIKE
- 2.4: REGEXP_INSTR
- 2.5: REGEXP_LIKE
- 2.6: REGEXP_NOT_ILIKE
- 2.7: REGEXP_NOT_LIKE
- 2.8: REGEXP_REPLACE
- 2.9: REGEXP_SUBSTR
- 3: Text search functions
- 3.1: DELETE_TOKENIZER_CONFIG_FILE
- 3.2: GET_TOKENIZER_PARAMETER
- 3.3: READ_CONFIG_FILE
- 3.4: SET_TOKENIZER_PARAMETER
1 - MATCH clause functions
Used with the MATCH clause, the functions in this section return additional data about the patterns found or returned. For example, you can use these functions to return values representing the name of the event or pattern that matched the input row, the sequential number of the match, or a partition-wide unique identifier for the instance of the pattern that matched.
Pattern matching is particularly useful for clickstream analysis where you might want to identify users' actions based on their Web browsing behavior (page clicks). A typical online clickstream funnel is:
Company home page -> product home page -> search -> results -> purchase online
Using the above clickstream funnel, you can search for a match on the user's sequence of web clicks and identify that the user:
-
Landed on the company home page.
-
Navigated to the product page.
-
Ran a search.
-
Clicked a link from the search results.
-
Made a purchase.
For examples that use this clickstream model, see Event series pattern matching.
Note
GROUP BY and PARTITION BY expressions do not support window functions.1.1 - EVENT_NAME
Returns a VARCHAR value representing the name of the event that matched the row.
Syntax
EVENT_NAME()
Notes
Pattern matching functions must be used in MATCH clause syntax; for example, if you call EVENT_NAME() on its own, Vertica returns the following error message:
=> SELECT event_name();
ERROR: query with pattern matching function event_name must include a MATCH clause
Examples
Note
This example uses the schema defined in Event series pattern matching.The following statement analyzes users' browsing history on website2.com
and identifies patterns where the user landed on website2.com
from another Web site (Entry) and browsed to any number of other pages (Onsite) before making a purchase (Purchase). The query also outputs the values for EVENT_NAME(), which is the name of the event that matched the row.
SELECT uid,
sid,
ts,
refurl,
pageurl,
action,
event_name()
FROM clickstream_log
MATCH
(PARTITION BY uid, sid ORDER BY ts
DEFINE
Entry AS RefURL NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
Onsite AS PageURL ILIKE '%website2.com%' AND Action='V',
Purchase AS PageURL ILIKE '%website2.com%' AND Action = 'P'
PATTERN
P AS (Entry Onsite* Purchase)
ROWS MATCH FIRST EVENT);
uid | sid | ts | refurl | pageurl | action | event_name
-----+-----+----------+----------------------+----------------------+--------+------------
1 | 100 | 12:00:00 | website1.com | website2.com/home | V | Entry
1 | 100 | 12:01:00 | website2.com/home | website2.com/floby | V | Onsite
1 | 100 | 12:02:00 | website2.com/floby | website2.com/shamwow | V | Onsite
1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy | P | Purchase
2 | 100 | 12:10:00 | website1.com | website2.com/home | V | Entry
2 | 100 | 12:11:00 | website2.com/home | website2.com/forks | V | Onsite
2 | 100 | 12:13:00 | website2.com/forks | website2.com/buy | P | Purchase
(7 rows)
See also
1.2 - MATCH_ID
Returns a successful pattern match as an INTEGER value. The returned value is the ordinal position of a match within a partition.
Syntax
MATCH_ID()
Notes
Pattern matching functions must be used in MATCH clause syntax; for example, if you call MATCH_ID() on its own, Vertica returns the following error message:
=> SELECT match_id();
ERROR: query with pattern matching function match_id must include a MATCH clause
Examples
Note
This example uses the schema defined in Event series pattern matching.The following statement analyzes users' browsing history on a site called website2.com
and identifies patterns where the user reached website2.com
from another Web site (Entry
in the MATCH
clause) and browsed to any number of other pages (Onsite
) before making a purchase (Purchase). The query also outputs values for the MATCH_ID(), which represents a sequential number of the match.
SELECT uid,
sid,
ts,
refurl,
pageurl,
action,
match_id()
FROM clickstream_log
MATCH
(PARTITION BY uid, sid ORDER BY ts
DEFINE
Entry AS RefURL NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
Onsite AS PageURL ILIKE '%website2.com%' AND Action='V',
Purchase AS PageURL ILIKE '%website2.com%' AND Action = 'P'
PATTERN
P AS (Entry Onsite* Purchase)
ROWS MATCH FIRST EVENT);
uid | sid | ts | refurl | pageurl | action | match_id
----+-----+----------+----------------------+----------------------+--------+------------
1 | 100 | 12:00:00 | website1.com | website2.com/home | V | 1
1 | 100 | 12:01:00 | website2.com/home | website2.com/floby | V | 2
1 | 100 | 12:02:00 | website2.com/floby | website2.com/shamwow | V | 3
1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy | P | 4
2 | 100 | 12:10:00 | website1.com | website2.com/home | V | 1
2 | 100 | 12:11:00 | website2.com/home | website2.com/forks | V | 2
2 | 100 | 12:13:00 | website2.com/forks | website2.com/buy | P | 3
(7 rows)
See also
1.3 - PATTERN_ID
Returns an integer value that is a partition-wide unique identifier for the instance of the pattern that matched.
Syntax
PATTERN_ID()
Notes
Pattern matching functions must be used in MATCH clause syntax; for example, if call PATTERN_ID() on its own, Vertica returns the following error message:
=> SELECT pattern_id();
ERROR: query with pattern matching function pattern_id must include a MATCH clause
Examples
Note
This example uses the schema defined in Event series pattern matching.The following statement analyzes users' browsing history on website2.com and identifies patterns where the user landed on website2.com from another Web site (Entry) and browsed to any number of other pages (Onsite) before making a purchase (Purchase). The query also outputs values for PATTERN_ID(), which represents the partition-wide identifier for the instance of the pattern that matched.
SELECT uid,
sid,
ts,
refurl,
pageurl,
action,
pattern_id()
FROM clickstream_log
MATCH
(PARTITION BY uid, sid ORDER BY ts
DEFINE
Entry AS RefURL NOT ILIKE '%website2.com%' AND PageURL ILIKE '%website2.com%',
Onsite AS PageURL ILIKE '%website2.com%' AND Action='V',
Purchase AS PageURL ILIKE '%website2.com%' AND Action = 'P'
PATTERN
P AS (Entry Onsite* Purchase)
ROWS MATCH FIRST EVENT);
uid | sid | ts | refurl | pageurl | action | pattern_id
----+-----+----------+----------------------+----------------------+--------+------------
1 | 100 | 12:00:00 | website1.com | website2.com/home | V | 1
1 | 100 | 12:01:00 | website2.com/home | website2.com/floby | V | 1
1 | 100 | 12:02:00 | website2.com/floby | website2.com/shamwow | V | 1
1 | 100 | 12:03:00 | website2.com/shamwow | website2.com/buy | P | 1
2 | 100 | 12:10:00 | website1.com | website2.com/home | V | 1
2 | 100 | 12:11:00 | website2.com/home | website2.com/forks | V | 1
2 | 100 | 12:13:00 | website2.com/forks | website2.com/buy | P | 1
(7 rows)
See also
2 - Regular expression functions
A regular expression lets you perform pattern matching on strings of characters. The regular expression syntax allows you to precisely define the pattern used to match strings, giving you much greater control than wildcard matching used in the LIKE predicate. The Vertica regular expression functions let you perform tasks such as determining if a string value matches a pattern, extracting a portion of a string that matches a pattern, or counting the number of times a pattern occurs within a string.
Vertica uses the Perl Compatible Regular Expression (PCRE) library to evaluate regular expressions. As its name implies, PCRE's regular expression syntax is compatible with the syntax used by the Perl 5 programming language. You can read PCRE's documentation about its library. However, if you are unfamiliar with using regular expressions, the Perl Regular Expressions Documentation is a good introduction.
Note
The regular expression functions only operate on valid UTF-8 strings. If you try using a regular expression function on a string that is not valid UTF-8, the query fails with an error. To prevent an error from occurring, use the ISUTF8 function as an initial clause to ensure the strings you pass to the regular expression functions are valid UTF-8 strings. Alternatively, or you can use the 'b' argument to treat the strings as binary octets, rather than UTF-8 encoded strings.2.1 - MATCH_COLUMNS
Specified as an element in a SELECT list, returns all columns in queried tables that match the specified pattern. For example:
=> SELECT MATCH_COLUMNS ('%order%') FROM store.store_orders_fact LIMIT 3;
order_number | date_ordered | quantity_ordered | total_order_cost | reorder_level
--------------+--------------+------------------+------------------+---------------
191119 | 2003-03-09 | 15 | 4021 | 23
89985 | 2003-05-04 | 19 | 2692 | 23
246962 | 2007-06-01 | 77 | 4419 | 42
(3 rows)
Syntax
MATCH_COLUMNS ('pattern')
Arguments
pattern
- The pattern to match against all column names in the queried tables, where
pattern
typically contains one or both of the following wildcard characters:-
_
(underscore): Match any single character. -
%
(percent sign): Match any string of zero or more characters.
The pattern can also include backslash (
\
) characters to escape reserved characters that are embedded in column names:_
(underscore),%
(percent sign), and backlash (\
) itself. -
Privileges
None
DDL usage
You can use MATCH_COLUMNS to define database objects—for example, specify it in CREATE PROJECTION to identify projection columns, or in CREATE TABLE...AS to identify columns in the new table. In all cases, Vertica expands the MATCH_COLUMNS output before it stores the object DDL. Subsequent changes to the original source table have no effect on the derived object definitions.
Restrictions
In general, MATCH_COLUMNS is specified as an element in a SELECT list. For example, CREATE PROJECTION can call MATCH_COLUMNS to specify the columns to include in a projection. However, attempts to specify columns in the projection's segmentation clause return with an error:
=> CREATE PROJECTION p_store_orders AS SELECT
MATCH_COLUMNS('%product%'),
MATCH_COLUMNS('%store%'),
order_number FROM store.store_orders_fact SEGMENTED BY MATCH_COLUMNS('products%') ALL NODES;
ERROR 0: MATCH_COLUMNS() function can only be specified as an element in a SELECT list
=> CREATE PROJECTION p_store_orders AS SELECT
MATCH_COLUMNS('%product%'),
MATCH_COLUMNS('%store%'),
order_number FROM store.store_orders_fact;
WARNING 4468: Projection <store.p_store_orders_b0> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468: Projection <store.p_store_orders_b1> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
CREATE PROJECTION
If you call MATCH_COLUMNS from a function that supports a fixed number of arguments, Vertica returns an error. For example, the UPPER function supports only one argument; so calling MATCH_COLUMNS from UPPER as follows returns an error:
=> SELECT MATCH_COLUMNS('emp%') FROM employee_dimension LIMIT 1;
-[ RECORD 1 ]-----------+---------------------------------
employee_key | 1
employee_gender | Male
employee_first_name | Craig
employee_middle_initial | F
employee_last_name | Robinson
employee_age | 22
employee_street_address | 5 Bakers St
employee_city | Thousand Oaks
employee_state | CA
employee_region | West
=> SELECT UPPER (MATCH_COLUMNS('emp%')) FROM employee_dimension;
ERROR 10465: MATCH_COLUMNS() function can only be specified as an element in a SELECT list
In contrast, the HASH function accepts an unlimited number of arguments, so calling MATCH_COLUMNS as an argument succeeds:
=> select HASH(MATCH_COLUMNS('emp%')) FROM employee_dimension LIMIT 10;
HASH
---------------------
2047284364908178817
1421997332260827278
7981613309330877388
792898558199431621
5275639269069980417
7892790768178152349
184601038712735208
3020263228621856381
7056305566297085916
3328422577712931057
(10 rows)
Other constraints
The following usages of MATCH_COLUMNS are invalid and return with an error:
-
Including MATCH_COLUMNS in the non-recursive (base) term query of a RECURSIVE WITH clause
-
Concatenating the results of MATCH_COLUMNS calls:
=> SELECT MATCH_COLUMNS ('%store%')||MATCH_COLUMNS('%store%') FROM store.store_orders_fact; ERROR 0: MATCH_COLUMNS() function can only be specified as an element in a SELECT list
-
Setting an alias on MATCH_COLUMNS
Examples
The following CREATE PROJECTION statement uses MATCH_COLUMNS to specify table columns in the new projection:
=> CREATE PROJECTION p_store_orders AS SELECT
MATCH_COLUMNS('%product%'),
MATCH_COLUMNS('%store%'),
order_number FROM store.store_orders_fact;
WARNING 4468: Projection <store.p_store_orders_b0> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
WARNING 4468: Projection <store.p_store_orders_b1> is not available for query processing. Execute the select start_refresh() function to copy data into this projection.
The projection must have a sufficient number of buddy projections and all nodes must be up before starting a refresh
CREATE PROJECTION
=> SELECT export_objects('', 'store.p_store_orders_b0');
...
CREATE PROJECTION store.p_store_orders_b0 /*+basename(p_store_orders)*/
(
product_key,
product_version,
store_key,
order_number
)
AS
SELECT store_orders_fact.product_key,
store_orders_fact.product_version,
store_orders_fact.store_key,
store_orders_fact.order_number
FROM store.store_orders_fact
ORDER BY store_orders_fact.product_key,
store_orders_fact.product_version,
store_orders_fact.store_key,
store_orders_fact.order_number
SEGMENTED BY hash(store_orders_fact.product_key, store_orders_fact.product_version, store_orders_fact.store_key, store_orders_fact.order_number) ALL NODES OFFSET 0;
SELECT MARK_DESIGN_KSAFE(1);
(1 row)
As shown in the EXPORT_OBJECTS output, Vertica stores the result sets of the two MATCH_COLUMNS calls in the new projection's DDL. Later changes in the anchor table DDL have no effect on this projection.
2.2 - REGEXP_COUNT
Returns the number times a regular expression matches a string.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_COUNT ( string-expession, pattern [, position [, regexp-modifier ]... ] )
Parameters
string-expression
The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
.pattern
The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.position
- The number of characters from the start of the string where the function should start searching for matches. By default, the function begins searching for a match at the first (leftmost) character. Setting this parameter to a value greater than 1 begins searching for a match at the *
n
*th character you specify.Default: 1
regexp-modifier
One or more single-character flags that modify how the regular expression
pattern
is matched tostring-expression
:-
b
: Treat strings as binary octets, rather than UTF-8 characters. -
c
(default): Force the match to be case sensitive. -
i
: Force the match to be case insensitive. -
m
: Treat the string to match as multiple lines. Using this modifier, the start of line (^
) and end of line ($)
regular expression operators match line breaks (\n
) within the string. Without them
modifier, the start and end of line operators match only the start and end of the string. -
n
: Match the regular expression operator (.
) to a newline (\n
). By default, the.
operator matches any character except a newline. -
x
: Add comments to regular expressions. Thex
modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#
) and end with a newline (\n
). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\
).
-
Examples
Count the number of occurrences of the substring an
in the specified string (a man, a plan, a canal: Panama
):
=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', 'an');
REGEXP_COUNT
--------------
4
(1 row)
Find the number of occurrences of the substring an
, starting with the fifth character.
=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', 'an',5);
REGEXP_COUNT
--------------
3
(1 row)
Find the number of occurrences of a substring containing a lower-case character followed by an
:
=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', '[a-z]an');
REGEXP_COUNT
--------------
3
(1 row
REGEXP_COUNT specifies the i
modifier, so it ignores case:
=> SELECT REGEXP_COUNT('a man, a plan, a canal: Panama', '[a-z]an', 1, 'i');
REGEXP_COUNT
--------------
4
2.3 - REGEXP_ILIKE
Returns true if the string contains a match for the regular expression. REGEXP_ILIKE is similar to the LIKE, except that it uses a case insensitive regular expression, rather than simple wildcard character matching.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_ILIKE ( string-expression, pattern )
Parameters
string-expression
``- The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
. pattern
``- The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.
Examples
This example creates a table containing several strings to demonstrate regular expressions.
-
Create table
longvc
with a single, long varchar columnbody
, and insert data with distinct characters:=> CREATE table longvc(body long varchar (1048576)); CREATE TABLE => insert into longvc values ('На берегу пустынных волн'); => insert into longvc values ('Voin syödä lasia, se ei vahingoita minua'); => insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。'); => insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.'); => insert into longvc values ('zésbaésbaa'); => insert into longvc values ('Out of the frying pan, he landed immediately in the fire'); => SELECT * FROM longvc; body ------------------------------------------------ На берегу пустынных волн Voin syödä lasia, se ei vahingoita minua 私はガラスを食べられます。それは私を傷つけません。 Je peux manger du verre, ça ne me fait pas mal. zésbaésbaa Out of the frying pan, he landed immediately in the fire (6 rows)
-
Pattern match table rows containing the character
ç
:=> SELECT * FROM longvc where regexp_ilike(body, 'ç'); body ------------------------------------------------- Je peux manger du verre, ça ne me fait pas mal. (1 row)
-
Select all rows that contain the characters
A
/a
:=> SELECT * FROM longvc where regexp_ilike(body, 'A'); body ------------------------------------------------- Je peux manger du verre, ça ne me fait pas mal. Voin syödä lasia, se ei vahingoita minua zésbaésbaa (3 rows)
-
Select all rows that contain the characters
O
/o
:=> SELECT * FROM longvc where regexp_ilike(body, 'O'); body ---------------------------------------------------------- Voin syödä lasia, se ei vahingoita minua Out of the frying pan, he landed immediately in the fire (2 rows)
2.4 - REGEXP_INSTR
Returns the starting or ending position in a string where a regular expression matches. REGEXP_INSTR returns 0 if no match for the regular expression is found in the string.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_INSTR ( string-expression, pattern
[, position [, occurrence [, return-position [, regexp-modifier ]... [, captured-subexp ]]]] )
Parameters
string-expression
The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
.pattern
The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.position
- The number of characters from the start of the string where the function should start searching for matches. By default, the function begins searching for a match at the first (leftmost) character. Setting this parameter to a value greater than 1 begins searching for a match at the *
n
*th character you specify.Default: 1
occurrence
- Controls which occurrence of a pattern match in the string to return. By default, the function returns the position of the first matching substring. Use this parameter to find the position of subsequent matching substrings. For example, setting this parameter to 3 returns the position of the third substring that matches the pattern.
Default: 1
return-position
- Sets the position within the string to return. Using the default position (0), the function returns the string position of the first character of the substring that matches the pattern. If you set
return-position
to 1, the function returns the position of the first character after the end of the matching substring.Default: 0
regexp-modifier
One or more single-character flags that modify how the regular expression
pattern
is matched tostring-expression
:-
b
: Treat strings as binary octets, rather than UTF-8 characters. -
c
(default): Force the match to be case sensitive. -
i
: Force the match to be case insensitive. -
m
: Treat the string to match as multiple lines. Using this modifier, the start of line (^
) and end of line ($)
regular expression operators match line breaks (\n
) within the string. Without them
modifier, the start and end of line operators match only the start and end of the string. -
n
: Match the regular expression operator (.
) to a newline (\n
). By default, the.
operator matches any character except a newline. -
x
: Add comments to regular expressions. Thex
modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#
) and end with a newline (\n
). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\
).
-
captured-subexp
- The captured subexpression whose position to return. By default, the function returns the position of the first character in
string
that matches the regular expression. If you set this value from 1 – 9, the function returns the subexpression captured by the corresponding set of parentheses in the regular expression. For example, setting this value to 3 returns the substring captured by the third set of parentheses in the regular expression.Default: 0
Note
The subexpressions are numbered left to right, based on the appearance of opening parenthesis, so nested regular expressions . For example, in the regular expression\s*(\w+\s+(\w+))
, subexpression 1 is the one that captures everything but any leading whitespaces.
Examples
Find the first occurrence of a sequence of letters starting with the letter e
and ending with the letter y
in the specified string (easy come, easy go
).
=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y');
REGEXP_INSTR
--------------
1
(1 row)
Starting at the second character (2
), find the first sequence of letters starting with the letter e
and ending with the letter y
:
=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y',2);
REGEXP_INSTR
--------------
12
(1 row)
Starting at the first character (1
), find the second sequence of letters starting with the letter e
and ending with the letter y
:
=> SELECT REGEXP_INSTR('easy come, easy go','e\w*y',1,2);
REGEXP_INSTR
--------------
12
(1 row)
Find the position of the first character after the first whitespace:
=> SELECT REGEXP_INSTR('easy come, easy go','\s',1,1,1);
REGEXP_INSTR
--------------
6
(1 row)
Find the position of the start of the third word in a string by capturing each word as a subexpression, and returning the third subexpression's start position.
=> SELECT REGEXP_INSTR('one two three','(\w+)\s+(\w+)\s+(\w+)', 1,1,0,'',3);
REGEXP_INSTR
--------------
9
(1 row)
2.5 - REGEXP_LIKE
Returns true if the string matches the regular expression. REGEXP_LIKE is similar to the LIKE, except that it uses regular expressions rather than simple wildcard character matching.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_LIKE ( string-expression, pattern [, regexp-modifier ]... )
Parameters
string-expression
The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
.pattern
The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.regexp-modifier
One or more single-character flags that modify how the regular expression
pattern
is matched tostring-expression
:-
b
: Treat strings as binary octets, rather than UTF-8 characters. -
c
(default): Force the match to be case sensitive. -
i
: Force the match to be case insensitive. -
m
: Treat the string to match as multiple lines. Using this modifier, the start of line (^
) and end of line ($)
regular expression operators match line breaks (\n
) within the string. Without them
modifier, the start and end of line operators match only the start and end of the string. -
n
: Match the regular expression operator (.
) to a newline (\n
). By default, the.
operator matches any character except a newline. -
x
: Add comments to regular expressions. Thex
modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#
) and end with a newline (\n
). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\
).
-
Examples
Create a table that contains several strings:
=> CREATE TABLE t (v VARCHAR);
CREATE TABLE
=> CREATE PROJECTION t1 AS SELECT * FROM t;
CREATE PROJECTION
=> COPY t FROM stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> aaa
>> Aaa
>> abc
>> abc1
>> 123
>> \.
=> SELECT * FROM t;
v
-------
aaa
Aaa
abc
abc1
123
(5 rows)
Select all records from table t
that contain the letter a
:
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a');
v
------
Aaa
aaa
abc
abc1
(4 rows)
Select all rows from table t
that start with the letter a
:
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'^a');
v
------
aaa
abc
abc1
(3 rows)
Select all rows that contain the substring aa
:
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aa');
v
-----
Aaa
aaa
(2 rows)
Select all rows that contain a digit.
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'\d');
v
------
123
abc1
(2 rows)
Select all rows that contain the substring aaa
.
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aaa');
v
-----
aaa
(1 row)
Select all rows that contain the substring aaa
using case-insensitive matching.
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'aaa', 'i');
v
-----
Aaa
aaa
(2 rows)
Select rows that contain the substring a b c
.
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a b c');
v
---
(0 rows)
Select rows that contain the substring a b c
, ignoring space within the regular expression.
=> SELECT v FROM t WHERE REGEXP_LIKE(v,'a b c','x');
v
------
abc
abc1
(2 rows)
Add multi-line rows to table t
:
=> COPY t FROM stdin RECORD TERMINATOR '!';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Record 1 line 1
>> Record 1 line 2
>> Record 1 line 3!
>> Record 2 line 1
>> Record 2 line 2
>> Record 2 line 3!
>> \.
Select rows from table t
that start with the substring Record
and end with the substring line 2
.
=> SELECT v from t WHERE REGEXP_LIKE(v,'^Record.*line 2$');
v
---
(0 rows)
Select rows that start with the substring Record
and end with the substring line 2
, treating multiple lines as separate strings.
=> SELECT v from t WHERE REGEXP_LIKE(v,'^Record.*line 2$','m');
v
--------------------------------------------------
Record 2 line 1
Record 2 line 2
Record 2 line 3
Record 1 line 1
Record 1 line 2
Record 1 line 3
(2 rows)
2.6 - REGEXP_NOT_ILIKE
Returns true if the string does not match the case-insensitive regular expression.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_NOT_ILIKE ( string-expression, pattern )
Parameters
string-expression
``- The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
. pattern
``- The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.
Examples
-
Create a table (
longvc
) with a single, long varchar column (body
). Then, insert data with some distinct characters, and query the table contents:=> CREATE table longvc(body long varchar (1048576)); CREATE TABLE => insert into longvc values ('На берегу пустынных волн'); => insert into longvc values ('Voin syödä lasia, se ei vahingoita minua'); => insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。'); => insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.'); => insert into longvc values ('zésbaésbaa'); => SELECT * FROM longvc; body ------------------------------------------------ На берегу пустынных волн Voin syödä lasia, se ei vahingoita minua 私はガラスを食べられます。それは私を傷つけません。 Je peux manger du verre, ça ne me fait pas mal. zésbaésbaa (5 rows)
-
Find all rows that do not contain the character
ç
:=> SELECT * FROM longvc where regexp_not_ilike(body, 'ç'); body ---------------------------------------------------- Voin syödä lasia, se ei vahingoita minua zésbaésbaa На берегу пустынных волн 私はガラスを食べられます。それは私を傷つけません。 (4 rows)
-
Find all rows that do not contain the substring
a
:=> SELECT * FROM longvc where regexp_not_ilike(body, 'a'); body ---------------------------------------------------- На берегу пустынных волн 私はガラスを食べられます。それは私を傷つけません。 (2 rows)
2.7 - REGEXP_NOT_LIKE
Returns true if the string does not contain a match for the regular expression. REGEXP_NOT_LIKE is a case sensitive regular expression.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_NOT_LIKE ( string-expression, pattern )
Parameters
string-expression
``- The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
. pattern
``- The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.
Examples
-
Create a table (
longvc
) with the LONG VARCHAR columnbody
. Then, insert data with some distinct characters and query the table contents:=> CREATE table longvc(body long varchar (1048576)); CREATE TABLE => insert into longvc values ('На берегу пустынных волн'); => insert into longvc values ('Voin syödä lasia, se ei vahingoita minua'); => insert into longvc values ('私はガラスを食べられます。それは私を傷つけません。'); => insert into longvc values ('Je peux manger du verre, ça ne me fait pas mal.'); => insert into longvc values ('zésbaésbaa'); => SELECT * FROM longvc; body ------------------------------------------------ На берегу пустынных волн Voin syödä lasia, se ei vahingoita minua 私はガラスを食べられます。それは私を傷つけません。 Je peux manger du verre, ça ne me fait pas mal. zésbaésbaa (5 rows)
-
Use
REGEXP_NOT_LIKE
to return rows that do not contain the characterç
:=> SELECT * FROM longvc where regexp_not_like(body, 'ç'); body ---------------------------------------------------- Voin syödä lasia, se ei vahingoita minua zésbaésbaa На берегу пустынных волн 私はガラスを食べられます。それは私を傷つけません。 (4 rows)
-
Return all rows that do not contain the characters
*ö
and*ä
:=> SELECT * FROM longvc where regexp_not_like(body, '.*ö.*ä'); body ---------------------------------------------------- Je peux manger du verre, ça ne me fait pas mal. zésbaésbaa На берегу пустынных волн 私はガラスを食べられます。それは私を傷つけません。 (4 rows)
-
Pattern match all rows that do not contain the characters
z
and*ésbaa
:=> SELECT * FROM longvc where regexp_not_like(body, 'z.*ésbaa'); body ---------------------------------------------------- Je peux manger du verre, ça ne me fait pas mal. Voin syödä lasia, se ei vahingoita minua zésbaésbaa На берегу пустынных волн 私はガラスを食べられます。それは私を傷つけません。 (5 rows)
2.8 - REGEXP_REPLACE
Replaces all occurrences of a substring that match a regular expression with another substring. REGEXP_REPLACE is similar to the REPLACE function, except it uses a regular expression to select the substring to be replaced.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_REPLACE ( string-expression, target
[, replacement [, position [, occurrence[...] [, regexp-modifier]]]] )
Parameters
string-expression
The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
.pattern
The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.replacement
- The string to replace matched substrings. If you do not supply a
replacement
, the function deletes matched substrings. The replacement string can contain backreferences for substrings captured by the regular expression. The first captured substring is inserted into the replacement string using\1
, the second\2
, and so on. position
- The number of characters from the start of the string where the function should start searching for matches. By default, the function begins searching for a match at the first (leftmost) character. Setting this parameter to a value greater than 1 begins searching for a match at the
n
-th character you specify.Default: 1
occurrence
- Controls which occurrence of a pattern match in the string to replace. By default, the function replaces all matching substrings. For example, setting this parameter to 3 replaces the third matching instance.
Default: 1
regexp-modifier
One or more single-character flags that modify how the regular expression
pattern
is matched tostring-expression
:-
b
: Treat strings as binary octets, rather than UTF-8 characters. -
c
(default): Force the match to be case sensitive. -
i
: Force the match to be case insensitive. -
m
: Treat the string to match as multiple lines. Using this modifier, the start of line (^
) and end of line ($)
regular expression operators match line breaks (\n
) within the string. Without them
modifier, the start and end of line operators match only the start and end of the string. -
n
: Match the regular expression operator (.
) to a newline (\n
). By default, the.
operator matches any character except a newline. -
x
: Add comments to regular expressions. Thex
modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#
) and end with a newline (\n
). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\
).
-
How Oracle handles subexpressions
Unlike Oracle, Vertica can handle an unlimited number of captured subexpressions, while Oracle is limited to nine.
In Vertica, you can use \10
in the replacement pattern to access the substring captured by the tenth set of parentheses in the regular expression. In Oracle, \10
is treated as the substring captured by the first set of parentheses, followed by a zero. To force this Oracle behavior in Vertica, use the \g
back reference and enclose the number of the captured subexpression in curly braces. For example, \g{1}0
is the substring captured by the first set of parentheses followed by a zero.
You can also name captured subexpressions to make your regular expressions less ambiguous. See the PCRE documentation for details.
Examples
Find groups of word characters—letters, numbers and underscore—that end with thy
in the string healthy, wealthy, and wise
, and replace them with nothing.
=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy');
REGEXP_REPLACE
----------------
, , and wise
(1 row)
Find groups of word characters ending with thy
and replace with the string something
.
=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something');
REGEXP_REPLACE
--------------------------------
something, something, and wise
(1 row)
Find groups of word characters ending with thy
and replace with the string something
starting at the third character in the string.
=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something', 3);
REGEXP_REPLACE
----------------------------------
hesomething, something, and wise
(1 row)
Replace the second group of word characters ending with thy
with the string something
.
=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','\w+thy', 'something', 1, 2);
REGEXP_REPLACE
------------------------------
healthy, something, and wise
(1 row)
Find groups of word characters ending with thy
capturing the letters before the thy
, and replace with the captured letters plus the letters ish
.
=> SELECT REGEXP_REPLACE('healthy, wealthy, and wise','(\w+)thy', '\1ish');
REGEXP_REPLACE
----------------------------
healish, wealish, and wise
(1 row)
Create a table to demonstrate replacing strings in a query.
=> CREATE TABLE customers (name varchar(50), phone varchar(11));
CREATE TABLE
=> CREATE PROJECTION customers1 AS SELECT * FROM customers;
CREATE PROJECTION
=> COPY customers FROM stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Able, Adam|17815551234
>> Baker,Bob|18005551111
>> Chu,Cindy|16175559876
>> Dodd,Dinara|15083452121
>> \.
Query the customers, using REGEXP_REPLACE to format phone numbers.
=> SELECT name, REGEXP_REPLACE(phone, '(\d)(\d{3})(\d{3})(\d{4})',
'\1-(\2) \3-\4') as phone FROM customers;
name | phone
-------------+------------------
Able, Adam | 1-(781) 555-1234
Baker,Bob | 1-(800) 555-1111
Chu,Cindy | 1-(617) 555-9876
Dodd,Dinara | 1-(508) 345-2121
(4 rows)
2.9 - REGEXP_SUBSTR
Returns the substring that matches a regular expression within a string. If no matches are found, REGEXP_SUBSTR returns NULL. This is different from an empty string, which the function can return if the regular expression matches a zero-length string.
This function operates on UTF-8 strings using the default locale, even if the locale is set otherwise.
Important
If you port a regular expression query from an Oracle database, remember that Oracle considers a zero-length string to be equivalent to NULL, while Vertica does not.Syntax
REGEXP_SUBSTR ( string-expression, pattern
[, position [, occurrence [, regexp-modifier [, captured-subexp ]]... ]] )
Parameters
string-expression
The
VARCHAR
orLONG VARCHAR
expression to evaluate for matches with the regular expression specified inpattern
. Ifstring-expression
is in the__raw__
column of a flex or columnar table, cast the string to aLONG VARCHAR
before searching forpattern
.pattern
The regular expression to match against
string-expression
. The regular expression must conform with Perl regular expression syntax.position
- The number of characters from the start of the string where the function should start searching for matches. By default, the function begins searching for a match at the first (leftmost) character. Setting this parameter to a value greater than 1 begins searching for a match at the
n
-th character you specify.Default: 1
occurrence
- Controls which occurrence of a pattern match in the string to return. By default, the function returns the first matching substring. For example, setting this parameter to 3 returns the third matching instance.
Default: 1
regexp-modifier
One or more single-character flags that modify how the regular expression
pattern
is matched tostring-expression
:-
b
: Treat strings as binary octets, rather than UTF-8 characters. -
c
(default): Force the match to be case sensitive. -
i
: Force the match to be case insensitive. -
m
: Treat the string to match as multiple lines. Using this modifier, the start of line (^
) and end of line ($)
regular expression operators match line breaks (\n
) within the string. Without them
modifier, the start and end of line operators match only the start and end of the string. -
n
: Match the regular expression operator (.
) to a newline (\n
). By default, the.
operator matches any character except a newline. -
x
: Add comments to regular expressions. Thex
modifier causes the function to ignore all un-escaped space characters and comments in the regular expression. Comments start with hash (#
) and end with a newline (\n
). All spaces in the regular expression to be matched in strings must be escaped with a backslash (\
).
-
captured-subexp
- The group to return. By default, the function returns all matching groups. For example, setting this value to 3 returns the substring captured by the third set of parentheses in the regular expression.
Default: 0
Note
The subexpressions are numbered left to right, based on the appearance of opening parenthesis, so nested regular expressions . For example, in the regular expression\s*(\w+\s+(\w+))
, subexpression 1 is the one that captures everything but any leading whitespaces.
Examples
Select the first substring of letters that end with thy
.
=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy');
REGEXP_SUBSTR
---------------
healthy
(1 row)
Select the first substring of letters that ends with thy
starting at the second character in the string.
=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy',2);
REGEXP_SUBSTR
---------------
ealthy
(1 row)
Select the second substring of letters that ends with thy
.
=> SELECT REGEXP_SUBSTR('healthy, wealthy, and wise','\w+thy',1,2);
REGEXP_SUBSTR
---------------
wealthy
(1 row)
Return the contents of the third captured subexpression, which captures the third word in the string.
=> SELECT REGEXP_SUBSTR('one two three', '(\w+)\s+(\w+)\s+(\w+)', 1, 1, '', 3);
REGEXP_SUBSTR
---------------
three
(1 row)
3 - Text search functions
This section contains text search functions specific to Vertica.
3.1 - DELETE_TOKENIZER_CONFIG_FILE
Deletes a tokenizer configuration file.
Syntax
SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='proc_oid', confirm={true | false });
Parameters
confirm = [true | false]
- Boolean flag. Indicates that the configuration file should be removed even if the tokenizer is still in use.
True
— Force deletion of the tokenizer when the used parameter value is True.False
— Delete tokenizer if the used parameter value is False.Default:
False
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following example shows how you can use DELETE_TOKENIZER_CONFIG_FILE to delete the tokenizer configuration file:
=> SELECT v_txtindex.DELETE_TOKENIZER_CONFIG_FILE (USING PARAMETERS proc_oid='45035996274126984');
DELETE_TOKENIZER_CONFIG_FILE
------------------------------
t
(1 row)
3.2 - GET_TOKENIZER_PARAMETER
Returns the configuration parameter for a given tokenizer.
Syntax
SELECT v_txtindex.GET_TOKENIZER_PARAMETER(parameter_name USING PARAMETERS proc_oid='proc_oid');
Parameters
parameter_name
- Name of the parameter to be returned.
One of the following:
-
stopWordsCaseInsensitive
-
minorSeparators
-
majorSeparators
-
minLength
-
maxLength
-
ngramsSize
-
used
-
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following examples show how you can use GET_TOKENIZER_PARAMETER.
Return the stop words used in a tokenizer:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('stopwordscaseinsensitive' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
devil,TODAY,the,fox
(1 row)
Return the major separators used in a tokenizer:
=> SELECT v_txtindex.GET_TOKENIZER_PARAMETER('majorseparators' USING PARAMETERS proc_oid='45035996274126984');
getTokenizerParameter
-----------------------
{}()&[]
(1 row)
3.3 - READ_CONFIG_FILE
Reads and returns the key-value pairs of all the parameters of a given tokenizer.
You must use the OVER() clause with this function.
Syntax
SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='proc_oid') OVER ()
Parameters
proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following example shows how you can use READ_CONFIG_FILE to return the parameters associated with a tokenizer:
=> SELECT v_txtindex.READ_CONFIG_FILE(USING PARAMETERS proc_oid='45035996274126984') OVER();
config_key | config_value
--------------------------+---------------------
majorseparators | {}()&[]
stopwordscaseinsensitive | devil,TODAY,the,fox
(2 rows)
3.4 - SET_TOKENIZER_PARAMETER
Configures the tokenizer parameters.
Important
\n, \t,\r
must be entered as Unicode using Vertica notation, U&’\000D’
, or using Vertica escaping notation, E’\r’
. Otherwise, they are taken literally as two separate characters. For example, "\" & "r"
.
Syntax
SELECT v_txtindex.SET_TOKENIZER_PARAMETER (parameter_name, parameter_value USING PARAMETERS proc_oid='proc_oid')
Parameters
parameter_name
- Name of the parameter to be configured.
Use one of the following:
-
stopwordsCaseInsensitive
: List of stop words. All the tokens that belong to the list are ignored. Vertica supports separators and stop words up to the first 256 Unicode characters.If you want to define a stop word that contains a comma or a backslash, then it needs to be escaped.
For example:"Dear Jack\," "Dear Jack\\"
Default:
''
(empty list) -
majorSeparators
:List of major separators. Enclose in quotes with no spaces between.Default:
E' []<>(){}|!;,''"*&?+\r\n\t'
-
minorSeparators
: List of minor separators. Enclose in quotes with no spaces between.Default:
E'/:=@.-$#%\\_'
-
minLength
— Minimum length a token can have, type Integer. Must be greater than 0.Default:
'2'
-
maxLength
: Maximum length a token can be. Type Integer. Cannot be greater than 1024 bytes. For information about increasing the token size, see Text search parameters.Default:
'128'
-
ngramsSize
: Integer value greater than zero. Use only with ngram tokenizers.Default:
'3'
-
used
: Indicates when a tokenizer configuration cannot be changed. Type Boolean. After you set used toTrue
, any calls to setTokenizerParameter fail.You must set the parameter
used
toTrue
before using the configured tokenizer. Doing so prevents the configuration from being modified after being used to create a text index.Default:
False
-
parameter_value
- The value of a configuration parameter.
If you want to disable minorSeperators or stopWordsCaseInsensitive, then set their values to
''
. proc_oid
- A unique identifier assigned to a tokenizer when it is created. Users must query the system table vs_procedures to get the proc_oid for a given tokenizer name. See Configuring a tokenizer for more information.
Examples
The following examples show how you can use SET_TOKENIZER_PARAMETER to configure stop words and separators.
Configure the stop words of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordsCaseInsensitive', 'devil,TODAY,the,fox' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)
Configure the major separators of a tokenizer:
=> SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorSeparators',E'{}()&[]' USING PARAMETERS proc_oid='45035996274126984');
SET_TOKENIZER_PARAMETER
-------------------------
t
(1 row)