Arrays and sets (collections)

Tables can include collections (arrays or sets).

Tables can include collections (arrays or sets). An array is an ordered collection of elements that allows duplicate values, and a set is an unordered collection of unique values.

Consider an orders table with columns for product keys, customer keys, order prices, and order date, with some containing arrays. A basic query in Vertica results in the following:

=> SELECT * from orders LIMIT 5;
 orderkey | custkey |        prodkey         |         orderprices         | orderdate
----------+---------+------------------------+-----------------------------+------------
    19626 |      91 | ["P1262","P68","P101"] | ["192.59","49.99","137.49"] | 2021-03-14
    25646 |     716 | ["P997","P31","P101"]  | ["91.39","29.99","147.49"]  | 2021-03-14
    25647 |     716 | ["P12"]                | ["8.99"]                    | 2021-03-14
    19743 |     161 | ["P68","P101"]         | ["49.99","137.49"]          | 2021-03-15
    19888 |     241 | ["P1262","P101"]       | ["197.59","142.49"]         | 2021-03-15
(5 rows)

As shown in this example, array values are returned in JSON format.

Set values are also returned in JSON array format:

=> SELECT custkey,email_addrs FROM customers LIMIT 4;
 custkey |                           email_addrs
---------+------------------------------------------------------------------------
 342176  | ["joe.smith@example.com"]
 342799  | ["bob@example,com","robert.jones@example.com"]
 342845  | ["br92@cs.example.edu"]
 342321  | ["789123@example-isp.com","sjohnson@eng.example.com","sara@johnson.example.name"]

You can access elements of nested arrays (multi-dimensional arrays) with multiple indexes, as shown in the following examples:

=> SELECT host, pingtimes FROM network_tests;
 host |                 pingtimes
------+-------------------------------------------------------
 eng1 | [[24.24,25.27,27.16,24.97], [23.97,25.01,28.12,29.50]]
 eng2 | [[27.12,27.91,28.11,26.95], [29.01,28.99,30.11,31.56]]
 qa1  | [[23.15,25.11,24.63,23.91], [22.85,22.86,23.91,31.52]]
(3 rows)

=> SELECT pingtimes[0] FROM network_tests;
      pingtimes
-------------------------
[24.24,25.27,27.16,24.97]
[27.12,27.91,28.11,26.95]
[23.15,25.11,24.63,23.91]
(3 rows)

=> SELECT pingtimes[0][0] FROM network_tests;
 pingtimes
-----------
24.24
27.12
23.15
(3 rows)

Vertica supports several functions to manipulate arrays and sets.

Consider the same orders table which includes an array of product keys for all items purchased in a single order. You can use the APPLY_COUNT_ELEMENTS function to find out how many items each order contains. The function identifies the number of non-null elements in the prodkey array:

=> SELECT apply_count_elements(prodkey) FROM orders LIMIT 5;
apply_count_elements
--------------------
3
2
2
3
1
(5 rows)

Vertica also supports aggregate functions for collection elements. Now, consider a column in the same table which includes an array of prices for each item purchased in a single order. You can use the APPLY_SUM function to find the total amount spent for each order:

=> SELECT apply_sum(orderprices) from orders LIMIT 5;
apply_sum
-----------
380.07
187.48
340.08
268.87
  8.99
(5 rows)

Most of the array functions operate only on one-dimensional arrays. To use them with multi-dimensional arrays, first dereference one dimension:

=> SELECT apply_max(pingtimes[0]) FROM network_tests;
 apply_max
-----------
 27.16
 28.11
 25.11
(3 rows)

See Collection functions for a comprehensive list of functions.

You can include both column names and literal values in queries. The following example returns the product keys for orders where the number of items in each order is greater than three:

=> SELECT prodkey FROM orders WHERE apply_count_elements(prodkey)>2;
      prodkey
------------------------
 ["P1262","P68","P101"]
 ["P997","P31","P101"]
(2 rows)

Consider a more complex query that returns the customer key, name, email, order key, and product key by joining two tables, cust and orders, for orders that satisfy the condition where the total is greater than 150:

=> SELECT custkey, cust_custname, cust_email, orderkey, prodkey, orderprices from orders
 JOIN cust ON custkey = cust_custkey
 WHERE apply_sum(orderprices)>150 ;
custkey|  cust_custname   |        cust_email         |   orderkey   |                  prodkey                  |        orderprices
-------+------------------+---------------------------+--------------+--------------------------------========---+---------------------------
342799 | "Ananya Patel"   | "ananyapatel98@gmail.com" | "113-341987" | ["MG-7190","VA-4028","EH-1247","MS-7018"] | [60.00,67.00,22.00,14.99]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "111-952000" | ["ID-2586","IC-9010","MH-2401","JC-1905"] | [22.00,35.00,90.00,12.00]
342989 | "Natasha Abbasi" | "natsabbasi@live.com"     | "111-685238" | ["HP-4024"]                               | [650.00]
342176 | "Jose Martinez"  | "jmartinez@hotmail.com"   | "113-672238" | ["HP-4768","IC-9010"]                     | [899.00,60.00]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "113-864153" | ["AE-7064","VA-4028","GW-1808"]           | [72.00,99.00,185.00]
(5 rows)

Arrays with complex elements

Arrays can contain arrays and structs in any combination, as in the following example:

=> CREATE TABLE orders(
  orderid INT,
  accountid INT,
  shipments ARRAY[
    ROW(
      shipid INT,
      address ROW(
        street VARCHAR,
        city VARCHAR,
        zip INT
        ),
      shipdate DATE
    )
  ]
 );

Some orders consist of more than one shipment. Line breaks have been inserted into the following output for legibility:

=> SELECT * FROM orders;
 orderid | accountid |                          shipments
---------+-----------+---------------------------------------------------------------------------------------------------------------
   99123 |        17 | [{"shipid":1,"address":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipdate":"2020-11-05"},
            {"shipid":2,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
   99149 |       139 | [{"shipid":3,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
   99162 |       139 | [{"shipid":4,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-04"},
            {"shipid":5,"address":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipdate":"2020-11-11"}]
(3 rows)

You can use array indexing and struct field selection together in queries:

=> SELECT orderid, shipments[0].shipdate AS ship1, shipments[1].shipdate AS ship2 FROM orders;
 orderid |   ship1    |   ship2
---------+------------+------------
   99123 | 2020-11-05 | 2020-11-06
   99149 | 2020-11-06 |
   99162 | 2020-11-04 | 2020-11-11
(3 rows)

This example selects specific array indices. To access all entries, use EXPLODE.

Some data formats have a map type, which is a set of key/value pairs. Vertica does not directly support querying maps, but you can define a map column as an array of structs and query that. In the following example, the "prods" column in the data is a map:

=> CREATE EXTERNAL TABLE orders
 (orderkey INT,
  custkey INT,
  prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
  orderdate DATE
 ) AS COPY FROM '...' PARQUET;

=> SELECT orderkey, prods FROM orders;
 orderkey |                                              prods
----------+--------------------------------------------------------------------------------------------------
    19626 | [{"key":"P68","value":"49.99"},{"key":"P1262","value":"192.59"},{"key":"P101","value":"137.49"}]
    25646 | [{"key":"P997","value":"91.39"},{"key":"P101","value":"147.49"},{"key":"P31","value":"29.99"}]
    25647 | [{"key":"P12","value":"8.99"}]
    19743 | [{"key":"P68","value":"49.99"},{"key":"P101","value":"137.49"}]
    19888 | [{"key":"P1262","value":"197.59"},{"key":"P101","value":"142.49"}]
(5 rows)

You cannot use mixed columns in CREATE TABLE AS SELECT (CTAS) or in views. This restriction applies for the entire column or for field selection within it.

Ordering and grouping

You can use Comparison operators with collections. Null collections are ordered last. Otherwise, collections are compared element by element until there is a mismatch, and then they are ordered based on the non-matching elements. If all elements are equal up to the length of the shorter one, then the shorter one is ordered first.

You can use collections in the ORDER BY and GROUP BY clauses of queries. The following example shows ordering query results by an array column:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT * FROM employees ORDER BY grant_values;
 id | department |          grants          |  grant_values
----+------------+--------------------------+----------------
 36 | Astronomy  | ["US-7376","DARPA-1567"] | [5000,4000]
 36 | Physics    | ["US-7376","DARPA-1567"] | [10000,25000]
 33 | Physics    | ["US-7376"]              | [30000]
 42 | Physics    | ["US-7376","DARPA-1567"] | [65000,135000]
(4 rows)

The following example queries the same table using GROUP BY:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT department, grants, SUM(apply_sum(grant_values)) FROM employees GROUP BY grants, department;
 department |          grants          |  SUM
------------+--------------------------+--------
 Physics    | ["US-7376","DARPA-1567"] | 235000
 Astronomy  | ["US-7376","DARPA-1567"] |   9000
 Physics    | ["US-7376"]              |  30000
(3 rows)

See the "Functions and Operators" section on the ARRAY reference page for information on how Vertica orders collections. (The same information is also on the SET reference page.)

Null-handling

Null semantics for collections are consistent with normal columns in most regards. See NULL sort order for more information on null-handling.

The null-safe equality operator (<=>) behaves differently from equality (=) when the collection is null rather than empty. Comparing a collection to NULL strictly is undefined.

=> SELECT ARRAY[1,3] = NULL;
?column?
----------

(1 row)

=> SELECT ARRAY[1,3] <=> NULL;
 ?column?
----------
 f
(1 row)

In the following example, the grants column in the table is null for employee 99.

=> SELECT grants = NULL FROM employees WHERE id=99;
 ?column?
----------

(1 row)

=> SELECT grants <=> NULL FROM employees WHERE id=99;
 ?column?
----------
 t
(1 row)

Empty collections are not null and behave as expected.

=> SELECT ARRAY[]::ARRAY[INT] = ARRAY[]::ARRAY[INT];
 ?column?
----------
 t
(1 row)

Collections are compared element by element. If a comparison depends on a null element, the result is unknown (null), not false. For example, ARRAY[1,2,null]=ARRAY[1,2,null] and ARRAY[1,2,null]=ARRAY[1,2,3] both return null, but ARRAY[1,2,null]=ARRAY[1,4,null] returns false because the second elements do not match.

Out-of-bound indexes into collections return NULL.

=> SELECT prodkey[2] from orders LIMIT 4;
prodkey
---------

"EH-1247"
"MH-2401"

(4 rows)

The results of the query return NULL for two out of four rows, the first and the fourth, since the specified index is greater than the size of those arrays.

Casting

When the data type of an expression value is unambiguous, it is implicitly coerced to match the expected data type. However, there can be ambiguity about the data type of an expression. For example, a date can be interpreted as either a string or a timestamp. Write an explicit cast to avoid the default:

=> SELECT apply_count_elements(ARRAY['2019-01-20','2019-02-12','2019-03-23']::ARRAY[TIMESTAMP]);
apply_count_elements
--------------------
 3
(1 row)

You can cast arrays or sets of one scalar type to arrays or sets of other (compatible) types, following the same rules as for casting scalar values. Casting a collection casts each element of that collection. Casting an array to a set also removes any duplicates.

You can cast arrays (but not sets) with elements that are arrays or structs (or combinations):

=> SELECT shipments::ARRAY[ROW(id INT,addr ROW(VARCHAR,VARCHAR,INT),shipped DATE)]
FROM orders;
                  shipments
---------------------------------------------------------------------------
 [{"id":1,"addr":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipped":"2020-11-05"},
  {"id":2,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
 [{"id":3,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
 [{"id":4,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-04"},
  {"id":5,"addr":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipped":"2020-11-11"}]
(3 rows)

You can change the bound of an array or set by casting. When casting to a bounded native array, inputs that are too long are truncated. When casting to a multi-dimensional array, if the new bounds are too small for the data the cast fails:

=> SELECT ARRAY[1,2,3]::ARRAY[VARCHAR,2];
   array
-----------
 ["1","2"]
(1 row)

=> SELECT ARRAY[ARRAY[1,2,3],ARRAY[4,5,6]]::ARRAY[ARRAY[VARCHAR,2],2];
ERROR 9227:  Output array isn't big enough
DETAIL:  Type limit is 4 elements, but value has 6 elements

If you cast to a bounded multi-dimensional array, you must specify the bounds at all levels.

An array or set with a single null element must be explicitly cast because no type can be inferred.

See Data type coercion for more information on casting for data types.

Exploding array columns

You can simplify queries on elements stored in arrays with EXPLODE, a function that takes array columns from a table and expands them. For each exploded array, the results include two columns, one for the array element index, and one for the value at that position. If the function explodes a single array, these columns are named position and value by default. If the function explodes two or more arrays, the columns for each array are named pos_column-name and val_column-name.

The function explodes the first N array columns in the parameter list, defaulting to one, and passes all other columns through.

The following examples illustrate using EXPLODE() with the OVER(PARTITION BEST) clause.

Consider an orders table with columns for order keys, customer keys, product keys, order prices, and email addresses, with some containing arrays. A basic query in Vertica results in the following:

=> SELECT orderkey, custkey, prodkey, orderprices, email_addrs FROM orders LIMIT 5;
  orderkey  | custkey |                    prodkey                    |            orderprices            |                                                  email_addrs
------------+---------+-----------------------------------------------+-----------------------------------+----------------------------------------------------------------------------------------------------------------
 113-341987 |  342799 | ["MG-7190 ","VA-4028 ","EH-1247 ","MS-7018 "] | ["60.00","67.00","22.00","14.99"] | ["bob@example,com","robert.jones@example.com"]
 111-952000 |  342845 | ["ID-2586 ","IC-9010 ","MH-2401 ","JC-1905 "] | ["22.00","35.00",null,"12.00"]    | ["br92@cs.example.edu"]
 111-345634 |  342536 | ["RS-0731 ","SJ-2021 "]                       | ["50.00",null]                    | [null]
 113-965086 |  342176 | ["GW-1808 "]                                  | ["108.00"]                        | ["joe.smith@example.com"]
 111-335121 |  342321 | ["TF-3556 "]                                  | ["50.00"]                         | ["789123@example-isp.com","alexjohnson@example.com","monica@eng.example.com","sara@johnson.example.name",null]
(5 rows)

This example expands the orderprices column for a specified customer, in ascending order. The custkey and email_addrs columns are repeated for each array element.

=> SELECT EXPLODE(orderprices, custkey, email_addrs) OVER(PARTITION BEST) AS (position, orderprices, custkey, email_addrs)
   FROM orders WHERE custkey='342845' ORDER BY orderprices;
 position | orderprices | custkey |         email_addrs
----------+-------------+---------+------------------------------
        2 |             |  342845 | ["br92@cs.example.edu",null]
        3 |       12.00 |  342845 | ["br92@cs.example.edu",null]
        0 |       22.00 |  342845 | ["br92@cs.example.edu",null]
        1 |       35.00 |  342845 | ["br92@cs.example.edu",null]
(4 rows)

When you explode a column that contains null values, the null values are displayed as empty.

You can explode more than one column by specifying the explode_count parameter.

=> SELECT EXPLODE(orderkey, prodkey, orderprices USING PARAMETERS explode_count=2)
OVER(PARTITION BEST)
AS (orderkey,pk_idx,pk_val,ord_idx,ord_val)
FROM orders
WHERE orderkey='113-341987';
  orderkey  | pk_idx |  pk_val  | ord_idx | ord_val
------------+--------+----------+---------+---------
 113-341987 |      0 | MG-7190  |       0 |   60.00
 113-341987 |      0 | MG-7190  |       1 |   67.00
 113-341987 |      0 | MG-7190  |       2 |   22.00
 113-341987 |      0 | MG-7190  |       3 |   14.99
 113-341987 |      1 | VA-4028  |       0 |   60.00
 113-341987 |      1 | VA-4028  |       1 |   67.00
 113-341987 |      1 | VA-4028  |       2 |   22.00
 113-341987 |      1 | VA-4028  |       3 |   14.99
 113-341987 |      2 | EH-1247  |       0 |   60.00
 113-341987 |      2 | EH-1247  |       1 |   67.00
 113-341987 |      2 | EH-1247  |       2 |   22.00
 113-341987 |      2 | EH-1247  |       3 |   14.99
 113-341987 |      3 | MS-7018  |       0 |   60.00
 113-341987 |      3 | MS-7018  |       1 |   67.00
 113-341987 |      3 | MS-7018  |       2 |   22.00
 113-341987 |      3 | MS-7018  |       3 |   14.99
(16 rows)

The following example uses a multi-dimensional array:

=> SELECT name, pingtimes FROM network_tests;
 name |                       pingtimes
------+-------------------------------------------------------
 eng1 | [[24.24,25.27,27.16,24.97],[23.97,25.01,28.12,29.5]]
 eng2 | [[27.12,27.91,28.11,26.95],[29.01,28.99,30.11,31.56]]
 qa1  | [[23.15,25.11,24.63,23.91],[22.85,22.86,23.91,31.52]]
(3 rows)

=> SELECT EXPLODE(name, pingtimes USING PARAMETERS explode_count=1) OVER()
FROM network_tests;
 name | position |           value
------+----------+---------------------------
 eng1 |        0 | [24.24,25.27,27.16,24.97]
 eng1 |        1 | [23.97,25.01,28.12,29.5]
 eng2 |        0 | [27.12,27.91,28.11,26.95]
 eng2 |        1 | [29.01,28.99,30.11,31.56]
 qa1  |        0 | [23.15,25.11,24.63,23.91]
 qa1  |        1 | [22.85,22.86,23.91,31.52]
(6 rows)

See ARRAY and SET for more information on the implementation of these data types in Vertica.

Imploding and filtering arrays

The IMPLODE function is the inverse of EXPLODE: it takes a column and produces an array containing the column's values. Combined with GROUP BY, it can be used to reverse an explode operation.

You can use EXPLODE and IMPLODE together to filter array values. For example, in a set of orders where prices are an array value, you might want to query only the orders with prices below a certain threshold. Consider the following table:

=> SELECT * FROM orders;

 key |      prices
-----+-------------------
 567 | [27.99,18.99]
 789 | [108.0]
 345 | [14.99,35.99]
 123 | [60.0,67.0,14.99]
(4 rows)

You can use EXPLODE to expand the arrays. For clarity, this example creates a new table to hold the results. More typically, you would use EXPLODE and IMPLODE in subqueries instead of creating intermediate tables.

=> CREATE TABLE exploded AS
SELECT EXPLODE(prices,key) OVER (PARTITION BEST)
AS (position, itemprice, itemkey) FROM orders;

=> SELECT * FROM exploded;

 position | itemprice | itemkey
----------+-----------+---------
        0 |       108 |     789
        1 |     35.99 |     345
        0 |     14.99 |     345
        0 |     27.99 |     567
        0 |        60 |     123
        1 |     18.99 |     567
        1 |        67 |     123
        2 |     14.99 |     123
(8 rows)

You can now filter the exploded prices:

=> CREATE TABLE filtered AS
    SELECT position, itemprice, itemkey FROM orders WHERE itemprice < 50.00;

=> SELECT * FROM filtered;

 position | itemprice | itemkey
----------+-----------+---------
        0 |     14.99 |     345
        0 |     27.99 |     567
        1 |     18.99 |     567
        1 |     35.99 |     345
        2 |     14.99 |     123
(5 rows)

Finally, you can use IMPLODE to reconstruct the arrays:

=> SELECT itemkey AS key, IMPLODE(itemprice) AS prices
    FROM filtered GROUP BY itemkey ORDER BY itemkey;
 key |      prices
-----+-------------------
 123 | ["14.99"]
 345 | ["35.99","14.99"]
 567 | ["27.99","18.99"]
(3 rows)

You can modify this query by including a WITHIN GROUP ORDER BY clause, which specifies how to sort array elements within each group:

=> SELECT itemkey AS key, IMPLODE(itemprice) WITHIN GROUP (ORDER BY itemprice) AS prices
    FROM filtered GROUP BY itemkey ORDER BY itemkey;
 key |      prices
-----+-------------------
 123 | ["14.99"]
 345 | ["14.99","35.99"]
 567 | ["18.99","27.99"]
(3 rows)

If IMPLODE were to return an array too large for the column, the function returns an error. To avoid this, you can set the allow_truncate parameter to omit some elements from the results. Truncation never applies to individual elements; for example, the function does not shorten strings.