数组和集(集合)

表可以包含集合(数组或集)。数组是允许重复值的元素的有序集合,而集是唯一值的无序集合。

考虑一个订单表,其中包含产品密钥、客户密钥、订单价格和订单日期列,以及一些包含数组。Vertica 中的基本查询结果如下:

=> SELECT * from orders LIMIT 5;
 orderkey | custkey |        prodkey         |         orderprices         | orderdate
----------+---------+------------------------+-----------------------------+------------
    19626 |      91 | ["P1262","P68","P101"] | ["192.59","49.99","137.49"] | 2021-03-14
    25646 |     716 | ["P997","P31","P101"]  | ["91.39","29.99","147.49"]  | 2021-03-14
    25647 |     716 | ["P12"]                | ["8.99"]                    | 2021-03-14
    19743 |     161 | ["P68","P101"]         | ["49.99","137.49"]          | 2021-03-15
    19888 |     241 | ["P1262","P101"]       | ["197.59","142.49"]         | 2021-03-15
(5 rows)

如本例所示,数组值以 JSON 格式返回。

集的值也以 JSON 数组格式返回:

=> SELECT custkey,email_addrs FROM customers LIMIT 4;
 custkey |                           email_addrs
---------+------------------------------------------------------------------------
 342176  | ["joe.smith@example.com"]
 342799  | ["bob@example,com","robert.jones@example.com"]
 342845  | ["br92@cs.example.edu"]
 342321  | ["789123@example-isp.com","sjohnson@eng.example.com","sara@johnson.example.name"]

您可以访问具有多个索引的嵌套数组(多维数组)的元素,如以下示例所示:

=> SELECT host, pingtimes FROM network_tests;
 host |                 pingtimes
------+-------------------------------------------------------
 eng1 | [[24.24,25.27,27.16,24.97], [23.97,25.01,28.12,29.50]]
 eng2 | [[27.12,27.91,28.11,26.95], [29.01,28.99,30.11,31.56]]
 qa1  | [[23.15,25.11,24.63,23.91], [22.85,22.86,23.91,31.52]]
(3 rows)

=> SELECT pingtimes[0] FROM network_tests;
      pingtimes
-------------------------
[24.24,25.27,27.16,24.97]
[27.12,27.91,28.11,26.95]
[23.15,25.11,24.63,23.91]
(3 rows)

=> SELECT pingtimes[0][0] FROM network_tests;
 pingtimes
-----------
24.24
27.12
23.15
(3 rows)

Vertica 支持多个函数来操作数组和集。

考虑同一订单表,其中包含在单个订单中购买的所有商品的产品密钥数组。您可以使用 APPLY_COUNT_ELEMENTS 函数来找出每个订单包含的商品数量。该函数标识 prodkey 数组中非 null 元素的数量:

=> SELECT apply_count_elements(prodkey) FROM orders LIMIT 5;
apply_count_elements
--------------------
3
2
2
3
1
(5 rows)

Vertica 还支持集合元素的聚合函数。现在,考虑同一表中的一列,其中包含在单个订单中购买的每件商品的价格数组。您可以使用 APPLY_SUM 函数来查找每个订单的总花费:

=> SELECT apply_sum(orderprices) from orders LIMIT 5;
apply_sum
-----------
380.07
187.48
340.08
268.87
  8.99
(5 rows)

大部分数组函数只对一维数组进行操作。要将它们与多维数组一起使用,首先要取消引用一维:

=> SELECT apply_max(pingtimes[0]) FROM network_tests;
 apply_max
-----------
 27.16
 28.11
 25.11
(3 rows)

有关完整的函数列表,请参阅集合函数

您可以在查询中同时包含列名和字面量值。以下示例返回每个订单中商品数量大于三个的订单的产品密钥:

=> SELECT prodkey FROM orders WHERE apply_count_elements(prodkey)>2;
      prodkey
------------------------
 ["P1262","P68","P101"]
 ["P997","P31","P101"]
(2 rows)

考虑一个较复杂的查询,该查询通过联接两个表 custorders 来返回客户密钥、姓名、电子邮件、订单密钥和产品密钥,以获取满足总和大于 150 这一条件的订单:

=> SELECT custkey, cust_custname, cust_email, orderkey, prodkey, orderprices from orders
 JOIN cust ON custkey = cust_custkey
 WHERE apply_sum(orderprices)>150 ;
custkey|  cust_custname   |        cust_email         |   orderkey   |                  prodkey                  |        orderprices
-------+------------------+---------------------------+--------------+--------------------------------========---+---------------------------
342799 | "Ananya Patel"   | "ananyapatel98@gmail.com" | "113-341987" | ["MG-7190","VA-4028","EH-1247","MS-7018"] | [60.00,67.00,22.00,14.99]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "111-952000" | ["ID-2586","IC-9010","MH-2401","JC-1905"] | [22.00,35.00,90.00,12.00]
342989 | "Natasha Abbasi" | "natsabbasi@live.com"     | "111-685238" | ["HP-4024"]                               | [650.00]
342176 | "Jose Martinez"  | "jmartinez@hotmail.com"   | "113-672238" | ["HP-4768","IC-9010"]                     | [899.00,60.00]
342845 | "Molly Benton"   | "molly_benton@gmail.com"  | "113-864153" | ["AE-7064","VA-4028","GW-1808"]           | [72.00,99.00,185.00]
(5 rows)

具有复杂元素的数组

数组可以包含任意组合的数组和结构,如下例所示:

=> CREATE TABLE orders(
  orderid INT,
  accountid INT,
  shipments ARRAY[
    ROW(
      shipid INT,
      address ROW(
        street VARCHAR,
        city VARCHAR,
        zip INT
        ),
      shipdate DATE
    )
  ]
 );

一些订单包含多批货物。为了便于阅读,已在以下输出中插入换行符:

=> SELECT * FROM orders;
 orderid | accountid |                          shipments
---------+-----------+---------------------------------------------------------------------------------------------------------------
   99123 |        17 | [{"shipid":1,"address":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipdate":"2020-11-05"},
            {"shipid":2,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
   99149 |       139 | [{"shipid":3,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
   99162 |       139 | [{"shipid":4,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-04"},
            {"shipid":5,"address":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipdate":"2020-11-11"}]
(3 rows)

您可以在查询中同时使用数组索引和结构字段选择:

=> SELECT orderid, shipments[0].shipdate AS ship1, shipments[1].shipdate AS ship2 FROM orders;
 orderid |   ship1    |   ship2
---------+------------+------------
   99123 | 2020-11-05 | 2020-11-06
   99149 | 2020-11-06 |
   99162 | 2020-11-04 | 2020-11-11
(3 rows)

此示例选择特定数组索引。要访问所有条目,请使用 EXPLODE

某些数据格式具有映射类型,它是一组键/值对。Vertica 不直接支持查询映射,但您可以将映射列定义为结构数组并对其进行查询。在以下示例中,数据中的 "prods" 列是一个映射:

=> CREATE EXTERNAL TABLE orders
 (orderkey INT,
  custkey INT,
  prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
  orderdate DATE
 ) AS COPY FROM '...' PARQUET;

=> SELECT orderkey, prods FROM orders;
 orderkey |                                              prods
----------+--------------------------------------------------------------------------------------------------
    19626 | [{"key":"P68","value":"49.99"},{"key":"P1262","value":"192.59"},{"key":"P101","value":"137.49"}]
    25646 | [{"key":"P997","value":"91.39"},{"key":"P101","value":"147.49"},{"key":"P31","value":"29.99"}]
    25647 | [{"key":"P12","value":"8.99"}]
    19743 | [{"key":"P68","value":"49.99"},{"key":"P101","value":"137.49"}]
    19888 | [{"key":"P1262","value":"197.59"},{"key":"P101","value":"142.49"}]
(5 rows)

您不能在 CREATE TABLE AS SELECT (CTAS) 或视图中使用混合列。此限制适用于整个列或其中的字段选择。

排序和分组

您可以将 比较运算符 用于集合。null 集合排在最后。否则,对集合中的元素逐个比较,直到出现不匹配,然后根据不匹配的元素对集合进行排序。如果所有元素的长度都依次等于较短元素的长度,则首先对较短的元素进行排序。

您可以在查询的 ORDER BYGROUP BY 子句中使用集合。以下示例演示如何按数组列对查询结果进行排序:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT * FROM employees ORDER BY grant_values;
 id | department |          grants          |  grant_values
----+------------+--------------------------+----------------
 36 | Astronomy  | ["US-7376","DARPA-1567"] | [5000,4000]
 36 | Physics    | ["US-7376","DARPA-1567"] | [10000,25000]
 33 | Physics    | ["US-7376"]              | [30000]
 42 | Physics    | ["US-7376","DARPA-1567"] | [65000,135000]
(4 rows)

以下示例使用 GROUP BY 查询同一个表:

=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);

=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.

=> SELECT department, grants, SUM(apply_sum(grant_values)) FROM employees GROUP BY grants, department;
 department |          grants          |  SUM
------------+--------------------------+--------
 Physics    | ["US-7376","DARPA-1567"] | 235000
 Astronomy  | ["US-7376","DARPA-1567"] |   9000
 Physics    | ["US-7376"]              |  30000
(3 rows)

有关 Vertica 如何对集合进行排序的信息,请参阅 ARRAY 参考页面上的“函数和运算符”部分。(SET 参考页面上也提供了同样的信息。)

NULL 处理

集合的空语义在大多数方面与普通列一致。有关空处理的更多信息,请参阅 NULL 排序顺序

当集合为 null 而不是空时,空安全相等运算符 (<=>) 的行为与相等 (=) 不同。将集合严格地与 NULL 进行比较是未定义的。

=> SELECT ARRAY[1,3] = NULL;
?column?
----------

(1 row)

=> SELECT ARRAY[1,3] <=> NULL;
 ?column?
----------
 f
(1 row)

在以下示例中,表中的授予列对于员工 99 为 null。

=> SELECT grants = NULL FROM employees WHERE id=99;
 ?column?
----------

(1 row)

=> SELECT grants <=> NULL FROM employees WHERE id=99;
 ?column?
----------
 t
(1 row)

空集合不为 null 并且按预期运行。

=> SELECT ARRAY[]::ARRAY[INT] = ARRAY[]::ARRAY[INT];
 ?column?
----------
 t
(1 row)

集合逐个元素进行比较。如果比较依赖于 null 元素,则结果是未知的 (null),而不是 false。例如,ARRAY[1,2,null]=ARRAY[1,2,null]ARRAY[1,2,null]=ARRAY[1,2,3] 都返回 null,但 ARRAY[1,2,null]=ARRAY[1,4,null] 因为第二个元素不匹配而返回 false。

集合中的越界索引返回 NULL。

=> SELECT prodkey[2] from orders LIMIT 4;
prodkey
---------

"EH-1247"
"MH-2401"

(4 rows)

由于指定的索引大于这些数组的大小,因此四行中的两行(第一行和第四行)的查询结果返回 NULL。

强制转换

当表达式值的数据类型明确时,会隐式地强制转换,以与预期数据类型相匹配。但是,表达式的数据类型可能不明确。例如,日期可能解释为字符串或时间戳。编写显式强制转换以避免使用默认值:

=> SELECT apply_count_elements(ARRAY['2019-01-20','2019-02-12','2019-03-23']::ARRAY[TIMESTAMP]);
apply_count_elements
--------------------
 3
(1 row)

您可以按照与强制转换标量值相同的规则,将一种标量类型的数组或集强制转换为其他(兼容)类型的数组或集。强制转换一个集合会强制转换该集合的每个元素。将数组强制转换为集也会移除任何重复项。

您可以使用作为数组或结构(或二者的组合)的元素来强制转换数组(而不是集):

=> SELECT shipments::ARRAY[ROW(id INT,addr ROW(VARCHAR,VARCHAR,INT),shipped DATE)]
FROM orders;
                  shipments
---------------------------------------------------------------------------
 [{"id":1,"addr":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipped":"2020-11-05"},
  {"id":2,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
 [{"id":3,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
 [{"id":4,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-04"},
  {"id":5,"addr":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipped":"2020-11-11"}]
(3 rows)

您可以通过强制转换来更改数组或集的边界。当强制转换为有界原生数组时,会截断太长的输入。强制转换为多维数组时,如果新边界对于数据而言太小,则强制转换失败:

=> SELECT ARRAY[1,2,3]::ARRAY[VARCHAR,2];
   array
-----------
 ["1","2"]
(1 row)

=> SELECT ARRAY[ARRAY[1,2,3],ARRAY[4,5,6]]::ARRAY[ARRAY[VARCHAR,2],2];
ERROR 9227:  Output array isn't big enough
DETAIL:  Type limit is 4 elements, but value has 6 elements

如果强制转换为有界多维数组,则必须指定所有级别的边界。

必须显式强制转换具有单个 null 元素的数组或集,因为无法推断出任何类型。

有关数据类型强制转换的详细信息,请参阅数据类型强制转换

分割数组列

您可以使用 EXPLODE 简化对存储在数组中的元素的查询,该函数从表中获取数组列并展开它们。对于每个分割的数组,结果包括两列,一列用于数组元素索引,另一列用于该位置的值。如果该函数分割单个数组,这些列默认命名为 positionvalue。如果该函数分割两个或更多数组,则每个数组的列命名为 pos_column-nameval_column-name

该函数分割参数列表中的前 N 个数组列(N 默认值为 1),并传递所有其他列。

以下示例说明了将 EXPLODE()OVER(PARTITION BEST) 子句一起使用。

考虑一个订单表,其中包含订单键、客户键、产品键、订单价格和电子邮件地址的列,其中一些包含数组。Vertica 中的基本查询结果如下:

=> SELECT orderkey, custkey, prodkey, orderprices, email_addrs FROM orders LIMIT 5;
  orderkey  | custkey |                    prodkey                    |            orderprices            |                                                  email_addrs
------------+---------+-----------------------------------------------+-----------------------------------+----------------------------------------------------------------------------------------------------------------
 113-341987 |  342799 | ["MG-7190 ","VA-4028 ","EH-1247 ","MS-7018 "] | ["60.00","67.00","22.00","14.99"] | ["bob@example,com","robert.jones@example.com"]
 111-952000 |  342845 | ["ID-2586 ","IC-9010 ","MH-2401 ","JC-1905 "] | ["22.00","35.00",null,"12.00"]    | ["br92@cs.example.edu"]
 111-345634 |  342536 | ["RS-0731 ","SJ-2021 "]                       | ["50.00",null]                    | [null]
 113-965086 |  342176 | ["GW-1808 "]                                  | ["108.00"]                        | ["joe.smith@example.com"]
 111-335121 |  342321 | ["TF-3556 "]                                  | ["50.00"]                         | ["789123@example-isp.com","alexjohnson@example.com","monica@eng.example.com","sara@johnson.example.name",null]
(5 rows)

此示例按升序扩展指定客户的 orderprices 列。custkeyemail_addrs 列对每个数组元素重复。

=> SELECT EXPLODE(orderprices, custkey, email_addrs) OVER(PARTITION BEST) AS (position, orderprices, custkey, email_addrs)
   FROM orders WHERE custkey='342845' ORDER BY orderprices;
 position | orderprices | custkey |         email_addrs
----------+-------------+---------+------------------------------
        2 |             |  342845 | ["br92@cs.example.edu",null]
        3 |       12.00 |  342845 | ["br92@cs.example.edu",null]
        0 |       22.00 |  342845 | ["br92@cs.example.edu",null]
        1 |       35.00 |  342845 | ["br92@cs.example.edu",null]
(4 rows)

展开包含空值的列时,null 值显示为空。

您可以通过指定 explode_count 参数来展开多个列。

=> SELECT EXPLODE(orderkey, prodkey, orderprices USING PARAMETERS explode_count=2)
OVER(PARTITION BEST)
AS (orderkey,pk_idx,pk_val,ord_idx,ord_val)
FROM orders
WHERE orderkey='113-341987';
  orderkey  | pk_idx |  pk_val  | ord_idx | ord_val
------------+--------+----------+---------+---------
 113-341987 |      0 | MG-7190  |       0 |   60.00
 113-341987 |      0 | MG-7190  |       1 |   67.00
 113-341987 |      0 | MG-7190  |       2 |   22.00
 113-341987 |      0 | MG-7190  |       3 |   14.99
 113-341987 |      1 | VA-4028  |       0 |   60.00
 113-341987 |      1 | VA-4028  |       1 |   67.00
 113-341987 |      1 | VA-4028  |       2 |   22.00
 113-341987 |      1 | VA-4028  |       3 |   14.99
 113-341987 |      2 | EH-1247  |       0 |   60.00
 113-341987 |      2 | EH-1247  |       1 |   67.00
 113-341987 |      2 | EH-1247  |       2 |   22.00
 113-341987 |      2 | EH-1247  |       3 |   14.99
 113-341987 |      3 | MS-7018  |       0 |   60.00
 113-341987 |      3 | MS-7018  |       1 |   67.00
 113-341987 |      3 | MS-7018  |       2 |   22.00
 113-341987 |      3 | MS-7018  |       3 |   14.99
(16 rows)

以下示例使用多维数组:

=> SELECT name, pingtimes FROM network_tests;
 name |                       pingtimes
------+-------------------------------------------------------
 eng1 | [[24.24,25.27,27.16,24.97],[23.97,25.01,28.12,29.5]]
 eng2 | [[27.12,27.91,28.11,26.95],[29.01,28.99,30.11,31.56]]
 qa1  | [[23.15,25.11,24.63,23.91],[22.85,22.86,23.91,31.52]]
(3 rows)

=> SELECT EXPLODE(name, pingtimes USING PARAMETERS explode_count=1) OVER()
FROM network_tests;
 name | position |           value
------+----------+---------------------------
 eng1 |        0 | [24.24,25.27,27.16,24.97]
 eng1 |        1 | [23.97,25.01,28.12,29.5]
 eng2 |        0 | [27.12,27.91,28.11,26.95]
 eng2 |        1 | [29.01,28.99,30.11,31.56]
 qa1  |        0 | [23.15,25.11,24.63,23.91]
 qa1  |        1 | [22.85,22.86,23.91,31.52]
(6 rows)

有关在 Vertica 中实施这些数据类型的详细信息,请参阅 ARRAYSET

压缩和筛选数组

IMPLODE 函数是 EXPLODE 的反函数:它接受一列,并生成一个包含该列值的数组。与 GROUP BY 结合使用,可用于反转分割操作。

您可以同时使用 EXPLODEIMPLODE 来筛选数组值。例如,在一组价格为数组值的订单中,您可能只想查询价格低于某个阈值的订单。请考虑下表:

=> SELECT * FROM orders;

 key |      prices
-----+-------------------
 567 | [27.99,18.99]
 789 | [108.0]
 345 | [14.99,35.99]
 123 | [60.0,67.0,14.99]
(4 rows)

您可以使用 EXPLODE 来展开数组。为清楚起见,此示例创建一个新表来保存结果。更典型的做法是,您在子查询中使用 EXPLODEIMPLODE 而不是创建中间表。

=> CREATE TABLE exploded AS
SELECT EXPLODE(prices,key) OVER (PARTITION BEST)
AS (position, itemprice, itemkey) FROM orders;

=> SELECT * FROM exploded;

 position | itemprice | itemkey
----------+-----------+---------
        0 |       108 |     789
        1 |     35.99 |     345
        0 |     14.99 |     345
        0 |     27.99 |     567
        0 |        60 |     123
        1 |     18.99 |     567
        1 |        67 |     123
        2 |     14.99 |     123
(8 rows)

您现在可以筛选已分割的价格:

=> CREATE TABLE filtered AS
    SELECT position, itemprice, itemkey FROM orders WHERE itemprice < 50.00;

=> SELECT * FROM filtered;

 position | itemprice | itemkey
----------+-----------+---------
        0 |     14.99 |     345
        0 |     27.99 |     567
        1 |     18.99 |     567
        1 |     35.99 |     345
        2 |     14.99 |     123
(5 rows)

最后,您可以使用 IMPLODE 来重构数组:

=> SELECT itemkey AS key, IMPLODE(itemprice) AS prices
    FROM filtered GROUP BY itemkey ORDER BY itemkey;
 key |      prices
-----+-------------------
 123 | ["14.99"]
 345 | ["35.99","14.99"]
 567 | ["27.99","18.99"]
(3 rows)

您可以通过包括 WITHIN GROUP ORDER BY 子句来修改此查询,该子句指定如何对每个组中的数组元素进行排序:

=> SELECT itemkey AS key, IMPLODE(itemprice) WITHIN GROUP (ORDER BY itemprice) AS prices
    FROM filtered GROUP BY itemkey ORDER BY itemkey;
 key |      prices
-----+-------------------
 123 | ["14.99"]
 345 | ["14.99","35.99"]
 567 | ["18.99","27.99"]
(3 rows)

如果 IMPLODE 要返回的数组对于该列来说太大,则该函数将返回一个错误。为避免这种情况,您可以将 allow_truncate 参数设置为在结果中省略一些元素。截断永远不会应用于单个元素;例如,该函数不会缩短字符串。