数组和集(集合)
表可以包含集合(数组或集)。数组是允许重复值的元素的有序集合,而集是唯一值的无序集合。
考虑一个订单表,其中包含产品密钥、客户密钥、订单价格和订单日期列,以及一些包含数组。Vertica 中的基本查询结果如下:
=> SELECT * from orders LIMIT 5;
orderkey | custkey | prodkey | orderprices | orderdate
----------+---------+------------------------+-----------------------------+------------
19626 | 91 | ["P1262","P68","P101"] | ["192.59","49.99","137.49"] | 2021-03-14
25646 | 716 | ["P997","P31","P101"] | ["91.39","29.99","147.49"] | 2021-03-14
25647 | 716 | ["P12"] | ["8.99"] | 2021-03-14
19743 | 161 | ["P68","P101"] | ["49.99","137.49"] | 2021-03-15
19888 | 241 | ["P1262","P101"] | ["197.59","142.49"] | 2021-03-15
(5 rows)
如本例所示,数组值以 JSON 格式返回。
集的值也以 JSON 数组格式返回:
=> SELECT custkey,email_addrs FROM customers LIMIT 4;
custkey | email_addrs
---------+------------------------------------------------------------------------
342176 | ["joe.smith@example.com"]
342799 | ["bob@example,com","robert.jones@example.com"]
342845 | ["br92@cs.example.edu"]
342321 | ["789123@example-isp.com","sjohnson@eng.example.com","sara@johnson.example.name"]
您可以访问具有多个索引的嵌套数组(多维数组)的元素,如以下示例所示:
=> SELECT host, pingtimes FROM network_tests;
host | pingtimes
------+-------------------------------------------------------
eng1 | [[24.24,25.27,27.16,24.97], [23.97,25.01,28.12,29.50]]
eng2 | [[27.12,27.91,28.11,26.95], [29.01,28.99,30.11,31.56]]
qa1 | [[23.15,25.11,24.63,23.91], [22.85,22.86,23.91,31.52]]
(3 rows)
=> SELECT pingtimes[0] FROM network_tests;
pingtimes
-------------------------
[24.24,25.27,27.16,24.97]
[27.12,27.91,28.11,26.95]
[23.15,25.11,24.63,23.91]
(3 rows)
=> SELECT pingtimes[0][0] FROM network_tests;
pingtimes
-----------
24.24
27.12
23.15
(3 rows)
Vertica 支持多个函数来操作数组和集。
考虑同一订单表,其中包含在单个订单中购买的所有商品的产品密钥数组。您可以使用 APPLY_COUNT_ELEMENTS 函数来找出每个订单包含的商品数量。该函数标识 prodkey
数组中非 null 元素的数量:
=> SELECT apply_count_elements(prodkey) FROM orders LIMIT 5;
apply_count_elements
--------------------
3
2
2
3
1
(5 rows)
Vertica 还支持集合元素的聚合函数。现在,考虑同一表中的一列,其中包含在单个订单中购买的每件商品的价格数组。您可以使用 APPLY_SUM 函数来查找每个订单的总花费:
=> SELECT apply_sum(orderprices) from orders LIMIT 5;
apply_sum
-----------
380.07
187.48
340.08
268.87
8.99
(5 rows)
大部分数组函数只对一维数组进行操作。要将它们与多维数组一起使用,首先要取消引用一维:
=> SELECT apply_max(pingtimes[0]) FROM network_tests;
apply_max
-----------
27.16
28.11
25.11
(3 rows)
有关完整的函数列表,请参阅集合函数。
您可以在查询中同时包含列名和字面量值。以下示例返回每个订单中商品数量大于三个的订单的产品密钥:
=> SELECT prodkey FROM orders WHERE apply_count_elements(prodkey)>2;
prodkey
------------------------
["P1262","P68","P101"]
["P997","P31","P101"]
(2 rows)
考虑一个较复杂的查询,该查询通过联接两个表 cust
和 orders
来返回客户密钥、姓名、电子邮件、订单密钥和产品密钥,以获取满足总和大于 150 这一条件的订单:
=> SELECT custkey, cust_custname, cust_email, orderkey, prodkey, orderprices from orders
JOIN cust ON custkey = cust_custkey
WHERE apply_sum(orderprices)>150 ;
custkey| cust_custname | cust_email | orderkey | prodkey | orderprices
-------+------------------+---------------------------+--------------+--------------------------------========---+---------------------------
342799 | "Ananya Patel" | "ananyapatel98@gmail.com" | "113-341987" | ["MG-7190","VA-4028","EH-1247","MS-7018"] | [60.00,67.00,22.00,14.99]
342845 | "Molly Benton" | "molly_benton@gmail.com" | "111-952000" | ["ID-2586","IC-9010","MH-2401","JC-1905"] | [22.00,35.00,90.00,12.00]
342989 | "Natasha Abbasi" | "natsabbasi@live.com" | "111-685238" | ["HP-4024"] | [650.00]
342176 | "Jose Martinez" | "jmartinez@hotmail.com" | "113-672238" | ["HP-4768","IC-9010"] | [899.00,60.00]
342845 | "Molly Benton" | "molly_benton@gmail.com" | "113-864153" | ["AE-7064","VA-4028","GW-1808"] | [72.00,99.00,185.00]
(5 rows)
具有复杂元素的数组
数组可以包含任意组合的数组和结构,如下例所示:
=> CREATE TABLE orders(
orderid INT,
accountid INT,
shipments ARRAY[
ROW(
shipid INT,
address ROW(
street VARCHAR,
city VARCHAR,
zip INT
),
shipdate DATE
)
]
);
一些订单包含多批货物。为了便于阅读,已在以下输出中插入换行符:
=> SELECT * FROM orders;
orderid | accountid | shipments
---------+-----------+---------------------------------------------------------------------------------------------------------------
99123 | 17 | [{"shipid":1,"address":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipdate":"2020-11-05"},
{"shipid":2,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
99149 | 139 | [{"shipid":3,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-06"}]
99162 | 139 | [{"shipid":4,"address":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipdate":"2020-11-04"},
{"shipid":5,"address":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipdate":"2020-11-11"}]
(3 rows)
您可以在查询中同时使用数组索引和结构字段选择:
=> SELECT orderid, shipments[0].shipdate AS ship1, shipments[1].shipdate AS ship2 FROM orders;
orderid | ship1 | ship2
---------+------------+------------
99123 | 2020-11-05 | 2020-11-06
99149 | 2020-11-06 |
99162 | 2020-11-04 | 2020-11-11
(3 rows)
此示例选择特定数组索引。要访问所有条目,请使用 EXPLODE。
某些数据格式具有映射类型,它是一组键/值对。Vertica 不直接支持查询映射,但您可以将映射列定义为结构数组并对其进行查询。在以下示例中,数据中的 "prods" 列是一个映射:
=> CREATE EXTERNAL TABLE orders
(orderkey INT,
custkey INT,
prods ARRAY[ROW(key VARCHAR(10), value DECIMAL(12,2))],
orderdate DATE
) AS COPY FROM '...' PARQUET;
=> SELECT orderkey, prods FROM orders;
orderkey | prods
----------+--------------------------------------------------------------------------------------------------
19626 | [{"key":"P68","value":"49.99"},{"key":"P1262","value":"192.59"},{"key":"P101","value":"137.49"}]
25646 | [{"key":"P997","value":"91.39"},{"key":"P101","value":"147.49"},{"key":"P31","value":"29.99"}]
25647 | [{"key":"P12","value":"8.99"}]
19743 | [{"key":"P68","value":"49.99"},{"key":"P101","value":"137.49"}]
19888 | [{"key":"P1262","value":"197.59"},{"key":"P101","value":"142.49"}]
(5 rows)
您不能在 CREATE TABLE AS SELECT (CTAS) 或视图中使用混合列。此限制适用于整个列或其中的字段选择。
排序和分组
您可以将 比较运算符 用于集合。null 集合排在最后。否则,对集合中的元素逐个比较,直到出现不匹配,然后根据不匹配的元素对集合进行排序。如果所有元素的长度都依次等于较短元素的长度,则首先对较短的元素进行排序。
您可以在查询的 ORDER BY 和 GROUP BY 子句中使用集合。以下示例演示如何按数组列对查询结果进行排序:
=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);
=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.
=> SELECT * FROM employees ORDER BY grant_values;
id | department | grants | grant_values
----+------------+--------------------------+----------------
36 | Astronomy | ["US-7376","DARPA-1567"] | [5000,4000]
36 | Physics | ["US-7376","DARPA-1567"] | [10000,25000]
33 | Physics | ["US-7376"] | [30000]
42 | Physics | ["US-7376","DARPA-1567"] | [65000,135000]
(4 rows)
以下示例使用 GROUP BY 查询同一个表:
=> CREATE TABLE employees (id INT, department VARCHAR(50), grants ARRAY[VARCHAR], grant_values ARRAY[INT]);
=> COPY employees FROM STDIN;
42|Physics|[US-7376,DARPA-1567]|[65000,135000]
36|Physics|[US-7376,DARPA-1567]|[10000,25000]
33|Physics|[US-7376]|[30000]
36|Astronomy|[US-7376,DARPA-1567]|[5000,4000]
\.
=> SELECT department, grants, SUM(apply_sum(grant_values)) FROM employees GROUP BY grants, department;
department | grants | SUM
------------+--------------------------+--------
Physics | ["US-7376","DARPA-1567"] | 235000
Astronomy | ["US-7376","DARPA-1567"] | 9000
Physics | ["US-7376"] | 30000
(3 rows)
有关 Vertica 如何对集合进行排序的信息,请参阅 ARRAY 参考页面上的“函数和运算符”部分。(SET 参考页面上也提供了同样的信息。)
NULL 处理
集合的空语义在大多数方面与普通列一致。有关空处理的更多信息,请参阅 NULL 排序顺序。
当集合为 null 而不是空时,空安全相等运算符 (<=>) 的行为与相等 (=) 不同。将集合严格地与 NULL 进行比较是未定义的。
=> SELECT ARRAY[1,3] = NULL;
?column?
----------
(1 row)
=> SELECT ARRAY[1,3] <=> NULL;
?column?
----------
f
(1 row)
在以下示例中,表中的授予列对于员工 99 为 null。
=> SELECT grants = NULL FROM employees WHERE id=99;
?column?
----------
(1 row)
=> SELECT grants <=> NULL FROM employees WHERE id=99;
?column?
----------
t
(1 row)
空集合不为 null 并且按预期运行。
=> SELECT ARRAY[]::ARRAY[INT] = ARRAY[]::ARRAY[INT];
?column?
----------
t
(1 row)
集合逐个元素进行比较。如果比较依赖于 null 元素,则结果是未知的 (null),而不是 false。例如,ARRAY[1,2,null]=ARRAY[1,2,null]
和 ARRAY[1,2,null]=ARRAY[1,2,3]
都返回 null,但 ARRAY[1,2,null]=ARRAY[1,4,null]
因为第二个元素不匹配而返回 false。
集合中的越界索引返回 NULL。
=> SELECT prodkey[2] from orders LIMIT 4;
prodkey
---------
"EH-1247"
"MH-2401"
(4 rows)
由于指定的索引大于这些数组的大小,因此四行中的两行(第一行和第四行)的查询结果返回 NULL。
强制转换
当表达式值的数据类型明确时,会隐式地强制转换,以与预期数据类型相匹配。但是,表达式的数据类型可能不明确。例如,日期可能解释为字符串或时间戳。编写显式强制转换以避免使用默认值:
=> SELECT apply_count_elements(ARRAY['2019-01-20','2019-02-12','2019-03-23']::ARRAY[TIMESTAMP]);
apply_count_elements
--------------------
3
(1 row)
您可以按照与强制转换标量值相同的规则,将一种标量类型的数组或集强制转换为其他(兼容)类型的数组或集。强制转换一个集合会强制转换该集合的每个元素。将数组强制转换为集也会移除任何重复项。
您可以使用作为数组或结构(或二者的组合)的元素来强制转换数组(而不是集):
=> SELECT shipments::ARRAY[ROW(id INT,addr ROW(VARCHAR,VARCHAR,INT),shipped DATE)]
FROM orders;
shipments
---------------------------------------------------------------------------
[{"id":1,"addr":{"street":"911 San Marcos St","city":"Austin","zip":73344},"shipped":"2020-11-05"},
{"id":2,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
[{"id":3,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-06"}]
[{"id":4,"addr":{"street":"100 main St Apt 4B","city":"Pasadena","zip":91001},"shipped":"2020-11-04"},
{"id":5,"addr":{"street":"100 Main St Apt 4A","city":"Pasadena","zip":91001},"shipped":"2020-11-11"}]
(3 rows)
您可以通过强制转换来更改数组或集的边界。当强制转换为有界原生数组时,会截断太长的输入。强制转换为多维数组时,如果新边界对于数据而言太小,则强制转换失败:
=> SELECT ARRAY[1,2,3]::ARRAY[VARCHAR,2];
array
-----------
["1","2"]
(1 row)
=> SELECT ARRAY[ARRAY[1,2,3],ARRAY[4,5,6]]::ARRAY[ARRAY[VARCHAR,2],2];
ERROR 9227: Output array isn't big enough
DETAIL: Type limit is 4 elements, but value has 6 elements
如果强制转换为有界多维数组,则必须指定所有级别的边界。
必须显式强制转换具有单个 null 元素的数组或集,因为无法推断出任何类型。
有关数据类型强制转换的详细信息,请参阅数据类型强制转换。
分割数组列
您可以使用 EXPLODE 简化对存储在数组中的元素的查询,该函数从表中获取数组列并展开它们。对于每个分割的数组,结果包括两列,一列用于数组元素索引,另一列用于该位置的值。如果该函数分割单个数组,这些列默认命名为 position
和 value
。如果该函数分割两个或更多数组,则每个数组的列命名为
pos_column-name
和
val_column-name
。
该函数分割参数列表中的前 N 个数组列(N 默认值为 1),并传递所有其他列。
以下示例说明了将 EXPLODE()
与 OVER(PARTITION BEST)
子句一起使用。
考虑一个订单表,其中包含订单键、客户键、产品键、订单价格和电子邮件地址的列,其中一些包含数组。Vertica 中的基本查询结果如下:
=> SELECT orderkey, custkey, prodkey, orderprices, email_addrs FROM orders LIMIT 5;
orderkey | custkey | prodkey | orderprices | email_addrs
------------+---------+-----------------------------------------------+-----------------------------------+----------------------------------------------------------------------------------------------------------------
113-341987 | 342799 | ["MG-7190 ","VA-4028 ","EH-1247 ","MS-7018 "] | ["60.00","67.00","22.00","14.99"] | ["bob@example,com","robert.jones@example.com"]
111-952000 | 342845 | ["ID-2586 ","IC-9010 ","MH-2401 ","JC-1905 "] | ["22.00","35.00",null,"12.00"] | ["br92@cs.example.edu"]
111-345634 | 342536 | ["RS-0731 ","SJ-2021 "] | ["50.00",null] | [null]
113-965086 | 342176 | ["GW-1808 "] | ["108.00"] | ["joe.smith@example.com"]
111-335121 | 342321 | ["TF-3556 "] | ["50.00"] | ["789123@example-isp.com","alexjohnson@example.com","monica@eng.example.com","sara@johnson.example.name",null]
(5 rows)
此示例按升序扩展指定客户的 orderprices
列。custkey
和 email_addrs
列对每个数组元素重复。
=> SELECT EXPLODE(orderprices, custkey, email_addrs) OVER(PARTITION BEST) AS (position, orderprices, custkey, email_addrs)
FROM orders WHERE custkey='342845' ORDER BY orderprices;
position | orderprices | custkey | email_addrs
----------+-------------+---------+------------------------------
2 | | 342845 | ["br92@cs.example.edu",null]
3 | 12.00 | 342845 | ["br92@cs.example.edu",null]
0 | 22.00 | 342845 | ["br92@cs.example.edu",null]
1 | 35.00 | 342845 | ["br92@cs.example.edu",null]
(4 rows)
展开包含空值的列时,null 值显示为空。
您可以通过指定 explode_count
参数来展开多个列。
=> SELECT EXPLODE(orderkey, prodkey, orderprices USING PARAMETERS explode_count=2)
OVER(PARTITION BEST)
AS (orderkey,pk_idx,pk_val,ord_idx,ord_val)
FROM orders
WHERE orderkey='113-341987';
orderkey | pk_idx | pk_val | ord_idx | ord_val
------------+--------+----------+---------+---------
113-341987 | 0 | MG-7190 | 0 | 60.00
113-341987 | 0 | MG-7190 | 1 | 67.00
113-341987 | 0 | MG-7190 | 2 | 22.00
113-341987 | 0 | MG-7190 | 3 | 14.99
113-341987 | 1 | VA-4028 | 0 | 60.00
113-341987 | 1 | VA-4028 | 1 | 67.00
113-341987 | 1 | VA-4028 | 2 | 22.00
113-341987 | 1 | VA-4028 | 3 | 14.99
113-341987 | 2 | EH-1247 | 0 | 60.00
113-341987 | 2 | EH-1247 | 1 | 67.00
113-341987 | 2 | EH-1247 | 2 | 22.00
113-341987 | 2 | EH-1247 | 3 | 14.99
113-341987 | 3 | MS-7018 | 0 | 60.00
113-341987 | 3 | MS-7018 | 1 | 67.00
113-341987 | 3 | MS-7018 | 2 | 22.00
113-341987 | 3 | MS-7018 | 3 | 14.99
(16 rows)
以下示例使用多维数组:
=> SELECT name, pingtimes FROM network_tests;
name | pingtimes
------+-------------------------------------------------------
eng1 | [[24.24,25.27,27.16,24.97],[23.97,25.01,28.12,29.5]]
eng2 | [[27.12,27.91,28.11,26.95],[29.01,28.99,30.11,31.56]]
qa1 | [[23.15,25.11,24.63,23.91],[22.85,22.86,23.91,31.52]]
(3 rows)
=> SELECT EXPLODE(name, pingtimes USING PARAMETERS explode_count=1) OVER()
FROM network_tests;
name | position | value
------+----------+---------------------------
eng1 | 0 | [24.24,25.27,27.16,24.97]
eng1 | 1 | [23.97,25.01,28.12,29.5]
eng2 | 0 | [27.12,27.91,28.11,26.95]
eng2 | 1 | [29.01,28.99,30.11,31.56]
qa1 | 0 | [23.15,25.11,24.63,23.91]
qa1 | 1 | [22.85,22.86,23.91,31.52]
(6 rows)
有关在 Vertica 中实施这些数据类型的详细信息,请参阅 ARRAY 和 SET。
压缩和筛选数组
IMPLODE 函数是 EXPLODE 的反函数:它接受一列,并生成一个包含该列值的数组。与 GROUP BY 结合使用,可用于反转分割操作。
您可以同时使用 EXPLODE 和 IMPLODE 来筛选数组值。例如,在一组价格为数组值的订单中,您可能只想查询价格低于某个阈值的订单。请考虑下表:
=> SELECT * FROM orders;
key | prices
-----+-------------------
567 | [27.99,18.99]
789 | [108.0]
345 | [14.99,35.99]
123 | [60.0,67.0,14.99]
(4 rows)
您可以使用 EXPLODE 来展开数组。为清楚起见,此示例创建一个新表来保存结果。更典型的做法是,您在子查询中使用 EXPLODE 和 IMPLODE 而不是创建中间表。
=> CREATE TABLE exploded AS
SELECT EXPLODE(prices,key) OVER (PARTITION BEST)
AS (position, itemprice, itemkey) FROM orders;
=> SELECT * FROM exploded;
position | itemprice | itemkey
----------+-----------+---------
0 | 108 | 789
1 | 35.99 | 345
0 | 14.99 | 345
0 | 27.99 | 567
0 | 60 | 123
1 | 18.99 | 567
1 | 67 | 123
2 | 14.99 | 123
(8 rows)
您现在可以筛选已分割的价格:
=> CREATE TABLE filtered AS
SELECT position, itemprice, itemkey FROM orders WHERE itemprice < 50.00;
=> SELECT * FROM filtered;
position | itemprice | itemkey
----------+-----------+---------
0 | 14.99 | 345
0 | 27.99 | 567
1 | 18.99 | 567
1 | 35.99 | 345
2 | 14.99 | 123
(5 rows)
最后,您可以使用 IMPLODE 来重构数组:
=> SELECT itemkey AS key, IMPLODE(itemprice) AS prices
FROM filtered GROUP BY itemkey ORDER BY itemkey;
key | prices
-----+-------------------
123 | ["14.99"]
345 | ["35.99","14.99"]
567 | ["27.99","18.99"]
(3 rows)
您可以通过包括 WITHIN GROUP ORDER BY 子句来修改此查询,该子句指定如何对每个组中的数组元素进行排序:
=> SELECT itemkey AS key, IMPLODE(itemprice) WITHIN GROUP (ORDER BY itemprice) AS prices
FROM filtered GROUP BY itemkey ORDER BY itemkey;
key | prices
-----+-------------------
123 | ["14.99"]
345 | ["14.99","35.99"]
567 | ["18.99","27.99"]
(3 rows)
如果 IMPLODE 要返回的数组对于该列来说太大,则该函数将返回一个错误。为避免这种情况,您可以将 allow_truncate 参数设置为在结果中省略一些元素。截断永远不会应用于单个元素;例如,该函数不会缩短字符串。