近似计数区分函数
聚合函数 COUNT(DISTINCT) 计算数据集中非重复值的确切数量。COUNT(DISTINCT) 在使用 GROUPBY PIPELINED 算法执行计算时表现良好。
当满足以下条件时,聚合 COUNT 操作在数据集上表现良好:
-
目标表投影之一包含 ORDER BY 子句,便于对聚合进行排序。
-
非重复值的数量非常小。
-
需要使用哈希聚合来执行查询。
或者,当满足以下条件时,考虑使用 APPROXIMATE_COUNT_DISTINCT 函数而不是 COUNT(DISTINCT):
-
您有一个大型数据集,而且您不需要获得非重复值的确切数量。
-
COUNT(DISTINCT) 对于给定数据集的性能是不足的。
-
您在同一个查询中计算几个非重复值计数。
-
COUNT(DISTINCT) 的计划使用哈希聚合。
APPROXIMATE_COUNT_DISTINCT 返回的预期值等于 COUNT(DISTINCT),误差呈对数正态分布,标准差为 s。您可以通过设置函数的可选误差容限实参来控制标准偏差(默认情况下为 1.25%)。
其他 APPROXIMATE_COUNT_DISTINCT 函数
下面是 Vertica 支持您一起使用的其他两个函数(而不是 APPROXIMATE_COUNT_DISTINCT): APPROXIMATE_COUNT_DISTINCT_SYNOPSIS 和 APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS。当满足以下条件时使用这些函数:
-
您有一个大型数据集,而且您不需要获得非重复值的确切数量。
-
COUNT(DISTINCT) 对于给定数据集的性能是不足的。
-
您希望预先计算非重复值计数,并在之后使用不同的方法合并它们。
将这两个函数一起使用,如下所示:
-
向 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS 传递数据集以及正态分布的置信区间。此函数会返回称为“二进制概要对象”的数据子集。
-
将概要传递给 APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS 函数,然后对概要计算非重复值近似计数。
您还可以使用 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE,它将多个概要合并为一个概要。使用此功能,您可以通过合并一个或多个涵盖最近、较短时间段的概要来不断更新“主”概要。
示例
以下示例显示如何使用 APPROXIMATE_COUNT_DISTINCT 函数来保持给定时间段内单击给定网页的用户的大致计数。
-
创建
pviews
表来存储有关网站访问的数据(访问时间、访问的网页和访问者):=> CREATE TABLE pviews( visit_time TIMESTAMP NOT NULL, page_id INTEGER NOT NULL, user_id INTEGER NOT NULL) ORDER BY page_id, visit_time SEGMENTED BY HASH(user_id) ALL NODES KSAFE PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);
pviews
通过散列user_id
数据进行分段,因此给定用户的所有访问都存储在同一节点的同一段上。这可以防止稍后执行 COUNT (DISTINCT user_id) 时数据低效跨节点传输。该表还在访问时使用分层分区来优化 ROS 存储。这样做可以提高按时间筛选数据时的性能。
-
将数据加载到
pviews
中:=> INSERT INTO pviews VALUES ('2022-02-01 10:00:02',1002,1), ('2022-02-01 10:00:03',1002,2), ('2022-02-01 10:00:04',1002,1), ('2022-02-01 10:00:05',1002,3), ('2022-02-01 10:00:01',1000,1), ('2022-02-01 10:00:06',1002,1), ('2022-02-01 10:00:07',1002,3), ('2022-02-01 10:00:08',1002,1), ('2022-02-01 10:00:09',1002,3), ('2022-02-01 10:00:12',1002,2), ('2022-02-02 10:00:01',1000,1), ('2022-02-02 10:00:02',1002,4), ('2022-02-02 10:00:03',1002,2), ('2022-02-02 10:00:04',1002,1), ('2022-02-02 10:00:05',1002,3), ('2022-02-02 10:00:06',1002,4), ('2022-02-02 10:00:07',1002,3), ('2022-02-02 10:00:08',1002,4), ('2022-02-02 10:00:09',1002,3), ('2022-02-02 10:00:12',1002,2), ('2022-03-02 10:00:01',1000,1), ('2022-03-02 10:00:02',1002,1), ('2022-03-02 10:00:03',1002,2), ('2022-03-02 10:00:04',1002,1), ('2022-03-02 10:00:05',1002,3), ('2022-03-02 10:00:06',1002,4), ('2022-03-02 10:00:07',1002,3), ('2022-03-02 10:00:08',1002,6), ('2022-03-02 10:00:09',1002,5), ('2022-03-02 10:00:12',1002,2), ('2022-03-02 11:00:01',1000,5), ('2022-03-02 11:00:02',1002,6), ('2022-03-02 11:00:03',1002,7), ('2022-03-02 11:00:04',1002,4), ('2022-03-02 11:00:05',1002,1), ('2022-03-02 11:00:06',1002,6), ('2022-03-02 11:00:07',1002,8), ('2022-03-02 11:00:08',1002,6), ('2022-03-02 11:00:09',1002,7), ('2022-03-02 11:00:12',1002,1), ('2022-03-03 10:00:01',1000,1), ('2022-03-03 10:00:02',1002,2), ('2022-03-03 10:00:03',1002,4), ('2022-03-03 10:00:04',1002,1), ('2022-03-03 10:00:05',1002,2), ('2022-03-03 10:00:06',1002,6), ('2022-03-03 10:00:07',1002,9), ('2022-03-03 10:00:08',1002,10), ('2022-03-03 10:00:09',1002,7), ('2022-03-03 10:00:12',1002,1); OUTPUT -------- 50 (1 row) => COMMIT; COMMIT
-
通过使用 CREATE TABLE...AS SELECT 查询
pviews
来创建pview_summary
表。此表的每一行汇总了从pviews
中选择的给定日期的数据:-
partial_visit_count
在pviews
中存储具有该日期的行数(网站访问)。 -
daily_users_acdp
使用 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS 构建一个概要,该概要近似计算在该日期访问该网站的不同用户 (user_id
) 的数量。
=> CREATE TABLE pview_summary AS SELECT visit_time::DATE "date", COUNT(*) partial_visit_count, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp FROM pviews GROUP BY 1; CREATE TABLE => ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
-
-
更新
pview_summary
表,使其像pviews
一样进行分区。REORGANIZE 关键字强制表数据的立即重新分区:=> ALTER TABLE pview_summary PARTITION BY "date" GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE; vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364: The new partitioning scheme will produce partitions in 2 physical storage containers per projection vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785: Started background repartition table task ALTER TABLE
-
使用 CREATE TABLE...LIKE 创建两个 ETL 表
pviews_etl
和pview_summary_etl
,它们分别具有与pviews
和pview_summary
相同的 DDL。这些表用于处理传入数据:=> CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS; CREATE TABLE => CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS; CREATE TABLE
-
将新数据加载到
pviews_etl
:=> INSERT INTO pviews_etl VALUES ('2022-03-03 11:00:01',1000,8), ('2022-03-03 11:00:02',1002,9), ('2022-03-03 11:00:03',1002,1), ('2022-03-03 11:00:04',1002,11), ('2022-03-03 11:00:05',1002,10), ('2022-03-03 11:00:06',1002,12), ('2022-03-03 11:00:07',1002,3), ('2022-03-03 11:00:08',1002,10), ('2022-03-03 11:00:09',1002,1), ('2022-03-03 11:00:12',1002,1); OUTPUT -------- 10 (1 row) => COMMIT; COMMIT
-
在
pview_summary_etl
中汇总新数据:=> INSERT INTO pview_summary_etl SELECT visit_time::DATE visit_date, COUNT(*) partial_visit_count, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp FROM pviews_etl GROUP BY visit_date; OUTPUT -------- 1 (1 row)
-
使用 COPY_PARTITIONS_TO_TABLE 将
pviews_etl
数据附加到pviews
:=> SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews'); COPY_PARTITIONS_TO_TABLE ---------------------------------------------------- 1 distinct partition values copied at epoch 1403. (1 row) => SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary'); COPY_PARTITIONS_TO_TABLE ---------------------------------------------------- 1 distinct partition values copied at epoch 1404. (1 row)
-
为所有数据(包括刚刚从
pviews_etl
复制的分区)创建视图和不同(近似)视图:=> SELECT "date" visit_date, SUM(partial_visit_count) visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd FROM pview_summary GROUP BY visit_date ORDER BY visit_date; visit_date | visit_count | daily_users_acd ------------+-------------+----------------- 2022-02-01 | 10 | 3 2022-02-02 | 10 | 4 2022-03-02 | 20 | 8 2022-03-03 | 20 | 11 (4 rows)
-
按月创建视图和不同(近似)视图:
=> SELECT DATE_TRUNC('MONTH', "date")::DATE "month", SUM(partial_visit_count) visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd FROM pview_summary GROUP BY month ORDER BY month; month | visit_count | monthly_users_acd ------------+-------------+------------------- 2022-02-01 | 20 | 4 2022-03-01 | 40 | 12 (2 rows)
-
将每日概要合并到每月概要:
=> CREATE TABLE pview_monthly_summary AS SELECT DATE_TRUNC('MONTH', "date")::DATE "month", SUM(partial_visit_count) partial_visit_count, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp FROM pview_summary GROUP BY month ORDER BY month; CREATE TABLE
-
按月创建根据合并的概要生成的视图和不同的视图:
=> SELECT month, SUM(partial_visit_count) monthly_visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd FROM pview_monthly_summary GROUP BY month ORDER BY month; month | monthly_visit_count | monthly_users_acd ------------+---------------------+------------------- 2019-02-01 | 20 | 4 2019-03-01 | 40 | 12 (2 rows)
-
您可以使用每月摘要生成年度摘要。当需要处理大量数据时,这种方法可能比使用每日摘要快:
=> SELECT DATE_TRUNC('YEAR', "month")::DATE "year", SUM(partial_visit_count) yearly_visit_count, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd FROM pview_monthly_summary GROUP BY year ORDER BY year; year | yearly_visit_count | yearly_users_acd ------------+--------------------+------------------ 2022-01-01 | 60 | 12 (1 row)
-
删除 ETL 表:
=> DROP TABLE IF EXISTS pviews_etl, pview_summary_etl; DROP TABLE