近似计数区分函数

聚合函数 COUNT(DISTINCT) 计算数据集中非重复值的确切数量。COUNT(DISTINCT) 在使用 GROUPBY PIPELINED 算法执行计算时表现良好。

当满足以下条件时,聚合 COUNT 操作在数据集上表现良好:

  • 目标表投影之一包含 ORDER BY 子句,便于对聚合进行排序。

  • 非重复值的数量非常小。

  • 需要使用哈希聚合来执行查询。

或者,当满足以下条件时,考虑使用 APPROXIMATE_COUNT_DISTINCT 函数而不是 COUNT(DISTINCT):

  • 您有一个大型数据集,而且您不需要获得非重复值的确切数量。

  • COUNT(DISTINCT) 对于给定数据集的性能是不足的。

  • 您在同一个查询中计算几个非重复值计数。

  • COUNT(DISTINCT) 的计划使用哈希聚合。

APPROXIMATE_COUNT_DISTINCT 返回的预期值等于 COUNT(DISTINCT),误差呈对数正态分布,标准差为 s。您可以通过设置函数的可选误差容限实参来控制标准偏差(默认情况下为 1.25%)。

其他 APPROXIMATE_COUNT_DISTINCT 函数

下面是 Vertica 支持您一起使用的其他两个函数(而不是 APPROXIMATE_COUNT_DISTINCT): APPROXIMATE_COUNT_DISTINCT_SYNOPSISAPPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS。当满足以下条件时使用这些函数:

  • 您有一个大型数据集,而且您不需要获得非重复值的确切数量。

  • COUNT(DISTINCT) 对于给定数据集的性能是不足的。

  • 您希望预先计算非重复值计数,并在之后使用不同的方法合并它们。

将这两个函数一起使用,如下所示:

  1. 向 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS 传递数据集以及正态分布的置信区间。此函数会返回称为“二进制概要对象”的数据子集。

  2. 将概要传递给 APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS 函数,然后对概要计算非重复值近似计数。

您还可以使用 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE,它将多个概要合并为一个概要。使用此功能,您可以通过合并一个或多个涵盖最近、较短时间段的概要来不断更新“主”概要。

示例

以下示例显示如何使用 APPROXIMATE_COUNT_DISTINCT 函数来保持给定时间段内单击给定网页的用户的大致计数。

  1. 创建 pviews 表来存储有关网站访问的数据(访问时间、访问的网页和访问者):

    
    => CREATE TABLE pviews(
    visit_time TIMESTAMP NOT NULL,
    page_id INTEGER NOT NULL,
    user_id INTEGER NOT NULL)
    ORDER BY page_id, visit_time
    SEGMENTED BY HASH(user_id) ALL NODES KSAFE
    PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);
    

    pviews 通过散列 user_id 数据进行分段,因此给定用户的所有访问都存储在同一节点的同一段上。这可以防止稍后执行 COUNT (DISTINCT user_id) 时数据低效跨节点传输。

    该表还在访问时使用分层分区来优化 ROS 存储。这样做可以提高按时间筛选数据时的性能。

  2. 将数据加载到 pviews 中:

    => INSERT INTO pviews VALUES
         ('2022-02-01 10:00:02',1002,1),
         ('2022-02-01 10:00:03',1002,2),
         ('2022-02-01 10:00:04',1002,1),
         ('2022-02-01 10:00:05',1002,3),
         ('2022-02-01 10:00:01',1000,1),
         ('2022-02-01 10:00:06',1002,1),
         ('2022-02-01 10:00:07',1002,3),
         ('2022-02-01 10:00:08',1002,1),
         ('2022-02-01 10:00:09',1002,3),
         ('2022-02-01 10:00:12',1002,2),
         ('2022-02-02 10:00:01',1000,1),
         ('2022-02-02 10:00:02',1002,4),
         ('2022-02-02 10:00:03',1002,2),
         ('2022-02-02 10:00:04',1002,1),
         ('2022-02-02 10:00:05',1002,3),
         ('2022-02-02 10:00:06',1002,4),
         ('2022-02-02 10:00:07',1002,3),
         ('2022-02-02 10:00:08',1002,4),
         ('2022-02-02 10:00:09',1002,3),
         ('2022-02-02 10:00:12',1002,2),
         ('2022-03-02 10:00:01',1000,1),
         ('2022-03-02 10:00:02',1002,1),
         ('2022-03-02 10:00:03',1002,2),
         ('2022-03-02 10:00:04',1002,1),
         ('2022-03-02 10:00:05',1002,3),
         ('2022-03-02 10:00:06',1002,4),
         ('2022-03-02 10:00:07',1002,3),
         ('2022-03-02 10:00:08',1002,6),
         ('2022-03-02 10:00:09',1002,5),
         ('2022-03-02 10:00:12',1002,2),
         ('2022-03-02 11:00:01',1000,5),
         ('2022-03-02 11:00:02',1002,6),
         ('2022-03-02 11:00:03',1002,7),
         ('2022-03-02 11:00:04',1002,4),
         ('2022-03-02 11:00:05',1002,1),
         ('2022-03-02 11:00:06',1002,6),
         ('2022-03-02 11:00:07',1002,8),
         ('2022-03-02 11:00:08',1002,6),
         ('2022-03-02 11:00:09',1002,7),
         ('2022-03-02 11:00:12',1002,1),
         ('2022-03-03 10:00:01',1000,1),
         ('2022-03-03 10:00:02',1002,2),
         ('2022-03-03 10:00:03',1002,4),
         ('2022-03-03 10:00:04',1002,1),
         ('2022-03-03 10:00:05',1002,2),
         ('2022-03-03 10:00:06',1002,6),
         ('2022-03-03 10:00:07',1002,9),
         ('2022-03-03 10:00:08',1002,10),
         ('2022-03-03 10:00:09',1002,7),
         ('2022-03-03 10:00:12',1002,1);
     OUTPUT
    --------
         50
    (1 row)
    
    => COMMIT;
    COMMIT
    
  3. 通过使用 CREATE TABLE...AS SELECT 查询 pviews 来创建 pview_summary 表。此表的每一行汇总了从 pviews 中选择的给定日期的数据:

    • partial_visit_countpviews 中存储具有该日期的行数(网站访问)。

    • daily_users_acdp 使用 APPROXIMATE_COUNT_DISTINCT_SYNOPSIS 构建一个概要,该概要近似计算在该日期访问该网站的不同用户 (user_id) 的数量。

    
    => CREATE TABLE pview_summary AS SELECT
          visit_time::DATE "date",
          COUNT(*) partial_visit_count,
          APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
       FROM pviews GROUP BY 1;
    CREATE TABLE
    => ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
    
  4. 更新 pview_summary 表,使其像 pviews 一样进行分区。REORGANIZE 关键字强制表数据的立即重新分区:

    => ALTER TABLE pview_summary
         PARTITION BY "date"
         GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE;
    vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364:  The new partitioning scheme will produce partitions in 2 physical storage containers per projection
    vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785:  Started background repartition table task
    ALTER TABLE
    
  5. 使用 CREATE TABLE...LIKE 创建两个 ETL 表 pviews_etlpview_summary_etl,它们分别具有与 pviewspview_summary 相同的 DDL。这些表用于处理传入数据:

    => CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
    CREATE TABLE
    => CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
    CREATE TABLE
    
  6. 将新数据加载到 pviews_etl

    => INSERT INTO pviews_etl VALUES
         ('2022-03-03 11:00:01',1000,8),
         ('2022-03-03 11:00:02',1002,9),
         ('2022-03-03 11:00:03',1002,1),
         ('2022-03-03 11:00:04',1002,11),
         ('2022-03-03 11:00:05',1002,10),
         ('2022-03-03 11:00:06',1002,12),
         ('2022-03-03 11:00:07',1002,3),
         ('2022-03-03 11:00:08',1002,10),
         ('2022-03-03 11:00:09',1002,1),
         ('2022-03-03 11:00:12',1002,1);
     OUTPUT
    --------
         10
    (1 row)
    
    => COMMIT;
    COMMIT
    
  7. pview_summary_etl 中汇总新数据:

    => INSERT INTO pview_summary_etl SELECT
          visit_time::DATE visit_date,
          COUNT(*) partial_visit_count,
          APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
        FROM pviews_etl GROUP BY visit_date;
     OUTPUT
    --------
          1
    (1 row)
    
  8. 使用 COPY_PARTITIONS_TO_TABLEpviews_etl 数据附加到 pviews

    => SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews');
                  COPY_PARTITIONS_TO_TABLE
    ----------------------------------------------------
     1 distinct partition values copied at epoch 1403.
    
    (1 row)
    
    => SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary');
                  COPY_PARTITIONS_TO_TABLE
    ----------------------------------------------------
     1 distinct partition values copied at epoch 1404.
    
    (1 row)
    
  9. 为所有数据(包括刚刚从 pviews_etl 复制的分区)创建视图和不同(近似)视图:

    => SELECT
         "date" visit_date,
         SUM(partial_visit_count) visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd
       FROM pview_summary GROUP BY visit_date ORDER BY visit_date;
     visit_date | visit_count | daily_users_acd
    ------------+-------------+-----------------
     2022-02-01 |          10 |               3
     2022-02-02 |          10 |               4
     2022-03-02 |          20 |               8
     2022-03-03 |          20 |              11
    (4 rows)
    
  10. 按月创建视图和不同(近似)视图:

    => SELECT
         DATE_TRUNC('MONTH', "date")::DATE "month",
         SUM(partial_visit_count) visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd
       FROM pview_summary GROUP BY month ORDER BY month;
       month    | visit_count | monthly_users_acd
    ------------+-------------+-------------------
     2022-02-01 |          20 |                 4
     2022-03-01 |          40 |                12
    (2 rows)
    
  11. 将每日概要合并到每月概要:

    => CREATE TABLE pview_monthly_summary AS SELECT
         DATE_TRUNC('MONTH', "date")::DATE "month",
         SUM(partial_visit_count) partial_visit_count,
         APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp
       FROM pview_summary GROUP BY month ORDER BY month;
    CREATE TABLE
    
  12. 按月创建根据合并的概要生成的视图和不同的视图:

    => SELECT
         month,
         SUM(partial_visit_count) monthly_visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd
       FROM pview_monthly_summary GROUP BY month ORDER BY month;
       month    | monthly_visit_count | monthly_users_acd
    ------------+---------------------+-------------------
     2019-02-01 |                  20 |                 4
     2019-03-01 |                  40 |                12
    (2 rows)
    
  13. 您可以使用每月摘要生成年度摘要。当需要处理大量数据时,这种方法可能比使用每日摘要快:

    => SELECT
         DATE_TRUNC('YEAR', "month")::DATE "year",
         SUM(partial_visit_count) yearly_visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd
       FROM pview_monthly_summary GROUP BY year ORDER BY year;
        year    | yearly_visit_count | yearly_users_acd
    ------------+--------------------+------------------
     2022-01-01 |                 60 |               12
    (1 row)
    
  14. 删除 ETL 表:

    => DROP TABLE IF EXISTS pviews_etl, pview_summary_etl;
    DROP TABLE
    

另请参阅