The aggregate function COUNT(DISTINCT) computes the exact number of distinct values in a data set.
The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation by setting the function's optional error tolerance argument—by default, 1.25 percent.
The following example shows how to use APPROXIMATE_COUNT_DISTINCT functions to keep an approximate running count of users who click on a given web page within a given time span.
-
Create the pviews
table to store data about website visits—time of visit, web page visited, and visitor:
=> CREATE TABLE pviews(
visit_time TIMESTAMP NOT NULL,
page_id INTEGER NOT NULL,
user_id INTEGER NOT NULL)
ORDER BY page_id, visit_time
SEGMENTED BY HASH(user_id) ALL NODES KSAFE
PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);
pviews
is segmented by hashing user_id
data, so all visits by a given user are stored on the same segment, on the same node. This prevents inefficient cross-node transfer of data, when later we do a COUNT (DISTINCT user_id).
The table also uses hierarchical partitioning on time of visit to optimize the ROS storage. Doing so improves performance when filtering data by time.
-
Load data into pviews
:
=> INSERT INTO pviews VALUES
('2022-02-01 10:00:02',1002,1),
('2022-02-01 10:00:03',1002,2),
('2022-02-01 10:00:04',1002,1),
('2022-02-01 10:00:05',1002,3),
('2022-02-01 10:00:01',1000,1),
('2022-02-01 10:00:06',1002,1),
('2022-02-01 10:00:07',1002,3),
('2022-02-01 10:00:08',1002,1),
('2022-02-01 10:00:09',1002,3),
('2022-02-01 10:00:12',1002,2),
('2022-02-02 10:00:01',1000,1),
('2022-02-02 10:00:02',1002,4),
('2022-02-02 10:00:03',1002,2),
('2022-02-02 10:00:04',1002,1),
('2022-02-02 10:00:05',1002,3),
('2022-02-02 10:00:06',1002,4),
('2022-02-02 10:00:07',1002,3),
('2022-02-02 10:00:08',1002,4),
('2022-02-02 10:00:09',1002,3),
('2022-02-02 10:00:12',1002,2),
('2022-03-02 10:00:01',1000,1),
('2022-03-02 10:00:02',1002,1),
('2022-03-02 10:00:03',1002,2),
('2022-03-02 10:00:04',1002,1),
('2022-03-02 10:00:05',1002,3),
('2022-03-02 10:00:06',1002,4),
('2022-03-02 10:00:07',1002,3),
('2022-03-02 10:00:08',1002,6),
('2022-03-02 10:00:09',1002,5),
('2022-03-02 10:00:12',1002,2),
('2022-03-02 11:00:01',1000,5),
('2022-03-02 11:00:02',1002,6),
('2022-03-02 11:00:03',1002,7),
('2022-03-02 11:00:04',1002,4),
('2022-03-02 11:00:05',1002,1),
('2022-03-02 11:00:06',1002,6),
('2022-03-02 11:00:07',1002,8),
('2022-03-02 11:00:08',1002,6),
('2022-03-02 11:00:09',1002,7),
('2022-03-02 11:00:12',1002,1),
('2022-03-03 10:00:01',1000,1),
('2022-03-03 10:00:02',1002,2),
('2022-03-03 10:00:03',1002,4),
('2022-03-03 10:00:04',1002,1),
('2022-03-03 10:00:05',1002,2),
('2022-03-03 10:00:06',1002,6),
('2022-03-03 10:00:07',1002,9),
('2022-03-03 10:00:08',1002,10),
('2022-03-03 10:00:09',1002,7),
('2022-03-03 10:00:12',1002,1);
OUTPUT
--------
50
(1 row)
=> COMMIT;
COMMIT
-
Create the pview_summary
table by querying pviews
with CREATE TABLE...AS SELECT. Each row of this table summarizes data selected from pviews
for a given date:
-
partial_visit_count
stores the number of rows (website visits) in pviews
with that date.
-
daily_users_acdp
uses APPROXIMATE_COUNT_DISTINCT_SYNOPSIS to construct a synopsis that approximates the number of distinct users (user_id
) who visited that website on that date.
=> CREATE TABLE pview_summary AS SELECT
visit_time::DATE "date",
COUNT(*) partial_visit_count,
APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
FROM pviews GROUP BY 1;
CREATE TABLE
=> ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
-
Update the pview_summary
table so it is partitioned like pviews
. The REORGANIZE keyword forces immediate repartitioning of the table data:
=> ALTER TABLE pview_summary
PARTITION BY "date"
GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE;
vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364: The new partitioning scheme will produce partitions in 2 physical storage containers per projection
vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785: Started background repartition table task
ALTER TABLE
-
Use CREATE TABLE..LIKE to create two ETL tables, pviews_etl
and pview_summary_etl
with the same DDL as pviews
and pview_summary
, respectively. These tables serve to process incoming data:
=> CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
CREATE TABLE
=> CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
CREATE TABLE
-
Load new data into pviews_etl
:
=> INSERT INTO pviews_etl VALUES
('2022-03-03 11:00:01',1000,8),
('2022-03-03 11:00:02',1002,9),
('2022-03-03 11:00:03',1002,1),
('2022-03-03 11:00:04',1002,11),
('2022-03-03 11:00:05',1002,10),
('2022-03-03 11:00:06',1002,12),
('2022-03-03 11:00:07',1002,3),
('2022-03-03 11:00:08',1002,10),
('2022-03-03 11:00:09',1002,1),
('2022-03-03 11:00:12',1002,1);
OUTPUT
--------
10
(1 row)
=> COMMIT;
COMMIT
-
Summarize the new data in pview_summary_etl
:
=> INSERT INTO pview_summary_etl SELECT
visit_time::DATE visit_date,
COUNT(*) partial_visit_count,
APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
FROM pviews_etl GROUP BY visit_date;
OUTPUT
--------
1
(1 row)
-
Append the pviews_etl
data to pviews
with COPY_PARTITIONS_TO_TABLE:
=> SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews');
COPY_PARTITIONS_TO_TABLE
----------------------------------------------------
1 distinct partition values copied at epoch 1403.
(1 row)
=> SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary');
COPY_PARTITIONS_TO_TABLE
----------------------------------------------------
1 distinct partition values copied at epoch 1404.
(1 row)
-
Create views and distinct (approximate) views by day for all data, including the partition that was just copied from pviews_etl
:
=> SELECT
"date" visit_date,
SUM(partial_visit_count) visit_count,
APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd
FROM pview_summary GROUP BY visit_date ORDER BY visit_date;
visit_date | visit_count | daily_users_acd
------------+-------------+-----------------
2022-02-01 | 10 | 3
2022-02-02 | 10 | 4
2022-03-02 | 20 | 8
2022-03-03 | 20 | 11
(4 rows)
-
Create views and distinct (approximate) views by month:
=> SELECT
DATE_TRUNC('MONTH', "date")::DATE "month",
SUM(partial_visit_count) visit_count,
APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd
FROM pview_summary GROUP BY month ORDER BY month;
month | visit_count | monthly_users_acd
------------+-------------+-------------------
2022-02-01 | 20 | 4
2022-03-01 | 40 | 12
(2 rows)
-
Merge daily synopses into monthly synopses:
=> CREATE TABLE pview_monthly_summary AS SELECT
DATE_TRUNC('MONTH', "date")::DATE "month",
SUM(partial_visit_count) partial_visit_count,
APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp
FROM pview_summary GROUP BY month ORDER BY month;
CREATE TABLE
-
Create views and distinct views by month, generated from the merged synopses:
=> SELECT
month,
SUM(partial_visit_count) monthly_visit_count,
APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd
FROM pview_monthly_summary GROUP BY month ORDER BY month;
month | monthly_visit_count | monthly_users_acd
------------+---------------------+-------------------
2019-02-01 | 20 | 4
2019-03-01 | 40 | 12
(2 rows)
-
You can use the monthly summary to produce a yearly summary. This approach is likely to be faster than using a daily summary if a lot of data needs to be processed:
=> SELECT
DATE_TRUNC('YEAR', "month")::DATE "year",
SUM(partial_visit_count) yearly_visit_count,
APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd
FROM pview_monthly_summary GROUP BY year ORDER BY year;
year | yearly_visit_count | yearly_users_acd
------------+--------------------+------------------
2022-01-01 | 60 | 12
(1 row)
-
Drop the ETL tables:
=> DROP TABLE IF EXISTS pviews_etl, pview_summary_etl;
DROP TABLE