APPROXIMATE_COUNT_DISTINCT
Returns the number of distinct non-NULL values in a data set.
Behavior type
ImmutableSyntax
APPROXIMATE_COUNT_DISTINCT ( expression[, error-tolerance ] )
Parameters
expression
- Value to be evaluated using any data type that supports equality comparison.
error-tolerance
Numeric value that represents the desired percentage of error tolerance, distributed around the value returned by this function. The smaller the error tolerance, the closer the approximation.
You can set
error-tolerance
to a minimum value of 0.88. Vertica imposes no maximum restriction, but any value greater than 5 is implemented with 5% error tolerance.If you omit this argument, Vertica uses an error tolerance of 1.25(%).
Restrictions
APPROXIMATE_COUNT_DISTINCT and DISTINCT aggregates cannot be in the same query block.
Error tolerance
APPROXIMATE_COUNT_DISTINCT(
x
,
error-tolerance
)
returns a value equal to COUNT(DISTINCT
x
)
, with an error that is lognormally distributed with standard deviation.
Parameter error-tolerance
is optional. Supply this argument to specify the desired standard deviation. error-tolerance
is defined as 2.17 standard deviations, which corresponds to a 97 percent confidence interval:
standard-deviation = error-tolerance / 2.17
For example:
-
error-tolerance
= 1
Default setting, corresponds to a standard deviation
97 percent of the time, APPROXIMATE_COUNT_DISTINCT(
x,5
) returns a value between:-
COUNT(DISTINCT
x
) * 0.99
-
COUNT(DISTINCT
x
) * 1.01
-
-
error-tolerance
= 5
97 percent of the time,
APPROXIMATE_COUNT_DISTINCT(
x
)
returns a value between:-
COUNT(DISTINCT
x
) * 0.95
-
COUNT(DISTINCT
x
) * 1.05
-
A 99 percent confidence interval corresponds to 2.58
standard deviations. To set error-tolerance
confidence level corresponding to 99 (instead of a 97) percent , multiply error-tolerance
by 2.17 / 2.58 = 0.841
.
For example, if you specify error-tolerance
as 5 * 0.841 = 4.2
, APPROXIMATE_COUNT_DISTINCT(
x,4.2
)
returns values 99 percent of the time between:
-
COUNT (DISTINCT
x
) * 0.95
-
COUNT (DISTINCT
x
) * 1.05
Examples
Count the total number of distinct values in column product_key
from table store.store_sales_fact
:
=> SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact;
COUNT
-------
19982
(1 row)
Count the approximate number of distinct values in product_key
with various error tolerances. The smaller the error tolerance, the closer the approximation:
=> SELECT APPROXIMATE_COUNT_DISTINCT(product_key,5) AS five_pct_accuracy,
APPROXIMATE_COUNT_DISTINCT(product_key,1) AS one_pct_accuracy,
APPROXIMATE_COUNT_DISTINCT(product_key,.88) AS point_eighteight_pct_accuracy
FROM store.store_sales_fact;
five_pct_accuracy | one_pct_accuracy | point_eighteight_pct_accuracy
-------------------+------------------+-------------------------------
19431 | 19921 | 19921
(1 row)