APPROXIMATE_COUNT_DISTINCT

Returns the number of distinct non-NULL values in a data set.

Returns the number of distinct non-NULL values in a data set.

Behavior type

Immutable

Syntax

APPROXIMATE_COUNT_DISTINCT ( expression[, error-tolerance ] )

Parameters

expression
Value to be evaluated using any data type that supports equality comparison.
error-tolerance

Numeric value that represents the desired percentage of error tolerance, distributed around the value returned by this function. The smaller the error tolerance, the closer the approximation.

You can set error-tolerance to a minimum value of 0.88. Vertica imposes no maximum restriction, but any value greater than 5 is implemented with 5% error tolerance.

If you omit this argument, Vertica uses an error tolerance of 1.25(%).

Restrictions

APPROXIMATE_COUNT_DISTINCT and DISTINCT aggregates cannot be in the same query block.

Error tolerance

APPROXIMATE_COUNT_DISTINCT(x, error-tolerance) returns a value equal to COUNT(DISTINCT x), with an error that is lognormally distributed with standard deviation.

Parameter error-tolerance is optional. Supply this argument to specify the desired standard deviation. error-tolerance is defined as 2.17 standard deviations, which corresponds to a 97 percent confidence interval:

standard-deviation = error-tolerance / 2.17

For example:

  • error-tolerance = 1

    Default setting, corresponds to a standard deviation

    97 percent of the time, APPROXIMATE_COUNT_DISTINCT(x,5) returns a value between:

    • COUNT(DISTINCT x) * 0.99

    • COUNT(DISTINCT x) * 1.01

  • error-tolerance = 5

    97 percent of the time, APPROXIMATE_COUNT_DISTINCT(x) returns a value between:

    • COUNT(DISTINCT x) * 0.95

    • COUNT(DISTINCT x) * 1.05

A 99 percent confidence interval corresponds to 2.58 standard deviations. To set error-tolerance confidence level corresponding to 99 (instead of a 97) percent , multiply error-tolerance by 2.17 / 2.58 = 0.841.

For example, if you specify *error-tolerance* as 5 * 0.841 = 4.2, APPROXIMATE_COUNT_DISTINCT(x,4.2) returns values 99 percent of the time between:

  • COUNT (DISTINCT x) * 0.95

  • COUNT (DISTINCT x) * 1.05

Examples

Count the total number of distinct values in column product_key from table store.store_sales_fact:

=> SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact;
 COUNT
-------
 19982
(1 row)

Count the approximate number of distinct values in product_key with various error tolerances. The smaller the error tolerance, the closer the approximation:


=> SELECT APPROXIMATE_COUNT_DISTINCT(product_key,5) AS five_pct_accuracy,
   APPROXIMATE_COUNT_DISTINCT(product_key,1) AS one_pct_accuracy,
   APPROXIMATE_COUNT_DISTINCT(product_key,.88) AS point_eighteight_pct_accuracy
   FROM store.store_sales_fact;

 five_pct_accuracy | one_pct_accuracy | point_eighteight_pct_accuracy
-------------------+------------------+-------------------------------
             19431 |            19921 |                         19921
(1 row)

See also

Approximate count distinct functions