This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

DISTINCT in a SELECT query list

This section describes how to optimize queries that have the DISTINCT keyword in their SELECT list.

This section describes how to optimize queries that have the DISTINCT keyword in their SELECT list. The techniques for optimizing DISTINCT queries are similar to the techniques for optimizing GROUP BY queries because when processing queries that use DISTINCT, the Vertica optimizer rewrites the query as a GROUP BY query.

The following sections below this page describe specific situations:

Examples in these sections use the following table:

=> CREATE TABLE table1 (
    a INT,
    b INT,
    c INT
);

1 - Query has no aggregates in SELECT list

If your query has no aggregates in the SELECT list, internally, Vertica treats the query as if it uses GROUP BY instead.

If your query has no aggregates in the SELECT list, internally, Vertica treats the query as if it uses GROUP BY instead.

For example, you can rewrite the following query:

SELECT DISTINCT a, b, c FROM table1;

as:

SELECT a, b, c FROM table1 GROUP BY a, b, c;

For fastest execution, apply the optimization techniques for GROUP BY queries described in GROUP BY queries.

2 - COUNT (DISTINCT) and other DISTINCT aggregates

Computing a DISTINCT aggregate generally requires more work than other aggregates.

Computing a DISTINCT aggregate generally requires more work than other aggregates. Also, a query that uses a single DISTINCT aggregate consumes fewer resources than a query with multiple DISTINCT aggregates.

Examples

The following query returns the number of distinct values in a column:

=> SELECT COUNT (DISTINCT date_key) FROM date_dimension;

 COUNT
-------
  1826
(1 row)

This example returns the number of distinct return values from an expression:

=> SELECT COUNT (DISTINCT date_key + product_key) FROM inventory_fact;

 COUNT
-------
 21560
(1 row)

You can create an equivalent query using the LIMIT keyword to restrict the number of rows returned:

=> SELECT COUNT(date_key + product_key) FROM inventory_fact GROUP BY date_key LIMIT 10;

 COUNT
-------
   173
    31
   321
   113
   286
    84
   244
   238
   145
   202
(10 rows)

The following query uses GROUP BY to count distinct values within groups:

=> SELECT product_key, COUNT (DISTINCT date_key) FROM inventory_fact
   GROUP BY product_key LIMIT 10;

 product_key | count
-------------+-------
           1 |    12
           2 |    18
           3 |    13
           4 |    17
           5 |    11
           6 |    14
           7 |    13
           8 |    17
           9 |    15
          10 |    12
(10 rows)

The following query returns the number of distinct products and the total inventory within each date key:

=> SELECT date_key, COUNT (DISTINCT product_key), SUM(qty_in_stock) FROM inventory_fact
   GROUP BY date_key LIMIT 10;

 date_key | count |  sum
----------+-------+--------
        1 |   173 |  88953
        2 |    31 |  16315
        3 |   318 | 156003
        4 |   113 |  53341
        5 |   285 | 148380
        6 |    84 |  42421
        7 |   241 | 119315
        8 |   238 | 122380
        9 |   142 |  70151
       10 |   202 |  95274
(10 rows)

This query selects each distinct product_key value and then counts the number of distinct date_key values for all records with the specific product_key value. It also counts the number of distinct warehouse_key values in all records with the specific product_key value:

=> SELECT product_key, COUNT (DISTINCT date_key), COUNT (DISTINCT warehouse_key) FROM inventory_fact
   GROUP BY product_key LIMIT 15;

 product_key | count | count
-------------+-------+-------
           1 |    12 |    12
           2 |    18 |    18
           3 |    13 |    12
           4 |    17 |    18
           5 |    11 |     9
           6 |    14 |    13
           7 |    13 |    13
           8 |    17 |    15
           9 |    15 |    14
          10 |    12 |    12
          11 |    11 |    11
          12 |    13 |    12
          13 |     9 |     7
          14 |    13 |    13
          15 |    18 |    17
(15 rows)

This query selects each distinct product_key value, counts the number of distinct date_key and warehouse_key values for all records with the specific product_key value, and then sums all qty_in_stock values in records with the specific product_key value. It then returns the number of product_version values in records with the specific product_key value:

=> SELECT product_key, COUNT (DISTINCT date_key),
      COUNT (DISTINCT warehouse_key),
      SUM (qty_in_stock),
      COUNT (product_version)
      FROM inventory_fact GROUP BY product_key LIMIT 15;

 product_key | count | count |  sum  | count
-------------+-------+-------+-------+-------
           1 |    12 |    12 |  5530 |    12
           2 |    18 |    18 |  9605 |    18
           3 |    13 |    12 |  8404 |    13
           4 |    17 |    18 | 10006 |    18
           5 |    11 |     9 |  4794 |    11
           6 |    14 |    13 |  7359 |    14
           7 |    13 |    13 |  7828 |    13
           8 |    17 |    15 |  9074 |    17
           9 |    15 |    14 |  7032 |    15
          10 |    12 |    12 |  5359 |    12
          11 |    11 |    11 |  6049 |    11
          12 |    13 |    12 |  6075 |    13
          13 |     9 |     7 |  3470 |     9
          14 |    13 |    13 |  5125 |    13
          15 |    18 |    17 |  9277 |    18
(15 rows)

3 - Approximate count distinct functions

The aggregate function COUNT(DISTINCT) computes the exact number of distinct values in a data set.

The aggregate function COUNT(DISTINCT) computes the exact number of distinct values in a data set. COUNT(DISTINCT) performs well when it executes with the GROUPBY PIPELINED algorithm.

An aggregate COUNT operation performs well on a data set when the following conditions are true:

  • One of the target table's projections has an ORDER BY clause that facilitates sorted aggregation.

  • The number of distinct values is fairly small.

  • Hashed aggregation is required to execute the query.

Alternatively, consider using the APPROXIMATE_COUNT_DISTINCT function instead of COUNT(DISTINCT) when the following conditions are true:

  • You have a large data set and you do not require an exact count of distinct values.

  • The performance of COUNT(DISTINCT) on a given data set is insufficient.

  • You calculate several distinct counts in the same query.

  • The plan for COUNT(DISTINCT) uses hashed aggregation.

The expected value that APPROXIMATE_COUNT_DISTINCT returns is equal to COUNT(DISTINCT), with an error that is lognormally distributed with standard deviation s. You can control the standard deviation by setting the function's optional error tolerance argument—by default, 1.25 percent.

Other APPROXIMATE_COUNT_DISTINCT functions

Vertica supports two other functions that you can use together, instead of APPROXIMATE_COUNT_DISTINCT: APPROXIMATE_COUNT_DISTINCT_SYNOPSIS and APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS. Use these functions when the following conditions are true:

  • You have a large data set and you don't require an exact count of distinct values.

  • The performance of COUNT(DISTINCT) on a given data set is insufficient.

  • You want to pre-compute the distinct counts and later combine them in different ways.

Use the two functions together as follows:

  1. Pass APPROXIMATE_COUNT_DISTINCT_SYNOPSIS the data set and a normally distributed confidence interval. The function returns a subset of the data, as a binary synopsis object*.*

  2. Pass the synopsis to the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function, which then performs an approximate count distinct on the synopsis.

You also use APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE, which merges multiple synopses into one synopsis. With this function, you can continually update a "master" synopsis by merging in one or more synopses that cover more recent, shorter periods of time.

Example

The following example shows how to use APPROXIMATE_COUNT_DISTINCT functions to keep an approximate running count of users who click on a given web page within a given time span.

  1. Create the pviews table to store data about website visits—time of visit, web page visited, and visitor:

    
    => CREATE TABLE pviews(
    visit_time TIMESTAMP NOT NULL,
    page_id INTEGER NOT NULL,
    user_id INTEGER NOT NULL)
    ORDER BY page_id, visit_time
    SEGMENTED BY HASH(user_id) ALL NODES KSAFE
    PARTITION BY visit_time::DATE GROUP BY CALENDAR_HIERARCHY_DAY(visit_time::DATE, 2, 2);
    

    pviews is segmented by hashing user_id data, so all visits by a given user are stored on the same segment, on the same node. This prevents inefficient cross-node transfer of data, when later we do a COUNT (DISTINCT user_id).

    The table also uses hierarchical partitioning on time of visit to optimize the ROS storage. Doing so improves performance when filtering data by time.

  2. Load data into pviews:

    => INSERT INTO pviews VALUES
         ('2022-02-01 10:00:02',1002,1),
         ('2022-02-01 10:00:03',1002,2),
         ('2022-02-01 10:00:04',1002,1),
         ('2022-02-01 10:00:05',1002,3),
         ('2022-02-01 10:00:01',1000,1),
         ('2022-02-01 10:00:06',1002,1),
         ('2022-02-01 10:00:07',1002,3),
         ('2022-02-01 10:00:08',1002,1),
         ('2022-02-01 10:00:09',1002,3),
         ('2022-02-01 10:00:12',1002,2),
         ('2022-02-02 10:00:01',1000,1),
         ('2022-02-02 10:00:02',1002,4),
         ('2022-02-02 10:00:03',1002,2),
         ('2022-02-02 10:00:04',1002,1),
         ('2022-02-02 10:00:05',1002,3),
         ('2022-02-02 10:00:06',1002,4),
         ('2022-02-02 10:00:07',1002,3),
         ('2022-02-02 10:00:08',1002,4),
         ('2022-02-02 10:00:09',1002,3),
         ('2022-02-02 10:00:12',1002,2),
         ('2022-03-02 10:00:01',1000,1),
         ('2022-03-02 10:00:02',1002,1),
         ('2022-03-02 10:00:03',1002,2),
         ('2022-03-02 10:00:04',1002,1),
         ('2022-03-02 10:00:05',1002,3),
         ('2022-03-02 10:00:06',1002,4),
         ('2022-03-02 10:00:07',1002,3),
         ('2022-03-02 10:00:08',1002,6),
         ('2022-03-02 10:00:09',1002,5),
         ('2022-03-02 10:00:12',1002,2),
         ('2022-03-02 11:00:01',1000,5),
         ('2022-03-02 11:00:02',1002,6),
         ('2022-03-02 11:00:03',1002,7),
         ('2022-03-02 11:00:04',1002,4),
         ('2022-03-02 11:00:05',1002,1),
         ('2022-03-02 11:00:06',1002,6),
         ('2022-03-02 11:00:07',1002,8),
         ('2022-03-02 11:00:08',1002,6),
         ('2022-03-02 11:00:09',1002,7),
         ('2022-03-02 11:00:12',1002,1),
         ('2022-03-03 10:00:01',1000,1),
         ('2022-03-03 10:00:02',1002,2),
         ('2022-03-03 10:00:03',1002,4),
         ('2022-03-03 10:00:04',1002,1),
         ('2022-03-03 10:00:05',1002,2),
         ('2022-03-03 10:00:06',1002,6),
         ('2022-03-03 10:00:07',1002,9),
         ('2022-03-03 10:00:08',1002,10),
         ('2022-03-03 10:00:09',1002,7),
         ('2022-03-03 10:00:12',1002,1);
     OUTPUT
    --------
         50
    (1 row)
    
    => COMMIT;
    COMMIT
    
  3. Create the pview_summary table by querying pviews with CREATE TABLE...AS SELECT. Each row of this table summarizes data selected from pviews for a given date:

    • partial_visit_count stores the number of rows (website visits) in pviews with that date.

    • daily_users_acdp uses APPROXIMATE_COUNT_DISTINCT_SYNOPSIS to construct a synopsis that approximates the number of distinct users (user_id) who visited that website on that date.

    
    => CREATE TABLE pview_summary AS SELECT
          visit_time::DATE "date",
          COUNT(*) partial_visit_count,
          APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
       FROM pviews GROUP BY 1;
    CREATE TABLE
    => ALTER TABLE pview_summary ALTER COLUMN "date" SET NOT NULL;
    
  4. Update the pview_summary table so it is partitioned like pviews. The REORGANIZE keyword forces immediate repartitioning of the table data:

    => ALTER TABLE pview_summary
         PARTITION BY "date"
         GROUP BY CALENDAR_HIERARCHY_DAY("date", 2, 2) REORGANIZE;
    vsql:/home/ale/acd_ex4.sql:93: NOTICE 8364:  The new partitioning scheme will produce partitions in 2 physical storage containers per projection
    vsql:/home/ale/acd_ex4.sql:93: NOTICE 4785:  Started background repartition table task
    ALTER TABLE
    
  5. Use CREATE TABLE..LIKE to create two ETL tables, pviews_etl and pview_summary_etl with the same DDL as pviews and pview_summary, respectively. These tables serve to process incoming data:

    => CREATE TABLE pviews_etl LIKE pviews INCLUDING PROJECTIONS;
    CREATE TABLE
    => CREATE TABLE pview_summary_etl LIKE pview_summary INCLUDING PROJECTIONS;
    CREATE TABLE
    
  6. Load new data into pviews_etl:

    => INSERT INTO pviews_etl VALUES
         ('2022-03-03 11:00:01',1000,8),
         ('2022-03-03 11:00:02',1002,9),
         ('2022-03-03 11:00:03',1002,1),
         ('2022-03-03 11:00:04',1002,11),
         ('2022-03-03 11:00:05',1002,10),
         ('2022-03-03 11:00:06',1002,12),
         ('2022-03-03 11:00:07',1002,3),
         ('2022-03-03 11:00:08',1002,10),
         ('2022-03-03 11:00:09',1002,1),
         ('2022-03-03 11:00:12',1002,1);
     OUTPUT
    --------
         10
    (1 row)
    
    => COMMIT;
    COMMIT
    
  7. Summarize the new data in pview_summary_etl:

    => INSERT INTO pview_summary_etl SELECT
          visit_time::DATE visit_date,
          COUNT(*) partial_visit_count,
          APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(user_id) AS daily_users_acdp
        FROM pviews_etl GROUP BY visit_date;
     OUTPUT
    --------
          1
    (1 row)
    
  8. Append the pviews_etl data to pviews with COPY_PARTITIONS_TO_TABLE:

    => SELECT COPY_PARTITIONS_TO_TABLE('pviews_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pviews');
                  COPY_PARTITIONS_TO_TABLE
    ----------------------------------------------------
     1 distinct partition values copied at epoch 1403.
    
    (1 row)
    
    => SELECT COPY_PARTITIONS_TO_TABLE('pview_summary_etl', '01-01-0000'::DATE, '01-01-9999'::DATE, 'pview_summary');
                  COPY_PARTITIONS_TO_TABLE
    ----------------------------------------------------
     1 distinct partition values copied at epoch 1404.
    
    (1 row)
    
  9. Create views and distinct (approximate) views by day for all data, including the partition that was just copied from pviews_etl :

    => SELECT
         "date" visit_date,
         SUM(partial_visit_count) visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS daily_users_acd
       FROM pview_summary GROUP BY visit_date ORDER BY visit_date;
     visit_date | visit_count | daily_users_acd
    ------------+-------------+-----------------
     2022-02-01 |          10 |               3
     2022-02-02 |          10 |               4
     2022-03-02 |          20 |               8
     2022-03-03 |          20 |              11
    (4 rows)
    
  10. Create views and distinct (approximate) views by month:

    => SELECT
         DATE_TRUNC('MONTH', "date")::DATE "month",
         SUM(partial_visit_count) visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(daily_users_acdp) AS monthly_users_acd
       FROM pview_summary GROUP BY month ORDER BY month;
       month    | visit_count | monthly_users_acd
    ------------+-------------+-------------------
     2022-02-01 |          20 |                 4
     2022-03-01 |          40 |                12
    (2 rows)
    
  11. Merge daily synopses into monthly synopses:

    => CREATE TABLE pview_monthly_summary AS SELECT
         DATE_TRUNC('MONTH', "date")::DATE "month",
         SUM(partial_visit_count) partial_visit_count,
         APPROXIMATE_COUNT_DISTINCT_SYNOPSIS_MERGE(daily_users_acdp) AS monthly_users_acdp
       FROM pview_summary GROUP BY month ORDER BY month;
    CREATE TABLE
    
  12. Create views and distinct views by month, generated from the merged synopses:

    => SELECT
         month,
         SUM(partial_visit_count) monthly_visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS monthly_users_acd
       FROM pview_monthly_summary GROUP BY month ORDER BY month;
       month    | monthly_visit_count | monthly_users_acd
    ------------+---------------------+-------------------
     2019-02-01 |                  20 |                 4
     2019-03-01 |                  40 |                12
    (2 rows)
    
  13. You can use the monthly summary to produce a yearly summary. This approach is likely to be faster than using a daily summary if a lot of data needs to be processed:

    => SELECT
         DATE_TRUNC('YEAR', "month")::DATE "year",
         SUM(partial_visit_count) yearly_visit_count,
         APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(monthly_users_acdp) AS yearly_users_acd
       FROM pview_monthly_summary GROUP BY year ORDER BY year;
        year    | yearly_visit_count | yearly_users_acd
    ------------+--------------------+------------------
     2022-01-01 |                 60 |               12
    (1 row)
    
  14. Drop the ETL tables:

    => DROP TABLE IF EXISTS pviews_etl, pview_summary_etl;
    DROP TABLE
    

See also

4 - Single DISTINCT aggregates

Vertica computes a DISTINCT aggregate by first removing all duplicate values of the aggregate's argument to find the distinct values.

Vertica computes a DISTINCT aggregate by first removing all duplicate values of the aggregate's argument to find the distinct values. Then it computes the aggregate.

For example, you can rewrite the following query:

SELECT a, b, COUNT(DISTINCT c) AS dcnt FROM table1 GROUP BY a, b;

as:

SELECT a, b, COUNT(dcnt) FROM
  (SELECT a, b, c AS dcnt FROM table1 GROUP BY a, b, c)
GROUP BY a, b;

For fastest execution, apply the optimization techniques for GROUP BY queries.

5 - Multiple DISTINCT aggregates

If your query has multiple DISTINCT aggregates, there is no straightforward SQL rewrite that can compute them.

If your query has multiple DISTINCT aggregates, there is no straightforward SQL rewrite that can compute them. The following query cannot easily be rewritten for improved performance:

SELECT a, COUNT(DISTINCT b), COUNT(DISTINCT c) AS dcnt FROM table1 GROUP BY a;

For a query with multiple DISTINCT aggregates, there is no projection design that can avoid using GROUPBY HASH and resegmenting the data. To improve performance of this query, make sure that it has large amounts of memory available. For more information about memory allocation for queries, see Resource manager.