OpenText Analytics Database 26.2.x – Data aggregation

Data-Analysis: Single-level aggregation

Mon, 01 Jan 0001 00:00:00 +0000

The simplest GROUP BY queries aggregate data at a single level. For example, a table might contain the following information about family expenses:

Category
Amount spent on that category during the year
Year

Table data might look like this:

=> SELECT * FROM expenses ORDER BY Category;
 Year |  Category  | Amount
------+------------+--------
2005  | Books      |  39.98
2007  | Books      |  29.99
2008  | Books      |  29.99
2006  | Electrical | 109.99
2005  | Electrical | 109.99
2007  | Electrical | 229.98

You can use aggregate functions to get the total expenses per category or per year:

=> SELECT SUM(Amount), Category FROM expenses GROUP BY Category;
 SUM     | Category
---------+------------
  99.96  | Books
 449.96  | Electrical
=> SELECT SUM(Amount), Year FROM expenses GROUP BY Year;
 SUM    | Year
--------+------
 149.97 | 2005
 109.99 | 2006
  29.99 | 2008
 259.97 | 2007

Data-Analysis: Multi-level aggregation

Mon, 01 Jan 0001 00:00:00 +0000

Over time, tables that are updated frequently can contain large amounts of data. Using the simple table shown earlier, suppose you want a multilevel query, like the number of expenses per category per year.

The following query uses the ROLLUP aggregation with the SUM function to calculate the total expenses by category and the overall expenses total. The NULL fields indicate subtotal values in the aggregation.

When only the Year column is NULL, the subtotal is for all the Category values.
When both the Year and Category columns are NULL, the subtotal is for all Amount values for both columns.

Using the ORDER BY clause orders the results by expense category, the year the expenses took place, and the GROUP BY level that the GROUPING_ID function creates:

=> SELECT Category, Year, SUM(Amount) FROM expenses
   GROUP BY ROLLUP(Category, Year) ORDER BY Category, Year, GROUPING_ID();
 Category   | Year |  SUM
------------+------+--------
 Books      | 2005 |  39.98
 Books      | 2007 |  29.99
 Books      | 2008 |  29.99
 Books      |      |  99.96
 Electrical | 2005 | 109.99
 Electrical | 2006 | 109.99
 Electrical | 2007 | 229.98
 Electrical |      | 449.96
            |      | 549.92

Similarly, the following query calculates the total sales by year and the overall sales total and then uses the ORDER BY clause to sort the results:

=> SELECT Category, Year, SUM(Amount) FROM expenses
   GROUP BY ROLLUP(Year, Category) ORDER BY 2, 1, GROUPING_ID();
  Category  | Year |  SUM
------------+------+--------
 Books      | 2005 |  39.98
 Electrical | 2005 | 109.99
            | 2005 | 149.97
 Electrical | 2006 | 109.99
            | 2006 | 109.99
 Books      | 2007 |  29.99
 Electrical | 2007 | 229.98
            | 2007 | 259.97
 Books      | 2008 |  29.99
            | 2008 |  29.99
            |      | 549.92
(11 rows)

You can use the CUBE aggregate to perform all possible groupings of the category and year expenses. The following query returns all possible groupings, ordered by grouping:

=> SELECT Category, Year, SUM(Amount) FROM expenses
   GROUP BY CUBE(Category, Year) ORDER BY 1, 2, GROUPING_ID();
 Category   | Year |  SUM
------------+------+--------
 Books      | 2005 |  39.98
 Books      | 2007 |  29.99
 Books      | 2008 |  29.99
 Books      |      |  99.96
 Electrical | 2005 | 109.99
 Electrical | 2006 | 109.99
 Electrical | 2007 | 229.98
 Electrical |      | 449.96
            | 2005 | 149.97
            | 2006 | 109.99
            | 2007 | 259.97
            | 2008 |  29.99
            |      | 549.92

The results include subtotals for each category and each year and a total ($549.92) for all transactions, regardless of year or category.

ROLLUP, CUBE, and GROUPING SETS generate NULL values in grouping columns to identify subtotals. If table data includes NULL values, differentiating these from NULL values in subtotals can sometimes be challenging.

In the preceding output, the NULL values in the Year column indicate that the row was grouped on the Category column, rather than on both columns. In this case, ROLLUP added the NULL value to indicate the subtotal row.

Data-Analysis: Aggregates and functions for multilevel grouping

Mon, 01 Jan 0001 00:00:00 +0000

OpenText™ Analytics Database provides several aggregates and functions that group the results of a GROUP BY query at multiple levels.

Aggregates for multilevel grouping

Use the following aggregates for multilevel grouping:

ROLLUP automatically performs subtotal aggregations. ROLLUP performs one or more aggregations across multiple dimensions, at different levels.
CUBE performs the aggregation for all permutations of the CUBE expression that you specify.
GROUPING SETS let you specify which groupings of aggregations you need.

You can use CUBE or ROLLUP expressions inside GROUPING SETS expressions. Otherwise, you cannot nest multilevel aggregate expressions.

Grouping functions

You use one of the following three grouping functions with ROLLUP, CUBE, and GROUPING SETS:

GROUP_ID returns one or more numbers, starting with zero (0), to uniquely identify duplicate sets.
GROUPING_ID produces a unique ID for each grouping combination.
GROUPING identifies for each grouping combination whether a column is a part of this grouping. This function also differentiates NULL values in the data from NULL grouping subtotals.

These functions are typically used with multilevel aggregates.

Data-Analysis: Aggregate expressions for GROUP BY

Mon, 01 Jan 0001 00:00:00 +0000

You can include CUBE and ROLLUP aggregates within a GROUPING SETS aggregate. Be aware that the CUBE and ROLLUP aggregates can result in a large amount of output. However, you can avoid large outputs by using GROUPING SETS to return only specified results.

...GROUP BY a,b,c,d,ROLLUP(a,b)...
...GROUP BY a,b,c,d,CUBE((a,b),c,d)...

You cannot include any aggregates in a CUBE or ROLLUP aggregate expression.

You can append multiple GROUPING SETS, CUBE, or ROLLUP aggregates in the same query.

...GROUP BY a,b,c,d,CUBE(a,b),ROLLUP (c,d)...
...GROUP BY a,b,c,d,GROUPING SETS ((a,d),(b,c),CUBE(a,b));...
...GROUP BY a,b,c,d,GROUPING SETS ((a,d),(b,c),(a,b),(a),(b),())...

Data-Analysis: Pre-aggregating data in projections

Mon, 01 Jan 0001 00:00:00 +0000

Queries that use aggregate functions such as SUM and COUNT can perform more efficiently when they use projections that already contain the aggregated data. This improved efficiency is especially true for queries on large quantities of data.

For example, a power grid company reads 30 million smart meters that provide data at five-minute intervals. The company records each reading in a database table. Over a given year, three trillion records are added to this table.

The power grid company can analyze these records with queries that include aggregate functions to perform the following tasks:

Establish usage patterns.
Detect fraud.
Measure correlation to external events such as weather patterns or pricing changes.

To optimize query response time, you can create an aggregate projection, which stores the data is stored after it is aggregated.

Aggregate projections

Vertica provides several types of projections for storing data that is returned from aggregate functions or expressions:

Live aggregate projection: Projection that contains columns with values that are aggregated from columns in its anchor table. You can also define live aggregate projections that include user-defined transform functions.
Top-K projection: Type of live aggregate projection that returns the top k rows from a partition of selected rows. Create a Top-K projection that satisfies the criteria for a Top-K query.
Projection that pre-aggregates UDTF results: Live aggregate projection that invokes user-defined transform functions (UDTFs). To minimize overhead when you query those projections of this type, Vertica processes the UDTF functions in the background and stores their results on disk.
Projection that contains expressions: Projection with columns whose values are calculated from anchor table columns.

Recommended use

Aggregate projections are most useful for queries against large sets of data.
For optimal query performance, the size of LAP projections should be a small subset of the anchor table—ideally, between 1 and 10 percent of the anchor table, or smaller, if possible.

Restrictions

MERGE operations must be optimized if they are performed on target tables that have live aggregate projections.
You cannot update or delete data in temporary tables with live aggregate projections.

Requirements

In the event of manual recovery from an unclean database shutdown, live aggregate projections might require some time to refresh.