All analytic functions in this section with an aggregate counterpart are appended with [Analytics] in the heading to avoid confusion between the two function types.
All analytic functions in this section with an aggregate counterpart are appended with [Analytics] in the heading to avoid confusion between the two function types.
Vertica analytics are SQL functions based on the ANSI 99 standard. These functions handle complex analysis and reporting tasks—for example:
Rank the longest-standing customers in a particular state.
Calculate the moving average of retail volume over a specified time.
Find the highest score among all students in the same grade.
Compare the current sales bonus that salespersons received against their previous bonus.
Analytic functions return aggregate results but they do not group the result set. They return the group value multiple times, once per record. You can sort group values, or partitions, using a window ORDER BY clause, but the order affects only the function result set, not the entire query result set.
Specifies how to partition, sort, and window frame function input with respect to the current row. The input data is the result set that the query returns after it evaluates FROM, WHERE, GROUP BY, and HAVING clauses.
An empty OVER clause provides the best performance for single threaded queries on a single node.
Optionally specifies how to sort rows that are supplied to the analytic function. If the OVER clause also includes a partition clause, rows are sorted within each partition.
Only valid for some analytic functions, specifies as input a set of rows relative to the row that is currently being evaluated by the analytic function. After the function processes that row and its window, Vertica advances the current row and adjusts the window boundaries accordingly.
named-window
The name of a window that you define in the same query with a window name clause. This definition encapsulates window partitioning and sorting. Named windows are useful when the query invokes multiple analytic functions with similar OVER clauses.
A window name clause cannot specify a window frame clause. However, you can qualify the named window in an OVER clause with a window frame clause.
Requirements
The following requirements apply to analytic functions:
All require an OVER clause. Each function has its own OVER clause requirements. For example, you can supply an empty OVER clause for some analytic aggregate functions such as
SUM. For other functions, window frame and order clauses might be required, or might be invalid.
Analytic functions can be invoked only in a query's SELECT and ORDER BY clauses.
Analytic functions cannot be nested. For example, the following query is not allowed:
=> SELECT MEDIAN(RANK() OVER(ORDER BY sal) OVER()).
WHERE, GROUP BY and HAVING operators are technically not part of the analytic function. However, they determine input to that function.
This function is patterned after the mathematical function , which returns the value of x that maximizes.
This function is patterned after the mathematical function argmax(f(x)), which returns the value of x that maximizes f(x). Similarly, ARGMAX takes two arguments target and arg, where both are columns or column expressions in the queried dataset. ARGMAX finds the row with the largest non-null value in target and returns the value of arg in that row. If multiple rows contain the largest target value, ARGMAX returns arg from the first row that it finds.
PARTITION BYexpression: Groups (partitions) input rows according to the values in expression, which resolves to one or more columns in the queried dataset. If you omit this clause, ARGMAX processes all input rows as a single partition.
window-order-clause: Specifies how to sort input rows. If the OVER clause also includes a partition clause, rows are sorted separately within each partition.
Important
To ensure consistent results when multiple rows contain the largest target value, include a window order clause that sorts on arg.
Create and populate table service_info, which contains information on various services, their respective development groups, and their userbase. A NULL in the users column indicates that the service has not been released, and so it cannot have users.
ARGMAX returns the value in the product_name column that maximizes the value in the users column. In this case, ARGMAX returns totp, which indicates that the totp service has the largest user base:
The next query partitions the data on dev_group to identify the most popular service created by each development group. ARGMAX returns NULL if the partition's users column contains only NULL values and breaks ties using the first value in product_name from the top of the partition.
This function is patterned after the mathematical function , which returns the value of x that minimizes.
This function is patterned after the mathematical function argmin(f(x)), which returns the value of x that minimizes f(x). Similarly, ARGMIN takes two arguments target and arg, where both are columns or column expressions in the queried dataset. ARGMIN finds the row with the smallest non-null value in target and returns the value of arg in that row. If multiple rows contain the smallest target value, ARGMIN returns arg from the first row that it finds.
PARTITION BYexpression: Groups (partitions) input rows according to the values in expression, which resolves to one or more columns in the queried dataset. If you omit this clause, ARGMIN processes all input rows as a single partition.
window-order-clause: Specifies how to sort input rows. If the OVER clause also includes a partition clause, rows are sorted separately within each partition.
Important
To ensure consistent results when multiple rows contain the smallest target value, include a window order clause that sorts on arg.
Create and populate table service_info, which contains information on various services, their respective development groups, and their userbase. A NULL in the users column indicates that the service has not been released, and so it cannot have users.
ARGMIN returns the value in the product_name column that minimizes the value in the users column. In this case, ARGMIN returns totp, which indicates that the totp service has the smallest user base:
The next query partitions the data on dev_group to identify the least popular service created by each development group. ARGMIN returns NULL if the partition's users column contains only NULL values and breaks ties using the first value in product_name from the top of the partition.
=> SELECT dev_group, product_name, users, ARGMIN(users, product_name) OVER (PARTITION BY dev_group ORDER BY product_name ASC) FROM service_info;
dev_group | product_name | users | ARGMIN
-----------+------------------+---------+------------
iris | chat | 48193 | video call
iris | forum | 48193 | video call
iris | video call | 10203 | video call
orchid | cloud | 990322 | cloud
aspen | trading | 3000 | trading
daffodil | clip share | 3000 | clip share
daffodil | streaming | 44123 | clip share
rose | crypto | |
hydrangea | e2e sms | | totp
hydrangea | password manager | | totp
hydrangea | totp | 1837363 | totp
(11 rows)
By default, Vertica allows silent numeric overflow when you call this function on numeric data types. For more information on this behavior and how to change it, seeNumeric data type overflow with SUM, SUM_FLOAT, and AVG.
Examples
The following query finds the sales for that calendar month and returns a running/cumulative average (sometimes called a moving average) using the default window of RANGE UNBOUNDED PRECEDING AND CURRENT ROW:
=> SELECT calendar_month_number_in_year Mo, SUM(product_price) Sales,
AVG(SUM(product_price)) OVER (ORDER BY calendar_month_number_in_year)::INTEGER Average
FROM product_dimension pd, date_dimension dm, inventory_fact if
WHERE dm.date_key = if.date_key AND pd.product_key = if.product_key GROUP BY Mo;
Mo | Sales | Average
----+----------+----------
1 | 23869547 | 23869547
2 | 19604661 | 21737104
3 | 22877913 | 22117374
4 | 22901263 | 22313346
5 | 23670676 | 22584812
6 | 22507600 | 22571943
7 | 21514089 | 22420821
8 | 24860684 | 22725804
9 | 21687795 | 22610470
10 | 23648921 | 22714315
11 | 21115910 | 22569005
12 | 24708317 | 22747281
(12 rows)
To return a moving average that is not a running (cumulative) average, the window can specify ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING:
=> SELECT calendar_month_number_in_year Mo, SUM(product_price) Sales,
AVG(SUM(product_price)) OVER (ORDER BY calendar_month_number_in_year
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)::INTEGER Average
FROM product_dimension pd, date_dimension dm, inventory_fact if
WHERE dm.date_key = if.date_key AND pd.product_key = if.product_key GROUP BY Mo;
Mo | Sales | Average
----+----------+----------
1 | 23869547 | 22117374
2 | 19604661 | 22313346
3 | 22877913 | 22584812
4 | 22901263 | 22312423
5 | 23670676 | 22694308
6 | 22507600 | 23090862
7 | 21514089 | 22848169
8 | 24860684 | 22843818
9 | 21687795 | 22565480
10 | 23648921 | 23204325
11 | 21115910 | 22790236
12 | 24708317 | 23157716
(12 rows)
The following example illustrates how you can use the BOOL_AND, BOOL_OR, and BOOL_XOR analytic functions. The sample table, employee, includes a column for type of employee and years paid.
Insert sample data into the table to show years paid. In more than one case, an employee could be paid more than once within one year.
=> INSERT INTO employee
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2015'
UNION ALL
SELECT 'contractor3', '2014'
UNION ALL
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2014'
UNION ALL
SELECT 'contractor3', '2015'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor5', '2015'
UNION ALL
SELECT 'contractor5', '2016';
OUTPUT
--------
10
(1 row)
Query the table. The result shows employees that were paid twice in 2014 (BOOL_AND), once or twice in 2014 (BOOL_OR), and specifically not more than once in 2014 (BOOL_XOR).
=> SELECT DISTINCT emptype,
BOOL_AND(yearspaid='2014') OVER (PARTITION BY emptype) AS paidtwicein2014,
BOOL_OR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidonceortwicein2014,
BOOL_XOR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidjustoncein2014
FROM employee;
emptype | paidtwicein2014 | paidonceortwicein2014 | paidjustoncein2014
-------------+-----------------+-----------------------+--------------------
contractor1 | t | t | f
contractor2 | f | t | t
contractor3 | f | t | t
contractor4 | t | t | f
contractor5 | f | f | f
(5 rows)
The following example illustrates how you can use the BOOL_AND, BOOL_OR, and BOOL_XOR analytic functions. The sample table, employee, includes a column for type of employee and years paid.
Insert sample data into the table to show years paid. In more than one case, an employee could be paid more than once within one year.
=> INSERT INTO employee
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2015'
UNION ALL
SELECT 'contractor3', '2014'
UNION ALL
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2014'
UNION ALL
SELECT 'contractor3', '2015'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor5', '2015'
UNION ALL
SELECT 'contractor5', '2016';
OUTPUT
--------
10
(1 row)
Query the table. The result shows employees that were paid twice in 2014 (BOOL_AND), once or twice in 2014 (BOOL_OR), and specifically not more than once in 2014 (BOOL_XOR).
=> SELECT DISTINCT emptype,
BOOL_AND(yearspaid='2014') OVER (PARTITION BY emptype) AS paidtwicein2014,
BOOL_OR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidonceortwicein2014,
BOOL_XOR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidjustoncein2014
FROM employee;
emptype | paidtwicein2014 | paidonceortwicein2014 | paidjustoncein2014
-------------+-----------------+-----------------------+--------------------
contractor1 | t | t | f
contractor2 | f | t | t
contractor3 | f | t | t
contractor4 | t | t | f
contractor5 | f | f | f
(5 rows)
The following example illustrates how you can use the BOOL_AND, BOOL_OR, and BOOL_XOR analytic functions. The sample table, employee, includes a column for type of employee and years paid.
Insert sample data into the table to show years paid. In more than one case, an employee could be paid more than once within one year.
=> INSERT INTO employee
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2015'
UNION ALL
SELECT 'contractor3', '2014'
UNION ALL
SELECT 'contractor1', '2014'
UNION ALL
SELECT 'contractor2', '2014'
UNION ALL
SELECT 'contractor3', '2015'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor4', '2014'
UNION ALL
SELECT 'contractor5', '2015'
UNION ALL
SELECT 'contractor5', '2016';
OUTPUT
--------
10
(1 row)
Query the table. The result shows employees that were paid twice in 2014 (BOOL_AND), once or twice in 2014 (BOOL_OR), and specifically not more than once in 2014 (BOOL_XOR).
=> SELECT DISTINCT emptype,
BOOL_AND(yearspaid='2014') OVER (PARTITION BY emptype) AS paidtwicein2014,
BOOL_OR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidonceortwicein2014,
BOOL_XOR(yearspaid='2014') OVER (PARTITION BY emptype) AS paidjustoncein2014
FROM employee;
emptype | paidtwicein2014 | paidonceortwicein2014 | paidjustoncein2014
-------------+-----------------+-----------------------+--------------------
contractor1 | t | t | f
contractor2 | f | t | t
contractor3 | f | t | t
contractor4 | t | t | f
contractor5 | f | f | f
(5 rows)
Assigns an event window number to each row, starting from 0, and increments by 1 when the result of evaluating the argument expression on the current row differs from that on the previous row.
Assigns an event window number to each row, starting from 0, and increments by 1 when the result of evaluating the argument expression on the current row differs from that on the previous row.
The analytic window-order-clause is required but the window-partition-clause is optional.
Examples
=> SELECT CONDITIONAL_CHANGE_EVENT(bid)
OVER (PARTITION BY symbol ORDER BY ts) AS cce
FROM TickStore;
The system returns an error when no ORDER BY clause is present:
=> SELECT CONDITIONAL_CHANGE_EVENT(bid)
OVER (PARTITION BY symbol) AS cce
FROM TickStore;
ERROR: conditional_change_event must contain an
ORDER BY clause within its analytic clause
Assigns an event window number to each row, starting from 0, and increments the number by 1 when the result of the boolean argument expression evaluates true.
Assigns an event window number to each row, starting from 0, and increments the number by 1 when the result of the boolean argument expression evaluates true. For example, given a sequence of values for column a, as follows:
The analytic window-order-clause is required but the window-partition-clause is optional.
Examples
> SELECT CONDITIONAL_TRUE_EVENT(bid > 10.6)
OVER(PARTITION BY bid ORDER BY ts) AS cte
FROM Tickstore;
The system returns an error if the ORDER BY clause is omitted:
> SELECT CONDITIONAL_TRUE_EVENT(bid > 10.6)
OVER(PARTITION BY bid) AS cte
FROM Tickstore;
ERROR: conditional_true_event must contain an ORDER BY
clause within its analytic clause
Using the schema defined in Window framing, the following COUNT function omits window order and window frame clauses; otherwise Vertica would treat it as a window aggregate. Think of the window of reporting aggregates as UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING.
Using the VMart schema, the following query finds the number of employees who make less than or equivalent to the hourly rate of the current employee. The query returns a running/cumulative average (sometimes called a moving average) using the default window of RANGE UNBOUNDED PRECEDING AND CURRENT ROW:
=> SELECT employee_last_name AS "last_name", hourly_rate, COUNT(*)
OVER (ORDER BY hourly_rate) AS moving_count from employee_dimension;
last_name | hourly_rate | moving_count
------------+-------------+--------------
Gauthier | 6 | 4
Taylor | 6 | 4
Jefferson | 6 | 4
Nielson | 6 | 4
McNulty | 6.01 | 11
Robinson | 6.01 | 11
Dobisz | 6.01 | 11
Williams | 6.01 | 11
Kramer | 6.01 | 11
Miller | 6.01 | 11
Wilson | 6.01 | 11
Vogel | 6.02 | 14
Moore | 6.02 | 14
Vogel | 6.02 | 14
Carcetti | 6.03 | 19
...
To return a moving average that is not also a running (cumulative) average, the window should specify ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING:
=> SELECT employee_last_name AS "last_name", hourly_rate, COUNT(*)
OVER (ORDER BY hourly_rate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
AS moving_count from employee_dimension;
Calculates the cumulative distribution, or relative rank, of the current row with regard to other rows in the same partition within a .
Calculates the cumulative distribution, or relative rank, of the current row with regard to other rows in the same partition within a window.
CUME_DIST() returns a number greater then 0 and less then or equal to 1, where the number represents the relative position of the specified row within a group of n rows. For a row x (assuming ASC ordering), the CUME_DIST of x is the number of rows with values lower than or equal to the value of x, divided by the number of rows in the partition. For example, in a group of three rows, the cumulative distribution values returned would be 1/3, 2/3, and 3/3.
Note
Because the result for a given row depends on the number of rows preceding that row in the same partition, you should always specify a window-order-clause when you call this function.
The following example returns the cumulative distribution of sales for different transaction types within each month of the first quarter.
=> SELECT calendar_month_name AS month, tender_type, SUM(sales_quantity),
CUME_DIST()
OVER (PARTITION BY calendar_month_name ORDER BY SUM(sales_quantity)) AS
CUME_DIST
FROM store.store_sales_fact JOIN date_dimension
USING(date_key) WHERE calendar_month_name IN ('January','February','March')
AND tender_type NOT LIKE 'Other'
GROUP BY calendar_month_name, tender_type;
month | tender_type | SUM | CUME_DIST
----------+-------------+--------+-----------
March | Credit | 469858 | 0.25
March | Cash | 470449 | 0.5
March | Check | 473033 | 0.75
March | Debit | 475103 | 1
January | Cash | 441730 | 0.25
January | Debit | 443922 | 0.5
January | Check | 446297 | 0.75
January | Credit | 450994 | 1
February | Check | 425665 | 0.25
February | Debit | 426726 | 0.5
February | Credit | 430010 | 0.75
February | Cash | 430767 | 1
(12 rows)
Within each window partition, ranks all rows in the query results set according to the order specified by the window's ORDER BY clause.
Within each window partition, ranks all rows in the query results set according to the order specified by the window's ORDER BY clause. A DENSE_RANK function returns a sequence of ranking numbers without any gaps.
DENSE_RANK executes as follows:
Sorts partition rows as specified by the ORDER BY clause.
Compares the ORDER BY values of the preceding row and current row and ranks the current row as follows:
If ORDER BY values are the same, the current row gets the same ranking as the preceding row.
Note
Null values are considered equal. For detailed information on how null values are sorted, see NULL sort order.
If the ORDER BY values are different, DENSE_RANK increments or decrements the current row's ranking by 1, depending whether sort order is ascending or descending.
DENSE_RANK always changes the ranking by 1, so no gaps appear in the ranking sequence. The largest rank value is the number of unique ORDER BY values returned by the query.
Calculates the exponential moving average (EMA) of expression E with smoothing factor X.
Calculates the exponential moving average (EMA) of expression E with smoothing factor X. An EMA differs from a simple moving average in that it provides a more stable picture of changes to data over time.
The EMA is calculated by adding the previous EMA value to the current data point scaled by the smoothing factor, as in the following formula:
EMA=EMA0+ (X* (E-EMA0))
where:
E is the current data point
EMA0 is the previous row's EMA value.
X is the smoothing factor.
This function also works at the row level. For example, EMA assumes the data in a given column is sampled at uniform intervals. If the users' data points are sampled at non-uniform intervals, they should run the time series gap filling and interpolation (GFI) operations before EMA()
The following example uses time series gap filling and interpolation (GFI) first in a subquery, and then performs an EXPONENTIAL_MOVING_AVERAGE operation on the subquery result.
Create a simple four-column table:
=> CREATE TABLE ticker(
time TIMESTAMP,
symbol VARCHAR(8),
bid1 FLOAT,
bid2 FLOAT );
Insert some data, including nulls, so GFI can do its interpolation and gap filling:
During gap filling and interpolation, Vertica takes the closest non null value on either side of the time slice and uses that value. For example, if you use a linear interpolation scheme and you do not specify IGNORE NULLS, and your data has one real value and one null, the result is null. If the value on either side is null, the result is null. See When Time Series Data Contains Nulls for details.
Query the table that you just created to you can see the output:
The following query processes the first and last values that belong to each 2-second time slice in table trades' column a. The query then calculates the exponential moving average of expression fv and lv with a smoothing factor of 50%:
=> SELECT symbol, slice_time, fv, lv,
EXPONENTIAL_MOVING_AVERAGE(fv, 0.5)
OVER (PARTITION BY symbol ORDER BY slice_time) AS ema_first,
EXPONENTIAL_MOVING_AVERAGE(lv, 0.5)
OVER (PARTITION BY symbol ORDER BY slice_time) AS ema_last
FROM (
SELECT symbol, slice_time,
TS_FIRST_VALUE(bid1 IGNORE NULLS) as fv,
TS_LAST_VALUE(bid2 IGNORE NULLS) AS lv
FROM ticker TIMESERIES slice_time AS '2 seconds'
OVER (PARTITION BY symbol ORDER BY time) ) AS sq;
symbol | slice_time | fv | lv | ema_first | ema_last
--------+---------------------+-------+-------+-----------+----------
ABC | 2009-07-12 03:00:00 | 60.45 | 65.12 | 60.45 | 65.12
ABC | 2009-07-12 03:00:02 | 57.78 | 65.12 | 59.115 | 65.12
ABC | 2009-07-12 03:00:04 | 67.88 | 65.12 | 63.4975 | 65.12
XYZ | 2009-07-12 03:00:00 | 47.55 | 46.78 | 47.55 | 46.78
XYZ | 2009-07-12 03:00:02 | 71.56 | 70.21 | 59.555 | 58.495
XYZ | 2009-07-12 03:00:04 | 45.55 | 58.65 | 52.5525 | 58.5725
(6 rows)
Lets you select the first value of a table or partition (determined by the window-order-clause) without having to use a self join.
Lets you select the first value of a table or partition (determined by the window-order-clause) without having to use a self join. This function is useful when you want to use the first value as a baseline in calculations.
Use FIRST_VALUE() with the window-order-clause to produce deterministic results. If no window is specified for the current row, the default window is UNBOUNDED PRECEDING AND CURRENT ROW.
Expression to evaluate—or example, a constant, column, nonanalytic function, function expression, or expressions involving any of these.
IGNORE NULLS
Specifies to return the first non-null value in the set, or NULL if all values are NULL. If you omit this option and the first value in the set is null, the function returns NULL.
The following query asks for the first value in the partitioned day of week, and illustrates the potential nondeterministic nature of FIRST_VALUE():
=> SELECT calendar_year, date_key, day_of_week, full_date_description,
FIRST_VALUE(full_date_description)
OVER(PARTITION BY calendar_month_number_in_year ORDER BY day_of_week)
AS "first_value"
FROM date_dimension
WHERE calendar_year=2003 AND calendar_month_number_in_year=1;
The first value returned is January 31, 2003; however, the next time the same query is run, the first value might be January 24 or January 3, or the 10th or 17th. This is because the analytic ORDER BY column day_of_week returns rows that contain ties (multiple Fridays). These repeated values make the ORDER BY evaluation result nondeterministic, because rows that contain ties can be ordered in any way, and any one of those rows qualifies as being the first value of day_of_week.
calendar_year | date_key | day_of_week | full_date_description | first_value
--------------+----------+-------------+-----------------------+------------------
2003 | 31 | Friday | January 31, 2003 | January 31, 2003
2003 | 24 | Friday | January 24, 2003 | January 31, 2003
2003 | 3 | Friday | January 3, 2003 | January 31, 2003
2003 | 10 | Friday | January 10, 2003 | January 31, 2003
2003 | 17 | Friday | January 17, 2003 | January 31, 2003
2003 | 6 | Monday | January 6, 2003 | January 31, 2003
2003 | 27 | Monday | January 27, 2003 | January 31, 2003
2003 | 13 | Monday | January 13, 2003 | January 31, 2003
2003 | 20 | Monday | January 20, 2003 | January 31, 2003
2003 | 11 | Saturday | January 11, 2003 | January 31, 2003
2003 | 18 | Saturday | January 18, 2003 | January 31, 2003
2003 | 25 | Saturday | January 25, 2003 | January 31, 2003
2003 | 4 | Saturday | January 4, 2003 | January 31, 2003
2003 | 12 | Sunday | January 12, 2003 | January 31, 2003
2003 | 26 | Sunday | January 26, 2003 | January 31, 2003
2003 | 5 | Sunday | January 5, 2003 | January 31, 2003
2003 | 19 | Sunday | January 19, 2003 | January 31, 2003
2003 | 23 | Thursday | January 23, 2003 | January 31, 2003
2003 | 2 | Thursday | January 2, 2003 | January 31, 2003
2003 | 9 | Thursday | January 9, 2003 | January 31, 2003
2003 | 16 | Thursday | January 16, 2003 | January 31, 2003
2003 | 30 | Thursday | January 30, 2003 | January 31, 2003
2003 | 21 | Tuesday | January 21, 2003 | January 31, 2003
2003 | 14 | Tuesday | January 14, 2003 | January 31, 2003
2003 | 7 | Tuesday | January 7, 2003 | January 31, 2003
2003 | 28 | Tuesday | January 28, 2003 | January 31, 2003
2003 | 22 | Wednesday | January 22, 2003 | January 31, 2003
2003 | 29 | Wednesday | January 29, 2003 | January 31, 2003
2003 | 15 | Wednesday | January 15, 2003 | January 31, 2003
2003 | 1 | Wednesday | January 1, 2003 | January 31, 2003
2003 | 8 | Wednesday | January 8, 2003 | January 31, 2003
(31 rows)
Note
The day_of_week results are returned in alphabetical order because of lexical rules. The fact that each day does not appear ordered by the 7-day week cycle (for example, starting with Sunday followed by Monday, Tuesday, and so on) has no affect on results.
To return deterministic results, modify the query so that it performs its analytic ORDER BY operations on a unique field, such as date_key:
=> SELECT calendar_year, date_key, day_of_week, full_date_description,
FIRST_VALUE(full_date_description) OVER
(PARTITION BY calendar_month_number_in_year ORDER BY date_key) AS "first_value"
FROM date_dimension WHERE calendar_year=2003;
FIRST_VALUE() returns a first value of January 1 for the January partition and the first value of February 1 for the February partition. Also, the full_date_description column contains no ties:
calendar_year | date_key | day_of_week | full_date_description | first_value
---------------+----------+-------------+-----------------------+------------
2003 | 1 | Wednesday | January 1, 2003 | January 1, 2003
2003 | 2 | Thursday | January 2, 2003 | January 1, 2003
2003 | 3 | Friday | January 3, 2003 | January 1, 2003
2003 | 4 | Saturday | January 4, 2003 | January 1, 2003
2003 | 5 | Sunday | January 5, 2003 | January 1, 2003
2003 | 6 | Monday | January 6, 2003 | January 1, 2003
2003 | 7 | Tuesday | January 7, 2003 | January 1, 2003
2003 | 8 | Wednesday | January 8, 2003 | January 1, 2003
2003 | 9 | Thursday | January 9, 2003 | January 1, 2003
2003 | 10 | Friday | January 10, 2003 | January 1, 2003
2003 | 11 | Saturday | January 11, 2003 | January 1, 2003
2003 | 12 | Sunday | January 12, 2003 | January 1, 2003
2003 | 13 | Monday | January 13, 2003 | January 1, 2003
2003 | 14 | Tuesday | January 14, 2003 | January 1, 2003
2003 | 15 | Wednesday | January 15, 2003 | January 1, 2003
2003 | 16 | Thursday | January 16, 2003 | January 1, 2003
2003 | 17 | Friday | January 17, 2003 | January 1, 2003
2003 | 18 | Saturday | January 18, 2003 | January 1, 2003
2003 | 19 | Sunday | January 19, 2003 | January 1, 2003
2003 | 20 | Monday | January 20, 2003 | January 1, 2003
2003 | 21 | Tuesday | January 21, 2003 | January 1, 2003
2003 | 22 | Wednesday | January 22, 2003 | January 1, 2003
2003 | 23 | Thursday | January 23, 2003 | January 1, 2003
2003 | 24 | Friday | January 24, 2003 | January 1, 2003
2003 | 25 | Saturday | January 25, 2003 | January 1, 2003
2003 | 26 | Sunday | January 26, 2003 | January 1, 2003
2003 | 27 | Monday | January 27, 2003 | January 1, 2003
2003 | 28 | Tuesday | January 28, 2003 | January 1, 2003
2003 | 29 | Wednesday | January 29, 2003 | January 1, 2003
2003 | 30 | Thursday | January 30, 2003 | January 1, 2003
2003 | 31 | Friday | January 31, 2003 | January 1, 2003
2003 | 32 | Saturday | February 1, 2003 | February 1, 2003
2003 | 33 | Sunday | February 2, 2003 | February 1,2003
...
(365 rows)
Returns the value of the input expression at the given offset before the current row within a.
Returns the value of the input expression at the given offset before the current row within a window. This function lets you access more than one row in a table at the same time. This is useful for comparing values when the relative positions of rows can be reliably known. It also lets you avoid the more costly self join, which enhances query processing speed.
For information on getting the rows that follow, see LEAD.
The expression to evaluate—for example, a constant, column, non-analytic function, function expression, or expressions involving any of these.
offset
Indicates how great is the lag. The default value is 1 (the previous row). This parameter must evaluate to a constant positive integer.
default
The value returned if offset falls outside the bounds of the table or partition. This value must be a constant value or an expression that can be evaluated to a constant; its data type is coercible to that of the first argument.
Examples
This example sums the current balance by date in a table and also sums the previous balance from the last day. Given the inputs that follow, the data satisfies the following conditions:
For each some_id, there is exactly 1 row for each date represented by month_date.
For each some_id, the set of dates is consecutive; that is, if there is a row for February 24 and a row for February 26, there would also be a row for February 25.
Each some_id has the same set of dates.
=> CREATE TABLE balances (
month_date DATE,
current_bal INT,
some_id INT);
=> INSERT INTO balances values ('2009-02-24', 10, 1);
=> INSERT INTO balances values ('2009-02-25', 10, 1);
=> INSERT INTO balances values ('2009-02-26', 10, 1);
=> INSERT INTO balances values ('2009-02-24', 20, 2);
=> INSERT INTO balances values ('2009-02-25', 20, 2);
=> INSERT INTO balances values ('2009-02-26', 20, 2);
=> INSERT INTO balances values ('2009-02-24', 30, 3);
=> INSERT INTO balances values ('2009-02-25', 20, 3);
=> INSERT INTO balances values ('2009-02-26', 30, 3);
Now run LAG to sum the current balance for each date and sum the previous balance from the last day:
=> SELECT month_date,
SUM(current_bal) as current_bal_sum,
SUM(previous_bal) as previous_bal_sum FROM
(SELECT month_date, current_bal,
LAG(current_bal, 1, 0) OVER
(PARTITION BY some_id ORDER BY month_date)
AS previous_bal FROM balances) AS subQ
GROUP BY month_date ORDER BY month_date;
month_date | current_bal_sum | previous_bal_sum
------------+-----------------+------------------
2009-02-24 | 60 | 0
2009-02-25 | 50 | 60
2009-02-26 | 60 | 50
(3 rows)
Using the same example data, the following query would not be allowed because LAG is nested inside an aggregate function:
=> SELECT month_date,
SUM(current_bal) as current_bal_sum,
SUM(LAG(current_bal, 1, 0) OVER
(PARTITION BY some_id ORDER BY month_date)) AS previous_bal_sum
FROM some_table GROUP BY month_date ORDER BY month_date;
The following example uses the VMart database. LAG first returns the annual income from the previous row, and then it calculates the difference between the income in the current row from the income in the previous row:
=> SELECT occupation, customer_key, customer_name, annual_income,
LAG(annual_income, 1, 0) OVER (PARTITION BY occupation
ORDER BY annual_income) AS prev_income, annual_income -
LAG(annual_income, 1, 0) OVER (PARTITION BY occupation
ORDER BY annual_income) AS difference
FROM customer_dimension ORDER BY occupation, customer_key LIMIT 20;
occupation | customer_key | customer_name | annual_income | prev_income | difference
------------+--------------+----------------------+---------------+-------------+------------
Accountant | 15 | Midori V. Peterson | 692610 | 692535 | 75
Accountant | 43 | Midori S. Rodriguez | 282359 | 280976 | 1383
Accountant | 93 | Robert P. Campbell | 471722 | 471355 | 367
Accountant | 102 | Sam T. McNulty | 901636 | 901561 | 75
Accountant | 134 | Martha B. Overstreet | 705146 | 704335 | 811
Accountant | 165 | James C. Kramer | 376841 | 376474 | 367
Accountant | 225 | Ben W. Farmer | 70574 | 70449 | 125
Accountant | 270 | Jessica S. Lang | 684204 | 682274 | 1930
Accountant | 273 | Mark X. Lampert | 723294 | 722737 | 557
Accountant | 295 | Sharon K. Gauthier | 29033 | 28412 | 621
Accountant | 338 | Anna S. Jackson | 816858 | 815557 | 1301
Accountant | 377 | William I. Jones | 915149 | 914872 | 277
Accountant | 438 | Joanna A. McCabe | 147396 | 144482 | 2914
Accountant | 452 | Kim P. Brown | 126023 | 124797 | 1226
Accountant | 467 | Meghan K. Carcetti | 810528 | 810284 | 244
Accountant | 478 | Tanya E. Greenwood | 639649 | 639029 | 620
Accountant | 511 | Midori P. Vogel | 187246 | 185539 | 1707
Accountant | 525 | Alexander K. Moore | 677433 | 677050 | 383
Accountant | 550 | Sam P. Reyes | 735691 | 735355 | 336
Accountant | 577 | Robert U. Vu | 616101 | 615439 | 662
(20 rows)
The next example uses LEAD and LAG to return the third row after the salary in the current row and fifth salary before the salary in the current row:
Lets you select the last value of a table or partition (determined by the window-order-clause) without having to use a self join.
Lets you select the last value of a table or partition (determined by the window-order-clause) without having to use a self join. LAST_VALUE takes the last record from the partition after the window order clause. The function then computes the expression against the last record, and returns the results. This function is useful when you want to use the last value as a baseline in calculations.
Use LAST_VALUE() with the window-order-clause to produce deterministic results. If no window is specified for the current row, the default window is UNBOUNDED PRECEDING AND CURRENT ROW.
Tip
Due to default window semantics, LAST_VALUE does not always return the last value of a partition. If you omit Window frame clause from the analytic clause, LAST_VALUE operates on this default window. Although results can seem non-intuitive by not returning the bottom of the current partition, it returns the bottom of the window, which continues to change along with the current input row being processed. If you want to return the last value of a partition, use UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. See examples below.
Expression to evaluate—for example, a constant, column, nonanalytic function, function expression, or expressions involving any of these.
IGNORE NULLS
Specifies to return the last non-null value in the set, or NULL if all values are NULL. If you omit this option and the last value in the set is null, the function returns NULL.
Using the schema defined in Window framing in Analyzing Data, the following query does not show the highest salary value by department; instead it shows the highest salary value by department by salary.
If you include the window frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, LAST_VALUE() returns the highest salary by department, an accurate representation of the information:
Returns values from the row after the current row within a , letting you access more than one row in a table at the same time.
Returns values from the row after the current row within a window, letting you access more than one row in a table at the same time. This is useful for comparing values when the relative positions of rows can be reliably known. It also lets you avoid the more costly self join, which enhances query processing speed.
The expression to evaluate—for example, a constant, column, non-analytic function, function expression, or expressions involving any of these.
offset
Is an optional parameter that defaults to 1 (the next row). This parameter must evaluate to a constant positive integer.
default
The value returned if offset falls outside the bounds of the table or partition. This value must be a constant value or an expression that can be evaluated to a constant; its data type is coercible to that of the first argument.
Examples
LEAD finds the hire date of the employee hired just after the current row:
=> SELECT employee_region, hire_date, employee_key, employee_last_name,
LEAD(hire_date, 1) OVER (PARTITION BY employee_region ORDER BY hire_date) AS "next_hired"
FROM employee_dimension ORDER BY employee_region, hire_date, employee_key;
employee_region | hire_date | employee_key | employee_last_name | next_hired
-------------------+------------+--------------+--------------------+------------
East | 1956-04-08 | 9218 | Harris | 1957-02-06
East | 1957-02-06 | 7799 | Stein | 1957-05-25
East | 1957-05-25 | 3687 | Farmer | 1957-06-26
East | 1957-06-26 | 9474 | Bauer | 1957-08-18
East | 1957-08-18 | 570 | Jefferson | 1957-08-24
East | 1957-08-24 | 4363 | Wilson | 1958-02-17
East | 1958-02-17 | 6457 | McCabe | 1958-06-26
East | 1958-06-26 | 6196 | Li | 1958-07-16
East | 1958-07-16 | 7749 | Harris | 1958-09-18
East | 1958-09-18 | 9678 | Sanchez | 1958-11-10
(10 rows)
The next example uses LEAD and LAG to return the third row after the salary in the current row and fifth salary before the salary in the current row.
The next example returns, for each assistant director in the employees table, the hire date of the director hired just after the director on the current row. For example, Jackson was hired on 2016-12-28, and the next director hired was Bauer:
=> SELECT employee_last_name, hire_date,
LEAD(hire_date, 1) OVER (ORDER BY hire_date DESC) as "NextHired"
FROM employee_dimension WHERE job_title = 'Assistant Director';
employee_last_name | hire_date | NextHired
--------------------+------------+------------
Jackson | 2016-12-28 | 2016-12-26
Bauer | 2016-12-26 | 2016-12-11
Miller | 2016-12-11 | 2016-12-07
Fortin | 2016-12-07 | 2016-11-27
Harris | 2016-11-27 | 2016-11-15
Goldberg | 2016-11-15 |
(5 rows)
Returns the maximum value of an expression within a.
Returns the maximum value of an expression within a window. The return value has the same type as the expression data type.
The analytic functions MIN() and MAX() can operate with Boolean values. The MAX() function acts upon a Boolean data type or a value that can be implicitly converted to a Boolean value. If at least one input value is true, MAX() returns t (true). Otherwise, it returns f (false). In the same scenario, the MIN() function returns t (true) if all input values are true. Otherwise, it returns f.
The following query computes the deviation between the employees' annual salary and the maximum annual salary in Massachusetts:
=> SELECT employee_state, annual_salary,
MAX(annual_salary)
OVER(PARTITION BY employee_state ORDER BY employee_key) max,
annual_salary- MAX(annual_salary)
OVER(PARTITION BY employee_state ORDER BY employee_key) diff
FROM employee_dimension
WHERE employee_state = 'MA';
employee_state | annual_salary | max | diff
----------------+---------------+--------+---------
MA | 1918 | 995533 | -993615
MA | 2058 | 995533 | -993475
MA | 2586 | 995533 | -992947
MA | 2500 | 995533 | -993033
MA | 1318 | 995533 | -994215
MA | 2072 | 995533 | -993461
MA | 2656 | 995533 | -992877
MA | 2148 | 995533 | -993385
MA | 2366 | 995533 | -993167
MA | 2664 | 995533 | -992869
(10 rows)
The following example shows you the difference between the MIN and MAX analytic functions when you use them with a Boolean value. The sample creates a table with two columns, adds two rows of data, and shows sample output for MIN and MAX.
CREATE TABLE min_max_functions (emp VARCHAR, torf BOOL);
INSERT INTO min_max_functions VALUES ('emp1', 1);
INSERT INTO min_max_functions VALUES ('emp1', 0);
SELECT DISTINCT emp,
min(torf) OVER (PARTITION BY emp) AS worksasbooleanand,
Max(torf) OVER (PARTITION BY emp) AS worksasbooleanor
FROM min_max_functions;
emp | worksasbooleanand | worksasbooleanor
------+-------------------+------------------
emp1 | f | t
(1 row)
For each row, returns the median value of a value set within each partition.
For each row, returns the median value of a value set within each partition. MEDIAN determines the argument with the highest numeric precedence, implicitly converts the remaining arguments to that data type, and returns that data type.
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the middle value or an interpolated value that would be the middle value once the values are sorted. Null values are ignored in the calculation.
OVER()
If the OVER clause specifies window-partition-clause, MEDIAN groups input rows according to one or more columns or expressions. If this clause is omitted, no grouping occurs and MEDIAN processes all input rows as a single partition.
Returns the minimum value of an expression within a.
Returns the minimum value of an expression within a window. The return value has the same type as the expression data type.
The analytic functions MIN() and MAX() can operate with Boolean values. The MAX() function acts upon a Boolean data type or a value that can be implicitly converted to a Boolean value. If at least one input value is true, MAX() returns t (true). Otherwise, it returns f (false). In the same scenario, the MIN() function returns t (true) if all input values are true. Otherwise, it returns f.
The following example shows how you can query to determine the deviation between the employees' annual salary and the minimum annual salary in Massachusetts:
=> SELECT employee_state, annual_salary,
MIN(annual_salary)
OVER(PARTITION BY employee_state ORDER BY employee_key) min,
annual_salary- MIN(annual_salary)
OVER(PARTITION BY employee_state ORDER BY employee_key) diff
FROM employee_dimension
WHERE employee_state = 'MA';
employee_state | annual_salary | min | diff
----------------+---------------+------+------
MA | 1918 | 1204 | 714
MA | 2058 | 1204 | 854
MA | 2586 | 1204 | 1382
MA | 2500 | 1204 | 1296
MA | 1318 | 1204 | 114
MA | 2072 | 1204 | 868
MA | 2656 | 1204 | 1452
MA | 2148 | 1204 | 944
MA | 2366 | 1204 | 1162
MA | 2664 | 1204 | 1460
(10 rows)
The following example shows you the difference between the MIN and MAX analytic functions when you use them with a Boolean value. The sample creates a table with two columns, adds two rows of data, and shows sample output for MIN and MAX.
CREATE TABLE min_max_functions (emp VARCHAR, torf BOOL);
INSERT INTO min_max_functions VALUES ('emp1', 1);
INSERT INTO min_max_functions VALUES ('emp1', 0);
SELECT DISTINCT emp,
min(torf) OVER (PARTITION BY emp) AS worksasbooleanand,
Max(torf) OVER (PARTITION BY emp) AS worksasbooleanor
FROM min_max_functions;
emp | worksasbooleanand | worksasbooleanor
------+-------------------+------------------
emp1 | f | t
(1 row)
Returns the value evaluated at the row that is the nth row of the window (counting from 1).
Returns the value evaluated at the row that is the *n*th row of the window (counting from 1). If the specified row does not exist, NTH_VALUE returns NULL.
Expression to evaluate. The expression can be a constant, column name, nonanalytic function, function expression, or expressions that include any of these.
row-number
Specifies the row to evaluate, where row-number evaluates to an integer ≥ 1.
IGNORE NULLS
Specifies to return the first non-NULL value in the set, or NULL if all values are NULL.
In the following example, for each tuple (current row) in table t1, the window frame clause defines the window as follows:
ORDER BY b ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
For each window, n for *n*th value is a+1. a is the value of column a in the tuple.
NTH_VALUE returns the result of the expression b+1, where b is the value of column b in the *n*th row, which is the a+1 row within the window.
=> SELECT * FROM t1 ORDER BY a;
a | b
---+----
1 | 10
2 | 20
2 | 21
3 | 30
4 | 40
5 | 50
6 | 60
(7 rows)
=> SELECT NTH_VALUE(b+1, a+1) OVER
(ORDER BY b ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM t1;
?column?
----------
22
31
(7 rows)
21 - NTILE [analytic]
Equally divides an ordered data set (partition) into a number of subsets within a , where the subsets are numbered 1 through the value in parameter constant‑value.
Equally divides an ordered data set (partition) into a {value} number of subsets within a window, where the subsets are numbered 1 through the value in parameter constant-value. For example, if constant-value= 4 and the partition contains 20 rows, NTILE divides the partition rows into four equal subsets of five rows. NTILE assigns each row to a subset by giving row a number from 1 to 4. The rows in the first subset are assigned 1, the next five are assigned 2, and so on.
If the number of partition rows is not evenly divisible by the number of subsets, the rows are distributed so no subset is more than one row larger than any other subset, and the lowest subsets have extra rows. For example, if constant-value= 4 and the number of rows = 21, the first subset has six rows, the second subset has five rows, and so on.
If the number of subsets is greater than the number of rows, then a number of subsets equal to the number of rows is filled, and the remaining subsets are empty.
The following query assigns each month's sales total into one of four subsets:
=> SELECT calendar_month_name AS MONTH, SUM(sales_quantity),
NTILE(4) OVER (ORDER BY SUM(sales_quantity)) AS NTILE
FROM store.store_sales_fact JOIN date_dimension
USING(date_key)
GROUP BY calendar_month_name
ORDER BY NTILE;
MONTH | SUM | NTILE
-----------+---------+-------
November | 2040726 | 1
June | 2088528 | 1
February | 2134708 | 1
April | 2181767 | 2
January | 2229220 | 2
October | 2316363 | 2
September | 2323914 | 3
March | 2354409 | 3
August | 2387017 | 3
July | 2417239 | 4
May | 2492182 | 4
December | 2531842 | 4
(12 rows)
Calculates the relative rank of a row for a given row in a group within a by dividing that row’s rank less 1 by the number of rows in the partition, also less 1.
Calculates the relative rank of a row for a given row in a group within a window by dividing that row’s rank less 1 by the number of rows in the partition, also less 1. PERCENT_RANK always returns values from 0 to 1 inclusive. The first row in any set has a PERCENT_RANK of 0. The return value is NUMBER.
( rank - 1 ) / ( [ rows ] - 1 )
In the preceding formula, rank is the rank position of a row in the group and rows is the total number of rows in the partition defined by the OVER() clause.
The following example finds the percent rank of gross profit for different states within each month of the first quarter:
=> SELECT calendar_month_name AS MONTH, store_state,
SUM(gross_profit_dollar_amount),
PERCENT_RANK() OVER (PARTITION BY calendar_month_name
ORDER BY SUM(gross_profit_dollar_amount)) AS PERCENT_RANK
FROM store.store_sales_fact JOIN date_dimension
USING(date_key)
JOIN store.store_dimension
USING (store_key)
WHERE calendar_month_name IN ('January','February','March')
AND store_state IN ('OR','IA','DC','NV','WI')
GROUP BY calendar_month_name, store_state
ORDER BY calendar_month_name, PERCENT_RANK;
MONTH | store_state | SUM | PERCENT_RANK
----------+-------------+--------+--------------
February | IA | 418490 | 0
February | OR | 460588 | 0.25
February | DC | 616553 | 0.5
February | WI | 619204 | 0.75
February | NV | 838039 | 1
January | OR | 446528 | 0
January | IA | 474501 | 0.25
January | DC | 628496 | 0.5
January | WI | 679382 | 0.75
January | NV | 871824 | 1
March | IA | 460282 | 0
March | OR | 481935 | 0.25
March | DC | 716063 | 0.5
March | WI | 771575 | 0.75
March | NV | 970878 | 1
(15 rows)
The following example calculates, for each employee, the percent rank of the employee's salary by their job title:
=> SELECT job_title, employee_last_name, annual_salary,
PERCENT_RANK()
OVER (PARTITION BY job_title ORDER BY annual_salary DESC) AS percent_rank
FROM employee_dimension
ORDER BY percent_rank, annual_salary;
job_title | employee_last_name | annual_salary | percent_rank
--------------------+--------------------+---------------+---------------------
Cashier | Fortin | 3196 | 0
Delivery Person | Garnett | 3196 | 0
Cashier | Vogel | 3196 | 0
Customer Service | Sanchez | 3198 | 0
Shelf Stocker | Jones | 3198 | 0
Custodian | Li | 3198 | 0
Customer Service | Kramer | 3198 | 0
Greeter | McNulty | 3198 | 0
Greeter | Greenwood | 3198 | 0
Shift Manager | Miller | 99817 | 0
Advertising | Vu | 99853 | 0
Branch Manager | Jackson | 99858 | 0
Marketing | Taylor | 99928 | 0
Assistant Director | King | 99973 | 0
Sales | Kramer | 99973 | 0
Head of PR | Goldberg | 199067 | 0
Regional Manager | Gauthier | 199744 | 0
Director of HR | Moore | 199896 | 0
Head of Marketing | Overstreet | 199955 | 0
VP of Advertising | Meyer | 199975 | 0
VP of Sales | Sanchez | 199992 | 0
Founder | Gauthier | 927335 | 0
CEO | Taylor | 953373 | 0
Investor | Garnett | 963104 | 0
Co-Founder | Vu | 977716 | 0
CFO | Vogel | 983634 | 0
President | Sanchez | 992363 | 0
Delivery Person | Li | 3194 | 0.00114155251141553
Delivery Person | Robinson | 3194 | 0.00114155251141553
Custodian | McCabe | 3192 | 0.00126582278481013
Shelf Stocker | Moore | 3196 | 0.00128040973111396
Branch Manager | Moore | 99716 | 0.00186567164179104
...
An inverse distribution function where, for each row, PERCENTILE_CONT returns the value that would fall into the specified percentile among a set of values in each partition within a.
An inverse distribution function where, for each row, PERCENTILE_CONT returns the value that would fall into the specified percentile among a set of values in each partition within a window. For example, if the argument to the function is 0.5, the result of the function is the median of the data set (50th percentile). PERCENTILE_CONT assumes a continuous distribution data model. NULL values are ignored.
PERCENTILE_CONT computes the percentile by first computing the row number where the percentile row would exist. For example:
If row-number is a whole number (within an error of 0.00001), the percentile is the value of row row-number.
Otherwise, Vertica interpolates the percentile value between the value of the CEILING(row-number) row and the value of the FLOOR(row-number) row. In other words, the percentile is calculated as follows:
( CEILING( row-number) - row-number ) * ( value of FLOOR(row-number) row )
+ ( row-number - FLOOR(row-number) ) * ( value of CEILING(row-number) row)
Note
If the percentile value is 0.5, PERCENTILE_CONT returns the same result set as the function MEDIAN.
PERCENTILE_CONT ( percentile ) WITHIN GROUP ( ORDER BY expression [ ASC | DESC ] ) OVER ( [ window-partition-clause ] )
Parameters
percentile
Percentile value, a FLOAT constant that ranges from 0 to 1 (inclusive).
WITHIN GROUP (ORDER BYexpression)
Specifies how to sort data within each group. ORDER BY takes only one column/expression that must be INTEGER, FLOAT, INTERVAL, or NUMERIC data type. NULL values are discarded.
The WITHIN GROUP(ORDER BY) clause does not guarantee the order of the SQL result. To order the final result , use the SQL ORDER BY clause set.
ASC | DESC
Specifies the ordering sequence as ascending (default) or descending.
Specifying ASC or DESC in the WITHIN GROUP clause affects results as long as the percentile is not 0.5.
This query computes the median annual income per group for the first 300 customers in Wisconsin and the District of Columbia.
=> SELECT customer_state, customer_key, annual_income, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY annual_income)
OVER (PARTITION BY customer_state) AS PERCENTILE_CONT
FROM customer_dimension WHERE customer_state IN ('DC','WI') AND customer_key < 300
ORDER BY customer_state, customer_key;
customer_state | customer_key | annual_income | PERCENTILE_CONT
----------------+--------------+---------------+-----------------
DC | 52 | 168312 | 483266.5
DC | 118 | 798221 | 483266.5
WI | 62 | 283043 | 377691
WI | 139 | 472339 | 377691
(4 rows)
This query computes the median annual income per group for all customers in Wisconsin and the District of Columbia.
=> SELECT customer_state, customer_key, annual_income, PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY annual_income)
OVER (PARTITION BY customer_state) AS PERCENTILE_CONT
FROM customer_dimension WHERE customer_state IN ('DC','WI') ORDER BY customer_state, customer_key;
customer_state | customer_key | annual_income | PERCENTILE_CONT
----------------+--------------+---------------+-----------------
DC | 52 | 168312 | 483266.5
DC | 118 | 798221 | 483266.5
DC | 622 | 220782 | 555088
DC | 951 | 178453 | 555088
DC | 972 | 961582 | 555088
DC | 1286 | 760445 | 555088
DC | 1434 | 44836 | 555088
...
WI | 62 | 283043 | 377691
WI | 139 | 472339 | 377691
WI | 359 | 42242 | 517717
WI | 364 | 867543 | 517717
WI | 403 | 509031 | 517717
WI | 455 | 32000 | 517717
WI | 485 | 373129 | 517717
...
(1353 rows)
An inverse distribution function where, for each row, PERCENTILE_DISC returns the value that would fall into the specified percentile among a set of values in each partition within a.
An inverse distribution function where, for each row, PERCENTILE_DISC returns the value that would fall into the specified percentile among a set of values in each partition within a window. PERCENTILE_DISC() assumes a discrete distribution data model. NULL values are ignored.
PERCENTILE_DISC examines the cumulative distribution values in each group until it finds one that is greater than or equal to the specified percentile. Vertica computes the percentile where, for each row, PERCENTILE_DISC outputs the first value of the WITHIN GROUP(ORDER BY) column whose CUME_DIST (cumulative distribution) value is >= the argument FLOAT value—for example, 0.4:
PERCENTILE_DIST(0.4) WITHIN GROUP (ORDER BY salary) OVER(PARTITION BY deptno)...
Given the following query:
SELECT CUME_DIST() OVER(ORDER BY salary) FROM table-name;
The smallest CUME_DIST value that is greater than 0.4 is also the PERCENTILE_DISC.
PERCENTILE_DISC ( percentile ) WITHIN GROUP (
ORDER BY expression [ ASC | DESC ] ) OVER (
[ window-partition-clause ] )
Parameters
percentile
Percentile value, a FLOAT constant that ranges from 0 to 1 (inclusive).
WITHIN GROUP(ORDER BYexpression)
Specifies how to sort data within each group. ORDER BY takes only one column/expression that must be INTEGER, FLOAT, INTERVAL, or NUMERIC data type. NULL values are discarded.
The WITHIN GROUP(ORDER BY) clause does not guarantee the order of the SQL result. To order the final result , use the SQL
ORDER BY clause set.
ASC | DESC
Specifies the ordering sequence as ascending (default) or descending.
This query computes the 20th percentile annual income by group for first 300 customers in Wisconsin and the District of Columbia.
=> SELECT customer_state, customer_key, annual_income,
PERCENTILE_DISC(.2) WITHIN GROUP(ORDER BY annual_income)
OVER (PARTITION BY customer_state) AS PERCENTILE_DISC
FROM customer_dimension
WHERE customer_state IN ('DC','WI')
AND customer_key < 300
ORDER BY customer_state, customer_key;
customer_state | customer_key | annual_income | PERCENTILE_DISC
----------------+--------------+---------------+-----------------
DC | 104 | 658383 | 417092
DC | 168 | 417092 | 417092
DC | 245 | 670205 | 417092
WI | 106 | 227279 | 227279
WI | 127 | 703889 | 227279
WI | 209 | 458607 | 227279
(6 rows)
Within each window partition, ranks all rows in the query results set according to the order specified by the window's ORDER BY clause.
Within each window partition, ranks all rows in the query results set according to the order specified by the window's ORDER BY clause.
RANK executes as follows:
Sorts partition rows as specified by the ORDER BY clause.
Compares the ORDER BY values of the preceding row and current row and ranks the current row as follows:
If ORDER BY values are the same, the current row gets the same ranking as the preceding row.
Note
Null values are considered equal. For detailed information on how null values are sorted, see NULL sort order.
If the ORDER BY values are different, DENSE_RANK increments or decrements the current row's ranking by 1, plus the number of consecutive duplicate values in the rows that precede it.
The largest rank value is the equal to the total number of rows returned by the query.
RANK can leave gaps in the ranking sequence, while
DENSE_RANK does not.
Examples
The following query ranks by state all company customers that have been customers since 2007. In rows where the customer_since dates are the same, RANK assigns the rows equal ranking. When the customer_since date changes, RANK skips one or more rankings—for example, within CA, from 12 to 14, and from 17 to 19.
=> SELECT customer_state, customer_name, customer_since,
RANK() OVER (PARTITION BY customer_state ORDER BY customer_since) AS rank
FROM customer_dimension WHERE customer_type='Company' AND customer_since > '01/01/2007'
ORDER BY customer_state;
customer_state | customer_name | customer_since | rank
----------------+---------------+----------------+------
AZ | Foodshop | 2007-01-20 | 1
AZ | Goldstar | 2007-08-11 | 2
CA | Metahope | 2007-01-05 | 1
CA | Foodgen | 2007-02-05 | 2
CA | Infohope | 2007-02-09 | 3
CA | Foodcom | 2007-02-19 | 4
CA | Amerihope | 2007-02-22 | 5
CA | Infostar | 2007-03-05 | 6
CA | Intracare | 2007-03-14 | 7
CA | Infocare | 2007-04-07 | 8
...
CO | Goldtech | 2007-02-19 | 1
CT | Foodmedia | 2007-02-11 | 1
CT | Metatech | 2007-02-20 | 2
CT | Infocorp | 2007-04-10 | 3
...
Assigns a sequence of unique numbers to each row in a partition, starting with 1.
Assigns a sequence of unique numbers to each row in a window partition, starting with 1. ROW_NUMBER and RANK are generally interchangeable, with the following differences:
ROW_NUMBER assigns a unique ordinal number to each row in the ordered set, starting with 1.
ROW_NUMBER() is a Vertica extension, while RANK conforms to the SQL-99 standard.
The following ROW_NUMBER query partitions customers in the VMart table customer_dimension by customer_region. Within each partition, the function ranks those customers in order of seniority, as specified by its window order clause:
=> SELECT * FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY customer_region ORDER BY customer_since) AS most_senior,
customer_region, customer_name, customer_since FROM public.customer_dimension WHERE customer_type = 'Individual') sq
WHERE most_senior <= 5;
most_senior | customer_region | customer_name | customer_since
-------------+-----------------+----------------------+----------------
1 | West | Jack Y. Perkins | 1965-01-01
2 | West | Linda Q. Winkler | 1965-01-02
3 | West | Marcus K. Li | 1965-01-03
4 | West | Carla R. Jones | 1965-01-07
5 | West | Seth P. Young | 1965-01-09
1 | East | Kim O. Vu | 1965-01-01
2 | East | Alexandra L. Weaver | 1965-01-02
3 | East | Steve L. Webber | 1965-01-04
4 | East | Thom Y. Li | 1965-01-05
5 | East | Martha B. Farmer | 1965-01-07
1 | SouthWest | Martha V. Gauthier | 1965-01-01
2 | SouthWest | Jessica U. Goldberg | 1965-01-07
3 | SouthWest | Robert O. Stein | 1965-01-07
4 | SouthWest | Emily I. McCabe | 1965-01-18
5 | SouthWest | Jack E. Miller | 1965-01-25
1 | NorthWest | Julie O. Greenwood | 1965-01-08
2 | NorthWest | Amy X. McNulty | 1965-01-25
3 | NorthWest | Kevin S. Carcetti | 1965-02-09
4 | NorthWest | Sam K. Carcetti | 1965-03-16
5 | NorthWest | Alexandra X. Winkler | 1965-04-05
1 | MidWest | Michael Y. Meyer | 1965-01-01
2 | MidWest | Joanna W. Bauer | 1965-01-06
3 | MidWest | Amy E. Harris | 1965-01-08
4 | MidWest | Julie W. McCabe | 1965-01-09
5 | MidWest | William . Peterson | 1965-01-09
1 | South | Dean . Martin | 1965-01-01
2 | South | Ruth U. Williams | 1965-01-02
3 | South | Steve Y. Farmer | 1965-01-03
4 | South | Mark V. King | 1965-01-08
5 | South | Lucas Y. Young | 1965-01-10
(30 rows)
Computes the statistical sample standard deviation of the current row with respect to the group within a.
Computes the statistical sample standard deviation of the current row with respect to the group within a window. STDDEV_SAMP returns the same value as the square root of the variance defined for the
VAR_SAMP function:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
Computes the statistical population standard deviation and returns the square root of the population variance within a.
Computes the statistical population standard deviation and returns the square root of the population variance within a window. The STDDEV_POP() return value is the same as the square root of the VAR_POP() function:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
Computes the statistical sample standard deviation of the current row with respect to the group within a.
Computes the statistical sample standard deviation of the current row with respect to the group within a window. STDDEV_SAM's return value is the same as the square root of the variance defined for the VAR_SAMP function:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument..
Computes the sum of an expression over a group of rows within a.
Computes the sum of an expression over a group of rows within a window. It returns a DOUBLE PRECISION value for a floating-point expression. Otherwise, the return value is the same as the expression data type.
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
If you encounter data overflow when using SUM, use
SUM_FLOAT which converts data to a floating point.
By default, Vertica allows silent numeric overflow when you call this function on numeric data types. For more information on this behavior and how to change it, seeNumeric data type overflow with SUM, SUM_FLOAT, and AVG.
Examples
The following query returns the cumulative sum all of the returns made to stores in January:
=> SELECT calendar_month_name AS month, transaction_type, sales_quantity,
SUM(sales_quantity)
OVER (PARTITION BY calendar_month_name ORDER BY date_dimension.date_key) AS SUM
FROM store.store_sales_fact JOIN date_dimension
USING(date_key) WHERE calendar_month_name IN ('January')
AND transaction_type= 'return';
month | transaction_type | sales_quantity | SUM
---------+------------------+----------------+------
January | return | 7 | 651
January | return | 3 | 651
January | return | 7 | 651
January | return | 7 | 651
January | return | 7 | 651
January | return | 3 | 651
January | return | 7 | 651
January | return | 5 | 651
January | return | 1 | 651
January | return | 6 | 651
January | return | 6 | 651
January | return | 3 | 651
January | return | 9 | 651
January | return | 7 | 651
January | return | 6 | 651
January | return | 8 | 651
January | return | 7 | 651
January | return | 2 | 651
January | return | 4 | 651
January | return | 5 | 651
January | return | 7 | 651
January | return | 8 | 651
January | return | 4 | 651
January | return | 10 | 651
January | return | 6 | 651
...
Returns the statistical population variance of a non-null set of numbers (nulls are ignored) in a group within a.
Returns the statistical population variance of a non-null set of numbers (nulls are ignored) in a group within a window. Results are calculated by the sum of squares of the difference of expression from the mean of expression, divided by the number of rows remaining:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument
Returns the sample variance of a non-NULL set of numbers (NULL values in the set are ignored) for each row of the group within a.
Returns the sample variance of a non-NULL set of numbers (NULL values in the set are ignored) for each row of the group within a window. Results are calculated as follows:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument
Returns the sample variance of a non-NULL set of numbers (NULL values in the set are ignored) for each row of the group within a.
Returns the sample variance of a non-NULL set of numbers (NULL values in the set are ignored) for each row of the group within a window. Results are calculated as follows:
Any NUMERIC data type or any non-numeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
Specifies a window frame, which comprises a set of rows relative to the row that is currently being evaluated by the analytic function.
Specifies a window frame, which comprises a set of rows relative to the row that is currently being evaluated by the analytic function. After the function processes that row and its window, Vertica advances the current row and adjusts the window boundaries accordingly. If the OVER clause also specifies a partition, Vertica also checks that window boundaries do not cross partition boundaries. This process repeats until the function evaluates the last row of the last partition.
Syntax
{ ROWS | RANGE } { BETWEEN start-point AND end-point } | start-point
Specifies whether Vertica determines window frame dimensions as physical or logical offsets from the current row. See ROWS versus RANGE below for details.
BETWEENstart-pointANDend-point
Specifies the window's first and last rows, where start-point and end-point can be one of the following (discussed in detail below):
UNBOUNDED {PRECEDING | FOLLOWING}
CURRENT ROW
constant-value{PRECEDING | FOLLOWING}
start-point must resolve to a row or value that is less than or equal to end-point.
UNBOUNDED PRECEDING
Specifies that the window frame extends to the current partition's first row.
start-point
If ROWS or RANGE specifies only a start point, Vertica uses the current row as the end point and creates the window frame accordingly. In this case, start-point must resolve to a row that is less than or equal to the current row.
UNBOUNDED FOLLOWING
Specifies that the window frame extends to the current partition's last row.
CURRENT ROW
Specifies the current row or value as the window's start or end point.
constant-value{PRECEDING | FOLLOWING}
Specifies a constant value or expression that evaluates to a constant value. The value specifies a physical or logical offset from the current row, depending on whether you specify ROWS or RANGE.
Other dependencies also pertain, depending whether you specify ROWS and RANGE. See ROWS versus RANGE below for details.
Requirements
In order to specify a window frame, the OVER must also specify a window order (ORDER BY) clause. If the OVER clause omits specifying a window frame, the function creates a default window that extends from the current row to the first row in the current partition. This is equivalent to the following clause:
RANGE UNBOUNDED PRECEDING AND CURRENT ROW
ROWS versus RANGE
The window frame's offset from the current row can be physical or logical:
ROWS specifies the window's start-point and end-point as a number of rows relative to the current row. If start-point and end-point are expressed as constant values, the value must evaluate to a positive integer.
RANGE specifies the window as a logical offset such as time. The range value must match the window order (ORDER BY) clause data type: NUMERIC, DATE/TIME, FLOAT or INTEGER.
Use of ROWS or RANGE imposes specific requirements on setting the window's start and end points as constant values:
Setting constant values for ROWS The constant must evaluate to a positive INTEGER.
Setting constant values for RANGE The following requirements apply:
The constant must evaluate to a positive numeric value or INTERVAL literal.
If the constant evaluates to a NUMERIC value, the ORDER BY column type must be a NUMERIC data type.
If the constant evaluates to an INTERVAL DAY TO SECOND subtype, the ORDER BY column type must be one of the following: TIMESTAMP, TIME, DATE, or INTERVAL DAY TO SECOND.
If the constant evaluates to an INTERVAL YEAR TO MONTH, the ORDER BY column type must be one of the following: TIMESTAMP, DATE, or INTERVAL YEAR TO MONTH.
Defines a named window that specifies window partition and order clauses for an analytic function.
Defines a named window that specifies window partition and order clauses for an analytic function. This window is specified in the function's OVER clause. Named windows can be useful when you write queries that invoke multiple analytic functions with similar OVER clauses—for example, they use the same partition (PARTITION BY) clauses.
Each WINDOW clause within the same query must have a unique name.
A WINDOW clause can reference another window that is already named. For example, the following query names window w1 before w2. Thus, the WINDOW clause that defines w2 can reference w1:
=> SELECT RANK() OVER(w1 ORDER BY sal DESC), RANK() OVER w2
FROM EMP WINDOW w1 AS (PARTITION BY deptno), w2 AS (w1 ORDER BY sal);
Specifies how to sort rows that are supplied to the analytic function.
Specifies how to sort rows that are supplied to the analytic function. If the OVER clause also includes a window partition clause, rows are sorted within each partition.
The window order clause only specifies order within a window result set. The query can have its own ORDER BY clause outside the OVER clause. This has precedence over the window order clause and orders the final result set.
An window order clause also creates a default window frame if none is explicitly specified.
Syntax
ORDER BY { expression [ ASC | DESC [ NULLS { FIRST | LAST | AUTO } ] ]
}[,...]
Parameters
expression
A column, constant, or arbitrary expression formed on columns on which to sort input rows.
ASC | DESC
Specifies the ordering sequence as ascending (default) or descending.
NULLS {FIRST | LAST | AUTO}
Specifies whether to position null values first or last. Default positioning depends on whether the sort order is ascending or descending:
Ascending default: NULLS LAST
Descending default: NULLS FIRST
If you specify NULLS AUTO, Vertica chooses the positioning that is most efficient for this query, either NULLS FIRST or NULLS LAST.
If you omit all sort qualifiers, Vertica uses ASC NULLS LAST.
When specified, a window partition clause divides the rows of the function input based on user-provided expressions.
When specified, a window partition clause divides the rows of the function input based on user-provided expressions. If no expression is provided, the partition clause can improve query performance by using parallelism.
Window partitioning is similar to the GROUP BY clause except that it returns only one result row per input row. If you omit specifying a window partition clause, all input rows are treated as a single partition.
When used with analytic functions, results are computed per partition and start over again (reset) at the beginning of each subsequent partition.
Syntax
{ PARTITION BY expression[,...] | PARTITION BEST | PARTITION NODES }
Parameters
PARTITION BYexpression
Expression on which to sort the partition, where expression can be a column, constant, or an arbitrary expression formed on columns. Use PARTITION BY for analytic functions with specific partitioning requirements.
PARTITION BEST
Use parallelism to improve performance for multi-threaded queries across multiple nodes.
OVER(PARTITION BEST) provides the best performance on multi-threaded queries across multiple nodes.
The following considerations apply to using PARTITION BEST:
Use PARTITION BEST for analytic functions that have no partitioning requirements and are thread safe—for example, a one-to-many transform.
Do not use PARTITION BEST on user-defined transform functions (UDTFs) that are not thread-safe. Doing so can produce an error or incorrect results. If a UDTF is not thread safe, use PARTITION NODES .
PARTITION NODES
Use parallelism to improve performance for single-threaded queries across multiple nodes.
OVER(PARTITION NODES) provides the best performance on single-threaded queries across multiple nodes.