Improving query performance (C++ only)
When evaluating a query, the Vertica optimizer might sort its input to improve performance. If a function already returns sorted data, this means the optimizer is doing extra work. A transform function written in C++ can declare how the data it returns is sorted, and the optimizer can take advantage of that information.
A transform function does the actual sorting in the function's processPartition()
method. To take advantage of this optimization, sorts must be ascending. You need not sort all columns, but you must return the sorted column or columns first.
You can declare how the function sorts its output in the factory's getReturnType()
method.
Caution
If the sorting declared in the factory does not match the sorting provided by the function, query results can be incorrect.The PolyTopKPerPartition
example sorts input columns and returns a given number of rows:
=> SELECT polykSort(14, a, b, c) OVER (ORDER BY a, b, c)
AS (sort1,sort2,sort3) FROM observations ORDER BY 1,2,3;
sort1 | sort2 | sort3
-------+-------+-------
1 | 1 | 1
1 | 1 | 2
1 | 1 | 3
1 | 2 | 1
1 | 2 | 2
1 | 3 | 1
1 | 3 | 2
1 | 3 | 3
1 | 3 | 4
2 | 1 | 1
2 | 1 | 2
2 | 2 | 3
2 | 2 | 34
2 | 3 | 5
(14 rows)
The factory declares this sorting in getReturnType()
by setting the isSortedBy
property on each column. Each SizedColumnType
has an associated Properties
object where this value can be set:
virtual void getReturnType(ServerInterface &srvInterface, const SizedColumnTypes &inputTypes, SizedColumnTypes &outputTypes)
{
vector<size_t> argCols; // Argument column indexes.
inputTypes.getArgumentColumns(argCols);
size_t colIdx = 0;
for (vector<size_t>::iterator i = argCols.begin() + 1; i < argCols.end(); i++)
{
SizedColumnTypes::Properties props;
props.isSortedBy = true;
std::stringstream cname;
cname << "col" << colIdx++;
outputTypes.addArg(inputTypes.getColumnType(*i), cname.str(), props);
}
}
You can see the effects of this optimization by reviewing the EXPLAIN plans for queries with and without this setting. The following output shows the query plan for polyk
, the unsorted version. Note the cost for sorting:
=> EXPLAN SELECT polyk(14, a, b, c) OVER (ORDER BY a, b, c)
FROM observations ORDER BY 1,2,3;
Access Path:
+-SORT [Cost: 2K, Rows: 10K] (PATH ID: 1)
| Order: col0 ASC, col1 ASC, col2 ASC
| +---> ANALYTICAL [Cost: 2K, Rows: 10K] (PATH ID: 2)
| | Analytic Group
| | Functions: polyk()
| | Group Sort: observations.a ASC NULLS LAST, observations.b ASC NULLS LAST, observations.c ASC NULLS LAST
| | +---> STORAGE ACCESS for observations [Cost: 2K, Rows: 10K]
(PATH ID: 3)
| | | Projection: public.observations_super
| | | Materialize: observations.a, observations.b, observations.c
The query plan for the sorted version omits this step (and cost) and starts with the analytical step (the second step in the previous plan):
=> EXPLAN SELECT polykSort(14, a, b, c) OVER (ORDER BY a, b, c)
FROM observations ORDER BY 1,2,3;
Access Path:
+-ANALYTICAL [Cost: 2K, Rows: 10K] (PATH ID: 2)
| Analytic Group
| Functions: polykSort()
| Group Sort: observations.a ASC NULLS LAST, observations.b ASC NULLS LAST, observations.c ASC NULLS LAST
| +---> STORAGE ACCESS for observations [Cost: 2K, Rows: 10K] (PATH ID: 3)
| | Projection: public.observations_super
| | Materialize: observations.a, observations.b, observations.c