Partitioning options for UDTFs
Depending on the application, a UDTF might require the input data to be partitioned in a specific way. For example, a UDTF that processes a web server log file to count the number of hits referred by each partner web site needs to have its input partitioned by a referrer column. However, in other cases—such as a string tokenizer—the sort order of the data does not matter. Vertica provides partition options for both of these types of UDTFs.
Data sort required
In cases where a specific sort order is required, the window partitioning clause in the query that calls the UDTF should use a PARTITION BY
clause. Each node in the cluster partitions the data it stores, sends some of these partitions off to other nodes, and then consolidates the partitions it receives from other nodes and runs an instance of the UDTF to process them.
For example, the following UDTF partitions the input data by store ID and then computes the count of each distinct array element in each partition:
=> SELECT * FROM orders;
storeID | productIDs
---------+-----------------------
1 | [101,102,103]
1 | [102,104]
1 | [101,102,102,201,203]
2 | [101,202,203,202,203]
2 | [203]
2 | []
(6 rows)
=> SELECT storeID, CountElements(productIDs) OVER (PARTITION BY storeID) FROM orders;
storeID | element_count
--------+---------------------------
1 | {"element":101,"count":2}
1 | {"element":102,"count":4}
1 | {"element":103,"count":1}
1 | {"element":104,"count":1}
1 | {"element":201,"count":1}
1 | {"element":202,"count":1}
2 | {"element":101,"count":1}
2 | {"element":202,"count":2}
2 | {"element":203,"count":3}
(9 rows)
No sort needed
Some UDTFs, such as Explode, do not need to partition input data in a particular way. In these cases, you can specify that each UDTF instance process only the data that is stored locally by the node on which it is running. By eliminating the overhead of partitioning data and the cost of sort and merge operations, processing can be much more efficient.
You can use the following window partition options for UDTFs that do not require a specific data partitioning:
-
PARTITION ROW
: For single-phase UDTFs where each partition is one input row, allows users to write SELECT queries that include any expression. The UDTF calls theprocessPartition()
method once per input row. UDTFs of this type, often called one-to-many transforms, can be explicitly marked as such with the exploder class property in the TransformFunctionFactory class. This class property helps Vertica optimize query plans and removes the need for an OVER clause. See One to Many UDTFs for details on how to set this class property for UDTFs written in C++ and Python. -
PARTITION BEST
: For thread-safe UDTFs only, optimizes performance through multi-threaded queries across multiple nodes. The UDTF calls theprocessPartition()
method once per thread per node. -
PARTITION NODES
: Optimizes performance of single-threaded queries across multiple nodes. The UDTF calls theprocessPartition()
method once per node.
For more information about these partition options, see Window partitioning.
One-to-many UDTFs
To mark a UDTF as one-to-many, you must set the isExploder
class property to True within the getTransformFunctionProperties()
method. Whether a UDTF is marked as one-to-many can be determined by the transform function's arguments and parameters, for example:
void getFunctionProperties(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
Properties &properties) override
{
if (argTypes.getColumnCount() > 1) {
properties.isExploder = false;
}
else {
properties.isExploder = true;
}
}
To mark a UDTF as one-to-many, you must set the is_exploder
class property to True within the getTransformFunctionProperties()
method. Whether a UDTF is marked as one-to-many can be determined by the transform function's arguments and parameters, for example:
def getFunctionProperties(cls, server_interface, arg_types):
props = vertica_sdk.TransformFunctionFactory.Properties()
if arg_types.getColumnCount() != 1:
props.is_exploder = False
else:
props.is_exploder = True
return props
If the exploder class property is set to True, the OVER clause is by default OVER(PARTITION ROW). This allows users to call the UDTF without specifying an OVER clause:
=> SELECT * FROM reviews;
id | sentence
----+--------------------------------------
1 | Customer service was slow
2 | Product is exactly what I needed
3 | Price is a bit high
4 | Highly recommended
(4 rows)
=> SELECT tokenize(sentence) FROM reviews;
tokens
-------------
Customer
service
was
slow
Product
...
bit
high
Highly
recommended
(17 rows)
Note
If a user writes an OVER clause with PARTITION BY for a function marked as one-to-many, the function replaces the OVER clause with OVER(PARTITION ROW) and emits a notice to the user.One-to-many UDTFs also support any expression in the SELECT clause, unlike UDTFs that use either the PARTITION BEST or the PARTITION NODES clause:
=> SELECT id, tokenize(sentence) FROM reviews;
id | tokens
----+-------------
1 | Customer
1 | service
1 | was
1 | respond
2 | Product
...
3 | high
4 | Highly
4 | recommended
(17 rows)
For an in-depth example detailing a one-to-many UDTF, see Python example: explode.