Python example: count elements
The following example details a UDTF that takes a partition of arrays, computes the count of each distinct array element in the partition, and outputs each element and its count as a row value. You can call the function on tables that contain multiple partitions of arrays.
The complete source code is in /opt/vertica/sdk/examples/python/TransformFunctions.py
.
Loading and using the example
Load the library and create the transform function as follows:
=> CREATE OR REPLACE LIBRARY TransformFunctions AS '/home/dbadmin/examples/python/TransformFunctions.py' LANGUAGE 'Python';
=> CREATE TRANSFORM FUNCTION CountElements AS LANGUAGE 'Python' NAME 'countElementsUDTFactory' LIBRARY TransformFunctions;
You can create some data and then call the function on it, for example:
=> CREATE TABLE orders (storeID int, productIDs array[int]);
CREATE TABLE
=> INSERT INTO orders VALUES
(1, array[101, 102, 103]),
(1, array[102, 104]),
(1, array[101, 102, 102, 201, 203]),
(2, array[101, 202, 203, 202, 203]),
(2, array[203]),
(2, array[]);
OUTPUT
--------
6
(1 row)
=> COMMIT;
COMMIT
=> SELECT storeID, CountElements(productIDs) OVER (PARTITION BY storeID) FROM orders;
storeID | element_count
--------+---------------------------
1 | {"element":101,"count":2}
1 | {"element":102,"count":4}
1 | {"element":103,"count":1}
1 | {"element":104,"count":1}
1 | {"element":201,"count":1}
1 | {"element":202,"count":1}
2 | {"element":101,"count":1}
2 | {"element":202,"count":2}
2 | {"element":203,"count":3}
(9 rows)
Setup
All Python UDxs must import the Vertica SDK library:
import vertica_sdk
Factory implementation
The getPrototype()
method declares that the inputs and outputs can be of any type, which means that type enforcement must be done elsewhere:
def getPrototype(self, srv_interface, arg_types, return_type):
arg_types.addAny()
return_type.addAny()
getReturnType()
validates that the only argument to the function is an array and that the return type is a row with 'element' and 'count' fields:
def getReturnType(self, srv_interface, arg_types, return_type):
if arg_types.getColumnCount() != 1:
srv_interface.reportError(1, 'countElements UDT should take exactly one argument')
if not arg_types.getColumnType(0).isArrayType():
srv_interface.reportError(2, 'Argument to countElements UDT should be an ARRAY')
retRowFields = vertica_sdk.SizedColumnTypes.makeEmpty()
retRowFields.addColumn(arg_types.getColumnType(0).getElementType(), 'element')
retRowFields.addInt('count')
return_type.addRowType(retRowFields, 'element_count')
The countElementsUDTFactory
class also contains a createTransformFunction()
method that instantiates and returns the transform function.
Function implementation
The processBlock()
method is called with a BlockReader
and a BlockWriter
, named arg_reader
and res_writer
respectively. The function loops through all the input arrays in a partition and uses a dictionary to collect the frequency of each element. To access elements of each input array, the method instantiates an ArrayReader
. After collecting the element counts, the function writes each element and its count to a row. This process is repeated for each partition.
def processPartition(self, srv_interface, arg_reader, res_writer):
elemCounts = dict()
# Collect element counts for entire partition
while (True):
if not arg_reader.isNull(0):
arr = arg_reader.getArray(0)
for elem in arr:
elemCounts[elem] = elemCounts.setdefault(elem, 0) + 1
if not arg_reader.next():
break
# Write at least one value for each partition
if len(elemCounts) == 0:
elemCounts[None] = 0
# Write out element counts as (element, count) pairs
for pair in elemCounts.items():
res_writer.setRow(0, pair)
res_writer.next()