This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Analytic functions (UDAnFs)

User-defined analytic functions (UDAnFs) are used for analytics.

User-defined analytic functions (UDAnFs) are used for analytics. See SQL analytics for an overview of Vertica's built-in analytics. Like user-defined scalar functions (UDSFs), UDAnFs must output a single value for each row of data read and can have no more than 9800 arguments.

Unlike UDSFs, the UDAnF's input reader and output reader can be advanced independently. This feature lets you create analytic functions where the output value is calculated over multiple rows of data. By advancing the reader and writer independently, you can create functions similar to the built-in analytic functions such as LAG, which uses data from prior rows to output a value for the current row.

1 - AnalyticFunction class

The AnalyticFunction class performs the analytic processing.

The AnalyticFunction class performs the analytic processing. Your subclass must define the processPartition() method to perform the operation. It may define methods to set up and tear down the function.

Performing the operation

The processPartition() method reads a partition of data, performs some sort of processing, and outputs a single value for each input row.

Vertica calls processPartition() once for each partition of data. It supplies the partition using an AnalyticPartitionReader object from which you read its input data. In addition, there is a unique method on this object named isNewOrderByKey(), which returns a Boolean value indicating whether your function has seen a row with the same ORDER BY key (or keys). This method is very useful for analytic functions (such as the example RANK function) which need to handle rows with identical ORDER BY keys differently than rows with different ORDER BY keys.

Once your method has finished processing the row of data, you advance it to the next row of input by calling next() on AnalyticPartitionReader.

Your method writes its output value using an AnalyticPartitionWriter object that Vertica supplies as a parameter to processPartition(). This object has data-type-specific methods to write the output value (such as setInt()). After setting the output value, call next() on AnalyticPartitionWriter to advance to the next row in the output.

Setting up and tearing down

The AnalyticFunction class defines two additional methods that you can optionally implement to allocate and free resources: setup() and destroy(). You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).

API

The AnalyticFunction API provides the following methods for extension by subclasses:

virtual void setup(ServerInterface &srvInterface,
        const SizedColumnTypes &argTypes);

virtual void processPartition (ServerInterface &srvInterface,
        AnalyticPartitionReader &input_reader,
        AnalyticPartitionWriter &output_writer)=0;

virtual void cancel(ServerInterface &srvInterface);

virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes);

The AnalyticFunction API provides the following methods for extension by subclasses:

public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes);

public abstract void processPartition (ServerInterface srvInterface,
        AnalyticPartitionReader input_reader, AnalyticPartitionWriter output_writer)
        throws UdfException, DestroyInvocation;

protected void cancel(ServerInterface srvInterface);

public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes);

2 - AnalyticFunctionFactory class

The AnalyticFunctionFactory class tells Vertica metadata about your UDAnF: its number of parameters and their data types, as well as the data type of its return value.

The AnalyticFunctionFactory class tells Vertica metadata about your UDAnF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of AnalyticFunction.

Your AnalyticFunctionFactory subclass must implement the following methods:

  • getPrototype() describes the input parameters and output value of your function. You set these values by calling functions on two ColumnTypes objects that are passed to your method.

  • createAnalyticFunction() supplies an instance of your AnalyticFunction that Vertica can call to process a UDAnF function call.

  • getReturnType() provides details about your function's output. This method is where you set the width of the output value if your function returns a variable-width value (such as VARCHAR) or the precision of the output value if it has a settable precision (such as TIMESTAMP).

API

The AnalyticFunctionFactory API provides the following methods for extension by subclasses:

virtual AnalyticFunction * createAnalyticFunction (ServerInterface &srvInterface)=0;

virtual void getPrototype(ServerInterface &srvInterface,
        ColumnTypes &argTypes, ColumnTypes &returnType)=0;

virtual void getReturnType(ServerInterface &srvInterface,
        const SizedColumnTypes &argTypes, SizedColumnTypes &returnType)=0;

virtual void getParameterType(ServerInterface &srvInterface,
        SizedColumnTypes &parameterTypes);

The AnalyticFunctionFactory API provides the following methods for extension by subclasses:

public abstract AnalyticFunction createAnalyticFunction (ServerInterface srvInterface);

public abstract void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType);

public abstract void getReturnType(ServerInterface srvInterface, SizedColumnTypes argTypes,
        SizedColumnTypes returnType) throws UdfException;

public void getParameterType(ServerInterface srvInterface, SizedColumnTypes parameterTypes);

3 - C++ example: rank

The Rank analytic function ranks rows based on how they are ordered.

The Rank analytic function ranks rows based on how they are ordered. A Java version of this UDx is included in /opt/vertica/sdk/examples.

Loading and using the example

The following example shows how to load the function into Vertica. It assumes that the AnalyticFunctions.so library that contains the function has been copied to the dbadmin user's home directory on the initiator node.

=> CREATE LIBRARY AnalyticFunctions AS '/home/dbadmin/AnalyticFunctions.so';
CREATE LIBRARY
=> CREATE ANALYTIC FUNCTION an_rank AS LANGUAGE 'C++'
   NAME 'RankFactory' LIBRARY AnalyticFunctions;
CREATE ANALYTIC FUNCTION

An example of running this rank function, named an_rank, is:

=> SELECT * FROM hits;
      site       |    date    | num_hits
-----------------+------------+----------
 www.example.com | 2012-01-02 |       97
 www.vertica.com | 2012-01-01 |   343435
 www.example.com | 2012-01-01 |      123
 www.example.com | 2012-01-04 |      112
 www.vertica.com | 2012-01-02 |   503695
 www.vertica.com | 2012-01-03 |   490387
 www.example.com | 2012-01-03 |      123
(7 rows)
=> SELECT site,date,num_hits,an_rank()
   OVER (PARTITION BY site ORDER BY num_hits DESC)
   AS an_rank FROM hits;
      site       |    date    | num_hits | an_rank
-----------------+------------+----------+---------
 www.example.com | 2012-01-03 |      123 |       1
 www.example.com | 2012-01-01 |      123 |       1
 www.example.com | 2012-01-04 |      112 |       3
 www.example.com | 2012-01-02 |       97 |       4
 www.vertica.com | 2012-01-02 |   503695 |       1
 www.vertica.com | 2012-01-03 |   490387 |       2
 www.vertica.com | 2012-01-01 |   343435 |       3
(7 rows)

As with the built-in RANK analytic function, rows that have the same value for the ORDER BY column (num_hits in this example) have the same rank, but the rank continues to increase, so that the next row that has a different ORDER BY key gets a rank value based on the number of rows that preceded it.

AnalyticFunction implementation

The following code defines an AnalyticFunction subclass named Rank. It is based on example code distributed in the examples directory of the SDK.

/**
 * User-defined analytic function: Rank - works mostly the same as SQL-99 rank
 * with the ability to define as many order by columns as desired
 *
 */
class Rank : public AnalyticFunction
{
    virtual void processPartition(ServerInterface &srvInterface,
                                  AnalyticPartitionReader &inputReader,
                                  AnalyticPartitionWriter &outputWriter)
    {
        // Always use a top-level try-catch block to prevent exceptions from
        // leaking back to Vertica or the fenced-mode side process.
        try {
            rank = 1; // The rank to assign a row
            rowCount = 0; // Number of rows processed so far
            do {
                rowCount++;
                // Do we have a new order by row?
                if (inputReader.isNewOrderByKey()) {
                    // Yes, so set rank to the total number of rows that have been
                    // processed. Otherwise, the rank remains the same value as
                    // the previous iteration.
                    rank = rowCount;
                }
                // Write the rank
                outputWriter.setInt(0, rank);
                // Move to the next row of the output
                outputWriter.next();
            } while (inputReader.next()); // Loop until no more input
        } catch(exception& e) {
            // Standard exception. Quit.
            vt_report_error(0, "Exception while processing partition: %s", e.what());
        }
    }
private:
    vint rank, rowCount;
};

In this example, the processPartition() method does not actually read any of the data from the input row; it just advances through the rows. It does not need to read data; it just counts the rows that have been read and determine whether those rows have the same ORDER BY key as the previous row. If the current row is a new ORDER BY key, then the rank is set to the total number of rows that have been processed. If the current row has the same ORDER BY value as the previous row, then the rank remains the same.

Note that the function has a top-level try-catch block. All of your UDx functions should always have one to prevent stray exceptions from being passed back to Vertica (if you run the function unfenced) or the side process.

AnalyticFunctionFactory implementation

The following code defines the AnalyticFunctionFactory that corresponds with the Rank analytic function.

class RankFactory : public AnalyticFunctionFactory
{
    virtual void getPrototype(ServerInterface &srvInterface,
                                ColumnTypes &argTypes, ColumnTypes &returnType)
    {
        returnType.addInt();
    }
    virtual void getReturnType(ServerInterface &srvInterface,
                               const SizedColumnTypes &inputTypes,
                               SizedColumnTypes &outputTypes)
    {
        outputTypes.addInt();
    }
    virtual AnalyticFunction *createAnalyticFunction(ServerInterface
                                                        &srvInterface)
    { return vt_createFuncObj(srvInterface.allocator, Rank); }
};

The first method defined by the RankFactory subclass, getPrototype(), sets the data type of the return value. Because the Rank UDAnF does not read input, it does not define any arguments by calling methods on the ColumnTypes object passed in the argTypes parameter.

The next method is getReturnType(). If your function returns a data type that needs to define a width or precision, your implementation of the getReturnType() method calls a method on the SizedColumnType object passed in as a parameter to tell Vertica the width or precision. Rank returns a fixed-width data type (an INTEGER) so it does not need to set the precision or width of its output; it just calls addInt() to report its output data type.

Finally, RankFactory defines the createAnalyticFunction() method that returns an instance of the AnalyticFunction class that Vertica can call. This code is mostly boilerplate. All you need to do is add the name of your analytic function class in the call to vt_createFuncObj(), which takes care of allocating the object for you.