This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Creating a polymorphic UDx

Polymorphic UDxs accept any number and type of argument that the user supplies.

Polymorphic UDxs accept any number and type of argument that the user supplies. Transform functions (UDTFs) and analytic functions (UDAnFs) can define their output return types at runtime, usually based on the input arguments. For example, a UDTF that adds two numbers could return an integer or a float, depending on the input types.

Vertica does not check the number or types of argument that the user passes to the UDx—it just passes the UDx all of the arguments supplied by the user. It is up to your polymorphic UDx's main processing function (for example, processBlock() in user-defined scalar functions) to examine the number and types of arguments it received and determine if it can handle them. UDxs support up to 9800 arguments.

Polymorphic UDxs are more flexible than using multiple factory classes for your function (see Overloading your UDx). They also allow you to write more concise code, instead of writing versions for each data type. The tradeoff is that your polymorphic function needs to perform more work to determine whether it can process its arguments.

Your polymorphic UDx declares that it accepts any number of arguments in its factory's getPrototype() function by calling the addAny() function on the ColumnTypes object that defines its arguments, as follows:

    // C++ example
    void getPrototype(ServerInterface &srvInterface,
                      ColumnTypes &argTypes,
                      ColumnTypes &returnType)
    {
        argTypes.addAny(); // Must be only argument type.
        returnType.addInt(); // or whatever the function returns
    }

This "any parameter" argument type is the only one that your function can declare. You cannot define required arguments and then call addAny() to declare the rest of the signature as optional. If your function has requirements for the arguments it accepts, your process() function must enforce them.

The getPrototype() example shown previously accepts any type and declares that it returns an integer. The following example shows a version of the method that defers resolving the return type until runtime. You can only use the "any" return type for transform and analytic functions.

    void getPrototype(ServerInterface &srvInterface,
                      ColumnTypes &argTypes,
                      ColumnTypes &returnType)
    {
        argTypes.addAny();
        returnType.addAny(); // type determined at runtime
    }

If you use polymorphic return types, you must also define getReturnType() in your factory. This function is called at runtime to determine the actual return type. See C++ example: PolyNthValue for an example.

Polymorphic UDxs and schema search paths

If a user does not supply a schema name as part of a UDx call, Vertica searches each schema in the schema search path for a function whose name and signature match the function call. See Setting search paths for more information about schema search paths.

Because polymorphic UDxs do not have specific signatures associated with them, Vertica initially skips them when searching for a function to handle the function call. If none of the schemas in the search path contain a UDx whose name and signature match the function call, Vertica searches the schema search path again for a polymorphic UDx whose name matches the function name in the function call.

This behavior gives precedence to a UDx whose signature exactly matches the function call. It allows you to create a "catch-all" polymorphic UDx that Vertica calls only when none of the non-polymorphic UDxs with the same name have matching signatures.

This behavior may cause confusion if your users expect the first polymorphic function in the schema search path to handle a function call. To avoid confusion, you should:

  • Avoid using the same name for different UDxs. You should always uniquely name UDxs unless you intend to create an overloaded UDx with multiple signatures.

  • When you cannot avoid having UDxs with the same name in different schemas, always supply the schema name as part of the function call. Using the schema name prevents ambiguity and ensures that Vertica uses the correct UDx to process your function calls.

1 - C++ example: PolyNthValue

The PolyNthValue example is an analytic function that returns the value in the Nth row in each partition in its input.

The PolyNthValue example is an analytic function that returns the value in the Nth row in each partition in its input. This function is a generalization of FIRST_VALUE [analytic] and LAST_VALUE [analytic].

The values can be of any primitive data type.

For the complete source code, see PolymorphicNthValue.cpp in the examples (in /opt/vertica/sdk/examples/AnalyticFunctions/).

Loading and using the example

Load the library and create the function as follows:

=> CREATE LIBRARY AnalyticFunctions AS '/home/dbadmin/AnalyticFns.so';
CREATE LIBRARY

=> CREATE ANALYTIC FUNCTION poly_nth_value AS LANGUAGE 'C++'
   NAME 'PolyNthValueFactory' LIBRARY AnalyticFunctions;
CREATE ANALYTIC FUNCTION

Consider a table of scores for different test groups:

=> SELECT cohort, score FROM trials;
 cohort | score
--------+-------
   1    | 9
   1    | 8
   1    | 7
   3    | 3
   3    | 2
   3    | 1
   2    | 4
   2    | 5
   2    | 6
(9 rows)

Call the function in a query that uses an OVER clause to partition the data. This example returns the second-highest score in each cohort:

=> SELECT cohort, score, poly_nth_value(score USING PARAMETERS n=2) OVER (PARTITION BY cohort) AS nth_value
FROM trials;
 cohort | score | nth_value
--------+-------+-----------
   1    | 9     |         8
   1    | 8     |         8
   1    | 7     |         8
   3    | 3     |         2
   3    | 2     |         2
   3    | 1     |         2
   2    | 4     |         5
   2    | 5     |         5
   2    | 6     |         5
(9 rows)

Factory implementation

The factory declares that the class is polymorphic, and then sets the return type based on the input type. Two factory methods specify the argument and return types.

Use the getPrototype() method to declare that the analytic function takes and returns any type:

    void getPrototype(ServerInterface &srvInterface, ColumnTypes &argTypes, ColumnTypes &returnType)
    {
        // This function supports any argument data type
        argTypes.addAny();

        // Output data type will be the same as the argument data type
        // We will specify that in getReturnType()
        returnType.addAny();
    }

The getReturnType() method is called at runtime. This is where you set the return type based on the input type:

    void getReturnType(ServerInterface &srvInterface, const SizedColumnTypes &inputTypes,
                       SizedColumnTypes &outputTypes)
    {
        // This function accepts only one argument
        // Complain if we find a different number
        std::vector<size_t> argCols;
        inputTypes.getArgumentColumns(argCols); // get argument column indices

        if (argCols.size() != 1)
        {
            vt_report_error(0, "Only one argument is expected but %s provided",
                            argCols.size()? std::to_string(argCols.size()).c_str() : "none");
        }

        // Define output type the same as argument type
        outputTypes.addArg(inputTypes.getColumnType(argCols[0]), inputTypes.getColumnName(argCols[0]));
    }

Function implementation

The analytic function itself is type-agnostic:


    void processPartition(ServerInterface &srvInterface, AnalyticPartitionReader &inputReader,
                          AnalyticPartitionWriter &outputWriter)
    {
        try {
            const SizedColumnTypes &inTypes = inputReader.getTypeMetaData();
            std::vector<size_t> argCols; // Argument column indexes.
            inTypes.getArgumentColumns(argCols);

            vint currentRow = 1;
            bool nthRowExists = false;

            // Find the value of the n-th row
            do {
                if (currentRow == this->n) {
                    nthRowExists = true;
                    break;
                } else {
                    currentRow++;
                }
            } while (inputReader.next());

            if (nthRowExists) {
                do {
                    // Return n-th value
                    outputWriter.copyFromInput(0 /*dest column*/, inputReader,
                                               argCols[0] /*source column*/);
                } while (outputWriter.next());
            } else {
                // The partition has less than n rows
                // Return NULL value
                do {
                    outputWriter.setNull(0);
                } while (outputWriter.next());
            }
        } catch(std::exception& e) {
            // Standard exception. Quit.
            vt_report_error(0, "Exception while processing partition: [%s]", e.what());
        }
    }
};

2 - Java example: AddAnyInts

The following example shows an implementation of a Java ScalarFunction that adds together two or more integers.

The following example shows an implementation of a Java ScalarFunction that adds together two or more integers.

For the complete source code, see AddAnyIntsInfo.java in the examples (in /opt/vertica/sdk/examples/JavaUDx/ScalarFunctions).

Loading and using the example

Load the library and create the function as follows:

=> CREATE LIBRARY JavaScalarFunctions AS '/home/dbadmin/JavaScalarLib.jar' LANGUAGE 'JAVA';
CREATE LIBRARY

=> CREATE FUNCTION addAnyInts AS LANGUAGE 'Java' NAME 'com.vertica.JavaLibs.AddAnyIntsInfo'
   LIBRARY JavaScalarFunctions;
CREATE FUNCTION

Call the function with two or more integer arguments:

=> SELECT addAnyInts(1,2);
 addAnyInts
------------
          3
(1 row)

=> SELECT addAnyInts(1,2,3,40,50,60,70,80,900);
 addAnyInts
------------
       1206
(1 row)

Calling the function with too few arguments, or with non-integer arguments, produces errors that are generated from the processBlock() method. It is up to your UDx to ensure that the user supplies the correct number and types of arguments to your function and exit with an error if it cannot process them.

Function implementation

Most of the work in the example is done by the processBlock() method. It performs two checks on the arguments that have been passed in through the BlockReader object:

  • There are at least two arguments.

  • The data types of all arguments are integers.

It is up to your polymorphic UDx to determine that all of the input passed to it is valid.

Once the processBlock() method validates its arguments, it loops over them, adding them together.

        @Override
        public void processBlock(ServerInterface srvInterface,
                                 BlockReader arg_reader,
                                 BlockWriter res_writer)
                    throws UdfException, DestroyInvocation
        {
        SizedColumnTypes inTypes = arg_reader.getTypeMetaData();
        ArrayList<Integer> argCols = new ArrayList<Integer>(); // Argument column indexes.
        inTypes.getArgumentColumns(argCols);
        // While we have inputs to process
            do {
        long sum = 0;
        for (int i = 0; i < argCols.size(); ++i){
            long a = arg_reader.getLong(i);
            sum += a;
        }
                res_writer.setLong(sum);
                res_writer.next();
            } while (arg_reader.next());
        }
    }

Factory implementation

The factory declares the number and type of arguments in the getPrototype() function.

    @Override
    public void getPrototype(ServerInterface srvInterface,
                             ColumnTypes argTypes,
                             ColumnTypes returnType)
    {
    argTypes.addAny();
        returnType.addInt();
    }

3 - R example: kmeansPoly

The following example shows an implementation of a Transform Function (UDTF) that performs kmeans clustering on one or more input columns.

The following example shows an implementation of a Transform Function (UDTF) that performs kmeans clustering on one or more input columns.

kmeansPoly <- function(v.data.frame,v.param.list) {
  # Computes clusters using the kmeans algorithm.
  #
  # Input: A dataframe and a list of parameters.
  # Output: A dataframe with one column that tells the cluster to which each data
  #         point belongs.
  # Args:
  #  v.data.frame: The data from Vertica cast as an R data frame.
  #  v.param.list: List of function parameters.
  #
  # Returns:
  #  The cluster associated with each data point.
  # Ensure k is not null.
  if(!is.null(v.param.list[['k']])) {
     number_of_clusters <- as.numeric(v.param.list[['k']])
  } else {
    stop("k cannot be NULL! Please use a valid value.")
  }
  # Run the kmeans algorithm.
  kmeans_clusters <- kmeans(v.data.frame, number_of_clusters)
  final.output <- data.frame(kmeans_clusters$cluster)
  return(final.output)
}

kmeansFactoryPoly <- function() {
  # This function tells Vertica the name of the R function,
  # and the polymorphic parameters.
  list(name=kmeansPoly, udxtype=c("transform"), intype=c("any"),
       outtype=c("int"), parametertypecallback=kmeansParameters)
}

kmeansParameters <- function() {
  # Callback function for the parameter types.
  function.parameters <- data.frame(datatype=rep(NA, 1), length=rep(NA,1),
                                    scale=rep(NA,1), name=rep(NA,1))
  function.parameters[1,1] = "int"
  function.parameters[1,4] = "k"
  return(function.parameters)
}

The polymorphic R function declares it accepts any number of arguments in its factory function by specifying "any" as the argument to the intype parameter and optionally the outtype parameter. If you define "any" argument for intype or outtype, then it is the only type that your function can declare for the respective parameter. You cannot define required arguments and then call "any" to declare the rest of the signature as optional. If your function has requirements for the arguments it accepts, your process function must enforce them.

The outtypecallback method is used to indicate the argument types and sizes it has been called with, and is expected to indicate the types and sizes that the function returns. The outtypecallback method can also be used to check for unsupported types and/or number of arguments. For example, the function may require only integers, with no more than 10 of them.

You assign a SQL name to your polymorphic UDx using the same statement you use to assign one to a non-polymorphic UDx. The following statements show how you load and call the polymorphic function from the example.

=> CREATE LIBRARY rlib2 AS '/home/dbadmin/R_UDx/poly_kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION kmeansPoly AS LANGUAGE 'R' name 'kmeansFactoryPoly' LIBRARY rlib2;
CREATE FUNCTION
=> SELECT spec, kmeansPoly(sl,sw,pl,pw USING PARAMETERS k = 3)
    OVER(PARTITION BY spec) AS Clusters
      FROM iris;
      spec       | Clusters
-----------------+----------
 Iris-setosa     |        1
 Iris-setosa     |        1
 Iris-setosa     |        1
 Iris-setosa     |        1
.
.
.
(150 rows)