This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Scalar functions (UDSFs)

A user-defined scalar function (UDSF) returns a single value for each row of data it reads.

1: ScalarFunction class
2: ScalarFunctionFactory class
3: Setting null input and volatility behavior
4: Improving query performance (C++ only)
5: C++ example: Add2Ints
6: Python example: currency_convert
7: Python example: validate_url
8: Python example: matrix multiplication
9: R example: SalesTaxCalculator
10: R example: kmeans
11: C++ example: using complex types
12: C++ example: returning multiple values
13: C++ example: calling a UDSF from a check constraint

A user-defined scalar function (UDSF) returns a single value for each row of data it reads. You can use a UDSF anywhere you can use a built-in Vertica function. You usually develop a UDSF to perform data manipulations that are too complex or too slow to perform using SQL statements and functions. UDSFs also let you use analytic functions provided by third-party libraries within Vertica while still maintaining high performance.

A UDSF returns a single column. You can automatically return multiple values in a ROW. A ROW is a group of property-value pairs. In the following example, div_with_rem is a UDSF that performs a division operation, returning the quotient and remainder as integers:

=> SELECT div_with_rem(18,5);
        div_with_rem
------------------------------
 {"quotient":3,"remainder":3}
(1 row)

A ROW returned from a UDSF cannot be used as an argument to COUNT.

Alternatively, you can construct a complex return value yourself, as described in Complex Types as Arguments.

Your UDSF must return a value for every input row (unless it generates an error; see Handling errors for details). Failure to return a value for an input row results in incorrect results and potentially destabilizes the Vertica server if not run in Fenced and unfenced modes.

A UDSF can have up to 9800 arguments.

1 - ScalarFunction class

The ScalarFunction class is the heart of a UDSF.

The ScalarFunction class is the heart of a UDSF. Your subclass must define the processBlock() method to perform the scalar operation. It may define methods to set up and tear down the function.

For scalar functions written in C++, you can provide information that can help with query optimization. See Improving query performance (C++ only).

Performing the operation

The processBlock() method carries out all of the processing that you want your UDSF to perform. When a user calls your function in a SQL statement, Vertica bundles together the data from the function parameters and passes it to processBlock() .

The input and output of the processBlock() method are supplied by objects of the BlockReader and BlockWriter classes. They define methods that you use to read the input data and write the output data for your UDSF.

The majority of the work in developing a UDSF is writing processBlock(). This is where all of the processing in your function occurs. Your UDSF should follow this basic pattern:

Read in a set of arguments from the BlockReader object using data-type-specific methods.
Process the data in some manner.
Output the resulting value using one of the BlockWriter class's data-type-specific methods.
Advance to the next row of output and input by calling BlockWriter.next() and BlockReader.next().

This process continues until there are no more rows of data to be read (BlockReader.next() returns false).

You must make sure that processBlock() reads all of the rows in its input and outputs a single value for each row. Failure to do so can corrupt the data structures that Vertica reads to get the output of your UDSF. The only exception to this rule is if your processBlock() function reports an error back to Vertica (see Handling errors). In that case, Vertica does not attempt to read the incomplete result set generated by the UDSF.

Setting up and tearing down

The ScalarFunction class defines two additional methods that you can optionally implement to allocate and free resources: setup() and destroy(). You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).

Notes

While the name you choose for your ScalarFunction subclass does not have to match the name of the SQL function you will later assign to it, Vertica considers making the names the same a best practice.
Do not assume that your function will be called from the same thread that instantiated it.
The same instance of your ScalarFunction subclass can be called on to process multiple blocks of data.
The rows of input sent to processBlock() are not guaranteed to be any particular order.
Writing too many output rows can cause Vertica to emit an out-of-bounds error.

API

The ScalarFunction API provides the following methods for extension by subclasses:

virtual void setup(ServerInterface &srvInterface,
        const SizedColumnTypes &argTypes);

virtual void processBlock(ServerInterface &srvInterface,
        BlockReader &arg_reader, BlockWriter &res_writer)=0;

virtual void getOutputRange (ServerInterface &srvInterface,
        ValueRangeReader &inRange, ValueRangeWriter &outRange)

virtual void cancel(ServerInterface &srvInterface);

virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes);

The ScalarFunction API provides the following methods for extension by subclasses:

public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes);

public abstract void processBlock(ServerInterface srvInterface, BlockReader arg_reader,
        BlockWriter res_writer) throws UdfException, DestroyInvocation;

protected void cancel(ServerInterface srvInterface);

public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes);

The ScalarFunction API provides the following methods for extension by subclasses:

def setup(self, server_interface, col_types)

def processBlock(self, server_interface, block_reader, block_writer)

def destroy(self, server_interface, col_types)

Implement the Main function API to define a scalar function:

FunctionName <- function(input.data.frame, parameters.data.frame) {
  # Computations

  # The function must return a data frame.
  return(output.data.frame)
}

2 - ScalarFunctionFactory class

The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value.

The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of ScalarFunction.

Methods

You must implement the following methods in your ScalarFunctionFactory subclass:

createScalarFunction() instantiates a ScalarFunction subclass. If writing in C++, you can call the vt_createFuncObj macro with the name of the ScalarFunction subclass. This macro takes care of allocating and instantiating the class for you.
getPrototype() tells Vertica about the parameters and return type(s) for your UDSF. In addition to a ServerInterface object, this method gets two ColumnTypes objects. All you need to do in this function is to call class functions on these two objects to build the list of parameters and the return value type(s). If you return more than one value, the results are packaged into a ROW type.

After defining your factory class, you need to call the RegisterFactory macro. This macro instantiates a member of your factory class, so Vertica can interact with it and extract the metadata it contains about your UDSF.

Declaring return values

If your function returns a sized column (a return data type whose length can vary, such as a VARCHAR), a value that requires precision, or more than one value, you must implement getReturnType(). This method is called by Vertica to find the length or precision of the data being returned in each row of the results. The return value of this method depends on the data type your processBlock() method returns:

CHAR, (LONG) VARCHAR, BINARY, and (LONG) VARBINARY return the maximum length.
NUMERIC types specify the precision and scale.
TIME and TIMESTAMP values (with or without timezone) specify precision.
INTERVAL YEAR TO MONTH specifies range.
INTERVAL DAY TO SECOND specifies precision and range.
ARRAY types specify the maximum number of elements.

If your UDSF does not return one of these data types and returns a single value, it does not need a getReturnType() method.

The input to the getReturnType() method is a SizedColumnTypes object that contains the input argument types along with their lengths. This object will be passed to an instance of your processBlock() function. Your implementation of getReturnType() must extract the data types and lengths from this input and determine the length or precision of the output rows. It then saves this information in another instance of the SizedColumnTypes class.

API

The ScalarFunctionFactory API provides the following methods for extension by subclasses:

virtual ScalarFunction * createScalarFunction(ServerInterface &srvInterface)=0;

virtual void getPrototype(ServerInterface &srvInterface,
        ColumnTypes &argTypes, ColumnTypes &returnType)=0;

virtual void getReturnType(ServerInterface &srvInterface,
        const SizedColumnTypes &argTypes, SizedColumnTypes &returnType);

virtual void getParameterType(ServerInterface &srvInterface,
        SizedColumnTypes &parameterTypes);

The ScalarFunctionFactory API provides the following methods for extension by subclasses:

public abstract ScalarFunction createScalarFunction(ServerInterface srvInterface);

public abstract void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType);

public void getReturnType(ServerInterface srvInterface, SizedColumnTypes argTypes,
        SizedColumnTypes returnType) throws UdfException;

public void getParameterType(ServerInterface srvInterface, SizedColumnTypes parameterTypes);

The ScalarFunctionFactory API provides the following methods for extension by subclasses:

def createScalarFunction(self, srv)

def getPrototype(self, srv_interface, arg_types, return_type)

def getReturnType(self, srv_interface, arg_types, return_type)

Implement the Factory function API to define a scalar function factory:

FunctionNameFactory <- function() {
  list(name    = FunctionName,
       udxtype = c("scalar"),
       intype  = c("int"),
       outtype = c("int"))
}

3 - Setting null input and volatility behavior

Normally, Vertica calls your UDSF for every row of data in the query.

Normally, Vertica calls your UDSF for every row of data in the query. There are some cases where Vertica can avoid executing your UDSF. You can tell Vertica when it can skip calling your function and just supply a return value itself by changing your function's volatility and strictness settings.

Your function's volatility indicates whether it always returns the same output value when passed the same arguments. Depending on its behavior, Vertica can cache the arguments and the return value. If the user calls the UDSF with the same set of arguments, Vertica returns the cached value instead of calling your UDSF.
Your function's strictness indicates how it reacts to NULL arguments. If it always returns NULL when any argument is NULL, Vertica can just return NULL without having to call the function. This optimization also saves you work, because you do not need to test for and handle null arguments in your UDSF code.

You indicate the volatility and null handling of your function by setting the vol and strict fields in your ScalarFunctionFactory class's constructor.

Volatility settings

To indicate your function's volatility, set the vol field to one of the following values:

Value	Description
`VOLATILE`	Repeated calls to the function with the same arguments always result in different values. Vertica always calls volatile functions for each invocation.
`IMMUTABLE`	Calls to the function with the same arguments always results in the same return value.
`STABLE`	Repeated calls to the function with the same arguments within the same statement returns the same output. For example, a function that returns the current user name is stable because the user cannot change within a statement. The user name could change between statements.
`DEFAULT_VOLATILITY`	The default volatility. This is the same as VOLATILE.

The following example shows a version of the Add2ints example factory class that makes the function immutable.

class Add2intsImmutableFactory : public Vertica::ScalarFunctionFactory
{
    virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
    { return vt_createFuncObj(srvInterface.allocator, Add2ints); }
    virtual void getPrototype(Vertica::ServerInterface &srvInterface,
                              Vertica::ColumnTypes &argTypes,
                              Vertica::ColumnTypes &returnType)
    {
        argTypes.addInt();
        argTypes.addInt();
        returnType.addInt();
    }

public:
    Add2intsImmutableFactory() {vol = IMMUTABLE;}
};
RegisterFactory(Add2intsImmutableFactory);

The following example demonstrates setting the Add2IntsFactory's vol field to IMMUTABLE to tell Vertica it can cache the arguments and return value.

public class Add2IntsFactory extends ScalarFunctionFactory {

    @Override
    public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType){
        argTypes.addInt();
        argTypes.addInt();
        returnType.addInt();
    }

    @Override
    public ScalarFunction createScalarFunction(ServerInterface srvInterface){
        return new Add2Ints();
    }

    // Class constructor
    public Add2IntsFactory() {
        // Tell Vertica that the same set of arguments will always result in the
        // same return value.
        vol = volatility.IMMUTABLE;
    }
}

Null input behavior

To indicate how your function reacts to NULL input, set the strictness field to one of the following values.

Value	Description
CALLED_ON_NULL_INPUT	The function must be called, even if one or more arguments are NULL.
RETURN_NULL_ON_NULL_INPUT	The function always returns a NULL value if any of its arguments are NULL.
STRICT	A synonym for RETURN_NULL_ON_NULL_INPUT
DEFAULT_STRICTNESS	The default strictness setting. This is the same as CALLED_ON_NULL_INPUT.

Example

The following C++ example demonstrates setting the null behavior of Add2ints so Vertica does not call the function with NULL values.

class Add2intsNullOnNullInputFactory : public Vertica::ScalarFunctionFactory
{
    virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
    { return vt_createFuncObj(srvInterface.allocator, Add2ints); }
    virtual void getPrototype(Vertica::ServerInterface &srvInterface,
                              Vertica::ColumnTypes &argTypes,
                              Vertica::ColumnTypes &returnType)
    {
        argTypes.addInt();
        argTypes.addInt();
        returnType.addInt();
    }

public:
    Add2intsNullOnNullInputFactory() {strict = RETURN_NULL_ON_NULL_INPUT;}
};
RegisterFactory(Add2intsNullOnNullInputFactory);

4 - Improving query performance (C++ only)

When evaluating a query, Vertica can take advantage of available information about the ranges of values.

When evaluating a query, Vertica can take advantage of available information about the ranges of values. For example, if data is partitioned and a query restricts output by the partitioned value, Vertica can ignore partitions that cannot possibly contain data that would satisfy the query. Similarly, for a scalar function, Vertica can skip processing rows in the data where the value returned from the function cannot possibly affect the results.

Consider a table with millions of rows of data on customer orders and a scalar function that computes the total price paid for everything in an order. A query uses a WHERE clause to restrict results to orders above a given value. A scalar function is called on a block of data; if no rows within that block could produce the target value, skipping the processing of the block could improve query performance.

A scalar function written in C++ can implement the getOutputRange method. Before calling processBlock, Vertica calls getOutputRange to determine the minimum and maximum return values from this block given the input ranges. It then decides whether to call processBlock to perform the computations.

The Add2Ints example implements this function. The minimum output value is the sum of the smallest values of each of the two inputs, and the maximum output is the sum of the largest values of each of the inputs. This function does not consider individual rows. Consider the following inputs:

   a  |  b
------+------
  21  | 92
 500  | 19
 111  | 11

The smallest values of the two inputs are 21 and 11, so the function reports 32 as the low end of the output range. The largest input values are 500 and 92, so it reports 592 as the high end of the output range. 592 is larger than the value returned for any of the input rows and 32 is smaller than any row's return value.

The purpose of getOutputRange is to quickly eliminate calls where outputs would definitely be out of range. For example, if the query included "WHERE Add2Ints(a,b) > 600", this block of data could be skipped. There can still be cases where, after calling getOutputRange, processBlock returns no results. If the query included "WHERE Add2Ints(a,b) > 500", getOutputRange would not eliminate this block of data.

Add2Ints implements getOutputRange as follows:

    /*
     * This method computes the output range for this scalar function from
     *   the ranges of its inputs in a single invocation.
     *
     * The input ranges are retrieved via inRange
     * The output range is returned via outRange
     */
    virtual void getOutputRange(Vertica::ServerInterface &srvInterface,
                                Vertica::ValueRangeReader &inRange,
                                Vertica::ValueRangeWriter &outRange)
    {
        if (inRange.hasBounds(0) && inRange.hasBounds(1)) {
            // Input ranges have bounds defined
            if (inRange.isNull(0) || inRange.isNull(1)) {
                // At least one range has only NULL values.
                // Output range can only have NULL values.
                outRange.setNull();
                outRange.setHasBounds();
                return;
            } else {
                // Compute output range
                const vint& a1LoBound = inRange.getIntRefLo(0);
                const vint& a2LoBound = inRange.getIntRefLo(1);
                outRange.setIntLo(a1LoBound + a2LoBound);

                const vint& a1UpBound = inRange.getIntRefUp(0);
                const vint& a2UpBound = inRange.getIntRefUp(1);
                outRange.setIntUp(a1UpBound + a2UpBound);
            }
        } else {
            // Input ranges are unbounded. No output range can be defined
            return;
        }

        if (!inRange.canHaveNulls(0) && !inRange.canHaveNulls(1)) {
            // There cannot be NULL values in the output range
            outRange.setCanHaveNulls(false);
        }

        // Let Vertica know that the output range is bounded
        outRange.setHasBounds();
    }

If getOutputRange produces an error, Vertica issues a warning and does not call the method again for the current query.

5 - C++ example: Add2Ints

The following example shows a basic subclass of ScalarFunction called Add2ints.

The following example shows a basic subclass of ScalarFunction called Add2ints. As the name implies, it adds two integers together, returning a single integer result.

For the complete source code, see /opt/vertica/sdk/examples/ScalarFunctions/Add2Ints.cpp. Java and Python versions of this UDx are included in /opt/vertica/sdk/examples.

Loading and using the example

Use CREATE LIBRARY to load the library containing the function, and then use CREATE FUNCTION (scalar) to declare the function as in the following example:

=> CREATE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/ScalarFunctions.so';

=> CREATE FUNCTION add2ints AS LANGUAGE 'C++' NAME 'Add2IntsFactory' LIBRARY ScalarFunctions;

The following example shows how to use this function:

=> SELECT Add2Ints(27,15);
 Add2ints
----------
       42
(1 row)
=> SELECT * FROM MyTable;
  a  | b
-----+----
   7 |  0
  12 |  2
  12 |  6
  18 |  9
   1 |  1
  58 |  4
 450 | 15
(7 rows)
=> SELECT * FROM MyTable WHERE Add2ints(a, b) > 20;
  a  | b
-----+----
  18 |  9
  58 |  4
 450 | 15
(3 rows)

Function implementation

A scalar function does its computation in the processBlock method:

class Add2Ints : public ScalarFunction
 {
    public:
   /*
     * This method processes a block of rows in a single invocation.
     *
     * The inputs are retrieved via argReader
     * The outputs are returned via resWriter
     */
    virtual void processBlock(ServerInterface &srvInterface,
                              BlockReader &argReader,
                              BlockWriter &resWriter)
    {
        try {
            // While we have inputs to process
            do {
                if (argReader.isNull(0) || argReader.isNull(1)) {
                    resWriter.setNull();
                } else {
                    const vint a = argReader.getIntRef(0);
                    const vint b = argReader.getIntRef(1);
                    resWriter.setInt(a+b);
                }
                resWriter.next();
            } while (argReader.next());
        } catch(std::exception& e) {
            // Standard exception. Quit.
            vt_report_error(0, "Exception while processing block: [%s]", e.what());
        }
    }

  // ...
};

Implementing getOutputRange, which is optional, allows your function to skip rows where the result would not be within a target range. For example, if a WHERE clause restricts the query results to those in a certain range, calling the function for cases that could not possibly be in that range is unnecessary.

    /*
     * This method computes the output range for this scalar function from
     *   the ranges of its inputs in a single invocation.
     *
     * The input ranges are retrieved via inRange
     * The output range is returned via outRange
     */
    virtual void getOutputRange(Vertica::ServerInterface &srvInterface,
                                Vertica::ValueRangeReader &inRange,
                                Vertica::ValueRangeWriter &outRange)
    {
        if (inRange.hasBounds(0) && inRange.hasBounds(1)) {
            // Input ranges have bounds defined
            if (inRange.isNull(0) || inRange.isNull(1)) {
                // At least one range has only NULL values.
                // Output range can only have NULL values.
                outRange.setNull();
                outRange.setHasBounds();
                return;
            } else {
                // Compute output range
                const vint& a1LoBound = inRange.getIntRefLo(0);
                const vint& a2LoBound = inRange.getIntRefLo(1);
                outRange.setIntLo(a1LoBound + a2LoBound);

                const vint& a1UpBound = inRange.getIntRefUp(0);
                const vint& a2UpBound = inRange.getIntRefUp(1);
                outRange.setIntUp(a1UpBound + a2UpBound);
            }
        } else {
            // Input ranges are unbounded. No output range can be defined
            return;
        }

        if (!inRange.canHaveNulls(0) && !inRange.canHaveNulls(1)) {
            // There cannot be NULL values in the output range
            outRange.setCanHaveNulls(false);
        }

        // Let Vertica know that the output range is bounded
        outRange.setHasBounds();
    }

Factory implementation

The factory instantiates a member of the class (createScalarFunction), and also describes the function's inputs and outputs (getPrototype):

class Add2IntsFactory : public ScalarFunctionFactory
{
    // return an instance of Add2Ints to perform the actual addition.
    virtual ScalarFunction *createScalarFunction(ServerInterface &interface)
    { return vt_createFuncObject<Add2Ints>(interface.allocator); }

    // This function returns the description of the input and outputs of the
    // Add2Ints class's processBlock function.  It stores this information in
    // two ColumnTypes objects, one for the input parameters, and one for
    // the return value.
    virtual void getPrototype(ServerInterface &interface,
                              ColumnTypes &argTypes,
                              ColumnTypes &returnType)
    {
        argTypes.addInt();
        argTypes.addInt();

        // Note that ScalarFunctions *always* return a single value.
        returnType.addInt();
    }
};

The RegisterFactory macro

Use the RegisterFactory macro to register a UDx. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class:

RegisterFactory(Add2IntsFactory);

6 - Python example: currency_convert

The currency_convert scalar function reads two values from a table, a currency and a value.

The currency_convert scalar function reads two values from a table, a currency and a value. It then converts the item's value to USD, returning a single float result.

You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.

UDSF Python code

import vertica_sdk
import decimal

rates2USD = {'USD': 1.000,
             'EUR': 0.89977,
             'GBP': 0.68452,
             'INR': 67.0345,
             'AUD': 1.39187,
             'CAD': 1.30335,
             'ZAR': 15.7181,
             'XXX': -1.0000}

class currency_convert(vertica_sdk.ScalarFunction):
    """Converts a money column to another currency

    Returns a value in USD.

    """
    def __init__(self):
        pass

    def setup(self, server_interface, col_types):
        pass

    def processBlock(self, server_interface, block_reader, block_writer):
        while(True):
            currency = block_reader.getString(0)
            try:
                rate = decimal.Decimal(rates2USD[currency])

            except KeyError:
                server_interface.log("ERROR: {} not in dictionary.".format(currency))
                # Scalar functions always need a value to move forward to the
                # next input row. Therefore, we need to assign it a value to
                # move beyond the error.
                currency = 'XXX'
                rate = decimal.Decimal(rates2USD[currency])

            starting_value = block_reader.getNumeric(1)
            converted_value = decimal.Decimal(starting_value  / rate)
            block_writer.setNumeric(converted_value)
            block_writer.next()
            if not block_reader.next():
                break

    def destroy(self, server_interface, col_types):
        pass

class currency_convert_factory(vertica_sdk.ScalarFunctionFactory):

    def createScalarFunction(self, srv):
        return currency_convert()

    def getPrototype(self, srv_interface, arg_types, return_type):
        arg_types.addVarchar()
        arg_types.addNumeric()
        return_type.addNumeric()

    def getReturnType(self, srv_interface, arg_types, return_type):
        return_type.addNumeric(9,4)

Load the function and library

Create the library and the function.

=> CREATE LIBRARY pylib AS '/home/dbadmin/python_udx/currency_convert/currency_convert.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE FUNCTION currency_convert AS LANGUAGE 'Python' NAME 'currency_convert_factory' LIBRARY pylib fenced;
CREATE FUNCTION

Querying data with the function

The following query shows how you can run a query with the UDSF.

=> SELECT product, currency_convert(currency, value) AS cost_in_usd
    FROM items;
   product    | cost_in_usd
--------------+-------------
 Shoes        |    133.4008
 Soccer Ball  |    110.2817
 Coffee       |     13.5190
 Surfboard    |    176.2593
 Hockey Stick |     76.7177
 Car          |  17000.0000
 Software     |     10.4424
 Hamburger    |      7.5000
 Fish         |    130.4272
 Cattle       |    269.2367
(10 rows)

7 - Python example: validate_url

The validate_url scalar function reads a string from a table, a URL.

The validate_url scalar function reads a string from a table, a URL. It then validates if the URL is responsive, returning a status code or a string indicating the attempt failed.

You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.

UDSF Python code

import vertica_sdk
import urllib.request
import time

class validate_url(vertica_sdk.ScalarFunction):
    """Validates HTTP requests.

    Returns the status code of a webpage. Pages that cannot be accessed return
    "Failed to load page."

    """

    def __init__(self):
        pass

    def setup(self, server_interface, col_types):
        pass

    def processBlock(self, server_interface, arg_reader, res_writer):
        # Writes a string to the UDx log file.
        server_interface.log("Validating webpage accessibility - UDx")

        while(True):
            url = arg_reader.getString(0)
            try:
                status = urllib.request.urlopen(url).getcode()
                # Avoid overwhelming web servers -- be nice.
                time.sleep(2)
            except (ValueError, urllib.error.HTTPError, urllib.error.URLError):
                status = 'Failed to load page'
            res_writer.setString(str(status))
            res_writer.next()
            if not arg_reader.next():
                # Stop processing when there are no more input rows.
                break

    def destroy(self, server_interface, col_types):
        pass

class validate_url_factory(vertica_sdk.ScalarFunctionFactory):

    def createScalarFunction(self, srv):
        return validate_url()

    def getPrototype(self, srv_interface, arg_types, return_type):
        arg_types.addVarchar()
        return_type.addChar()

    def getReturnType(self, srv_interface, arg_types, return_type):
        return_type.addChar(20)

Load the function and library

Create the library and the function.

=> CREATE OR REPLACE LIBRARY pylib AS 'webpage_tester/validate_url.py' LANGUAGE 'Python';
=> CREATE OR REPLACE FUNCTION validate_url AS LANGUAGE 'Python' NAME 'validate_url_factory' LIBRARY pylib fenced;

Querying data with the function

The following query shows how you can run a query with the UDSF.

=> SELECT url, validate_url(url) AS url_status FROM webpages;
                     url                       |      url_status
-----------------------------------------------+----------------------
 http://www.vertica.com/documentation/vertica/ | 200
 http://www.google.com/                        | 200
 http://www.mass.gov.com/                      | Failed to load page
 http://www.espn.com                           | 200
 http://blah.blah.blah.blah                    | Failed to load page
 http://www.vertica.com/                       | 200
(6 rows)

8 - Python example: matrix multiplication

Python UDxs can accept and return complex types.

Python UDxs can accept and return complex types. The MatrixMultiply class multiplies input matrices and returns the resulting matrix product. These matrices are represented as two-dimensional arrays. In order to perform the matrix multiplication operation, the number of columns in the first input matrix must equal the number of rows in the second input matrix.

The complete source code is in /opt/vertica/sdk/examples/python/ScalarFunctions.py.

Loading and using the example

Load the library and create the function as follows:

=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/python/ScalarFunctions.py' LANGUAGE 'Python';

=> CREATE FUNCTION MatrixMultiply AS LANGUAGE 'Python' NAME 'matrix_multiply_factory' LIBRARY ScalarFunctions;

You can create input matrices and then call the function, for example:


=> CREATE TABLE mn (id INTEGER, data ARRAY[ARRAY[INTEGER, 3], 2]);
CREATE TABLE

=> CREATE TABLE np (id INTEGER, data ARRAY[ARRAY[INTEGER, 2], 3]);
CREATE TABLE

=> COPY mn FROM STDIN PARSER fjsonparser();
{"id": 1, "data": [[1, 2, 3], [4, 5, 6]] }
{"id": 2, "data": [[7, 8, 9], [10, 11, 12]] }
\.

=> COPY np FROM STDIN PARSER fjsonparser();
{"id": 1, "data": [[0, 0], [0, 0], [0, 0]] }
{"id": 2, "data": [[1, 1], [1, 1], [1, 1]] }
{"id": 3, "data": [[2, 0], [0, 2], [2, 0]] }
\.

=> SELECT mn.id, np.id, MatrixMultiply(mn.data, np.data) FROM mn CROSS JOIN np ORDER BY 1, 2;
id | id |   MatrixMultiply
---+----+-------------------
1  |  1 | [[0,0],[0,0]]
1  |  2 | [[6,6],[15,15]]
1  |  3 | [[8,4],[20,10]]
2  |  1 | [[0,0],[0,0]]
2  |  2 | [[24,24],[33,33]]
2  |  3 | [[32,16],[44,22]]
(6 rows)

Setup

All Python UDxs must import the Vertica SDK library:

import vertica_sdk

Factory implementation

The getPrototype() method declares that the function arguments and return type must all be two-dimensional arrays, represented as arrays of integer arrays:

def getPrototype(self, srv_interface, arg_types, return_type):
    array1dtype = vertica_sdk.ColumnTypes.makeArrayType(vertica_sdk.ColumnTypes.makeInt())
    arg_types.addArrayType(array1dtype)
    arg_types.addArrayType(array1dtype)
    return_type.addArrayType(array1dtype)

getReturnType() validates that the product matrix has the same number of rows as the first input matrix and the same number of columns as the second input matrix:

def getReturnType(self, srv_interface, arg_types, return_type):
    (_, a1type) = arg_types[0]
    (_, a2type) = arg_types[1]
    m = a1type.getArrayBound()
    p = a2type.getElementType().getArrayBound()
    return_type.addArrayType(vertica_sdk.SizedColumnTypes.makeArrayType(vertica_sdk.SizedColumnTypes.makeInt(), p), m)

Function implementation

The processBlock() method is called with a BlockReader and a BlockWriter, named arg_reader and res_writer respectively. To access elements of the input arrays, the method uses ArrayReader instances. The arrays are nested, so an ArrayReader must be instantiated for both the outer and inner arrays. List comprehension simplifies the process of reading the input arrays into lists. The method performs the computation and then uses an ArrayWriter instance to construct the product matrix.

def processBlock(self, server_interface, arg_reader, res_writer):
    while True:
        lmat = [[cell.getInt(0) for cell in row.getArrayReader(0)] for row in arg_reader.getArrayReader(0)]
        rmat = [[cell.getInt(0) for cell in row.getArrayReader(0)] for row in arg_reader.getArrayReader(1)]
        omat = [[0 for c in range(len(rmat[0]))] for r in range(len(lmat))]

        for i in range(len(lmat)):
            for j in range(len(rmat[0])):
                for k in range(len(rmat)):
                    omat[i][j] += lmat[i][k] * rmat[k][j]

        res_writer.setArray(omat)
        res_writer.next()

        if not arg_reader.next():
            break

9 - R example: SalesTaxCalculator

The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation.

The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation. It then uses the state abbreviation to find the sales tax rate from a list and calculates the item's price including the state's sales tax, returning the total cost of the item.

You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.

Load the function and library

Create the library and the function.

=> CREATE OR REPLACE LIBRARY rLib AS 'sales_tax_calculator.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION SalesTaxCalculator AS LANGUAGE 'R' NAME 'SalesTaxCalculatorFactory' LIBRARY rLib FENCED;
CREATE FUNCTION

Querying data with the function

The following query shows how you can run a query with the UDSF.

=> SELECT item, state_abbreviation,
          price, SalesTaxCalculator(price, state_abbreviation) AS Price_With_Sales_Tax
    FROM inventory;
    item     | state_abbreviation | price | Price_With_Sales_Tax
-------------+--------------------+-------+---------------------
 Scarf       | AZ                 |  6.88 |             7.53016
 Software    | MA                 | 88.31 |           96.655295
 Soccer Ball | MS                 | 12.55 |           13.735975
 Beads       | LA                 |  0.99 |            1.083555
 Baseball    | TN                 | 42.42 |            46.42869
 Cheese      | WI                 | 20.77 |           22.732765
 Coffee Mug  | MA                 |  8.99 |            9.839555
 Shoes       | TN                 | 23.99 |           26.257055
(8 rows)

UDSF R code

SalesTaxCalculator <- function(input.data.frame) {
  # Not a complete list of states in the USA, but enough to get the idea.
  state.sales.tax <- list(ma = 0.0625,
                          az = 0.087,
                          la = 0.0891,
                          tn = 0.0945,
                          wi = 0.0543,
                          ms = 0.0707)
  for ( state_abbreviation in input.data.frame[, 2] ) {
    # Ensure state abbreviations are lowercase.
    lower_state <- tolower(state_abbreviation)
    # Check if the state is in our state.sales.tax list.
    if (is.null(state.sales.tax[[lower_state]])) {
      stop("State is not in our small sample!")
    } else {
      sales.tax.rate <- state.sales.tax[[lower_state]]
      item.price <- input.data.frame[, 1]
      # Calculate the price including sales tax.
      price.with.sales.tax <- (item.price) + (item.price * sales.tax.rate)
    }
  }
  return(price.with.sales.tax)
}

SalesTaxCalculatorFactory <- function() {
  list(name    = SalesTaxCalculator,
       udxtype = c("scalar"),
       intype  = c("float", "varchar"),
       outtype = c("float"))
}

10 - R example: kmeans

The KMeans_User scalar function reads any number of columns from a table, the observations.

The KMeans_User scalar function reads any number of columns from a table, the observations. It then uses the observations and the two parameters when applying the kmeans clustering algorithm to the data, returning an integer value associated with the cluster of the row.

You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.

Load the function and library

Create the library and the function:

=> CREATE OR REPLACE LIBRARY rLib AS 'kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION KMeans_User AS LANGUAGE 'R' NAME 'KMeans_UserFactory' LIBRARY rLib FENCED;
CREATE FUNCTION

Querying data with the function

The following query shows how you can run a query with the UDSF:

=> SELECT spec,
          KMeans_User(sl, sw, pl, pw USING PARAMETERS clusters = 3, nstart = 20)
    FROM iris;
      spec       | KMeans_User
-----------------+-------------
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
.
.
.
(150 rows)

UDSF R code

KMeans_User <- function(input.data.frame, parameters.data.frame) {
  # Take the clusters and nstart parameters passed by the user and assign them
  # to variables in the function.
  if ( is.null(parameters.data.frame[['clusters']]) ) {
    stop("NULL value for clusters! clusters cannot be NULL.")
  } else {
    clusters.value <- parameters.data.frame[['clusters']]
  }
  if ( is.null(parameters.data.frame[['nstart']]) ) {
    stop("NULL value for nstart! nstart cannot be NULL.")
  } else {
    nstart.value <- parameters.data.frame[['nstart']]
  }
  # Apply the algorithm to the data.
  kmeans.clusters <- kmeans(input.data.frame[, 1:length(input.data.frame)],
                           clusters.value, nstart = nstart.value)
  final.output <- data.frame(kmeans.clusters$cluster)
  return(final.output)
}

KMeans_UserFactory <- function() {
  list(name    = KMeans_User,
       udxtype = c("scalar"),
       # Since this is a polymorphic function the intype must be any
       intype  = c("any"),
       outtype = c("int"),
       parametertypecallback=KMeansParameters)
}

KMeansParameters <- function() {
  parameters <- list(datatype = c("int", "int"),
                     length   = c("NA", "NA"),
                     scale    = c("NA", "NA"),
                     name     = c("clusters", "nstart"))
  return(parameters)
}

11 - C++ example: using complex types

UDxs can accept and return complex types.

UDxs can accept and return complex types. The ArraySlice example takes an array and two indices as inputs and returns an array containing only the values in that range. Because array elements can be of any type, the function is polymorphic.

The complete source code is in /opt/vertica/sdk/examples/ScalarFunctions/ArraySlice.cpp.

Loading and using the example

Load the library and create the function as follows:

=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';

=> CREATE FUNCTION ArraySlice AS
LANGUAGE 'C++' NAME 'ArraySliceFactory' LIBRARY ScalarFunctions;

Create some data and call the function on it as follows:

=> CREATE TABLE arrays (id INTEGER, aa ARRAY[INTEGER]);
COPY arrays FROM STDIN;
1|[]
2|[1,2,3]
3|[5,4,3,2,1]
\.

=> CREATE TABLE slices (b INTEGER, e INTEGER);
COPY slices FROM STDIN;
0|2
1|3
2|4
\.

=> SELECT id, b, e, ArraySlice(aa, b, e) AS slice FROM arrays, slices;
 id | b | e | slice
----+---+---+-------
  1 | 0 | 2 | []
  1 | 1 | 3 | []
  1 | 2 | 4 | []
  2 | 0 | 2 | [1,2]
  2 | 1 | 3 | [2,3]
  2 | 2 | 4 | [3]
  3 | 0 | 2 | [5,4]
  3 | 1 | 3 | [4,3]
  3 | 2 | 4 | [3,2]
(9 rows)

Factory implementation

Because the function is polymorphic, getPrototype() declares that the inputs and outputs can be of any type, and type enforcement must be done elsewhere:

void getPrototype(ServerInterface &srvInterface,
                              ColumnTypes &argTypes,
                              ColumnTypes &returnType) override
{
    /*
     * This is a polymorphic function that accepts any array
     * and returns an array of the same type
     */
    argTypes.addAny();
    returnType.addAny();
}

The factory validates input types and determines the return type in getReturnType():

void getReturnType(ServerInterface &srvInterface,
                   const SizedColumnTypes &argTypes,
                   SizedColumnTypes &returnType) override
{
    /*
     * Three arguments: (array, slicebegin, sliceend)
     * Validate manually since the prototype accepts any arguments.
     */
    if (argTypes.size() != 3) {
        vt_report_error(0, "Three arguments (array, slicebegin, sliceend) expected");
    } else if (!argTypes[0].getType().isArrayType()) {
        vt_report_error(1, "Argument 1 is not an array");
    } else if (!argTypes[1].getType().isInt()) {
        vt_report_error(2, "Argument 2 (slicebegin) is not an integer)");
    } else if (!argTypes[2].getType().isInt()) {
        vt_report_error(3, "Argument 3 (sliceend) is not an integer)");
    }

    /* return type is the same as the array arg type, copy it over */
    returnType.push_back(argTypes[0]);
}

Function implementation

The processBlock() method is called with a BlockReader and a BlockWriter. The first argument is an array. To access elements of the array, the method uses an ArrayReader. Similarly, it uses an ArrayWriter to construct the output.

void processBlock(ServerInterface &srvInterface,
        BlockReader &argReader,
        BlockWriter &resWriter) override
{
    do {
        if (argReader.isNull(0) || argReader.isNull(1) || argReader.isNull(2)) {
            resWriter.setNull();
        } else {
            Array::ArrayReader argArray  = argReader.getArrayRef(0);
            const vint slicebegin = argReader.getIntRef(1);
            const vint sliceend   = argReader.getIntRef(2);

            Array::ArrayWriter outArray = resWriter.getArrayRef(0);
            if (slicebegin < sliceend) {
                for (int i = 0; i < slicebegin && argArray->hasData(); i++) {
                    argArray->next();
                }
                for (int i = slicebegin; i < sliceend && argArray->hasData(); i++) {
                    outArray->copyFromInput(*argArray);
                    outArray->next();
                    argArray->next();
                }
            }
            outArray.commit();  /* finalize the written array elements */
        }
        resWriter.next();
    } while (argReader.next());
}

12 - C++ example: returning multiple values

When writing a UDSF, you can specify more than one return value.

When writing a UDSF, you can specify more than one return value. If you specify multiple values, Vertica packages them into a single ROW as a return value. You can query fields in the ROW or the entire ROW.

The following example implements a function named div (division) that returns two integers, the quotient and the remainder.

This example shows one way to return a ROW from a UDSF. Returning multiple values and letting Vertica build the ROW is convenient when inputs and outputs are all of primitive types. You can also work directly with the complex types, as described in Complex Types as Arguments and illustrated in C++ example: using complex types.

Loading and using the example

Load the library and create the function as follows:

=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';

=> CREATE FUNCTION div AS
LANGUAGE 'C++' NAME 'DivFactory' LIBRARY ScalarFunctions;

Create some data and call the function on it as follows:

=> CREATE TABLE D (a INTEGER, b INTEGER);
COPY D FROM STDIN DELIMITER ',';
10,0
10,1
10,2
10,3
10,4
10,5
\.

=> SELECT a, b, Div(a, b), (Div(a, b)).quotient, (Div(a, b)).remainder FROM D;
 a  | b |                  Div               | quotient | remainder
----+---+------------------------------------+----------+-----------
 10 | 0 | {"quotient":null,"remainder":null} |          |
 10 | 1 | {"quotient":10,"remainder":0}      |       10 |         0
 10 | 2 | {"quotient":5,"remainder":0}       |        5 |         0
 10 | 3 | {"quotient":3,"remainder":1}       |        3 |         1
 10 | 4 | {"quotient":2,"remainder":2}       |        2 |         2
 10 | 5 | {"quotient":2,"remainder":0}       |        2 |         0
(6 rows)

Factory implementation

The factory declares the two return values in getPrototype() and in getReturnType(). The factory is otherwise unremarkable.

    void getPrototype(ServerInterface &interface,
            ColumnTypes &argTypes,
            ColumnTypes &returnType) override
    {
        argTypes.addInt();
        argTypes.addInt();
        returnType.addInt(); /* quotient  */
        returnType.addInt(); /* remainder */
    }

    void getReturnType(ServerInterface &srvInterface,
            const SizedColumnTypes &argTypes,
            SizedColumnTypes &returnType) override
    {
        returnType.addInt("quotient");
        returnType.addInt("remainder");
    }

Function implementation

The function writes two output values in processBlock(). The number of values here must match the factory declarations.

class Div : public ScalarFunction {
    void processBlock(Vertica::ServerInterface &srvInterface,
            Vertica::BlockReader &argReader,
            Vertica::BlockWriter &resWriter) override
    {
        do {
            if (argReader.isNull(0) || argReader.isNull(1) || (argReader.getIntRef(1) == 0)) {
                resWriter.setNull(0);
                resWriter.setNull(1);
            } else {
                const vint dividend = argReader.getIntRef(0);
                const vint divisor  = argReader.getIntRef(1);
                resWriter.setInt(0, dividend / divisor);
                resWriter.setInt(1, dividend % divisor);
            }
            resWriter.next();
        } while (argReader.next());
    }
};

13 - C++ example: calling a UDSF from a check constraint

This example shows you the C++ code needed to create a UDSF that can be called by a check constraint.

This example shows you the C++ code needed to create a UDSF that can be called by a check constraint. The name of the sample function is LargestSquareBelow. The sample function determines the largest number whose square is less than the number in the subject column. For example, if the number in the column is 1000, the largest number whose square is less than 1000 is 31 (961).

Important

A UDSF used within a check constraint must be immutable, and the constraint must handle null values properly. Otherwise, the check constraint might not work as you intended. In addition, Vertica evaluates the predicate of an enabled check constraint on every row that is loaded or updated, so consider performance in writing your function.

For information on check constraints, see Check constraints.

Loading and using the example

The following example shows how you can create and load a library named MySqLib, using CREATE LIBRARY. Adjust the library path in this example to the absolute path and file name for the location where you saved the shared object LargestSquareBelow.

Create the library:

=> CREATE OR REPLACE LIBRARY MySqLib AS '/home/dbadmin/LargestSquareBelow.so';

After you create and load the library, add the function to the catalog using the CREATE FUNCTION (scalar) statement:

=> CREATE OR REPLACE FUNCTION largestSqBelow AS LANGUAGE 'C++' NAME 'LargestSquareBelowInfo' LIBRARY MySqLib;

Next, include the UDSF in a check constraint:

=> CREATE TABLE squaretest(
   ceiling INTEGER UNIQUE,
   CONSTRAINT chk_sq CHECK (largestSqBelow(ceiling) < ceiling*ceiling)
);

Add data to the table, squaretest:

=> COPY squaretest FROM stdin DELIMITER ','NULL'null';
-1
null
0
1
1000
1000000
1000001
\.

Your output should be similar to the following sample, based upon the data you use:

=> SELECT ceiling, largestSqBelow(ceiling)
FROM squaretest ORDER BY ceiling;

ceiling  | largestSqBelow
---------+----------------
         |
      -1 |
       0 |
       1 |              0
    1000 |             31
 1000000 |            999
 1000001 |           1000
(7 rows)

ScalarFunction implementation

This ScalarFunction implementation does the processing work for a UDSF that determines the largest number whose square is less than the number input.

#include "Vertica.h"
/*
 * ScalarFunction implementation for a UDSF that
 * determines the largest number whose square is less than
 * the number input.
*/
class LargestSquareBelow : public Vertica::ScalarFunction
{
public:
 /*
  * This function does all of the actual processing for the UDSF.
  * The inputs are retrieved via arg_reader
  * The outputs are returned via arg_writer
  *
 */
    virtual void processBlock(Vertica::ServerInterface &srvInterface,
                              Vertica::BlockReader &arg_reader,
                              Vertica::BlockWriter &res_writer)
    {
        if (arg_reader.getNumCols() != 1)
            vt_report_error(0, "Function only accept 1 argument, but %zu provided", arg_reader.getNumCols());
// While we have input to process
        do {
            // Read the input parameter by calling the
            // BlockReader.getIntRef class function
            const Vertica::vint a = arg_reader.getIntRef(0);
            Vertica::vint res;
            //Determine the largest square below the number
            if ((a != Vertica::vint_null) && (a > 0))
            {
                res = (Vertica::vint)sqrt(a - 1);
            }
            else
                res = Vertica::vint_null;
            //Call BlockWriter.setInt to store the output value,
            //which is the largest square
            res_writer.setInt(res);
            //Write the row and advance to the next output row
            res_writer.next();
            //Continue looping until there are no more input rows
        } while (arg_reader.next());
    }
};

ScalarFunctionFactory implementation

This ScalarFunctionFactory implementation does the work of handling input and output, and marks the function as immutable (a requirement if you plan to use the UDSF within a check constraint).

class LargestSquareBelowInfo : public Vertica::ScalarFunctionFactory
{
    //return an instance of LargestSquareBelow to perform the computation.
    virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
    //Call the vt_createFuncObj to create the new LargestSquareBelow class instance.
    { return Vertica::vt_createFuncObject<LargestSquareBelow>(srvInterface.allocator); }

    /*
     * This function returns the description of the input and outputs of the
     * LargestSquareBelow class's processBlock function.  It stores this information in
     * two ColumnTypes objects, one for the input parameter, and one for
     * the return value.
    */
    virtual void getPrototype(Vertica::ServerInterface &srvInterface,
                              Vertica::ColumnTypes &argTypes,
                              Vertica::ColumnTypes &returnType)
    {
        // Takes one int as input, so adds int to the argTypes object
        argTypes.addInt();
        // Returns a single int, so add a single int to the returnType object.
        // ScalarFunctions always return a single value.
        returnType.addInt();
    }
public:
    // the function cannot be called within a check constraint unless the UDx author
    // certifies that the function is immutable:
    LargestSquareBelowInfo() { vol = Vertica::IMMUTABLE; }
};

The RegisterFactory macro

Use the RegisterFactory macro to register a ScalarFunctionFactory subclass. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class.


RegisterFactory(LargestSquareBelowInfo);