This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Scalar functions (UDSFs)
A user-defined scalar function (UDSF) returns a single value for each row of data it reads.
A user-defined scalar function (UDSF) returns a single value for each row of data it reads. You can use a UDSF anywhere you can use a built-in Vertica function. You usually develop a UDSF to perform data manipulations that are too complex or too slow to perform using SQL statements and functions. UDSFs also let you use analytic functions provided by third-party libraries within Vertica while still maintaining high performance.
A UDSF returns a single column. You can automatically return multiple values in a ROW. A ROW is a group of property-value pairs. In the following example, div_with_rem is a UDSF that performs a division operation, returning the quotient and remainder as integers:
=> SELECT div_with_rem(18,5);
div_with_rem
------------------------------
{"quotient":3,"remainder":3}
(1 row)
A ROW returned from a UDSF cannot be used as an argument to COUNT.
Alternatively, you can construct a complex return value yourself, as described in Complex Types as Arguments.
Your UDSF must return a value for every input row (unless it generates an error; see Handling errors for details). Failure to return a value for an input row results in incorrect results and potentially destabilizes the Vertica server if not run in Fenced and unfenced modes.
A UDSF can have up to 9800 arguments.
1 - ScalarFunction class
The ScalarFunction class is the heart of a UDSF.
The ScalarFunction
class is the heart of a UDSF. Your subclass must define the processBlock()
method to perform the scalar operation. It may define methods to set up and tear down the function.
For scalar functions written in C++, you can provide information that can help with query optimization. See Improving query performance (C++ only).
The processBlock()
method carries out all of the processing that you want your UDSF to perform. When a user calls your function in a SQL statement, Vertica bundles together the data from the function parameters and passes it to processBlock()
.
The input and output of the processBlock()
method are supplied by objects of the BlockReader
and BlockWriter
classes. They define methods that you use to read the input data and write the output data for your UDSF.
The majority of the work in developing a UDSF is writing processBlock()
. This is where all of the processing in your function occurs. Your UDSF should follow this basic pattern:
-
Read in a set of arguments from the BlockReader
object using data-type-specific methods.
-
Process the data in some manner.
-
Output the resulting value using one of the BlockWriter
class's data-type-specific methods.
-
Advance to the next row of output and input by calling BlockWriter.next()
and BlockReader.next()
.
This process continues until there are no more rows of data to be read (BlockReader.next()
returns false).
You must make sure that processBlock()
reads all of the rows in its input and outputs a single value for each row. Failure to do so can corrupt the data structures that Vertica reads to get the output of your UDSF. The only exception to this rule is if your processBlock()
function reports an error back to Vertica (see Handling errors). In that case, Vertica does not attempt to read the incomplete result set generated by the UDSF.
Setting up and tearing down
The ScalarFunction
class defines two additional methods that you can optionally implement to allocate and free resources: setup()
and destroy()
. You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).
Notes
-
While the name you choose for your ScalarFunction
subclass does not have to match the name of the SQL function you will later assign to it, Vertica considers making the names the same a best practice.
-
Do not assume that your function will be called from the same thread that instantiated it.
-
The same instance of your ScalarFunction
subclass can be called on to process multiple blocks of data.
-
The rows of input sent to processBlock()
are not guaranteed to be any particular order.
-
Writing too many output rows can cause Vertica to emit an out-of-bounds error.
API
The ScalarFunction API provides the following methods for extension by subclasses:
The ScalarFunction API provides the following methods for extension by subclasses:
The ScalarFunction API provides the following methods for extension by subclasses:
Implement the Main function API to define a scalar function:
2 - ScalarFunctionFactory class
The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value.
The ScalarFunctionFactory
class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of ScalarFunction
.
Methods
You must implement the following methods in your ScalarFunctionFactory
subclass:
-
createScalarFunction()
instantiates a ScalarFunction
subclass. If writing in C++, you can call the vt_createFuncObj
macro with the name of the ScalarFunction
subclass. This macro takes care of allocating and instantiating the class for you.
-
getPrototype()
tells Vertica about the parameters and return type(s) for your UDSF. In addition to a ServerInterface
object, this method gets two ColumnTypes
objects. All you need to do in this function is to call class functions on these two objects to build the list of parameters and the return value type(s). If you return more than one value, the results are packaged into a ROW type.
After defining your factory class, you need to call the RegisterFactory
macro. This macro instantiates a member of your factory class, so Vertica can interact with it and extract the metadata it contains about your UDSF.
Declaring return values
If your function returns a sized column (a return data type whose length can vary, such as a VARCHAR), a value that requires precision, or more than one value, you must implement getReturnType()
. This method is called by Vertica to find the length or precision of the data being returned in each row of the results. The return value of this method depends on the data type your processBlock()
method returns:
-
CHAR, (LONG) VARCHAR, BINARY, and (LONG) VARBINARY return the maximum length.
-
NUMERIC types specify the precision and scale.
-
TIME and TIMESTAMP values (with or without timezone) specify precision.
-
INTERVAL YEAR TO MONTH specifies range.
-
INTERVAL DAY TO SECOND specifies precision and range.
If your UDSF does not return one of these data types and returns a single value, it does not need a getReturnType()
method.
The input to the getReturnType()
method is a SizedColumnTypes
object that contains the input argument types along with their lengths. This object will be passed to an instance of your processBlock()
function. Your implementation of getReturnType()
must extract the data types and lengths from this input and determine the length or precision of the output rows. It then saves this information in another instance of the SizedColumnTypes
class.
API
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
Implement the Factory function API to define a scalar function factory:
3 - Setting null input and volatility behavior
Normally, Vertica calls your UDSF for every row of data in the query.
Normally, Vertica calls your UDSF for every row of data in the query. There are some cases where Vertica can avoid executing your UDSF. You can tell Vertica when it can skip calling your function and just supply a return value itself by changing your function's volatility and strictness settings.
-
Your function's volatility indicates whether it always returns the same output value when passed the same arguments. Depending on its behavior, Vertica can cache the arguments and the return value. If the user calls the UDSF with the same set of arguments, Vertica returns the cached value instead of calling your UDSF.
-
Your function's strictness indicates how it reacts to NULL arguments. If it always returns NULL when any argument is NULL, Vertica can just return NULL without having to call the function. This optimization also saves you work, because you do not need to test for and handle null arguments in your UDSF code.
You indicate the volatility and null handling of your function by setting the vol
and strict
fields in your ScalarFunctionFactory
class's constructor.
Volatility settings
To indicate your function's volatility, set the vol
field to one of the following values:
Value |
Description |
VOLATILE |
Repeated calls to the function with the same arguments always result in different values. Vertica always calls volatile functions for each invocation. |
IMMUTABLE |
Calls to the function with the same arguments always results in the same return value. |
STABLE |
Repeated calls to the function with the same arguments within the same statement returns the same output. For example, a function that returns the current user name is stable because the user cannot change within a statement. The user name could change between statements. |
DEFAULT_VOLATILITY |
The default volatility. This is the same as VOLATILE. |
Example
The following example shows a version of the Add2ints
example factory class that makes the function immutable.
The following example demonstrates setting the Add2IntsFactory
's vol
field to IMMUTABLE to tell Vertica it can cache the arguments and return value.
public class Add2IntsFactory extends ScalarFunctionFactory {
@Override
public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType){
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface){
return new Add2Ints();
}
// Class constructor
public Add2IntsFactory() {
// Tell Vertica that the same set of arguments will always result in the
// same return value.
vol = volatility.IMMUTABLE;
}
}
To indicate how your function reacts to NULL input, set the strictness
field to one of the following values.
Value |
Description |
CALLED_ON_NULL_INPUT |
The function must be called, even if one or more arguments are NULL. |
RETURN_NULL_ON_NULL_INPUT |
The function always returns a NULL value if any of its arguments are NULL. |
STRICT |
A synonym for RETURN_NULL_ON_NULL_INPUT |
DEFAULT_STRICTNESS |
The default strictness setting. This is the same as CALLED_ON_NULL_INPUT. |
Example
The following C++ example demonstrates setting the null behavior of Add2ints so Vertica does not call the function with NULL values.
4 - Improving query performance (C++ only)
When evaluating a query, Vertica can take advantage of available information about the ranges of values.
When evaluating a query, Vertica can take advantage of available information about the ranges of values. For example, if data is partitioned and a query restricts output by the partitioned value, Vertica can ignore partitions that cannot possibly contain data that would satisfy the query. Similarly, for a scalar function, Vertica can skip processing rows in the data where the value returned from the function cannot possibly affect the results.
Consider a table with millions of rows of data on customer orders and a scalar function that computes the total price paid for everything in an order. A query uses a WHERE clause to restrict results to orders above a given value. A scalar function is called on a block of data; if no rows within that block could produce the target value, skipping the processing of the block could improve query performance.
A scalar function written in C++ can implement the getOutputRange
method. Before calling processBlock
, Vertica calls getOutputRange
to determine the minimum and maximum return values from this block given the input ranges. It then decides whether to call processBlock
to perform the computations.
The Add2Ints example implements this function. The minimum output value is the sum of the smallest values of each of the two inputs, and the maximum output is the sum of the largest values of each of the inputs. This function does not consider individual rows. Consider the following inputs:
a | b
------+------
21 | 92
500 | 19
111 | 11
The smallest values of the two inputs are 21 and 11, so the function reports 32 as the low end of the output range. The largest input values are 500 and 92, so it reports 592 as the high end of the output range. 592 is larger than the value returned for any of the input rows and 32 is smaller than any row's return value.
The purpose of getOutputRange
is to quickly eliminate calls where outputs would definitely be out of range. For example, if the query included "WHERE Add2Ints(a,b) > 600", this block of data could be skipped. There can still be cases where, after calling getOutputRange
, processBlock
returns no results. If the query included "WHERE Add2Ints(a,b) > 500", getOutputRange
would not eliminate this block of data.
Add2Ints implements getOutputRange
as follows:
If getOutputRange
produces an error, Vertica issues a warning and does not call the method again for the current query.
5 - C++ example: Add2Ints
The following example shows a basic subclass of ScalarFunction called Add2ints.
The following example shows a basic subclass of ScalarFunction
called Add2ints
. As the name implies, it adds two integers together, returning a single integer result.
For the complete source code, see /opt/vertica/sdk/examples/ScalarFunctions/Add2Ints.cpp
. Java and Python versions of this UDx are included in /opt/vertica/sdk/examples
.
Loading and using the example
Use CREATE LIBRARY to load the library containing the function, and then use CREATE FUNCTION (scalar) to declare the function as in the following example:
=> CREATE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/ScalarFunctions.so';
=> CREATE FUNCTION add2ints AS LANGUAGE 'C++' NAME 'Add2IntsFactory' LIBRARY ScalarFunctions;
The following example shows how to use this function:
=> SELECT Add2Ints(27,15);
Add2ints
----------
42
(1 row)
=> SELECT * FROM MyTable;
a | b
-----+----
7 | 0
12 | 2
12 | 6
18 | 9
1 | 1
58 | 4
450 | 15
(7 rows)
=> SELECT * FROM MyTable WHERE Add2ints(a, b) > 20;
a | b
-----+----
18 | 9
58 | 4
450 | 15
(3 rows)
Function implementation
A scalar function does its computation in the processBlock
method:
Implementing getOutputRange
, which is optional, allows your function to skip rows where the result would not be within a target range. For example, if a WHERE clause restricts the query results to those in a certain range, calling the function for cases that could not possibly be in that range is unnecessary.
Factory implementation
The factory instantiates a member of the class (createScalarFunction
), and also describes the function's inputs and outputs (getPrototype
):
The RegisterFactory macro
Use the RegisterFactory
macro to register a UDx. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class:
RegisterFactory(Add2IntsFactory);
6 - Python example: currency_convert
The currency_convert scalar function reads two values from a table, a currency and a value.
The currency_convert
scalar function reads two values from a table, a currency and a value. It then converts the item's value to USD, returning a single float result.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
UDSF Python code
Load the function and library
Create the library and the function.
=> CREATE LIBRARY pylib AS '/home/dbadmin/python_udx/currency_convert/currency_convert.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE FUNCTION currency_convert AS LANGUAGE 'Python' NAME 'currency_convert_factory' LIBRARY pylib fenced;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT product, currency_convert(currency, value) AS cost_in_usd
FROM items;
product | cost_in_usd
--------------+-------------
Shoes | 133.4008
Soccer Ball | 110.2817
Coffee | 13.5190
Surfboard | 176.2593
Hockey Stick | 76.7177
Car | 17000.0000
Software | 10.4424
Hamburger | 7.5000
Fish | 130.4272
Cattle | 269.2367
(10 rows)
7 - Python example: validate_url
The validate_url scalar function reads a string from a table, a URL.
The validate_url
scalar function reads a string from a table, a URL. It then validates if the URL is responsive, returning a status code or a string indicating the attempt failed.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
UDSF Python code
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY pylib AS 'webpage_tester/validate_url.py' LANGUAGE 'Python';
=> CREATE OR REPLACE FUNCTION validate_url AS LANGUAGE 'Python' NAME 'validate_url_factory' LIBRARY pylib fenced;
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT url, validate_url(url) AS url_status FROM webpages;
url | url_status
-----------------------------------------------+----------------------
http://www.vertica.com/documentation/vertica/ | 200
http://www.google.com/ | 200
http://www.mass.gov.com/ | Failed to load page
http://www.espn.com | 200
http://blah.blah.blah.blah | Failed to load page
http://www.vertica.com/ | 200
(6 rows)
8 - R example: SalesTaxCalculator
The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation.
The SalesTaxCalculator
scalar function reads a float and a varchar from a table, an item's price and the state abbreviation. It then uses the state abbreviation to find the sales tax rate from a list and calculates the item's price including the state's sales tax, returning the total cost of the item.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY rLib AS 'sales_tax_calculator.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION SalesTaxCalculator AS LANGUAGE 'R' NAME 'SalesTaxCalculatorFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT item, state_abbreviation,
price, SalesTaxCalculator(price, state_abbreviation) AS Price_With_Sales_Tax
FROM inventory;
item | state_abbreviation | price | Price_With_Sales_Tax
-------------+--------------------+-------+---------------------
Scarf | AZ | 6.88 | 7.53016
Software | MA | 88.31 | 96.655295
Soccer Ball | MS | 12.55 | 13.735975
Beads | LA | 0.99 | 1.083555
Baseball | TN | 42.42 | 46.42869
Cheese | WI | 20.77 | 22.732765
Coffee Mug | MA | 8.99 | 9.839555
Shoes | TN | 23.99 | 26.257055
(8 rows)
UDSF R code
9 - R example: kmeans
The KMeans_User scalar function reads any number of columns from a table, the observations.
The KMeans_User
scalar function reads any number of columns from a table, the observations. It then uses the observations and the two parameters when applying the kmeans clustering algorithm to the data, returning an integer value associated with the cluster of the row.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
Load the function and library
Create the library and the function:
=> CREATE OR REPLACE LIBRARY rLib AS 'kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION KMeans_User AS LANGUAGE 'R' NAME 'KMeans_UserFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF:
=> SELECT spec,
KMeans_User(sl, sw, pl, pw USING PARAMETERS clusters = 3, nstart = 20)
FROM iris;
spec | KMeans_User
-----------------+-------------
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
.
.
.
(150 rows)
UDSF R code
10 - C++ example: using complex types
UDxs can accept and return complex types.
UDxs can accept and return complex types. The ArraySlice
example takes an array and two indices as inputs and returns an array containing only the values in that range. Because array elements can be of any type, the function is polymorphic.
The complete source code is in /opt/vertica/sdk/examples/ScalarFunctions/ArraySlice.cpp
.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION ArraySlice AS
LANGUAGE 'C++' NAME 'ArraySliceFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
=> CREATE TABLE arrays (id INTEGER, aa ARRAY[INTEGER]);
COPY arrays FROM STDIN;
1|[]
2|[1,2,3]
3|[5,4,3,2,1]
\.
=> CREATE TABLE slices (b INTEGER, e INTEGER);
COPY slices FROM STDIN;
0|2
1|3
2|4
\.
=> SELECT id, b, e, ArraySlice(aa, b, e) AS slice FROM arrays, slices;
id | b | e | slice
----+---+---+-------
1 | 0 | 2 | []
1 | 1 | 3 | []
1 | 2 | 4 | []
2 | 0 | 2 | [1,2]
2 | 1 | 3 | [2,3]
2 | 2 | 4 | [3]
3 | 0 | 2 | [5,4]
3 | 1 | 3 | [4,3]
3 | 2 | 4 | [3,2]
(9 rows)
Factory implementation
Because the function is polymorphic, getPrototype()
declares that the inputs and outputs can be of any type, and type enforcement must be done elsewhere:
The factory validates input types and determines the return type in getReturnType()
:
Function implementation
The processBlock()
method is called with a BlockReader
and a BlockWriter
. The first argument is an array. To access elements of the array, the method uses an ArrayReader
. Similarly, it uses an ArrayWriter
to construct the output.
11 - C++ example: returning multiple values
When writing a UDSF, you can specify more than one return value.
When writing a UDSF, you can specify more than one return value. If you specify multiple values, Vertica packages them into a single ROW as a return value. You can query fields in the ROW or the entire ROW.
The following example implements a function named div (division) that returns two integers, the quotient and the remainder.
This example shows one way to return a ROW from a UDSF. Returning multiple values and letting Vertica build the ROW is convenient when inputs and outputs are all of primitive types. You can also work directly with the complex types, as described in Complex Types as Arguments and illustrated in C++ example: using complex types.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION div AS
LANGUAGE 'C++' NAME 'DivFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
=> CREATE TABLE D (a INTEGER, b INTEGER);
COPY D FROM STDIN DELIMITER ',';
10,0
10,1
10,2
10,3
10,4
10,5
\.
=> SELECT a, b, Div(a, b), (Div(a, b)).quotient, (Div(a, b)).remainder FROM D;
a | b | Div | quotient | remainder
----+---+------------------------------------+----------+-----------
10 | 0 | {"quotient":null,"remainder":null} | |
10 | 1 | {"quotient":10,"remainder":0} | 10 | 0
10 | 2 | {"quotient":5,"remainder":0} | 5 | 0
10 | 3 | {"quotient":3,"remainder":1} | 3 | 1
10 | 4 | {"quotient":2,"remainder":2} | 2 | 2
10 | 5 | {"quotient":2,"remainder":0} | 2 | 0
(6 rows)
Factory implementation
The factory declares the two return values in getPrototype()
and in getReturnType()
. The factory is otherwise unremarkable.
Function implementation
The function writes two output values in processBlock()
. The number of values here must match the factory declarations.
12 - C++ example: calling a UDSF from a check constraint
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint.
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint. The name of the sample function is LargestSquareBelow
. The sample function determines the largest number whose square is less than the number in the subject column. For example, if the number in the column is 1000, the largest number whose square is less than 1000 is 31 (961).
Important
A UDSF used within a check constraint must be immutable, and the constraint must handle null values properly. Otherwise, the check constraint might not work as you intended. In addition, Vertica evaluates the predicate of an enabled check constraint on every row that is loaded or updated, so consider performance in writing your function.
For information on check constraints, see Check constraints.
Loading and using the example
The following example shows how you can create and load a library named MySqLib, using CREATE LIBRARY. Adjust the library path in this example to the absolute path and file name for the location where you saved the shared object LargestSquareBelow
.
Create the library:
=> CREATE OR REPLACE LIBRARY MySqLib AS '/home/dbadmin/LargestSquareBelow.so';
- After you create and load the library, add the function to the catalog using the CREATE FUNCTION (scalar) statement:
=> CREATE OR REPLACE FUNCTION largestSqBelow AS LANGUAGE 'C++' NAME 'LargestSquareBelowInfo' LIBRARY MySqLib;
- Next, include the UDSF in a check constraint:
=> CREATE TABLE squaretest(
ceiling INTEGER UNIQUE,
CONSTRAINT chk_sq CHECK (largestSqBelow(ceiling) < ceiling*ceiling)
);
- Add data to the table,
squaretest
:
=> COPY squaretest FROM stdin DELIMITER ','NULL'null';
-1
null
0
1
1000
1000000
1000001
\.
Your output should be similar to the following sample, based upon the data you use:
SELECT ceiling, largestSqBelow(ceiling)
FROM squaretest ORDER BY ceiling;
ceiling | largestSqBelow
---------+----------------
|
-1 |
0 |
1 | 0
1000 | 31
1000000 | 999
1000001 | 1000
(7 rows)
ScalarFunction implementation
This ScalarFunction
implementation does the processing work for a UDSF that determines the largest number whose square is less than the number input.
ScalarFunctionFactory implementation
This ScalarFunctionFactory
implementation does the work of handling input and output, and marks the function as immutable (a requirement if you plan to use the UDSF within a check constraint).
The RegisterFactory macro
Use the RegisterFactory
macro to register a ScalarFunctionFactory
subclass. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class.
RegisterFactory(LargestSquareBelowInfo);