A user-defined scalar function (UDSF) returns a single value for each row of data it reads. You can use a UDSF anywhere you can use a built-in Vertica function. You usually develop a UDSF to perform data manipulations that are too complex or too slow to perform using SQL statements and functions. UDSFs also let you use analytic functions provided by third-party libraries within Vertica while still maintaining high performance.
A UDSF returns a single column. You can automatically return multiple values in a ROW. A ROW is a group of property-value pairs. In the following example, div_with_rem is a UDSF that performs a division operation, returning the quotient and remainder as integers:
A ROW returned from a UDSF cannot be used as an argument to COUNT.
Alternatively, you can construct a complex return value yourself, as described in Complex Types as Arguments.
Your UDSF must return a value for every input row (unless it generates an error; see Handling errors for details). Failure to return a value for an input row results in incorrect results and potentially destabilizes the Vertica server if not run in Fenced and unfenced modes.
A UDSF can have up to 9800 arguments.
1 - ScalarFunction class
The ScalarFunction class is the heart of a UDSF.
The ScalarFunction class is the heart of a UDSF. Your subclass must define the processBlock() method to perform the scalar operation. It may define methods to set up and tear down the function.
The processBlock() method carries out all of the processing that you want your UDSF to perform. When a user calls your function in a SQL statement, Vertica bundles together the data from the function parameters and passes it to processBlock() .
The input and output of the processBlock() method are supplied by objects of the BlockReader and BlockWriter classes. They define methods that you use to read the input data and write the output data for your UDSF.
The majority of the work in developing a UDSF is writing processBlock(). This is where all of the processing in your function occurs. Your UDSF should follow this basic pattern:
Read in a set of arguments from the BlockReader object using data-type-specific methods.
Process the data in some manner.
Output the resulting value using one of the BlockWriter class's data-type-specific methods.
Advance to the next row of output and input by calling BlockWriter.next() and BlockReader.next().
This process continues until there are no more rows of data to be read (BlockReader.next() returns false).
You must make sure that processBlock() reads all of the rows in its input and outputs a single value for each row. Failure to do so can corrupt the data structures that Vertica reads to get the output of your UDSF. The only exception to this rule is if your processBlock() function reports an error back to Vertica (see Handling errors). In that case, Vertica does not attempt to read the incomplete result set generated by the UDSF.
Setting up and tearing down
The ScalarFunction class defines two additional methods that you can optionally implement to allocate and free resources: setup() and destroy(). You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).
Notes
While the name you choose for your ScalarFunction subclass does not have to match the name of the SQL function you will later assign to it, Vertica considers making the names the same a best practice.
Do not assume that your function will be called from the same thread that instantiated it.
The same instance of your ScalarFunction subclass can be called on to process multiple blocks of data.
The rows of input sent to processBlock() are not guaranteed to be any particular order.
Writing too many output rows can cause Vertica to emit an out-of-bounds error.
Implement the Main function API to define a scalar function:
FunctionName<-function(input.data.frame,parameters.data.frame){# Computations# The function must return a data frame.return(output.data.frame)}
2 - ScalarFunctionFactory class
The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value.
The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of ScalarFunction.
Methods
You must implement the following methods in your ScalarFunctionFactory subclass:
createScalarFunction() instantiates a ScalarFunction subclass. If writing in C++, you can call the vt_createFuncObj macro with the name of the ScalarFunction subclass. This macro takes care of allocating and instantiating the class for you.
getPrototype() tells Vertica about the parameters and return type(s) for your UDSF. In addition to a ServerInterface object, this method gets two ColumnTypes objects. All you need to do in this function is to call class functions on these two objects to build the list of parameters and the return value type(s). If you return more than one value, the results are packaged into a ROW type.
After defining your factory class, you need to call the RegisterFactory macro. This macro instantiates a member of your factory class, so Vertica can interact with it and extract the metadata it contains about your UDSF.
Declaring return values
If your function returns a sized column (a return data type whose length can vary, such as a VARCHAR), a value that requires precision, or more than one value, you must implement getReturnType(). This method is called by Vertica to find the length or precision of the data being returned in each row of the results. The return value of this method depends on the data type your processBlock() method returns:
CHAR, (LONG) VARCHAR, BINARY, and (LONG) VARBINARY return the maximum length.
NUMERIC types specify the precision and scale.
TIME and TIMESTAMP values (with or without timezone) specify precision.
INTERVAL YEAR TO MONTH specifies range.
INTERVAL DAY TO SECOND specifies precision and range.
ARRAY types specify the maximum number of elements.
If your UDSF does not return one of these data types and returns a single value, it does not need a getReturnType() method.
The input to the getReturnType() method is a SizedColumnTypes object that contains the input argument types along with their lengths. This object will be passed to an instance of your processBlock() function. Your implementation of getReturnType() must extract the data types and lengths from this input and determine the length or precision of the output rows. It then saves this information in another instance of the SizedColumnTypes class.
Normally, Vertica calls your UDSF for every row of data in the query.
Normally, Vertica calls your UDSF for every row of data in the query. There are some cases where Vertica can avoid executing your UDSF. You can tell Vertica when it can skip calling your function and just supply a return value itself by changing your function's volatility and strictness settings.
Your function's volatility indicates whether it always returns the same output value when passed the same arguments. Depending on its behavior, Vertica can cache the arguments and the return value. If the user calls the UDSF with the same set of arguments, Vertica returns the cached value instead of calling your UDSF.
Your function's strictness indicates how it reacts to NULL arguments. If it always returns NULL when any argument is NULL, Vertica can just return NULL without having to call the function. This optimization also saves you work, because you do not need to test for and handle null arguments in your UDSF code.
You indicate the volatility and null handling of your function by setting the vol and strict fields in your ScalarFunctionFactory class's constructor.
Volatility settings
To indicate your function's volatility, set the vol field to one of the following values:
Value
Description
VOLATILE
Repeated calls to the function with the same arguments always result in different values. Vertica always calls volatile functions for each invocation.
IMMUTABLE
Calls to the function with the same arguments always results in the same return value.
STABLE
Repeated calls to the function with the same arguments within the same statement returns the same output. For example, a function that returns the current user name is stable because the user cannot change within a statement. The user name could change between statements.
DEFAULT_VOLATILITY
The default volatility. This is the same as VOLATILE.
The following example demonstrates setting the Add2IntsFactory's vol field to IMMUTABLE to tell Vertica it can cache the arguments and return value.
publicclassAdd2IntsFactoryextendsScalarFunctionFactory{@OverridepublicvoidgetPrototype(ServerInterfacesrvInterface,ColumnTypesargTypes,ColumnTypesreturnType){argTypes.addInt();argTypes.addInt();returnType.addInt();}@OverridepublicScalarFunctioncreateScalarFunction(ServerInterfacesrvInterface){returnnewAdd2Ints();}// Class constructorpublicAdd2IntsFactory(){// Tell Vertica that the same set of arguments will always result in the// same return value.vol=volatility.IMMUTABLE;}}
Null input behavior
To indicate how your function reacts to NULL input, set the strictness field to one of the following values.
Value
Description
CALLED_ON_NULL_INPUT
The function must be called, even if one or more arguments are NULL.
RETURN_NULL_ON_NULL_INPUT
The function always returns a NULL value if any of its arguments are NULL.
STRICT
A synonym for RETURN_NULL_ON_NULL_INPUT
DEFAULT_STRICTNESS
The default strictness setting. This is the same as CALLED_ON_NULL_INPUT.
Example
The following C++ example demonstrates setting the null behavior of Add2ints so Vertica does not call the function with NULL values.
When evaluating a query, Vertica can take advantage of available information about the ranges of values.
When evaluating a query, Vertica can take advantage of available information about the ranges of values. For example, if data is partitioned and a query restricts output by the partitioned value, Vertica can ignore partitions that cannot possibly contain data that would satisfy the query. Similarly, for a scalar function, Vertica can skip processing rows in the data where the value returned from the function cannot possibly affect the results.
Consider a table with millions of rows of data on customer orders and a scalar function that computes the total price paid for everything in an order. A query uses a WHERE clause to restrict results to orders above a given value. A scalar function is called on a block of data; if no rows within that block could produce the target value, skipping the processing of the block could improve query performance.
A scalar function written in C++ can implement the getOutputRange method. Before calling processBlock, Vertica calls getOutputRange to determine the minimum and maximum return values from this block given the input ranges. It then decides whether to call processBlock to perform the computations.
The Add2Ints example implements this function. The minimum output value is the sum of the smallest values of each of the two inputs, and the maximum output is the sum of the largest values of each of the inputs. This function does not consider individual rows. Consider the following inputs:
a | b
------+------
21 | 92
500 | 19
111 | 11
The smallest values of the two inputs are 21 and 11, so the function reports 32 as the low end of the output range. The largest input values are 500 and 92, so it reports 592 as the high end of the output range. 592 is larger than the value returned for any of the input rows and 32 is smaller than any row's return value.
The purpose of getOutputRange is to quickly eliminate calls where outputs would definitely be out of range. For example, if the query included "WHERE Add2Ints(a,b) > 600", this block of data could be skipped. There can still be cases where, after calling getOutputRange, processBlock returns no results. If the query included "WHERE Add2Ints(a,b) > 500", getOutputRange would not eliminate this block of data.
Add2Ints implements getOutputRange as follows:
/*
* This method computes the output range for this scalar function from
* the ranges of its inputs in a single invocation.
*
* The input ranges are retrieved via inRange
* The output range is returned via outRange
*/virtualvoidgetOutputRange(Vertica::ServerInterface&srvInterface,Vertica::ValueRangeReader&inRange,Vertica::ValueRangeWriter&outRange){if(inRange.hasBounds(0)&&inRange.hasBounds(1)){// Input ranges have bounds definedif(inRange.isNull(0)||inRange.isNull(1)){// At least one range has only NULL values.// Output range can only have NULL values.outRange.setNull();outRange.setHasBounds();return;}else{// Compute output rangeconstvint&a1LoBound=inRange.getIntRefLo(0);constvint&a2LoBound=inRange.getIntRefLo(1);outRange.setIntLo(a1LoBound+a2LoBound);constvint&a1UpBound=inRange.getIntRefUp(0);constvint&a2UpBound=inRange.getIntRefUp(1);outRange.setIntUp(a1UpBound+a2UpBound);}}else{// Input ranges are unbounded. No output range can be definedreturn;}if(!inRange.canHaveNulls(0)&&!inRange.canHaveNulls(1)){// There cannot be NULL values in the output rangeoutRange.setCanHaveNulls(false);}// Let Vertica know that the output range is boundedoutRange.setHasBounds();}
If getOutputRange produces an error, Vertica issues a warning and does not call the method again for the current query.
5 - C++ example: Add2Ints
The following example shows a basic subclass of ScalarFunction called Add2ints.
The following example shows a basic subclass of ScalarFunction called Add2ints. As the name implies, it adds two integers together, returning a single integer result.
For the complete source code, see /opt/vertica/sdk/examples/ScalarFunctions/Add2Ints.cpp. Java and Python versions of this UDx are included in /opt/vertica/sdk/examples.
Loading and using the example
Use CREATE LIBRARY to load the library containing the function, and then use CREATE FUNCTION (scalar) to declare the function as in the following example:
=> CREATE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/ScalarFunctions.so';
=> CREATE FUNCTION add2ints AS LANGUAGE 'C++' NAME 'Add2IntsFactory' LIBRARY ScalarFunctions;
The following example shows how to use this function:
=> SELECT Add2Ints(27,15);
Add2ints
----------
42
(1 row)
=> SELECT * FROM MyTable;
a | b
-----+----
7 | 0
12 | 2
12 | 6
18 | 9
1 | 1
58 | 4
450 | 15
(7 rows)
=> SELECT * FROM MyTable WHERE Add2ints(a, b) > 20;
a | b
-----+----
18 | 9
58 | 4
450 | 15
(3 rows)
Function implementation
A scalar function does its computation in the processBlock method:
classAdd2Ints:publicScalarFunction{public:/*
* This method processes a block of rows in a single invocation.
*
* The inputs are retrieved via argReader
* The outputs are returned via resWriter
*/virtualvoidprocessBlock(ServerInterface&srvInterface,BlockReader&argReader,BlockWriter&resWriter){try{// While we have inputs to processdo{if(argReader.isNull(0)||argReader.isNull(1)){resWriter.setNull();}else{constvinta=argReader.getIntRef(0);constvintb=argReader.getIntRef(1);resWriter.setInt(a+b);}resWriter.next();}while(argReader.next());}catch(std::exception&e){// Standard exception. Quit.vt_report_error(0,"Exception while processing block: [%s]",e.what());}}// ...};
Implementing getOutputRange, which is optional, allows your function to skip rows where the result would not be within a target range. For example, if a WHERE clause restricts the query results to those in a certain range, calling the function for cases that could not possibly be in that range is unnecessary.
/*
* This method computes the output range for this scalar function from
* the ranges of its inputs in a single invocation.
*
* The input ranges are retrieved via inRange
* The output range is returned via outRange
*/virtualvoidgetOutputRange(Vertica::ServerInterface&srvInterface,Vertica::ValueRangeReader&inRange,Vertica::ValueRangeWriter&outRange){if(inRange.hasBounds(0)&&inRange.hasBounds(1)){// Input ranges have bounds definedif(inRange.isNull(0)||inRange.isNull(1)){// At least one range has only NULL values.// Output range can only have NULL values.outRange.setNull();outRange.setHasBounds();return;}else{// Compute output rangeconstvint&a1LoBound=inRange.getIntRefLo(0);constvint&a2LoBound=inRange.getIntRefLo(1);outRange.setIntLo(a1LoBound+a2LoBound);constvint&a1UpBound=inRange.getIntRefUp(0);constvint&a2UpBound=inRange.getIntRefUp(1);outRange.setIntUp(a1UpBound+a2UpBound);}}else{// Input ranges are unbounded. No output range can be definedreturn;}if(!inRange.canHaveNulls(0)&&!inRange.canHaveNulls(1)){// There cannot be NULL values in the output rangeoutRange.setCanHaveNulls(false);}// Let Vertica know that the output range is boundedoutRange.setHasBounds();}
Factory implementation
The factory instantiates a member of the class (createScalarFunction), and also describes the function's inputs and outputs (getPrototype):
classAdd2IntsFactory:publicScalarFunctionFactory{// return an instance of Add2Ints to perform the actual addition.virtualScalarFunction*createScalarFunction(ServerInterface&interface){returnvt_createFuncObject<Add2Ints>(interface.allocator);}// This function returns the description of the input and outputs of the// Add2Ints class's processBlock function. It stores this information in// two ColumnTypes objects, one for the input parameters, and one for// the return value.virtualvoidgetPrototype(ServerInterface&interface,ColumnTypes&argTypes,ColumnTypes&returnType){argTypes.addInt();argTypes.addInt();// Note that ScalarFunctions *always* return a single value.returnType.addInt();}};
The RegisterFactory macro
Use the RegisterFactory macro to register a UDx. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class:
RegisterFactory(Add2IntsFactory);
6 - Python example: currency_convert
The currency_convert scalar function reads two values from a table, a currency and a value.
The currency_convert scalar function reads two values from a table, a currency and a value. It then converts the item's value to USD, returning a single float result.
importvertica_sdkimportdecimalrates2USD={'USD':1.000,'EUR':0.89977,'GBP':0.68452,'INR':67.0345,'AUD':1.39187,'CAD':1.30335,'ZAR':15.7181,'XXX':-1.0000}classcurrency_convert(vertica_sdk.ScalarFunction):"""Converts a money column to another currency
Returns a value in USD.
"""def__init__(self):passdefsetup(self,server_interface,col_types):passdefprocessBlock(self,server_interface,block_reader,block_writer):while(True):currency=block_reader.getString(0)try:rate=decimal.Decimal(rates2USD[currency])exceptKeyError:server_interface.log("ERROR: {} not in dictionary.".format(currency))# Scalar functions always need a value to move forward to the# next input row. Therefore, we need to assign it a value to# move beyond the error.currency='XXX'rate=decimal.Decimal(rates2USD[currency])starting_value=block_reader.getNumeric(1)converted_value=decimal.Decimal(starting_value/rate)block_writer.setNumeric(converted_value)block_writer.next()ifnotblock_reader.next():breakdefdestroy(self,server_interface,col_types):passclasscurrency_convert_factory(vertica_sdk.ScalarFunctionFactory):defcreateScalarFunction(self,srv):returncurrency_convert()defgetPrototype(self,srv_interface,arg_types,return_type):arg_types.addVarchar()arg_types.addNumeric()return_type.addNumeric()defgetReturnType(self,srv_interface,arg_types,return_type):return_type.addNumeric(9,4)
Load the function and library
Create the library and the function.
=> CREATE LIBRARY pylib AS '/home/dbadmin/python_udx/currency_convert/currency_convert.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE FUNCTION currency_convert AS LANGUAGE 'Python' NAME 'currency_convert_factory' LIBRARY pylib fenced;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT product, currency_convert(currency, value) AS cost_in_usd
FROM items;
product | cost_in_usd
--------------+-------------
Shoes | 133.4008
Soccer Ball | 110.2817
Coffee | 13.5190
Surfboard | 176.2593
Hockey Stick | 76.7177
Car | 17000.0000
Software | 10.4424
Hamburger | 7.5000
Fish | 130.4272
Cattle | 269.2367
(10 rows)
7 - Python example: validate_url
The validate_url scalar function reads a string from a table, a URL.
The validate_url scalar function reads a string from a table, a URL. It then validates if the URL is responsive, returning a status code or a string indicating the attempt failed.
importvertica_sdkimporturllib.requestimporttimeclassvalidate_url(vertica_sdk.ScalarFunction):"""Validates HTTP requests.
Returns the status code of a webpage. Pages that cannot be accessed return
"Failed to load page."
"""def__init__(self):passdefsetup(self,server_interface,col_types):passdefprocessBlock(self,server_interface,arg_reader,res_writer):# Writes a string to the UDx log file.server_interface.log("Validating webpage accessibility - UDx")while(True):url=arg_reader.getString(0)try:status=urllib.request.urlopen(url).getcode()# Avoid overwhelming web servers -- be nice.time.sleep(2)except(ValueError,urllib.error.HTTPError,urllib.error.URLError):status='Failed to load page'res_writer.setString(str(status))res_writer.next()ifnotarg_reader.next():# Stop processing when there are no more input rows.breakdefdestroy(self,server_interface,col_types):passclassvalidate_url_factory(vertica_sdk.ScalarFunctionFactory):defcreateScalarFunction(self,srv):returnvalidate_url()defgetPrototype(self,srv_interface,arg_types,return_type):arg_types.addVarchar()return_type.addChar()defgetReturnType(self,srv_interface,arg_types,return_type):return_type.addChar(20)
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY pylib AS 'webpage_tester/validate_url.py' LANGUAGE 'Python';
=> CREATE OR REPLACE FUNCTION validate_url AS LANGUAGE 'Python' NAME 'validate_url_factory' LIBRARY pylib fenced;
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT url, validate_url(url) AS url_status FROM webpages;
url | url_status
-----------------------------------------------+----------------------
http://www.vertica.com/documentation/vertica/ | 200
http://www.google.com/ | 200
http://www.mass.gov.com/ | Failed to load page
http://www.espn.com | 200
http://blah.blah.blah.blah | Failed to load page
http://www.vertica.com/ | 200
(6 rows)
8 - Python example: matrix multiplication
Python UDxs can accept and return complex types.
Python UDxs can accept and return complex types. The MatrixMultiply class multiplies input matrices and returns the resulting matrix product. These matrices are represented as two-dimensional arrays. In order to perform the matrix multiplication operation, the number of columns in the first input matrix must equal the number of rows in the second input matrix.
The complete source code is in /opt/vertica/sdk/examples/python/ScalarFunctions.py.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/python/ScalarFunctions.py' LANGUAGE 'Python';
=> CREATE FUNCTION MatrixMultiply AS LANGUAGE 'Python' NAME 'matrix_multiply_factory' LIBRARY ScalarFunctions;
You can create input matrices and then call the function, for example:
All Python UDxs must import the Vertica SDK library:
importvertica_sdk
Factory implementation
The getPrototype() method declares that the function arguments and return type must all be two-dimensional arrays, represented as arrays of integer arrays:
getReturnType() validates that the product matrix has the same number of rows as the first input matrix and the same number of columns as the second input matrix:
The processBlock() method is called with a BlockReader and a BlockWriter, named arg_reader and res_writer respectively. To access elements of the input arrays, the method uses ArrayReader instances. The arrays are nested, so an ArrayReader must be instantiated for both the outer and inner arrays. List comprehension simplifies the process of reading the input arrays into lists. The method performs the computation and then uses an ArrayWriter instance to construct the product matrix.
The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation.
The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation. It then uses the state abbreviation to find the sales tax rate from a list and calculates the item's price including the state's sales tax, returning the total cost of the item.
=> CREATE OR REPLACE LIBRARY rLib AS 'sales_tax_calculator.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION SalesTaxCalculator AS LANGUAGE 'R' NAME 'SalesTaxCalculatorFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT item, state_abbreviation,
price, SalesTaxCalculator(price, state_abbreviation) AS Price_With_Sales_Tax
FROM inventory;
item | state_abbreviation | price | Price_With_Sales_Tax
-------------+--------------------+-------+---------------------
Scarf | AZ | 6.88 | 7.53016
Software | MA | 88.31 | 96.655295
Soccer Ball | MS | 12.55 | 13.735975
Beads | LA | 0.99 | 1.083555
Baseball | TN | 42.42 | 46.42869
Cheese | WI | 20.77 | 22.732765
Coffee Mug | MA | 8.99 | 9.839555
Shoes | TN | 23.99 | 26.257055
(8 rows)
UDSF R code
SalesTaxCalculator<-function(input.data.frame){# Not a complete list of states in the USA, but enough to get the idea.state.sales.tax<-list(ma=0.0625,az=0.087,la=0.0891,tn=0.0945,wi=0.0543,ms=0.0707)for(state_abbreviationininput.data.frame[,2]){# Ensure state abbreviations are lowercase.lower_state<-tolower(state_abbreviation)# Check if the state is in our state.sales.tax list.if(is.null(state.sales.tax[[lower_state]])){stop("State is not in our small sample!")}else{sales.tax.rate<-state.sales.tax[[lower_state]]item.price<-input.data.frame[,1]# Calculate the price including sales tax.price.with.sales.tax<-(item.price)+(item.price*sales.tax.rate)}}return(price.with.sales.tax)}SalesTaxCalculatorFactory<-function(){list(name=SalesTaxCalculator,udxtype=c("scalar"),intype=c("float","varchar"),outtype=c("float"))}
10 - R example: kmeans
The KMeans_User scalar function reads any number of columns from a table, the observations.
The KMeans_User scalar function reads any number of columns from a table, the observations. It then uses the observations and the two parameters when applying the kmeans clustering algorithm to the data, returning an integer value associated with the cluster of the row.
=> CREATE OR REPLACE LIBRARY rLib AS 'kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION KMeans_User AS LANGUAGE 'R' NAME 'KMeans_UserFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF:
KMeans_User<-function(input.data.frame,parameters.data.frame){# Take the clusters and nstart parameters passed by the user and assign them# to variables in the function.if(is.null(parameters.data.frame[['clusters']])){stop("NULL value for clusters! clusters cannot be NULL.")}else{clusters.value<-parameters.data.frame[['clusters']]}if(is.null(parameters.data.frame[['nstart']])){stop("NULL value for nstart! nstart cannot be NULL.")}else{nstart.value<-parameters.data.frame[['nstart']]}# Apply the algorithm to the data.kmeans.clusters<-kmeans(input.data.frame[,1:length(input.data.frame)],clusters.value,nstart=nstart.value)final.output<-data.frame(kmeans.clusters$cluster)return(final.output)}KMeans_UserFactory<-function(){list(name=KMeans_User,udxtype=c("scalar"),# Since this is a polymorphic function the intype must be anyintype=c("any"),outtype=c("int"),parametertypecallback=KMeansParameters)}KMeansParameters<-function(){parameters<-list(datatype=c("int","int"),length=c("NA","NA"),scale=c("NA","NA"),name=c("clusters","nstart"))return(parameters)}
11 - C++ example: using complex types
UDxs can accept and return complex types.
UDxs can accept and return complex types. The ArraySlice example takes an array and two indices as inputs and returns an array containing only the values in that range. Because array elements can be of any type, the function is polymorphic.
The complete source code is in /opt/vertica/sdk/examples/ScalarFunctions/ArraySlice.cpp.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION ArraySlice AS
LANGUAGE 'C++' NAME 'ArraySliceFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
Because the function is polymorphic, getPrototype() declares that the inputs and outputs can be of any type, and type enforcement must be done elsewhere:
voidgetPrototype(ServerInterface&srvInterface,ColumnTypes&argTypes,ColumnTypes&returnType)override{/*
* This is a polymorphic function that accepts any array
* and returns an array of the same type
*/argTypes.addAny();returnType.addAny();}
The factory validates input types and determines the return type in getReturnType():
voidgetReturnType(ServerInterface&srvInterface,constSizedColumnTypes&argTypes,SizedColumnTypes&returnType)override{/*
* Three arguments: (array, slicebegin, sliceend)
* Validate manually since the prototype accepts any arguments.
*/if(argTypes.size()!=3){vt_report_error(0,"Three arguments (array, slicebegin, sliceend) expected");}elseif(!argTypes[0].getType().isArrayType()){vt_report_error(1,"Argument 1 is not an array");}elseif(!argTypes[1].getType().isInt()){vt_report_error(2,"Argument 2 (slicebegin) is not an integer)");}elseif(!argTypes[2].getType().isInt()){vt_report_error(3,"Argument 3 (sliceend) is not an integer)");}/* return type is the same as the array arg type, copy it over */returnType.push_back(argTypes[0]);}
Function implementation
The processBlock() method is called with a BlockReader and a BlockWriter. The first argument is an array. To access elements of the array, the method uses an ArrayReader. Similarly, it uses an ArrayWriter to construct the output.
voidprocessBlock(ServerInterface&srvInterface,BlockReader&argReader,BlockWriter&resWriter)override{do{if(argReader.isNull(0)||argReader.isNull(1)||argReader.isNull(2)){resWriter.setNull();}else{Array::ArrayReaderargArray=argReader.getArrayRef(0);constvintslicebegin=argReader.getIntRef(1);constvintsliceend=argReader.getIntRef(2);Array::ArrayWriteroutArray=resWriter.getArrayRef(0);if(slicebegin<sliceend){for(inti=0;i<slicebegin&&argArray->hasData();i++){argArray->next();}for(inti=slicebegin;i<sliceend&&argArray->hasData();i++){outArray->copyFromInput(*argArray);outArray->next();argArray->next();}}outArray.commit();/* finalize the written array elements */}resWriter.next();}while(argReader.next());}
12 - C++ example: returning multiple values
When writing a UDSF, you can specify more than one return value.
When writing a UDSF, you can specify more than one return value. If you specify multiple values, Vertica packages them into a single ROW as a return value. You can query fields in the ROW or the entire ROW.
The following example implements a function named div (division) that returns two integers, the quotient and the remainder.
This example shows one way to return a ROW from a UDSF. Returning multiple values and letting Vertica build the ROW is convenient when inputs and outputs are all of primitive types. You can also work directly with the complex types, as described in Complex Types as Arguments and illustrated in C++ example: using complex types.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION div AS
LANGUAGE 'C++' NAME 'DivFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
=> CREATE TABLE D (a INTEGER, b INTEGER);
COPY D FROM STDIN DELIMITER ',';
10,0
10,1
10,2
10,3
10,4
10,5
\.
=> SELECT a, b, Div(a, b), (Div(a, b)).quotient, (Div(a, b)).remainder FROM D;
a | b | Div | quotient | remainder
----+---+------------------------------------+----------+-----------
10 | 0 | {"quotient":null,"remainder":null} | |
10 | 1 | {"quotient":10,"remainder":0} | 10 | 0
10 | 2 | {"quotient":5,"remainder":0} | 5 | 0
10 | 3 | {"quotient":3,"remainder":1} | 3 | 1
10 | 4 | {"quotient":2,"remainder":2} | 2 | 2
10 | 5 | {"quotient":2,"remainder":0} | 2 | 0
(6 rows)
Factory implementation
The factory declares the two return values in getPrototype() and in getReturnType(). The factory is otherwise unremarkable.
13 - C++ example: calling a UDSF from a check constraint
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint.
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint. The name of the sample function is LargestSquareBelow. The sample function determines the largest number whose square is less than the number in the subject column. For example, if the number in the column is 1000, the largest number whose square is less than 1000 is 31 (961).
Important
A UDSF used within a check constraint must be immutable, and the constraint must handle null values properly. Otherwise, the check constraint might not work as you intended. In addition, Vertica evaluates the predicate of an enabled check constraint on every row that is loaded or updated, so consider performance in writing your function.
The following example shows how you can create and load a library named MySqLib, using CREATE LIBRARY. Adjust the library path in this example to the absolute path and file name for the location where you saved the shared object LargestSquareBelow.
Create the library:
=> CREATE OR REPLACE LIBRARY MySqLib AS '/home/dbadmin/LargestSquareBelow.so';
After you create and load the library, add the function to the catalog using the CREATE FUNCTION (scalar) statement:
=> CREATE OR REPLACE FUNCTION largestSqBelow AS LANGUAGE 'C++' NAME 'LargestSquareBelowInfo' LIBRARY MySqLib;
This ScalarFunction implementation does the processing work for a UDSF that determines the largest number whose square is less than the number input.
#include"Vertica.h"/*
* ScalarFunction implementation for a UDSF that
* determines the largest number whose square is less than
* the number input.
*/classLargestSquareBelow:publicVertica::ScalarFunction{public:/*
* This function does all of the actual processing for the UDSF.
* The inputs are retrieved via arg_reader
* The outputs are returned via arg_writer
*
*/virtualvoidprocessBlock(Vertica::ServerInterface&srvInterface,Vertica::BlockReader&arg_reader,Vertica::BlockWriter&res_writer){if(arg_reader.getNumCols()!=1)vt_report_error(0,"Function only accept 1 argument, but %zu provided",arg_reader.getNumCols());// While we have input to processdo{// Read the input parameter by calling the// BlockReader.getIntRef class functionconstVertica::vinta=arg_reader.getIntRef(0);Vertica::vintres;//Determine the largest square below the numberif((a!=Vertica::vint_null)&&(a>0)){res=(Vertica::vint)sqrt(a-1);}elseres=Vertica::vint_null;//Call BlockWriter.setInt to store the output value,//which is the largest squareres_writer.setInt(res);//Write the row and advance to the next output rowres_writer.next();//Continue looping until there are no more input rows}while(arg_reader.next());}};
ScalarFunctionFactory implementation
This ScalarFunctionFactory implementation does the work of handling input and output, and marks the function as immutable (a requirement if you plan to use the UDSF within a check constraint).
classLargestSquareBelowInfo:publicVertica::ScalarFunctionFactory{//return an instance of LargestSquareBelow to perform the computation.virtualVertica::ScalarFunction*createScalarFunction(Vertica::ServerInterface&srvInterface)//Call the vt_createFuncObj to create the new LargestSquareBelow class instance.{returnVertica::vt_createFuncObject<LargestSquareBelow>(srvInterface.allocator);}/*
* This function returns the description of the input and outputs of the
* LargestSquareBelow class's processBlock function. It stores this information in
* two ColumnTypes objects, one for the input parameter, and one for
* the return value.
*/virtualvoidgetPrototype(Vertica::ServerInterface&srvInterface,Vertica::ColumnTypes&argTypes,Vertica::ColumnTypes&returnType){// Takes one int as input, so adds int to the argTypes objectargTypes.addInt();// Returns a single int, so add a single int to the returnType object.// ScalarFunctions always return a single value.returnType.addInt();}public:// the function cannot be called within a check constraint unless the UDx author// certifies that the function is immutable:LargestSquareBelowInfo(){vol=Vertica::IMMUTABLE;}};
The RegisterFactory macro
Use the RegisterFactory macro to register a ScalarFunctionFactory subclass. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class.