This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Extending Vertica
You can extend Vertica to perform new operations or handle new types of data.
You can extend Vertica to perform new operations or handle new types of data. There are several types of extensions:
-
External procedures: external scripts or programs that are installed on a host in your database cluster.
-
User-defined SQL functions: Frequently-used SQL expressions, which help you simplify and standardize your SQL scripts.
-
Stored Procedures: SQL procedures that are stored in the database (as opposed to external procedures). Stored procedures can communicate and interact with your database directly to perform maintenance, execute queries, and update tables.
-
User-defined extensions (UDxs): functions or data-load steps written in the C++, Python, Java, and R programming languages. They are useful when the type of data processing you want to perform is difficult or slow using SQL. User-defined extensions explains how to use them and Developing user-defined extensions (UDxs) explains how to create them.
1 - External procedures
An external procedure is a script or executable program on a host in your database cluster that you can call from within Vertica.
Enterprise Mode only
An external procedure is a script or executable program on a host in your database cluster that you can call from within Vertica. External procedures cannot communicate back to Vertica.
To implement an external procedure:
-
Create an external procedure executable file. See Requirements for external procedures.
-
Enable the set-user-ID(SUID), user execute, and group execute attributes for the file. Either the file must be readable by the dbadmin or the file owner's password must be given with the Administration tools install_procedure
command.
-
Install the external procedure executable file.
-
Create the external procedure in Vertica.
After a procedure is created in Vertica, you can execute or drop it, but you cannot alter it.
1.1 - Requirements for external procedures
External procedures have requirements regarding their attributes, where you store them, and how you handle their output.
Enterprise Mode only
External procedures have requirements regarding their attributes, where you store them, and how you handle their output. You should also be cognizant of their resource usage.
Procedure file attributes
The procedure file cannot be owned by root. It must have the set-user-ID (SUID), user execute, and group execute attributes set. If it is not readable by the Linux database administrator user, then the owner's password will have to be specified when installing the procedure.
Handling procedure output
Vertica does not provide a facility for handling procedure output. Therefore, you must make your own arrangements for handling procedure output, which should include writing error, logging, and program information directly to files that you manage.
Handling resource usage
The Vertica resource manager is unaware of resources used by external procedures. Additionally, Vertica is intended to be the only major process running on your system. If your external procedure is resource intensive, it could affect the performance and stability of Vertica. Consider the types of external procedures you create and when you run them. For example, you might run a resource-intensive procedure during off hours.
Sample procedure file
#!/bin/bash
echo "hello planet argument: $1" >> /tmp/myprocedure.log
1.2 - Installing external procedure executable files
To install an external procedure, use the Administration Tools through either the menu or the command line.
Enterprise Mode only
To install an external procedure, use the Administration Tools through either the menu or the command line.
-
Run the Administration tools.
$ /opt/vertica/bin/adminTools
-
On the AdminTools Main Menu, click Configuration Menu, and then click OK.
-
On the Configuration Menu, click Install External Procedure and then click OK.
-
Select the database on which you want to install the external procedure.
-
Either select the file to install or manually type the complete file path, and then click OK.
-
If you are not the superuser, you are prompted to enter your password and click OK.
The Administration Tools automatically create the database-name
/procedures
directory on each node in the database and installs the external procedure in these directories for you.
-
Click OK in the dialog that indicates that the installation was successful.
Command line
If you use the command line, be sure to specify the full path to the procedure file and the password of the Linux user who owns the procedure file. For example:
$ admintools -t install_procedure -d vmartdb -f /scratch/helloworld.sh -p ownerpassword
Installing external procedure...
External procedure installed
After you have installed an external procedure, you need to make Vertica aware of it. To do so, use the CREATE PROCEDURE
statement, but review Creating external procedures first.
1.3 - Creating external procedures
After you install an external procedure, you must make Vertica aware of it with CREATE PROCEDURE.
Enterprise Mode only
After you install an external procedure, you must make Vertica aware of it with CREATE PROCEDURE (external).
Only superusers can create an external procedure, and by default, only they have execute privileges. However, superusers can grant users and roles EXECUTE privilege on the stored procedure.
After you create a procedure, its metadata is stored in system table USER_PROCEDURES. Users can see only those procedures that they have been granted the privilege to execute.
Example
The following example creates a procedure named helloplanet
for external procedure file helloplanet.sh
. This file accepts one VARCHAR
argument. The sample code is provided in Requirements for external procedures.
=> CREATE PROCEDURE helloplanet(arg1 VARCHAR) AS 'helloplanet.sh' LANGUAGE 'external'
USER 'dbadmin';
The next example creates a procedure named proctest
for the script copy_vertica_database.sh
. This script copies a database from one cluster to another; it is included in the server RPM located in directory
/opt/vertica/scripts
.
=> CREATE PROCEDURE proctest(shosts VARCHAR, thosts VARCHAR, dbdir VARCHAR)
AS 'copy_vertica_database.sh' LANGUAGE 'external' USER 'dbadmin';
Overloading external procedures
You can create multiple external procedures with the same name if they have different signatures—that is, accept a different set of arguments. For example, you can overload the helloplanet
external procedure to also accept an integer value:
=> CREATE PROCEDURE helloplanet(arg1 INT) AS 'helloplanet.sh' LANGUAGE 'external'
USER 'dbadmin';
After executing this statement, the database catalog stores two external procedures named helloplanet
—one that accepts a VARCHAR argument and one that accepts an integer. When you call the external procedure, Vertica evaluates the arguments in the procedure call to determine which procedure to call.
See also
1.4 - Executing external procedures
After you define a procedure using the CREATE PROCEDURE statement, you can use it as a meta command in a SELECT statement.
Enterprise Mode only
After you define a procedure using the CREATE PROCEDURE (external) statement, you can use it as a meta command in a SELECT statement. Vertica does not support using procedures in more complex statements or in expressions.
The following example runs a procedure named helloplanet
:
=> SELECT helloplanet('earthlings');
helloplanet
-------------
0
(1 row)
The following example runs a procedure named proctest
. This procedure references the copy_vertica_database.sh
script that copies a database from one cluster to another. It is installed by the server RPM in the
/opt/vertica/scripts
directory.
=> SELECT proctest(
'-s qa01',
'-t rbench1',
'-D /scratch_b/qa/PROC_TEST' );
Note
External procedures have no direct access to database data. Use ODBC or JDBC for this purpose.
Procedures are executed on the initiating node. Vertica runs the procedure by forking and executing the program. Each procedure argument is passed to the executable file as a string. The parent fork process waits until the child process ends.
If the child process exits with status 0, Vertica reports the operation took place by returning one row as shown in the helloplanet
example. If the child process exits with any other status, Vertica reports an error like the following:
ERROR 7112: Procedure reported: Procedure execution error: exit status = code
To stop execution, cancel the process by sending a cancel command (for example, CTRL+C) through the client. If the procedure program exits with an error, an error message with the exit status is returned.
Permissions
To execute an external procedure, the user needs:
1.5 - Dropping external procedures
Only a superuser can drop an external procedure.
Enterprise Mode only
Only a superuser can drop an external procedure. To drop the definition for an external procedure from Vertica, use the DROP PROCEDURE (external) statement. Only the reference to the procedure is removed. The external file remains in the <database>/procedures
directory on each node in the database.
Note
The definition Vertica uses for a procedure cannot be altered; it can only be dropped.
Example
=> DROP PROCEDURE helloplanet(arg1 varchar);
See also
2 - User-defined SQL functions
For syntax and parameters for the commands and system table discussed in this section, see the following topics:.
User-defined SQL functions let you define and store commonly-used SQL expressions as a function. User-defined SQL functions are useful for executing complex queries and combining Vertica built-in functions. You simply call the function name you assigned in your query.
A user-defined SQL function can be used anywhere in a query where an ordinary SQL expression can be used, except in a table partition clause or the projection segmentation clause.
For syntax and parameters for the commands and system table discussed in this section, see the following topics:
2.1 - Creating user-defined SQL functions
A user-defined SQL function can be used anywhere in a query where an ordinary SQL expression can be used, except in the table partition clause or the projection segmentation clause.
A user-defined SQL function can be used anywhere in a query where an ordinary SQL expression can be used, except in the table partition clause or the projection segmentation clause.
To create a SQL function, the user must have CREATE privileges on the schema. To use a SQL function, the user must have USAGE privileges on the schema and EXECUTE privileges on the defined function.
This following statement creates a SQL function called myzeroifnull
that accepts an INTEGER
argument and returns an INTEGER
result.
=> CREATE FUNCTION myzeroifnull(x INT) RETURN INT
AS BEGIN
RETURN (CASE WHEN (x IS NOT NULL) THEN x ELSE 0 END);
END;
You can use the new SQL function (myzeroifnull
) anywhere you use an ordinary SQL expression. For example, create a simple table:
=> CREATE TABLE tabwnulls(col1 INT);
=> INSERT INTO tabwnulls VALUES(1);
=> INSERT INTO tabwnulls VALUES(NULL);
=> INSERT INTO tabwnulls VALUES(0);
=> SELECT * FROM tabwnulls;
a
---
1
0
(3 rows)
Use the myzeroifnull
function in a SELECT
statement, where the function calls col1
from table tabwnulls:
=> SELECT myzeroifnull(col1) FROM tabwnulls;
myzeroifnull
--------------
1
0
0
(3 rows)
Use the myzeroifnull
function in the GROUP BY
clause:
=> SELECT COUNT(*) FROM tabwnulls GROUP BY myzeroifnull(col1);
count
-------
2
1
(2 rows)
If you want to change a user-defined SQL function's body, use the CREATE OR REPLACE
syntax. The following command modifies the CASE expression:
=> CREATE OR REPLACE FUNCTION myzeroifnull(x INT) RETURN INT
AS BEGIN
RETURN (CASE WHEN (x IS NULL) THEN 0 ELSE x END);
END;
To see how this information is stored in the Vertica catalog, see Viewing Information About SQL Functions.
See also
2.2 - Altering and dropping user-defined SQL functions
Vertica allows multiple functions to share the same name with different argument types.
Vertica allows multiple functions to share the same name with different argument types. Therefore, if you try to alter or drop a SQL function without specifying the argument data type, the system returns an error message to prevent you from dropping the wrong function:
=> DROP FUNCTION myzeroifnull();
ROLLBACK: Function with specified name and parameters does not exist: myzeroifnull
Note
Only a superuser or owner can alter or drop a SQL Function.
Altering a user-defined SQL function
The ALTER FUNCTION (scalar) command lets you assign a new name to a user-defined function, as well as move it to a different schema.
In the previous topic, you created a SQL function called myzeroifnull
. The following command renames the myzeroifnull
function to zerowhennull
:
=> ALTER FUNCTION myzeroifnull(x INT) RENAME TO zerowhennull;
ALTER FUNCTION
This next command moves the renamed function into a new schema called macros
:
=> ALTER FUNCTION zerowhennull(x INT) SET SCHEMA macros;
ALTER FUNCTION
Dropping a SQL function
The DROP FUNCTION command drops a SQL function from the Vertica catalog.
Like with ALTER FUNCTION, you must specify the argument data type or the system returns the following error message:
=> DROP FUNCTION zerowhennull();
ROLLBACK: Function with specified name and parameters does not exist: zerowhennull
Specify the argument type:
=> DROP FUNCTION macros.zerowhennull(x INT);
DROP FUNCTION
Vertica does not check for dependencies, so if you drop a SQL function where other objects reference it (such as views or other SQL Functions), Vertica returns an error when those objects are used, not when the function is dropped.
Tip
To view a list of all user-defined SQL functions on which you have EXECUTE privileges, (which also returns their argument types), query the
V_CATALOG.USER_FUNCTIONS system table.
See also
2.3 - Managing access to SQL functions
Before a user can execute a user-defined SQL function, he or she must have USAGE privileges on the schema and EXECUTE privileges on the defined function.
Before a user can execute a user-defined SQL function, he or she must have USAGE privileges on the schema and EXECUTE privileges on the defined function. Only the superuser or owner can grant/revoke EXECUTE usage on a function.
To grant EXECUTE privileges to user Fred on the myzeroifnull
function:
=> GRANT EXECUTE ON FUNCTION myzeroifnull (x INT) TO Fred;
To revoke EXECUTE privileges from user Fred on the myzeroifnull
function:
=> REVOKE EXECUTE ON FUNCTION myzeroifnull (x INT) FROM Fred;
See also
2.4 - Viewing information about user-defined SQL functions
You can access information about user-defined SQL functions on which you have EXECUTE privileges.
You can access information about user-defined SQL functions on which you have EXECUTE privileges. This information is available in system table USER_FUNCTIONS and from the vsql meta-command \df
.
To view all user-defined SQL functions on which you have EXECUTE privileges, query USER_FUNCTIONS:
=> SELECT * FROM USER_FUNCTIONS;
-[ RECORD 1 ]----------+---------------------------------------------------
schema_name | public
function_name | myzeroifnull
function_return_type | Integer
function_argument_type | x Integer
function_definition | RETURN CASE WHEN (x IS NOT NULL) THEN x ELSE 0 END
volatility | immutable
is_strict | f
If you want to change the body of a user-defined SQL function, use the CREATE OR REPLACE syntax. The following command modifies the CASE expression:
=> CREATE OR REPLACE FUNCTION myzeroifnull(x INT) RETURN INT
AS BEGIN
RETURN (CASE WHEN (x IS NULL) THEN 0 ELSE x END);
END;
Now when you query USER_FUNCTIONS, you can see the changes in the function_definition
column:
=> SELECT * FROM USER_FUNCTIONS;
-[ RECORD 1 ]----------+---------------------------------------------------
schema_name | public
function_name | myzeroifnull
function_return_type | Integer
function_argument_type | x Integer
function_definition | RETURN CASE WHEN (x IS NULL) THEN 0 ELSE x END
volatility | immutable
is_strict | f
If you use CREATE OR REPLACE syntax to change only the argument name or argument type (or both), the system maintains both versions of the function. For example, the following command tells the function to accept and return a numeric data type instead of an integer for the myzeroifnull
function:
=> CREATE OR REPLACE FUNCTION myzeroifnull(z NUMERIC) RETURN NUMERIC
AS BEGIN
RETURN (CASE WHEN (z IS NULL) THEN 0 ELSE z END);
END;
Now query the USER_FUNCTIONS table, and you can see the second instance of myzeroifnull
in Record 2, as well as the changes in the function_return_type
, function_argument_type
, and function_definition
columns.
Note
Record 1 still holds the original definition for the myzeroifnull
function:
=> SELECT * FROM USER_FUNCTIONS;
-[ RECORD 1 ]----------+------------------------------------------------------------
schema_name | public
function_name | myzeroifnull
function_return_type | Integer
function_argument_type | x Integer
function_definition | RETURN CASE WHEN (x IS NULL) THEN 0 ELSE x END
volatility | immutable
is_strict | f
-[ RECORD 2 ]----------+------------------------------------------------------------
schema_name | public
function_name | myzeroifnull
function_return_type | Numeric
function_argument_type | z Numeric
function_definition | RETURN (CASE WHEN (z IS NULL) THEN (0) ELSE z END)::numeric
volatility | immutable
is_strict | f
Because Vertica allows functions to share the same name with different argument types, you must specify the argument type when you alter or drop a function. If you do not, the system returns an error message:
=> DROP FUNCTION myzeroifnull();
ROLLBACK: Function with specified name and parameters does not exist: myzeroifnull
2.5 - Migrating built-in SQL functions
If you have built-in SQL functions from another RDBMS that do not map to a Vertica-supported function, you can migrate them into your Vertica database by using a user-defined SQL function.
If you have built-in SQL functions from another RDBMS that do not map to a Vertica-supported function, you can migrate them into your Vertica database by using a user-defined SQL function.
The example scripts below show how to create user-defined functions for the following DB2 built-in functions:
-
UCASE()
-
LCASE()
-
LOCATE()
-
POSSTR()
UCASE()
This script creates a user-defined SQL function for the UCASE()
function:
=> CREATE OR REPLACE FUNCTION UCASE (x VARCHAR)
RETURN VARCHAR
AS BEGIN
RETURN UPPER(x);
END;
LCASE()
This script creates a user-defined SQL function for the LCASE()
function:
=> CREATE OR REPLACE FUNCTION LCASE (x VARCHAR)
RETURN VARCHAR
AS BEGIN
RETURN LOWER(x);
END;
LOCATE()
This script creates a user-defined SQL function for the LOCATE()
function:
=> CREATE OR REPLACE FUNCTION LOCATE(a VARCHAR, b VARCHAR)
RETURN INT
AS BEGIN
RETURN POSITION(a IN b);
END;
POSSTR()
This script creates a user-defined SQL function for the POSSTR()
function:
=> CREATE OR REPLACE FUNCTION POSSTR(a VARCHAR, b VARCHAR)
RETURN INT
AS BEGIN
RETURN POSITION(b IN a);
END;
3 - Stored procedures
You can condense complex database tasks and routines into stored procedures.
You can condense complex database tasks and routines into stored procedures. Unlike external procedures, stored procedures live and can be executed from inside your database; this lets them communicate and interact with your database directly to perform maintenance, execute queries, and update tables.
Best practices
Many other databases are optimized for online transaction processing (OLTP), which focuses on frequent transactions. In contrast, Vertica is optimized for online analytical processing (OLAP), which instead focuses on storing and analyzing large amounts of data and delivering the fastest responses to the most complex queries on that data.
This architecture difference means that the recommended use cases and best practices for stored procedures in Vertica differ slightly from stored procedures in other databases.
While stored procedures in OLTP-oriented databases are often used to perform small transactions, stored procedures in OLAP-oriented databases like Vertica should instead be used to enhance analytical workloads. Vertica can handle isolated transactions, but frequent small transactions can potentially hinder performance.
Some recommended use cases for stored procedures in Vertica include information lifecycle management (ILM) activities such as extract, transform, and load (ETL), and data preparation for tasks like machine learning. For example:
-
Swapping partitions according to age
-
Exporting data at end-of-life and dropping the partitions
-
Saving inputs, outputs, and metadata from a machine learning model—who ran the model, the version of the model, how many times the model was run, and who received the results
Stored procedures in Vertica can also operate on objects that require higher privileges than that of the caller. An optional parameter allows procedures to run using the privileges of the definer, allowing callers to perform sensitive operations in a controlled way.
Viewing stored procedures
To view existing stored procedures, see USER_PROCEDURES.
=> SELECT * FROM USER_PROCEDURES;
procedure_name | owner | language | security | procedure_arguments | schema_name
----------------+---------+----------+----------+---------------------+-------------
raiseXY | dbadmin | PL/vSQL | INVOKER | x int, y varchar | public
raiseXY | dbadmin | PL/vSQL | INVOKER | x int, y int | public
(2 rows)
To view the the source code for stored procedures, export them with EXPORT_OBJECTS.
To export a particular implementation, specify either the types or both the names and types of its formal parameters. The following example specifies the types:
=> SELECT EXPORT_OBJECTS('','raiseXY(int, int)');
EXPORT_OBJECTS
----------------------
CREATE PROCEDURE public.raiseXY(x int, y int)
LANGUAGE 'PL/vSQL'
SECURITY INVOKER
AS '
BEGIN
RAISE NOTICE ''x = %'', x;
RAISE NOTICE ''y = %'', y;
-- some processing statements
END
';
SELECT MARK_DESIGN_KSAFE(0);
(1 row)
To export all implementations of the overloaded stored procedure raiseXY, export its parent schema:
=> SELECT EXPORT_OBJECTS('','public');
EXPORT_OBJECTS
----------------------
...
CREATE PROCEDURE public.raiseXY(x int, y varchar)
LANGUAGE 'PL/vSQL'
SECURITY INVOKER
AS '
BEGIN
RAISE NOTICE ''x = %'', x;
RAISE NOTICE ''y = %'', y;
-- some processing statements
END
';
CREATE PROCEDURE public.raiseXY(x int, y int)
LANGUAGE 'PL/vSQL'
SECURITY INVOKER
AS '
BEGIN
RAISE NOTICE ''x = %'', x;
RAISE NOTICE ''y = %'', y;
-- some processing statements
END
';
SELECT MARK_DESIGN_KSAFE(0);
(1 row)
Known issues and workarounds
-
You cannot use PERFORM CREATE FUNCTION to create a SQL macro.
Workaround
Use EXECUTE to make SQL macros inside a stored procedure:
CREATE PROCEDURE procedure_name()
LANGUAGE PLvSQL AS $$
BEGIN
EXECUTE 'macro';
end;
$$;
where macro
is the creation statement for a SQL macro. For example, this procedure creates the argmax macro:
=> CREATE PROCEDURE make_argmax() LANGUAGE PLvSQL AS $$
BEGIN
EXECUTE
'CREATE FUNCTION
argmax(x int) RETURN int AS
BEGIN
RETURN (CASE WHEN (x IS NOT NULL) THEN x ELSE 0 END);
END';
END;
$$;
-
Non-error exceptions in embedded SQL statements are not reported.
-
DECIMAL, NUMERIC, NUMBER, MONEY, and UUID data types cannot yet be used for arguments.
-
Cursors should capture the variable context at declaration time, but they currently capture the variable context at open time.
-
DML queries on tables with key constraints cannot yet return a value.
Workaround
Rather than:
DO $$
DECLARE
y int;
BEGIN
y := UPDATE tbl WHERE col1 = 3 SET col2 = 4;
END;
$$
Check the result of the DML query with SELECT:
DO $$
DECLARE
y int;
BEGIN
y := SELECT COUNT(*) FROM tbl WHERE col1 = 3;
PERFORM UPDATE tbl SET col2 = 4 WHERE col1 = 3;
END;
$$;
3.1 - PL/vSQL
PL/vSQL is a powerful and expressive procedural language for creating reusable procedures, manipulating data, and simplifying otherwise complex database routines.
PL/vSQL is a powerful and expressive procedural language for creating reusable procedures, manipulating data, and simplifying otherwise complex database routines.
Vertica PL/vSQL is largely compatible with PostgreSQL PL/pgSQL, with minor semantic differences. For details on migrating your PostgreSQL PL/pgSQL stored procedures to Vertica, see the PL/pgSQL to PL/vSQL migration guide.
For real-world, practical examples of PL/vSQL usage, see Stored procedures: use cases and examples.
3.1.1 - Supported types
Vertica PL/vSQL supports non-complex data types.
Vertica PL/vSQL supports non-complex data types. The following types are supported as variables only and not as arguments:
-
DECIMAL
-
NUMERIC
-
NUMBER
-
MONEY
-
UUID
3.1.2 - Scope and structure
PL/vSQL uses block scope, where a block has the following structure:.
PL/vSQL uses block scope, where a block has the following structure:
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
...
END [ label ];
Declarations
Variable declarations
in the DECLARE block are structured as:
variable_name [ CONSTANT ] data_type [ NOT NULL ] [:= { expression | statement } ];
variable_name |
Variable names must meet the following requirements:
|
CONSTANT |
Defines the variable as a constant (immutable). You can only set a constant variable's value during initialization. |
data_type |
The variable data type. PL/vSQL supports non-complex data types, with the following exceptions:
You can optionally reference a particular column's data type:
variable_name table_name.column_name%TYPE ;
|
NOT NULL |
Specifies that the variable cannot hold a NULL value. If declared with NOT NULL, the variable must be initialized (otherwise, throws ERRCODE_SYNTAX_ERROR) and cannot be assigned NULL (otherwise, throws ERRCODE_WRONG_OBJECT_TYPE). |
:= expression |
Initializes a variable with expression or statement.
If the variable is declared with NOT NULL , expression is required.
Variable declarations in a given block execute sequentially, so old declarations can be referenced by newer ones. For example:
DECLARE
x int := 3;
y int := x;
Default (uninitialized): NULL
|
Aliases
Aliases are alternate names for the same variable. An alias of a variable is not a copy, and changes made to either reference affect the same underlying variable.
new_name ALIAS FOR variable;
Here, the identifier y
is now an alias for variable x
, and changes to y
are reflected in x
.
DO $$
DECLARE
x int := 3;
y ALIAS FOR x;
BEGIN
y := 5; -- since y refers to x, x = 5
RAISE INFO 'x = %, y = %', x, y;
END;
$$;
INFO 2005: x = 5, y = 5
BEGIN and nested blocks
BEGIN contains statements
. A statement is defined as a line or block of PL/vSQL.
Variables declared in inner blocks shadow those declared in outer blocks. To unambiguously specify a variable in a particular block, you can name the block with a label
(case-insensitive), and then reference the variable declared in that block with:
label.variable_name
For example, specifying the variable x
from inside the inner block implicitly refers to inner_block.x
rather than outer_block.x
because of shadowing:
<<outer_block>>
DECLARE
x int;
BEGIN
<<inner_block>>
DECLARE
x int;
BEGIN
x := 1000; -- implicitly specifies x in inner_block because of shadowing
OUTER_BLOCK.x := 0; -- specifies x in outer_block; labels are case-insensitive
END inner_block;
END outer_block;
NULL statement
The NULL statement does nothing. This can be useful as a placeholder statement or a way to show that a code block is intentionally empty. For example:
DO $$
BEGIN
NULL;
END;
$$
Comments have the following syntax. You cannot nest comments.
-- single-line comment
/* multi-line
comment
*/
Nested stored procedures
Stored procedures that call other stored procedures, also called nested stored procedures, can be useful for simplifying complex functions and reusing code.
You can enable nested stored procedures by setting the EnableNestedStoredProcedures configuration parameter (disabled by default):
--Enable nested calls
=> ALTER DATABASE DEFAULT SET EnableNestedStoredProcedures = 1;
--Disable nested calls
=> ALTER DATABASE DEFAULT SET EnableNestedStoredProcedures = 0;
In the following example, proc2()
calls proc1()
to insert values into a table:
CREATE PROCEDURE proc1() AS
$$
BEGIN PERFORM INSERT INTO t_int VALUES(2023);
END;
$$;
CREATE PROCEDURE proc2() AS $$
BEGIN
PERFORM CREATE TABLE IF NOT EXISTS t_int(x int);
PERFORM CALL proc1();
END;
$$;
You can also use this feature to call meta-functions:
CREATE PROCEDURE RUN_ANALYZE_STATS() AS
$$
BEGIN PERFORM SELECT analyze_statistics('');
END;
$$;
CALL run_analyze_stats();
Depth limits
Stored procedures can only be nested up to a depth of 50. If a stored procedure exceeds the call depth, the entire operation is terminated and rolled back.
The stored procedure recursive_proc()
calls itself to insert sequential values into a table, but it has no condition to stop before the depth limit. Calling the procedure causes a rollback and no changes are made to the table:
=> CREATE TABLE numbers (n INT);
=> SELECT * FROM numbers;
n
---
(0 rows)
=> CREATE OR REPLACE PROCEDURE recursive_proc(x int) AS
$$
BEGIN
PERFORM INSERT INTO numbers VALUES(x + 1);
PERFORM CALL recursive_proc(x + 1);
END;
$$;
=> CALL recursive_proc(0);
ERROR 0: Nested stored procedure call exceeds call depth limit
CONTEXT: PL/vSQL procedure recursive_proc line 4 at static SQL
PL/vSQL procedure recursive_proc line 4 at static SQL
PL/vSQL procedure recursive_proc line 4 at static SQL
...
=> SELECT * FROM numbers;
n
---
(0 rows)
3.1.3 - Embedded SQL
You can embed and execute SQL and from within stored procedures.
You can embed and execute SQL statements and expressions from within stored procedures.
Assignment
To save the value of an expression or returned value, you can assign it to a variable:
variable_name := expression;
variable_name := statement;
For example, this procedure assigns 3 into i
and 'message' into v
.
=> CREATE PROCEDURE performless_assignment() LANGUAGE PLvSQL AS $$
DECLARE
i int;
v varchar;
BEGIN
i := SELECT 3;
v := 'message';
END;
$$;
This type of assignment will fail if the query returns no rows or more than one row. For returns of multiple rows, use LIMIT or truncating assignment:
=> SELECT * FROM t1;
b
---
t
f
f
(3 rows)
=> CREATE PROCEDURE more_than_one_row() LANGUAGE PLvSQL as $$
DECLARE
x boolean;
BEGIN
x := SELECT * FROM t1;
END;
$$;
CREATE PROCEDURE
=> CALL more_than_one_row();
ERROR 10332: Query returned multiple rows where 1 was expected
Truncating assignment
Truncating assignment stores in a variable the first row returned by a query. Row order is nondeterministic unless you specify an ORDER BY clause:
variable_name <- expression;
variable_name <- statement;
The following procedure takes the first row of the results returned by the specified query and assigns it to x
:
=> CREATE PROCEDURE truncating_assignment() LANGUAGE PLvSQL AS $$
DECLARE
x boolean;
BEGIN
x <- SELECT * FROM t1 ORDER BY b DESC; -- x is now assigned the first row returned by the SELECT query
END;
$$;
The PERFORM keyword runs a SQL statement or expression and discards the returned result.
PERFORM statement;
PERFORM expression;
For example, this procedure inserts a value into a table. INSERT returns the number of rows inserted, so you must pair it with PERFORM.
=> DO $$
BEGIN
PERFORM INSERT INTO coordinates VALUES(1,2,3);
END;
$$;
Note
If a SQL statement has no return value or you don't assign the return value to a variable, you must use PERFORM.
EXECUTE
EXECUTE allows you to dynamically construct a SQL query during execution:
EXECUTE command_expression [ USING expression [, ... ] ];
command_expression
is a SQL expression that can reference PL/vSQL variables and evaluates to a string literal. The string literal is executed as a SQL statement, and $1, $2, ... are substituted with the corresponding *expression
*s.
Constructing your query with PL/vSQL variables can be dangerous and expose your system to SQL injection, so wrap them with QUOTE_IDENT, QUOTE_LITERAL, and QUOTE_NULLABLE.
The following procedure constructs a query with a WHERE clause:
DO $$
BEGIN
EXECUTE 'SELECT * FROM t1 WHERE x = $1' USING 10; -- becomes WHERE x = 10
END;
$$;
The following procedure creates a user with a password from the username
and password
arguments. Because the constructed CREATE USER statement uses variables, use the functions QUOTE_IDENT and QUOTE_LITERAL, concatenating them with ||.
=> CREATE PROCEDURE create_user(username varchar, password varchar) LANGUAGE PLvSQL AS $$
BEGIN
EXECUTE 'CREATE USER ' || QUOTE_IDENT(username) || ' IDENTIFIED BY ' || QUOTE_LITERAL(password);
END;
$$;
EXECUTE is a SQL statement, so you can assign it to a variable or pair it with PERFORM:
variable_name:= EXECUTE command_expression;
PERFORM EXECUTE command_expression;
FOUND (special variable)
The special boolean variable FOUND is initialized as false and assigned true or false based on whether:
-
A statement (but not expression) returns results with non-zero number of rows, or
-
A FOR loop iterates at least once
You can use FOUND to distinguish between a NULL and 0-row return.
Special variables exist between the scope of a procedure's argument and the outermost block of its definition. This means that:
The following procedure demonstrates how FOUND changes. Before the SELECT statement, FOUND is false; after the SELECT statement, FOUND is true.
=> DO $$
BEGIN
RAISE NOTICE 'Before SELECT, FOUND = %', FOUND;
PERFORM SELECT 1; -- SELECT returns 1
RAISE NOTICE 'After SELECT, FOUND = %', FOUND;
END;
$$;
NOTICE 2005: Before SELECT, FOUND = f
NOTICE 2005: After SELECT, FOUND = t
Similarly, UPDATE, DELETE, and INSERT return the number of rows affected. In the next example, UPDATE doesn't change any rows, but returns the value 0 to indicate that no rows were affected, so FOUND is set to true:
=> SELECT * t1;
a | b
-----+-----
100 | abc
(1 row)
DO $$
BEGIN
PERFORM UPDATE t1 SET a=200 WHERE b='efg'; -- no rows affected since b doesn't contain 'efg'
RAISE INFO 'FOUND = %', FOUND;
END;
$$;
INFO 2005: FOUND = t
FOUND starts as false and is set to true if the loop iterates at least once:
=> DO $$
BEGIN
RAISE NOTICE 'FOUND = %', FOUND;
FOR i IN RANGE 1..1 LOOP -- RANGE is inclusive, so iterates once
RAISE NOTICE 'i = %', i;
END LOOP;
RAISE NOTICE 'FOUND = %', FOUND;
END;
$$;
NOTICE 2005: FOUND = f
NOTICE 2005: FOUND = t
DO $$
BEGIN
RAISE NOTICE 'FOUND = %', FOUND;
FOR i IN RANGE 1..0 LOOP
RAISE NOTICE 'i = %', i;
END LOOP;
RAISE NOTICE 'FOUND = %', FOUND;
END;
$$;
NOTICE 2005: FOUND = f
NOTICE 2005: FOUND = f
3.1.4 - Control flow
Control flow constructs give you control over how many times and under what conditions a block of statements should run.
Control flow constructs give you control over how many times and under what conditions a block of statements should run.
Conditionals
IF/ELSIF/ELSE
IF/ELSIF/ELSE statements let you perform different actions based on a specified condition.
IF condition_1 THEN
statement_1;
[ ELSIF condition_2 THEN
statement_2 ]
...
[ ELSE
statement_n; ]
END IF;
Vertica successively evaluates each condition as a boolean until it finds one that's true, then executes the block of statements and exits the IF statement. If no conditions are true, it executes the ELSE block, if one exists.
IF i = 3 THEN...
ELSIF 0 THEN...
ELSIF true THEN...
ELSIF x <= 4 OR x >= 10 THEN...
ELSIF y = 'this' AND z = 'THAT' THEN...
For example, this procedure demonstrates a simple IF...ELSE branch. Because b
is declared to be true, Vertica executes the first branch.
=> DO LANGUAGE PLvSQL $$
DECLARE
b bool := true;
BEGIN
IF b THEN
RAISE NOTICE 'true branch';
ELSE
RAISE NOTICE 'false branch';
END IF;
END;
$$;
NOTICE 2005: true branch
CASE
CASE expressions are often more readable than IF...ELSE chains. After executing a CASE expression's branch, control jumps to the statement after the enclosing END CASE.
PL/vSQL CASE expressions are more flexible and powerful than SQL case expressions, but the latter are more efficient; you should favor SQL case expressions when possible.
CASE [ search_expression ]
WHEN expression_1 [, expression_2, ...] THEN
when_statements
[ ... ]
[ ELSE
else_statements ]
END CASE;
search_expression
is evaluated once and then compared with expression_n
in each branch from top to bottom. If search_expression
and a given expression_n
are equal, then Vertica executes the WHEN block for expression_n
and exits the CASE block. If no matching expression is found, the ELSE branch is executed, if one exists.
Case expressions must have either a matching case or an ELSE branch, otherwise Vertica throws a CASE_NOT_FOUND error.
If you omit search_expression
, its value defaults to true
.
For example, this procedure plays the game FizzBuzz, printing Fizz if the argument is divisible by 3, Buzz if the argument is divisible by 5, FizzBuzz if the if the argument is divisible by 3 and 5.
=> CREATE PROCEDURE fizzbuzz(IN x int) LANGUAGE PLvSQL AS $$
DECLARE
fizz int := x % 3;
buzz int := x % 5;
BEGIN
CASE fizz
WHEN 0 THEN -- if fizz = 0, execute WHEN block
CASE buzz
WHEN 0 THEN -- if buzz = 0, execute WHEN block
RAISE INFO 'FizzBuzz';
ELSE -- if buzz != 0, execute WHEN block
RAISE INFO 'Fizz';
END CASE;
ELSE -- if fizz != 0, execute ELSE block
CASE buzz
WHEN 0 THEN
RAISE INFO 'Buzz';
ELSE
RAISE INFO '';
END CASE;
END CASE;
END;
$$;
=> CALL fizzbuzz(3);
INFO 2005: Fizz
=> CALL fizzbuzz(5);
INFO 2005: Buzz
=> CALL fizzbuzz(15);
INFO 2005: FizzBuzz
Loops
Loops repeatedly execute a block of code until a given condition is satisfied.
WHILE
A WHILE loop checks a given condition and, if the condition is true, it executes the loop body, after which the condition is checked again: if true, the loop body executes again; if false, control jumps to the end of the loop body.
[ <<label>> ]
WHILE condition LOOP
statements;
END LOOP;
For example, this procedure computes the factorial of the argument:
=> CREATE PROCEDURE factorialSP(input int) LANGUAGE PLvSQL AS $$
DECLARE
i int := 1;
output int := 1;
BEGIN
WHILE i <= input loop
output := output * i;
i := i + 1;
END LOOP;
RAISE INFO '%! = %', input, output;
END;
$$;
=> CALL factorialSP(5);
INFO 2005: 5! = 120
LOOP
This type of loop is equivalent to WHILE true
and only terminates if it encounters a RETURN or EXIT statement, or if an exception is thrown.
[ <<label>> ]
LOOP
statements;
END LOOP;
For example, this procedure prints the integers from counter
up to upper_bound
, inclusive:
DO $$
DECLARE
counter int := 1;
upper_bound int := 3;
BEGIN
LOOP
RAISE INFO '%', counter;
IF counter >= upper_bound THEN
RETURN;
END IF;
counter := counter + 1;
END LOOP;
END;
$$;
INFO 2005: 1
INFO 2005: 2
INFO 2005: 3
FOR
FOR loops iterate over a collection, which can be an integral range, query, or cursor.
If a FOR loop iterates at least once, the special FOUND variable is set to true after the loop ends. Otherwise, FOUND is set to false.
The FOUND variable can be useful for distinguishing between a NULL and 0-row return, or creating an IF branch if a LOOP didn't run.
FOR (RANGE)
A FOR (RANGE) loop iterates over a range of integers specified by the expressions left
and right
.
[ <<label>> ]
FOR loop_counter IN RANGE [ REVERSE ] left..right [ BY step ] LOOP
statements
END LOOP [ label ];
loop_counter
:
loop_counter
iterates from left
to right
(inclusive), incrementing by step
at the end of each iteration.
The REVERSE
option instead iterates from right
to left
(inclusive), decrementing by step
.
For example, here is a standard ascending FOR loop with step
= 1:
=> DO $$
BEGIN
FOR i IN RANGE 1..4 LOOP -- loop_counter i does not have to be declared
RAISE NOTICE 'i = %', i;
END LOOP;
RAISE NOTICE 'after loop: i = %', i; -- fails
END;
$$;
NOTICE 2005: i = 1
NOTICE 2005: i = 2
NOTICE 2005: i = 3
NOTICE 2005: i = 4
ERROR 2624: Column "i" does not exist -- loop_counter i is only available inside the FOR loop
Here, the loop_counter
i
starts at 4 and decrements by 2 at the end of each iteration:
=> DO $$
BEGIN
FOR i IN RANGE REVERSE 4..0 BY 2 LOOP
RAISE NOTICE 'i = %', i;
END LOOP;
END;
$$;
NOTICE 2005: i = 4
NOTICE 2005: i = 2
NOTICE 2005: i = 0
FOR (query)
A FOR (QUERY) loop iterates over the results of a query.
[ <<label>> ]
FOR target IN QUERY statement LOOP
statements
END LOOP [ label ];
You can include an ORDER BY clause in the query to make the ordering deterministic.
Unlike FOR (RANGE) loops, you must declare the target
variables. The values of these variables persist after the loop ends.
For example, suppose given the table tuple
:
=> SELECT * FROM tuples ORDER BY x ASC;
x | y | z
---+---+---
1 | 2 | 3
4 | 5 | 6
7 | 8 | 9
(3 rows)
This procedure retrieves the tuples in each row and stores them in the variables a
, b
, and c
, and prints them after each iteration:
=>
=> DO $$
DECLARE
a int; -- target variables must be declared
b int;
c int;
i int := 1;
BEGIN
FOR a,b,c IN QUERY SELECT * FROM tuples ORDER BY x ASC LOOP
RAISE NOTICE 'iteration %: a = %, b = %, c = %', i,a,b,c;
i := i + 1;
END LOOP;
RAISE NOTICE 'after loop: a = %, b = %, c = %', a,b,c;
END;
$$;
NOTICE 2005: iteration 1: a = 1, b = 2, c = 3
NOTICE 2005: iteration 2: a = 4, b = 5, c = 6
NOTICE 2005: iteration 3: a = 7, b = 8, c = 9
NOTICE 2005: after loop: a = 7, b = 8, c = 9
You can also use a query constructed dynamically with EXECUTE:
[ <<label>> ]
FOR target IN EXECUTE 'statement' [ USING expression [, ... ] ] LOOP
statements
END LOOP [ label ];
The following procedure uses EXECUTE to construct a FOR (QUERY) loop and stores the results of that SELECT statement in the variables x
and y
. The result set of a statement like this has only one row, so it only iterates once.
=> SELECT 'first string', 'second string';
?column? | ?column?
--------------+---------------
first string | second string
(1 row)
=> DO $$
DECLARE
x varchar; -- target variables must be declared
y varchar;
BEGIN
-- substitute the placeholders $1 and $2 with the strings
FOR x, y IN EXECUTE 'SELECT $1, $2' USING 'first string', 'second string' LOOP
RAISE NOTICE '%', x;
RAISE NOTICE '%', y;
END LOOP;
END;
$$;
NOTICE 2005: first string
NOTICE 2005: second string
FOR (cursor)
A FOR (CURSOR) loop iterates over a bound, unopened cursor, executing some set of statements
for each iteration.
[ <<label>> ]
FOR loop_variable [, ...] IN CURSOR bound_unopened_cursor [ ( [ arg_name := ] arg_value [, ...] ) ] LOOP
statements
END LOOP [ label ];
This type of FOR loop opens the cursor at start of the loop and closes at the end.
For example, this procedure creates a cursor c
. The procedure passes 6
as an argument to the cursor, so the cursor only retrieves rows where the y-coordinate is 6, storing the coordinates in the variables x_
, y_
, and z_
and printing them at the end of each iteration:
=> SELECT * FROM coordinates;
x | y | z
----+---+----
14 | 6 | 19
1 | 6 | 2
10 | 6 | 39
10 | 2 | 1
7 | 1 | 10
67 | 1 | 77
(6 rows)
DO $$
DECLARE
c CURSOR (key int) FOR SELECT * FROM coordinates WHERE y=key;
x_ int;
y_ int;
z_ int;
BEGIN
FOR x_,y_,z_ IN CURSOR c(6) LOOP
RAISE NOTICE 'cursor returned %,%,% FOUND=%', x_,y_,z_,FOUND;
END LOOP;
RAISE NOTICE 'after loop: %,%,% FOUND=%', x_,y_,z_,FOUND;
END;
$$;
NOTICE 2005: cursor returned 14,6,19 FOUND=f -- FOUND is only set after the loop ends
NOTICE 2005: cursor returned 1,6,2 FOUND=f
NOTICE 2005: after loop: 10,6,39 FOUND=t -- x_, y_, and z_ retain their values, FOUND is now true because the FOR loop iterated at least once
Manipulating loops
RETURN
You can exit the entire procedure (and therefore the loop) with RETURN. RETURN is an optional statement and can be added to signal to readers the end of a procedure.
RETURN;
EXIT
Similar to a break
or labeled break
in other programming languages, EXIT statements let you exit a loop early, optionally specifying:
-
loop_label
: the name of the loop to exit from
-
condition
: if the condition
is true
, execute the EXIT statement
EXIT [ loop_label ] [ WHEN condition ];
CONTINUE
CONTINUE skips to the next iteration of the loop without executing statements that follow the CONTINUE itself. You can specify a particular loop with loop_label
:
CONTINUE [loop_label] [ WHEN condition ];
For example, this procedure doesn't print during its first two iterations because the CONTINUE statement executes and moves on to the next iteration of the loop before control reaches the RAISE NOTICE statement:
=> DO $$
BEGIN
FOR i IN RANGE 1..5 LOOP
IF i < 3 THEN
CONTINUE;
END IF;
RAISE NOTICE 'i = %', i;
END LOOP;
END;
$$;
NOTICE 2005: i = 3
NOTICE 2005: i = 4
NOTICE 2005: i = 5
3.1.5 - Errors and diagnostics
ASSERT is a debugging feature that checks whether a condition is true.
ASSERT
ASSERT is a debugging feature that checks whether a condition is true
. If the condition is false
, ASSERT raises an ASSERT_FAILURE
exception with an optional error message.
To escape a '
(single quote) character, use ''
. Similarly, to escape a "
(double quote) character, use ""
.
ASSERT condition [ , message ];
For example, this procedure checks the number of rows in the products
table and uses ASSERT to check that the table is populated. If the table is empty, Vertica raises an error:
=> CREATE TABLE products(id UUID, name VARCHARE, price MONEY);
CREATE TABLE
=> SELECT * FROM products;
id | name | price
----+------+-------
(0 rows)
DO $$
DECLARE
prod_count INT;
BEGIN
prod_count := SELECT count(*) FROM products;
ASSERT prod_count > 0, 'products table is empty';
END;
$$;
ERROR 2005: products table is empty
To stop Vertica from checking ASSERT statements, you can set the boolean session-level parameter PLpgSQLCheckAsserts
.
RAISE
RAISE can throw errors or print a user-specified error message, one of the following:
RAISE [ level ] 'format' [, arg_expression [, ... ]] [ USING option = expression [, ... ] ];
RAISE [ level ] condition_name [ USING option = expression [, ... ] ];
RAISE [ level ] SQLSTATE 'sql-state' [ USING option = expression [, ... ] ];
RAISE [ level ] USING option = expression [, ... ];
level |
VARCHAR, one of the following:
-
LOG: Sends the format to vertica.log
-
INFO: Prints an INFO message in VSQL
-
NOTICE: Prints a NOTICE in VSQL
-
WARNING: Prints a WARNING in VSQL
-
EXCEPTION: Throws catchable exception
Default: EXCEPTION
|
format |
VARCHAR, a string literal error message where the percent character % is substituted with the *arg_expression *s. %% escapes the substitution and results in a single % in plaintext.
If the number of % characters doesn't equal the number of arguments, Vertica throws an error.
To escape a ' (single quote) character, use '' . Similarly, to escape a " (double quote) character, use "" .
|
arg_expression |
An expression that substitutes for the percent character (% ) in the format string. |
option = expression |
option must be one of the following and paired with an expression that elaborates on the option :
option |
expression content |
MESSAGE |
An error message.
Default: the ERRCODE associated with the exception
|
DETAIL |
Details about the error. |
HINT |
A hint message. |
ERRCODE |
The error code to report, one of the following:
-
A condition name specified in the description column of the SQL state list (with optional ERRCODE_ prefix)
-
A code that satisfies the SQLSTATE formatting: 5-character sequence of numbers and capital letters (not necessarily on the SQL State List)
Default: ERRCODE_RAISE_EXCEPTION (V0002)
|
COLUMN |
A column name relevant to the error |
CONSTRAINT |
A constraint relevant to the error |
DATATYPE |
A data type relevant to the error |
TABLE |
A table name relevant to the error |
SCHEMA |
A schema name relevant to the error |
|
This procedure demonstrates various RAISE levels:
=> DO $$
DECLARE
logfile varchar := 'vertica.log';
BEGIN
RAISE LOG 'this message was sent to %', logfile;
RAISE INFO 'info';
RAISE NOTICE 'notice';
RAISE WARNING 'warning';
RAISE EXCEPTION 'exception';
RAISE NOTICE 'exception changes control flow; this is not printed';
END;
$$;
INFO 2005: info
NOTICE 2005: notice
WARNING 2005: warning
ERROR 2005: exception
$ grep 'this message was sent to vertica.log' v_vmart_node0001_catalog/vertica.log
<LOG> @v_vmart_node0001: V0002/2005: this message is sent to vertica.log
Exceptions
EXCEPTION blocks let you catch and handle exceptions that might get thrown from statements
:
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
EXCEPTION
WHEN exception_condition [ OR exception_condition ... ] THEN
handler_statements
[ WHEN exception_condition [ OR exception_condition ... ] THEN
handler_statements
... ]
END [ label ];
exception_condition
has one of the following forms:
WHEN errcode_division_by_zero THEN ...
WHEN division_by_zero THEN ...
WHEN SQLSTATE '22012' THEN ...
WHEN OTHERS THEN ...
OTHERS is a special condition that catches all exceptions except QUERY_CANCELLED
, ASSERT_FAILURE
, and FEATURE_NOT_SUPPORTED
.
When an exception is thrown, Vertica checks the list of exceptions for a matching exception_condition
from top to bottom. If it finds a match, it executes the handler_statements
and then leaves the exception block's scope.
If Vertica can't find a match, it propagates the exception up to the next enclosing block. You can do this manually within an exception handler with RAISE:
RAISE;
For example, the following procedure divides 3 by 0 in the inner_block
, which is an illegal operation that throws the exception division_by_zero
with SQL state 22012. Vertica checks the inner EXCEPTION block for a matching condition:
-
The first condition checks for SQL state 42501, so Vertica, so Vertica moves to the next condition.
-
WHEN OTHERS THEN catches all exceptions, so it executes that block.
-
The bare RAISE then propagates the exception to the outer_block
.
-
The outer EXCEPTION block successfully catches the exception and prints a message.
=> DO $$
<<outer_block>>
BEGIN
<<inner_block>>
DECLARE
x int;
BEGIN
x := 3 / 0; -- throws exception division_by_zero, SQLSTATE 22012
EXCEPTION -- this block is checked first for matching exceptions
WHEN SQLSTATE '42501' THEN
RAISE NOTICE 'caught insufficient_privilege exception';
WHEN OTHERS THEN -- catches all exceptions
RAISE; -- manually propagate the exception to the next enclosing block
END inner_block;
EXCEPTION -- exception is propagated to this block
WHEN division_by_zero THEN
RAISE NOTICE 'caught division_by_zero exception';
END outer_block;
$$;
NOTICE 2005: caught division_by_zero exception
SQLSTATE and SQLERRM variables
When handling an exception, you can use the following variables to retrieve error information:
For details, see SQL state list.
This procedure catches the exception thrown by attempting to assign NULL to a NOT NULL variable and prints the SQL state and error message:
DO $$
DECLARE
i int NOT NULL := 1;
BEGIN
i := NULL; -- illegal, i was declared with NOT NULL
EXCEPTION
WHEN OTHERS THEN
RAISE WARNING 'SQL State: %', SQLSTATE;
RAISE WARNING 'Error message: %', SQLERRM;
END;
$$;
WARNING 2005: SQLSTATE: 42809
WARNING 2005: SQLERRM: Cannot assign null into NOT NULL variable
You can retrieve information about exceptions inside exception handlers with GET STACKED DIAGNOSTICS:
GET STACKED DIAGNOSTICS variable_name { = | := } item [, ... ];
Where item
can be any of the following:
item |
Description |
RETRUNED_SQLSTATE |
SQLSTATE error code of the exception |
COLUMN_NAME |
Name of the column related to exception |
CONSTRAINT_NAME |
Name of the constraint related to exception |
DATATYPE_NAME |
Name of the data type related to exception |
MESSAGE_TEXT |
Text of the exception's primary message |
TABLE_NAME |
Name of the table related to exception |
SCHEMA_NAME |
Name of the schema related to exception |
DETAIL_TEXT |
Text of the exception's detail message, if any |
HINT_TEXT |
Text of the exception's hint message, if any |
EXCEPTION_CONTEXT |
Description of the call stack at the time of the exception |
For example, this procedure has an EXCEPTION block that catches the division_by_zero
error and prints SQL state, error message, and the exception context:
=> DO $$
DECLARE
message_1 varchar;
message_2 varchar;
message_3 varchar;
x int;
BEGIN
x := 5 / 0;
EXCEPTION
WHEN OTHERS THEN -- OTHERS catches all exceptions
GET STACKED DIAGNOSTICS message_1 = RETURNED_SQLSTATE,
message_2 = MESSAGE_TEXT,
message_3 = EXCEPTION_CONTEXT;
RAISE INFO 'SQLSTATE: %', message_1;
RAISE INFO 'MESSAGE: %', message_2;
RAISE INFO 'EXCEPTION_CONTEXT: %', message_3;
END;
$$;
INFO 2005: SQLSTATE: 22012
INFO 2005: MESSAGE: Division by zero
INFO 2005: EXCEPTION_CONTEXT: PL/vSQL procedure inline_code_block line 8 at static SQL
3.1.6 - Cursors
A cursor is a reference to the result set of a query and allows you to view the results one row at a time.
A cursor is a reference to the result set of a query and allows you to view the results one row at a time. Cursors remember their positions in result sets, which can be one of the following:
-
a result row
-
before the first row
-
after the last row
You can also iterate over unopened, bound cursors with a FOR loop. See Control Flow for more information.
Declaring cursors
Bound cursors
To bind a cursor to a statement
on declaration, use the FOR keyword:
cursor_name CURSOR [ ( arg_name arg_type [, ...] ) ] FOR statement;
The arguments to a cursor give you more control over which rows to process. For example, suppose you have the following table:
=> SELECT * FROM coordinates_xy;
x | y
---+----
1 | 2
9 | 5
7 | 13
...
(100000 rows)
If you're only interested in the rows where y
is 6, you might declare the following cursor and then provide the argument 6
when you OPEN the cursor:
c CURSOR (key int) FOR SELECT * FROM coordinates_xy WHERE y=key;
Unbound cursors
To declare a cursor without binding it to a particular query, use the refcursor
type:
cursor_name refcursor;
You can bind an unbound cursor at any time with OPEN.
For example, to declare the cursor my_unbound_cursor
:
my_unbound_cursor refcursor;
Opening and closing cursors
OPEN
Opening a cursor executes the query with the given arguments, and puts the cursor before the first row of the result set. The ordering of query results (and therefore, the start of the result set) is non-deterministic, unless you specify an ORDER BY clause.
OPEN a bound cursor
To open a cursor that was bound during declaration:
OPEN bound_cursor [ ( [ arg_name := ] arg_value [, ...] ) ];
For example, given the following declaration:
c CURSOR (key int) FOR SELECT * FROM t1 WHERE y=key;
You can open the cursor with one of the following:
OPEN c(5);
OPEN c(key := 5);
CLOSE
Open cursors are automatically closed when the cursor leaves scope, but you can close the cursor preemptively with CLOSE. Closed cursors can be reopened later, which re-executes the query and prepares a new result set.
CLOSE cursor;
OPEN an unbound cursor
To bind an unbound cursor and then open it:
OPEN unbound_cursor FOR statement;
You can also use EXECUTE because it's a statement:
OPEN unbound_cursor FOR EXECUTE statement_string [ USING expression [, ... ] ];
For example, to bind the cursor c
to a query to a table product_data
:
OPEN c for SELECT * FROM product_data;
FETCH rows
FETCH statements:
-
Retrieve the row that the specified cursor currently points to and stores it in some variable.
-
Advance the cursor to the next position.
variable [, ...] := FETCH opened_cursor;
The retrieved value is stored in variable
. Rows typically have more than one value, so you can use one variable for each.
If FETCH successfully retrieves a value, the special variable FOUND is set to true
. Otherwise, if you call FETCH when the cursor is past the final row of the result set, it returns NULL and the special variable FOUND is set to false
.
The following procedure creates a cursor c
, binding it to a SELECT query on the coordinates
table. The procedures passes the argument 1 to the cursor, so the cursor only retrieves rows where the y-coordinate is 1, storing the coordinates in the variables x_
, y_
, and z_
.
Only two rows have a y-coordinate of 1, so after using FETCH twice, the third FETCH starts to return NULL values and FOUND is set to false
:
=> SELECT * FROM coordinates;
x | y | z
----+---+----
14 | 6 | 19
1 | 6 | 2
10 | 6 | 39
10 | 2 | 1
7 | 1 | 10
67 | 1 | 77
(6 rows)
DO $$
DECLARE
c CURSOR (key int) FOR SELECT * FROM coordinates WHERE y=key;
x_ int;
y_ int;
z_ int;
BEGIN
OPEN c(1); -- only retrieve rows where y=1
x_,y_,z_ := FETCH c;
RAISE NOTICE 'cursor returned %, %, %, FOUND=%',x_, y_, z_, FOUND;
x_,y_,z_ := FETCH c; -- fetches the last set of results and moves to the end of the result set
RAISE NOTICE 'cursor returned %, %, %, FOUND=%',x_, y_, z_, FOUND;
x_,y_,z_ := FETCH c; -- cursor has advanced past the final row
RAISE NOTICE 'cursor returned %, %, %, FOUND=%',x_, y_, z_, FOUND;
END;
$$;
NOTICE 2005: cursor returned 7, 1, 10, FOUND=t
NOTICE 2005: cursor returned 67, 1, 77, FOUND=t
NOTICE 2005: cursor returned <NULL>, <NULL>, <NULL>, FOUND=f
MOVE cursors
MOVE advances an open cursor to the next position without retrieving the row. The special FOUND variable is set to true
if the cursor's position (before MOVE) was not past the final row—that is, if calling FETCH instead of MOVE would have retrieved the row.
MOVE bound_cursor;
For example, this cursor only retrieves rows where the y-coordinate is 2. The result set is only one row, so using MOVE twice advances past the first (and last) row, setting FOUND to false:
=> SELECT * FROM coordinates WHERE y=2;
x | y | z
----+---+---
10 | 2 | 1
(1 row)
DO $$
DECLARE
c CURSOR (key int) FOR SELECT * FROM coordinates WHERE y=key;
BEGIN
OPEN c(2); -- only retrieve rows where y=2, cursor starts before the first row
MOVE c; -- cursor advances to the first (and last) row
RAISE NOTICE 'FOUND=%', FOUND; -- FOUND is true because the cursor points to a row in the result set
MOVE c; -- cursor advances past the final row
RAISE NOTICE 'FOUND=%', FOUND; -- FOUND is false because the cursor is past the final row
END;
$$;
NOTICE 2005: FOUND=t
NOTICE 2005: FOUND=f
3.1.7 - PL/pgSQL to PL/vSQL migration guide
While Vertica PL/vSQL is largely compatible with PostgreSQL PL/pgSQL, there are some easily-resolved semantic and SQL-level differences when migrating from PostgreSQL PL/pgSQL.
While Vertica PL/vSQL is largely compatible with PostgreSQL PL/pgSQL, there are some easily-resolved semantic and SQL-level differences when migrating from PostgreSQL PL/pgSQL.
Language-level differences
PL/vSQL differs from PL/pgSQL in the following ways:
-
You must use the PERFORM statement for SQL statements that return no value.
-
UPDATE/DELETE WHERE CURRENT OF is not currently supported.
-
FOR loops have additional keywords:
-
FOR (RANGE) loops: RANGE keyword
-
FOR (QUERY) loops: QUERY keyword
-
FOR (CURSOR) loops: CURSOR keyword
-
By default, NULL cannot be coerced to FALSE.
Coercing NULL to FALSE
NULL is not coercible to FALSE by default, and expressions that expect a BOOLEAN value throw an exception when given a NULL:
=> DO $$
BEGIN
IF NULL THEN -- BOOLEAN value expected for IF
END IF;
END;
$$;
ERROR 10268: Query returned null where a value was expected
To enable NULL-to-FALSE coercion, set the configuration parameter PLvSQLCoerceNull:
=> ALTER DATABASE DEFAULT SET PLvSQLCoerceNull = 1;
Planned features
Support for the following features is planned for a future release:
- Full transaction semantics: Currently, stored procedures only COMMIT changes after successful execution. This means that you cannot manually ROLLBACK. However, changes made by nested stored procedures are automatically rolled back if they reach the depth limit. You can also use PERFORM COMMIT to commit changes during execution.
- OUT/INOUT parameter modes
- FOREACH (ARRAY) loops
- Using the following types as arguments:
- DECIMAL
- NUMERIC
- NUMBER
- MONEY
- UUID
- Non-forward moving cursors
- CONTEXT/EXCEPTION_CONTEXT for diagnostics
- The special variable ROW_COUNT: To work around this, you can rely on INSERT, UPDATE, and DELETE to return the number of rows affected:
=> CREATE TABLE t1(i int);
CREATE TABLE
=> DO $$
DECLARE
x int;
BEGIN
x := INSERT INTO t1 VALUES (200);
RAISE INFO 'rows inserted: %', x;
END;
$$;
INFO 2005: rows inserted: 1
SQL-level differences
Vertica differs from PostgreSQL in the following ways:
-
Some data types are different sizes—for example, the standard INTEGER type in Vertica is 8 bytes, but 4 bytes in PostgreSQL.
-
In Vertica, INSERT, UPDATE, and DELETE return the number of rows affected.
-
Certain SQLSTATE codes are different, which affects exception handling.
3.2 - Parameter modes
Stored procedures support IN parameters.
Stored procedures support IN parameters. OUT and INOUT parameters are currently not supported.
If unspecified, a parameter's mode defaults to IN.
IN
IN parameters specify the name and type of an argument. These parameters determine a procedure's signature. When an overloaded procedure is called, Vertica runs the procedure whose signature matches the types of the arguments passed in the invocation.
For example, the caller of this procedure must pass in an INT and a VARCHAR value. Both x
and y
are IN parameters:
=> CREATE PROCEDURE raiseXY(IN x INT, y VARCHAR) LANGUAGE PLvSQL AS $$
BEGIN
RAISE NOTICE 'x = %', x;
RAISE NOTICE 'y = %', y;
-- some processing statements
END
$$;
CALL raiseXY(3, 'some string');
NOTICE 2005: x = 3
NOTICE 2005: y = some string
For more information on RAISE NOTICE, see Errors and diagnostics.
3.3 - Executing stored procedures
If you have EXECUTE privileges on a stored procedure, you can execute it with a CALL statement that specifies the procedure and its IN arguments.
If you have EXECUTE privileges on a stored procedure, you can execute it with a CALL statement that specifies the procedure and its IN arguments.
Syntax
CALL stored_procedure_name();
For example, the stored procedure raiseXY()
is defined as:
=> CREATE PROCEDURE raiseXY(IN x INT, y VARCHAR) LANGUAGE PLvSQL AS $$
BEGIN
RAISE NOTICE 'x = %', x;
RAISE NOTICE 'y = %', y;
-- some processing statements
END
$$;
CALL raiseXY(3, 'some string');
NOTICE 2005: x = 3
NOTICE 2005: y = some string
For more information on RAISE NOTICE, see Errors and diagnostics.
You can execute an anonymous (unnamed) procedure with DO. This requires no privileges:
=> DO $$
BEGIN
RAISE NOTICE '% ran an anonymous procedure', current_user();
END;
$$;
NOTICE 2005: Bob ran an anonymous procedure
Transaction semantics
Changes made by stored procedures are automatically committed after successful execution and are rolled back otherwise.
You can also manually commit changes in the middle of a stored procedure. If execution fails after the commit, the committed changes persist even after the automatic rollback.
In this example, manual_commit()
inserts two values into a table and then commits. The third insert attempts to insert a CHAR into an INT column, which causes the stored procedure to fail and triggers an automatic rollback. This rollback does not affect the first two inserts because they were manually committed:
=> CREATE TABLE numbers (n INT);
=> CREATE PROCEDURE manualcommit() AS
$$
BEGIN
PERFORM INSERT INTO numbers VALUES(1);
PERFORM INSERT INTO numbers VALUES(2);
PERFORM COMMIT;
PERFORM INSERT INTO numbers VALUES('a');
END;
$$;
=> CALL manualcommit();
ERROR 3681: Invalid input syntax for integer: "a"
CONTEXT: PL/vSQL procedure manualcommit line 6 at static SQL
=> SELECT * FROM numbers;
n
---
1
2
(2 rows)
Session semantics
Operations that modify the session persist after execution of the stored procedure, including ALTER SESSION and SET statements.
Limiting runtime
You can set the maximum runtime of a procedure with session parameter RUNTIMECAP.
This example sets the runtime of all stored procedures to one second for duration of session and runs an anonymous procedure with an infinite loop. Vertica terminates the procedure after it runs for more than one second:
=> SET SESSION RUNTIMECAP '1 SECOND';
=> DO $$
BEGIN
LOOP
END LOOP;
END;
$$;
ERROR 0: Query exceeded maximum runtime
HINT: Change the maximum runtime using SET SESSION RUNTIMECAP
Execution security and privileges
By default, stored procedures execute with the privileges of the caller (invoker), so callers must have the necessary privileges on the catalog objects accessed by the stored procedure. You can allow callers to execute the procedure with the privileges, default roles, user parameters, and user attributes (RESOURCE_POOL, MEMORY_CAP_KB, TEMP_SPACE_CAP_KB, RUNTIMECAP) of the definer by specifying DEFINER for the SECURITY option.
For example, the following procedure inserts a value into table s1.t1
. If the DEFINER has the required privileges (USAGE on the schema and INSERT on table), this requirement is waived for callers.
=> CREATE PROCEDURE insert_into_s1_t1(IN x int, IN y int)
LANGUAGE PLvSQL
SECURITY DEFINER AS $$
BEGIN
PERFORM INSERT INTO s1.t1 VALUES(x,y);
END;
$$;
A procedure with SECURITY DEFINER effectively executes the procedure as that user, so changes to the database appear to be performed by the procedure's definer rather than its caller.
Caution
Improper use of SECURITY DEFINER can lead to the
confused deputy problem and introduce vulnerabilities into your system like SQL injection.
Execution privileges for nested stored procedures
A stored procedure cannot call stored procedures that require additional privileges. For example, if a stored procedure executes with privileges A, B, and C, it cannot call a stored procedure that requires privileges C, D, and E.
For details on nested stored procedures, see Scope and structure.
Examples
In this example, this table:
records(i INT, updated_date TIMESTAMP DEFAULT sysdate, updated_by VARCHAR(128) DEFAULT current_user())
Contains the following content:
=> SELECT * FROM records;
i | updated_date | updated_by
---+----------------------------+------------
1 | 2021-08-27 15:54:05.709044 | Bob
2 | 2021-08-27 15:54:07.051154 | Bob
3 | 2021-08-27 15:54:08.301704 | Bob
(3 rows)
Bob creates a procedure to update the table and uses the SECURITY DEFINER option and grants EXECUTE on the procedure to Alice. Alice can now use the procedure to update the table without any additional privileges:
=> GRANT EXECUTE ON PROCEDURE update_records(int,int) to Alice;
GRANT PRIVILEGE
=> \c - Alice
You are now connected as user "Alice".
=> CALL update_records(99,1);
update_records
---------------
0
(1 row)
Because calls to update_records()
effectively run the procedure as Bob, Bob is listed as the updater of the table rather than Alice:
=> SELECT * FROM records;
i | updated_date | updated_by
----+----------------------------+------------
99 | 2021-08-27 15:55:42.936404 | Bob
2 | 2021-08-27 15:54:07.051154 | Bob
3 | 2021-08-27 15:54:08.301704 | Bob
(3 rows)
3.3.1 - Triggers
You can automate the execution of stored procedures with triggers.
You can automate the execution of stored procedures with triggers. A trigger listens to database events and executes its associated stored procedure when the events occur. You can use triggers with CREATE SCHEDULE to implement Scheduled execution.
Individual triggers can be enabled and disabled with ENABLE_TRIGGER, and can be manually executed with EXECUTE_TRIGGER.
3.3.1.1 - Scheduled execution
Stored procedures can be scheduled to execute automatically with the privileges of the trigger definer.
Stored procedures can be scheduled to execute automatically with the privileges of the trigger definer. You can use this to automate various tasks, like logging database activity, revoking privileges, or creating roles.
Enabling and disabling scheduling
Scheduling can be toggled at the database level with the EnableStoredProcedureScheduler configuration parameter:
-- Enable scheduler
=> SELECT SET_CONFIG_PARAMETER('EnableStoredProcedureScheduler', 1);
-- Disable scheduler
=> SELECT SET_CONFIG_PARAMETER('EnableStoredProcedureScheduler', 0);
You can toggle an individual schedule ENABLE_TRIGGER, or disable it by dropping the schedule's associated trigger.
Scheduling a stored procedure
The general workflow for implementing scheduled execution for a single stored procedure is as follows:
-
Create a stored procedure.
-
Create a schedule. A schedule can either use a list of timestamps for one-off triggers or a cron
expression for recurring events.
-
Create a trigger, associating it with the stored procedure and trigger.
-
(Optional) Manually execute the trigger to test it.
One-off triggers
One-off triggers run a finite number of times.
The following example creates a trigger that revokes privileges on the customer_dimension table from the user Bob after 24 hours:
-
Create a stored procedure to revoke privileges from Bob:
=> CREATE OR REPLACE PROCEDURE revoke_all_on_table(table_name VARCHAR, user_name VARCHAR)
LANGUAGE PLvSQL
AS $$
BEGIN
EXECUTE 'REVOKE ALL ON ' || QUOTE_IDENT(table_name) || ' FROM ' || QUOTE_IDENT(user_name);
END;
$$;
-
Create a schedule with a timestamp for 24 hours later:
=> CREATE SCHEDULE 24_hours_later USING DATETIMES('2022-12-16 12:00:00');
-
Create a trigger with the stored procedure and schedule:
=> CREATE TRIGGER revoke_trigger ON SCHEDULE 24_hours_later EXECUTE PROCEDURE revoke_all_on_table('customer_dimension', 'Bob') AS DEFINER;
Recurring triggers
Recurring triggers run at a recurring date or time.
The following example creates a weekly trigger that logs to the USER_COUNT table the number of users in the database:
=> SELECT * FROM USER_COUNT;
total | timestamp
---------+-------------------------------
293 | 2022-12-04 00:00:00.346664-00
302 | 2022-12-11 00:00:00.782242-00
301 | 2022-12-18 00:00:00.144633-00
301 | 2022-12-25 00:00:00.548832-00
(4 rows)
-
Create the table to log the user counts:
=> CREATE TABLE USER_COUNT(total INT, timestamp TIMESTAMPTZ)
-
Create the stored procedure to log to the table:
=> CREATE OR REPLACE PROCEDURE log_user_count()
LANGUAGE PLvSQL
AS $$
DECLARE
num_users int := SELECT count (user_id) FROM users;
timestamp datetime := SELECT NOW();
BEGIN
PERFORM INSERT INTO USER_COUNT VALUES(num_users, timestamp);
END;
$$;
-
Create the schedule for 12:00 AM on Sunday:
=> CREATE SCHEDULE weekly_sunday USING CRON '0 0 * * 0';
-
Create the trigger with the stored procedure and schedule:
=> CREATE TRIGGER user_log_trigger ON SCHEDULE weekly_sunday EXECUTE PROCEDURE log_user_count() AS DEFINER;
Viewing upcoming schedules
Schedules are managed and coordinated by the Active Scheduler Node (ASN). If the ASN goes down, a different node is automatically designated as the new ASN. To view scheduled tasks, query SCHEDULER_TIME_TABLE on the ASN.
-
Determine the ASN with ACTIVE_SCHEDULER_NODE:
=> SELECT active_scheduler_node();
active_scheduler_node
-----------------------
initiator
(1 row)
-
On the ASN, query SCHEDULER_TIME_TABLE:
=> SELECT * FROM scheduler_time_table;
schedule_name | attached_trigger | scheduled_execution_time
-----------------+------------------+--------------------------
daily_1am_gmt | log_user_actions | 2022-12-15 01:00:00-00
24_hours_later | revoke_trigger | 2022-12-16 12:00:00-00
3.4 - Altering stored procedures
You can alter a stored procedure and retain its grants with ALTER PROCEDURE.
You can alter a stored procedure and retain its grants with ALTER PROCEDURE.
Examples
The examples below use the following procedure:
=> CREATE PROCEDURE echo_integer(IN x int) LANGUAGE PLvSQL AS $$
BEGIN
RAISE INFO 'x is %', x;
END;
$$;
By default, stored procedures execute with the privileges of the caller (invoker), so callers must have the necessary privileges on the catalog objects accessed by the stored procedure. You can allow callers to execute the procedure with the privileges, default roles, user parameters, and user attributes (RESOURCE_POOL, MEMORY_CAP_KB, TEMP_SPACE_CAP_KB, RUNTIMECAP) of the definer by specifying DEFINER for the SECURITY option.
To execute the procedure with privileges of the...
To change a procedure's source code:
=> ALTER PROCEDURE echo_integer(int) SOURCE TO $$
BEGIN
RAISE INFO 'the integer is: %', x;
END;
$$;
To change a procedure's owner (definer):
=> ALTER PROCEDURE echo_integer(int) OWNER TO u1;
To change a procedure's schema:
=> ALTER PROCEDURE echo_integer(int) SET SCHEMA s1;
To rename a procedure:
=> ALTER PROCEDURE echo_integer(int) RENAME TO echo_int;
3.5 - Stored procedures: use cases and examples
Stored procedures in Vertica are best suited for complex, analytical workflows rather than small, transaction-heavy ones.
Stored procedures in Vertica are best suited for complex, analytical workflows rather than small, transaction-heavy ones. Some recommended use cases include information lifecycle management (ILM) activities like extract, transform, and load (ETL), and data preparation for more complex analytical tasks like machine learning. For example:
-
Swapping partitions according to age
-
Exporting data at end-of-life and dropping the partitions
-
Saving inputs, outputs, and metadata from a machine learning model (e.g. who ran the model, the version of the model, how many times the model was run, who received the results, etc.) for auditing purposes
Searching for a value
The find_my_value()
procedure searches for a user-specified value in any table column in a given schema and stores the locations of instances of the value in a user-specified table:
=> CREATE PROCEDURE find_my_value(p_table_schema VARCHAR(128), p_search_value VARCHAR(1000), p_results_schema VARCHAR(128), p_results_table VARCHAR(128)) AS $$
DECLARE
sql_cmd VARCHAR(65000);
sql_cmd_result VARCHAR(65000);
results VARCHAR(65000);
BEGIN
IF p_table_schema IS NULL OR p_table_schema = '' OR
p_search_value IS NULL OR p_search_value = '' OR
p_results_schema IS NULL OR p_results_schema = '' OR
p_results_table IS NULL OR p_results_table = '' THEN
RAISE EXCEPTION 'Please provide a schema to search, a search value, a results table schema, and a results table name.';
RETURN;
END IF;
sql_cmd := 'CREATE TABLE IF NOT EXISTS ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) ||
'(found_timestamp TIMESTAMP, found_value VARCHAR(1000), table_name VARCHAR(128), column_name VARCHAR(128));';
sql_cmd_result := EXECUTE 'SELECT LISTAGG(c USING PARAMETERS max_length=1000000, separator='' '')
FROM (SELECT ''
(SELECT '''''' || NOW() || ''''''::TIMESTAMP , ''''' || QUOTE_IDENT(p_search_value) || ''''','''''' || table_name || '''''', '''''' || column_name || ''''''
FROM '' || table_schema || ''.'' || table_name || ''
WHERE '' || column_name || ''::'' ||
CASE
WHEN data_type_id IN (17, 115, 116, 117) THEN data_type
ELSE ''VARCHAR('' || LENGTH(''' || QUOTE_IDENT(p_search_value)|| ''') || '')'' END || '' = ''''' || QUOTE_IDENT(p_search_value) || ''''''' || DECODE(LEAD(column_name) OVER(ORDER BY table_schema, table_name, ordinal_position), NULL, '' LIMIT 1);'', '' LIMIT 1)
UNION ALL '') c
FROM (SELECT table_schema, table_name, column_name, ordinal_position, data_type_id, data_type
FROM columns WHERE NOT is_system_table AND table_schema ILIKE ''' || QUOTE_IDENT(p_table_schema) || ''' AND data_type_id < 1000
ORDER BY table_schema, table_name, ordinal_position) foo) foo;';
results := EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_results_schema) || '.' || QUOTE_IDENT(p_results_table) || ' ' || sql_cmd_result;
RAISE INFO 'Matches Found: %', results;
END;
$$;
For example, to search the public
schema for instances of the string 'dog
' and then store the results in public.table_list
:
=> CALL find_my_value('public', 'dog', 'public', 'table_list');
find_my_value
---------------
0
(1 row)
=> SELECT * FROM public.table_list;
found_timestamp | found_value | table_name | column_name
----------------------------+-------------+---------------+-------------
2021-08-25 22:13:20.147889 | dog | another_table | b
2021-08-25 22:13:20.147889 | dog | some_table | c
(2 rows)
Optimizing tables
You can automate loading data from Parquet files and optimizing your queries with the create_optimized_table()
procedure. This procedure:
-
Creates an external table whose structure is built from Parquet files using the Vertica INFER_TABLE_DDL function.
-
Creates a native Vertica table, like the external table, resizing all VARCHAR columns to the MAX length of the data to be loaded.
-
Creates a super projection using the optional segmentation/order by columns passed in as a parameter.
-
Adds an optional primary key to the native table passed in as a parameter.
-
Loads a sample data set (1 million rows) from the external table into the native table.
-
Drops the external table.
-
Runs the ANALYZE_STATISTICS function on the native table.
-
Runs the DESIGNER_DESIGN_PROJECTION_ENCODINGS function to get a properly encoded super projection for the native table.
-
Truncates the now-optimized native table (we will load the entire data set in a separate script/stored procedure).
=> CREATE OR REPLACE PROCEDURE create_optimized_table(p_file_path VARCHAR(1000), p_table_schema VARCHAR(128), p_table_name VARCHAR(128), p_seg_columns VARCHAR(1000), p_pk_columns VARCHAR(1000)) LANGUAGE PLvSQL AS $$
DECLARE
command_sql VARCHAR(1000);
seg_columns VARCHAR(1000);
BEGIN
-- First 3 parms are required.
-- Segmented and PK columns names, if present, must be Unquoted Identifiers
IF p_file_path IS NULL OR p_file_path = '' THEN
RAISE EXCEPTION 'Please provide a file path.';
ELSEIF p_table_schema IS NULL OR p_table_schema = '' THEN
RAISE EXCEPTION 'Please provide a table schema.';
ELSEIF p_table_name IS NULL OR p_table_name = '' THEN
RAISE EXCEPTION 'Please provide a table name.';
END IF;
-- Pass optional segmented columns parameter as null or empty string if not used
IF p_seg_columns IS NULL OR p_seg_columns = '' THEN
seg_columns := '';
ELSE
seg_columns := 'ORDER BY ' || p_seg_columns || ' SEGMENTED BY HASH(' || p_seg_columns || ') ALL NODES';
END IF;
-- Add '_external' to end of p_table_name for the external table and drop it if it already exists
EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_external CASCADE;';
-- Execute INFER_TABLE_DDL to generate CREATE EXTERNAL TABLE from the Parquet files
command_sql := EXECUTE 'SELECT infer_table_ddl(' || QUOTE_LITERAL(p_file_path) || ' USING PARAMETERS format = ''parquet'', table_schema = ''' || QUOTE_IDENT(p_table_schema) || ''', table_name = ''' || QUOTE_IDENT(p_table_name) || '_external'', table_type = ''external'');';
-- Run the CREATE EXTERNAL TABLE DDL
EXECUTE command_sql;
-- Generate the Internal/ROS Table DDL and generate column lengths based on maximum column lengths found in external table
command_sql := EXECUTE 'SELECT LISTAGG(y USING PARAMETERS separator='' '')
FROM ((SELECT 0 x, ''SELECT ''''CREATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '('' y
UNION ALL SELECT ordinal_position, column_name || '' '' ||
CASE WHEN data_type LIKE ''varchar%''
THEN ''varchar('''' || (SELECT MAX(LENGTH('' || column_name || ''))
FROM '' || table_schema || ''.'' || table_name || '') || '''')'' ELSE data_type END || NVL2(LEAD('' || column_name || '', 1) OVER (ORDER BY ordinal_position), '','', '')'')
FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_table_name) || '_external''
UNION ALL SELECT 10000, ''' || seg_columns || ''' UNION ALL SELECT 10001, '';'''''') ORDER BY x) foo WHERE y <> '''';';
command_sql := EXECUTE command_sql;
EXECUTE command_sql;
-- Alter the Internal/ROS Table if primary key columns were passed as a parameter
IF p_pk_columns IS NOT NULL AND p_pk_columns <> '' THEN
EXECUTE 'ALTER TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' ADD CONSTRAINT ' || QUOTE_IDENT(p_table_name) || '_pk PRIMARY KEY (' || p_pk_columns || ') ENABLED;';
END IF;
-- Insert 1M rows into the Internal/ROS Table, analyze stats, and generate encodings
EXECUTE 'INSERT INTO ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ' SELECT * FROM ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_external LIMIT 1000000;';
EXECUTE 'SELECT analyze_statistics(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''');';
EXECUTE 'SELECT designer_design_projection_encodings(''' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ''', ''/tmp/toss.sql'', TRUE, TRUE);';
-- Truncate the Internal/ROS Table and you are now ready to load all rows
-- Drop the external table
EXECUTE 'TRUNCATE TABLE ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || ';';
EXECUTE 'DROP TABLE IF EXISTS ' || QUOTE_IDENT(p_table_schema) || '.' || QUOTE_IDENT(p_table_name) || '_external CASCADE;';
END;
$$;
=> call create_optimized_table('/home/dbadmin/parquet_example/*','public','parquet_table','c1,c2','c1');
create_optimized_table
------------------------
0
(1 row)
=> select export_objects('', 'public.parquet_table');
export_objects
------------------------------------------
CREATE TABLE public.parquet_table
(
c1 int NOT NULL,
c2 varchar(36),
c3 date,
CONSTRAINT parquet_table_pk PRIMARY KEY (c1) ENABLED
);
CREATE PROJECTION public.parquet_table_super /*+createtype(D)*/
(
c1 ENCODING COMMONDELTA_COMP,
c2 ENCODING ZSTD_FAST_COMP,
c3 ENCODING COMMONDELTA_COMP
)
AS
SELECT parquet_table.c1,
parquet_table.c2,
parquet_table.c3
FROM public.parquet_table
ORDER BY parquet_table.c1,
parquet_table.c2
SEGMENTED BY hash(parquet_table.c1, parquet_table.c2) ALL NODES OFFSET 0;
SELECT MARK_DESIGN_KSAFE(0);
(1 row)
Pivoting tables dynamically
The stored procedure unpivot()
takes as input a source table and target table. It unpivots the source table and outputs it into a target table.
This example uses the following table:
=> SELECT * FROM make_the_columns_into_rows;
c1 | c2 | c3 | c4 | c5 | c6
-----+-----+--------------------------------------+----------------------------+----------+----
123 | ABC | cf470c5b-50e3-492a-8483-b9e4f20d195a | 2021-08-24 18:49:40.835802 | 1.72964 | t
567 | EFG | 25ea7636-d924-4b4f-81b5-1e1c884b06e3 | 2021-08-04 18:49:40.835802 | 41.46100 | f
890 | XYZ | f588935a-35a4-4275-9e7f-ebb3986390e3 | 2021-08-29 19:53:39.465778 | 8.58207 | t
(3 rows)
This table contains the following columns:
=> \d make_the_columns_into_rows
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+----------------------------+--------+---------------+------+---------+----------+-------------+-------------
public | make_the_columns_into_rows | c1 | int | 8 | | f | f |
public | make_the_columns_into_rows | c2 | varchar(80) | 80 | | f | f |
public | make_the_columns_into_rows | c3 | uuid | 16 | | f | f |
public | make_the_columns_into_rows | c4 | timestamp | 8 | | f | f |
public | make_the_columns_into_rows | c5 | numeric(10,5) | 8 | | f | f |
public | make_the_columns_into_rows | c6 | boolean | 1 | | f | f |
(6 rows)
The target table has columns from the source table pivoted into rows as key/value pairs. It also has a ROWID
column to tie the key/value pairs back to their original row from the source table:
=> CREATE PROCEDURE unpivot(p_source_table_schema VARCHAR(128), p_source_table_name VARCHAR(128), p_target_table_schema VARCHAR(128), p_target_table_name VARCHAR(128)) AS $$
DECLARE
explode_command VARCHAR(10000);
BEGIN
explode_command := EXECUTE 'SELECT ''explode(string_to_array(''''['''' || '' || LISTAGG(''NVL('' || column_name || ''::VARCHAR, '''''''')'' USING PARAMETERS separator='' || '''','''' || '') || '' || '''']'''')) OVER (PARTITION BY rn)'' explode_command FROM (SELECT table_schema, table_name, column_name, ordinal_position FROM columns ORDER BY table_schema, table_name, ordinal_position LIMIT 10000000) foo WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || ''' AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''';';
EXECUTE 'CREATE TABLE ' || QUOTE_IDENT(p_target_table_schema) || '.' || QUOTE_IDENT(p_target_table_name) || '
AS SELECT rn rowid, column_name key, value FROM (SELECT (ordinal_position - 1) op, column_name
FROM columns WHERE table_schema = ''' || QUOTE_IDENT(p_source_table_schema) || '''
AND table_name = ''' || QUOTE_IDENT(p_source_table_name) || ''') a
JOIN (SELECT rn, ' || explode_command || '
FROM (SELECT ROW_NUMBER() OVER() rn, *
FROM ' || QUOTE_IDENT(p_source_table_schema) || '.' || QUOTE_IDENT(p_source_table_name) || ') foo) b ON b.position = a.op';
END;
$$;
Call the procedure:
=> CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows');
unpivot
---------
0
(1 row)
=> SELECT * FROM columns_into_rows ORDER BY rowid, key;
rowid | key | value
-------+-----+--------------------------------------
1 | c1 | 123
1 | c2 | ABC
1 | c3 | cf470c5b-50e3-492a-8483-b9e4f20d195a
1 | c4 | 2021-08-24 18:49:40.835802
1 | c5 | 1.72964
1 | c6 | t
2 | c1 | 890
2 | c2 | XYZ
2 | c3 | f588935a-35a4-4275-9e7f-ebb3986390e3
2 | c4 | 2021-08-29 19:53:39.465778
2 | c5 | 8.58207
2 | c6 | t
3 | c1 | 567
3 | c2 | EFG
3 | c3 | 25ea7636-d924-4b4f-81b5-1e1c884b06e3
3 | c4 | 2021-08-04 18:49:40.835802
3 | c5 | 41.46100
3 | c6 | f
(18 rows)
The unpivot()
procedure can handle new columns in the source table as well.
Add a new column z
to the source table, and then unpivot the table with the same procedure:
=> ALTER TABLE make_the_columns_into_rows ADD COLUMN z VARCHAR;
ALTER TABLE
=> UPDATE make_the_columns_into_rows SET z = 'ZZZ' WHERE c1 IN (123, 890);
OUTPUT
--------
2
(1 row)
=> CALL unpivot('public', 'make_the_columns_into_rows', 'public', 'columns_into_rows');
unpivot
---------
0
(1 row)
=> SELECT * FROM columns_into_rows;
rowid | key | value
-------+-----+--------------------------------------
1 | c1 | 567
1 | c2 | EFG
1 | c3 | 25ea7636-d924-4b4f-81b5-1e1c884b06e3
1 | c4 | 2021-08-04 18:49:40.835802
1 | c5 | 41.46100
1 | c6 | f
1 | z | -- new column
2 | c1 | 123
2 | c2 | ABC
2 | c3 | cf470c5b-50e3-492a-8483-b9e4f20d195a
2 | c4 | 2021-08-24 18:49:40.835802
2 | c5 | 1.72964
2 | c6 | t
2 | z | ZZZ -- new column
3 | c1 | 890
3 | c2 | XYZ
3 | c3 | f588935a-35a4-4275-9e7f-ebb3986390e3
3 | c4 | 2021-08-29 19:53:39.465778
3 | c5 | 8.58207
3 | c6 | t
3 | z | ZZZ -- new column
(21 rows)
Machine learning: optimizing AUC estimation
The ROC function can approximate the AUC (area under the curve), the accuracy of which depends on the num_bins
parameter; greater values of num_bins
give you more precise approximations, but may impact performance.
You can use the stored procedure accurate_auc()
to approximate the AUC, which automatically determines the optimal num_bins
value for a given epsilon (error term):
=> CREATE PROCEDURE accurate_auc(relation VARCHAR, observation_col VARCHAR, probability_col VARCHAR, epsilon FLOAT) AS $$
DECLARE
auc_value FLOAT;
previous_auc FLOAT;
nbins INT;
BEGIN
IF epsilon > 0.25 THEN
RAISE EXCEPTION 'epsilon must not be bigger than 0.25';
END IF;
IF epsilon < 1e-12 THEN
RAISE EXCEPTION 'epsilon must be bigger than 1e-12';
END IF;
auc_value := 0.5;
previous_auc := 0; -- epsilon and auc should be always less than 1
nbins := 100;
WHILE abs(auc_value - previous_auc) > epsilon and nbins < 1000000 LOOP
RAISE INFO 'auc_value: %', auc_value;
RAISE INFO 'previous_auc: %', previous_auc;
RAISE INFO 'nbins: %', nbins;
previous_auc := auc_value;
auc_value := EXECUTE 'SELECT auc FROM (select roc(' || QUOTE_IDENT(observation_col) || ',' || QUOTE_IDENT(probability_col) || ' USING parameters num_bins=$1, auc=true) over() FROM ' || QUOTE_IDENT(relation) || ') subq WHERE auc IS NOT NULL' USING nbins;
nbins := nbins * 2;
END LOOP;
RAISE INFO 'Result_auc_value: %', auc_value;
END;
$$;
For example, given the following data in test_data.csv
:
1,0,0.186
1,1,0.993
1,1,0.9
1,1,0.839
1,0,0.367
1,0,0.362
0,1,0.6
1,1,0.726
...
(see test_data.csv for the complete set of data)
You can load the data into table categorical_test_data
as follows:
=> \set datafile '\'/data/test_data.csv\''
=> CREATE TABLE categorical_test_data(obs INT, pred INT, prob FLOAT);
CREATE TABLE
=> COPY categorical_test_data FROM :datafile DELIMITER ',';
Call accurate_auc()
. For this example, the approximated AUC will be within the an epsilon of 0.01:
=> CALL accurate_auc('categorical_test_data', 'obs', 'prob', 0.01);
INFO 2005: auc_value: 0.5
INFO 2005: previous_auc: 0
INFO 2005: nbins: 100
INFO 2005: auc_value: 0.749597423510467
INFO 2005: previous_auc: 0.5
INFO 2005: nbins: 200
INFO 2005: Result_auc_value: 0.750402576489533
test_data.csv
1,0,0.186
1,1,0.993
1,1,0.9
1,1,0.839
1,0,0.367
1,0,0.362
0,1,0.6
1,1,0.726
0,0,0.087
0,0,0.004
0,1,0.562
1,0,0.477
0,0,0.258
1,0,0.143
0,0,0.403
1,1,0.978
1,1,0.58
1,1,0.51
0,0,0.424
0,1,0.546
0,1,0.639
0,1,0.676
0,1,0.639
1,1,0.757
1,1,0.883
1,0,0.301
1,1,0.846
1,0,0.129
1,1,0.76
1,0,0.351
1,1,0.803
1,1,0.527
1,1,0.836
1,0,0.417
1,1,0.656
1,1,0.977
1,1,0.815
1,1,0.869
0,0,0.474
0,0,0.346
1,0,0.188
0,1,0.805
1,1,0.872
1,0,0.466
1,1,0.72
0,0,0.163
0,0,0.085
0,0,0.124
1,1,0.876
0,0,0.451
0,0,0.185
1,1,0.937
1,1,0.615
0,0,0.312
1,1,0.924
1,1,0.638
1,1,0.891
0,1,0.621
1,0,0.421
0,0,0.254
0,0,0.225
1,1,0.577
0,1,0.579
0,1,0.628
0,1,0.855
1,1,0.955
0,0,0.331
1,0,0.298
0,0,0.047
0,0,0.173
1,1,0.96
0,0,0.481
0,0,0.39
0,0,0.088
1,0,0.417
0,0,0.12
1,1,0.871
0,1,0.522
0,0,0.312
1,1,0.695
0,0,0.155
0,0,0.352
1,1,0.561
0,0,0.076
0,1,0.923
1,0,0.169
0,0,0.032
1,1,0.63
0,0,0.126
0,0,0.15
1,0,0.348
0,0,0.188
0,1,0.755
1,1,0.813
0,0,0.418
1,0,0.161
1,0,0.316
0,1,0.558
1,1,0.641
1,0,0.305
4 - User-defined extensions
A user-defined extension (UDx) is a component that expands Vertica functionality—for example, new types of data analysis and the ability to parse and load new types of data.
A user-defined extension (UDx) is a component that expands Vertica functionality—for example, new types of data analysis and the ability to parse and load new types of data.
This section provides an overview of how to install and use a UDx. If you are using a UDx developed by a third party, consult its documentation for detailed installation and usage instructions.
4.1 - Loading UDxs
User-defined extensions (UDxs) are contained in libraries.
User-defined extensions (UDxs) are contained in libraries. A library can contain multiple UDxs. To add UDxs to Vertica, you must:
-
Deploy the library (once per library).
-
Create each UDx (once per UDx).
If you are using UDxs written in Java, you must also set up a Java runtime environment. See Installing Java on Vertica hosts.
Deploying libraries
To deploy a library to your Vertica database:
-
Copy the UDx shared library file (.so
), Python file, Java JAR file, or R functions file that contains your function to a node on your Vertica cluster. You do not need to copy it to every node.
-
Connect to the node where you copied the library (for example, using vsql).
-
Add your library to the database catalog using the CREATE LIBRARY statement.
=> CREATE LIBRARY libname AS '/path_to_lib/filename'
LANGUAGE 'language';
libname
is the name you want to use to reference the library. path_to_lib/filename
is the fully-qualified path to the library or JAR file you copied to the host. language
is the implementation language.
For example, if you created a JAR file named TokenizeStringLib.jar
and copied it to the dbadmin account's home directory, you would use this command to load the library:
=> CREATE LIBRARY tokenizelib AS '/home/dbadmin/TokenizeStringLib.jar'
LANGUAGE 'Java';
You can load any number of libraries into Vertica.
Privileges
Superusers can create, modify, and drop any library. Users with the UDXDEVELOPER role or explicit grants can also act on libraries, as shown in the following table:
Creating UDx functions
After the library is loaded, define individual UDxs using SQL statements such as CREATE FUNCTION and CREATE SOURCE. These statements assign SQL function names to the extension classes in the library. They add the UDx to the database catalog and remain available after a database restart.
The statement you use depends on the type of UDx you are declaring, as shown in the following table:
If a UDx of the given name already exists, you can replace it or instruct Vertica to not replace it. To replace it, use the OR REPLACE syntax, as in the following example:
=> CREATE OR REPLACE TRANSFORM FUNCTION tokenize
AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions;
CREATE TRANSFORM FUNCTION
You might want to replace an existing function to change between fenced and unfenced modes.
Alternatively, you can use IF NOT EXISTS to prevent the function from being created again if it already exists. You might want to use this in upgrade or test scripts that require, and therefore load, UDxs. By using IF NOT EXISTS, you preserve the original definition including fenced status. The following example shows this syntax:
--- original creation:
=> CREATE TRANSFORM FUNCTION tokenize
AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions NOT FENCED;
CREATE TRANSFORM FUNCTION
--- function is not replaced (and is still unfenced):
=> CREATE TRANSFORM FUNCTION IF NOT EXISTS tokenize
AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions FENCED;
CREATE TRANSFORM FUNCTION
After you add the UDx to the database, you can use your extension within SQL statements. The database superuser can grant access privileges to the UDx for users. See GRANT (user defined extension) for details.
When you call a UDx, Vertica creates an instance of the UDx class on each node in the cluster and provides it with the data it needs to process.
4.2 - Installing Java on Vertica hosts
If you are using UDxs written in Java, follow the instructions in this section.
If you are using UDxs written in Java, follow the instructions in this section.
You must install a Java Virtual Machine (JVM) on every host in your cluster in order for Vertica to be able to execute your Java UDxs.
Installing Java on your Vertica cluster is a two-step process:
-
Install a Java runtime on all of the hosts in your cluster.
-
Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable.
Installing a Java runtime
For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE.
Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK.
To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions.
Once you have installed a JVM on each host, ensure that the java
command is in the search path and calls the correct JVM by running the command:
$ java -version
This command should print something similar to:
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
Note
Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime.
Setting the JavaBinaryForUDx configuration parameter
The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java
). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell:
$ which java
/usr/bin/java
If the java
command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java
.
You set the configuration parameter by executing the following statement as a database superuser:
=> ALTER DATABASE DEFAULT SET PARAMETER JavaBinaryForUDx = '/usr/bin/java';
See ALTER DATABASE for more information on setting configuration parameters.
To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table:
=> \x
Expanded display is on.
=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx';
-[ RECORD 1 ]-----------------+----------------------------------------------------------
node_name | ALL
parameter_name | JavaBinaryForUDx
current_value | /usr/bin/java
default_value |
change_under_support_guidance | f
change_requires_restart | f
description | Path to the java binary for executing UDx written in Java
Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster.
Note
Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster.
4.3 - UDx restrictions
Some UDx types have special considerations or restrictions.
Some UDx types have special considerations or restrictions.
UDxs written in Java and R do not support complex types.
Aggregate functions
You cannot use the DISTINCT clause in queries with more than one aggregate function or provide inputs or return values containing complex types.
Analytic functions
UDAnFs do not support framing windows using ROWS.
Only UDAnFs written in C++ can use complex types.
As with Vertica's built-in analytic functions, UDAnFs cannot be used with MATCH clause functions.
Scalar functions
If the result of applying a UDSF is an invalid record, COPY aborts the load even if CopyFaultTolerantExpressions is set to true.
A ROW returned from a UDSF cannot be used as an argument to COUNT.
A query that includes a UDTF cannot:
-
Include statements other than the SELECT statement that calls the UDTF and a PARTITION BY expression unless the UDTF is marked as a one-to-many UDTF
-
Call an analytic function
-
Call another UDTF
-
Include one of the following clauses:
Load functions
Installing an untrusted UDL function can compromise the security of the server. UDxs can contain arbitrary code. In particular, user-defined source functions can read data from any arbitrary location. It is up to the developer of the function to enforce proper security limitations. Superusers must not grant access to UDxs to untrusted users.
You cannot ALTER UDL functions.
UDFilter and UDSource functions do not support complex types.
4.4 - Fenced and unfenced modes
User-defined extensions (UDxs) written in the C++ programming language have the option of running in fenced or unfenced mode.
User-defined extensions (UDxs) written in the C++ programming language have the option of running in fenced or unfenced mode. Fenced mode runs the UDx code outside of the main Vertica process in a separate zygote process. UDxs that use unfenced mode run directly within the Vertica process.
Fenced mode
You can run most C++ UDxs in fenced mode. Fenced mode uses a separate zygote process, so fenced UDx crashes do not impact the core Vertica process. There is a small performance impact when running UDx code in fenced mode. On average, using fenced mode adds about 10% more time to execution compared to unfenced mode.
Fenced mode is currently available for all C++ UDxs with the exception of user-defined aggregates. All UDxs developed in the Python, R, and Java programming languages must run in fenced mode, since the Python, R, and Java runtimes cannot run directly within the Vertica process.
Using fenced mode does not affect the development of your UDx. Fenced mode is enabled by default for UDxs that support fenced mode. Optionally, you can issue the CREATE FUNCTION command with the NOT FENCED
modifier to disable fenced mode for the function. Additionally, you can enable or disable fenced mode on any fenced mode-supported C++ UDx by using the ALTER FUNCTION command.
Unfenced mode
Unfenced UDxs run within Vertica, so they have little overhead, and can perform almost as fast as Vertica's own built-in functions. However, because they run within Vertica directly, any bugs in their code (memory leaks, for example) can destabilize the main Vertica process and bring one or more database nodes down.
Important
Always carefully review code downloaded from third-party sources and test UDxs in a test or staging environment before deciding to run them in unfenced mode in production.
About the zygote process
The Vertica zygote process starts when Vertica starts. Each node has a single zygote process. Side processes are created "on demand". The zygote listens for requests and spawns a UDx side session that runs the UDx in fenced mode when a UDx is called by the user.
About fenced mode logging:
UDx code that runs in fenced mode is logged in the UDxZygote.log
and is stored in the UDxLogs
directory in the catalog directory of Vertica. Log entries for the side process are denoted by the UDx language (for example, C++), node, zygote process ID, and the UDxSideProcess ID.
For example, for the following query returns the current fenced processes:
=> SELECT * FROM UDX_FENCED_PROCESSES;
node_name | process_type | session_id | pid | port | status
------------------+------------------+----------------------------------+-------+-------+--------
v_vmart_node0001 | UDxZygoteProcess | | 27468 | 51900 | UP
v_vmart_node0001 | UDxSideProcess | localhost.localdoma-27465:0x800b | 5677 | 44123 | UP
Below is the corresponding log file for the fenced processes returned in the previous query:
2016-05-16 11:24:43.990 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 UDx side process started
11:24:43.996 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 Finished setting up signal handlers.
11:24:43.996 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 My port: 44123
11:24:43.996 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 My address: 0.0.0.0
11:24:43.996 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 Vertica port: 51900
11:24:43.996 [C++-localhost.localdoma-27465:0x800b-5677] 0x2b3ff17e7fd0 Vertica address: 127.0.0.1
11:25:19.749 [C++-localhost.localdoma-27465:0x800b-5677] 0x41837940 Setting memory resource limit to -1
11:30:11.523 [C++-localhost.localdoma-27465:0x800b-5677] 0x41837940 Exiting UDx side process
The last line indicates that the side process was killed. In this case it was killed when the user session (vsql) closed.
About fenced mode configuration parameters
Fenced mode supports the following configuration parameters:
-
FencedUDxMemoryLimitMB
: The maximum memory size, in MB, to use for fenced mode processes. The default is -1
(no limit). The side process is killed if this limit is exceeded.
-
ForceUDxFencedMode
: When set to 1
, force all UDx's that support fenced mode to run in fenced mode even if their definition specified NOT FENCED. The default is 0
(disabled).
-
UDxFencedBlockTimeout
: The maximum time, in seconds, that the Vertica server waits for a UDx to return before aborting with ERROR 3399. The default is 60
.
See also
4.5 - Updating UDx libraries
There are two cases where you need to update libraries that you have already deployed:.
There are two cases where you need to update libraries that you have already deployed:
-
When you have upgraded Vertica to a new version that contains changes to the SDK API. For your libraries to work with the new server version, you need to recompile them with new version of the SDK. See UDx library compatibility with new server versions for more information.
-
When you have made changes to your UDxs and you want to deploy these changes. Before updating your UDx library, you need to determine if you have changed the signature of any of the functions contained in the library. If you have, you need to drop the functions from the Vertica catalog before you update the library.
4.5.1 - UDx library compatibility with new server versions
The Vertica SDK defines an application programming interface (API) that UDxs use to interact with the database.
The Vertica SDK defines an application programming interface (API) that UDxs use to interact with the database. When developers compile their UDx code, it is linked to the SDK code to form a library. This library is only compatible with Vertica servers that support the version of the SDK API used to compile the code. The library and servers that share the same API version are compatible on a binary level (referred to as "binary compatible").
The Vertica server returns an error message if you attempt to load a library that is not binary compatible with it. Similarly, if you upgrade your Vertica server to a version that supports a new SDK API, any existing UDx that relies on newly-incompatible libraries returns an error messages when you call it:
ERROR 2858: Could not find function definition
HINT:
This usually happens due to missing or corrupt libraries, libraries built
with the wrong SDK version, or due to a concurrent session dropping the library
or function. Try recreating the library and function
To resolve this issue, you must install UDx libraries that have been recompiled with the correct version of the SDK.
New versions of the Vertica server do not always change the SDK API version. The SDK API version changes whenever OpenText changes the components that make up the SDK. If the SDK API does not change in a new version of the server, then the old libraries remain compatible with the new server.
The SDK API almost always changes in Vertica releases (major, minor, service pack) as OpenText expands the SDK's features. Vertica will never change the API in a hotfix patch.
These policies mean that you must update UDx libraries when you upgrade between major versions. For example, if you upgrade from version 10.0 to 10.1, you must update your UDx libraries.
Note
A UDx written in a scripting language has no compiled binary, and so does not need to maintain binary compatibility from one version to another. UDxs written in scripting languages only become incompatible if the APIs used in the SDK actually change. For example, if the number of arguments to an API call changes, a UDx has to be changed to use the new number of arguments.
Pre-upgrade steps
Before upgrading your Vertica server, consider whether you have any UDx libraries that may be incompatible with the new version. Consult the release notes of the new server version to determine whether the SDK API has changed between the version of Vertica server you currently have installed and the new version. As mentioned previously, only upgrades from a previous major version or from the initial release of a major version to a service pack release can cause your currently-loaded UDx libraries to become incompatible with the server.
Any UDx libraries that are incompatible with the new version of the Vertica server must be recompiled. If you got the UDx library from a third party, you need to see if a new version has been released. If so, deploy the new version after you have upgraded the server (see Deploying a new version of your UDx library).
If you developed the UDx yourself (or if you have the source code) you must:
-
Recompile your UDx library using the new version of the Vertica SDK. See Compiling your C++ library or Compiling and packaging a Java library for more information.
-
Deploy the new version of your library. See Deploying a new version of your UDx library.
4.5.2 - Determining if a UDx signature has changed
You need to be careful when making changes to UDx libraries that contain functions you have already deployed in your Vertica database.
You need to be careful when making changes to UDx libraries that contain functions you have already deployed in your Vertica database. When you deploy a new version of your UDx library, Vertica does not ensure that the signatures of the functions that are defined in the library match the signature of the function that is already defined in the Vertica catalog. If you have changed the signature of a UDx in the library then update the library in the Vertica database, calls to the altered UDx will produce errors.
Making any of the following changes to a UDx alters its signature:
-
Changing the number of arguments accepted or the data type of any argument accepted by your function (not including polymorphic functions).
-
Changing the number or data types of any return values or output columns.
-
Changing the name of the factory class that Vertica uses to create an instance of your function code.
-
Changing the null handling or volatility behavior of your function.
-
Removed the function's factory class from the library completely.
The following changes do not alter the signature of your function, and do not require you to drop the function before updating the library:
-
Changing the number or type of arguments handled by a polymorphic function. Vertica does not process the arguments the user passes to a polymorphic function.
-
Changing the name, data type, or number of parameters accepted by your function. The parameters your function accepts are not determined by the function signature. Instead, Vertica passes all of the parameters the user included in the function call, and your function processes them at runtime. See UDx parameters for more information about parameters.
-
Changing any of the internal processing performed by your function.
-
Adding new UDxs to the library.
After you drop any functions whose signatures have changed, you load the new library file, then re-create your altered functions. If you have not made any changes to the signature of your UDxs, you can just update the library file in your Vertica database without having to drop or alter your function definitions. As long as the UDx definitions in the Vertica catalog match the signatures of the functions in your library, function calls will work transparently after you have updated the library. See Deploying a new version of your UDx library.
4.5.3 - Deploying a new version of your UDx library
You need to deploy a new version of your UDx library if:.
You need to deploy a new version of your UDx library if:
The process of deploying a new version of your library is similar to deploying it initially.
-
If you are deploying a UDx library developed in C++ or Java, you must compile it with the current version of the Vertica SDK.
-
Copy your UDx's library file (a .so
file for libraries developed in C++, a .py
file for libraries developed in Python, or a .jar
file for libraries developed in Java) or R source file to a host in your Vertica database.
-
Connect to the host using vsql.
-
If you have changed the signature of any of the UDxs in the shared library, you must drop them using DROP statements such as DROP FUNCTION or DROP SOURCE. If you are unsure whether any of the signatures of your functions have changed, see Determining if a UDx signature has changed.
Note
If all of the UDx signatures in your library have changed, you may find it more convenient to drop the library using the
DROP LIBRARY statement with the CASCADE option to drop the library and all of the functions and loaders that reference it. Droping the library can save you the time it would take to drop each UDx individually. You can then reload the library and recreate all of the extensions using the same process you used to deploy the library in the first place. See
CREATE LIBRARY.
-
Use the ALTER LIBRARY statement to update the UDx library definition with the file you copied in step 1. For example, if you want to update the library named ScalarFunctions
with a file named ScalarFunctions-2.0.so
in the dbadmin user's home directory, you could use the command:
=> ALTER LIBRARY ScalarFunctions AS '/home/dbadmin/ScalarFunctions-2.0.so';
After you have updated the UDx library definition to use the new version of your shared library, the UDxs that are defined using classes in your UDx library begin using the new shared library file without any further changes.
-
If you had to drop any functions in step 4, recreate them using the new signature defined by the factory classes in your library. See CREATE FUNCTION statements.
4.6 - Listing the UDxs contained in a library
Once a library has been loaded using the CREATE LIBRARY statement, you can find the UDxs and UDLs it contains by querying the USER_LIBRARY_MANIFEST system table:.
Once a library has been loaded using the CREATE LIBRARY statement, you can find the UDxs and UDLs it contains by querying the USER_LIBRARY_MANIFEST system table:
=> CREATE LIBRARY ScalarFunctions AS '/home/dbadmin/ScalarFunctions.so';
CREATE LIBRARY
=> \x
Expanded display is on.
=> SELECT * FROM USER_LIBRARY_MANIFEST WHERE lib_name = 'ScalarFunctions';
-[ RECORD 1 ]-------------------
schema_name | public
lib_name | ScalarFunctions
lib_oid | 45035996273792402
obj_name | RemoveSpaceFactory
obj_type | Scalar Function
arg_types | Varchar
return_type | Varchar
-[ RECORD 2 ]-------------------
schema_name | public
lib_name | ScalarFunctions
lib_oid | 45035996273792402
obj_name | Div2intsInfo
obj_type | Scalar Function
arg_types | Integer, Integer
return_type | Integer
-[ RECORD 3 ]-------------------
schema_name | public
lib_name | ScalarFunctions
lib_oid | 45035996273792402
obj_name | Add2intsInfo
obj_type | Scalar Function
arg_types | Integer, Integer
return_type | Integer
The obj_name
column lists the factory classes contained in the library. These are the names you use to define UDxs and UDLs in the database catalog using statements such as CREATE FUNCTION and CREATE SOURCE.
4.7 - Using wildcards in your UDx
Vertica supports wildcard * characters in the place of column names in user-defined functions.
Vertica supports wildcard * characters in the place of column names in user-defined functions.
You can use wildcards when:
-
Your query contains a table in the FROM clause
-
You are using a Vertica-supported development language
-
Your UDx is running in fenced or unfenced mode
Supported SQL statements
The following SQL statements can accept wildcards:
-
DELETE
-
INSERT
-
SELECT
-
UPDATE
Unsupported configurations
The following situations do not support wildcards:
-
You cannot pass a wildcard in the OVER clause of a query
-
You cannot us a wildcard with a DROP statement
-
You cannot use wildcards with any other arguments
Examples
These examples show wildcards and user-defined functions in a range of data manipulation operations.
DELETE statements:
=> DELETE FROM tablename WHERE udf(tablename.*) = 5;
INSERT statements:
=> INSERT INTO table1 SELECT udf(*) FROM table2;
SELECT statements:
=> SELECT udf(*) FROM tablename;
=> SELECT udf(tablename.*) FROM tablename;
=> SELECT udf(f.*) FROM table f;
=> SELECT udf(*) FROM table1,table2;
=> SELECT udf1( udf2(*) ) FROM table1,table2;
=> SELECT udf( db.schema.table.*) FROM tablename;
=> SELECT udf(sub.*) FROM (select col1, col2 FROM table) sub;
=> SELECT x FROM tablename WHERE udf(*) = y;
=> WITH sub as (SELECT * FROM tablename) select x, udf(*) FROM sub;
=> SELECT udf( * using parameters x=1) FROM tablename;
=> SELECT udf(table1.*, table2.col2) FROM table1,table2;
UPDATE statements:
=> UPDATE tablename set col1 = 4 FROM tablename WHERE udf(*) = 3;
5 - Developing user-defined extensions (UDxs)
The primary strengths of UDxs are:.
User-defined extensions (UDxs) are functions contained in external libraries that are developed in C++, Python, Java, or R using the Vertica SDK. The external libraries are defined in the Vertica catalog using the CREATE LIBRARY statement. They are best suited for analytic operations that are difficult to perform in SQL, or that need to be performed frequently enough that their speed is a major concern.
The primary strengths of UDxs are:
-
They can be used anywhere an internal function can be used.
-
They take full advantage of Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.
-
They are distributed to all nodes by Vertica. You only need to copy the library to the initiator node.
-
All of the complicated aspects of developing a distributed piece of analytic code are handled for you by Vertica. Your main programming task is to read in data, process it, and then write it out using the Vertica SDK APIs.
There are a few things to keep in mind about developing UDxs:
-
UDxs can be developed in the programming languages C++, Python, Java, and R. (Not all UDx types support all languages.)
-
UDxs written in Java always run in fenced mode, because the Java Virtual Machine that executes Java programs cannot run directly within the Vertica process.
-
UDxs written in Python and R always run in fenced mode.
-
UDxs developed in C++ have the option of running in unfenced mode, which means they load and run directly in the Vertica database process. This option provides the lowest overhead and highest speed. However, any bugs in the UDx's code can cause database instability. You must thoroughly test any UDxs you intend to run in unfenced mode before deploying them in a live environment. Consider whether the performance boost of running a C++ UDx unfenced is worth the potential database instability that a buggy UDx can cause.
-
Because a UDx runs on the Vertica cluster, it can take processor time and memory away from the database processes. A UDx that consumes large amounts of computing resources can negatively impact database performance.
Types of UDxs
Vertica supports five types of user-defined extensions:
-
User-defined scalar functions (UDSFs) take in a single row of data and return a single value. These functions can be used anywhere a native function can be used, except CREATE TABLE BY PARTITION and SEGMENTED BY expressions. UDSFs can be developed in C++, Python, Java, and R.
-
User-defined aggregate functions (UDAF) allow you to create custom Aggregate functions specific to your needs. They read one column of data, and return one output column. UDAFs can be developed in C++.
-
User-defined analytic functions (UDAnF) are similar to UDSFs, in that they read a row of data and return a single row. However, the function can read input rows independently of outputting rows, so that the output values can be calculated over several input rows. The function can be used with the query's OVER()
clause to partition rows. UDAnFs can be developed in C++ and Java.
-
User-defined transform functions (UDTFs) operate on table partitions (as specified by the query's OVER()
clause) and return zero or more rows of data. The data they return can be an entirely new table, unrelated to the schema of the input table, with its own ordering and segmentation expressions. They can only be used in the SELECT list of a query. UDTFs can be developed in C++, Python, Java, and R.
To optimize query performance, you can use live aggregate projections to pre-aggregate the data that a UDTF returns. For more information, see Pre-aggregating UDTF results.
-
User-defined load allows you to create custom sources, filters, and parsers to load data. These extensions can be used in COPY statements. UDLs can be developed C++, Java and Python.
While each UDx type has a unique base class, developing them is similar in many ways. Different UDx types can also share the same library.
Structure
Each UDx type consists of two primary classes. The main class does the actual work (a transformation, an aggregation, and so on). The class usually has at least three methods: one to set up, one to tear down (release reserved resources), and one to do the work. Sometimes additional methods are defined.
The main processing method receives an instance of the ServerInterface
class as an argument. This object is used by the underlying Vertica SDK code to make calls back into the Vertica process, for example to allocate memory. You can use this class to write to the server log during UDx execution.
The second class is a singleton factory. It defines one method that produces instances of the first class, and might define other methods to manage parameters.
When implementing a UDx you must subclass both classes.
Conventions
The C++, Python, and Java APIs are nearly identical. Where possible, this documentation describes these interfaces without respect to language. Documentation specific to C++, Python, or Java is covered in language-specific sections.
Because some documentation is language-independent, it is not always possible to use ideal, language-based terminology. This documentation uses the term "method" to refer to a Java method or a C++ member function.
See also
Loading UDxs
5.1 - Developing with the Vertica SDK
Before you can write a user-defined extension you must set up a development environment.
Before you can write a user-defined extension you must set up a development environment. After you do so, a good test is to download, build, and run the published examples.
In addition to covering how to set up your environment, this section covers general information about working with the Vertica SDK, including language-specific considerations.
5.1.1 - Setting up a development environment
Before you start developing your UDx, you need to configure your development and test environments.
Before you start developing your UDx, you need to configure your development and test environments. Development and test environments must use the same operating system and Vertica version as the production environment.
For additional language-specific requirements, see the following topics:
Development environment options
The language that you use to develop your UDx determines the setup options and requirements for your development environment. C++ developers can use the C++ UDx container, and all developers can use a non-production Vertica environment.
C++ UDx container
C++ developers can develop with the C++ UDx container. The UDx-container GitHub repository provides the tools to build a container that packages the binaries, libraries, and compilers required to develop C++ Vertica extensions. The C++ UDx container has the following build options:
For requirement, build, and test details, see the repository README.
Non-production Vertica environments
You can use a node in a non-production Vertica database or another machine that runs the same operating system and Vertica version as your production environment. For specific requirements and dependencies, refer to Operating System Requirements and Language Requirements.
Test environment options
To test your UDx, you need access to a non-production Vertica database. You have the following options:
- Install a single-node Vertica database on your development machine.
- Download and build a containerized test environment.
Containerized test environments
Vertica provides the following containerized options to simplify your test environment setup:
Operating system requirements
Develop your UDx code on the same Linux platform that you use for your production Vertica database cluster. Centos- and Debian-based operating systems each require that you download additional packages.
CentOS-based operating systems
Installations on the following CentOS-based operating systems require the devtoolset-7 package:
-
CentOS
-
Red Hat Enterprise Linux
-
Oracle Enterprise Linux
Consult the documentation for your operating system for the specific installation command.
Debian-based operating systems
Installations on the following Debian-based operating systems require the GCC package version 7 or later:
Note
Vertica has not tested UDx builds that use GCC
versions later than GCC 8
.
-
Debian
-
Ubuntu
-
SUSE
-
OpenSUSE
Consult the documentation for your operating system for the specific installation command.
5.1.2 - Downloading and running UDx example code
You can download all of the examples shown in this documentation, and many more, from the Vertica GitHub repository.
You can download all of the examples shown in this documentation, and many more, from the Vertica GitHub repository. This repository includes examples of all types of UDxs.
You can download the examples in either of two ways:
-
Download the ZIP file. Extract the contents of the file into a directory.
-
Clone the repository. Using a terminal window, run the following command:
$ git clone https://github.com/vertica/UDx-Examples.git
The repository includes a makefile that you can use to compile the C++ and Java examples. It also includes .sql files that load and use the examples. See the README file for instructions on compiling and running the examples. To compile the examples you will need g++ or a JDK and make. See Setting up a development environment for related information.
Running the examples not only helps you understand how a UDx works, but also helps you ensure your development environment is properly set up to compile UDx libraries.
See also
5.1.3 - C++ SDK
The Vertica SDK supports writing both fenced and unfenced UDxs in C++ 11.
The Vertica SDK supports writing both fenced and unfenced UDxs in C++ 11. You can download, compile, and run the examples; see Downloading and running UDx example code. Running the examples is a good way to verify that your development environment has all needed libraries.
If you do not have access to a Vertica test environment, you can install Vertica on your development machine and run a single node. Each time you rebuild your UDx library, you need to re-install it into Vertica. The following diagram illustrates the typical development cycle.
This section covers C++-specific topics that apply to all UDx types. For information that applies to all languages, see Arguments and return values, UDx parameters, Errors, warnings, and logging, Handling cancel requests and the sections for specific UDx types. For full API documentation, see the C++ SDK Documentation.
5.1.3.1 - Setting up the C++ SDK
The Vertica C++ Software Development Kit (SDK) is distributed as part of the server installation.
The Vertica C++ Software Development Kit (SDK) is distributed as part of the server installation. It contains the source and header files you need to create your UDx library. For examples that you can compile and run, see Downloading and running UDx example code.
Requirements
At a minimum, install the following on your development machine:
-
devtoolset-7 package (CentOS) or GCC package (Debian), including GCC
version 7 or later and an up-to-date libstdc++
package.
Note
Vertica has not tested UDx builds that use GCC
versions later than GCC 8
.
-
g++ and its associated toolchain, such as ld
. Some Linux distributions package g++
separately from GCC
.
-
A copy of the Vertica SDK.
Note
The Vertica binaries are compiled using the default version of g++
installed on the supported Linux platforms.
You must compile with a std
flag value of c++11
or later.
The following optional software packages can simplify development:
-
make
, or some other build-management tool.
-
gdb
, or some other debugger.
-
Valgrind, or similar tools that detect memory leaks.
If you want to use any third-party libraries, such as statistical analysis libraries, you need to install them on your development machine. If you do not statically link these libraries into your UDx library, you must install them on every node in the cluster. See Compiling your C++ library for details.
SDK files
The SDK files are located in the sdk subdirectory under the root Vertica server directory (usually,
/opt/vertica/sdk
). This directory contains a subdirectory, include
, which contains the headers and source files needed to compile UDx libraries.
There are two files in the include directory you need when compiling your UDx:
-
Vertica.h
is the main header file for the SDK. Your UDx code needs to include this file in order to find the SDK's definitions.
-
Vertica.cpp
contains support code that needs to be compiled into the UDx library.
Much of the Vertica SDK API is defined in the VerticaUDx.h
header file (which is included by the Vertica.h
file). If you're curious, you might want to review the contents of this file in addition to reading the API documentation.
Finding the current SDK version
You must develop your UDx using the same SDK version as the database in which you plan to use it. To display the SDK version currently installed on your system, run the following command in vsql:
=> SELECT sdk_version();
Running the examples
You can download the examples from the GitHub repository (see Downloading and running UDx example code). Compiling and running the examples helps you to ensure that your development environment is properly set up.
To compile all of the examples, including the Java examples, issue the following command in the Java-and-C++
directory under the examples directory:
$ make
Note
To compile the examples, you must have a g++ development environment installed. To install a g++ development environment on Red Hat systems, run yum install gcc gcc-c++ make.
5.1.3.2 - Compiling your C++ library
GNU g++ is the only supported compiler for compiling UDx libraries.
GNU g++ is the only supported compiler for compiling UDx libraries. Always compile your UDx code on the same version of Linux that you use on your Vertica cluster.
When compiling your library, you must always:
-
Compile with a -std
flag value of c++11
or later.
-
Pass the -shared
and -fPIC
flags to the linker. The simplest method is to just pass these flags to g++ when you compile and link your library.
-
Use the -Wno-unused-value
flag to suppress warnings when macro arguments are not used. If you do not use this flag, you may get "left-hand operand of comma has no effect" warnings.
-
Compile sdk/include/Vertica.cpp
and link it into your library. This file contains support routines that help your UDx communicate with Vertica. The easiest way to do this is to include it in the g++ command to compile your library. Vertica supplies this file as C++ source rather than a library to limit library compatibility issues.
-
Add the Vertica SDK include directory in the include search path using the g++ -I
flag.
The SDK examples include a working makefile. See Downloading and running UDx example code.
Example of compiling a UDx
The following command compiles a UDx contained in a single source file named MyUDx.cpp
into a shared library named MyUDx.so
:
g++ -I /opt/vertica/sdk/include -Wall -shared -Wno-unused-value \
-fPIC -o MyUDx.so MyUDx.cpp /opt/vertica/sdk/include/Vertica.cpp
Important
Vertica only supports UDx development on 64-bit architectures.
After you debug your UDx, you are ready to deploy it. Recompile your UDx using the -O3
flag to enable compiler optimization.
You can add additional source files to your library by adding them to the command line. You can also compile them separately and then link them together.
Tip
The examples subdirectory in the Vertica SDK directory contains a make file that you can use as starting point for your own UDx project.
Handling external libraries
You must link your UDx library to any supporting libraries that your UDx code relies on.These libraries might be either ones you developed or others provided by third parties. You have two options for linking:
-
Statically link the support libraries into your UDx. The benefit of this method is that your UDx library does not rely on external files. Having a single UDx library file simplifies deployment because you just transfer a single file to your Vertica cluster. This method's main drawback is that it increases the size of your UDx library file.
-
Dynamically link the library to your UDx. You must sometimes use dynamic linking if a third-party library does not allow static linking. In this case, you must copy the libraries to your Vertica cluster in addition to your UDx library file.
5.1.3.3 - Adding metadata to C++ libraries
For example, the following code demonstrates adding metadata to the Add2Ints example (see C++ Example: Add2Ints).
You can add metadata, such as author name, the version of the library, a description of your library, and so on to your library. This metadata lets you track the version of your function that is deployed on a Vertica Analytic Database cluster and lets third-party users of your function know who created the function. Your library's metadata appears in the USER_LIBRARIES system table after your library has been loaded into the Vertica Analytic Database catalog.
You declare the metadata for your library by calling the RegisterLibrary()
function in one of the source files for your UDx. If there is more than one function call in the source files for your UDx, whichever gets interpreted last as Vertica Analytic Database loads the library is used to determine the library's metadata.
The RegisterLibrary()
function takes eight string parameters:
RegisterLibrary(author,
library_build_tag,
library_version,
library_sdk_version,
source_url,
description,
licenses_required,
signature);
-
author
contains whatever name you want associated with the creation of the library (your own name or your company's name for example).
-
library_build_tag
is a string you want to use to represent the specific build of the library (for example, the SVN revision number or a timestamp of when the library was compiled). This is useful for tracking instances of your library as you are developing them.
-
library_version
is the version of your library. You can use whatever numbering or naming scheme you want.
-
library_sdk_version
is the version of the Vertica Analytic Database SDK Library for which you've compiled the library.
Note
This field isn't used to determine whether a library is compatible with a version of the Vertica Analytic Database server. The version of the Vertica Analytic Database SDK you use to compile your library is embedded in the library when you compile it. It is this information that Vertica Analytic Database server uses to determine if your library is compatible with it.
-
source_url
is a URL where users of your function can find more information about it. This can be your company's website, the GitHub page hosting your library's source code, or whatever site you like.
-
description
is a concise description of your library.
-
licenses_required
is a placeholder for licensing information. You must pass an empty string for this value.
-
signature
is a placeholder for a signature that will authenticate your library. You must pass an empty string for this value.
For example, the following code demonstrates adding metadata to the Add2Ints example (see C++ example: Add2Ints).
// Register the factory with Vertica
RegisterFactory(Add2IntsFactory);
// Register the library's metadata.
RegisterLibrary("Whizzo Analytics Ltd.",
"1234",
"2.0",
"7.0.0",
"http://www.example.com/add2ints",
"Add 2 Integer Library",
"",
"");
Loading the library and querying the USER_LIBRARIES system table shows the metadata supplied in the call to RegisterLibrary()
:
=> CREATE LIBRARY add2intslib AS '/home/dbadmin/add2ints.so';
CREATE LIBRARY
=> \x
Expanded display is on.
=> SELECT * FROM USER_LIBRARIES WHERE lib_name = 'add2intslib';
-[ RECORD 1 ]-----+----------------------------------------
schema_name | public
lib_name | add2intslib
lib_oid | 45035996273869808
author | Whizzo Analytics Ltd.
owner_id | 45035996273704962
lib_file_name | public_add2intslib_45035996273869808.so
md5_sum | 732c9e145d447c8ac6e7304313d3b8a0
sdk_version | v7.0.0-20131105
revision | 125200
lib_build_tag | 1234
lib_version | 2.0
lib_sdk_version | 7.0.0
source_url | http://www.example.com/add2ints
description | Add 2 Integer Library
licenses_required |
signature |
5.1.3.4 - C++ SDK data types
The Vertica SDK has typedefs and classes for representing Vertica data types within your UDx code.
The Vertica SDK has typedefs and classes for representing Vertica data types within your UDx code. Using these typedefs ensures data type compatibility between the data your UDx processes and generates and the Vertica database. The following table describes some of the typedefs available. Consult the C++ SDK Documentation for a complete list, as well as lists of helper functions to convert and manipulate these data types.
For information about SDK support for complex data types, see Complex Types as Arguments and Return Values.
Type Definition |
Description |
Interval |
A Vertica interval |
IntervalYM |
A Vertica year-to-month interval. |
Timestamp |
A Vertica timestamp |
vint |
A standard Vertica 64-bit integer |
vbool |
A Boolean value in Vertica |
vbool_null |
A null value for a Boolean data types |
vfloat |
A Vertica floating point value |
VString |
String data types (such as varchar and char)
Note: Do not use a VString object to hold an intermediate result. Use a std::string or char[] instead.
|
VNumeric |
Fixed-point data types from Vertica |
VUuid |
A Vertica universally unique identifier |
Notes
-
When making some Vertica SDK API calls (such as VerticaType::getNumericLength()
) on objects, make sure they have the correct data type. To minimize overhead and improve performance, most of the APIs do not check the data types of the objects on which they are called. Calling a function on an incorrect data type can result in an error.
-
You cannot create instances of VString or VNumeric yourself. You can manipulate the values of existing objects of these classes that Vertica passes to your UDx, and extract values from them. However, only Vertica can instantiate these classes.
5.1.3.5 - Resource use for C++ UDxs
Your UDxs consume at least a small amount of memory by instantiating classes and creating local variables.
Your UDxs consume at least a small amount of memory by instantiating classes and creating local variables. This basic memory usage by UDxs is small enough that you do not need to be concerned about it.
If your UDx needs to allocate more than one or two megabytes of memory for data structures, or requires access to additional resources such as files, you must inform Vertica about its resource use. Vertica can then ensure that the resources your UDx requires are available before running a query that uses it. Even moderate memory use (10MB per invocation of a UDx, for example) can become an issue if there are many simultaneous queries that call it.
Note
If your UDx allocates its own memory, you must make
absolutely sure it properly frees it. Failing to free even a single byte of allocated memory can have significant consequences at scale. Instead of having your code allocate its own memory, you should use the C++
vt_alloc
macro, which uses Vertica's own memory manager to allocate and track memory. This memory is guaranteed to be properly disposed of when your UDx completes execution. See
Allocating resources for UDxs for more information.
5.1.3.5.1 - Allocating resources for UDxs
You have two options for allocating memory and file handles for your user-defined extensions (UDxs):.
You have two options for allocating memory and file handles for your user-defined extensions (UDxs):
-
Use Vertica SDK macros to allocate resources. This is the best method, since it uses Vertica's own resource manager, and guarantees that resources used by your UDx are reclaimed. See Allocating resources with the SDK macros.
-
While not the recommended option, you can allocate resources in your UDxs yourself using standard C++ methods (instantiating objects using new
, allocating memory blocks using malloc()
, etc.). You must manually free these resources before your UDx exits.
Note
You must be extremely careful if you choose to allocate your own resources in your UDx. Failing to free resources properly will have significant negative impact, especially if your UDx is running in unfenced mode.
Whichever method you choose, you usually allocate resources in a function named setup()
in your UDx class. This function is called after your UDx function object is instantiated, but before Vertica calls it to process data.
If you allocate memory on your own in the setup()
function, you must free it in a corresponding function named destroy()
. This function is called after your UDx has performed all of its processing. This function is also called if your UDx returns an error (see Handling errors).
Note
Always use the setup()
and destroy()
functions to allocate and free resources instead of your own constructors and destructors. The memory for your UDx object is allocated from one of Vertica's own memory pools. Vertica always calls your UDx's destroy()
function before it deallocates the object's memory. There is no guarantee that your UDx's destructor is will be called before the object is deallocated. Using the destroy()
function ensures that your UDx has a chance to free its allocated resources before it is destroyed.
The following code fragment demonstrates allocating and freeing memory using a setup()
and destroy()
function.
class MemoryAllocationExample : public ScalarFunction
{
public:
uint64* myarray;
// Called before running the UDF to allocate memory used throughout
// the entire UDF processing.
virtual void setup(ServerInterface &srvInterface, const SizedColumnTypes
&argTypes)
{
try
{
// Allocate an array. This memory is directly allocated, rather than
// letting Vertica do it. Remember to properly calculate the amount
// of memory you need based on the data type you are allocating.
// This example divides 500MB by 8, since that's the number of
// bytes in a 64-bit unsigned integer.
myarray = new uint64[1024 * 1024 * 500 / 8];
}
catch (std::bad_alloc &ba)
{
// Always check for exceptions caused by failed memory
// allocations.
vt_report_error(1, "Couldn't allocate memory :[%s]", ba.what());
}
}
// Called after the UDF has processed all of its information. Use to free
// any allocated resources.
virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes
&argTypes)
{
// srvInterface.log("RowNumber processed %d records", *count_ptr);
try
{
// Properly dispose of the allocated memory.
delete[] myarray;
}
catch (std::bad_alloc &ba)
{
// Always check for exceptions caused by failed memory
// allocations.
vt_report_error(1, "Couldn't free memory :[%s]", ba.what());
}
}
5.1.3.5.2 - Allocating resources with the SDK macros
The Vertica SDK provides three macros to allocate memory:.
The Vertica SDK provides three macros to allocate memory:
-
vt_alloc
allocates a block of memory to fit a specific data type (vint, struct, etc.).
-
vt_allocArray
allocates a block of memory to hold an array of a specific data type.
-
vt_allocSize
allocates an arbitrarily-sized block of memory.
All of these macros allocate their memory from memory pools managed by Vertica. The main benefit of allowing Vertica to manage your UDx's memory is that the memory is automatically reclaimed after your UDx has finished. This ensures there is no memory leaks in your UDx.
Because Vertica frees this memory automatically, do not attempt to free any of the memory you allocate through any of these macros. Attempting to free this memory results in run-time errors.
5.1.3.5.3 - Informing Vertica of resource requirements
When you run your UDx in fenced mode, Vertica monitors its use of memory and file handles.
When you run your UDx in fenced mode, Vertica monitors its use of memory and file handles. If your UDx uses more than a few megabytes of memory or any file handles, it should tell Vertica about its resource requirements. Knowing the resource requirements of your UDx allows Vertica to determine whether it can run the UDx immediately or needs to queue the request until enough resources become available to run it.
Determining how much memory your UDx requires can be difficult in some cases. For example, if your UDx extracts unique data elements from a data set, there is potentially no bound on the number of data items. In this case, a useful technique is to run your UDx in a test environment and monitor its memory use on a node as it handles several differently-sized queries, then extrapolate its memory use based on the worst-case scenario it may face in your production environment. In all cases, it's usually a good idea to add a safety margin to the amount of memory you tell Vertica your UDx uses.
Note
The information on your UDx's resource needs that you pass to Vertica is used when planning the query execution. There is no way to change the amount of resources your UDx requests from Vertica while the UDx is actually running.
Your UDx informs Vertica of its resource needs by implementing the getPerInstanceResources()
function in its factory class (see Vertica::UDXFactory::getPerInstanceResources()
in the SDK documentation). If your UDx's factory class implements this function, Vertica calls it to determine the resources your UDx requires.
The getPerInstanceResources()
function receives an instance of the Vertica::VResources
struct. This struct contains fields that set the amount of memory and the number of file handles your UDx needs. By default, the Vertica server allocates zero bytes of memory and 100 file handles for each instance of your UDx.
Your implementation of the getPerInstanceResources()
function sets the fields in the VResources
struct based on the maximum resources your UDx may consume for each instance of the UDx function. So, if your UDx's processBlock()
function creates a data structure that uses at most 100MB of memory, your UDx must set the VResources.scratchMemory
field to at least 104857600 (the number of bytes in 100MB). Leave yourself a safety margin by increasing the number beyond what your UDx should normally consume. In this example, allocating 115000000 bytes (just under 110MB) is a good idea.
The following ScalarFunctionFactory
class demonstrates calling getPerInstanceResources()
to inform Vertica about the memory requirements of the MemoryAllocationExample
class shown in Allocating resources for UDxs. It tells Vertica that the UDSF requires 510MB of memory (which is a bit more than the UDSF actually allocates, to be on the safe size).
class MemoryAllocationExampleFactory : public ScalarFunctionFactory
{
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface
&srvInterface)
{
return vt_createFuncObj(srvInterface.allocator, MemoryAllocationExample);
}
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
// Tells Vertica the amount of resources that this UDF uses.
virtual void getPerInstanceResources(ServerInterface &srvInterface,
VResources &res)
{
res.scratchMemory += 1024LL * 1024 * 510; // request 510MB of memory
}
};
5.1.3.5.4 - Setting memory limits for fenced-mode UDxs
Vertica calls a fenced-mode UDx's implementation of Vertica::UDXFactory::getPerInstanceResources() to determine if there are enough free resources to run the query containing the UDx (see Informing [%=Vertica.DBMS_SHORT%] of Resource Requirements).
Vertica calls a fenced-mode UDx's implementation of Vertica::UDXFactory::getPerInstanceResources()
to determine if there are enough free resources to run the query containing the UDx (see Informing Vertica of resource requirements). Since these reports are not generated by actual memory use, they can be inaccurate. Once started by Vertica, a UDx could allocate far more memory or file handles than it reported it needs.
The FencedUDxMemoryLimitMB configuration parameter lets you create an absolute memory limit for UDxs. Any attempt by a UDx to allocate more memory than this limit results in a bad_alloc
exception. For an example of setting FencedUDxMemoryLimitMB, see How resource limits are enforced.
5.1.3.5.5 - How resource limits are enforced
Before running a query, Vertica determines how much memory it requires to run.
Before running a query, Vertica determines how much memory it requires to run. If the query contains a fenced-mode UDx which implements the getPerInstanceResources()
function in its factory class, Vertica calls it to determine the amount of memory the UDx needs and adds this to the total required for the query. Based on these requirements, Vertica decides how to handle the query:
-
If the total amount of memory required (including the amount that the UDxs report that they need) is larger than the session's MEMORYCAP or resource pool's MAXMEMORYSIZE setting, Vertica rejects the query. For more information about resource pools, see Resource pool architecture.
-
If the amount of memory is below the limit set by the session and resource pool limits, but there is currently not enough free memory to run the query, Vertica queues it until enough resources become available.
-
If there are enough free resources to run the query, Vertica executes it.
Note
Vertica has no other way to determine the amount of resources a UDx requires other than the values it reports using the
getPerInstanceResources()
function. A UDx could use more resources than it claims, which could cause performance issues for other queries that are denied resources. You can set an absolute limit on the amount of memory UDxs can allocate. See
Setting memory limits for fenced-mode UDxs for more information.
If the process executing your UDx attempts to allocate more memory than the limit set by the FencedUDxMemoryLimitMB configuration parameter, it receives a bad_alloc exception. For more information about FencedUDxMemoryLimitMB, see Setting memory limits for fenced-mode UDxs.
Below is the output of loading a UDSF that consumes 500MB of memory, then changing the memory settings to cause out-of-memory errors. The MemoryAllocationExample UDSF in the following example is just the Add2Ints UDSF example altered as shown in Allocating resources for UDxs and Informing Vertica of resource requirements to allocate 500MB of RAM.
=> CREATE LIBRARY mylib AS '/home/dbadmin/MemoryAllocationExample.so';
CREATE LIBRARY
=> CREATE FUNCTION usemem AS NAME 'MemoryAllocationExampleFactory' LIBRARY mylib
-> FENCED;
CREATE FUNCTION
=> SELECT usemem(1,2);
usemem
--------
3
(1 row)
The following statements demonstrate setting the session's MEMORYCAP to lower than the amount of memory that the UDSF reports it uses. This causes Vertica to return an error before it executes the UDSF.
=> SET SESSION MEMORYCAP '100M';
SET
=> SELECT usemem(1,2);
ERROR 3596: Insufficient resources to execute plan on pool sysquery
[Request exceeds session memory cap: 520328KB > 102400KB]
=> SET SESSION MEMORYCAP = default;
SET
The resource pool can also prevent a UDx from running if it requires more memory than is available in the pool. The following statements demonstrate the effect of creating and using a resource pool that has too little memory for the UDSF to run. Similar to the session's MAXMEMORYCAP limit, the pool's MAXMEMORYSIZE setting prevents Vertica from executing the query containing the UDSF.
=> CREATE RESOURCE POOL small MEMORYSIZE '100M' MAXMEMORYSIZE '100M';
CREATE RESOURCE POOL
=> SET SESSION RESOURCE POOL small;
SET
=> CREATE TABLE ExampleTable(a int, b int);
CREATE TABLE
=> INSERT /*+direct*/ INTO ExampleTable VALUES (1,2);
OUTPUT
--------
1
(1 row)
=> SELECT usemem(a, b) FROM ExampleTable;
ERROR 3596: Insufficient resources to execute plan on pool small
[Request Too Large:Memory(KB) Exceeded: Requested = 523136, Free = 102400 (Limit = 102400, Used = 0)]
=> DROP RESOURCE POOL small; --Dropping the pool resets the session's pool
DROP RESOURCE POOL
Finally, setting the FencedUDxMemoryLimitMB configuration parameter to lower than the UDx actually allocates results in the UDx throwing an exception. This is a different case than either of the previous two examples, since the query actually executes. The UDx's code needs to catch and handle the exception. In this example, it uses the vt_report_error
macro to report the error back to Vertica and exit.
=> ALTER DATABASE DEFAULT SET FencedUDxMemoryLimitMB = 300;
=> SELECT usemem(1,2);
ERROR 3412: Failure in UDx RPC call InvokeSetup(): Error calling setup() in
User Defined Object [usemem] at [MemoryAllocationExample.cpp:32], error code:
1, message: Couldn't allocate memory :[std::bad_alloc]
=> ALTER DATABASE DEFAULT SET FencedUDxMemoryLimitMB = -1;
=> SELECT usemem(1,2);
usemem
--------
3
(1 row)
See also
5.1.4 - Java SDK
The Vertica SDK supports writing Java UDxs of all types except aggregate functions.
The Vertica SDK supports writing Java UDxs of all types except aggregate functions. All Java UDxs are fenced.
You can download, compile, and run the examples; see Downloading and running UDx example code. Running the examples is a good way to verify that your development environment has all needed libraries.
If you do not have access to a Vertica test environment, you can install Vertica on your development machine and run a single node. Each time you rebuild your UDx library, you need to re-install it into Vertica. The following diagram illustrates the typical development cycle.
This section covers Java-specific topics that apply to all UDx types. For information that applies to all languages, see Arguments and return values, UDx parameters, Errors, warnings, and logging, Handling cancel requests and the sections for specific UDx types. For full API documentation, see the Java SDK Documentation.
5.1.4.1 - Setting up the Java SDK
The Vertica Java Software Development Kit (SDK) is distributed as part of the server installation.
The Vertica Java Software Development Kit (SDK) is distributed as part of the server installation. It contains the source and JAR files you need to create your UDx library. For examples that you can compile and run, see Downloading and running UDx example code.
Requirements
At a minimum, install the following on your development machine:
Optionally, you can simplify development with a build-management tool, such as make
.
SDK files
To use the SDK you need two files from the Java support package:
-
/opt/vertica/bin/VerticaSDK.jar
contains the Vertica Java SDK and other supporting files.
-
/opt/vertica/sdk/BuildInfo.java
contains version information about the SDK. You must compile this file and include it within your Java UDx JAR files.
If you are not doing your development on a database node, you can copy these two files from one of the database nodes to your development system.
The BuildInfo.java
and VerticaSDK.jar
files that you use to compile your UDx must be from the same SDK version. Both files must also match the version of the SDK files on your Vertica hosts. Versioning is only an issue if you are not compiling your UDxs on a Vertica host. If you are compiling on a separate development system, always refresh your copies of these two files and recompile your UDxs just before deploying them.
Finding the current SDK version
You must develop your UDx using the same SDK version as the database in which you plan to use it. To display the SDK version currently installed on your system, run the following command in vsql:
=> SELECT sdk_version();
Compiling BuildInfo.java
You need to compile the BuildInfo.java
file into a class file, so you can include it in your Java UDx JAR library. If you are using a Vertica node as a development system, you can either:
-
Copy the BuildInfo.java
file to another location on your host.
-
If you have root privileges, compile the BuildInfo.java
file in place. (Only the root user has privileges to write files to the /opt/vertica/sdk directory.)
Compile the file using the following command. Replace path
with the path to the file and output-directory with the directory where you will compile your UDxs.
$ javac -classpath /opt/vertica/bin/VerticaSDK.jar \
/path/BuildInfo.java -d output-directory
If you use an IDE such as Eclipse, you can include the BuildInfo.java
file in your project instead of compiling it separately. You must also add the VerticaSDK.jar
file to the project's build path. See your IDE's documentation for details on how to include files and libraries in your projects.
Running the examples
You can download the examples from the GitHub repository (see Downloading and running UDx example code). Compiling and running the examples helps you to ensure that your development environment is properly set up.
If you have not already done so, set the JAVA_HOME environment variable to your JDK (not JRE) directory.
To compile all of the examples, including the Java examples, issue the following command in the Java-and-C++
directory under the examples directory:
$ make
To compile only the Java examples, issue the following command in the Java-and-C++
directory under the examples directory:
$ make JavaFunctions
5.1.4.2 - Compiling and packaging a Java library
Before you can use your Java UDx, you need to compile it and package it into a JAR file.
Before you can use your Java UDx, you need to compile it and package it into a JAR file.
The SDK examples include a working makefile. See Downloading and running UDx example code.
Compile your Java UDx
You must include the SDK JAR file in the CLASSPATH when you compile your Java UDx source files so the Java compiler can resolve the Vertica API calls. If you are using the command-line Java compiler on a host in your database cluster, enter this command:
$ javac -classpath /opt/vertica/bin/VerticaSDK.jar factorySource.java \
[functionSource.java...] -d output-directory
If all of your source files are in the same directory, you can use *.java
on the command line instead of listing the files individually.
If you are using an IDE, verify that a copy of the VerticaSDK.jar
file is in the build path.
UDx class file organization
After you compile your UDx, you must package its class files and the BuildInfo.class
file into a JAR file.
Note
You can package as many UDxs as you want into the same JAR file. Bundling your UDxs together saves you from having to load multiple libraries.
To use the jar command packaged as part of the JDK, you must organize your UDx class files into a directory structure matching your class package structure. For example, suppose your UDx's factory class has a fully-qualified name of com.mycompany.udfs.Add2ints
. In this case, your class files must be in the directory hierarchy com/mycompany/udfs
relative to your project's base directory. In addition, you must have a copy of the BuildInfo.class
file in the path com/vertica/sdk
so that it can be included in the JAR file. This class must appear in your JAR file to indicate the SDK version that was used to compile your Java UDx.
The JAR file for the Add2ints UDSF example has the following directory structure after compilation:
com/vertica/sdk/BuildInfo.class
com/mycompany/example/Add2intsFactory.class
com/mycompany/example/Add2intsFactory$Add2ints.class
Package your UDx into a JAR file
To create a JAR file from the command line:
-
Change to the root directory of your project.
-
Use the jar command to package the BuildInfo.class
file and all of the classes in your UDx:
# jar -cvf libname.jar com/vertica/sdk/BuildInfo.class \
packagePath/*.class
When you type this command, libname
is the filename you have chosen for your JAR file (choose whatever name you like), and packagePath
is the path to the directory containing your UDx's class files.
-
For example, to package the files from the Add2ints example, you use the command:
# jar -cvf Add2intsLib.jar com/vertica/sdk/BuildInfo.class \
com/mycompany/example/*.class
-
More simply, if you compiled BuildInfo.class
and your class files into the same root directory, you can use the following command:
# jar -cvf Add2intsLib.jar .
You must include all of the class files that make up your UDx in your JAR file. Your UDx always consists of at least two classes (the factory class and the function class). Even if you defined your function class as an inner class of your factory class, Java generates a separate class file for the inner class.
After you package your UDx into a JAR file, you are ready to deploy it to your Vertica database.
5.1.4.3 - Handling Java UDx dependencies
If your Java UDx relies on one or more external libraries, you can handle the dependencies in one of three ways:.
If your Java UDx relies on one or more external libraries, you can handle the dependencies in one of three ways:
-
Bundle the JAR files into your UDx JAR file using a tool such as One-JAR or Eclipse Runnable JAR Export Wizard.
-
Unpack the JAR file and then repack its contents in your UDx's JAR file.
-
Copy the libraries to your Vertica cluster in addition to your UDx library.Then, use the DEPENDS keyword of the CREATE LIBRARY statement to tell Vertica that the UDx library depends on the external libraries. This keyword acts as a library-specific CLASSPATH setting. Vertica distributes the support libraries to all of the nodes in the cluster and sets the class path for the UDx so it can find them.
If your UDx depends on native libraries (SO files), use the DEPENDS keyword to specify their path. When you call System.loadLibrary in your UDx (which you must do before using a native library), this function uses the DEPENDS path to find them. You do not need to also set the LD_LIBRARY_PATH environment variable.
External library example
The following example demonstrates using an external library with a Java UDx.
The following sample code defines a simple class, named VowelRemover
. It contains a single method, named removevowels
, that removes all of the vowels (the letters a, e, i, o u, and y) from a string.
package com.mycompany.libs;
public class VowelRemover {
public String removevowels(String input) {
return input.replaceAll("(?i)[aeiouy]", "");
}
};
You can compile this class and package it into a JAR file with the following commands:
$ javac -g com/mycompany/libs/VowelRemover.java
$ jar cf mycompanylibs.jar com/mycompany/libs/VowelRemover.class
The following code defines a Java UDSF, named DeleteVowels
, that uses the library defined in the preceding example code. DeleteVowels
accepts a single VARCHAR as input, and returns a VARCHAR.
package com.mycompany.udx;
// Import the support class created earlier
import com.mycompany.libs.VowelRemover;
// Import the Vertica SDK
import com.vertica.sdk.*;
public class DeleteVowelsFactory extends ScalarFunctionFactory {
@Override
public ScalarFunction createScalarFunction(ServerInterface arg0) {
return new DeleteVowels();
}
@Override
public void getPrototype(ServerInterface arg0, ColumnTypes argTypes,
ColumnTypes returnTypes) {
// Accept a single string and return a single string.
argTypes.addVarchar();
returnTypes.addVarchar();
}
@Override
public void getReturnType(ServerInterface srvInterface,
SizedColumnTypes argTypes,
SizedColumnTypes returnType){
returnType.addVarchar(
// Output will be no larger than the input.
argTypes.getColumnType(0).getStringLength(), "RemovedVowels");
}
public class DeleteVowels extends ScalarFunction
{
@Override
public void processBlock(ServerInterface arg0, BlockReader argReader,
BlockWriter resWriter) throws UdfException, DestroyInvocation {
// Create an instance of the VowelRemover object defined in
// the library.
VowelRemover remover = new VowelRemover();
do {
String instr = argReader.getString(0);
// Call the removevowels method defined in the library.
resWriter.setString(remover.removevowels(instr));
resWriter.next();
} while (argReader.next());
}
}
}
Use the following commands to build the example UDSF and package it into a JAR:
-
The first javac command compiles the SDK’s BuildInfo
class. Vertica requires all UDx libraries to contain this class. The javac
command’s -d option outputs the class file in the directory structure of your UDSF’s source.
-
The second javac
command compiles the UDSF class. It adds the previously-created mycompanylibs.jar
file to the class path so compiler can find the the VowelRemover
class.
-
The jar
command packages the BuildInfo
and the classes for the UDx library together.
$ javac -g -cp /opt/vertica/bin/VerticaSDK.jar\
/opt/vertica/sdk/com/vertica/sdk/BuildInfo.java -d .
$ javac -g -cp mycompanylibs.jar:/opt/vertica/bin/VerticaSDK.jar\
com/mycompany/udx/DeleteVowelsFactory.java
$ jar cf DeleteVowelsLib.jar com/mycompany/udx/*.class \
com/vertica/sdk/*.class
To install the UDx library, you must copy both of the JAR files to a node in the Vertica cluster. Then, connect to the node to execute the CREATE LIBRARY statement.
The following example demonstrates how to load the UDx library after you copy the JAR files to the home directory of the dbadmin user. The DEPENDS keyword tells Vertica that the UDx library depends on the mycompanylibs.jar
file.
=> CREATE LIBRARY DeleteVowelsLib AS
'/home/dbadmin/DeleteVowelsLib.jar' DEPENDS '/home/dbadmin/mycompanylibs.jar'
LANGUAGE 'JAVA';
CREATE LIBRARY
=> CREATE FUNCTION deleteVowels AS language 'java' NAME
'com.mycompany.udx.DeleteVowelsFactory' LIBRARY DeleteVowelsLib;
CREATE FUNCTION
=> SELECT deleteVowels('I hate vowels!');
deleteVowels
--------------
ht vwls!
(1 row)
5.1.4.4 - Java and Vertica data types
The Vertica Java SDK converts Vertica's native data types into the appropriate Java data type.
The Vertica Java SDK converts Vertica's native data types into the appropriate Java data type. The following table lists the Vertica data types and their corresponding Java data types.
Vertica Data Type |
Java Data Type |
INTEGER |
long |
FLOAT |
double |
NUMERIC |
com.vertica.sdk.VNumeric |
DATE |
java.sql.Date |
CHAR, VARCHAR, LONG VARCHAR |
com.vertica.sdk.VString |
BINARY, VARBINARY, LONG VARBINARY |
com.vertica.sdk.VString |
TIMESTAMP |
java.sql.Timestamp |
Note
Some Vertica data types are not supported.
Setting BINARY, VARBINARY, and LONG VARBINARY values
The Vertica BINARY, VARBINARY, and LONG VARBINARY data types are converted as the Java UDx SDK 's VString class. You can also set the value of a column with one of these data types with a ByteBuffer
object (or a byte array wrapped in a ByteBuffer
) using the PartitionWriter.setStringBytes()
method. See the Java API UDx entry for PartitionWriter.setStringBytes()
for more information.
Timestamps and time zones
When the SDK converts a Vertica timestamp into a Java timestamp, it uses the time zone of the JVM. If the JVM is running in a different time zone than the one used by Vertica, the results can be confusing.
Vertica stores timestamps in the database in UTC. (If a database time zone is set, the conversion is done at query time.) To prevent errors from the JVM time zone, add the following code to the processing method of your UDx:
TimeZone.setDefault(TimeZone.getTimeZone("UTC"));
Strings
The Java SDK contains a class named StringUtils
that assists you when manipulating string data. One of its more useful features is its getStringBytes()
method. This method extracts bytes from a String
in a way that prevents the creation of invalid strings. If you attempt to extract a substring that would split part of a multi-byte UTF-8 character, getStringBytes()
truncates it to the nearest whole character.
5.1.4.5 - Handling NULL values
Your UDxs must be prepared to handle NULL values.
Your UDxs must be prepared to handle NULL values. These values usually must be handled separately from regular values.
Reading NULL values
Your UDx reads data from instances of the the BlockReader
or PartitionReader
classes. If the value of a column is NULL, the methods you use to get data (such as getLong
) return a Java null
reference. If you attempt to use the value without checking for NULL, the Java runtime will throw a null pointer exception.
You can test for null values before reading columns by using the data-type-specific methods (such as isLongNull
, isDoubleNull
, and isBooleanNull
). For example, to test whether the INTEGER first column of your UDx's input is a NULL, you would use the statement:
// See if the Long value in column 0 is a NULL
if (inputReader.isLongNull(0)) {
// value is null
. . .
Writing NULL values
You output NULL values using type-specific methods on the BlockWriter
and PartitionWriter
classes (such as setLongNull
and setStringNull
). These methods take the column number to receive the NULL value. In addition, the PartitionWriter
class has data-type specific set value methods (such as setLongValue
and setStringValue
). If you pass these methods a value, they set the output column to that value. If you pass them a Java null
reference, they set the output column to NULL.
5.1.4.6 - Adding metadata to Java UDx libraries
To add metadata to your Java UDx library, you create a subclass of the UDXLibrary class that contains your library's metadata.
You can add metadata, such as author name, the version of the library, a description of your library, and so on to your library. This metadata lets you track the version of your function that is deployed on a Vertica Analytic Database cluster and lets third-party users of your function know who created the function. Your library's metadata appears in the USER_LIBRARIES system table after your library has been loaded into the Vertica Analytic Database catalog.
To add metadata to your Java UDx library, you create a subclass of the UDXLibrary
class that contains your library's metadata. You then include this class within your JAR file. When you load your class into the Vertica Analytic Database catalog using the CREATE LIBRARY statement, looks for a subclass of UDXLibrary
for the library's metadata.
In your subclass of UDXLibrary
, you need to implement eight getters that return String values containing the library's metadata. The getters in this class are:
-
getAuthor()
returns the name you want associated with the creation of the library (your own name or your company's name for example).
-
getLibraryBuildTag()
returns whatever String you want to use to represent the specific build of the library (for example, the SVN revision number or a timestamp of when the library was compiled). This is useful for tracking instances of your library as you are developing them.
-
getLibraryVersion()
returns the version of your library. You can use whatever numbering or naming scheme you want.
-
getLibrarySDKVersion()
returns the version of the Vertica Analytic Database SDK Library for which you've compiled the library.
Note
This field isn't used to determine whether a library is compatible with a version of the Vertica Analytic Database server. The version of the Vertica Analytic Database SDK you use to compile your library is embedded in the library when you compile it. It is this information that Vertica Analytic Database server uses to determine if your library is compatible with it.
-
getSourceUrl()
returns a URL where users of your function can find more information about it. This can be your company's website, the GitHub page hosting your library's source code, or whatever site you like.
-
getDescription()
returns a concise description of your library.
-
getLicensesRequired()
returns a placeholder for licensing information. You must pass an empty string for this value.
-
getSignature()
returns a placeholder for a signature that will authenticate your library. You must pass an empty string for this value.
For example, the following code demonstrates creating a UDXLibrary subclass to be included in the Add2Ints UDSF example JAR file (see /opt/vertica/sdk/examples/JavaUDx/ScalarFunctions
on any Vertica node).
// Import the UDXLibrary class to hold the metadata
import com.vertica.sdk.UDXLibrary;
public class Add2IntsLibrary extends UDXLibrary
{
// Return values for the metadata about this library.
@Override public String getAuthor() {return "Whizzo Analytics Ltd.";}
@Override public String getLibraryBuildTag() {return "1234";}
@Override public String getLibraryVersion() {return "1.0";}
@Override public String getLibrarySDKVersion() {return "7.0.0";}
@Override public String getSourceUrl() {
return "http://example.com/add2ints";
}
@Override public String getDescription() {
return "My Awesome Add 2 Ints Library";
}
@Override public String getLicensesRequired() {return "";}
@Override public String getSignature() {return "";}
}
When the library containing the Add2IntsLibrary class loaded, the metadata appears in the USER_LIBRARIES system table:
=> CREATE LIBRARY JavaAdd2IntsLib AS :libfile LANGUAGE 'JAVA';
CREATE LIBRARY
=> CREATE FUNCTION JavaAdd2Ints as LANGUAGE 'JAVA' name 'com.mycompany.example.Add2IntsFactory' library JavaAdd2IntsLib;
CREATE FUNCTION
=> \x
Expanded display is on.
=> SELECT * FROM USER_LIBRARIES WHERE lib_name = 'JavaAdd2IntsLib';
-[ RECORD 1 ]-----+---------------------------------------------
schema_name | public
lib_name | JavaAdd2IntsLib
lib_oid | 45035996273869844
author | Whizzo Analytics Ltd.
owner_id | 45035996273704962
lib_file_name | public_JavaAdd2IntsLib_45035996273869844.jar
md5_sum | f3bfc76791daee95e4e2c0f8a8d2737f
sdk_version | v7.0.0-20131105
revision | 125200
lib_build_tag | 1234
lib_version | 1.0
lib_sdk_version | 7.0.0
source_url | http://example.com/add2ints
description | My Awesome Add 2 Ints Library
licenses_required |
signature |
5.1.4.7 - Java UDx resource management
Java Virtual Machines (JVMs) allocate a set amount of memory when they start.
Java Virtual Machines (JVMs) allocate a set amount of memory when they start. This set memory allocation complicates memory management for Java UDxs, because memory cannot be dynamically allocated and freed by the UDx as it is processing data. This is differs from C++ UDxs which can dynamically allocate resources.
To control the amount of memory consumed by Java UDxs, Vertica has a memory pool named jvm that it uses to allocate memory for JVMs. If this memory pool is exhausted, queries that call Java UDxs block until enough memory in the pool becomes free to start a new JVM.
By default, the jvm pool has:
-
no memory of its own assigned to it, so it borrows memory from the GENERAL pool.
-
its MAXMEMORYSIZE set to either 10% of system memory or 2GB, whichever is smaller.
-
its PLANNEDCONCURRENCY set to AUTO, so that it inherits the GENERAL pool's PLANNEDCONCURRENCY setting.
You can view the current settings for the jvm pool by querying the RESOURCE_POOLS table:
=> SELECT MAXMEMORYSIZE,PLANNEDCONCURRENCY FROM V_CATALOG.RESOURCE_POOLS WHERE NAME = 'jvm';
MAXMEMORYSIZE | PLANNEDCONCURRENCY
---------------+--------------------
10% | AUTO
When a SQL statement calls a Java UDx, Vertica checks if the jvm memory pool has enough memory to start a new JVM instance to execute the function call. Vertica starts each new JVM with its heap memory size set to approximately the jvm pool's MAXMEMORYSIZE parameter divided by its PLANNEDCONCURRENCY parameter. If the memory pool does not contain enough memory, the query blocks until another JVM exits and return their memory to the pool.
If your Java UDx attempts to consume more memory than has been allocated to the JVM's heap size, it exits with a memory error. You can attempt to resolve this issue by:
-
increasing the jvm pool's MAXMEMORYSIZE parameter.
-
decreasing the jvm pool's PLANNEDCONCURRENCY parameter.
-
changing your Java UDx's code to consume less memory.
Adjusting the jvm pool
When adjusting the jvm pool to your needs, you must consider two factors:
You can learn the amount of memory your Java UDx needs using several methods. For example, your code can use Java's Runtime
class to get an estimate of the total memory it has allocated and then log the value using ServerInterface.log()
. (An instance of this class is passed to your UDx.) If you have multiple Java UDxs in your database, set the jvm pool memory size based on the UDx that uses the most memory.
The number of concurrent sessions that need to run Java UDxs may not be the same as the global PLANNEDCONCURRENCY setting. For example, you may have just a single user who runs a Java UDx, which means you can lower the jvm pool's PLANNEDCONCURRENCY setting to 1.
When you have an estimate for the amount of RAM and the number of concurrent user sessions that need to run Java UDXs, you can adjust the jvm pool to an appropriate size. Set the pool's MAXMEMORYSIZE to the maximum amount of RAM needed by the most demanding Java UDx multiplied by the number of concurrent user sessions that need to run Java UDxs. Set the pool's PLANNEDCONCURENCY to the numebr of simultaneous user sessions that need to run Java UDxs.
For example, suppose your Java UDx requires up to 4GB of memory to run and you expect up to two user sessions use Java UDx's. You would use the following command to adjust the jvm pool:
=> ALTER RESOURCE POOL jvm MAXMEMORYSIZE '8G' PLANNEDCONCURRENCY 2;
The MEMORYSIZE is set to 8GB, which is the 4GB maximum memory use by the Java UDx multiplied by the 2 concurrent user sessions.
Note
The PLANNEDCONCURRENCY value is not the number of calls to Java UDx that you expect to happen simultaneously. Instead, it is the number of concurrently open user sessions that call Java UDxs at any time during the session. See below for more information.
See Managing workloads for more information on tuning the jvm and other resource pools.
Freeing JVM memory
The first time users call a Java UDx during their session, Vertica allocates memory from the jvm pool and starts a new JVM. This JVM remains running for as long as the user's session is open so it can process other Java UDx calls. Keeping the JVM running lowers the overhead of executing multiple Java UDxs by the same session. If the JVM did not remain open, each call to a Java UDx would require additional time for Vertica to allocate resources and start a new JVM. However, having the JVM remain open means that the JVM's memory remains allocated for the life of the session whether or not it will be used again.
If the jvm memory pool is depleted, queries containing Java UDxs either block until memory becomes available or eventually fail due a lack of resources. If you find queries blocking or failing for this reason, you can allocate more memory to the jvm pool and increase its PLANNEDCONCURRENCY. Another option is to ask users to call the RELEASE_JVM_MEMORY function when they no longer need to run Java UDxs. This function closes any JVM belonging to the user's session and returns its allocated memory to the jvm memory pool.
The following example demonstrates querying V_MONITOR.SESSIONS to find the memory allocated to JVMs by all sessions. It also demonstrates how the memory is allocated by a call to a Java UDx, and then freed by calling RELEASE_JVM_MEMORY.
=> SELECT USER_NAME,EXTERNAL_MEMORY_KB FROM V_MONITOR.SESSIONS;
user_name | external_memory_kb
-----------+---------------
dbadmin | 0
(1 row)
=> -- Call a Java UDx
=> SELECT add2ints(123,456);
add2ints
----------
579
(1 row)
=> -- JVM is now running and memory is allocated to it.
=> SELECT USER_NAME,EXTERNAL_MEMORY_KB FROM V_MONITOR.SESSIONS;
USER_NAME | EXTERNAL_MEMORY_KB
-----------+---------------
dbadmin | 79705
(1 row)
=> -- Shut down the JVM and deallocate memory
=> SELECT RELEASE_JVM_MEMORY();
RELEASE_JVM_MEMORY
-----------------------------------------
Java process killed and memory released
(1 row)
=> SELECT USER_NAME,EXTERNAL_MEMORY_KB FROM V_MONITOR.SESSIONS;
USER_NAME | EXTERNAL_MEMORY_KB
-----------+---------------
dbadmin | 0
(1 row)
In rare cases, you may need to close all JVMs. For example, you may need to free memory for an important query, or several instances of a Java UDx may be taking too long to complete. You can use the RELEASE_ALL_JVM_MEMORY to close all of the JVMs in all user sessions:
=> SELECT USER_NAME,EXTERNAL_MEMORY_KB FROM V_MONITOR.SESSIONS;
USER_NAME | EXTERNAL_MEMORY_KB
-------------+---------------
ExampleUser | 79705
dbadmin | 79705
(2 rows)
=> SELECT RELEASE_ALL_JVM_MEMORY();
RELEASE_ALL_JVM_MEMORY
-----------------------------------------------------------------------------
Close all JVM sessions command sent. Check v_monitor.sessions for progress.
(1 row)
=> SELECT USER_NAME,EXTERNAL_MEMORY_KB FROM V_MONITOR.SESSIONS;
USER_NAME | EXTERNAL_MEMORY_KB
-----------+---------------
dbadmin | 0
(1 row)
Caution
This function terminates all JVMs, including ones that are currently executing Java UDXs. This will cause any query that is currently executing a Java UDx to return an error.
Notes
-
The jvm resource pool is used only to allocate memory for the Java UDx function calls in a statement. The rest of the resources required by the SQL statement come from other memory pools.
-
The first time a Java UDx is called, Vertica starts a JVM to execute some Java methods to get metadata about the UDx during the query planning phase. The memory for this JVM is also taken from the jvm memory pool.
5.1.5 - Python SDK
The Vertica SDK supports writing UDxs of some types in Python 3.
The Vertica SDK supports writing UDxs of some types in Python 3.
The Python SDK does not require any additional system configuration or header files. This low overhead allows you to develop and deploy new capabilities to your Vertica cluster in a short amount of time.
The following workflow is typical for the Python SDK:
Because Python has an interpreter, you do not have to compile your program before loading the UDx in Vertica. However, you should expect to do some debugging of your code after you create your function and begin testing it in Vertica.
When Vertica calls your UDx, it starts a side process that manages the interaction between the server and the Python interpreter.
This section covers Python-specific topics that apply to all UDx types. For information that applies to all languages, see Arguments and return values, UDx parameters, Errors, warnings, and logging, Handling cancel requests and the sections for specific UDx types. For full API documentation, see the Python SDK.
Important
Your UDx must be able to run with the version of Python bundled with Vertica. You can find this with /opt/vertica/sbin/python3 --version
. You cannot change the version used by the Vertica Python interpreter.
5.1.5.1 - Setting up a Python development environment
To avoid problems when loading and executing your UDxs, develop your UDxs using the same version of Python that Vertica uses.
To avoid problems when loading and executing your UDxs, develop your UDxs using the same version of Python that Vertica uses. To do this without changing your environment for projects that might require other Python versions, you can use a Python virtual environment (venv). You can install libraries that your UDx depends on into your venv
and use that path when you create your UDx library with CREATE LIBRARY.
Setting up venv
Set up venv
using the Python version bundled with Vertica. If you have direct access to a database node, you can use that Python binary directly to create your venv
:
$ /opt/vertica/sbin/python3 -m venv /path/to/new/environment
The result is a directory with a default environment, including a site-packages
directory:
$ ls venv/lib/
python3.9
$ ls venv/lib/python3.9/
site-packages
If your UDx depends on libraries that are not packaged with Vertica, install them into this directory:
$ source venv/bin/activate
(venv) $ pip install numpy
...
The lib/python3.9/site-packages
directory now contains the installed library. The change affects only your virtual environment.
UDx imports
Your UDx code must import, in addition to any libraries you add, the vertica_sdk
library:
# always required:
import vertica_sdk
# other libs:
import numpy as np
# ...
The vertica_sdk
library is included as a part of the Vertica server. You do not need to add it to site-packages
or declare it as a dependency.
Deployment
For libraries you add, you must declare dependencies when using CREATE LIBRARY. This declaration allows Vertica to find the libraries and distribute them to all database nodes. You can supply a path instead of enumerating the libraries:
=> CREATE OR REPLACE LIBRARY pylib AS
'/path/to/udx/add2ints.py'
DEPENDS '/path/to/new/environment/lib/python3.9/site-packages/*'
LANGUAGE 'Python';
=> CREATE OR REPLACE FUNCTION add2ints AS LANGUAGE 'Python'
NAME 'add2ints_factory' LIBRARY pylib;
CREATE LIBRARY copies the UDx and the contents of the DEPENDS path and stores them with the database. Vertica then distributes copies to all database nodes.
5.1.5.2 - Python and Vertica data types
The Vertica Python SDK converts native Vertica data types into the appropriate Python data types.
The Vertica Python SDK converts native Vertica data types into the appropriate Python data types. The following table describes some of the data type conversions. Consult the Python SDK for a complete list, as well as lists of helper functions to convert and manipulate these data types.
For information about SDK support for complex data types, see Complex Types as Arguments and Return Values.
Vertica Data Type |
Python Data Type |
INTEGER |
int |
FLOAT |
float |
NUMERIC |
decimal.Decimal |
DATE |
datetime.date |
CHAR, VARCHAR, LONG VARCHAR |
string (UTF-8 encoded) |
BINARY, VARBINARY, LONG VARBINARY |
binary |
TIMESTAMP |
datetime.datetime |
TIME |
datetime.time |
ARRAY |
list
Note: Nested ARRAY types are also converted into lists.
|
ROW |
collections.OrderedDict
Note: Nested ROW types are also converted into collections.OrderedDicts.
|
Note
Some Vertica data types are not supported in Python. For a list of all Vertica data types, see
Data types.
5.1.6 - R SDK
The Vertica R SDK extends the capabilities of the Vertica Analytic Database so you can leverage additional R libraries.
The Vertica R SDK extends the capabilities of the Vertica Analytic Database so you can leverage additional R libraries. Before you can begin developing User Defined Extensions (UDxs) in R, you must install the R Language Pack for Vertica on each of the nodes in your cluster. The R SDK supports scalar and transform functions in fenced mode. Other UDx types are not supported.
The following workflow is typical for the R SDK:
You can find detailed documentation of all of the classes in the Vertica R SDK.
5.1.6.1 - Installing/upgrading the R language pack for Vertica
To create R UDxs in Vertica, install the R Language Pack package that matches your server version.
To create R UDxs in Vertica, install the R Language Pack package that matches your server version. The R Language Pack includes the R runtime and associated libraries for interfacing with Vertica. You must use this version of the R runtime; you cannot upgrade it.
You must install the R Language Pack on each node in the cluster. The Vertica R Language Pack must be the only R Language Pack installed on the node.
Vertica R language pack prerequisites
The R Language Pack package requires a number of packages for installation and execution. The names of these dependencies vary among Linux distributions. For Vertica-supported Linux platforms the packages are:
-
RHEL/CentOS: libfortran
, xz-libs
, libgomp
-
SUSE Linux Enterprise Server: libfortran3
, liblzma5
, libgomp1
-
Debian/Ubuntu: libfortran3
, liblzma5
, libgomp1
Vertica requires a version of the libgfortran4
library later than 7.1 to create R extensions. The libgfortran
library is included by default with the devtool
and gcc
packages.
Installing the Vertica R language pack
If you use your operating systems package manager, rather than the rpm or dpkg command, for installation, you do not need to manually install the R Language Pack. The native package managers for each supported Linux version are:
-
Download the R language package by browsing to the Vertica website.
-
On the Support tab, select Customer Downloads.
-
When prompted, log in using your Micro Focus credentials.
-
Located and select the vertica-R-lang_
version
.rpm
or vertica-R-lang_
version
.deb
file for your server version. The R language package version must match your server version to three decimal points.
-
Install the package as root or using sudo:
-
RHEL/CentOS
$ yum install vertica-R-lang-<version>.rpm
-
SUSE Linux Enterprise Server
$ zypper install vertica-R-lang-<version>.rpm
-
Debian
$ apt-get install ./vertica-R-lang_<version>.deb
-
Amazon Linux 2.0
$ yum install vertica-R-lang-<version>.AMZN.rpm
The installer puts the R binary in /opt/vertica/R
.
Upgrading the Vertica R language pack
When upgrading, some R packages you have manually installed may not work and may have to be reinstalled. If you do not update your package(s), then R returns an error if the package cannot be used. Instructions for upgrading these packages are below.
Note
The R packages provided in the R Language Pack are automatically upgraded and do not need to be reinstalled.
-
You must uninstall the R Language package before upgrading Vertica. Any additional R packages you manually installed remain in /opt/vertica/R
and are not removed when you uninstall the package.
-
Upgrade your server package as detailed in Upgrading Vertica to a New Version.
-
After the server package has been updated, install the new R Language package on each host.
If you have installed additional R packages, on each node:
-
As root run /opt/vertica/R/bin/R
and issue the command:
> update.packages(checkBuilt=TRUE)
-
Select a CRAN mirror from the list displayed.
-
You are prompted to update each package that has an update available for it. You must update any packages that you manually installed and are not compatible with the current version of R in the R Language Pack.
Do NOT update:
The packages you selected to be updated are installed. Quit R with the command:
> quit()
Vertica UDx functions written in R do not need to be compiled and you do not need to reload your Vertica-R libraries and functions after an upgrade.
5.1.6.2 - R packages
The Vertica R Language Pack includes the following R packages in addition to the default packages bundled with R:.
The Vertica R Language Pack includes the following R packages in addition to the default packages bundled with R:
-
Rcpp
-
RInside
-
IpSolve
-
lpSolveAPI
You can install additional R packages not included in the Vertica R Language Pack by using one of two methods. You must install the same packages on all nodes.
Installing R packages
You can install additional R packages by using one of the two following methods.
Using the install.packages() R command:
$ sudo /opt/vertica/R/bin/R
> install.packages("Zelig");
Using CMD INSTALL:
/opt/vertica/R/bin/R CMD INSTALL <path-to-package-tgz>
The installed packages are located in: /opt/vertica/R/library
.
5.1.6.3 - R and Vertica data types
The following data types are supported when passing data to/from an R UDx:.
The following data types are supported when passing data to/from an R UDx:
Vertica Data Type |
R Data Type |
BOOLEAN |
logical |
DATE, DATETIME, SMALLDATETIME, TIME, TIMESTAMP, TIMESTAMPTZ, TIMETZ |
numeric |
DOUBLE PRECISION, FLOAT, REAL |
numeric |
BIGINT, DECIMAL, INT, NUMERIC, NUMBER, MONEY |
numeric |
BINARY, VARBINARY |
character |
CHAR, VARCHAR |
character |
NULL values in Vertica are translated to R NA values when sent to the R function. R NA values are translated into Vertica null values when returned from the R function to Vertica.
Important
When specifying LONG VARCHAR or LONG VARBINARY data types, include the space between the two words. For example, datatype = c("long varchar")
.
5.1.6.4 - Adding metadata to R libraries
The following example shows how to add metadata to an R UDx.
You can add metadata, such as author name, the version of the library, a description of your library, and so on to your library. This metadata lets you track the version of your function that is deployed on a Vertica Analytic Database cluster and lets third-party users of your function know who created the function. Your library's metadata appears in the USER_LIBRARIES system table after your library has been loaded into the Vertica Analytic Database catalog.
You declare the metadata for your library by calling the RegisterLibrary()
function in one of the source files for your UDx. If there is more than one function call in the source files for your UDx, whichever gets interpreted last as Vertica Analytic Database loads the library is used to determine the library's metadata.
The RegisterLibrary()
function takes eight string parameters:
RegisterLibrary(author,
library_build_tag,
library_version,
library_sdk_version,
source_url,
description,
licenses_required,
signature);
-
author
contains whatever name you want associated with the creation of the library (your own name or your company's name for example).
-
library_build_tag
is a string you want to use to represent the specific build of the library (for example, the SVN revision number or a timestamp of when the library was compiled). This is useful for tracking instances of your library as you are developing them.
-
library_version
is the version of your library. You can use whatever numbering or naming scheme you want.
-
library_sdk_version
is the version of the Vertica Analytic Database SDK Library for which you've compiled the library.
Note
This field isn't used to determine whether a library is compatible with a version of the Vertica Analytic Database server. The version of the Vertica Analytic Database SDK you use to compile your library is embedded in the library when you compile it. It is this information that Vertica Analytic Database server uses to determine if your library is compatible with it.
-
source_url
is a URL where users of your function can find more information about it. This can be your company's website, the GitHub page hosting your library's source code, or whatever site you like.
-
description
is a concise description of your library.
-
licenses_required
is a placeholder for licensing information. You must pass an empty string for this value.
-
signature
is a placeholder for a signature that will authenticate your library. You must pass an empty string for this value.
The following example shows how to add metadata to an R UDx.
RegisterLibrary("Speedy Analytics Ltd.",
"1234",
"1.0",
"8.1.0",
"http://www.example.com/sales_tax_calculator.R",
"Sales Tax R Library",
"",
"")
Loading the library and querying the USER_LIBRARIES system table shows the metadata supplied in the call to RegisterLibrary
:
=> CREATE LIBRARY rLib AS '/home/dbadmin/sales_tax_calculator.R' LANGUAGE 'R';
CREATE LIBRARY
=> SELECT * FROM USER_LIBRARIES WHERE lib_name = 'rLib';
-[ RECORD 1 ]-----+---------------------------------------------------------
schema_name | public
lib_name | rLib
lib_oid | 45035996273708350
author | Speedy Analytics Ltd.
owner_id | 45035996273704962
lib_file_name | rLib_02552872a35d9352b4907d3fcd03cf9700a0000000000d3e.R
md5_sum | 30da555537c4d93c352775e4f31332d2
sdk_version |
revision |
lib_build_tag | 1234
lib_version | 1.0
lib_sdk_version | 8.1.0
source_url | http://www.example.com/sales_tax_calculator.R
description | Sales Tax R Library
licenses_required |
signature |
dependencies |
is_valid | t
sal_storage_id | 02552872a35d9352b4907d3fcd03cf9700a0000000000d3e
5.1.6.5 - Setting null input and volatility behavior for R functions
Vertica supports defining volatility and null-input settings for UDxs written in R.
Vertica supports defining volatility and null-input settings for UDxs written in R. Both settings aid in the performance of your R function.
Volatility settings
Volatility settings describe the behavior of the function to the Vertica optimizer. For example, if you have identical rows of input data and you know the UDx is immutable, then you can define the UDx as IMMUTABLE. This tells the Vertica optimizer that it can return a cached value for subsequent identical rows on which the function is called rather than having the function run on each identical row.
To indicate your UDx's volatility, set the volatility parameter of your R factory function to one of the following values:
Value |
Description |
VOLATILE |
Repeated calls to the function with the same arguments always result in different values. Vertica always calls volatile functions for each invocation. |
IMMUTABLE |
Calls to the function with the same arguments always results in the same return value. |
STABLE |
Repeated calls to the function with the same arguments within the same statement returns the same output. For example, a function that returns the current user name is stable because the user cannot change within a statement. The user name could change between statements. |
DEFAULT_VOLATILITY |
The default volatility. This is the same as VOLATILE. |
If you do not define a volatility, then the function is considered to be VOLATILE.
The following example sets the volatility to STABLE in the multiplyTwoIntsFactory function:
multiplyTwoIntsFactory <- function() {
list(name = multiplyTwoInts,
udxtype = c("scalar"),
intype = c("float","float"),
outtype = c("float"),
volatility = c("stable"),
parametertypecallback = multiplyTwoIntsParameters)
}
Null input setting determine how to respond to rows that have null input. For example, you can choose to return null if any inputs are null rather than calling the function and having the function deal with a NULL input.
To indicate how your UDx reacts to NULL input, set the strictness parameter of your R factory function to one of the following values:
Value |
Description |
CALLED_ON_NULL_INPUT |
The function must be called, even if one or more arguments are NULL. |
RETURN_NULL_ON_NULL_INPUT |
The function always returns a NULL value if any of its arguments are NULL. |
STRICT |
A synonym for RETURN_NULL_ON_NULL_INPUT |
DEFAULT_STRICTNESS |
The default strictness setting. This is the same as CALLED_ON_NULL_INPUT. |
If you do not define a null input behavior, then the function is called on every row of data regardless of the presence of NULL values.
The following example sets the NULL input behavior to STRICT in the multiplyTwoIntsFactory function:
multiplyTwoIntsFactory <- function() {
list(name = multiplyTwoInts,
udxtype = c("scalar"),
intype = c("float","float"),
outtype = c("float"),
strictness = c("strict"),
parametertypecallback = multiplyTwoIntsParameters)
}
5.1.7 - Debugging tips
The following tips can help you debug your UDx before deploying it in a production environment.
The following tips can help you debug your UDx before deploying it in a production environment.
Use a single node for initial debugging
You can attach to the Vertica process using a debugger such as gdb to debug your UDx code. Doing this in a multi-node environment, however, is very difficult. Therefore, consider setting up a single-node Vertica test environment to initially debug your UDx.
Use logging
Each UDx has an associated ServerInterface
instance. The ServerInterface
provides functions to write to the Vertica log and, in the C++ API only, a system table. See Logging for more information.
5.2 - Arguments and return values
For all UDx types except load (UDL), the factory class declares the arguments and return type of the associated function.
For all UDx types except load (UDL), the factory class declares the arguments and return type of the associated function. Factories have two methods for this purpose:
-
getPrototype()
(required): declares input and output types
-
getReturnType()
(sometimes required): declares the return types, including length and precision, when applicable
The getPrototype()
method receives two ColumnTypes
parameters, one for input and one for output. The factory in C++ example: string tokenizer takes a single input string and returns a string:
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)
{
argTypes.addVarchar();
returnType.addVarchar();
}
The ColumnTypes
class provides "add" methods for each supported type, like addVarchar()
. This class supports complex types with the addArrayType()
and addRowType()
methods; see Complex Types as Arguments. If your function is polymorphic, you can instead call addAny()
. You are then responsible for validating your inputs and outputs. For more information about implementing polymorphic UDxs, see Creating a polymorphic UDx.
The getReturnType()
method computes a maximum length for the returned value. If your UDx returns a sized column (a return data type whose length can vary, such as a VARCHAR), a value that requires precision, or more than one value, implement this factory method. (Some UDx types require you to implement it.)
The input is a SizedColumnTypes
containing the input argument types along with their lengths. Depending on the input types, add one of the following to the output types:
-
CHAR, (LONG) VARCHAR, BINARY, and (LONG) VARBINARY: return the maximum length.
-
NUMERIC types: specify the precision and scale.
-
TIME and TIMESTAMP values (with or without timezone): specify precision.
-
INTERVAL YEAR TO MONTH: specify range.
-
INTERVAL DAY TO SECOND: specify precision and range.
-
ARRAY: specify the maximum number of array elements.
In the case of the string tokenizer, the output is a VARCHAR and the function determines its maximum length:
// Tell Vertica what our return string length will be, given the input
// string length
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// Error out if we're called with anything but 1 argument
if (inputTypes.getColumnCount() != 1)
vt_report_error(0, "Function only accepts 1 argument, but %zu provided", inputTypes.getColumnCount());
int input_len = inputTypes.getColumnType(0).getStringLength();
// Our output size will never be more than the input size
outputTypes.addVarchar(input_len, "words");
}
Complex types as arguments and return values
The ColumnTypes
class supports ARRAY and ROW types. Arrays have elements and rows have fields, both of which have types that you need to describe. To work with complex types, you build ColumnTypes
objects for the array or row and then add them to the ColumnTypes
objects representing the function inputs and outputs.
In the following example, the input to a transform function is an array of orders, which are rows, and the output is the individual rows with their positions in the array. An order consists of a shipping address (VARCHAR) and an array of product IDs (INT).
The factory's getPrototype()
method first creates ColumnTypes
for the array and row elements and then calls addArrayType()
and addRowType()
using them:
void getPrototype(ServerInterface &srv,
ColumnTypes &argTypes,
ColumnTypes &retTypes)
{
// item ID (int), to be used in an array
ColumnTypes itemIdProto;
itemIdProto.addInt();
// row: order = address (varchar) + array of previously-created item IDs
ColumnTypes orderProto;
orderProto.addVarchar(); /* address */
orderProto.addArrayType(itemIdProto); /* array of item ID */
/* argument (input) is array of orders */
argTypes.addArrayType(orderProto);
/* return values: index in the array, order */
retTypes.addInt(); /* index of element */
retTypes.addRowType(orderProto); /* element return type */
}
The arguments include a sized type (the VARCHAR). The getReturnType()
method uses a similar approach, using the Fields
class to build the two fields in the order.
void getReturnType(ServerInterface &srv,
const SizedColumnTypes &argTypes,
SizedColumnTypes &retTypes)
{
Fields itemIdElementFields;
itemIdElementFields.addInt("item_id");
Fields orderFields;
orderFields.addVarchar(32, "address");
orderFields.addArrayType(itemIdElementFields[0], "item_id");
// optional third arg: max length, default unbounded
/* declare return type */
retTypes.addInt("index");
static_cast<Fields &>(retTypes).addRowType(orderFields, "element");
/* NOTE: presumably we have verified that the arguments match the prototype, so really we could just do this: */
retTypes.addInt("index");
retTypes.addArg(argTypes.getColumnType(0).getElementType(), "element");
}
To access complex types in the UDx processing method, use the ArrayReader
, ArrayWriter
, StructReader
, and StructWriter
classes.
See C++ example: using complex types for a polymorphic function that uses arrays.
The factory's getPrototype()
method first uses make
Type
()
and add
Type
()
methods to create and construct ColumnTypes
for the row and its elements. The method then calls add
Type
()
methods to add these constructed ColumnTypes
to the arg_types
and return_type
objects:
def getPrototype(self, srv_interface, arg_types, return_type):
# item ID (int), to be used in an array
itemIdProto = vertica_sdk.ColumnTypes.makeInt()
# row (order): address (varchar) + array of previously-created item IDs
orderProtoFields = vertica_sdk.ColumnTypes.makeEmpty()
orderProtoFields.addVarchar() # address
orderProtoFields.addArrayType(itemIdProto) # array of item ID
orderProto = vertica_sdk.ColumnTypes.makeRowType(orderProtoFields)
# argument (input): array of orders
arg_types.addArrayType(orderProto)
# return values: index in the array, order
return_type.addInt(); # index of element
return_type.addRowType(orderProto); # element return type
The factory's getReturnType()
method creates SizedColumnTypes
with the makeInt()
and makeEmpty()
methods and then builds two row fields with the addVarchar()
and addArrayType()
methods. Note that the addArrayType()
method specifies the maximum number of array elements as 1024. getReturnType()
then adds these constructed SizedColumnTypes
to the object representing the return type.
def getReturnType(self, srv_interface, arg_types, return_type):
itemsIdElementField = vertica_sdk.SizedColumnTypes.makeInt("item_id")
orderFields = vertica_sdk.SizedColumnTypes.makeEmpty()
orderFields.addVarchar(32, "address")
orderFields.addArrayType(itemIdElementField, 1024, "item_ids")
# declare return type
return_type.addInt("index")
return_type.addRowType(orderFields, "element")
'''
NOTE: presumably we have verified that the arguments match the prototype, so really we could just do this:
return_type.addInt("index")
return_type.addArrayType(argTypes.getColumnType(0).getElementType(), "element")
'''
To access complex types in the UDx processing method, use the ArrayReader
, ArrayWriter
, RowReader
, and RowWriter
classes. For details, see Python SDK.
See Python example: matrix multiplication for a scalar function that uses complex types.
Handling different numbers and types of arguments
You can create UDxs that handle multiple signatures, or even accept all arguments supplied to them by the user, using either overloading or polymorphism.
You can overload your UDx by assigning the same SQL function name to multiple factory classes, each of which defines a unique function signature. When a user uses the function name in a query, Vertica tries to match the signature of the function call to the signatures declared by the factory's getPrototype()
method. This is the best technique to use if your UDx needs to accept a few different signatures (for example, accepting two required and one optional argument).
Alternatively, you can write a polymorphic function, writing one factory method instead of several and declaring that it accepts any number and type of arguments. When a user uses the function name in a query, Vertica calls your function regardless of the signature. In exchange for this flexibility, your UDx's main "process" method has to determine whether it can accept the arguments and emit errors if not.
All UDx types can use polymorphic inputs. Transform functions and analytic functions can also use polymorphic outputs. This means that getPrototype()
can declare a return type of "any" and set the actual return type at runtime. For example, a function that returns the largest value in an input would return the same type as the input type.
5.2.1 - Overloading your UDx
You may want your UDx to accept several different signatures (sets of arguments).
You may want your UDx to accept several different signatures (sets of arguments). For example, you might want your UDx to accept:
-
One or more optional arguments.
-
One or more arguments that can be one of several data types.
-
Completely distinct signatures (either all INTEGER or all VARCHAR, for example).
You can create a function with this behavior by creating several factory classes, each of which accepts a different signature (the number and data types of arguments). You can then associate a single SQL function name with all of them. You can use the same SQL function name to refer to multiple factory classes as long as the signature defined by each factory is unique. When a user calls your UDx, Vertica matches the number and types of arguments supplied by the user to the arguments accepted by each of your function's factory classes. If one matches, Vertica uses it to instantiate a function class to process the data.
Multiple factory classes can instantiate the same function class, so you can re-use one function class that is able to process multiple sets of arguments and then create factory classes for each of the function signatures. You can also create multiple function classes if you want.
See the C++ example: overloading your UDx and Java example: overloading your UDx examples.
5.2.1.1 - C++ example: overloading your UDx
The following example code demonstrates creating a user-defined scalar function (UDSF) that adds two or three integers together.
The following example code demonstrates creating a user-defined scalar function (UDSF) that adds two or three integers together. The Add2or3ints
class is prepared to handle two or three arguments. The processBlock()
function checks the number of arguments that have been passed to it, and adds all two or three of them together. It also exits with an error message if it has been called with less than 2 or more than 3 arguments. In theory, this should never happen, since Vertica only calls the UDSF if the user's function call matches a signature on one of the factory classes you create for your function. In practice, it is a good idea to perform this sanity checking, in case your (or someone else's) factory class inaccurately reports a set of arguments your function class cannot handle.
#include "Vertica.h"
using namespace Vertica;
using namespace std;
// a ScalarFunction that accepts two or three
// integers and adds them together.
class Add2or3ints : public Vertica::ScalarFunction
{
public:
virtual void processBlock(Vertica::ServerInterface &srvInterface,
Vertica::BlockReader &arg_reader,
Vertica::BlockWriter &res_writer)
{
const size_t numCols = arg_reader.getNumCols();
// Ensure that only two or three parameters are passed in
if ( numCols < 2 || numCols > 3)
vt_report_error(0, "Function only accept 2 or 3 arguments, "
"but %zu provided", arg_reader.getNumCols());
// Add two integers together
do {
const vint a = arg_reader.getIntRef(0);
const vint b = arg_reader.getIntRef(1);
vint c = 0;
// Check for third argument, add it in if it exists.
if (numCols == 3)
c = arg_reader.getIntRef(2);
res_writer.setInt(a+b+c);
res_writer.next();
} while (arg_reader.next());
}
};
// This factory accepts function calls with two integer arguments.
class Add2intsFactory : public Vertica::ScalarFunctionFactory
{
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface
&srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, Add2or3ints); }
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{ // Accept 2 integer values
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
};
RegisterFactory(Add2intsFactory);
// This factory defines a function that accepts 3 ints.
class Add3intsFactory : public Vertica::ScalarFunctionFactory
{
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface
&srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, Add2or3ints); }
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{ // accept 3 integer values
argTypes.addInt();
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
};
RegisterFactory(Add3intsFactory);
The example has two ScalarFunctionFactory
classes, one for each signature that the function accepts (two integers and three integers). There is nothing unusual about these factory classes, except that their implementation of ScalarFunctionFactory::createScalarFunction()
both create Add2or3ints
objects.
The final step is to bind the same SQL function name to both factory classes. You can assign multiple factories to the same SQL function, as long as the signatures defined by each factory's getPrototype()
implementation are different.
=> CREATE LIBRARY add2or3IntsLib AS '/home/dbadmin/Add2or3Ints.so';
CREATE LIBRARY
=> CREATE FUNCTION add2or3Ints as NAME 'Add2intsFactory' LIBRARY add2or3IntsLib FENCED;
CREATE FUNCTION
=> CREATE FUNCTION add2or3Ints as NAME 'Add3intsFactory' LIBRARY add2or3IntsLib FENCED;
CREATE FUNCTION
=> SELECT add2or3Ints(1,2);
add2or3Ints
-------------
3
(1 row)
=> SELECT add2or3Ints(1,2,4);
add2or3Ints
-------------
7
(1 row)
=> SELECT add2or3Ints(1,2,3,4); -- Will generate an error
ERROR 3467: Function add2or3Ints(int, int, int, int) does not exist, or
permission is denied for add2or3Ints(int, int, int, int)
HINT: No function matches the given name and argument types. You may
need to add explicit type casts
The error message in response to the final call to the add2or3Ints function was generated by Vertica, since it could not find a factory class associated with add2or3Ints that accepted four integer arguments. To expand add2or3Ints further, you could create another factory class that accepted this signature, and either change the Add2or3ints ScalarFunction class or create a totally different class to handle adding more integers together. However, adding more classes to accept each variation in the arguments quickly becomes overwhelming. In that case, you should consider creating a polymorphic UDx.
5.2.1.2 - Java example: overloading your UDx
The following example code demonstrates creating a user-defined scalar function (UDSF) that adds two or three integers together.
The following example code demonstrates creating a user-defined scalar function (UDSF) that adds two or three integers together. The Add2or3ints class is prepared to handle two or three arguments. It checks the number of arguments that have been passed to it, and adds all two or three of them together. The processBlock()
method checks whether it has been called with less than 2 or more than 3 arguments. In theory, this should never happen, since Vertica only calls the UDSF if the user's function call matches a signature on one of the factory classes you create for your function. In practice, it is a good idea to perform this sanity checking, in case your (or someone else's) factory class reports that your function class accepts a set of arguments that it actually does not.
// You need to specify the full package when creating functions based on
// the classes in your library.
package com.mycompany.multiparamexample;
// Import the entire Vertica SDK
import com.vertica.sdk.*;
// This ScalarFunction accepts two or three integer arguments. It tests
// the number of input columns to determine whether to read two or three
// arguments as input.
public class Add2or3ints extends ScalarFunction
{
@Override
public void processBlock(ServerInterface srvInterface,
BlockReader argReader,
BlockWriter resWriter)
throws UdfException, DestroyInvocation
{
// See how many arguments were passed in
int numCols = argReader.getNumCols();
// Return an error if less than two or more than 3 aerguments
// were given. This error only occurs if a Factory class that
// accepts the wrong number of arguments instantiates this
// class.
if (numCols < 2 || numCols > 3) {
throw new UdfException(0,
"Must supply 2 or 3 integer arguments");
}
// Process all of the rows of input.
do {
// Get the first two integer arguments from the BlockReader
long a = argReader.getLong(0);
long b = argReader.getLong(1);
// Assume no third argument.
long c = 0;
// Get third argument value if it exists
if (numCols == 3) {
c = argReader.getLong(2);
}
// Process the arguments and come up with a result. For this
// example, just add the three arguments together.
long result = a+b+c;
// Write the integer output value.
resWriter.setLong(result);
// Advance the output BlocKWriter to the next row.
resWriter.next();
// Continue processing input rows until there are no more.
} while (argReader.next());
}
}
The main difference between the Add2ints
class and the Add2or3ints
class is the inclusion of a section that gets the number of arguments by calling BlockReader.getNumCols()
. This class also tests the number of columns it received from Vertica to ensure it is in the range it is prepared to handle. This test will only fail if you create a ScalarFunctionFactory
whose getPrototype()
method defines a signature that accepts less than two or more than three arguments. This is not really necessary in this simple example, but for a more complicated class it is a good idea to test the number of columns and data types that Vertica passed your function class.
Within the do
loop, Add2or3ints
uses a default value of zero if Vertica sent it two input columns. Otherwise, it retrieves the third value and adds that to the other two. Your own class needs to use default values for missing input columns or alter its processing in some other way to handle the variable columns.
You must define your function class in its own source file, rather than as an inner class of one of your factory classes since Java does not allow the instantiation of an inner class from outside its containing class. You factory class has to be available for instantiation by multiple factory classes.
Once you have created a function class or classes, you create a factory class for each signature you want your function class to handle. These factory classes can call individual function classes, or they can all call the same class that is prepared to accept multiple sets of arguments.
The following example's createScalarFunction()
method instantiates a member of the Add2or3ints
class.
// You will need to specify the full package when creating functions based on
// the classes in your library.
package com.mycompany.multiparamexample;
// Import the entire Vertica SDK
import com.vertica.sdk.*;
public class Add2intsFactory extends ScalarFunctionFactory
{
@Override
public void getPrototype(ServerInterface srvInterface,
ColumnTypes argTypes,
ColumnTypes returnType)
{
// Accept two integers as input
argTypes.addInt();
argTypes.addInt();
// writes one integer as output
returnType.addInt();
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface)
{
// Instantiate the class that can handle either 2 or 3 integers.
return new Add2or3ints();
}
}
The following ScalarFunctionFactory
subclass accepts three integers as input. It, too, instantiates a member of the Add2or3ints
class to process the function call:
// You will need to specify the full package when creating functions based on
// the classes in your library.
package com.mycompany.multiparamexample;
// Import the entire Vertica SDK
import com.vertica.sdk.*;
public class Add3intsFactory extends ScalarFunctionFactory
{
@Override
public void getPrototype(ServerInterface srvInterface,
ColumnTypes argTypes,
ColumnTypes returnType)
{
// Accepts three integers as input
argTypes.addInt();
argTypes.addInt();
argTypes.addInt();
// Returns a single integer
returnType.addInt();
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface)
{
// Instantiates the Add2or3ints ScalarFunction class, which is able to
// handle eitehr 2 or 3 integers as arguments.
return new Add2or3ints();
}
}
The factory classes and the function class or classes they call must be packaged into the same JAR file (see Compiling and packaging a Java library for details). If a host in the database cluster has the JDK installed on it, you could use the following commands to compile and package the example:
$ cd pathToJavaProject$ javac -classpath /opt/vertica/bin/VerticaSDK.jar \
> com/mycompany/multiparamexample/*.java
$ jar -cvf Add2or3intslib.jar com/vertica/sdk/BuildInfo.class \
> com/mycompany/multiparamexample/*.class
added manifest
adding: com/vertica/sdk/BuildInfo.class(in = 1202) (out= 689)(deflated 42%)
adding: com/mycompany/multiparamexample/Add2intsFactory.class(in = 677) (out= 366)(deflated 45%)
adding: com/mycompany/multiparamexample/Add2or3ints.class(in = 919) (out= 601)(deflated 34%)
adding: com/mycompany/multiparamexample/Add3intsFactory.class(in = 685) (out= 369)(deflated 46%)
Once you have packaged your overloaded UDx, you deploy it the same way as you do a regular UDx, except you use multiple CREATE FUNCTION statements to define the function, once for each factory class.
=> CREATE LIBRARY add2or3intslib as '/home/dbadmin/Add2or3intslib.jar'
-> language 'Java';
CREATE LIBRARY
=> CREATE FUNCTION add2or3ints as LANGUAGE 'Java' NAME 'com.mycompany.multiparamexample.Add2intsFactory' LIBRARY add2or3intslib;
CREATE FUNCTION
=> CREATE FUNCTION add2or3ints as LANGUAGE 'Java' NAME 'com.mycompany.multiparamexample.Add3intsFactory' LIBRARY add2or3intslib;
CREATE FUNCTION
You call the overloaded function the same way you call any other function.
=> SELECT add2or3ints(2,3);
add2or3ints
-------------
5
(1 row)
=> SELECT add2or3ints(2,3,4);
add2or3ints
-------------
9
(1 row)
=> SELECT add2or3ints(2,3,4,5);
ERROR 3457: Function add2or3ints(int, int, int, int) does not exist, or permission is denied for add2or3ints(int, int, int, int)
HINT: No function matches the given name and argument types. You may need to add explicit type casts
The last error was generated by Vertica, not the UDx code. It returns an error if it cannot find a factory class whose signature matches the function call's signature.
Creating an overloaded UDx is useful if you want your function to accept a limited set of potential arguments. If you want to create a more flexible function, you can create a polymorphic function.
5.2.2 - Creating a polymorphic UDx
Polymorphic UDxs accept any number and type of argument that the user supplies.
Polymorphic UDxs accept any number and type of argument that the user supplies. Transform functions (UDTFs), analytic functions (UDAnFs), and aggregate functions (UDAFs) can define their output return types at runtime, usually based on the input arguments. For example, a UDTF that adds two numbers could return an integer or a float, depending on the input types.
Vertica does not check the number or types of argument that the user passes to the UDx—it just passes the UDx all of the arguments supplied by the user. It is up to your polymorphic UDx's main processing function (for example, processBlock()
in user-defined scalar functions) to examine the number and types of arguments it received and determine if it can handle them. UDxs support up to 9800 arguments.
Polymorphic UDxs are more flexible than using multiple factory classes for your function (see Overloading your UDx). They also allow you to write more concise code, instead of writing versions for each data type. The tradeoff is that your polymorphic function needs to perform more work to determine whether it can process its arguments.
Your polymorphic UDx declares that it accepts any number of arguments in its factory's getPrototype()
function by calling the addAny()
function on the ColumnTypes
object that defines its arguments, as follows:
// C++ example
void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
argTypes.addAny(); // Must be only argument type.
returnType.addInt(); // or whatever the function returns
}
This "any parameter" argument type is the only one that your function can declare. You cannot define required arguments and then call addAny()
to declare the rest of the signature as optional. If your function has requirements for the arguments it accepts, your process()
function must enforce them.
The getPrototype()
example shown previously accepts any type and declares that it returns an integer. The following example shows a version of the method that defers resolving the return type until runtime. You can only use the "any" return type for transform and analytic functions.
void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
argTypes.addAny();
returnType.addAny(); // type determined at runtime
}
If you use polymorphic return types, you must also define getReturnType()
in your factory. This function is called at runtime to determine the actual return type. See C++ example: PolyNthValue for an example.
Polymorphic UDxs and schema search paths
If a user does not supply a schema name as part of a UDx call, Vertica searches each schema in the schema search path for a function whose name and signature match the function call. See Setting search paths for more information about schema search paths.
Because polymorphic UDxs do not have specific signatures associated with them, Vertica initially skips them when searching for a function to handle the function call. If none of the schemas in the search path contain a UDx whose name and signature match the function call, Vertica searches the schema search path again for a polymorphic UDx whose name matches the function name in the function call.
This behavior gives precedence to a UDx whose signature exactly matches the function call. It allows you to create a "catch-all" polymorphic UDx that Vertica calls only when none of the non-polymorphic UDxs with the same name have matching signatures.
This behavior may cause confusion if your users expect the first polymorphic function in the schema search path to handle a function call. To avoid confusion, you should:
-
Avoid using the same name for different UDxs. You should always uniquely name UDxs unless you intend to create an overloaded UDx with multiple signatures.
-
When you cannot avoid having UDxs with the same name in different schemas, always supply the schema name as part of the function call. Using the schema name prevents ambiguity and ensures that Vertica uses the correct UDx to process your function calls.
5.2.2.1 - C++ example: PolyNthValue
The PolyNthValue example is an analytic function that returns the value in the Nth row in each partition in its input.
The PolyNthValue example is an analytic function that returns the value in the Nth row in each partition in its input. This function is a generalization of FIRST_VALUE [analytic] and LAST_VALUE [analytic].
The values can be of any primitive data type.
For the complete source code, see PolymorphicNthValue.cpp
in the examples (in /opt/vertica/sdk/examples/AnalyticFunctions/
).
Loading and using the example
Load the library and create the function as follows:
=> CREATE LIBRARY AnalyticFunctions AS '/home/dbadmin/AnalyticFns.so';
CREATE LIBRARY
=> CREATE ANALYTIC FUNCTION poly_nth_value AS LANGUAGE 'C++'
NAME 'PolyNthValueFactory' LIBRARY AnalyticFunctions;
CREATE ANALYTIC FUNCTION
Consider a table of scores for different test groups:
=> SELECT cohort, score FROM trials;
cohort | score
--------+-------
1 | 9
1 | 8
1 | 7
3 | 3
3 | 2
3 | 1
2 | 4
2 | 5
2 | 6
(9 rows)
Call the function in a query that uses an OVER clause to partition the data. This example returns the second-highest score in each cohort:
=> SELECT cohort, score, poly_nth_value(score USING PARAMETERS n=2) OVER (PARTITION BY cohort) AS nth_value
FROM trials;
cohort | score | nth_value
--------+-------+-----------
1 | 9 | 8
1 | 8 | 8
1 | 7 | 8
3 | 3 | 2
3 | 2 | 2
3 | 1 | 2
2 | 4 | 5
2 | 5 | 5
2 | 6 | 5
(9 rows)
Factory implementation
The factory declares that the class is polymorphic, and then sets the return type based on the input type. Two factory methods specify the argument and return types.
Use the getPrototype()
method to declare that the analytic function takes and returns any type:
void getPrototype(ServerInterface &srvInterface, ColumnTypes &argTypes, ColumnTypes &returnType)
{
// This function supports any argument data type
argTypes.addAny();
// Output data type will be the same as the argument data type
// We will specify that in getReturnType()
returnType.addAny();
}
The getReturnType()
method is called at runtime. This is where you set the return type based on the input type:
void getReturnType(ServerInterface &srvInterface, const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// This function accepts only one argument
// Complain if we find a different number
std::vector<size_t> argCols;
inputTypes.getArgumentColumns(argCols); // get argument column indices
if (argCols.size() != 1)
{
vt_report_error(0, "Only one argument is expected but %s provided",
argCols.size()? std::to_string(argCols.size()).c_str() : "none");
}
// Define output type the same as argument type
outputTypes.addArg(inputTypes.getColumnType(argCols[0]), inputTypes.getColumnName(argCols[0]));
}
Function implementation
The analytic function itself is type-agnostic:
void processPartition(ServerInterface &srvInterface, AnalyticPartitionReader &inputReader,
AnalyticPartitionWriter &outputWriter)
{
try {
const SizedColumnTypes &inTypes = inputReader.getTypeMetaData();
std::vector<size_t> argCols; // Argument column indexes.
inTypes.getArgumentColumns(argCols);
vint currentRow = 1;
bool nthRowExists = false;
// Find the value of the n-th row
do {
if (currentRow == this->n) {
nthRowExists = true;
break;
} else {
currentRow++;
}
} while (inputReader.next());
if (nthRowExists) {
do {
// Return n-th value
outputWriter.copyFromInput(0 /*dest column*/, inputReader,
argCols[0] /*source column*/);
} while (outputWriter.next());
} else {
// The partition has less than n rows
// Return NULL value
do {
outputWriter.setNull(0);
} while (outputWriter.next());
}
} catch(std::exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing partition: [%s]", e.what());
}
}
};
5.2.2.2 - Java example: AddAnyInts
The following example shows an implementation of a Java ScalarFunction that adds together two or more integers.
The following example shows an implementation of a Java ScalarFunction
that adds together two or more integers.
For the complete source code, see AddAnyIntsInfo.java
in the examples (in /opt/vertica/sdk/examples/JavaUDx/ScalarFunctions
).
Loading and using the example
Load the library and create the function as follows:
=> CREATE LIBRARY JavaScalarFunctions AS '/home/dbadmin/JavaScalarLib.jar' LANGUAGE 'JAVA';
CREATE LIBRARY
=> CREATE FUNCTION addAnyInts AS LANGUAGE 'Java' NAME 'com.vertica.JavaLibs.AddAnyIntsInfo'
LIBRARY JavaScalarFunctions;
CREATE FUNCTION
Call the function with two or more integer arguments:
=> SELECT addAnyInts(1,2);
addAnyInts
------------
3
(1 row)
=> SELECT addAnyInts(1,2,3,40,50,60,70,80,900);
addAnyInts
------------
1206
(1 row)
Calling the function with too few arguments, or with non-integer arguments, produces errors that are generated from the processBlock()
method. It is up to your UDx to ensure that the user supplies the correct number and types of arguments to your function and exit with an error if it cannot process them.
Function implementation
Most of the work in the example is done by the processBlock()
method. It performs two checks on the arguments that have been passed in through the BlockReader
object:
It is up to your polymorphic UDx to determine that all of the input passed to it is valid.
Once the processBlock()
method validates its arguments, it loops over them, adding them together.
@Override
public void processBlock(ServerInterface srvInterface,
BlockReader arg_reader,
BlockWriter res_writer)
throws UdfException, DestroyInvocation
{
SizedColumnTypes inTypes = arg_reader.getTypeMetaData();
ArrayList<Integer> argCols = new ArrayList<Integer>(); // Argument column indexes.
inTypes.getArgumentColumns(argCols);
// While we have inputs to process
do {
long sum = 0;
for (int i = 0; i < argCols.size(); ++i){
long a = arg_reader.getLong(i);
sum += a;
}
res_writer.setLong(sum);
res_writer.next();
} while (arg_reader.next());
}
}
Factory implementation
The factory declares the number and type of arguments in the getPrototype()
function.
@Override
public void getPrototype(ServerInterface srvInterface,
ColumnTypes argTypes,
ColumnTypes returnType)
{
argTypes.addAny();
returnType.addInt();
}
5.2.2.3 - R example: kmeansPoly
The following example shows an implementation of a Transform Function (UDTF) that performs kmeans clustering on one or more input columns.
The following example shows an implementation of a Transform Function (UDTF) that performs kmeans clustering on one or more input columns.
kmeansPoly <- function(v.data.frame,v.param.list) {
# Computes clusters using the kmeans algorithm.
#
# Input: A dataframe and a list of parameters.
# Output: A dataframe with one column that tells the cluster to which each data
# point belongs.
# Args:
# v.data.frame: The data from Vertica cast as an R data frame.
# v.param.list: List of function parameters.
#
# Returns:
# The cluster associated with each data point.
# Ensure k is not null.
if(!is.null(v.param.list[['k']])) {
number_of_clusters <- as.numeric(v.param.list[['k']])
} else {
stop("k cannot be NULL! Please use a valid value.")
}
# Run the kmeans algorithm.
kmeans_clusters <- kmeans(v.data.frame, number_of_clusters)
final.output <- data.frame(kmeans_clusters$cluster)
return(final.output)
}
kmeansFactoryPoly <- function() {
# This function tells Vertica the name of the R function,
# and the polymorphic parameters.
list(name=kmeansPoly, udxtype=c("transform"), intype=c("any"),
outtype=c("int"), parametertypecallback=kmeansParameters)
}
kmeansParameters <- function() {
# Callback function for the parameter types.
function.parameters <- data.frame(datatype=rep(NA, 1), length=rep(NA,1),
scale=rep(NA,1), name=rep(NA,1))
function.parameters[1,1] = "int"
function.parameters[1,4] = "k"
return(function.parameters)
}
The polymorphic R function declares it accepts any number of arguments in its factory function by specifying "any" as the argument to the intype
parameter and optionally the outtype
parameter. If you define "any" argument for intype
or outtype
, then it is the only type that your function can declare for the respective parameter. You cannot define required arguments and then call "any" to declare the rest of the signature as optional. If your function has requirements for the arguments it accepts, your process function must enforce them.
The outtypecallback
method is used to indicate the argument types and sizes it has been called with, and is expected to indicate the types and sizes that the function returns. The outtypecallback
method can also be used to check for unsupported types and/or number of arguments. For example, the function may require only integers, with no more than 10 of them.
You assign a SQL name to your polymorphic UDx using the same statement you use to assign one to a non-polymorphic UDx. The following statements show how you load and call the polymorphic function from the example.
=> CREATE LIBRARY rlib2 AS '/home/dbadmin/R_UDx/poly_kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION kmeansPoly AS LANGUAGE 'R' name 'kmeansFactoryPoly' LIBRARY rlib2;
CREATE FUNCTION
=> SELECT spec, kmeansPoly(sl,sw,pl,pw USING PARAMETERS k = 3)
OVER(PARTITION BY spec) AS Clusters
FROM iris;
spec | Clusters
-----------------+----------
Iris-setosa | 1
Iris-setosa | 1
Iris-setosa | 1
Iris-setosa | 1
.
.
.
(150 rows)
5.3 - UDx parameters
Parameters let you define arguments for your UDxs that remain constant across all of the rows processed by the SQL statement that calls your UDx.
Parameters let you define arguments for your UDxs that remain constant across all of the rows processed by the SQL statement that calls your UDx. Typically, your UDxs accept arguments that come from columns in a SQL statement. For example, in the following SQL statement, the arguments a and b to the add2ints
UDSF change value for each row processed by the SELECT statement:
=> SELECT a, b, add2ints(a,b) AS 'sum' FROM example;
a | b | sum
---+----+-----
1 | 2 | 3
3 | 4 | 7
5 | 6 | 11
7 | 8 | 15
9 | 10 | 19
(5 rows)
Parameters remain constant for all the rows your UDx processes. You can also make parameters optional so that if the user does not supply it, your UDx uses a default value. For example, the following example demonstrates calling a UDSF named add2intsWithConstant
that has a single parameter value named constant whose value is added to each the arguments supplied in each row of input:
=> SELECT a, b, add2intsWithConstant(a, b USING PARAMETERS constant=42)
AS 'a+b+42' from example;
a | b | a+b+42
---+----+--------
1 | 2 | 45
3 | 4 | 49
5 | 6 | 53
7 | 8 | 57
9 | 10 | 61
(5 rows)
Note
When calling a UDx with parameters, there is no comma between the last argument and the USING PARAMETERS clause.
The topics in this section explain how to develop UDxs that accept parameters.
5.3.1 - Defining UDx parameters
You define the parameters that your UDx accepts in its factory class (ScalarFunctionFactory, AggregateFunctionFactory, and so on) by implementing getParameterType().
You define the parameters that your UDx accepts in its factory class (ScalarFunctionFactory
, AggregateFunctionFactory
, and so on) by implementing getParameterType()
. This method is similar to getReturnType()
: you call data-type-specific methods on a SizedColumnTypes
object that is passed in as a parameter. Each function call sets the name, data type, and width or precision (if the data type requires it) of the parameter.
Note
Parameter names in the __param-name__ format are reserved for internal use.
Setting parameter properties (C++ only)
When you add parameters to the getParameterType()
function using the C++ API, you can also set properties for each parameter. For example, you can define a parameter as being required by the UDx. Doing so lets the Vertica server know that every UDx invocation must provide the specified parameter, or the query fails.
By passing an object to the SizedColumnTypes::Properties
class, you can define the following four parameter properties:
Parameter |
Type |
Description |
visible |
BOOLEAN |
If set to TRUE, the parameter appears in the USER_FUNCTION_PARAMETERS table. You may want to set this to FALSE to declare a parameter for internal use only. |
required |
BOOLEAN |
If set to TRUE:
- The parameter is required when invoking the UDx.
- Invoking the UDx without supplying the parameter results in an error, and the UDx does not run.
|
canBeNull |
BOOLEAN |
If set to TRUE, the parameter can have a NULL value.
If set to FALSE, make sure that the supplied parameter does not contain a NULL value when invoking the UDx. Otherwise, an error results, and the UDx does not run.
|
comment |
VARCHAR(128) |
A comment to describe the parameter.
If you exceed the 128 character limit, Vertica generates an error when you run the CREATE_FUNCTION command. Additionally, if you replace the existing function definition in the comment parameter, make sure that the new definition does not exceed 128 characters. Otherwise, you delete all existing entries in the USER_FUNCTION_PARAMETERS table related to the UDx.
|
Setting parameter properties (R only)
When using parameters in your R UDx, you must specify a field in the factory function called parametertypecallback
. This field points to the callback function that defines the parameters expected by the function. The callback function defines a four-column data frame with the following properties:
Parameter |
Type |
Description |
datatype |
VARCHAR(128) |
The data type of the parameter. |
length |
INTEGER |
The dimension of the parameter. |
scale |
INTEGER |
The proportional dimensions of the parameter. |
name |
VARCHAR(128) |
The name of the parameter. |
If any of the columns are left blank (or the parametertypecallback
function is omitted), then Vertica uses default values.
For more information, see Parametertypecallback function.
Redacting UDx parameters
If a parameter name meets any of the following criteria, its value is automatically redacted from logs and system tables like QUERY_REQUESTS:
5.3.2 - Getting parameter values in UDxs
Your UDx uses the parameter values it declared in its factory class (see Defining the Parameters Your UDx Accepts) in its function class's processing method (for example, processBlock() or processPartition()).
Your UDx uses the parameter values it declared in its factory class (see Defining UDx parameters) in its function class's processing method (for example, processBlock()
or processPartition()
). It gets its parameter values from a ParamReader
object, which is available from the ServerInterface
object that is passed to your processing method. Reading parameters from this object is similar to reading argument values from BlockReader
or PartitionReader
objects: you call a data-type-specific function with the name of the parameter to retrieve its value. For example, in C++:
// Get the parameter reader from the ServerInterface to see if there are supplied parameters.
ParamReader paramReader = srvInterface.getParamReader();
// Get the value of an int parameter named constant.
const vint constant = paramReader.getIntRef("constant");
Note
String data values do not have any of their escape characters processed before they are passed to your function. Therefore, your function may need to process the escape sequences itself if it needs to operate on unescaped character values.
Using parameters in the factory class
In addition to using parameters in your UDx function class, you can also access the parameters in the factory class. You may want to access the parameters to let the user control the input or output values of your function in some way. For example, your UDx can have a parameter that lets the user choose to have your UDx return a single- or double-precision value. The process of accessing parameters in the factory class is the same as accessing it in the function class: get a ParamReader
object from the ServerInterface
's getParamReader()
method, them read the parameter values.
Testing whether the user supplied parameter values
Unlike its handling of arguments, Vertica does not immediately return an error if a user's function call does not include a value for a parameter defined by your UDx's factory class. This means that your function can attempt to read a parameter value that the user did not supply. If it does so, by default Vertica returns a non-existent parameter warning to the user, and the query containing the function call continues.
If you want your parameter to be optional, you can test whether the user supplied a value for the parameter before attempting to access its value. Your function determines if a value exists for a particular parameter by calling the ParamReader
's containsParameter()
method with the parameter's name. If this call returns true, your function can safely retrieve the value. If this call returns false, your UDx can use a default value or change its processing in some other way to compensate for not having the parameter value. As long as your UDx does not try to access the non-existent parameter value, Vertica does not generate an error or warning about missing parameters.
Note
If the user passes your UDx a parameter that it has not defined, by default Vertica issues a warning that the parameter is not used. It still executes the SQL statement, ignoring the parameter. You can change this behavior by altering the
StrictUDxParameterChecking
configuration parameter.
See C++ example: defining parameters for an example.
5.3.3 - Calling UDxs with parameters
You pass parameters to a UDx by adding a USING PARAMETERS clause in the function call after the last argument.
You pass parameters to a UDx by adding a USING PARAMETERS clause in the function call after the last argument.
-
Do not insert a comma between the last argument and the USING PARAMETERS clause.
-
After the USING PARAMETERS clause, add one or more parameter definitions, in the following form:
<parameter name> = <parameter value>
-
Separate parameter definitions by commas.
Parameter values can be a constant expression (for example 1234 + SQRT(5678)
). You cannot use volatile functions (such as RANDOM) in the expression, because they do not return a constant value. If you do supply a volatile expression as a parameter value, by default, Vertica returns an incorrect parameter type warning. Vertica then tries to run the UDx without the parameter value. If the UDx requires the parameter, it returns its own error, which cancels the query.
Calling a UDx with a single parameter
The following example demonstrates how you can call the Add2intsWithConstant UDSF example shown in C++ example: defining parameters:
=> SELECT a, b, Add2intsWithConstant(a, b USING PARAMETERS constant=42) AS 'a+b+42' from example;
a | b | a+b+42
---+----+--------
1 | 2 | 45
3 | 4 | 49
5 | 6 | 53
7 | 8 | 57
9 | 10 | 61
(5 rows)
To remove the first instance of the number 3, you can call the RemoveSymbol UDSF example:
=> SELECT '3re3mo3ve3sy3mb3ol' original_string, RemoveSymbol('3re3mo3ve3sy3mb3ol' USING PARAMETERS symbol='3');
original_string | RemoveSymbol
--------------------+-------------------
3re3mo3ve3sy3mb3ol | re3mo3ve3sy3mb3ol
(1 row)
Calling a UDx with multiple parameters
The following example shows how you can call a version of the tokenize UDTF. This UDTF includes parameters to limit the shortest allowed word and force the words to be output in uppercase. Separate multiple parameters with commas.
=> SELECT url, tokenize(description USING PARAMETERS minLength=4, uppercase=true) OVER (partition by url) FROM T;
url | words
-----------------+-----------
www.amazon.com | ONLINE
www.amazon.com | RETAIL
www.amazon.com | MERCHANT
www.amazon.com | PROVIDER
www.amazon.com | CLOUD
www.amazon.com | SERVICES
www.dell.com | LEADING
www.dell.com | PROVIDER
www.dell.com | COMPUTER
www.dell.com | HARDWARE
www.vertica.com | WORLD'S
www.vertica.com | FASTEST
www.vertica.com | ANALYTIC
www.vertica.com | DATABASE
(16 rows)
The following example calls the RemoveSymbol UDSF. By changing the value of the optional parameter, n
, you can remove all instances of the number 3:
=> SELECT '3re3mo3ve3sy3mb3ol' original_string, RemoveSymbol('3re3mo3ve3sy3mb3ol' USING PARAMETERS symbol='3', n=6);
original_string | RemoveSymbol
--------------------+--------------
3re3mo3ve3sy3mb3ol | removesymbol
(1 row)
Calling a UDx with optional or incorrect parameters
You can optionally add the Add2intsWithConstant UDSF's constant parameter. Calling this constraint without the parameter does not return an error or warning:
=> SELECT a,b,Add2intsWithConstant(a, b) AS 'sum' FROM example;
a | b | sum
---+----+-----
1 | 2 | 3
3 | 4 | 7
5 | 6 | 11
7 | 8 | 15
9 | 10 | 19
(5 rows)
Although calling a UDx with incorrect parameters generates a warning, by default, the query still runs. For further information on setting the behavior of your UDx when you supply incorrect parameters, see Specifying the behavior of passing unregistered parameters.
=> SELECT a, b, add2intsWithConstant(a, b USING PARAMETERS wrongparam=42) AS 'result' from example;
WARNING 4332: Parameter wrongparam was not registered by the function and cannot
be coerced to a definite data type
a | b | result
---+----+--------
1 | 2 | 3
3 | 4 | 7
5 | 6 | 11
7 | 8 | 15
9 | 10 | 19
(5 rows)
5.3.4 - Specifying the behavior of passing unregistered parameters
By default, Vertica issues a warning message when you pass a UDx an unregistered parameter.
By default, Vertica issues a warning message when you pass a UDx an unregistered parameter. An unregistered parameter is one that you did not declare in the getParameterType()
method.
You can control the behavior of your UDx when you pass it an unregistered parameter by altering the StrictUDxParameterChecking
configuration parameter.
Unregistered parameter behavior settings
You can specify the behavior of your UDx in response to one or more unregistered parameters. To do so, set the StrictUDxParameterChecking
configuration parameter to one of the following values:
-
0: Allows unregistered parameters to be accessible to the UDx. The ParamReader
class's getType()
method determines the data type of the unregistered parameter. Vertica does not display any warning or error message.
-
1 (default): Ignores the unregistered parameter and allows the function to run. Vertica displays a warning message.
-
2: Returns an error and does not allow the function to run.
Examples
The following examples demonstrate the behavior you can specify using different values with the StrictUDxParameterChecking
parameter.
View the current value of StrictUDxParameterChecking
To view the current value of the StrictUDxParameterChecking
configuration parameter, run the following query:
=> \x
Expanded display is on.
=> SELECT * FROM configuration_parameters WHERE parameter_name = 'StrictUDxParameterChecking';
-[ RECORD 1 ]-----------------+------------------------------------------------------------------
node_name | ALL
parameter_name | StrictUDxParameterChecking
current_value | 1
restart_value | 1
database_value | 1
default_value | 1
current_level | DATABASE
restart_level | DATABASE
is_mismatch | f
groups |
allowed_levels | DATABASE
superuser_only | f
change_under_support_guidance | f
change_requires_restart | f
description | Sets the behavior to deal with undeclared UDx function parameters
Change the value of StrictUDxParameterChecking
You can change the value of the StrictUDxParameterChecking
configuration parameter at the database, node, or session level. For example, you can change the value to '0' to specify that unregistered parameters can pass to the UDx without displaying a warning or error message:
=> ALTER DATABASE DEFAULT SET StrictUDxParameterChecking = 0;
ALTER DATABASE
Invalid parameter behavior with RemoveSymbol
The following example demonstrates how to call the RemoveSymbol UDSF example. The RemoveSymbol UDSF has a required parameter, symbol
, and an optional parameter, n
. In this case, you do not use the optional parameter.
If you pass both symbol
and an additional parameter called wrongParam
, which is not declared in the UDx, the behavior of the UDx changes corresponding to the value of StrictUDxParameterChecking
.
When you set StrictUDxParameterChecking
to '0', the UDx runs normally without a warning. Additionally, wrongParam
becomes accessible to the UDx through the ParamReader
object of the ServerInterface
object:
=> ALTER DATABASE DEFAULT SET StrictUDxParameterChecking = 0;
ALTER DATABASE
=> SELECT '3re3mo3ve3sy3mb3ol' original_string, RemoveSymbol('3re3mo3ve3sy3mb3ol' USING PARAMETERS symbol='3', wrongParam='x');
original_string | RemoveSymbol
--------------------+-------------------
3re3mo3ve3sy3mb3ol | re3mo3ve3sy3mb3ol
(1 row)
When you set StrictUDxParameterChecking
to '1', the UDx ignores wrongParam
and runs normally. However, it also issues a warning message:
=> ALTER DATABASE DEFAULT SET StrictUDxParameterChecking = 1;
ALTER DATABASE
=> SELECT '3re3mo3ve3sy3mb3ol' original_string, RemoveSymbol('3re3mo3ve3sy3mb3ol' USING PARAMETERS symbol='3', wrongParam='x');
WARNING 4320: Parameter wrongParam was not registered by the function and cannot be coerced to a definite data type
original_string | RemoveSymbol
--------------------+-------------------
3re3mo3ve3sy3mb3ol | re3mo3ve3sy3mb3ol
(1 row)
When you set StrictUDxParameterChecking
to '2', the UDx encounters an error when it tries to call wrongParam
and does not run. Instead, it generates an error message:
=> ALTER DATABASE DEFAULT SET StrictUDxParameterChecking = 2;
ALTER DATABASE
=> SELECT '3re3mo3ve3sy3mb3ol' original_string, RemoveSymbol('3re3mo3ve3sy3mb3ol' USING PARAMETERS symbol='3', wrongParam='x');
ERROR 0: Parameter wrongParam was not registered by the function
5.3.5 - User-defined session parameters
User-defined session parameters allow you to write more generalized parameters than what Vertica provides.
User-defined session parameters allow you to write more generalized parameters than what Vertica provides. You can configure user-defined session parameters in these ways:
A user-defined session parameter can be passed into any type of UDx supported by Vertica. You can also set parameters for your UDx at the session level. By specifying a user-defined session parameter, you can have the state of a parameter saved continuously. Vertica saves the state of the parameter even when the UDx is invoked multiple times during a single session.
The RowCount example uses a user-defined session parameter. This parameter counts the total number of rows processed by the UDx each time it runs. RowCount then displays the aggregate number of rows processed for all executions. See C++ example: using session parameters and Java example: using session parameters for implementations.
Viewing the user-defined session parameter
Enter the following command to see the value of all session parameters:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+---------+-----+-------
(0 rows)
No value has been set, so the table is empty. Now, execute the UDx:
=> SELECT RowCount(5,5);
RowCount
----------
10
(1 row)
Again, enter the command to see the value of the session parameter:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+-----------+----------+-------
public | UDSession | rowcount | 1
(1 row)
The library column shows the name of the library containing the UDx. This is the name set with CREATE LIBRARY. Because the UDx has processed one row, the value of the rowcount session parameter is now 1. Running the UDx two more times should increment the value twice.
=> SELECT RowCount(10,10);
RowCount
----------
20
(1 row)
=> SELECT RowCount(15,15);
RowCount
----------
30
(1 row)
You have now executed the UDx three times, obtaining the sum of 5 + 5, 10 + 10, and 15 + 15. Now, check the value of rowcount.
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+-----------+----------+-------
public | UDSession | rowcount | 3
(1 row)
Altering the user-defined session parameter
You can also manually alter the value of rowcount. To do so, enter the following command:
=> ALTER SESSION SET UDPARAMETER FOR UDSession rowcount = 25;
ALTER SESSION
Check the value of RowCount:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+-----------+----------+-------
public | UDSession | rowcount | 25
(1 row)
Clearing the user-defined session parameter
From the client:
To clear the current value of rowcount, enter the following command:
=> ALTER SESSION CLEAR UDPARAMETER FOR UDSession rowcount;
ALTER SESSION
Verify that rowcount has been cleared:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+---------+-----+-------
(0 rows)
Through the UDx in C++:
You can set the session parameter to clear through the UDx itself. For example, to clear rowcount when its value reaches 10 or greater, do the following:
-
Remove the following line from the destroy()
method in the RowCount class:
udParams.getUDSessionParamWriter("library").getStringRef("rowCount").copy(i_as_string);
-
Replace the removed line from the destroy()
method with the following code:
if (rowCount < 10)
{
udParams.getUDSessionParamWriter("library").getStringRef("rowCount").copy(i_as_string);
}
else
{
udParams.getUDSessionParamWriter("library").clearParameter("rowCount");
}
-
To see the UDx clear the session parameter, set rowcount to a value of 9:
=> ALTER SESSION SET UDPARAMETER FOR UDSession rowcount = 9;
ALTER SESSION
-
Check the value of rowcount:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+-----------+----------+-------
public | UDSession | rowcount | 9
(1 row)
-
Invoke RowCount so that its value becomes 10:
=> SELECT RowCount(15,15);
RowCount
----------
30
(1 row)
-
Check the value of rowcount again. Because the value has reached 10, the threshold specified in the UDx, expect that rowcount is cleared:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+---------+-----+-------
(0 rows)
As expected, RowCount is cleared.
Through the UDx in Java:
-
Remove the following lines from the destroy()
method in the RowCount class:
udParams.getUDSessionParamWriter("library").setString("rowCount", Integer.toString(rowCount));
srvInterface.log("RowNumber processed %d records", count);
-
Replace the removed lines from the destroy()
method with the following code:
if (rowCount < 10)
{
udParams.getUDSessionParamWriter("library").setString("rowCount", Integer.toString(rowCount));
srvInterface.log("RowNumber processed %d records", count);
}
else
{
udParams.getUDSessionParamWriter("library").clearParameter("rowCount");
}
-
To see the UDx clear the session parameter, set rowcount to a value of 9:
=> ALTER SESSION SET UDPARAMETER FOR UDSession rowcount = 9;
ALTER SESSION
-
Check the value of rowcount:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+-----------+----------+-------
public | UDSession | rowcount | 9
(1 row)
-
Invoke RowCount so that its value becomes 10:
=> SELECT RowCount(15,15);
RowCount
----------
30
(1 row)
-
Check the value of rowcount. Since the value has reached 10, the threshold specified in the UDx, expect that rowcount is cleared:
=> SHOW SESSION UDPARAMETER all;
schema | library | key | value
--------+---------+-----+-------
(0 rows)
As expected, rowcount is cleared.
Read-only and hidden session parameters
If you don't want a parameter to be set anywhere except in the UDx, you can make it read-only. If, additionally, you don't want a parameter to be visible in the client, you can make it hidden.
To make a parameter read-only, meaning that it cannot be set in the client, but can be viewed, add a single underscore before the parameter's name. For example, to make rowCount read-only, change all instances in the UDx of "rowCount" to "_rowCount".
To make a parameter hidden, meaning that it cannot be viewed in the client nor set, add two underscores before the parameter's name. For example, to make rowCount hidden, change all instances in the UDx of "rowCount" to "__rowCount".
Redacted parameters
If a parameter name meets any of the following criteria, its value is automatically redacted from logs and system tables like QUERY_REQUESTS:
See also
Kafka user-defined session parameters
5.3.6 - C++ example: defining parameters
The following code fragment demonstrates adding a single parameter to the C++ add2ints UDSF example.
The following code fragment demonstrates adding a single parameter to the C++ add2ints UDSF example. The getParameterType()
function defines a single integer parameter that is named constant
.
class Add2intsWithConstantFactory : public ScalarFunctionFactory
{
// Return an instance of Add2ints to perform the actual addition.
virtual ScalarFunction *createScalarFunction(ServerInterface &interface)
{
// Calls the vt_createFuncObj to create the new Add2ints class instance.
return vt_createFuncObj(interface.allocator, Add2intsWithConstant);
}
// Report the argument and return types to Vertica.
virtual void getPrototype(ServerInterface &interface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
// Takes two ints as inputs, so add ints to the argTypes object.
argTypes.addInt();
argTypes.addInt();
// Returns a single int.
returnType.addInt();
}
// Defines the parameters for this UDSF. Works similarly to defining arguments and return types.
virtual void getParameterType(ServerInterface &srvInterface,
SizedColumnTypes ¶meterTypes)
{
// One int parameter named constant.
parameterTypes.addInt("constant");
}
};
RegisterFactory(Add2intsWithConstantFactory);
See the Vertica SDK entry for SizedColumnTypes
for a full list of the data-type-specific functions you can call to define parameters.
The following code fragment demonstrates using the parameter value. The Add2intsWithConstant
class defines a function that adds two integer values. If the user supplies it, the function also adds the value of the optional integer parameter named constant.
/**
* A UDSF that adds two numbers together with a constant value.
*
*/
class Add2intsWithConstant : public ScalarFunction
{
public:
// Processes a block of data sent by Vertica.
virtual void processBlock(ServerInterface &srvInterface,
BlockReader &arg_reader,
BlockWriter &res_writer)
{
try
{
// The default value for the constant parameter is 0.
vint constant = 0;
// Get the parameter reader from the ServerInterface to see if there are supplied parameters.
ParamReader paramReader = srvInterface.getParamReader();
// See if the user supplied the constant parameter.
if (paramReader.containsParameter("constant"))
// There is a parameter, so get its value.
constant = paramReader.getIntRef("constant");
// While we have input to process:
do
{
// Read the two integer input parameters by calling the BlockReader.getIntRef class function.
const vint a = arg_reader.getIntRef(0);
const vint b = arg_reader.getIntRef(1);
// Add arguments plus constant.
res_writer.setInt(a+b+constant);
// Finish writing the row, and advance to the next output row.
res_writer.next();
// Continue looping until there are no more input rows.
}
while (arg_reader.next());
}
catch (exception& e)
{
// Standard exception. Quit.
vt_report_error(0, "Exception while processing partition: %s",
e.what());
}
}
};
5.3.7 - C++ example: using session parameters
The RowCount example uses a user-defined session parameter, also called RowCount.
The RowCount example uses a user-defined session parameter, also called RowCount. This parameter counts the total number of rows processed by the UDx each time it runs. RowCount then displays the aggregate number of rows processed for all executions.
#include <string>
#include <sstream>
#include <iostream>
#include "Vertica.h"
#include "VerticaUDx.h"
using namespace Vertica;
class RowCount : public Vertica::ScalarFunction
{
private:
int rowCount;
int count;
public:
virtual void setup(Vertica::ServerInterface &srvInterface, const Vertica::SizedColumnTypes &argTypes) {
ParamReader pSessionParams = srvInterface.getUDSessionParamReader("library");
std::string rCount = pSessionParams.containsParameter("rowCount")?
pSessionParams.getStringRef("rowCount").str(): "0";
rowCount=atoi(rCount.c_str());
}
virtual void processBlock(Vertica::ServerInterface &srvInterface, Vertica::BlockReader &arg_reader, Vertica::BlockWriter &res_writer) {
count = 0;
if(arg_reader.getNumCols() != 2)
vt_report_error(0, "Function only accepts two arguments, but %zu provided", arg_reader.getNumCols());
do {
const Vertica::vint a = arg_reader.getIntRef(0);
const Vertica::vint b = arg_reader.getIntRef(1);
res_writer.setInt(a+b);
count++;
res_writer.next();
} while (arg_reader.next());
srvInterface.log("count %d", count);
}
virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes, SessionParamWriterMap &udParams) {
rowCount = rowCount + count;
std:ostringstream s;
s << rowCount;
const std::string i_as_string(s.str());
udParams.getUDSessionParamWriter("library").getStringRef("rowCount").copy(i_as_string);
}
};
class RowCountsInfo : public Vertica::ScalarFunctionFactory {
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
{ return Vertica::vt_createFuncObject<RowCount>(srvInterface.allocator);
}
virtual void getPrototype(Vertica::ServerInterface &srvInterface, Vertica::ColumnTypes &argTypes, Vertica::ColumnTypes &returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
};
RegisterFactory(RowCountsInfo);
5.3.8 - Java example: defining parameters
The following code fragment demonstrates adding a single parameter to the Java add2ints UDSF example.
The following code fragment demonstrates adding a single parameter to the Java add2ints UDSF example. The getParameterType()
method defines a single integer parameter that is named constant.
package com.mycompany.example;
import com.vertica.sdk.*;
public class Add2intsWithConstantFactory extends ScalarFunctionFactory
{
@Override
public void getPrototype(ServerInterface srvInterface,
ColumnTypes argTypes,
ColumnTypes returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
@Override
public void getReturnType(ServerInterface srvInterface,
SizedColumnTypes argTypes,
SizedColumnTypes returnType)
{
returnType.addInt("sum");
}
// Defines the parameters for this UDSF. Works similarly to defining
// arguments and return types.
public void getParameterType(ServerInterface srvInterface,
SizedColumnTypes parameterTypes)
{
// One INTEGER parameter named constant
parameterTypes.addInt("constant");
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface)
{
return new Add2intsWithConstant();
}
}
See the Vertica Java SDK entry for SizedColumnTypes
for a full list of the data-type-specific methods you can call to define parameters.
5.3.9 - Java example: using session parameters
The RowCount example uses a user-defined session parameter, also called RowCount.
The RowCount example uses a user-defined session parameter, also called RowCount. This parameter counts the total number of rows processed by the UDx each time it runs. RowCount then displays the aggregate number of rows processed for all executions.
package com.mycompany.example;
import com.vertica.sdk.*;
public class RowCountFactory extends ScalarFunctionFactory {
@Override
public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
public class RowCount extends ScalarFunction {
private Integer count;
private Integer rowCount;
// In the setup method, you look for the rowCount parameter. If it doesn't exist, it is created.
// Look in the default namespace which is "library," but it could be anything else, most likely "public" if not "library".
@Override
public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes) {
count = new Integer(0);
ParamReader pSessionParams = srvInterface.getUDSessionParamReader("library");
String rCount = pSessionParams.containsParameter("rowCount")?
pSessionParams.getString("rowCount"): "0";
rowCount = Integer.parseInt(rCount);
}
@Override
public void processBlock(ServerInterface srvInterface, BlockReader arg_reader, BlockWriter res_writer)
throws UdfException, DestroyInvocation {
do {
++count;
long a = arg_reader.getLong(0);
long b = arg_reader.getLong(1);
res_writer.setLong(a+b);
res_writer.next();
} while (arg_reader.next());
}
@Override
public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes, SessionParamWriterMap udParams){
rowCount = rowCount+count;
udParams.getUDSessionParamWriter("library").setString("rowCount", Integer.toString(rowCount));
srvInterface.log("RowNumber processed %d records", count);
}
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface){
return new RowCount();
}
}
5.4 - Errors, warnings, and logging
The SDK provides several ways for a UDx to report errors, warnings, and other messages.
The SDK provides several ways for a UDx to report errors, warnings, and other messages. For a UDx written in C++ or Python, use the messaging APIs described in Sending messages. UDxs in all languages can halt execution with an error, as explained in Handling errors.
UDxs can also write messages to the Vertica log, and UDxs written in C++ can write messages to a system table.
5.4.1 - Sending messages
A UDx can handle a problem by reporting an error and terminating execution, but in some cases you might want to send a warning and proceed.
A UDx can handle a problem by reporting an error and terminating execution, but in some cases you might want to send a warning and proceed. For example, a UDx might ignore or use a default for an unexpected input and report that it did so. The C++ and Python messaging APIs support reporting messages at different severity levels.
A UDx has access to a ServerInterface
instance. This class has the following methods for reporting messages, in order of severity:
Each method produces messages with the following components:
-
ID code: an identification code, any integer. This code does not interact with Vertica error codes.
-
Message string: a succinct description of the issue.
-
Optional details string: provides more contextual information.
-
Optional hint string: provides other guidance.
Duplicate messages are condensed into a single report if they have the same code and message string, even if the details and hint strings differ.
Constructing messages
The UDx should report errors immediately, usually during a process call. For all other message types, record information during processing and call the reporting methods from the UDx's destroy
method. The other reporting methods do not produce output if called during processing.
The process of constructing messages is language-specific.
C++
Each ServerInterface
reporting method takes a ClientMessage
argument. The ClientMessage
class has the following methods to set the code and message, detail, and hint:
-
makeMessage:
sets the ID code and message string.
-
setDetail:
sets the optional detail string.
-
setHint:
sets the optional hint string.
These method calls can be chained to simplify creating and passing the message.
All strings support printf
-style arguments and formatting.
In the following example, a function records issues in processBlock
and reports them in destroy
:
class PositiveIdentity : public Vertica::ScalarFunction
{
public:
using ScalarFunction::destroy;
bool hitNotice = false;
virtual void processBlock(Vertica::ServerInterface &srvInterface,
Vertica::BlockReader &arg_reader,
Vertica::BlockWriter &res_writer)
{
do {
const Vertica::vint a = arg_reader.getIntRef(0);
if (a < 0 && a != vint_null) {
hitNotice = true;
res_writer.setInt(null);
} else {
res_writer.setInt(a);
}
res_writer.next();
} while (arg_reader.next());
}
virtual void destroy(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes) override
{
if (hitNotice) {
ClientMessage msg = ClientMessage::makeMessage(100, "Passed negative argument")
.setDetail("Value set to null");
srvInterface.reportNotice(msg);
}
}
}
Python
Each ServerInterface
reporting method has the following positional and keyword arguments:
-
idCode
: integer ID code, positional argument.
-
message
: message text, positional argument.
-
hint
: optional hint text, keyword argument.
-
detail
: optional detail text, keyword argument.
All arguments support str.format()
and f-string
formatting.
In the following example, a function records issues in processBlock
and reports them in destroy
:
class PositiveIdentity(vertica_sdk.ScalarFunction):
def __init__(self):
self.hitNotice = False
def processBlock(self, server_interface, arg_reader, res_writer):
while True:
arg = arg_reader.getInt(0)
if arg < 0 and arg is not None:
self.hitNotice = True
res_writer.setNull()
else:
res_writer.setInt(arg)
res_writer.next()
if not arg_reader.next():
break
def destroy(self, srv, argType):
if self.hitNotice:
srv.reportNotice(100, "Passed negative arguement", detail="Value set to null")
return
API
Before calling a ServerInterface reporting method, construct and populate a message with the ClientMessage class.
The ServerInterface API provides the following methods for reporting messages:
// ClientMessage methods
template<typename... Argtypes>
static ClientMessage makeMessage(int errorcode, const char *fmt, Argtypes&&... args);
template <typename... Argtypes>
ClientMessage & setDetail(const char *fmt, Argtypes&&... args);
template <typename... Argtypes>
ClientMessage & setHint(const char *fmt, Argtypes&&... args);
// ServerInterface reporting methods
virtual void reportError(ClientMessage msg);
virtual void reportInfo(ClientMessage msg);
virtual void reportNotice(ClientMessage msg);
virtual void reportWarning(ClientMessage msg);
The ServerInterface API provides the following methods for reporting messages:
def reportError(self, code, text, hint='', detail=''):
def reportInfo(self, code, text, hint='', detail=''):
def reportNotice(self, code, text, hint='', detail=''):
def reportWarning(self, code, text, hint='', detail=''):
5.4.2 - Handling errors
If your UDx encounters an unrecoverable error, it should report the error and terminate.
If your UDx encounters an unrecoverable error, it should report the error and terminate. How you do this depends on the language:
-
C++: Consider using the API described in Sending messages, which is more expressive than the error-handling described in this topic. Alternatively, you can use the vt_report_error
macro to report an error and exit. The macro takes two parameters: an error number and an error message string. Both the error number and message appear in the error that Vertica reports to the user. The error number is not defined by Vertica. You can use whatever value that you wish.
-
Java: Instantiate and throw a UdfException
, which takes a numeric code and a message string to report to the user.
-
Python: Consider using the API described in Sending messages, which is more expressive than the error-handling described in this topic. Alternatively, raise an exception built into the Python language; the SDK does not include a UDx-specific exception.
-
R: Use stop
to halt execution with a message.
An exception or halt causes the transaction containing the function call to be rolled back.
The following examples demonstrate error-handling:
The following function divides two integers. To prevent division by zero, it tests the second parameter and fails if it is zero:
class Div2ints : public ScalarFunction
{
public:
virtual void processBlock(ServerInterface &srvInterface,
BlockReader &arg_reader,
BlockWriter &res_writer)
{
// While we have inputs to process
do
{
const vint a = arg_reader.getIntRef(0);
const vint b = arg_reader.getIntRef(1);
if (b == 0)
{
vt_report_error(1,"Attempted divide by zero");
}
res_writer.setInt(a/b);
res_writer.next();
}
while (arg_reader.next());
}
};
Loading and invoking the function demonstrates how the error appears to the user. Fenced and unfenced modes use different error numbers.
=> CREATE LIBRARY Div2IntsLib AS '/home/dbadmin/Div2ints.so';
CREATE LIBRARY
=> CREATE FUNCTION div2ints AS LANGUAGE 'C++' NAME 'Div2intsInfo' LIBRARY Div2IntsLib;
CREATE FUNCTION
=> SELECT div2ints(25, 5);
div2ints
----------
5
(1 row)
=> SELECT * FROM MyTable;
a | b
----+---
12 | 6
7 | 0
12 | 2
18 | 9
(4 rows)
=> SELECT * FROM MyTable WHERE div2ints(a, b) > 2;
ERROR 3399: Error in calling processBlock() for User Defined Scalar Function
div2ints at Div2ints.cpp:21, error code: 1, message: Attempted divide by zero
In the following example, if either of the arguments is NULL, the processBlock()
method throws an exception:
@Override
public void processBlock(ServerInterface srvInterface,
BlockReader argReader,
BlockWriter resWriter)
throws UdfException, DestroyInvocation
{
do {
// Test for NULL value. Throw exception if one occurs.
if (argReader.isLongNull(0) || argReader.isLongNull(1) ) {
// No nulls allowed. Throw exception
throw new UdfException(1234, "Cannot add a NULL value");
}
When your UDx throws an exception, the side process running your UDx reports the error back to Vertica and exits. Vertica displays the error message contained in the exception and a stack trace to the user:
=> SELECT add2ints(2, NULL);
ERROR 3399: Failure in UDx RPC call InvokeProcessBlock(): Error in User Defined Object [add2ints], error code: 1234
com.vertica.sdk.UdfException: Cannot add a NULL value
at com.example.Add2intsFactory$Add2ints.processBlock(Add2intsFactory.java:37)
at com.vertica.udxfence.UDxExecContext.processBlock(UDxExecContext.java:700)
at com.vertica.udxfence.UDxExecContext.run(UDxExecContext.java:173)
at java.lang.Thread.run(Thread.java:662)
In this example, if one of the arguments is less than 100, then the Python UDx throws an error:
while(True):
# Example of error checking best practices.
product_id = block_reader.getInt(2)
if product_id < 100:
raise ValueError("Invalid Product ID")
An error generates a message like the following:
=> SELECT add2ints(prod_cost, sale_price, product_id) FROM bunch_of_numbers;
ERROR 3399: Failure in UDx RPC call InvokeProcessBlock(): Error calling processBlock() in User Defined Object [add2ints]
at [/udx/PythonInterface.cpp:168], error code: 0,
message: Error [/udx/PythonInterface.cpp:385] function ['call_method']
(Python error type [<class 'ValueError'>])
Traceback (most recent call last):
File "/home/dbadmin/py_db/v_py_db_node0001_catalog/Libraries/02fc4af0ace6f91eefa74baecf3ef76000a0000000004fc4/pylib_02fc4af0ace6f91eefa74baecf3ef76000a0000000004fc4.py",
line 13, in processBlock
raise ValueError("Invalid Product ID")
ValueError: Invalid Product ID
In this example, if the third column of the data frame does not match the specified Product ID, then the R UDx throws an error:
Calculate_Cost_w_Tax <- function(input.data.frame) {
# Must match the Product ID 11444
if ( !is.numeric(input.data.frame[, 3]) == 11444 ) {
stop("Invalid Product ID!")
} else {
cost_w_tax <- data.frame(input.data.frame[, 1] * input.data.frame[, 2])
}
return(cost_w_tax)
}
Calculate_Cost_w_TaxFactory <- function() {
list(name=Calculate_Cost_w_Tax,
udxtype=c("scalar"),
intype=c("float","float", "float"),
outtype=c("float"))
}
An error generates a message like the following:
=> SELECT Calculate_Cost_w_Tax(item_price, tax_rate, prod_id) FROM Inventory_Sales_Data;
vsql:sql_test_multiply.sql:21: ERROR 3399: Failure in UDx RPC call InvokeProcessBlock():
Error calling processBlock() in User Defined Object [mul] at
[/udx/RInterface.cpp:1308],
error code: 0, message: Exception in processBlock :Invalid Product ID!
To report additional diagnostic information about the error, you can write messages to a log file before throwing the exception (see Logging).
Your UDx must not consume exceptions that it did not throw. Intercepting server exceptions can lead to database instability.
5.4.3 - Logging
Each UDx written in C++, Java, or Python has an associated instance of ServerInterface.
Each UDx written in C++, Java, or Python has an associated instance of ServerInterface
. The ServerInterface
class provides a function to write to the Vertica log, and the C++ implementation also provides a function to log events in a system table.
Writing messages to the Vertica log
You can write to log files using the ServerInterface.log()
function. The function acts similarly to printf()
, taking a formatted string and an optional set of values and writing the string to the log file. Where the message is written depends on whether your function runs in fenced mode or unfenced mode:
-
Functions running in unfenced mode write their messages into the
vertica.log
file in the catalog directory.
-
Functions running in fenced mode write their messages into a log file named UDxLogs/UDxFencedProcesses.log
in the catalog directory.
To help identify your function's output, Vertica adds the SQL function name bound to your UDx to the log message.
The following example logs a UDx's input values:
virtual void processBlock(ServerInterface &srvInterface,
BlockReader &argReader,
BlockWriter &resWriter)
{
try {
// While we have inputs to process
do {
if (argReader.isNull(0) || argReader.isNull(1)) {
resWriter.setNull();
} else {
const vint a = argReader.getIntRef(0);
const vint b = argReader.getIntRef(1);
srvInterface.log("got a: %d and b: %d", (int) a, (int) b);
resWriter.setInt(a+b);
}
resWriter.next();
} while (argReader.next());
} catch(std::exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing block: [%s]", e.what());
}
}
@Override
public void processBlock(ServerInterface srvInterface,
BlockReader argReader,
BlockWriter resWriter)
throws UdfException, DestroyInvocation
{
do {
// Get the two integer arguments from the BlockReader
long a = argReader.getLong(0);
long b = argReader.getLong(1);
// Log the input values
srvInterface.log("Got values a=%d and b=%d", a, b);
long result = a+b;
resWriter.setLong(result);
resWriter.next();
} while (argReader.next());
}
}
def processBlock(self, server_interface, arg_reader, res_writer):
server_interface.log("Python UDx - Adding 2 ints!")
while(True):
first_int = block_reader.getInt(0)
second_int = block_reader.getInt(1)
block_writer.setInt(first_int + second_int)
server_interface.log("Values: first_int is {} second_int is {}".format(first_int, second_int))
block_writer.next()
if not block_reader.next():
break
The log()
function generates entries in the log file like the following:
$ tail /home/dbadmin/py_db/v_py_db_node0001_catalog/UDxLogs/UDxFencedProcesses.log
07:52:12.862 [Python-v_py_db_node0001-7524:0x206c-40575] 0x7f70eee2f780 PythonExecContext::processBlock
07:52:12.862 [Python-v_py_db_node0001-7524:0x206c-40575] 0x7f70eee2f780 [UserMessage] add2ints - Python UDx - Adding 2 ints!
07:52:12.862 [Python-v_py_db_node0001-7524:0x206c-40575] 0x7f70eee2f780 [UserMessage] add2ints - Values: first_int is 100 second_int is 100
For details on viewing the Vertica log files, see Monitoring log files.
Writing messages to the UDX_EVENTS table (C++ only)
In the C++ API, you can write messages to the UDX_EVENTS system table instead of or in addition to writing to the log. Writing to a system table allows you to collect events from all nodes in one place.
You can write to this table using the ServerInterface.logEvent()
function. The function takes one argument, a map. The map is written into the __RAW__ column of the table as a Flex VMap. The following example shows how the Parquet exporter creates and logs this map.
// Log exported parquet file details to v_monitor.udx_events
std::map<std::string, std::string> details;
details["file"] = escapedPath;
details["created"] = create_timestamp_;
details["closed"] = close_timestamp_;
details["rows"] = std::to_string(num_rows_in_file);
details["row_groups"] = std::to_string(num_row_groups_in_file);
details["size_mb"] = std::to_string((double)outputStream->Tell()/(1024*1024));
srvInterface.logEvent(details);
You can select individual fields from the VMap as in the following example.
=> SELECT __RAW__['file'] FROM UDX_EVENTS;
__RAW__
-----------------------------------------------------------------------------
/tmp/export_tmpzLkrKq3a/450c4213-v_vmart_node0001-139770732459776-0.parquet
/tmp/export_tmpzLkrKq3a/9df1c797-v_vmart_node0001-139770860660480-0.parquet
(2 rows)
Alternatively, you can define a view to make it easier to query fields directly, as columns. See Monitoring exports for an example.
5.5 - Handling cancel requests
Users of your UDx might cancel the operation while it is running.
Users of your UDx might cancel the operation while it is running. How Vertica handles the cancellation of the query and your UDx depends on whether your UDx is running in fenced or unfenced mode:
-
If your UDx is running in unfenced mode, Vertica either stops the function when it requests a new block of input or output, or waits until your function completes running and discards the results.
-
If your UDx is running in Fenced and unfenced modes, Vertica kills the zygote process that is running your function if it continues processing past a timeout.
In addition, you can implement the cancel()
method in any UDx to perform any necessary additional work. Vertica calls your function when a query is canceled. This cancellation can occur at any time during your UDx's lifetime, from setup()
through destroy()
.
You can check for cancellation before starting an expensive operation by calling isCanceled()
.
5.5.1 - Implementing the cancel callback
Your UDx can implement a cancel() callback function.
Your UDx can implement a cancel()
callback function. Vertica calls this function if the query that invoked the UDx has been canceled.
You usually implement this function to perform an orderly shutdown of any additional processing that your UDx spawned. For example, you can have your cancel()
function shut down threads that your UDx has spawned or signal a third-party library that it needs to stop processing and exit. Your cancel()
function should leave your UDx's function class ready to be destroyed, because Vertica calls the UDx's destroy()
function after the cancel()
function has exited.
A UDx's default cancel()
behavior is to do nothing.
The contract for cancel()
is:
-
Vertica will call cancel()
at most once per UDx instance.
-
Vertica can call cancel()
concurrently with any other method of the UDx object except the constructor and destructor.
-
Vertica can call cancel()
from another thread, so implementations should be thread-safe.
-
Vertica will call cancel()
for either an explicit user cancellation or an error in the query.
-
Vertica does not guarantee that cancel()
will run to completion. Long-running cancellations might be aborted.
The call to cancel()
is not synchronized in any way with your UDx's other functions. If you need your processing function to exit before your cancel()
function performs some action (killing threads, for example), you must have the two function synchronize their actions.
Vertica always calls destroy()
if it called setup()
. Cancellation does not prevent destruction.
See C++ example: cancelable UDSource for an example that implements cancel()
.
5.5.2 - Checking for cancellation during execution
You can call the isCanceled() method to check for user cancellation.
You can call the isCanceled()
method to check for user cancellation. Typically you check for cancellation from the method that does the main processing in your UDx before beginning expensive operations. If isCanceled()
returns true, the query has been canceled and your method should exit immediately to prevent it from wasting CPU time. If your UDx is not running fenced mode, Vertica cannot halt your function and has to wait for it to finish. If it is running in fenced mode, Vertica eventually kills the side process running it.
See C++ example: cancelable UDSource for an example that uses isCanceled()
.
5.5.3 - C++ example: cancelable UDSource
The FifoSource example, found in filelib.cpp in the SDK examples, demonstrates use of cancel() and isCanceled().
The FifoSource
example, found in filelib.cpp
in the SDK examples, demonstrates use of cancel()
and isCanceled()
. This source reads from a named pipe. Unlike reads from files, reads from pipes can block. Therefore, we need to be able to cancel a load from this source.
To manage cancellation, the UDx uses a pipe, a data channel used for inter-process communication. A process can write data to the write end of the pipe, and it remains available until another process reads it from the read end of the pipe. This example doesn't pass data through this pipe; rather, it uses the pipe to manage cancellation, as explained further below. In addition to the pipe's two file descriptors (one for each end), the UDx creates a file descriptor for the file to read from. The setup()
function creates the pipe and then opens the file.
virtual void setup(ServerInterface &srvInterface) {
// cancelPipe is a pipe used only for checking cancellation
if (pipe(cancelPipe)) {
vt_report_error(0, "Error opening control structure");
}
// handle to the named pipe from which we read data
namedPipeFd = open(filename.c_str(), O_RDONLY | O_NONBLOCK);
if (namedPipeFd < 0) {
vt_report_error(0, "Error opening fifo [%s]", filename.c_str());
}
}
We now have three file descriptors: namedPipeFd
, cancelPipe[PIPE_READ]
, and cancelPipe[PIPE_WRITE]
. Each of these must eventually be closed.
This UDx uses the poll()
system call to wait either for data to arrive from the named pipe (namedPipeFd
) or for a cancellation (cancelPipe[PIPE_READ]
). The process()
function polls, checks for results, checks for cancellation, writes output if needed, and returns.
virtual StreamState process(ServerInterface &srvInterface, DataBuffer &output) {
struct pollfd pollfds[2] = {
{ namedPipeFd, POLLIN, 0 },
{ cancelPipe[PIPE_READ], POLLIN, 0 }
};
if (poll(pollfds, 2, -1) < 0) {
vt_report_error(1, "Error reading [%s]", filename.c_str());
}
if (pollfds[1].revents & (POLLIN | POLLHUP)) {
/* This can only happen after cancel() has been called */
VIAssert(isCanceled());
return DONE;
}
VIAssert(pollfds[PIPE_READ].revents & (POLLIN | POLLHUP));
const ssize_t amount = read(namedPipeFd, output.buf + output.offset, output.size - output.offset);
if (amount < 0) {
vt_report_error(1, "Error reading from fifo [%s]", filename.c_str());
}
if (amount == 0 || isCanceled()) {
return DONE;
} else {
output.offset += amount;
return OUTPUT_NEEDED;
}
}
If the query is canceled, the cancel()
function closes the write end of the pipe. The next time process()
polls for input, it finds no input on the read end of the pipe and exits. Otherwise, it continues. The function also calls isCanceled()
to check for cancellation before returning OUTPUT_NEEDED
, the signal that it has filled its buffer and is waiting for it to be processed downstream.
The cancel()
function does only the work needed to interrupt a call to process()
. Cleanup that is always needed, not just for cancellation, is instead done in destroy()
or the destructor. The cancel()
function closes the write end of the pipe. (The helper function will be shown later.)
virtual void cancel(ServerInterface &srvInterface) {
closeIfNeeded(cancelPipe[PIPE_WRITE]);
}
It is not safe to close the named pipe in cancel()
, because closing it could create a race condition if another process (like another query) were to reuse the file descriptor number for a new descriptor before the UDx finishes. Instead we close it, and the read end of the pipe, in destroy()
.
virtual void destroy(ServerInterface &srvInterface) {
closeIfNeeded(namedPipeFd);
closeIfNeeded(cancelPipe[PIPE_READ]);
}
It is not safe to close the write end of the pipe in destroy()
, because cancel()
closes it and can be called concurrently with destroy()
. Therefore, we close it in the destructor.
~FifoSource() {
closeIfNeeded(cancelPipe[PIPE_WRITE]);
}
The UDx uses a helper function, closeIfNeeded()
, to make sure each file descriptor is closed exactly once.
void closeIfNeeded(int &fd) {
if (fd >= 0) {
close(fd);
fd = -1;
}
}
5.6 - Aggregate functions (UDAFs)
Aggregate functions perform an operation on a set of values and return one value.
Aggregate functions perform an operation on a set of values and return one value. Vertica provides standard built-in aggregate functions such as AVG, MAX, and MIN. User-defined aggregate functions (UDAFs) provide similar functionality:
-
Support a single input column (or set) of values and provide a single output column.
-
Support RLE decompression. RLE input is decompressed before it is sent to a UDAF.
-
Support use with GROUP BY and HAVING clauses. Only columns appearing in the GROUP BY clause can be selected.
Restrictions
The following restrictions apply to UDAFs:
5.6.1 - AggregateFunction class
The AggregateFunction class performs the aggregation.
The AggregateFunction
class performs the aggregation. It computes values on each database node where relevant data is stored and then combines the results from the nodes. You must implement the following methods:
-
initAggregate()
- Initializes the class, defines variables, and sets the starting value for the variables. This function must be idempotent.
-
aggregate()
- The main aggregation operation, executed on each node.
-
combine()
- If multiple invocations of aggregate()
are needed, Vertica calls combine()
to combine all the sub-aggregations into a final aggregation. Although this method might not be called, you must define it.
-
terminate()
- Terminates the function and returns the result as a column.
Important
The aggregate()
function might not operate on the complete input set all at once. For this reason, initAggregate()
must be idempotent.
The AggregateFunction
class also provides optional methods that you can implement to allocate and free resources: setup()
and destroy()
. You should use these methods to allocate and deallocate resources that you do not allocate through the UDAF API (see Allocating resources for UDxs for details).
API
Aggregate functions are supported for C++ only.
The AggregateFunction API provides the following methods for extension by subclasses:
virtual void setup(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes);
virtual void initAggregate(ServerInterface &srvInterface, IntermediateAggs &aggs)=0;
void aggregate(ServerInterface &srvInterface, BlockReader &arg_reader,
IntermediateAggs &aggs);
virtual void combine(ServerInterface &srvInterface, IntermediateAggs &aggs_output,
MultipleIntermediateAggs &aggs_other)=0;
virtual void terminate(ServerInterface &srvInterface, BlockWriter &res_writer,
IntermediateAggs &aggs);
virtual void cancel(ServerInterface &srvInterface);
virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes);
5.6.2 - AggregateFunctionFactory class
The AggregateFunctionFactory class specifies metadata information such as the argument and return types of your aggregate function.
The AggregateFunctionFactory
class specifies metadata information such as the argument and return types of your aggregate function. It also instantiates your AggregateFunction
subclass. Your subclass must implement the following methods:
-
getPrototype()
- Defines the number of parameters and data types accepted by the function. There is a single parameter for aggregate functions.
-
getIntermediateTypes()
- Defines the intermediate variable(s) used by the function. These variables are used when combining the results of aggregate()
calls.
-
getReturnType()
- Defines the type of the output column.
Your function may also implement getParameterType()
, which defines the names and types of parameters that this function uses.
Vertica uses this data when you call the CREATE AGGREGATE FUNCTION SQL statement to add the function to the database catalog.
API
Aggregate functions are supported for C++ only.
The AggregateFunctionFactory API provides the following methods for extension by subclasses:
virtual AggregateFunction *
createAggregateFunction ServerInterface &srvInterface)=0;
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)=0;
virtual void getIntermediateTypes(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes, SizedColumnTypes &intermediateTypeMetaData)=0;
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes, SizedColumnTypes &returnType)=0;
virtual void getParameterType(ServerInterface &srvInterface,
SizedColumnTypes ¶meterTypes);
5.6.3 - UDAF performance in statements containing a GROUP BY clause
You may see slower-than-expected performance from your UDAF if the SQL statement calling it also contains a GROUP BY Clause.
You may see slower-than-expected performance from your UDAF if the SQL statement calling it also contains a GROUP BY clause. For example:
=> SELECT a, MYUDAF(b) FROM sampletable GROUP BY a;
In statements like this one, Vertica does not consolidate row data together before calling your UDAF's aggregate()
method. Instead, it calls aggregate()
once for each row of data. Usually, the overhead of having Vertica consolidate the row data is greater than the overhead of calling aggregate()
for each row of data. However, if your UDAF's aggregate()
method has significant overhead, then you might notice an impact on your UDAF's performance.
For example, suppose aggregate()
allocates memory. When called in a statement with a GROUP BY clause, it performs this memory allocation for each row of data. Because memory allocation is a relatively expensive process, this allocation can impact the overall performance of your UDAF and the query.
There are two ways you can address UDAF performance in a statement containing a GROUP BY clause:
-
Reduce the overhead of each call to aggregate()
. If possible, move any allocation or other setup operations to the UDAF's setup()
function.
-
Declare a special parameter that tells Vertica to group row data together when calling a UDAF. This technique is explained below.
Using the _minimizeCallCount parameter
Your UDAF can tell Vertica to always batch row data together to reduce the number of calls to its aggregate()
method. To trigger this behavior, your UDAF must declare an integer parameter named _minimizeCallCount
. You do not need to set a value for this parameter in your SQL statement. The fact that your UDAF declares this parameter triggers Vertica to group row data together when calling aggregate()
.
You declare the _minimizeCallCount
parameter the same way you declare other UDx parameters. See UDx parameters for more information.
Important
Always test the performance of your UDAF before and after implementing the _minimizeCallCount
parameter to ensure that it improves performance. You might find that the overhead of having Vertica group row data for your UDAF is greater than the cost of the repeated calls to aggregate()
.
5.6.4 - C++ example: average
The Average aggregate function created in this example computes the average of values in a column.
The Average
aggregate function created in this example computes the average of values in a column.
You can find the source code used in this example on the Vertica GitHub page.
Loading the example
Use CREATE LIBRARY and CREATE AGGREGATE FUNCTION to declare the function:
=> CREATE LIBRARY AggregateFunctions AS
'/opt/vertica/sdk/examples/build/AggregateFunctions.so';
CREATE LIBRARY
=> CREATE aggregate function ag_avg AS LANGUAGE 'C++'
name 'AverageFactory' library AggregateFunctions;
CREATE AGGREGATE FUNCTION
Using the example
Use the function as part of a SELECT statement:
=> SELECT * FROM average;
id | count
----+---------
A | 8
B | 3
C | 6
D | 2
E | 9
F | 7
G | 5
H | 4
I | 1
(9 rows)
=> SELECT ag_avg(count) FROM average;
ag_avg
--------
5
(1 row)
AggregateFunction implementation
This example adds the input argument values in the aggregate()
method and keeps a counter of the number of values added. The server runs aggregate()
on every node and different data chunks, and combines all the individually added values and counters in the combine()
method. Finally, the average value is computed in the terminate()
method by dividing the total sum by the total number of values processed.
For this discussion, assume the following environment:
-
A three-node Vertica cluster
-
A table column that contains nine values that are evenly distributed across the nodes. Schematically, the nodes look like the following figure:
The function uses sum and count variables. Sum contains the sum of the values, and count contains the count of values.
First, initAggregate()
initializes the variables and sets their values to zero.
virtual void initAggregate(ServerInterface &srvInterface,
IntermediateAggs &aggs)
{
try {
VNumeric &sum = aggs.getNumericRef(0);
sum.setZero();
vint &count = aggs.getIntRef(1);
count = 0;
}
catch(std::exception &e) {
vt_ report_ error(0, "Exception while initializing intermediate aggregates: [% s]", e.what());
}
}
The aggregate()
function reads the block of data on each node and calculates partial aggregates.
void aggregate(ServerInterface &srvInterface,
BlockReader &argReader,
IntermediateAggs &aggs)
{
try {
VNumeric &sum = aggs.getNumericRef(0);
vint &count = aggs.getIntRef(1);
do {
const VNumeric &input = argReader.getNumericRef(0);
if (!input.isNull()) {
sum.accumulate(&input);
count++;
}
} while (argReader.next());
} catch(std::exception &e) {
vt_ report_ error(0, " Exception while processing aggregate: [% s]", e.what());
}
}
Each completed instance of the aggregate()
function returns multiple partial aggregates for sum and count. The following figure illustrates this process using the aggregate()
function:
The combine()
function puts together the partial aggregates calculated by each instance of the average function.
virtual void combine(ServerInterface &srvInterface,
IntermediateAggs &aggs,
MultipleIntermediateAggs &aggsOther)
{
try {
VNumeric &mySum = aggs.getNumericRef(0);
vint &myCount = aggs.getIntRef(1);
// Combine all the other intermediate aggregates
do {
const VNumeric &otherSum = aggsOther.getNumericRef(0);
const vint &otherCount = aggsOther.getIntRef(1);
// Do the actual accumulation
mySum.accumulate(&otherSum);
myCount += otherCount;
} while (aggsOther.next());
} catch(std::exception &e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while combining intermediate aggregates: [%s]", e.what());
}
}
The following figure shows how each partial aggregate is combined:
After all input has been evaluated by the aggregate()
function Vertica calls the terminate()
function. It returns the average to the caller.
virtual void terminate(ServerInterface &srvInterface,
BlockWriter &resWriter,
IntermediateAggs &aggs)
{
try {
const int32 MAX_INT_PRECISION = 20;
const int32 prec = Basics::getNumericWordCount(MAX_INT_PRECISION);
uint64 words[prec];
VNumeric count(words,prec,0/*scale*/);
count.copy(aggs.getIntRef(1));
VNumeric &out = resWriter.getNumericRef();
if (count.isZero()) {
out.setNull();
} else
const VNumeric &sum = aggs.getNumericRef(0);
out.div(&sum, &count);
}
}
The following figure shows the implementation of the terminate()
function:
AggregateFunctionFactory implementation
The getPrototype()
function allows you to define the variables that are sent to your aggregate function and returned to Vertica after your aggregate function runs. The following example accepts and returns a numeric value:
virtual void getPrototype(ServerInterface &srvfloaterface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
argTypes.addNumeric();
returnType.addNumeric();
}
The getIntermediateTypes()
function defines any intermediate variables that you use in your aggregate function. Intermediate variables are values used to pass data among multiple invocations of an aggregate function. They are used to combine results until a final result can be computed. In this example, there are two results - total (numeric) and count (int).
virtual void getIntermediateTypes(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &intermediateTypeMetaData)
{
const VerticaType &inType = inputTypes.getColumnType(0);
intermediateTypeMetaData.addNumeric(interPrec, inType.getNumericScale());
intermediateTypeMetaData.addInt();
}
The getReturnType()
function defines the output data type:
virtual void getReturnType(ServerInterface &srvfloaterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
const VerticaType &inType = inputTypes.getColumnType(0);
outputTypes.addNumeric(inType.getNumericPrecision(),
inType.getNumericScale());
}
5.7 - Analytic functions (UDAnFs)
User-defined analytic functions (UDAnFs) are used for analytics.
User-defined analytic functions (UDAnFs) are used for analytics. See SQL analytics for an overview of Vertica's built-in analytics. Like user-defined scalar functions (UDSFs), UDAnFs must output a single value for each row of data read and can have no more than 9800 arguments.
Unlike UDSFs, the UDAnF's input reader and output reader can be advanced independently. This feature lets you create analytic functions where the output value is calculated over multiple rows of data. By advancing the reader and writer independently, you can create functions similar to the built-in analytic functions such as LAG, which uses data from prior rows to output a value for the current row.
5.7.1 - AnalyticFunction class
The AnalyticFunction class performs the analytic processing.
The AnalyticFunction
class performs the analytic processing. Your subclass must define the processPartition()
method to perform the operation. It may define methods to set up and tear down the function.
The processPartition()
method reads a partition of data, performs some sort of processing, and outputs a single value for each input row.
Vertica calls processPartition()
once for each partition of data. It supplies the partition using an AnalyticPartitionReader
object from which you read its input data. In addition, there is a unique method on this object named isNewOrderByKey()
, which returns a Boolean value indicating whether your function has seen a row with the same ORDER BY key (or keys). This method is very useful for analytic functions (such as the example RANK function) which need to handle rows with identical ORDER BY keys differently than rows with different ORDER BY keys.
Note
You can specify multiple ORDER BY columns in the SQL query you use to call your UDAnF. The isNewOrderByKey
method returns true if any of the ORDER BY keys are different than the previous row.
Once your method has finished processing the row of data, you advance it to the next row of input by calling next()
on AnalyticPartitionReader
.
Your method writes its output value using an AnalyticPartitionWriter
object that Vertica supplies as a parameter to processPartition()
. This object has data-type-specific methods to write the output value (such as setInt()
). After setting the output value, call next()
on AnalyticPartitionWriter
to advance to the next row in the output.
Note
You must be sure that your function produces a row of output for each row of input in the partition. You must also not output more rows than are in the partition, otherwise the zygote size process (if running in
Fenced and unfenced modes) or Vertica itself could generate an out of bounds error.
Setting up and tearing down
The AnalyticFunction
class defines two additional methods that you can optionally implement to allocate and free resources: setup()
and destroy()
. You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).
API
The AnalyticFunction API provides the following methods for extension by subclasses:
virtual void setup(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes);
virtual void processPartition (ServerInterface &srvInterface,
AnalyticPartitionReader &input_reader,
AnalyticPartitionWriter &output_writer)=0;
virtual void cancel(ServerInterface &srvInterface);
virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes);
The AnalyticFunction API provides the following methods for extension by subclasses:
public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes);
public abstract void processPartition (ServerInterface srvInterface,
AnalyticPartitionReader input_reader, AnalyticPartitionWriter output_writer)
throws UdfException, DestroyInvocation;
protected void cancel(ServerInterface srvInterface);
public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes);
5.7.2 - AnalyticFunctionFactory class
The AnalyticFunctionFactory class tells Vertica metadata about your UDAnF: its number of parameters and their data types, as well as the data type of its return value.
The AnalyticFunctionFactory
class tells Vertica metadata about your UDAnF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of AnalyticFunction
.
Your AnalyticFunctionFactory
subclass must implement the following methods:
-
getPrototype()
describes the input parameters and output value of your function. You set these values by calling functions on two ColumnTypes
objects that are passed to your method.
-
createAnalyticFunction()
supplies an instance of your AnalyticFunction
that Vertica can call to process a UDAnF function call.
-
getReturnType()
provides details about your function's output. This method is where you set the width of the output value if your function returns a variable-width value (such as VARCHAR) or the precision of the output value if it has a settable precision (such as TIMESTAMP).
API
The AnalyticFunctionFactory API provides the following methods for extension by subclasses:
virtual AnalyticFunction * createAnalyticFunction (ServerInterface &srvInterface)=0;
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)=0;
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes, SizedColumnTypes &returnType)=0;
virtual void getParameterType(ServerInterface &srvInterface,
SizedColumnTypes ¶meterTypes);
The AnalyticFunctionFactory API provides the following methods for extension by subclasses:
public abstract AnalyticFunction createAnalyticFunction (ServerInterface srvInterface);
public abstract void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType);
public abstract void getReturnType(ServerInterface srvInterface, SizedColumnTypes argTypes,
SizedColumnTypes returnType) throws UdfException;
public void getParameterType(ServerInterface srvInterface, SizedColumnTypes parameterTypes);
5.7.3 - C++ example: rank
The Rank analytic function ranks rows based on how they are ordered.
The Rank
analytic function ranks rows based on how they are ordered. A Java version of this UDx is included in /opt/vertica/sdk/examples
.
Loading and using the example
The following example shows how to load the function into Vertica. It assumes that the AnalyticFunctions.so
library that contains the function has been copied to the dbadmin user's home directory on the initiator node.
=> CREATE LIBRARY AnalyticFunctions AS '/home/dbadmin/AnalyticFunctions.so';
CREATE LIBRARY
=> CREATE ANALYTIC FUNCTION an_rank AS LANGUAGE 'C++'
NAME 'RankFactory' LIBRARY AnalyticFunctions;
CREATE ANALYTIC FUNCTION
An example of running this rank function, named an_rank
, is:
=> SELECT * FROM hits;
site | date | num_hits
-----------------+------------+----------
www.example.com | 2012-01-02 | 97
www.vertica.com | 2012-01-01 | 343435
www.example.com | 2012-01-01 | 123
www.example.com | 2012-01-04 | 112
www.vertica.com | 2012-01-02 | 503695
www.vertica.com | 2012-01-03 | 490387
www.example.com | 2012-01-03 | 123
(7 rows)
=> SELECT site,date,num_hits,an_rank()
OVER (PARTITION BY site ORDER BY num_hits DESC)
AS an_rank FROM hits;
site | date | num_hits | an_rank
-----------------+------------+----------+---------
www.example.com | 2012-01-03 | 123 | 1
www.example.com | 2012-01-01 | 123 | 1
www.example.com | 2012-01-04 | 112 | 3
www.example.com | 2012-01-02 | 97 | 4
www.vertica.com | 2012-01-02 | 503695 | 1
www.vertica.com | 2012-01-03 | 490387 | 2
www.vertica.com | 2012-01-01 | 343435 | 3
(7 rows)
As with the built-in RANK analytic function, rows that have the same value for the ORDER BY column (num_hits in this example) have the same rank, but the rank continues to increase, so that the next row that has a different ORDER BY key gets a rank value based on the number of rows that preceded it.
AnalyticFunction implementation
The following code defines an AnalyticFunction
subclass named Rank
. It is based on example code distributed in the examples directory of the SDK.
/**
* User-defined analytic function: Rank - works mostly the same as SQL-99 rank
* with the ability to define as many order by columns as desired
*
*/
class Rank : public AnalyticFunction
{
virtual void processPartition(ServerInterface &srvInterface,
AnalyticPartitionReader &inputReader,
AnalyticPartitionWriter &outputWriter)
{
// Always use a top-level try-catch block to prevent exceptions from
// leaking back to Vertica or the fenced-mode side process.
try {
rank = 1; // The rank to assign a row
rowCount = 0; // Number of rows processed so far
do {
rowCount++;
// Do we have a new order by row?
if (inputReader.isNewOrderByKey()) {
// Yes, so set rank to the total number of rows that have been
// processed. Otherwise, the rank remains the same value as
// the previous iteration.
rank = rowCount;
}
// Write the rank
outputWriter.setInt(0, rank);
// Move to the next row of the output
outputWriter.next();
} while (inputReader.next()); // Loop until no more input
} catch(exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing partition: %s", e.what());
}
}
private:
vint rank, rowCount;
};
In this example, the processPartition()
method does not actually read any of the data from the input row; it just advances through the rows. It does not need to read data; it just counts the rows that have been read and determine whether those rows have the same ORDER BY key as the previous row. If the current row is a new ORDER BY key, then the rank is set to the total number of rows that have been processed. If the current row has the same ORDER BY value as the previous row, then the rank remains the same.
Note that the function has a top-level try-catch block. All of your UDx functions should always have one to prevent stray exceptions from being passed back to Vertica (if you run the function unfenced) or the side process.
AnalyticFunctionFactory implementation
The following code defines the AnalyticFunctionFactory
that corresponds with the Rank
analytic function.
class RankFactory : public AnalyticFunctionFactory
{
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)
{
returnType.addInt();
}
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
outputTypes.addInt();
}
virtual AnalyticFunction *createAnalyticFunction(ServerInterface
&srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, Rank); }
};
The first method defined by the RankFactory
subclass, getPrototype()
, sets the data type of the return value. Because the Rank UDAnF does not read input, it does not define any arguments by calling methods on the ColumnTypes
object passed in the argTypes
parameter.
The next method is getReturnType()
. If your function returns a data type that needs to define a width or precision, your implementation of the getReturnType()
method calls a method on the SizedColumnType
object passed in as a parameter to tell Vertica the width or precision. Rank
returns a fixed-width data type (an INTEGER) so it does not need to set the precision or width of its output; it just calls addInt()
to report its output data type.
Finally, RankFactory
defines the createAnalyticFunction()
method that returns an instance of the AnalyticFunction
class that Vertica can call. This code is mostly boilerplate. All you need to do is add the name of your analytic function class in the call to vt_createFuncObj()
, which takes care of allocating the object for you.
5.8 - Scalar functions (UDSFs)
A user-defined scalar function (UDSF) returns a single value for each row of data it reads.
A user-defined scalar function (UDSF) returns a single value for each row of data it reads. You can use a UDSF anywhere you can use a built-in Vertica function. You usually develop a UDSF to perform data manipulations that are too complex or too slow to perform using SQL statements and functions. UDSFs also let you use analytic functions provided by third-party libraries within Vertica while still maintaining high performance.
A UDSF returns a single column. You can automatically return multiple values in a ROW. A ROW is a group of property-value pairs. In the following example, div_with_rem is a UDSF that performs a division operation, returning the quotient and remainder as integers:
=> SELECT div_with_rem(18,5);
div_with_rem
------------------------------
{"quotient":3,"remainder":3}
(1 row)
A ROW returned from a UDSF cannot be used as an argument to COUNT.
Alternatively, you can construct a complex return value yourself, as described in Complex Types as Arguments.
Your UDSF must return a value for every input row (unless it generates an error; see Handling errors for details). Failure to return a value for an input row results in incorrect results and potentially destabilizes the Vertica server if not run in Fenced and unfenced modes.
A UDSF can have up to 9800 arguments.
5.8.1 - ScalarFunction class
The ScalarFunction class is the heart of a UDSF.
The ScalarFunction
class is the heart of a UDSF. Your subclass must define the processBlock()
method to perform the scalar operation. It may define methods to set up and tear down the function.
For scalar functions written in C++, you can provide information that can help with query optimization. See Improving query performance (C++ only).
The processBlock()
method carries out all of the processing that you want your UDSF to perform. When a user calls your function in a SQL statement, Vertica bundles together the data from the function parameters and passes it to processBlock()
.
The input and output of the processBlock()
method are supplied by objects of the BlockReader
and BlockWriter
classes. They define methods that you use to read the input data and write the output data for your UDSF.
The majority of the work in developing a UDSF is writing processBlock()
. This is where all of the processing in your function occurs. Your UDSF should follow this basic pattern:
-
Read in a set of arguments from the BlockReader
object using data-type-specific methods.
-
Process the data in some manner.
-
Output the resulting value using one of the BlockWriter
class's data-type-specific methods.
-
Advance to the next row of output and input by calling BlockWriter.next()
and BlockReader.next()
.
This process continues until there are no more rows of data to be read (BlockReader.next()
returns false).
You must make sure that processBlock()
reads all of the rows in its input and outputs a single value for each row. Failure to do so can corrupt the data structures that Vertica reads to get the output of your UDSF. The only exception to this rule is if your processBlock()
function reports an error back to Vertica (see Handling errors). In that case, Vertica does not attempt to read the incomplete result set generated by the UDSF.
Setting up and tearing down
The ScalarFunction
class defines two additional methods that you can optionally implement to allocate and free resources: setup()
and destroy()
. You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).
Notes
-
While the name you choose for your ScalarFunction
subclass does not have to match the name of the SQL function you will later assign to it, Vertica considers making the names the same a best practice.
-
Do not assume that your function will be called from the same thread that instantiated it.
-
The same instance of your ScalarFunction
subclass can be called on to process multiple blocks of data.
-
The rows of input sent to processBlock()
are not guaranteed to be any particular order.
-
Writing too many output rows can cause Vertica to emit an out-of-bounds error.
API
The ScalarFunction API provides the following methods for extension by subclasses:
virtual void setup(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes);
virtual void processBlock(ServerInterface &srvInterface,
BlockReader &arg_reader, BlockWriter &res_writer)=0;
virtual void getOutputRange (ServerInterface &srvInterface,
ValueRangeReader &inRange, ValueRangeWriter &outRange)
virtual void cancel(ServerInterface &srvInterface);
virtual void destroy(ServerInterface &srvInterface, const SizedColumnTypes &argTypes);
The ScalarFunction API provides the following methods for extension by subclasses:
public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes);
public abstract void processBlock(ServerInterface srvInterface, BlockReader arg_reader,
BlockWriter res_writer) throws UdfException, DestroyInvocation;
protected void cancel(ServerInterface srvInterface);
public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes);
The ScalarFunction API provides the following methods for extension by subclasses:
def setup(self, server_interface, col_types)
def processBlock(self, server_interface, block_reader, block_writer)
def destroy(self, server_interface, col_types)
Implement the Main function API to define a scalar function:
FunctionName <- function(input.data.frame, parameters.data.frame) {
# Computations
# The function must return a data frame.
return(output.data.frame)
}
5.8.2 - ScalarFunctionFactory class
The ScalarFunctionFactory class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value.
The ScalarFunctionFactory
class tells Vertica metadata about your UDSF: its number of parameters and their data types, as well as the data type of its return value. It also instantiates a subclass of ScalarFunction
.
Methods
You must implement the following methods in your ScalarFunctionFactory
subclass:
-
createScalarFunction()
instantiates a ScalarFunction
subclass. If writing in C++, you can call the vt_createFuncObj
macro with the name of the ScalarFunction
subclass. This macro takes care of allocating and instantiating the class for you.
-
getPrototype()
tells Vertica about the parameters and return type(s) for your UDSF. In addition to a ServerInterface
object, this method gets two ColumnTypes
objects. All you need to do in this function is to call class functions on these two objects to build the list of parameters and the return value type(s). If you return more than one value, the results are packaged into a ROW type.
After defining your factory class, you need to call the RegisterFactory
macro. This macro instantiates a member of your factory class, so Vertica can interact with it and extract the metadata it contains about your UDSF.
Declaring return values
If your function returns a sized column (a return data type whose length can vary, such as a VARCHAR), a value that requires precision, or more than one value, you must implement getReturnType()
. This method is called by Vertica to find the length or precision of the data being returned in each row of the results. The return value of this method depends on the data type your processBlock()
method returns:
-
CHAR, (LONG) VARCHAR, BINARY, and (LONG) VARBINARY return the maximum length.
-
NUMERIC types specify the precision and scale.
-
TIME and TIMESTAMP values (with or without timezone) specify precision.
-
INTERVAL YEAR TO MONTH specifies range.
-
INTERVAL DAY TO SECOND specifies precision and range.
-
ARRAY types specify the maximum number of elements.
If your UDSF does not return one of these data types and returns a single value, it does not need a getReturnType()
method.
The input to the getReturnType()
method is a SizedColumnTypes
object that contains the input argument types along with their lengths. This object will be passed to an instance of your processBlock()
function. Your implementation of getReturnType()
must extract the data types and lengths from this input and determine the length or precision of the output rows. It then saves this information in another instance of the SizedColumnTypes
class.
API
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
virtual ScalarFunction * createScalarFunction(ServerInterface &srvInterface)=0;
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)=0;
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes, SizedColumnTypes &returnType);
virtual void getParameterType(ServerInterface &srvInterface,
SizedColumnTypes ¶meterTypes);
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
public abstract ScalarFunction createScalarFunction(ServerInterface srvInterface);
public abstract void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType);
public void getReturnType(ServerInterface srvInterface, SizedColumnTypes argTypes,
SizedColumnTypes returnType) throws UdfException;
public void getParameterType(ServerInterface srvInterface, SizedColumnTypes parameterTypes);
The ScalarFunctionFactory API provides the following methods for extension by subclasses:
def createScalarFunction(self, srv)
def getPrototype(self, srv_interface, arg_types, return_type)
def getReturnType(self, srv_interface, arg_types, return_type)
Implement the Factory function API to define a scalar function factory:
FunctionNameFactory <- function() {
list(name = FunctionName,
udxtype = c("scalar"),
intype = c("int"),
outtype = c("int"))
}
5.8.3 - Setting null input and volatility behavior
Normally, Vertica calls your UDSF for every row of data in the query.
Normally, Vertica calls your UDSF for every row of data in the query. There are some cases where Vertica can avoid executing your UDSF. You can tell Vertica when it can skip calling your function and just supply a return value itself by changing your function's volatility and strictness settings.
-
Your function's volatility indicates whether it always returns the same output value when passed the same arguments. Depending on its behavior, Vertica can cache the arguments and the return value. If the user calls the UDSF with the same set of arguments, Vertica returns the cached value instead of calling your UDSF.
-
Your function's strictness indicates how it reacts to NULL arguments. If it always returns NULL when any argument is NULL, Vertica can just return NULL without having to call the function. This optimization also saves you work, because you do not need to test for and handle null arguments in your UDSF code.
You indicate the volatility and null handling of your function by setting the vol
and strict
fields in your ScalarFunctionFactory
class's constructor.
Volatility settings
To indicate your function's volatility, set the vol
field to one of the following values:
Value |
Description |
VOLATILE |
Repeated calls to the function with the same arguments always result in different values. Vertica always calls volatile functions for each invocation. |
IMMUTABLE |
Calls to the function with the same arguments always results in the same return value. |
STABLE |
Repeated calls to the function with the same arguments within the same statement returns the same output. For example, a function that returns the current user name is stable because the user cannot change within a statement. The user name could change between statements. |
DEFAULT_VOLATILITY |
The default volatility. This is the same as VOLATILE. |
Example
The following example shows a version of the Add2ints
example factory class that makes the function immutable.
class Add2intsImmutableFactory : public Vertica::ScalarFunctionFactory
{
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, Add2ints); }
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
public:
Add2intsImmutableFactory() {vol = IMMUTABLE;}
};
RegisterFactory(Add2intsImmutableFactory);
The following example demonstrates setting the Add2IntsFactory
's vol
field to IMMUTABLE to tell Vertica it can cache the arguments and return value.
public class Add2IntsFactory extends ScalarFunctionFactory {
@Override
public void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType){
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
@Override
public ScalarFunction createScalarFunction(ServerInterface srvInterface){
return new Add2Ints();
}
// Class constructor
public Add2IntsFactory() {
// Tell Vertica that the same set of arguments will always result in the
// same return value.
vol = volatility.IMMUTABLE;
}
}
To indicate how your function reacts to NULL input, set the strictness
field to one of the following values.
Value |
Description |
CALLED_ON_NULL_INPUT |
The function must be called, even if one or more arguments are NULL. |
RETURN_NULL_ON_NULL_INPUT |
The function always returns a NULL value if any of its arguments are NULL. |
STRICT |
A synonym for RETURN_NULL_ON_NULL_INPUT |
DEFAULT_STRICTNESS |
The default strictness setting. This is the same as CALLED_ON_NULL_INPUT. |
Example
The following C++ example demonstrates setting the null behavior of Add2ints so Vertica does not call the function with NULL values.
class Add2intsNullOnNullInputFactory : public Vertica::ScalarFunctionFactory
{
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, Add2ints); }
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt();
}
public:
Add2intsNullOnNullInputFactory() {strict = RETURN_NULL_ON_NULL_INPUT;}
};
RegisterFactory(Add2intsNullOnNullInputFactory);
5.8.4 - Improving query performance (C++ only)
When evaluating a query, Vertica can take advantage of available information about the ranges of values.
When evaluating a query, Vertica can take advantage of available information about the ranges of values. For example, if data is partitioned and a query restricts output by the partitioned value, Vertica can ignore partitions that cannot possibly contain data that would satisfy the query. Similarly, for a scalar function, Vertica can skip processing rows in the data where the value returned from the function cannot possibly affect the results.
Consider a table with millions of rows of data on customer orders and a scalar function that computes the total price paid for everything in an order. A query uses a WHERE clause to restrict results to orders above a given value. A scalar function is called on a block of data; if no rows within that block could produce the target value, skipping the processing of the block could improve query performance.
A scalar function written in C++ can implement the getOutputRange
method. Before calling processBlock
, Vertica calls getOutputRange
to determine the minimum and maximum return values from this block given the input ranges. It then decides whether to call processBlock
to perform the computations.
The Add2Ints example implements this function. The minimum output value is the sum of the smallest values of each of the two inputs, and the maximum output is the sum of the largest values of each of the inputs. This function does not consider individual rows. Consider the following inputs:
a | b
------+------
21 | 92
500 | 19
111 | 11
The smallest values of the two inputs are 21 and 11, so the function reports 32 as the low end of the output range. The largest input values are 500 and 92, so it reports 592 as the high end of the output range. 592 is larger than the value returned for any of the input rows and 32 is smaller than any row's return value.
The purpose of getOutputRange
is to quickly eliminate calls where outputs would definitely be out of range. For example, if the query included "WHERE Add2Ints(a,b) > 600", this block of data could be skipped. There can still be cases where, after calling getOutputRange
, processBlock
returns no results. If the query included "WHERE Add2Ints(a,b) > 500", getOutputRange
would not eliminate this block of data.
Add2Ints implements getOutputRange
as follows:
/*
* This method computes the output range for this scalar function from
* the ranges of its inputs in a single invocation.
*
* The input ranges are retrieved via inRange
* The output range is returned via outRange
*/
virtual void getOutputRange(Vertica::ServerInterface &srvInterface,
Vertica::ValueRangeReader &inRange,
Vertica::ValueRangeWriter &outRange)
{
if (inRange.hasBounds(0) && inRange.hasBounds(1)) {
// Input ranges have bounds defined
if (inRange.isNull(0) || inRange.isNull(1)) {
// At least one range has only NULL values.
// Output range can only have NULL values.
outRange.setNull();
outRange.setHasBounds();
return;
} else {
// Compute output range
const vint& a1LoBound = inRange.getIntRefLo(0);
const vint& a2LoBound = inRange.getIntRefLo(1);
outRange.setIntLo(a1LoBound + a2LoBound);
const vint& a1UpBound = inRange.getIntRefUp(0);
const vint& a2UpBound = inRange.getIntRefUp(1);
outRange.setIntUp(a1UpBound + a2UpBound);
}
} else {
// Input ranges are unbounded. No output range can be defined
return;
}
if (!inRange.canHaveNulls(0) && !inRange.canHaveNulls(1)) {
// There cannot be NULL values in the output range
outRange.setCanHaveNulls(false);
}
// Let Vertica know that the output range is bounded
outRange.setHasBounds();
}
If getOutputRange
produces an error, Vertica issues a warning and does not call the method again for the current query.
5.8.5 - C++ example: Add2Ints
The following example shows a basic subclass of ScalarFunction called Add2ints.
The following example shows a basic subclass of ScalarFunction
called Add2ints
. As the name implies, it adds two integers together, returning a single integer result.
For the complete source code, see /opt/vertica/sdk/examples/ScalarFunctions/Add2Ints.cpp
. Java and Python versions of this UDx are included in /opt/vertica/sdk/examples
.
Loading and using the example
Use CREATE LIBRARY to load the library containing the function, and then use CREATE FUNCTION (scalar) to declare the function as in the following example:
=> CREATE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/ScalarFunctions.so';
=> CREATE FUNCTION add2ints AS LANGUAGE 'C++' NAME 'Add2IntsFactory' LIBRARY ScalarFunctions;
The following example shows how to use this function:
=> SELECT Add2Ints(27,15);
Add2ints
----------
42
(1 row)
=> SELECT * FROM MyTable;
a | b
-----+----
7 | 0
12 | 2
12 | 6
18 | 9
1 | 1
58 | 4
450 | 15
(7 rows)
=> SELECT * FROM MyTable WHERE Add2ints(a, b) > 20;
a | b
-----+----
18 | 9
58 | 4
450 | 15
(3 rows)
Function implementation
A scalar function does its computation in the processBlock
method:
class Add2Ints : public ScalarFunction
{
public:
/*
* This method processes a block of rows in a single invocation.
*
* The inputs are retrieved via argReader
* The outputs are returned via resWriter
*/
virtual void processBlock(ServerInterface &srvInterface,
BlockReader &argReader,
BlockWriter &resWriter)
{
try {
// While we have inputs to process
do {
if (argReader.isNull(0) || argReader.isNull(1)) {
resWriter.setNull();
} else {
const vint a = argReader.getIntRef(0);
const vint b = argReader.getIntRef(1);
resWriter.setInt(a+b);
}
resWriter.next();
} while (argReader.next());
} catch(std::exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing block: [%s]", e.what());
}
}
// ...
};
Implementing getOutputRange
, which is optional, allows your function to skip rows where the result would not be within a target range. For example, if a WHERE clause restricts the query results to those in a certain range, calling the function for cases that could not possibly be in that range is unnecessary.
/*
* This method computes the output range for this scalar function from
* the ranges of its inputs in a single invocation.
*
* The input ranges are retrieved via inRange
* The output range is returned via outRange
*/
virtual void getOutputRange(Vertica::ServerInterface &srvInterface,
Vertica::ValueRangeReader &inRange,
Vertica::ValueRangeWriter &outRange)
{
if (inRange.hasBounds(0) && inRange.hasBounds(1)) {
// Input ranges have bounds defined
if (inRange.isNull(0) || inRange.isNull(1)) {
// At least one range has only NULL values.
// Output range can only have NULL values.
outRange.setNull();
outRange.setHasBounds();
return;
} else {
// Compute output range
const vint& a1LoBound = inRange.getIntRefLo(0);
const vint& a2LoBound = inRange.getIntRefLo(1);
outRange.setIntLo(a1LoBound + a2LoBound);
const vint& a1UpBound = inRange.getIntRefUp(0);
const vint& a2UpBound = inRange.getIntRefUp(1);
outRange.setIntUp(a1UpBound + a2UpBound);
}
} else {
// Input ranges are unbounded. No output range can be defined
return;
}
if (!inRange.canHaveNulls(0) && !inRange.canHaveNulls(1)) {
// There cannot be NULL values in the output range
outRange.setCanHaveNulls(false);
}
// Let Vertica know that the output range is bounded
outRange.setHasBounds();
}
Factory implementation
The factory instantiates a member of the class (createScalarFunction
), and also describes the function's inputs and outputs (getPrototype
):
class Add2IntsFactory : public ScalarFunctionFactory
{
// return an instance of Add2Ints to perform the actual addition.
virtual ScalarFunction *createScalarFunction(ServerInterface &interface)
{ return vt_createFuncObject<Add2Ints>(interface.allocator); }
// This function returns the description of the input and outputs of the
// Add2Ints class's processBlock function. It stores this information in
// two ColumnTypes objects, one for the input parameters, and one for
// the return value.
virtual void getPrototype(ServerInterface &interface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
argTypes.addInt();
argTypes.addInt();
// Note that ScalarFunctions *always* return a single value.
returnType.addInt();
}
};
The RegisterFactory macro
Use the RegisterFactory
macro to register a UDx. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class:
RegisterFactory(Add2IntsFactory);
5.8.6 - Python example: currency_convert
The currency_convert scalar function reads two values from a table, a currency and a value.
The currency_convert
scalar function reads two values from a table, a currency and a value. It then converts the item's value to USD, returning a single float result.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
UDSF Python code
import vertica_sdk
import decimal
rates2USD = {'USD': 1.000,
'EUR': 0.89977,
'GBP': 0.68452,
'INR': 67.0345,
'AUD': 1.39187,
'CAD': 1.30335,
'ZAR': 15.7181,
'XXX': -1.0000}
class currency_convert(vertica_sdk.ScalarFunction):
"""Converts a money column to another currency
Returns a value in USD.
"""
def __init__(self):
pass
def setup(self, server_interface, col_types):
pass
def processBlock(self, server_interface, block_reader, block_writer):
while(True):
currency = block_reader.getString(0)
try:
rate = decimal.Decimal(rates2USD[currency])
except KeyError:
server_interface.log("ERROR: {} not in dictionary.".format(currency))
# Scalar functions always need a value to move forward to the
# next input row. Therefore, we need to assign it a value to
# move beyond the error.
currency = 'XXX'
rate = decimal.Decimal(rates2USD[currency])
starting_value = block_reader.getNumeric(1)
converted_value = decimal.Decimal(starting_value / rate)
block_writer.setNumeric(converted_value)
block_writer.next()
if not block_reader.next():
break
def destroy(self, server_interface, col_types):
pass
class currency_convert_factory(vertica_sdk.ScalarFunctionFactory):
def createScalarFunction(self, srv):
return currency_convert()
def getPrototype(self, srv_interface, arg_types, return_type):
arg_types.addVarchar()
arg_types.addNumeric()
return_type.addNumeric()
def getReturnType(self, srv_interface, arg_types, return_type):
return_type.addNumeric(9,4)
Load the function and library
Create the library and the function.
=> CREATE LIBRARY pylib AS '/home/dbadmin/python_udx/currency_convert/currency_convert.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE FUNCTION currency_convert AS LANGUAGE 'Python' NAME 'currency_convert_factory' LIBRARY pylib fenced;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT product, currency_convert(currency, value) AS cost_in_usd
FROM items;
product | cost_in_usd
--------------+-------------
Shoes | 133.4008
Soccer Ball | 110.2817
Coffee | 13.5190
Surfboard | 176.2593
Hockey Stick | 76.7177
Car | 17000.0000
Software | 10.4424
Hamburger | 7.5000
Fish | 130.4272
Cattle | 269.2367
(10 rows)
5.8.7 - Python example: validate_url
The validate_url scalar function reads a string from a table, a URL.
The validate_url
scalar function reads a string from a table, a URL. It then validates if the URL is responsive, returning a status code or a string indicating the attempt failed.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
UDSF Python code
import vertica_sdk
import urllib.request
import time
class validate_url(vertica_sdk.ScalarFunction):
"""Validates HTTP requests.
Returns the status code of a webpage. Pages that cannot be accessed return
"Failed to load page."
"""
def __init__(self):
pass
def setup(self, server_interface, col_types):
pass
def processBlock(self, server_interface, arg_reader, res_writer):
# Writes a string to the UDx log file.
server_interface.log("Validating webpage accessibility - UDx")
while(True):
url = arg_reader.getString(0)
try:
status = urllib.request.urlopen(url).getcode()
# Avoid overwhelming web servers -- be nice.
time.sleep(2)
except (ValueError, urllib.error.HTTPError, urllib.error.URLError):
status = 'Failed to load page'
res_writer.setString(str(status))
res_writer.next()
if not arg_reader.next():
# Stop processing when there are no more input rows.
break
def destroy(self, server_interface, col_types):
pass
class validate_url_factory(vertica_sdk.ScalarFunctionFactory):
def createScalarFunction(self, srv):
return validate_url()
def getPrototype(self, srv_interface, arg_types, return_type):
arg_types.addVarchar()
return_type.addChar()
def getReturnType(self, srv_interface, arg_types, return_type):
return_type.addChar(20)
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY pylib AS 'webpage_tester/validate_url.py' LANGUAGE 'Python';
=> CREATE OR REPLACE FUNCTION validate_url AS LANGUAGE 'Python' NAME 'validate_url_factory' LIBRARY pylib fenced;
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT url, validate_url(url) AS url_status FROM webpages;
url | url_status
-----------------------------------------------+----------------------
http://www.vertica.com/documentation/vertica/ | 200
http://www.google.com/ | 200
http://www.mass.gov.com/ | Failed to load page
http://www.espn.com | 200
http://blah.blah.blah.blah | Failed to load page
http://www.vertica.com/ | 200
(6 rows)
5.8.8 - Python example: matrix multiplication
Python UDxs can accept and return complex types.
Python UDxs can accept and return complex types. The MatrixMultiply
class multiplies input matrices and returns the resulting matrix product. These matrices are represented as two-dimensional arrays. In order to perform the matrix multiplication operation, the number of columns in the first input matrix must equal the number of rows in the second input matrix.
The complete source code is in /opt/vertica/sdk/examples/python/ScalarFunctions.py
.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examples/python/ScalarFunctions.py' LANGUAGE 'Python';
=> CREATE FUNCTION MatrixMultiply AS LANGUAGE 'Python' NAME 'matrix_multiply_factory' LIBRARY ScalarFunctions;
You can create input matrices and then call the function, for example:
=> CREATE TABLE mn (id INTEGER, data ARRAY[ARRAY[INTEGER, 3], 2]);
CREATE TABLE
=> CREATE TABLE np (id INTEGER, data ARRAY[ARRAY[INTEGER, 2], 3]);
CREATE TABLE
=> COPY mn FROM STDIN PARSER fjsonparser();
{"id": 1, "data": [[1, 2, 3], [4, 5, 6]] }
{"id": 2, "data": [[7, 8, 9], [10, 11, 12]] }
\.
=> COPY np FROM STDIN PARSER fjsonparser();
{"id": 1, "data": [[0, 0], [0, 0], [0, 0]] }
{"id": 2, "data": [[1, 1], [1, 1], [1, 1]] }
{"id": 3, "data": [[2, 0], [0, 2], [2, 0]] }
\.
=> SELECT mn.id, np.id, MatrixMultiply(mn.data, np.data) FROM mn CROSS JOIN np ORDER BY 1, 2;
id | id | MatrixMultiply
---+----+-------------------
1 | 1 | [[0,0],[0,0]]
1 | 2 | [[6,6],[15,15]]
1 | 3 | [[8,4],[20,10]]
2 | 1 | [[0,0],[0,0]]
2 | 2 | [[24,24],[33,33]]
2 | 3 | [[32,16],[44,22]]
(6 rows)
Setup
All Python UDxs must import the Vertica SDK library:
Factory implementation
The getPrototype()
method declares that the function arguments and return type must all be two-dimensional arrays, represented as arrays of integer arrays:
def getPrototype(self, srv_interface, arg_types, return_type):
array1dtype = vertica_sdk.ColumnTypes.makeArrayType(vertica_sdk.ColumnTypes.makeInt())
arg_types.addArrayType(array1dtype)
arg_types.addArrayType(array1dtype)
return_type.addArrayType(array1dtype)
getReturnType()
validates that the product matrix has the same number of rows as the first input matrix and the same number of columns as the second input matrix:
def getReturnType(self, srv_interface, arg_types, return_type):
(_, a1type) = arg_types[0]
(_, a2type) = arg_types[1]
m = a1type.getArrayBound()
p = a2type.getElementType().getArrayBound()
return_type.addArrayType(vertica_sdk.SizedColumnTypes.makeArrayType(vertica_sdk.SizedColumnTypes.makeInt(), p), m)
Function implementation
The processBlock()
method is called with a BlockReader
and a BlockWriter
, named arg_reader
and res_writer
respectively. To access elements of the input arrays, the method uses ArrayReader
instances. The arrays are nested, so an ArrayReader
must be instantiated for both the outer and inner arrays. List comprehension simplifies the process of reading the input arrays into lists. The method performs the computation and then uses an ArrayWriter
instance to construct the product matrix.
def processBlock(self, server_interface, arg_reader, res_writer):
while True:
lmat = [[cell.getInt(0) for cell in row.getArrayReader(0)] for row in arg_reader.getArrayReader(0)]
rmat = [[cell.getInt(0) for cell in row.getArrayReader(0)] for row in arg_reader.getArrayReader(1)]
omat = [[0 for c in range(len(rmat[0]))] for r in range(len(lmat))]
for i in range(len(lmat)):
for j in range(len(rmat[0])):
for k in range(len(rmat)):
omat[i][j] += lmat[i][k] * rmat[k][j]
res_writer.setArray(omat)
res_writer.next()
if not arg_reader.next():
break
5.8.9 - R example: SalesTaxCalculator
The SalesTaxCalculator scalar function reads a float and a varchar from a table, an item's price and the state abbreviation.
The SalesTaxCalculator
scalar function reads a float and a varchar from a table, an item's price and the state abbreviation. It then uses the state abbreviation to find the sales tax rate from a list and calculates the item's price including the state's sales tax, returning the total cost of the item.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY rLib AS 'sales_tax_calculator.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION SalesTaxCalculator AS LANGUAGE 'R' NAME 'SalesTaxCalculatorFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF.
=> SELECT item, state_abbreviation,
price, SalesTaxCalculator(price, state_abbreviation) AS Price_With_Sales_Tax
FROM inventory;
item | state_abbreviation | price | Price_With_Sales_Tax
-------------+--------------------+-------+---------------------
Scarf | AZ | 6.88 | 7.53016
Software | MA | 88.31 | 96.655295
Soccer Ball | MS | 12.55 | 13.735975
Beads | LA | 0.99 | 1.083555
Baseball | TN | 42.42 | 46.42869
Cheese | WI | 20.77 | 22.732765
Coffee Mug | MA | 8.99 | 9.839555
Shoes | TN | 23.99 | 26.257055
(8 rows)
UDSF R code
SalesTaxCalculator <- function(input.data.frame) {
# Not a complete list of states in the USA, but enough to get the idea.
state.sales.tax <- list(ma = 0.0625,
az = 0.087,
la = 0.0891,
tn = 0.0945,
wi = 0.0543,
ms = 0.0707)
for ( state_abbreviation in input.data.frame[, 2] ) {
# Ensure state abbreviations are lowercase.
lower_state <- tolower(state_abbreviation)
# Check if the state is in our state.sales.tax list.
if (is.null(state.sales.tax[[lower_state]])) {
stop("State is not in our small sample!")
} else {
sales.tax.rate <- state.sales.tax[[lower_state]]
item.price <- input.data.frame[, 1]
# Calculate the price including sales tax.
price.with.sales.tax <- (item.price) + (item.price * sales.tax.rate)
}
}
return(price.with.sales.tax)
}
SalesTaxCalculatorFactory <- function() {
list(name = SalesTaxCalculator,
udxtype = c("scalar"),
intype = c("float", "varchar"),
outtype = c("float"))
}
5.8.10 - R example: kmeans
The KMeans_User scalar function reads any number of columns from a table, the observations.
The KMeans_User
scalar function reads any number of columns from a table, the observations. It then uses the observations and the two parameters when applying the kmeans clustering algorithm to the data, returning an integer value associated with the cluster of the row.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
Load the function and library
Create the library and the function:
=> CREATE OR REPLACE LIBRARY rLib AS 'kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION KMeans_User AS LANGUAGE 'R' NAME 'KMeans_UserFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDSF:
=> SELECT spec,
KMeans_User(sl, sw, pl, pw USING PARAMETERS clusters = 3, nstart = 20)
FROM iris;
spec | KMeans_User
-----------------+-------------
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
Iris-setosa | 2
.
.
.
(150 rows)
UDSF R code
KMeans_User <- function(input.data.frame, parameters.data.frame) {
# Take the clusters and nstart parameters passed by the user and assign them
# to variables in the function.
if ( is.null(parameters.data.frame[['clusters']]) ) {
stop("NULL value for clusters! clusters cannot be NULL.")
} else {
clusters.value <- parameters.data.frame[['clusters']]
}
if ( is.null(parameters.data.frame[['nstart']]) ) {
stop("NULL value for nstart! nstart cannot be NULL.")
} else {
nstart.value <- parameters.data.frame[['nstart']]
}
# Apply the algorithm to the data.
kmeans.clusters <- kmeans(input.data.frame[, 1:length(input.data.frame)],
clusters.value, nstart = nstart.value)
final.output <- data.frame(kmeans.clusters$cluster)
return(final.output)
}
KMeans_UserFactory <- function() {
list(name = KMeans_User,
udxtype = c("scalar"),
# Since this is a polymorphic function the intype must be any
intype = c("any"),
outtype = c("int"),
parametertypecallback=KMeansParameters)
}
KMeansParameters <- function() {
parameters <- list(datatype = c("int", "int"),
length = c("NA", "NA"),
scale = c("NA", "NA"),
name = c("clusters", "nstart"))
return(parameters)
}
5.8.11 - C++ example: using complex types
UDxs can accept and return complex types.
UDxs can accept and return complex types. The ArraySlice
example takes an array and two indices as inputs and returns an array containing only the values in that range. Because array elements can be of any type, the function is polymorphic.
The complete source code is in /opt/vertica/sdk/examples/ScalarFunctions/ArraySlice.cpp
.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION ArraySlice AS
LANGUAGE 'C++' NAME 'ArraySliceFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
=> CREATE TABLE arrays (id INTEGER, aa ARRAY[INTEGER]);
COPY arrays FROM STDIN;
1|[]
2|[1,2,3]
3|[5,4,3,2,1]
\.
=> CREATE TABLE slices (b INTEGER, e INTEGER);
COPY slices FROM STDIN;
0|2
1|3
2|4
\.
=> SELECT id, b, e, ArraySlice(aa, b, e) AS slice FROM arrays, slices;
id | b | e | slice
----+---+---+-------
1 | 0 | 2 | []
1 | 1 | 3 | []
1 | 2 | 4 | []
2 | 0 | 2 | [1,2]
2 | 1 | 3 | [2,3]
2 | 2 | 4 | [3]
3 | 0 | 2 | [5,4]
3 | 1 | 3 | [4,3]
3 | 2 | 4 | [3,2]
(9 rows)
Factory implementation
Because the function is polymorphic, getPrototype()
declares that the inputs and outputs can be of any type, and type enforcement must be done elsewhere:
void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes,
ColumnTypes &returnType) override
{
/*
* This is a polymorphic function that accepts any array
* and returns an array of the same type
*/
argTypes.addAny();
returnType.addAny();
}
The factory validates input types and determines the return type in getReturnType()
:
void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
SizedColumnTypes &returnType) override
{
/*
* Three arguments: (array, slicebegin, sliceend)
* Validate manually since the prototype accepts any arguments.
*/
if (argTypes.size() != 3) {
vt_report_error(0, "Three arguments (array, slicebegin, sliceend) expected");
} else if (!argTypes[0].getType().isArrayType()) {
vt_report_error(1, "Argument 1 is not an array");
} else if (!argTypes[1].getType().isInt()) {
vt_report_error(2, "Argument 2 (slicebegin) is not an integer)");
} else if (!argTypes[2].getType().isInt()) {
vt_report_error(3, "Argument 3 (sliceend) is not an integer)");
}
/* return type is the same as the array arg type, copy it over */
returnType.push_back(argTypes[0]);
}
Function implementation
The processBlock()
method is called with a BlockReader
and a BlockWriter
. The first argument is an array. To access elements of the array, the method uses an ArrayReader
. Similarly, it uses an ArrayWriter
to construct the output.
void processBlock(ServerInterface &srvInterface,
BlockReader &argReader,
BlockWriter &resWriter) override
{
do {
if (argReader.isNull(0) || argReader.isNull(1) || argReader.isNull(2)) {
resWriter.setNull();
} else {
Array::ArrayReader argArray = argReader.getArrayRef(0);
const vint slicebegin = argReader.getIntRef(1);
const vint sliceend = argReader.getIntRef(2);
Array::ArrayWriter outArray = resWriter.getArrayRef(0);
if (slicebegin < sliceend) {
for (int i = 0; i < slicebegin && argArray->hasData(); i++) {
argArray->next();
}
for (int i = slicebegin; i < sliceend && argArray->hasData(); i++) {
outArray->copyFromInput(*argArray);
outArray->next();
argArray->next();
}
}
outArray.commit(); /* finalize the written array elements */
}
resWriter.next();
} while (argReader.next());
}
5.8.12 - C++ example: returning multiple values
When writing a UDSF, you can specify more than one return value.
When writing a UDSF, you can specify more than one return value. If you specify multiple values, Vertica packages them into a single ROW as a return value. You can query fields in the ROW or the entire ROW.
The following example implements a function named div (division) that returns two integers, the quotient and the remainder.
This example shows one way to return a ROW from a UDSF. Returning multiple values and letting Vertica build the ROW is convenient when inputs and outputs are all of primitive types. You can also work directly with the complex types, as described in Complex Types as Arguments and illustrated in C++ example: using complex types.
Loading and using the example
Load the library and create the function as follows:
=> CREATE OR REPLACE LIBRARY ScalarFunctions AS '/home/dbadmin/examplesUDSF.so';
=> CREATE FUNCTION div AS
LANGUAGE 'C++' NAME 'DivFactory' LIBRARY ScalarFunctions;
Create some data and call the function on it as follows:
=> CREATE TABLE D (a INTEGER, b INTEGER);
COPY D FROM STDIN DELIMITER ',';
10,0
10,1
10,2
10,3
10,4
10,5
\.
=> SELECT a, b, Div(a, b), (Div(a, b)).quotient, (Div(a, b)).remainder FROM D;
a | b | Div | quotient | remainder
----+---+------------------------------------+----------+-----------
10 | 0 | {"quotient":null,"remainder":null} | |
10 | 1 | {"quotient":10,"remainder":0} | 10 | 0
10 | 2 | {"quotient":5,"remainder":0} | 5 | 0
10 | 3 | {"quotient":3,"remainder":1} | 3 | 1
10 | 4 | {"quotient":2,"remainder":2} | 2 | 2
10 | 5 | {"quotient":2,"remainder":0} | 2 | 0
(6 rows)
Factory implementation
The factory declares the two return values in getPrototype()
and in getReturnType()
. The factory is otherwise unremarkable.
void getPrototype(ServerInterface &interface,
ColumnTypes &argTypes,
ColumnTypes &returnType) override
{
argTypes.addInt();
argTypes.addInt();
returnType.addInt(); /* quotient */
returnType.addInt(); /* remainder */
}
void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
SizedColumnTypes &returnType) override
{
returnType.addInt("quotient");
returnType.addInt("remainder");
}
Function implementation
The function writes two output values in processBlock()
. The number of values here must match the factory declarations.
class Div : public ScalarFunction {
void processBlock(Vertica::ServerInterface &srvInterface,
Vertica::BlockReader &argReader,
Vertica::BlockWriter &resWriter) override
{
do {
if (argReader.isNull(0) || argReader.isNull(1) || (argReader.getIntRef(1) == 0)) {
resWriter.setNull(0);
resWriter.setNull(1);
} else {
const vint dividend = argReader.getIntRef(0);
const vint divisor = argReader.getIntRef(1);
resWriter.setInt(0, dividend / divisor);
resWriter.setInt(1, dividend % divisor);
}
resWriter.next();
} while (argReader.next());
}
};
5.8.13 - C++ example: calling a UDSF from a check constraint
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint.
This example shows you the C++ code needed to create a UDSF that can be called by a check constraint. The name of the sample function is LargestSquareBelow
. The sample function determines the largest number whose square is less than the number in the subject column. For example, if the number in the column is 1000, the largest number whose square is less than 1000 is 31 (961).
Important
A UDSF used within a check constraint must be immutable, and the constraint must handle null values properly. Otherwise, the check constraint might not work as you intended. In addition, Vertica evaluates the predicate of an enabled check constraint on every row that is loaded or updated, so consider performance in writing your function.
For information on check constraints, see Check constraints.
Loading and using the example
The following example shows how you can create and load a library named MySqLib, using CREATE LIBRARY. Adjust the library path in this example to the absolute path and file name for the location where you saved the shared object LargestSquareBelow
.
Create the library:
=> CREATE OR REPLACE LIBRARY MySqLib AS '/home/dbadmin/LargestSquareBelow.so';
After you create and load the library, add the function to the catalog using the CREATE FUNCTION (scalar) statement:
=> CREATE OR REPLACE FUNCTION largestSqBelow AS LANGUAGE 'C++' NAME 'LargestSquareBelowInfo' LIBRARY MySqLib;
Next, include the UDSF in a check constraint:
=> CREATE TABLE squaretest(
ceiling INTEGER UNIQUE,
CONSTRAINT chk_sq CHECK (largestSqBelow(ceiling) < ceiling*ceiling)
);
Add data to the table, squaretest
:
=> COPY squaretest FROM stdin DELIMITER ','NULL'null';
-1
null
0
1
1000
1000000
1000001
\.
Your output should be similar to the following sample, based upon the data you use:
=> SELECT ceiling, largestSqBelow(ceiling)
FROM squaretest ORDER BY ceiling;
ceiling | largestSqBelow
---------+----------------
|
-1 |
0 |
1 | 0
1000 | 31
1000000 | 999
1000001 | 1000
(7 rows)
ScalarFunction implementation
This ScalarFunction
implementation does the processing work for a UDSF that determines the largest number whose square is less than the number input.
#include "Vertica.h"
/*
* ScalarFunction implementation for a UDSF that
* determines the largest number whose square is less than
* the number input.
*/
class LargestSquareBelow : public Vertica::ScalarFunction
{
public:
/*
* This function does all of the actual processing for the UDSF.
* The inputs are retrieved via arg_reader
* The outputs are returned via arg_writer
*
*/
virtual void processBlock(Vertica::ServerInterface &srvInterface,
Vertica::BlockReader &arg_reader,
Vertica::BlockWriter &res_writer)
{
if (arg_reader.getNumCols() != 1)
vt_report_error(0, "Function only accept 1 argument, but %zu provided", arg_reader.getNumCols());
// While we have input to process
do {
// Read the input parameter by calling the
// BlockReader.getIntRef class function
const Vertica::vint a = arg_reader.getIntRef(0);
Vertica::vint res;
//Determine the largest square below the number
if ((a != Vertica::vint_null) && (a > 0))
{
res = (Vertica::vint)sqrt(a - 1);
}
else
res = Vertica::vint_null;
//Call BlockWriter.setInt to store the output value,
//which is the largest square
res_writer.setInt(res);
//Write the row and advance to the next output row
res_writer.next();
//Continue looping until there are no more input rows
} while (arg_reader.next());
}
};
ScalarFunctionFactory implementation
This ScalarFunctionFactory
implementation does the work of handling input and output, and marks the function as immutable (a requirement if you plan to use the UDSF within a check constraint).
class LargestSquareBelowInfo : public Vertica::ScalarFunctionFactory
{
//return an instance of LargestSquareBelow to perform the computation.
virtual Vertica::ScalarFunction *createScalarFunction(Vertica::ServerInterface &srvInterface)
//Call the vt_createFuncObj to create the new LargestSquareBelow class instance.
{ return Vertica::vt_createFuncObject<LargestSquareBelow>(srvInterface.allocator); }
/*
* This function returns the description of the input and outputs of the
* LargestSquareBelow class's processBlock function. It stores this information in
* two ColumnTypes objects, one for the input parameter, and one for
* the return value.
*/
virtual void getPrototype(Vertica::ServerInterface &srvInterface,
Vertica::ColumnTypes &argTypes,
Vertica::ColumnTypes &returnType)
{
// Takes one int as input, so adds int to the argTypes object
argTypes.addInt();
// Returns a single int, so add a single int to the returnType object.
// ScalarFunctions always return a single value.
returnType.addInt();
}
public:
// the function cannot be called within a check constraint unless the UDx author
// certifies that the function is immutable:
LargestSquareBelowInfo() { vol = Vertica::IMMUTABLE; }
};
The RegisterFactory macro
Use the RegisterFactory
macro to register a ScalarFunctionFactory
subclass. This macro instantiates the factory class and makes the metadata it contains available for Vertica to access. To call this macro, pass it the name of your factory class.
RegisterFactory(LargestSquareBelowInfo);
5.9 - Transform functions (UDTFs)
A user-defined transform function (UDTF) lets you transform a table of data into another table.
A user-defined transform function (UDTF) lets you transform a table of data into another table. It reads one or more arguments (treated as a row of data), and returns zero or more rows of data consisting of one or more columns. A UDTF can produce any number of rows as output. However, each row it outputs must be complete. Advancing to the next row without having added a value for each column produces incorrect results.
The schema of the output table does not need to correspond to the schema of the input table—they can be totally different. The UDTF can return any number of output rows for each row of input.
Unless a UDTF is marked as one-to-many in its factory function, it can only be used in a SELECT list that contains the UDTF call and a required OVER clause. A multi-phase UDTF can make use of partition columns (PARTITION BY), but other UDTFs cannot.
UDTFs are run after GROUP BY, but before the final ORDER BY, when used in conjunction with GROUP BY and ORDER BY in a statement. The ORDER BY clause may contain only columns or expressions that are in a window partition clause (see Window partitioning).
UDTFs can take up to 9800 parameters (input columns). Attempts to pass more parameters to a UDTF return an error.
5.9.1 - TransformFunction class
The TransformFunction class is where you perform the data-processing, transforming input rows into output rows.
The TransformFunction
class is where you perform the data-processing, transforming input rows into output rows. Your subclass must define the processPartition()
method. It may define methods to set up and tear down the function.
The processPartition()
method carries out all of the processing that you want your UDTF to perform. When a user calls your function in a SQL statement, Vertica bundles together the data from the function parameters and passes it to processPartition()
.
The input and output of the processPartition()
method are supplied by objects of the PartitionReader
and PartitionWriter
classes. They define methods that you use to read the input data and write the output data for your UDTF.
A UDTF does not necessarily operate on a single row the way a UDSF does. A UDTF can read any number of rows and write output at any time.
Consider the following guidelines when implementing processPartition()
:
-
Extract the input parameters by calling data-type-specific functions in the PartitionReader
object to extract each input parameter. Each of these functions takes a single parameter: the column number in the input row that you want to read. Your function might need to handle NULL values.
-
When writing output, your UDTF must supply values for all of the output columns you defined in your factory. Similarly to reading input columns, the PartitionWriter
object has functions for writing each type of data to the output row.
-
Use PartitionReader.next()
to determine if there is more input to process, and exit when the input is exhausted.
-
In some cases, you might want to determine the number and types of parameters using PartitionReader
's getNumCols()
and getTypeMetaData()
functions, instead of just hard-coding the data types of the columns in the input row. This is useful if you want your TransformFunction
to be able to process input tables with different schemas. You can then use different TransformFunctionFactory
classes to define multiple function signatures that call the same TransformFunction
class. See Overloading your UDx for more information.
Setting up and tearing down
The TransformFunction
class defines two additional methods that you can optionally implement to allocate and free resources: setup()
and destroy()
. You should use these methods to allocate and deallocate resources that you do not allocate through the UDx API (see Allocating resources for UDxs for details).
API
The TransformFunction API provides the following methods for extension by subclasses:
virtual void setup(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes);
virtual void processPartition(ServerInterface &srvInterface,
PartitionReader &input_reader, PartitionWriter &output_writer)=0;
virtual void cancel(ServerInterface &srvInterface);
virtual void destroy(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes);
The PartitionReader
and PartitionWriter
classes provide getters and setters for column values, along with next()
to iterate through partitions. See the API reference documentation for details.
The TransformFunction API provides the following methods for extension by subclasses:
public void setup(ServerInterface srvInterface, SizedColumnTypes argTypes);
public abstract void processPartition(ServerInterface srvInterface,
PartitionReader input_reader, PartitionWriter input_writer)
throws UdfException, DestroyInvocation;
protected void cancel(ServerInterface srvInterface);
public void destroy(ServerInterface srvInterface, SizedColumnTypes argTypes);
The PartitionReader
and PartitionWriter
classes provide getters and setters for column values, along with next()
to iterate through partitions. See the API reference documentation for details.
The TransformFunction API provides the following methods for extension by subclasses:
def setup(self, server_interface, col_types)
def processPartition(self, server_interface, partition_reader, partition_writer)
def destroy(self, server_interface, col_types)
The PartitionReader
and PartitionWriter
classes provide getters and setters for column values, along with next()
to iterate through partitions. See the API reference documentation for details.
Implement the Main function API to define a transform function:
FunctionName <- function(input.data.frame, parameters.data.frame) {
# Computations
# The function must return a data frame.
return(output.data.frame)
}
5.9.2 - TransformFunctionFactory class
The TransformFunctionFactory class tells Vertica metadata about your UDTF: its number of parameters and their data types, as well as function properties and the data type of the return value.
The TransformFunctionFactory
class tells Vertica metadata about your UDTF: its number of parameters and their data types, as well as function properties and the data type of the return value. It also instantiates a subclass of TransformFunction
.
You must implement the following methods in your TransformFunctionFactory
:
-
getPrototype()
returns two ColumnTypes
objects that describe the columns your UDTF takes as input and returns as output.
-
getReturnType()
tells Vertica details about the output values: the width of variable-sized data types (such as VARCHAR) and the precision of data types that have settable precision (such as TIMESTAMP). You can also set the names of the output columns using this function. While this method is optional for UDxs that return single values, you must implement it for UDTFs.
-
createTransformFunction()
instantiates your TransformFunction
subclass.
For UDTFs written in C++ and Python, you can implement the getTransformFunctionProperties()
method to set transform function class properties, including:
isExploder
: By default False, indicates whether a single-phase UDTF performs a transform from one input row to a result set of N rows, often called a one-to-many transform. If set to True, each partition to the UDTF must consist of exactly one input row. When a UDTF is labeled as one-to-many, Vertica is able to optimize query plans and users can write SELECT queries that include any expression and do not require an OVER clause. For more information about UDTF partitioning options and instructions on how to set this class property, see Partitioning options for UDTFs. See Python example: explode for an in-depth example detailing a one-to-many UDTF.
For transform functions written in C++, you can provide information that can help with query optimization. See Improving query performance (C++ only).
API
The TransformFunctionFactory API provides the following methods for extension by subclasses:
virtual TransformFunction *
createTransformFunction (ServerInterface &srvInterface)=0;
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes, ColumnTypes &returnType)=0;
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
SizedColumnTypes &returnType)=0;
virtual void getParameterType(ServerInterface &srvInterface,
SizedColumnTypes ¶meterTypes);
virtual void getTransformFunctionProperties(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
Properties &properties);
The TransformFunctionFactory API provides the following methods for extension by subclasses:
public abstract TransformFunction createTransformFunction(ServerInterface srvInterface);
public abstract void getPrototype(ServerInterface srvInterface, ColumnTypes argTypes, ColumnTypes returnType);
public abstract void getReturnType(ServerInterface srvInterface, SizedColumnTypes argTypes,
SizedColumnTypes returnType) throws UdfException;
public void getParameterType(ServerInterface srvInterface, SizedColumnTypes parameterTypes);
The TransformFunctionFactory API provides the following methods for extension by subclasses:
def createTransformFunction(self, srv)
def getPrototype(self, srv_interface, arg_types, return_type)
def getReturnType(self, srv_interface, arg_types, return_type)
def getParameterType(self, server_interface, parameterTypes)
def getTransformFunctionProperties(self, server_interface, arg_types)
Implement the Factory function API to define a transform function factory:
FunctionNameFactory <- function() {
list(name = FunctionName,
udxtype = c("scalar"),
intype = c("int"),
outtype = c("int"))
}
5.9.3 - MultiPhaseTransformFunctionFactory class
Multi-phase UDTFs let you break your data processing into multiple steps.
Multi-phase UDTFs let you break your data processing into multiple steps. Using this feature, your UDTFs can perform processing in a way similar to Hadoop or other MapReduce frameworks. You can use the first phase to break down and gather data, and then use subsequent phases to process the data. For example, the first phase of your UDTF could extract specific types of user interactions from a web server log stored in the column of a table, and subsequent phases could perform analysis on those interactions.
Multi-phase UDTFs also let you decide where processing should occur: locally on each node, or throughout the cluster. If your multi-phase UDTF is like a MapReduce process, you want the first phase of your multi-phase UDTF to process data that is stored locally on the node where the instance of the UDTF is running. This prevents large segments of data from being copied around the Vertica cluster. Depending on the type of processing being performed in later phases, you may choose to have the data segmented and distributed across the Vertica cluster.
Each phase of the UDTF is the same as a traditional (single-phase) UDTF: it receives a table as input, and generates a table as output. The schema for each phase's output does not have to match its input, and each phase can output as many or as few rows as it wants.
You create a subclass of TransformFunction
to define the processing performed by each stage. If you already have a TransformFunction
from a single-phase UDTF that performs the processing you want a phase of your multi-phase UDTF to perform, you can easily adapt it to work within the multi-phase UDTF.
What makes a multi-phase UDTF different from a traditional UDTF is the factory class you use. You define a multi-phase UDTF using a subclass of MultiPhaseTransformFunctionFactory
, rather than the TransformFunctionFactory
. This special factory class acts as a container for all of the phases in your multi-step UDTF. It provides Vertica with the input and output requirements of the entire multi-phase UDTF (through the getPrototype()
function), and a list of all the phases in the UDTF.
Within your subclass of the MultiPhaseTransformFunctionFactory
class, you define one or more subclasses of TransformFunctionPhase
. These classes fill the same role as the TransformFunctionFactory
class for each phase in your multi-phase UDTF. They define the input and output of each phase and create instances of their associated TransformFunction
classes to perform the processing for each phase of the UDTF. In addition to these subclasses, your MultiPhaseTransformFunctionFactory
includes fields that provide a handle to an instance of each of the TransformFunctionPhase
subclasses.
Note
The
MultiPhaseTransformFunctionFactory
and
TransformFunctionPhase
classes do not support the
isExploder class property.
API
The MultiPhaseTransformFunctionFactory
class extends TransformFunctionFactory
The API provides the following additional methods for extension by subclasses:
virtual void getPhases(ServerInterface &srvInterface,
std::vector< TransformFunctionPhase * > &phases)=0;
If using this factory you must also extend TransformFunctionPhase
. See the SDK reference documentation.
The MultiPhaseTransformFunctionFactory class extends TransformFunctionFactory
. The API provides the following methods for extension by subclasses:
public abstract void getPhases(ServerInterface srvInterface,
Vector< TransformFunctionPhase > phases);
If using this factory you must also extend TransformFunctionPhase
. See the SDK reference documentation.
The TransformFunctionFactory class extends TransformFunctionFactory
. For each phase, the factory must define a class that extends TransformFunctionPhase
.
The factory adds the following method:
TransformFunctionPhase
has the following methods:
def createTransformFunction(cls, srv)
def getReturnType(self, srv_interface, input_types, output_types)
5.9.4 - Improving query performance (C++ only)
When evaluating a query, the Vertica optimizer might sort its input to improve performance.
When evaluating a query, the Vertica optimizer might sort its input to improve performance. If a function already returns sorted data, this means the optimizer is doing extra work. A transform function written in C++ can declare how the data it returns is sorted, and the optimizer can take advantage of that information.
A transform function does the actual sorting in the function's processPartition()
method. To take advantage of this optimization, sorts must be ascending. You need not sort all columns, but you must return the sorted column or columns first.
You can declare how the function sorts its output in the factory's getReturnType()
method.
Caution
If the sorting declared in the factory does not match the sorting provided by the function, query results can be incorrect.
The PolyTopKPerPartition
example sorts input columns and returns a given number of rows:
=> SELECT polykSort(14, a, b, c) OVER (ORDER BY a, b, c)
AS (sort1,sort2,sort3) FROM observations ORDER BY 1,2,3;
sort1 | sort2 | sort3
-------+-------+-------
1 | 1 | 1
1 | 1 | 2
1 | 1 | 3
1 | 2 | 1
1 | 2 | 2
1 | 3 | 1
1 | 3 | 2
1 | 3 | 3
1 | 3 | 4
2 | 1 | 1
2 | 1 | 2
2 | 2 | 3
2 | 2 | 34
2 | 3 | 5
(14 rows)
The factory declares this sorting in getReturnType()
by setting the isSortedBy
property on each column. Each SizedColumnType
has an associated Properties
object where this value can be set:
virtual void getReturnType(ServerInterface &srvInterface, const SizedColumnTypes &inputTypes, SizedColumnTypes &outputTypes)
{
vector<size_t> argCols; // Argument column indexes.
inputTypes.getArgumentColumns(argCols);
size_t colIdx = 0;
for (vector<size_t>::iterator i = argCols.begin() + 1; i < argCols.end(); i++)
{
SizedColumnTypes::Properties props;
props.isSortedBy = true;
std::stringstream cname;
cname << "col" << colIdx++;
outputTypes.addArg(inputTypes.getColumnType(*i), cname.str(), props);
}
}
You can see the effects of this optimization by reviewing the EXPLAIN plans for queries with and without this setting. The following output shows the query plan for polyk
, the unsorted version. Note the cost for sorting:
=> EXPLAN SELECT polyk(14, a, b, c) OVER (ORDER BY a, b, c)
FROM observations ORDER BY 1,2,3;
Access Path:
+-SORT [Cost: 2K, Rows: 10K] (PATH ID: 1)
| Order: col0 ASC, col1 ASC, col2 ASC
| +---> ANALYTICAL [Cost: 2K, Rows: 10K] (PATH ID: 2)
| | Analytic Group
| | Functions: polyk()
| | Group Sort: observations.a ASC NULLS LAST, observations.b ASC NULLS LAST, observations.c ASC NULLS LAST
| | +---> STORAGE ACCESS for observations [Cost: 2K, Rows: 10K]
(PATH ID: 3)
| | | Projection: public.observations_super
| | | Materialize: observations.a, observations.b, observations.c
The query plan for the sorted version omits this step (and cost) and starts with the analytical step (the second step in the previous plan):
=> EXPLAN SELECT polykSort(14, a, b, c) OVER (ORDER BY a, b, c)
FROM observations ORDER BY 1,2,3;
Access Path:
+-ANALYTICAL [Cost: 2K, Rows: 10K] (PATH ID: 2)
| Analytic Group
| Functions: polykSort()
| Group Sort: observations.a ASC NULLS LAST, observations.b ASC NULLS LAST, observations.c ASC NULLS LAST
| +---> STORAGE ACCESS for observations [Cost: 2K, Rows: 10K] (PATH ID: 3)
| | Projection: public.observations_super
| | Materialize: observations.a, observations.b, observations.c
5.9.5 - Partitioning options for UDTFs
Depending on the application, a UDTF might require the input data to be partitioned in a specific way.
Depending on the application, a UDTF might require the input data to be partitioned in a specific way. For example, a UDTF that processes a web server log file to count the number of hits referred by each partner web site needs to have its input partitioned by a referrer column. However, in other cases—such as a string tokenizer—the sort order of the data does not matter. Vertica provides partition options for both of these types of UDTFs.
Data sort required
In cases where a specific sort order is required, the window partitioning clause in the query that calls the UDTF should use a PARTITION BY
clause. Each node in the cluster partitions the data it stores, sends some of these partitions off to other nodes, and then consolidates the partitions it receives from other nodes and runs an instance of the UDTF to process them.
For example, the following UDTF partitions the input data by store ID and then computes the count of each distinct array element in each partition:
=> SELECT * FROM orders;
storeID | productIDs
---------+-----------------------
1 | [101,102,103]
1 | [102,104]
1 | [101,102,102,201,203]
2 | [101,202,203,202,203]
2 | [203]
2 | []
(6 rows)
=> SELECT storeID, CountElements(productIDs) OVER (PARTITION BY storeID) FROM orders;
storeID | element_count
--------+---------------------------
1 | {"element":101,"count":2}
1 | {"element":102,"count":4}
1 | {"element":103,"count":1}
1 | {"element":104,"count":1}
1 | {"element":201,"count":1}
1 | {"element":202,"count":1}
2 | {"element":101,"count":1}
2 | {"element":202,"count":2}
2 | {"element":203,"count":3}
(9 rows)
No sort needed
Some UDTFs, such as Explode, do not need to partition input data in a particular way. In these cases, you can specify that each UDTF instance process only the data that is stored locally by the node on which it is running. By eliminating the overhead of partitioning data and the cost of sort and merge operations, processing can be much more efficient.
You can use the following window partition options for UDTFs that do not require a specific data partitioning:
-
PARTITION ROW
: For single-phase UDTFs where each partition is one input row, allows users to write SELECT queries that include any expression. The UDTF calls the processPartition()
method once per input row. UDTFs of this type, often called one-to-many transforms, can be explicitly marked as such with the exploder class property in the TransformFunctionFactory class. This class property helps Vertica optimize query plans and removes the need for an OVER clause. See One to Many UDTFs for details on how to set this class property for UDTFs written in C++ and Python.
-
PARTITION BEST
: For thread-safe UDTFs only, optimizes performance through multi-threaded queries across multiple nodes. The UDTF calls the processPartition()
method once per thread per node.
-
PARTITION NODES
: Optimizes performance of single-threaded queries across multiple nodes. The UDTF calls the processPartition()
method once per node.
For more information about these partition options, see Window partitioning.
One-to-many UDTFs
To mark a UDTF as one-to-many, you must set the isExploder
class property to True within the getTransformFunctionProperties()
method. Whether a UDTF is marked as one-to-many can be determined by the transform function's arguments and parameters, for example:
void getFunctionProperties(ServerInterface &srvInterface,
const SizedColumnTypes &argTypes,
Properties &properties) override
{
if (argTypes.getColumnCount() > 1) {
properties.isExploder = false;
}
else {
properties.isExploder = true;
}
}
To mark a UDTF as one-to-many, you must set the is_exploder
class property to True within the getTransformFunctionProperties()
method. Whether a UDTF is marked as one-to-many can be determined by the transform function's arguments and parameters, for example:
def getFunctionProperties(cls, server_interface, arg_types):
props = vertica_sdk.TransformFunctionFactory.Properties()
if arg_types.getColumnCount() != 1:
props.is_exploder = False
else:
props.is_exploder = True
return props
If the exploder class property is set to True, the OVER clause is by default OVER(PARTITION ROW). This allows users to call the UDTF without specifying an OVER clause:
=> SELECT * FROM reviews;
id | sentence
----+--------------------------------------
1 | Customer service was slow
2 | Product is exactly what I needed
3 | Price is a bit high
4 | Highly recommended
(4 rows)
=> SELECT tokenize(sentence) FROM reviews;
tokens
-------------
Customer
service
was
slow
Product
...
bit
high
Highly
recommended
(17 rows)
Note
If a user writes an OVER clause with PARTITION BY for a function marked as one-to-many, the function replaces the OVER clause with OVER(PARTITION ROW) and emits a notice to the user.
One-to-many UDTFs also support any expression in the SELECT clause, unlike UDTFs that use either the PARTITION BEST or the PARTITION NODES clause:
=> SELECT id, tokenize(sentence) FROM reviews;
id | tokens
----+-------------
1 | Customer
1 | service
1 | was
1 | respond
2 | Product
...
3 | high
4 | Highly
4 | recommended
(17 rows)
For an in-depth example detailing a one-to-many UDTF, see Python example: explode.
See also
5.9.6 - C++ example: string tokenizer
The following example shows a subclass of TransformFunction named StringTokenizer.
The following example shows a subclass of TransformFunction
named StringTokenizer
. It defines a UDTF that reads a table containing an INTEGER ID column and a VARCHAR column. It breaks the text in the VARCHAR column into tokens (individual words). It returns a table containing each token, the row it occurred in, and its position within the string.
Loading and using the example
The following example shows how to load the function into Vertica. It assumes that the TransformFunctions.so
library that contains the function has been copied to the dbadmin user's home directory on the initiator node.
=> CREATE LIBRARY TransformFunctions AS
'/home/dbadmin/TransformFunctions.so';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize
AS LANGUAGE 'C++' NAME 'TokenFactory' LIBRARY TransformFunctions;
CREATE TRANSFORM FUNCTION
You can then use it from SQL statements, for example:
=> CREATE TABLE T (url varchar(30), description varchar(2000));
CREATE TABLE
=> INSERT INTO T VALUES ('www.amazon.com','Online retail merchant and provider of cloud services');
OUTPUT
--------
1
(1 row)
=> INSERT INTO T VALUES ('www.vertica.com','World''s fastest analytic database');
OUTPUT
--------
1
(1 row)
=> COMMIT;
COMMIT
=> -- Invoke the UDTF
=> SELECT url, tokenize(description) OVER (partition by url) FROM T;
url | words
-----------------+-----------
www.amazon.com | Online
www.amazon.com | retail
www.amazon.com | merchant
www.amazon.com | and
www.amazon.com | provider
www.amazon.com | of
www.amazon.com | cloud
www.amazon.com | services
www.vertica.com | World's
www.vertica.com | fastest
www.vertica.com | analytic
www.vertica.com | database
(12 rows)
Notice that the number of rows and columns in the result table are different than the input table. This is one of the strengths of a UDTF.
The following code shows the StringTokenizer
class.
class StringTokenizer : public TransformFunction
{
virtual void processPartition(ServerInterface &srvInterface,
PartitionReader &inputReader,
PartitionWriter &outputWriter)
{
try {
if (inputReader.getNumCols() != 1)
vt_report_error(0, "Function only accepts 1 argument, but %zu provided", inputReader.getNumCols());
do {
const VString &sentence = inputReader.getStringRef(0);
// If input string is NULL, then output is NULL as well
if (sentence.isNull())
{
VString &word = outputWriter.getStringRef(0);
word.setNull();
outputWriter.next();
}
else
{
// Otherwise, let's tokenize the string and output the words
std::string tmp = sentence.str();
std::istringstream ss(tmp);
do
{
std::string buffer;
ss >> buffer;
// Copy to output
if (!buffer.empty()) {
VString &word = outputWriter.getStringRef(0);
word.copy(buffer);
outputWriter.next();
}
} while (ss);
}
} while (inputReader.next() && !isCanceled());
} catch(std::exception& e) {
// Standard exception. Quit.
vt_report_error(0, "Exception while processing partition: [%s]", e.what());
}
}
};
The processPartition()
function in this example follows a pattern that you will follow in your own UDTF: it loops over all rows in the table partition that Vertica sends it, processing each row and checking for cancellation before advancing. For UDTFs you do not have to actually process every row. You can exit your function without having read all of the input without any issues. You may choose to do this if your UDTF is performing some sort search or some other operation where it can determine that the rest of the input is unneeded.
In this example, processPartition()
first extracts the VString
containing the text from the PartitionReader
object. The VString
class represents a Vertica string value (VARCHAR or CHAR). If there is input, it then tokenizes it and adds it to the output using the PartitionWriter
object.
Similarly to reading input columns, the PartitionWriter
class has functions for writing each type of data to the output row. In this case, the example calls the PartitionWriter
object's getStringRef()
function to allocate a new VString
object to hold the token to output for the first column, and then copies the token's value into the VString
.
The following code shows the factory class.
class TokenFactory : public TransformFunctionFactory
{
// Tell Vertica that we take in a row with 1 string, and return a row with 1 string
virtual void getPrototype(ServerInterface &srvInterface, ColumnTypes &argTypes, ColumnTypes &returnType)
{
argTypes.addVarchar();
returnType.addVarchar();
}
// Tell Vertica what our return string length will be, given the input
// string length
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// Error out if we're called with anything but 1 argument
if (inputTypes.getColumnCount() != 1)
vt_report_error(0, "Function only accepts 1 argument, but %zu provided", inputTypes.getColumnCount());
int input_len = inputTypes.getColumnType(0).getStringLength();
// Our output size will never be more than the input size
outputTypes.addVarchar(input_len, "words");
}
virtual TransformFunction *createTransformFunction(ServerInterface &srvInterface)
{ return vt_createFuncObject<StringTokenizer>(srvInterface.allocator); }
};
In this example:
-
The UDTF takes a VARCHAR column as input. To define the input column, getPrototype()
calls addVarchar()
on the ColumnTypes
object that represents the input table.
-
The UDTF returns a VARCHAR as output. The getPrototype()
function calls addVarchar()
to define the output table.
This example must return the maximum length of the VARCHAR output column. It sets the length to the length of the input string. This is a safe value, because the output will never be longer than the input string. It also sets the name of the VARCHAR output column to "words".
Note
You are not required to supply a name for an output column in this function. However, it is a best practice to do so. If you do not name an output column, getReturnType()
sets the column name to "". The SQL statements that call your UDTF must provide aliases for any unnamed columns to access them or else they return an error. From a usability standpoint, it is easier for you to supply the column names here once. The alternative is to force all of the users of your function to supply their own column names for each call to the UDTF.
The implementation of the createTransformFunction()
function in the example is boilerplate code. It just calls the vt_returnFuncObj
macro with the name of the TransformFunction
class associated with this factory class. This macro takes care of instantiating a copy of the TransformFunction
class that Vertica can use to process data.
The RegisterFactory macro
The final step in creating your UDTF is to call the RegisterFactory
macro. This macro ensures that your factory class is instantiated when Vertica loads the shared library containing your UDTF. Having your factory class instantiated is the only way that Vertica can find your UDTF and determine what its inputs and outputs are.
The RegisterFactory
macro just takes the name of your factory class:
RegisterFactory(TokenFactory);
5.9.7 - Python example: string tokenizer
The following example shows a transform function that breaks an input string into tokens (based on whitespace).
The following example shows a transform function that breaks an input string into tokens (based on whitespace). It is similar to the tokenizer examples for C++ and Java.
Loading and using the example
Create the library and function:
=> CREATE LIBRARY pyudtf AS '/home/dbadmin/udx/tokenize.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION tokenize AS NAME 'StringTokenizerFactory' LIBRARY pyudtf;
CREATE TRANSFORM FUNCTION
You can then use the function in SQL statements, for example:
=> CREATE TABLE words (w VARCHAR);
CREATE TABLE
=> COPY words FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> this is a test of the python udtf
>> \.
=> SELECT tokenize(w) OVER () FROM words;
token
----------
this
is
a
test
of
the
python
udtf
(8 rows)
Setup
All Python UDxs must import the Vertica SDK.
UDTF Python code
The following code defines the tokenizer and its factory.
class StringTokenizer(vertica_sdk.TransformFunction):
"""
Transform function which tokenizes its inputs.
For each input string, each of the whitespace-separated tokens of that
string is produced as output.
"""
def processPartition(self, server_interface, input, output):
while True:
for token in input.getString(0).split():
output.setString(0, token)
output.next()
if not input.next():
break
class StringTokenizerFactory(vertica_sdk.TransformFunctionFactory):
def getPrototype(self, server_interface, arg_types, return_type):
arg_types.addVarchar()
return_type.addVarchar()
def getReturnType(self, server_interface, arg_types, return_type):
return_type.addColumn(arg_types.getColumnType(0), "tokens")
def createTransformFunction(cls, server_interface):
return StringTokenizer()
5.9.8 - R example: log tokenizer
The LogTokenizer transform function reads a varchar from a table, a log message.
The LogTokenizer
transform function reads a varchar from a table, a log message. It then tokenizes each of the log messages, returning each of the tokens.
You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.
Load the function and library
Create the library and the function.
=> CREATE OR REPLACE LIBRARY rLib AS 'log_tokenizer.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE TRANSFORM FUNCTION LogTokenizer AS LANGUAGE 'R' NAME 'LogTokenizerFactory' LIBRARY rLib FENCED;
CREATE FUNCTION
Querying data with the function
The following query shows how you can run a query with the UDTF.
=> SELECT machine,
LogTokenizer(error_log USING PARAMETERS spliton = ' ') OVER(PARTITION BY machine)
FROM error_logs;
machine | Token
---------+---------
node001 | ERROR
node001 | 345
node001 | -
node001 | Broken
node001 | pipe
node001 | WARN
node001 | -
node001 | Nearly
node001 | filled
node001 | disk
node002 | ERROR
node002 | 111
node002 | -
node002 | Flooded
node002 | roads
node003 | ERROR
node003 | 222
node003 | -
node003 | Plain
node003 | old
node003 | broken
(21 rows)
UDTF R code
LogTokenizer <- function(input.data.frame, parameters.data.frame) {
# Take the spliton parameter passed by the user and assign it to a variable
# in the function so we can use that as our tokenizer.
if ( is.null(parameters.data.frame[['spliton']]) ) {
stop("NULL value for spliton! Token cannot be NULL.")
} else {
split.on <- as.character(parameters.data.frame[['spliton']])
}
# Tokenize the string.
tokens <- vector(length=0)
for ( string in input.data.frame[, 1] ) {
tokenized.string <- strsplit(string, split.on)
for ( token in tokenized.string ) {
tokens <- append(tokens, token)
}
}
final.output <- data.frame(tokens)
return(final.output)
}
LogTokenizerFactory <- function() {
list(name = LogTokenizer,
udxtype = c("transform"),
intype = c("varchar"),
outtype = c("varchar"),
outtypecallback=LogTokenizerReturn,
parametertypecallback=LogTokenizerParameters)
}
LogTokenizerParameters <- function() {
parameters <- list(datatype = c("varchar"),
length = c("NA"),
scale = c("NA"),
name = c("spliton"))
return(parameters)
}
LogTokenizerReturn <- function(arg.data.frame, parm.data.frame) {
output.return.type <- data.frame(datatype = rep(NA,1),
length = rep(NA,1),
scale = rep(NA,1),
name = rep(NA,1))
output.return.type$datatype <- c("varchar")
output.return.type$name <- c("Token")
return(output.return.type)
}
5.9.9 - C++ example: multi-phase indexer
The following code fragment is from the InvertedIndex UDTF example distributed with the Vertica SDK.
The following code fragment is from the InvertedIndex UDTF example distributed with the Vertica SDK. It demonstrates subclassing the MultiPhaseTransformFunctionFactory
including two TransformFunctionPhase
subclasses that define the two phases in this UDTF.
class InvertedIndexFactory : public MultiPhaseTransformFunctionFactory
{
public:
/**
* Extracts terms from documents.
*/
class ForwardIndexPhase : public TransformFunctionPhase
{
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// Sanity checks on input we've been given.
// Expected input: (doc_id INTEGER, text VARCHAR)
vector<size_t> argCols;
inputTypes.getArgumentColumns(argCols);
if (argCols.size() < 2 ||
!inputTypes.getColumnType(argCols.at(0)).isInt() ||
!inputTypes.getColumnType(argCols.at(1)).isVarchar())
vt_report_error(0, "Function only accepts two arguments"
"(INTEGER, VARCHAR))");
// Output of this phase is:
// (term_freq INTEGER) OVER(PBY term VARCHAR OBY doc_id INTEGER)
// Number of times term appears within a document.
outputTypes.addInt("term_freq");
// Add analytic clause columns: (PARTITION BY term ORDER BY doc_id).
// The length of any term is at most the size of the entire document.
outputTypes.addVarcharPartitionColumn(
inputTypes.getColumnType(argCols.at(1)).getStringLength(),
"term");
// Add order column on the basis of the document id's data type.
outputTypes.addOrderColumn(inputTypes.getColumnType(argCols.at(0)),
"doc_id");
}
virtual TransformFunction *createTransformFunction(ServerInterface
&srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, ForwardIndexBuilder); }
};
/**
* Constructs terms' posting lists.
*/
class InvertedIndexPhase : public TransformFunctionPhase
{
virtual void getReturnType(ServerInterface &srvInterface,
const SizedColumnTypes &inputTypes,
SizedColumnTypes &outputTypes)
{
// Sanity checks on input we've been given.
// Expected input:
// (term_freq INTEGER) OVER(PBY term VARCHAR OBY doc_id INTEGER)
vector<size_t> argCols;
inputTypes.getArgumentColumns(argCols);
vector<size_t> pByCols;
inputTypes.getPartitionByColumns(pByCols);
vector<size_t> oByCols;
inputTypes.getOrderByColumns(oByCols);
if (argCols.size() != 1 || pByCols.size() != 1 || oByCols.size() != 1 ||
!inputTypes.getColumnType(argCols.at(0)).isInt() ||
!inputTypes.getColumnType(pByCols.at(0)).isVarchar() ||
!inputTypes.getColumnType(oByCols.at(0)).isInt())
vt_report_error(0, "Function expects an argument (INTEGER) with "
"analytic clause OVER(PBY VARCHAR OBY INTEGER)");
// Output of this phase is:
// (term VARCHAR, doc_id INTEGER, term_freq INTEGER, corp_freq INTEGER).
outputTypes.addVarchar(inputTypes.getColumnType(
pByCols.at(0)).getStringLength(),"term");
outputTypes.addInt("doc_id");
// Number of times term appears within the document.
outputTypes.addInt("term_freq");
// Number of documents where the term appears in.
outputTypes.addInt("corp_freq");
}
virtual TransformFunction *createTransformFunction(ServerInterface
&srvInterface)
{ return vt_createFuncObj(srvInterface.allocator, InvertedIndexBuilder); }
};
ForwardIndexPhase fwardIdxPh;
InvertedIndexPhase invIdxPh;
virtual void getPhases(ServerInterface &srvInterface,
std::vector<TransformFunctionPhase *> &phases)
{
fwardIdxPh.setPrepass(); // Process documents wherever they're originally stored.
phases.push_back(&fwardIdxPh);
phases.push_back(&invIdxPh);
}
virtual void getPrototype(ServerInterface &srvInterface,
ColumnTypes &argTypes,
ColumnTypes &returnType)
{
// Expected input: (doc_id INTEGER, text VARCHAR).
argTypes.addInt();
argTypes.addVarchar();
// Output is: (term VARCHAR, doc_id INTEGER, term_freq INTEGER, corp_freq INTEGER)
returnType.addVarchar();
returnType.addInt();
returnType.addInt();
returnType.addInt();
}
};
RegisterFactory(InvertedIndexFactory);
Most of the code in this example is similar to the code in a TransformFunctionFactory
class:
-
Both TransformFunctionPhase
subclasses implement the getReturnType()
function, which describes the output of each stage. This is the similar to the getReturnType()
function from the TransformFunctionFactory
class. However, this function also lets you control how the data is partitioned and ordered between each phase of your multi-phase UDTF.
The first phase calls SizedColumnTypes::addVarcharPartitionColumn()
(rather than just addVarcharColumn()
) to set the phase's output table to be partitioned by the column containing the extracted words. It also calls SizedColumnTypes::addOrderColumn()
to order the output table by the document ID column. It calls this function instead of one of the data-type-specific functions (such as addIntOrderColumn()
) so it can pass the data type of the original column through to the output column.
Note
Any order by column or partition by column set by the final phase of the UDTF in its getReturnType()
function is ignored. Its output is returned to the initiator node rather than partitioned and reordered then sent to another phase.
-
The MultiPhaseTransformFunctionFactory
class implements the getPrototype()
function, that defines the schemas for the input and output of the multi-phase UDTF. This function is the same as the TransformFunctionFactory::getPrototype()
function.
The unique function implemented by the MultiPhaseTransformFunctionFactory
class is getPhases()
. This function defines the order in which the phases are executed. The fields that represent the phases are pushed into this vector in the order they should execute.
The MultiPhaseTransformFunctionFactory.getPhases()
function is also where you flag the first phase of the UDTF as operating on data stored locally on the node (called a "pre-pass" phase) rather than on data partitioned across all nodes. Using this option increases the efficiency of your multi-phase UDTF by avoiding having to move significant amounts of data around the Vertica cluster.
Note
Only the first phase of your UDTF can be a pre-pass phase. You cannot have multiple pre-pass phases, and no later phase can be a pre-pass phase.
To mark the first phase as pre-pass, you call the TransformFunctionPhase::setPrepass()
function of the first phase's TransformFunctionPhase instance from within the getPhase()
function.
Notes
-
You need to ensure that the output schema of each phase matches the input schema expected by the next phase. In the example code, each TransformFunctionPhase::getReturnType()
implementation performs a sanity check on its input and output schemas. Your TransformFunction
subclasses can also perform these checks in their processPartition()
function.
-
There is no built-in limit on the number of phases that your multi-phase UDTF can have. However, more phases use more resources. When running in fenced mode, Vertica may terminate UDTFs that use too much memory. See Resource use for C++ UDxs.
5.9.10 - Python example: multi-phase calculation
The following example shows a multi-phase transform function that computes the average value on a column of numbers in an input table.
The following example shows a multi-phase transform function that computes the average value on a column of numbers in an input table. It first defines two transform functions, and then defines a factory that creates the phases using them.
See AvgMultiPhaseUDT.py
in the examples distribution for the complete code.
Loading and using the example
Create the library and function:
=> CREATE LIBRARY pylib_avg AS '/home/dbadmin/udx/AvgMultiPhaseUDT.py' LANGUAGE 'Python';
CREATE LIBRARY
=> CREATE TRANSFORM FUNCTION myAvg AS NAME 'MyAvgFactory' LIBRARY pylib_avg;
CREATE TRANSFORM FUNCTION
You can then use the function in SELECT statements:
=> CREATE TABLE IF NOT EXISTS numbers(num FLOAT);
CREATE TABLE
=> COPY numbers FROM STDIN delimiter ',';
1
2
3
4
\.
=> SELECT myAvg(num) OVER() FROM numbers;
average | ignored_rows | total_rows
---------+--------------+------------
2.5 | 0 | 4
(1 row)
Setup
All Python UDxs must import the Vertica SDK. This example also imports another library.
import vertica_sdk
import math
A multi-phase transform function must define two or more TransformFunction
subclasses to be used in the phases. This example uses two classes: LocalCalculation
, which does calculations on local partitions, and GlobalCalculation
, which aggregates the results of all LocalCalculation
instances to calculate a final result.
In both functions, the calculation is done in the processPartition()
function:
class LocalCalculation(vertica_sdk.TransformFunction):
"""
This class is the first phase and calculates the local values for sum, ignored_rows and total_rows.
"""
def setup(self, server_interface, col_types):
server_interface.log("Setup: Phase0")
self.local_sum = 0.0
self.ignored_rows = 0
self.total_rows = 0
def processPartition(self, server_interface, input, output):
server_interface.log("Process Partition: Phase0")
while True:
self.total_rows += 1
if input.isNull(0) or math.isinf(input.getFloat(0)) or math.isnan(input.getFloat(0)):
# Null, Inf, or Nan is ignored
self.ignored_rows += 1
else:
self.local_sum += input.getFloat(0)
if not input.next():
break
output.setFloat(0, self.local_sum)
output.setInt(1, self.ignored_rows)
output.setInt(2, self.total_rows)
output.next()
class GlobalCalculation(vertica_sdk.TransformFunction):
"""
This class is the second phase and aggregates the values for sum, ignored_rows and total_rows.
"""
def setup(self, server_interface, col_types):
server_interface.log("Setup: Phase1")
self.global_sum = 0.0
self.ignored_rows = 0
self.total_rows = 0
def processPartition(self, server_interface, input, output):
server_interface.log("Process Partition: Phase1")
while True:
self.global_sum += input.getFloat(0)
self.ignored_rows += input.getInt(1)
self.total_rows += input.getInt(2)
if not input.next():
break
average = self.global_sum / (self.total_rows - self.ignored_rows)
output.setFloat(0, average)
output.setInt(1, self.ignored_rows)
output.setInt(2, self.total_rows)
output.next()
Multi-phase factory
A MultiPhaseTransformFunctionFactory
ties together the individual functions as phases. The factory defines a TransformFunctionPhase
for each function. Each phase defines createTransformFunction()
, which calls the constructor for the corresponding TransformFunction
, and getReturnType()
.
The first phase, LocalPhase
, follows.
class MyAvgFactory(vertica_sdk.MultiPhaseTransformFunctionFactory):
""" Factory class """
class LocalPhase(vertica_sdk.TransformFunctionPhase):
""" Phase 1 """
def getReturnType(self, server_interface, input_types, output_types):
# sanity check
number_of_cols = input_types.getColumnCount()
if (number_of_cols != 1 or not input_types.getColumnType(0).isFloat()):
raise ValueError("Function only accepts one argument (FLOAT))")
output_type