Developing user-defined extensions (UDxs)
User-defined extensions (UDxs) are functions contained in external libraries that are developed in C++, Python, Java, or R using the Vertica SDK. The external libraries are defined in the Vertica catalog using the CREATE LIBRARY statement. They are best suited for analytic operations that are difficult to perform in SQL, or that need to be performed frequently enough that their speed is a major concern.
The primary strengths of UDxs are:
-
They can be used anywhere an internal function can be used.
-
They take full advantage of Vertica's distributed computing features. The extensions usually execute in parallel on each node in the cluster.
-
They are distributed to all nodes by Vertica. You only need to copy the library to the initiator node.
-
All of the complicated aspects of developing a distributed piece of analytic code are handled for you by Vertica. Your main programming task is to read in data, process it, and then write it out using the Vertica SDK APIs.
There are a few things to keep in mind about developing UDxs:
-
UDxs can be developed in the programming languages C++, Python, Java, and R. (Not all UDx types support all languages.)
-
UDxs written in Java always run in fenced mode, because the Java Virtual Machine that executes Java programs cannot run directly within the Vertica process.
-
UDxs written in Python and R always run in fenced mode.
-
UDxs developed in C++ have the option of running in unfenced mode, which means they load and run directly in the Vertica database process. This option provides the lowest overhead and highest speed. However, any bugs in the UDx's code can cause database instability. You must thoroughly test any UDxs you intend to run in unfenced mode before deploying them in a live environment. Consider whether the performance boost of running a C++ UDx unfenced is worth the potential database instability that a buggy UDx can cause.
-
Because a UDx runs on the Vertica cluster, it can take processor time and memory away from the database processes. A UDx that consumes large amounts of computing resources can negatively impact database performance.
Types of UDxs
Vertica supports five types of user-defined extensions:
-
User-defined scalar functions (UDSFs) take in a single row of data and return a single value. These functions can be used anywhere a native function can be used, except CREATE TABLE BY PARTITION and SEGMENTED BY expressions. UDSFs can be developed in C++, Python, Java, and R.
-
User-defined aggregate functions (UDAF) allow you to create custom Aggregate functions specific to your needs. They read one column of data, and return one output column. UDAFs can be developed in C++.
-
User-defined analytic functions (UDAnF) are similar to UDSFs, in that they read a row of data and return a single row. However, the function can read input rows independently of outputting rows, so that the output values can be calculated over several input rows. The function can be used with the query's
OVER()
clause to partition rows. UDAnFs can be developed in C++ and Java. -
User-defined transform functions (UDTFs) operate on table partitions (as specified by the query's
OVER()
clause) and return zero or more rows of data. The data they return can be an entirely new table, unrelated to the schema of the input table, with its own ordering and segmentation expressions. They can only be used in the SELECT list of a query. UDTFs can be developed in C++, Python, Java, and R.To optimize query performance, you can use live aggregate projections to pre-aggregate the data that a UDTF returns. For more information, see Pre-aggregating UDTF results.
-
User-defined load allows you to create custom sources, filters, and parsers to load data. These extensions can be used in COPY statements. UDLs can be developed C++, Java and Python.
While each UDx type has a unique base class, developing them is similar in many ways. Different UDx types can also share the same library.
Structure
Each UDx type consists of two primary classes. The main class does the actual work (a transformation, an aggregation, and so on). The class usually has at least three methods: one to set up, one to tear down (release reserved resources), and one to do the work. Sometimes additional methods are defined.
The main processing method receives an instance of the ServerInterface
class as an argument. This object is used by the underlying Vertica SDK code to make calls back into the Vertica process, for example to allocate memory. You can use this class to write to the server log during UDx execution.
The second class is a singleton factory. It defines one method that produces instances of the first class, and might define other methods to manage parameters.
When implementing a UDx you must subclass both classes.
Conventions
The C++, Python, and Java APIs are nearly identical. Where possible, this documentation describes these interfaces without respect to language. Documentation specific to C++, Python, or Java is covered in language-specific sections.
Because some documentation is language-independent, it is not always possible to use ideal, language-based terminology. This documentation uses the term "method" to refer to a Java method or a C++ member function.