R example: kmeans

The KMeans_User scalar function reads any number of columns from a table, the observations.

The KMeans_User scalar function reads any number of columns from a table, the observations. It then uses the observations and the two parameters when applying the kmeans clustering algorithm to the data, returning an integer value associated with the cluster of the row.

You can find more UDx examples in the Vertica Github repository, https://github.com/vertica/UDx-Examples.

Load the function and library

Create the library and the function:

=> CREATE OR REPLACE LIBRARY rLib AS 'kmeans.R' LANGUAGE 'R';
CREATE LIBRARY
=> CREATE OR REPLACE FUNCTION KMeans_User AS LANGUAGE 'R' NAME 'KMeans_UserFactory' LIBRARY rLib FENCED;
CREATE FUNCTION

Querying data with the function

The following query shows how you can run a query with the UDSF:

=> SELECT spec,
          KMeans_User(sl, sw, pl, pw USING PARAMETERS clusters = 3, nstart = 20)
    FROM iris;
      spec       | KMeans_User
-----------------+-------------
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
 Iris-setosa     |           2
.
.
.
(150 rows)

UDSF R code


KMeans_User <- function(input.data.frame, parameters.data.frame) {
  # Take the clusters and nstart parameters passed by the user and assign them
  # to variables in the function.
  if ( is.null(parameters.data.frame[['clusters']]) ) {
    stop("NULL value for clusters! clusters cannot be NULL.")
  } else {
    clusters.value <- parameters.data.frame[['clusters']]
  }
  if ( is.null(parameters.data.frame[['nstart']]) ) {
    stop("NULL value for nstart! nstart cannot be NULL.")
  } else {
    nstart.value <- parameters.data.frame[['nstart']]
  }
  # Apply the algorithm to the data.
  kmeans.clusters <- kmeans(input.data.frame[, 1:length(input.data.frame)],
                           clusters.value, nstart = nstart.value)
  final.output <- data.frame(kmeans.clusters$cluster)
  return(final.output)
}

KMeans_UserFactory <- function() {
  list(name    = KMeans_User,
       udxtype = c("scalar"),
       # Since this is a polymorphic function the intype must be any
       intype  = c("any"),
       outtype = c("int"),
       parametertypecallback=KMeansParameters)
}

KMeansParameters <- function() {
  parameters <- list(datatype = c("int", "int"),
                     length   = c("NA", "NA"),
                     scale    = c("NA", "NA"),
                     name     = c("clusters", "nstart"))
  return(parameters)
}