KMEANS

Executes the k-means algorithm on an input relation.

Executes the k-means algorithm on an input relation. The result is a model with a list of cluster centers.

You can export the resulting k-means model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a k-means model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

KMEANS ( 'model-name', 'input-relation', 'input-columns', 'num-clusters'
        [ USING PARAMETERS
           [exclude_columns = 'excluded-columns']
           [, max_iterations = max-iterations]
           [, epsilon = epsilon-value]
           [, { init_method = 'init-method' } | { initial_centers_table = 'init-table' } ]
           [, output_view = 'output-view']
           [, key_columns = 'key-columns'] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the input data for k-means. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters
The number of clusters to create, an integer ≤ 10,000. This argument represents the k in k-means.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

max_iterations
The maximum number of iterations the algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm may not converge.

Default: 10

epsilon
Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
'epsilon' from the previous iteration.

Default: 1e-4

init_method
The method used to find the initial cluster centers, one of the following:
  • random

  • kmeanspp (default): kmeans++ algorithm

    This value can be memory intensive for high k. If the function returns an error that not enough memory is available, decrease the value of k or use the random method.

initial_centers_table
The table with the initial cluster centers to use. Supply this value if you know the initial centers to use and do not want Vertica to find the initial cluster centers for you.
output_view
The name of the view where you save the assignments of each point to its cluster. You must have CREATE privileges on the schema where the view is saved.
key_columns
Comma-separated list of column names from input-columns that will appear as the columns of output_view. These columns should be picked such that their contents identify each input data point. This parameter is only used if output_view is specified. Columns listed in input-columns that are only meant to be used as key_columns and not for training should be listed in exclude_columns.

Model attributes

centers
A list that contains the center of each cluster.
metrics
A string summary of several metrics related to the quality of the clustering.

Examples

The following example creates k-means model myKmeansModel and applies it to input table iris1. The call to APPLY_KMEANS mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:

=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
           KMEANS
----------------------------
 Finished in 12 iterations

(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
 id  | APPLY_KMEANS
-----+--------------
   5 |            1
  10 |            1
  14 |            1
  15 |            1
  21 |            1
  22 |            1
  24 |            1
  25 |            1
  32 |            1
  33 |            1
  34 |            1
  35 |            1
  38 |            1
  39 |            1
  42 |            1
...
 (60 rows)

See also