KMEANS
Executes the kmeans algorithm on an input relation. The result is a model with a list of cluster centers.
You can export the resulting kmeans model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a kmeans model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.
This is a metafunction. You must call metafunctions in a toplevel SELECT statement.
Behavior type
VolatileSyntax
KMEANS ( 'modelname', 'inputrelation', 'inputcolumns', 'numclusters'
[ USING PARAMETERS
[exclude_columns = 'excludedcolumns']
[, max_iterations = maxiterations]
[, epsilon = epsilonvalue]
[, { init_method = 'initmethod' }  { initial_centers_table = 'inittable' } ]
[, output_view = 'outputview']
[, key_columns = 'keycolumns'] ] )
Arguments
modelname
 Identifies the model to create, where
modelname
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. inputrelation
 The table or view that contains the input data for kmeans. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. inputcolumns
 Commaseparated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
numclusters
 The number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in kmeans.
Parameters
Important
Parametersinit_method
and initial_centers_table
are mutually exclusive. If you set both, the function returns an error.
exclude_columns
Commaseparated list of column names from
inputcolumns
to exclude from processing.max_iterations
 The maximum number of iterations the algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm may not converge.
Default: 10
epsilon
 Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
'epsilon' from the previous iteration.Default: 1e4
init_method
 The method used to find the initial cluster centers, one of the following:

random

kmeanspp
(default): kmeans++ algorithmThis value can be memory intensive for high k. If the function returns an error that not enough memory is available, decrease the value of k or use the
random
method.

initial_centers_table
 The table with the initial cluster centers to use. Supply this value if you know the initial centers to use and do not want Vertica to find the initial cluster centers for you.
output_view
 The name of the view where you save the assignments of each point to its cluster. You must have CREATE privileges on the schema where the view is saved.
key_columns
 Commaseparated list of column names from
inputcolumns
that will appear as the columns ofoutput_view
. These columns should be picked such that their contents identify each input data point. This parameter is only used ifoutput_view
is specified. Columns listed ininputcolumns
that are only meant to be used askey_columns
and not for training should be listed inexclude_columns
.
Model attributes
centers
 A list that contains the center of each cluster.
metrics
 A string summary of several metrics related to the quality of the clustering.
Examples
The following example creates kmeans model myKmeansModel
and applies it to input table iris1
. The call to APPLY_KMEANS
mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:
=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
KMEANS

Finished in 12 iterations
(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
id  APPLY_KMEANS
+
5  1
10  1
14  1
15  1
21  1
22  1
24  1
25  1
32  1
33  1
34  1
35  1
38  1
39  1
42  1
...
(60 rows)