KMEANS
Executes the k-means algorithm on an input relation. The result is a model with a list of cluster centers.
You can export the resulting k-means model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a k-means model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
KMEANS ( 'model-name', 'input-relation', 'input-columns', 'num-clusters'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, max_iterations = max-iterations]
[, epsilon = epsilon-value]
[, { init_method = 'init-method' } | { initial_centers_table = 'init-table' } ]
[, output_view = 'output-view']
[, key_columns = 'key-columns'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the input data for k-means. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters
- The number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in k-means.
Parameters
Important
Parametersinit_method
and initial_centers_table
are mutually exclusive. If you set both, the function returns an error.
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.max_iterations
- The maximum number of iterations the algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm may not converge.
Default: 10
epsilon
- Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
'epsilon' from the previous iteration.Default: 1e-4
init_method
- The method used to find the initial cluster centers, one of the following:
-
random
-
kmeanspp
(default): kmeans++ algorithmThis value can be memory intensive for high k. If the function returns an error that not enough memory is available, decrease the value of k or use the
random
method.
-
initial_centers_table
- The table with the initial cluster centers to use. Supply this value if you know the initial centers to use and do not want Vertica to find the initial cluster centers for you.
output_view
- The name of the view where you save the assignments of each point to its cluster. You must have CREATE privileges on the schema where the view is saved.
key_columns
- Comma-separated list of column names from
input-columns
that will appear as the columns ofoutput_view
. These columns should be picked such that their contents identify each input data point. This parameter is only used ifoutput_view
is specified. Columns listed ininput-columns
that are only meant to be used askey_columns
and not for training should be listed inexclude_columns
.
Model attributes
centers
- A list that contains the center of each cluster.
metrics
- A string summary of several metrics related to the quality of the clustering.
Examples
The following example creates k-means model myKmeansModel
and applies it to input table iris1
. The call to APPLY_KMEANS
mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:
=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
KMEANS
----------------------------
Finished in 12 iterations
(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
id | APPLY_KMEANS
-----+--------------
5 | 1
10 | 1
14 | 1
15 | 1
21 | 1
22 | 1
24 | 1
25 | 1
32 | 1
33 | 1
34 | 1
35 | 1
38 | 1
39 | 1
42 | 1
...
(60 rows)