This is the multipage printable view of this section.
Click here to print.
Return to the regular view of this page.
Kmeans
You can use the kmeans clustering algorithm to cluster data points into k different groups based on similarities between the data points.
You can use the kmeans clustering algorithm to cluster data points into k different groups based on similarities between the data points.
kmeans partitions n observations into k clusters. Through this partitioning, kmeans assigns each observation to the cluster with the nearest mean, or cluster center.
For a complete example of how to use kmeans on a table in Vertica, see Clustering data using kmeans .
1  Clustering data using kmeans
This kmeans example uses two small data sets: agar_dish_1 and agar_dish_2.
This kmeans example uses two small data sets: agar_dish_1
and agar_dish_2
. Using the numeric data in the agar_dish_1
data set, you can cluster the data into k clusters. Then, using the created kmeans model, you can run APPLY_KMEANS on agar_dish_2
and assign them to the clusters created in your original model.
Before you begin the example,
load the Machine Learning sample data.
Clustering training data into k clusters

Create the kmeans model, named agar_dish_kmeans using the agar_dish_1
table data.
=> SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view',
key_columns='id');
KMEANS

Finished in 7 iterations
(1 row)
The example creates a model named agar_dish_kmeans
and a view containing the results of the model named agar_1_view
. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

View the output of agar_1_view
.
=> SELECT * FROM agar_1_view;
id  cluster_id
+
2  4
5  4
7  4
9  4
13  4
.
.
.
(375 rows)

Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.
=> SELECT cluster_id, COUNT(cluster_id) as Total_count
FROM agar_1_view
GROUP BY cluster_id;
cluster_id  Total_count
+
0  76
2  80
1  74
3  73
4  72
(5 rows)
From the output, you can see that five clusters were created: 0
, 1
, 2
, 3
, and 4
.
You have now successfully clustered the data from agar_dish_1.csv
into five distinct clusters.
Summarizing your model
View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');

=======
centers
=======
x  y
+
0.49708  0.51116
7.481197.52577
1.562381.50561
3.506163.55703
5.520575.49197
=======
metrics
=======
Evaluation metrics:
Total Sum of Squares: 6008.4619
WithinCluster Sum of Squares:
Cluster 0: 12.083548
Cluster 1: 12.389038
Cluster 2: 12.639238
Cluster 3: 11.210146
Cluster 4: 12.994356
Total WithinCluster Sum of Squares: 61.316326
BetweenCluster Sum of Squares: 5947.1456
BetweenCluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)
Clustering data using a kmeans model
Using agar_dish_kmeans
, the kmeans model you just created, you can assign the points in agar_dish_2
to cluster centers.
Create a table named kmeans_results
, using the agar_dish_2
table as your input table and the agar_dish_kmeans
model for your initial cluster centers.
Add only the relevant feature columns to the arguments in the APPLY_KMEANS
function.
=> CREATE TABLE kmeans_results AS
(SELECT id,
APPLY_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_kmeans') AS cluster_id
FROM agar_dish_2);
The kmeans_results
table shows that the agar_dish_kmeans
model correctly clustered the agar_dish_2
data.
See also