This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
K-means
You can use the k-means clustering algorithm to cluster data points into k different groups based on similarities between the data points.
You can use the k-means clustering algorithm to cluster data points into k different groups based on similarities between the data points.
k-means partitions n observations into k clusters. Through this partitioning, k-means assigns each observation to the cluster with the nearest mean, or cluster center.
For a complete example of how to use k-means on a table in Vertica, see Clustering data using k-means .
1 - Clustering data using k-means
This k-means example uses two small data sets: agar_dish_1 and agar_dish_2.
This k-means example uses two small data sets: agar_dish_1
and agar_dish_2
. Using the numeric data in the agar_dish_1
data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2
and assign them to the clusters created in your original model.
Before you begin the example,
load the Machine Learning sample data.
Clustering training data into k clusters
-
Create the k-means model, named agar_dish_kmeans using the agar_dish_1
table data.
=> SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view',
key_columns='id');
KMEANS
---------------------------
Finished in 7 iterations
(1 row)
The example creates a model named agar_dish_kmeans
and a view containing the results of the model named agar_1_view
. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.
-
View the output of agar_1_view
.
=> SELECT * FROM agar_1_view;
id | cluster_id
-----+------------
2 | 4
5 | 4
7 | 4
9 | 4
13 | 4
.
.
.
(375 rows)
-
Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.
=> SELECT cluster_id, COUNT(cluster_id) as Total_count
FROM agar_1_view
GROUP BY cluster_id;
cluster_id | Total_count
------------+-------------
0 | 76
2 | 80
1 | 74
3 | 73
4 | 72
(5 rows)
From the output, you can see that five clusters were created: 0
, 1
, 2
, 3
, and 4
.
You have now successfully clustered the data from agar_dish_1.csv
into five distinct clusters.
Summarizing your model
View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');
----------------------------------------------------------------------------------
=======
centers
=======
x | y
--------+--------
0.49708 | 0.51116
-7.48119|-7.52577
-1.56238|-1.50561
-3.50616|-3.55703
-5.52057|-5.49197
=======
metrics
=======
Evaluation metrics:
Total Sum of Squares: 6008.4619
Within-Cluster Sum of Squares:
Cluster 0: 12.083548
Cluster 1: 12.389038
Cluster 2: 12.639238
Cluster 3: 11.210146
Cluster 4: 12.994356
Total Within-Cluster Sum of Squares: 61.316326
Between-Cluster Sum of Squares: 5947.1456
Between-Cluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)
Clustering data using a k-means model
Using agar_dish_kmeans
, the k-means model you just created, you can assign the points in agar_dish_2
to cluster centers.
Create a table named kmeans_results
, using the agar_dish_2
table as your input table and the agar_dish_kmeans
model for your initial cluster centers.
Add only the relevant feature columns to the arguments in the APPLY_KMEANS
function.
=> CREATE TABLE kmeans_results AS
(SELECT id,
APPLY_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_kmeans') AS cluster_id
FROM agar_dish_2);
The kmeans_results
table shows that the agar_dish_kmeans
model correctly clustered the agar_dish_2
data.
See also