Clustering data using k-means
This k-means example uses two small data sets named agar_dish_1
and agar_dish_2
. Using the numeric data in the agar_dish_1
data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2
and assign them to the clusters created in your original model.
Clustering training data into k clusters
-
Create the k-means model, named agar_dish_kmeans using the
agar_dish_1
table data.=> SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5 USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view', key_columns='id'); KMEANS --------------------------- Finished in 7 iterations (1 row)
The example creates a model named
agar_dish_kmeans
and a view containing the results of the model namedagar_1_view
. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default. -
View the output of
agar_1_view
.=> SELECT * FROM agar_1_view; id | cluster_id -----+------------ 2 | 4 5 | 4 7 | 4 9 | 4 13 | 4 . . . (375 rows)
-
Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.
=> SELECT cluster_id, COUNT(cluster_id) as Total_count FROM agar_1_view GROUP BY cluster_id; cluster_id | Total_count ------------+------------- 0 | 76 2 | 80 1 | 74 3 | 73 4 | 72 (5 rows)
From the output, you can see that five clusters were created:
0
,1
,2
,3
, and4
.You have now successfully clustered the data from
agar_dish_1.csv
into five distinct clusters.
Summarizing your model
View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');
----------------------------------------------------------------------------------
=======
centers
=======
x | y
--------+--------
0.49708 | 0.51116
-7.48119|-7.52577
-1.56238|-1.50561
-3.50616|-3.55703
-5.52057|-5.49197
=======
metrics
=======
Evaluation metrics:
Total Sum of Squares: 6008.4619
Within-Cluster Sum of Squares:
Cluster 0: 12.083548
Cluster 1: 12.389038
Cluster 2: 12.639238
Cluster 3: 11.210146
Cluster 4: 12.994356
Total Within-Cluster Sum of Squares: 61.316326
Between-Cluster Sum of Squares: 5947.1456
Between-Cluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)
Clustering data using a k-means model
Using agar_dish_kmeans
, the k-means model you just created, you can assign the points in agar_dish_2
to cluster centers.
Create a table named kmeans_results
, using the agar_dish_2
table as your input table and the agar_dish_kmeans
model for your initial cluster centers.
Add only the relevant feature columns to the arguments in the APPLY_KMEANS
function.
=> CREATE TABLE kmeans_results AS
(SELECT id,
APPLY_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_kmeans') AS cluster_id
FROM agar_dish_2);
The kmeans_results
table shows that the agar_dish_kmeans
model correctly clustered the agar_dish_2
data.