K-means

You can use the clustering algorithm, k-means clustering, to cluster data points into k different groups based on similarities between the data points.

1: Clustering data using k-means

You can use the clustering algorithm, k-means clustering, to cluster data points into k different groups based on similarities between the data points.

The purpose of k-means is to partition n observations into k clusters. Through this partitioning, k-means assigns each observation to the cluster with the nearest mean. That nearest mean is also known as the cluster center.

For a complete example of how to use k-means on a table in Vertica, see Clustering data using k-means .

1 - Clustering data using k-means

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2.

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2. Using the numeric data in the agar_dish_1 data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2 and assign them to the clusters created in your original model.

Before you begin the example, load the Machine Learning sample data.

Clustering training data into k clusters

Create the k-means model, named agar_dish_kmeans using the agar_dish_1 table data.
```
=> SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
                  USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view',
                  key_columns='id');
           KMEANS
---------------------------
 Finished in 7 iterations

(1 row)
```
The example creates a model named agar_dish_kmeans and a view containing the results of the model named agar_1_view. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

View the output of agar_1_view.

=> SELECT * FROM agar_1_view;
 id  | cluster_id
-----+------------
   2 |          4
   5 |          4
   7 |          4
   9 |          4
  13 |          4
.
.
.
(375 rows)

Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.
```
=> SELECT cluster_id, COUNT(cluster_id) as Total_count
   FROM agar_1_view
   GROUP BY cluster_id;
 cluster_id | Total_count
------------+-------------
          0 |          76
          2 |          80
          1 |          74
          3 |          73
          4 |          72
(5 rows)
```
From the output, you can see that five clusters were created: 0, 1, 2, 3, and 4.

You have now successfully clustered the data from agar_dish_1.csv into five distinct clusters.

Summarizing your model

View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');
----------------------------------------------------------------------------------
=======
centers
=======
x       |   y
--------+--------
0.49708 | 0.51116
-7.48119|-7.52577
-1.56238|-1.50561
-3.50616|-3.55703
-5.52057|-5.49197

=======
metrics
=======
Evaluation metrics:
  Total Sum of Squares: 6008.4619
  Within-Cluster Sum of Squares:
      Cluster 0: 12.083548
      Cluster 1: 12.389038
      Cluster 2: 12.639238
      Cluster 3: 11.210146
      Cluster 4: 12.994356
  Total Within-Cluster Sum of Squares: 61.316326
  Between-Cluster Sum of Squares: 5947.1456
  Between-Cluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)

Clustering data using a k-means model

Using agar_dish_kmeans, the k-means model you just created, you can assign the points in agar_dish_2 to cluster centers.

Create a table named kmeans_results, using the agar_dish_2 table as your input table and the agar_dish_kmeans model for your initial cluster centers.

Add only the relevant feature columns to the arguments in the APPLY_KMEANS function.

=> CREATE TABLE kmeans_results AS
        (SELECT id,
                APPLY_KMEANS(x, y
                             USING PARAMETERS
                                              model_name='agar_dish_kmeans') AS cluster_id
         FROM agar_dish_2);

The kmeans_results table shows that the agar_dish_kmeans model correctly clustered the agar_dish_2 data.

K-means

1 - Clustering data using k-means

Clustering training data into k clusters

Summarizing your model

Clustering data using a k-means model

See also