This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

K-means

You can use the clustering algorithm, k-means clustering, to cluster data points into k different groups based on similarities between the data points.

You can use the clustering algorithm, k-means clustering, to cluster data points into k different groups based on similarities between the data points.

The purpose of k-means is to partition n observations into k clusters. Through this partitioning, k-means assigns each observation to the cluster with the nearest mean. That nearest mean is also known as the cluster center.

For a complete example of how to use k-means on a table in Vertica, see Clustering data using k-means .

1 - Clustering data using k-means

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2.

This k-means example uses two small data sets named agar_dish_1 and agar_dish_2. Using the numeric data in the agar_dish_1 data set, you can cluster the data into k clusters. Then, using the created k-means model, you can run APPLY_KMEANS on agar_dish_2 and assign them to the clusters created in your original model.

Before you begin the example, load the Machine Learning sample data.

Clustering training data into k clusters

  1. Create the k-means model, named agar_dish_kmeans using the agar_dish_1 table data.

    => SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
                      USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view',
                      key_columns='id');
               KMEANS
    ---------------------------
     Finished in 7 iterations
    
    (1 row)
    

    The example creates a model named agar_dish_kmeans and a view containing the results of the model named agar_1_view. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

  2. View the output of agar_1_view.

    => SELECT * FROM agar_1_view;
     id  | cluster_id
    -----+------------
       2 |          4
       5 |          4
       7 |          4
       9 |          4
      13 |          4
    .
    .
    .
    (375 rows)
    
  3. Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.

    => SELECT cluster_id, COUNT(cluster_id) as Total_count
       FROM agar_1_view
       GROUP BY cluster_id;
     cluster_id | Total_count
    ------------+-------------
              0 |          76
              2 |          80
              1 |          74
              3 |          73
              4 |          72
    (5 rows)
    

    From the output, you can see that five clusters were created: 0, 1, 2, 3, and 4.

    You have now successfully clustered the data from agar_dish_1.csv into five distinct clusters.

Summarizing your model

View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');
----------------------------------------------------------------------------------
=======
centers
=======
x       |   y
--------+--------
0.49708 | 0.51116
-7.48119|-7.52577
-1.56238|-1.50561
-3.50616|-3.55703
-5.52057|-5.49197

=======
metrics
=======
Evaluation metrics:
  Total Sum of Squares: 6008.4619
  Within-Cluster Sum of Squares:
      Cluster 0: 12.083548
      Cluster 1: 12.389038
      Cluster 2: 12.639238
      Cluster 3: 11.210146
      Cluster 4: 12.994356
  Total Within-Cluster Sum of Squares: 61.316326
  Between-Cluster Sum of Squares: 5947.1456
  Between-Cluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)

Clustering data using a k-means model

Using agar_dish_kmeans, the k-means model you just created, you can assign the points in agar_dish_2 to cluster centers.

Create a table named kmeans_results, using the agar_dish_2 table as your input table and the agar_dish_kmeans model for your initial cluster centers.

Add only the relevant feature columns to the arguments in the APPLY_KMEANS function.

=> CREATE TABLE kmeans_results AS
        (SELECT id,
                APPLY_KMEANS(x, y
                             USING PARAMETERS
                                              model_name='agar_dish_kmeans') AS cluster_id
         FROM agar_dish_2);

The kmeans_results table shows that the agar_dish_kmeans model correctly clustered the agar_dish_2 data.

See also