BISECTING_KMEANS

Executes the bisecting k‑means algorithm on an input relation.

Executes the bisecting k‑means algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BISECTING_KMEANS('model-name', 'input-relation', 'input-columns', 'num-clusters'
           [ USING PARAMETERS
                 [exclude_columns = 'exclude-columns']
                 [, bisection_iterations = bisection-iterations]
                 [, split_method = 'split-method']
                 [, min_divisible_cluster_size = min-cluster-size]
                 [, kmeans_max_iterations = kmeans-max-iterations]
                 [, kmeans_epsilon = kmeans-epsilon]
                 [, kmeans_center_init_method = 'kmeans-init-method']
                 [, distance_method = 'distance-method']
                 [, output_view = 'output-view']
                 [, key_columns = 'key-columns'] ] )

Arguments

model‑name
Identifies the model to create, where model‑name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input‑relation
Table or view that contains the input data for k‑means. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input‑columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num‑clusters
Number of clusters to create, an integer ≤ 10,000. This argument represents the k in k‑means.

Parameters

exclude_columns

Comma-separated list of column names from input‑columns to exclude from processing.

bisection_iterations
Integer between 1 - 1MM inclusive, specifies number of iterations the bisecting k‑means algorithm performs for each bisection step. This corresponds to how many times a standalone k‑means algorithm runs in each bisection step.

A setting >1 allows the algorithm to run and choose the best k‑means run within each bisection step. If you use kmeanspp, the value of bisection_iterations is always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.

Default: 1

split_method
The method used to choose a cluster to bisect/split, one of:
  • size: Choose the largest cluster to bisect.

  • sum_squares: Choose the cluster with the largest within-cluster sum of squares to bisect.

Default: sum_squares

min_divisible_cluster_size
Integer ≥ 2, specifies minimum number of points of a divisible cluster.

Default: 2

kmeans_max_iterations
Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the k‑means algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.

Default: 10

kmeans_epsilon
Integer between 1 and 1MM inclusive, determines whether the k‑means algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of epsilon from the previous iteration.

Default: 1e-4

kmeans_center_init_method
The method used to find the initial cluster centers in k‑means, one of:
  • kmeanspp (default): kmeans++ algorithm

  • pseudo: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points

distance_method
The measure for distance between two data points. Only Euclidean distance is supported at this time.

Default: euclidean

output_view
Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.
key_columns
Comma-separated list of column names that identify the output rows. Columns must be in the input‑columns argument list. To exclude these and other input columns from being used by the algorithm, list them in parameter exclude_columns.

Model attributes

centers
A list of centers of the K centroids.
hierarchy
The hierarchy of K clusters, including:
  • ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.

  • LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first sub-cluster obtained by bisecting a cluster.

  • RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second sub-cluster obtained by bisecting a cluster.

  • BisectionLevel: Specifies which bisection step a cluster is obtained from.

  • WithinSS: Within-cluster sum of squares for the current cluster

  • TotalWithinSS: Total within-cluster sum of squares of leaf clusters thus far obtained.

metrics
Several metrics related to the quality of the clustering, including
  • Total sum of squares

  • Total within-cluster sum of squares

  • Between-cluster sum of squares

  • Between-cluster sum of squares / Total sum of squares

  • Sum of squares for cluster x, center_id y[...]

Examples

SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
       USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');

See also