BISECTING_KMEANS

Executes the bisecting k-means algorithm on an input relation.

Executes the bisecting k-means algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BISECTING_KMEANS('model-name', 'input-relation', 'input-columns', 'num-clusters'
           [ USING PARAMETERS
                 [exclude_columns = 'exclude-columns']
                 [, bisection_iterations = bisection-iterations]
                 [, split_method = 'split-method']
                 [, min_divisible_cluster_size = min-cluster-size]
                 [, kmeans_max_iterations = kmeans-max-iterations]
                 [, kmeans_epsilon = kmeans-epsilon]
                 [, kmeans_center_init_method = 'kmeans-init-method']
                 [, distance_method = 'distance-method']
                 [, output_view = 'output-view']
                 [, key_columns = 'key-columns'] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: Table or view that contains the input data for k-means. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters: Number of clusters to create, an integer ≤ 10,000. This argument represents the k in k-means.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

bisection_iterations

Integer between 1 - 1MM inclusive, specifies number of iterations the bisecting k-means algorithm performs for each bisection step. This corresponds to how many times a standalone k-means algorithm runs in each bisection step.

A setting >1 allows the algorithm to run and choose the best k-means run within each bisection step. If you use kmeanspp, the value of bisection_iterations is always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.

Default: 1

split_method

The method used to choose a cluster to bisect/split, one of:

size: Choose the largest cluster to bisect.
sum_squares: Choose the cluster with the largest within-cluster sum of squares to bisect.

Default: sum_squares

min_divisible_cluster_size

Integer ≥ 2, specifies minimum number of points of a divisible cluster.

Default: 2

kmeans_max_iterations

Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the k-means algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.

Default: 10

kmeans_epsilon

Integer between 1 and 1MM inclusive, determines whether the k-means algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of epsilon from the previous iteration.

Default: 1e-4

kmeans_center_init_method

The method used to find the initial cluster centers in k-means, one of:

kmeanspp (default): kmeans++ algorithm
pseudo: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points

distance_method

The measure for distance between two data points. Only Euclidean distance is supported at this time.

Default: euclidean

output_view

Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.

key_columns

Comma-separated list of column names that identify the output rows. Columns must be in the input-columns argument list. To exclude these and other input columns from being used by the algorithm, list them in parameter exclude_columns.

Model attributes

centers

A list of centers of the K centroids.

hierarchy

The hierarchy of K clusters, including:

ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.
LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first sub-cluster obtained by bisecting a cluster.
RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second sub-cluster obtained by bisecting a cluster.
BisectionLevel: Specifies which bisection step a cluster is obtained from.
WithinSS: Within-cluster sum of squares for the current cluster
TotalWithinSS: Total within-cluster sum of squares of leaf clusters thus far obtained.

metrics

Several metrics related to the quality of the clustering, including

Total sum of squares
Total within-cluster sum of squares
Between-cluster sum of squares
Between-cluster sum of squares / Total sum of squares
Sum of squares for cluster x, center_id y[...]

Examples

SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
       USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');