BISECTING_KMEANS
Executes the bisecting kmeans algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.
This is a metafunction. You must call metafunctions in a toplevel SELECT statement.
Behavior type
VolatileSyntax
BISECTING_KMEANS('modelname', 'inputrelation', 'inputcolumns', 'numclusters'
[ USING PARAMETERS
[exclude_columns = 'excludecolumns']
[, bisection_iterations = bisectioniterations]
[, split_method = 'splitmethod']
[, min_divisible_cluster_size = minclustersize]
[, kmeans_max_iterations = kmeansmaxiterations]
[, kmeans_epsilon = kmeansepsilon]
[, kmeans_center_init_method = 'kmeansinitmethod']
[, distance_method = 'distancemethod']
[, output_view = 'outputview']
[, key_columns = 'keycolumns'] ] )
Arguments
modelname
 Identifies the model to create, where
modelname
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. inputrelation
 Table or view that contains the input data for kmeans. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. inputcolumns
 Commaseparated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
numclusters
 Number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in kmeans.
Parameters
exclude_columns
Commaseparated list of column names from
inputcolumns
to exclude from processing.bisection_iterations
 Integer between 1  1MM inclusive, specifies number of iterations the bisecting kmeans algorithm performs for each bisection step. This corresponds to how many times a standalone kmeans algorithm runs in each bisection step.
A setting >1 allows the algorithm to run and choose the best kmeans run within each bisection step. If you use kmeanspp, the value of
bisection_iterations
is always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.Default: 1
split_method
 The method used to choose a cluster to bisect/split, one of:

size
: Choose the largest cluster to bisect. 
sum_squares
: Choose the cluster with the largest withincluster sum of squares to bisect.
Default:
sum_squares

min_divisible_cluster_size
 Integer ≥ 2, specifies minimum number of points of a divisible cluster.
Default: 2
kmeans_max_iterations
 Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the kmeans algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.
Default: 10
kmeans_epsilon
 Integer between 1 and 1MM inclusive, determines whether the kmeans algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
epsilon
from the previous iteration.Default: 1e4
kmeans_center_init_method
 The method used to find the initial cluster centers in kmeans, one of:

kmeanspp
(default): kmeans++ algorithm 
pseudo
: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points

distance_method
 The measure for distance between two data points. Only Euclidean distance is supported at this time.
Default:
euclidean
output_view
 Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.
key_columns
 Commaseparated list of column names that identify the output rows. Columns must be in the
inputcolumns
argument list. To exclude these and other input columns from being used by the algorithm, list them in parameterexclude_columns
.
Model attributes
centers
 A list of centers of the K centroids.
hierarchy
 The hierarchy of K clusters, including:

ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.

LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first subcluster obtained by bisecting a cluster.

RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second subcluster obtained by bisecting a cluster.

BisectionLevel: Specifies which bisection step a cluster is obtained from.

WithinSS: Withincluster sum of squares for the current cluster

TotalWithinSS: Total withincluster sum of squares of leaf clusters thus far obtained.

metrics
 Several metrics related to the quality of the clustering, including

Total sum of squares

Total withincluster sum of squares

Betweencluster sum of squares

Betweencluster sum of squares / Total sum of squares

Sum of squares for cluster
x
, center_idy
[...]

Examples
SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');