BISECTING_KMEANS
Executes the bisecting k-means algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
BISECTING_KMEANS('model-name', 'input-relation', 'input-columns', 'num-clusters'
[ USING PARAMETERS
[exclude_columns = 'exclude-columns']
[, bisection_iterations = bisection-iterations]
[, split_method = 'split-method']
[, min_divisible_cluster_size = min-cluster-size]
[, kmeans_max_iterations = kmeans-max-iterations]
[, kmeans_epsilon = kmeans-epsilon]
[, kmeans_center_init_method = 'kmeans-init-method']
[, distance_method = 'distance-method']
[, output_view = 'output-view']
[, key_columns = 'key-columns'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- Table or view that contains the input data for k-means. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters
- Number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in k-means.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.bisection_iterations
- Integer between 1 - 1MM inclusive, specifies number of iterations the bisecting k-means algorithm performs for each bisection step. This corresponds to how many times a standalone k-means algorithm runs in each bisection step.
A setting >1 allows the algorithm to run and choose the best k-means run within each bisection step. If you use kmeanspp, the value of
bisection_iterations
is always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.Default: 1
split_method
- The method used to choose a cluster to bisect/split, one of:
-
size
: Choose the largest cluster to bisect. -
sum_squares
: Choose the cluster with the largest within-cluster sum of squares to bisect.
Default:
sum_squares
-
min_divisible_cluster_size
- Integer ≥ 2, specifies minimum number of points of a divisible cluster.
Default: 2
kmeans_max_iterations
- Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the k-means algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.
Default: 10
kmeans_epsilon
- Integer between 1 and 1MM inclusive, determines whether the k-means algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
epsilon
from the previous iteration.Default: 1e-4
kmeans_center_init_method
- The method used to find the initial cluster centers in k-means, one of:
-
kmeanspp
(default): kmeans++ algorithm -
pseudo
: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points
-
distance_method
- The measure for distance between two data points. Only Euclidean distance is supported at this time.
Default:
euclidean
output_view
- Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.
key_columns
- Comma-separated list of column names that identify the output rows. Columns must be in the
input-columns
argument list. To exclude these and other input columns from being used by the algorithm, list them in parameterexclude_columns
.
Model attributes
centers
- A list of centers of the K centroids.
hierarchy
- The hierarchy of K clusters, including:
-
ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.
-
LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first sub-cluster obtained by bisecting a cluster.
-
RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second sub-cluster obtained by bisecting a cluster.
-
BisectionLevel: Specifies which bisection step a cluster is obtained from.
-
WithinSS: Within-cluster sum of squares for the current cluster
-
TotalWithinSS: Total within-cluster sum of squares of leaf clusters thus far obtained.
-
metrics
- Several metrics related to the quality of the clustering, including
-
Total sum of squares
-
Total within-cluster sum of squares
-
Between-cluster sum of squares
-
Between-cluster sum of squares / Total sum of squares
-
Sum of squares for cluster
x
, center_idy
[...]
-
Examples
SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');