BISECTING_KMEANS
Executes the bisecting k‑means algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
BISECTING_KMEANS('model-name', 'input-relation', 'input-columns', 'num-clusters'
[ USING PARAMETERS
[exclude_columns = 'exclude-columns']
[, bisection_iterations = bisection-iterations]
[, split_method = 'split-method']
[, min_divisible_cluster_size = min-cluster-size]
[, kmeans_max_iterations = kmeans-max-iterations]
[, kmeans_epsilon = kmeans-epsilon]
[, kmeans_center_init_method = 'kmeans-init-method']
[, distance_method = 'distance-method']
[, output_view = 'output-view']
[, key_columns = 'key-columns'] ] )
Arguments
model‑name- Identifies the model to create, where
model‑nameconforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input‑relation- Table or view that contains the input data for k‑means. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMAto sync thehcatalogschema, and then run the machine learning function. input‑columns- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num‑clusters- Number of clusters to create, an integer ≤ 10,000. This argument represents the
kin k‑means.
Parameters
exclude_columnsComma-separated list of column names from
input‑columnsto exclude from processing.bisection_iterations- Integer between 1 - 1MM inclusive, specifies number of iterations the bisecting k‑means algorithm performs for each bisection step. This corresponds to how many times a standalone k‑means algorithm runs in each bisection step.
A setting >1 allows the algorithm to run and choose the best k‑means run within each bisection step. If you use kmeanspp, the value of
bisection_iterationsis always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.Default: 1
split_method- The method used to choose a cluster to bisect/split, one of:
-
size: Choose the largest cluster to bisect. -
sum_squares: Choose the cluster with the largest within-cluster sum of squares to bisect.
Default:
sum_squares -
min_divisible_cluster_size- Integer ≥ 2, specifies minimum number of points of a divisible cluster.
Default: 2
kmeans_max_iterations- Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the k‑means algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.
Default: 10
kmeans_epsilon- Integer between 1 and 1MM inclusive, determines whether the k‑means algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
epsilonfrom the previous iteration.Default: 1e-4
kmeans_center_init_method- The method used to find the initial cluster centers in k‑means, one of:
-
kmeanspp(default): kmeans++ algorithm -
pseudo: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points
-
distance_method- The measure for distance between two data points. Only Euclidean distance is supported at this time.
Default:
euclidean output_view- Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.
key_columns- Comma-separated list of column names that identify the output rows. Columns must be in the
input‑columnsargument list. To exclude these and other input columns from being used by the algorithm, list them in parameterexclude_columns.
Model attributes
centers- A list of centers of the K centroids.
hierarchy- The hierarchy of K clusters, including:
-
ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.
-
LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first sub-cluster obtained by bisecting a cluster.
-
RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second sub-cluster obtained by bisecting a cluster.
-
BisectionLevel: Specifies which bisection step a cluster is obtained from.
-
WithinSS: Within-cluster sum of squares for the current cluster
-
TotalWithinSS: Total within-cluster sum of squares of leaf clusters thus far obtained.
-
metrics- Several metrics related to the quality of the clustering, including
-
Total sum of squares
-
Total within-cluster sum of squares
-
Between-cluster sum of squares
-
Between-cluster sum of squares / Total sum of squares
-
Sum of squares for cluster
x, center_idy[...]
-
Examples
SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');