KPROTOTYPES

Executes the k-prototypes algorithm on an input relation.

Executes the k-prototypes algorithm on an input relation. The result is a model with a list of cluster centers.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Syntax

SELECT KPROTOTYPES ('`*`model-name`*`', '`*`input-relation`*`', '`*`input-columns`*`', `*`num-clusters`*`
                [USING PARAMETERS [exclude_columns = '`*`exclude-columns`*`']
                [, max_iterations = '`*`max-iterations`*`']
                [, epsilon = `*`epsilon`*`]
                [, {[init_method = '`*`init-method`*`'] } | { initial_centers_table = '`*`init-table`*`' } ]
                [, gamma = '`*`gamma`*`']
                [, output_view = '`*`output-view`*`']
                [, key_columns = '`*`key-columns`*`']]);

Behavior type

Volatile

Arguments

model-name
Name of the model resulting from the training.
input-relation
Name of the table or view containing the training samples.
input-columns
String containing a comma-separated list of columns to use from the input-relation, or asterisk (*) to select all columns.
num-clusters
Integer ≤ 10,000 representing the number of clusters to create. This argument represents the k in k-prototypes.

Parameters

exclude-columns
String containing a comma-separated list of column names from input-columns to exclude from processing.

Default: (empty)

max_iterations
Integer ≤ 1M representing the maximum number of iterations the algorithm performs.

Default: Integer ≤ 1M

epsilon
Integer which determines whether the algorithm has converged.

Default: 1e-4

init_method
String specifying the method used to find the initial k-prototypes cluster centers.

Default: "random"

initial_centers_table
The table with the initial cluster centers to use.
gamma
Float between 0 and 10000 specifying the weighing factor for categorical columns. It can determine relative importance of numerical and categorical attributes

Default: Inferred from data.

output_view
The name of the view where you save the assignments of each point to its cluster
key_columns
Comma-separated list of column names that identify the output rows. Columns must be in the input-columns argument list

Examples

The following example creates k-prototypes model small_model and applies it to input table small_test_mixed:

=> SELECT KPROTOTYPES('small_model_initcenters', 'small_test_mixed', 'x0, country', 3 USING PARAMETERS initial_centers_table='small_test_mixed_centers', key_columns='pid');
      KPROTOTYPES
---------------------------
Finished in 2 iterations

(1 row)

=> SELECT country, x0, APPLY_KPROTOTYPES(country, x0
USING PARAMETERS model_name='small_model')
FROM small_test_mixed;
  country   | x0  | apply_kprototypes
------------+-----+-------------------
 'China'    |  20 |                 0
 'US'       |  85 |                 2
 'Russia'   |  80 |                 1
 'Brazil'   |  78 |                 1
 'US'       |  23 |                 0
 'US'       |  50 |                 0
 'Canada'   |  24 |                 0
 'Canada'   |  18 |                 0
 'Russia'   |  90 |                 2
 'Russia'   |  98 |                 2
 'Brazil'   |  89 |                 2
...
(45 rows)

See also