KPROTOTYPES

Executes the k-prototypes algorithm on an input relation.

Executes the k-prototypes algorithm on an input relation. The result is a model with a list of cluster centers.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Syntax

SELECT KPROTOTYPES ('`*`model-name`*`', '`*`input-relation`*`', '`*`input-columns`*`', `*`num-clusters`*`
                [USING PARAMETERS [exclude_columns = '`*`exclude-columns`*`']
                [, max_iterations = '`*`max-iterations`*`']
                [, epsilon = `*`epsilon`*`]
                [, {[init_method = '`*`init-method`*`'] } | { initial_centers_table = '`*`init-table`*`' } ]
                [, gamma = '`*`gamma`*`']
                [, output_view = '`*`output-view`*`']
                [, key_columns = '`*`key-columns`*`']]);

Behavior type

Volatile

Arguments

model-name: Name of the model resulting from the training.
input-relation: Name of the table or view containing the training samples.
input-columns: String containing a comma-separated list of columns to use from the input-relation, or asterisk (*) to select all columns.
num-clusters: Integer ≤ 10,000 representing the number of clusters to create. This argument represents the k in k-prototypes.

Parameters

exclude-columns: String containing a comma-separated list of column names from input-columns to exclude from processing.
Default: (empty)
max_iterations: Integer ≤ 1M representing the maximum number of iterations the algorithm performs.
Default: Integer ≤ 1M
epsilon: Integer which determines whether the algorithm has converged.
Default: 1e-4
init_method: String specifying the method used to find the initial k-prototypes cluster centers.
Default: "random"
initial_centers_table: The table with the initial cluster centers to use.
gamma: Float between 0 and 10000 specifying the weighing factor for categorical columns. It can determine relative importance of numerical and categorical attributes
Default: Inferred from data.
output_view: The name of the view where you save the assignments of each point to its cluster
key_columns: Comma-separated list of column names that identify the output rows. Columns must be in the input-columns argument list

Examples

The following example creates k-prototypes model small_model and applies it to input table small_test_mixed:

=> SELECT KPROTOTYPES('small_model_initcenters', 'small_test_mixed', 'x0, country', 3 USING PARAMETERS initial_centers_table='small_test_mixed_centers', key_columns='pid');
      KPROTOTYPES
---------------------------
Finished in 2 iterations

(1 row)

=> SELECT country, x0, APPLY_KPROTOTYPES(country, x0
USING PARAMETERS model_name='small_model')
FROM small_test_mixed;
  country   | x0  | apply_kprototypes
------------+-----+-------------------
 'China'    |  20 |                 0
 'US'       |  85 |                 2
 'Russia'   |  80 |                 1
 'Brazil'   |  78 |                 1
 'US'       |  23 |                 0
 'US'       |  50 |                 0
 'Canada'   |  24 |                 0
 'Canada'   |  18 |                 0
 'Russia'   |  90 |                 2
 'Russia'   |  98 |                 2
 'Brazil'   |  89 |                 2
...
(45 rows)