XGB_CLASSIFIER

Trains an XGBoost model for classification on an input relation.

Trains an XGBoost model for classification on an input relation.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

XGB_CLASSIFIER ('model-name', 'input-relation', 'response-column', 'predictor-columns'
        [ USING PARAMETERS param=value[,...] ] )

Arguments

model-name

Name of the model (case-insensitive).

input-relation
The table or view that contains the training samples. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response-column
An input column of type CHAR or VARCHAR that represents the dependent variable or outcome.
predictor-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of data types CHAR, VARCHAR, BOOL, INT, or FLOAT.

Columns of type CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

max_ntree
Integer in the range [1,1000] that sets the maximum number of trees to create.

Default: 10

max_depth
Integer in the range [1,40] that specifies the maximum depth of each tree.

Default: 6

objective
The objective/loss function used to iteratively improve the model. 'crossentropy' is the only option.

Default: 'crossentropy'

split_proposal_method
The splitting strategy for the feature columns. 'global' is the only option. This method calculates the split for each feature column only at the beginning of the algorithm. The feature columns are split into the number of bins specified by nbins.

Default: 'global'

learning_rate
Float in the range (0,1] that specifies the weight for each tree's prediction. Setting this parameter can reduce each tree's impact and thereby prevent earlier trees from monopolizing improvements at the expense of contributions from later trees.

Default: 0.3

min_split_loss
Float in the range [0,1000] that specifies the minimum amount of improvement each split must achieve on the model's objective function value to avoid being pruned.

If set to 0 or omitted, no minimum is set. In this case, trees are pruned according to positive or negative objective function values.

Default: 0.0 (disable)

weight_reg
Float in the range [0,1000] that specifies the regularization term applied to the weights of classification tree leaves. The higher the setting, the sparser or smoother the weights are, which can help prevent over-fitting.

Default: 1.0

nbins
Integer in the range (1,1000] that specifies the number of bins to use for finding splits in each column. More bins leads to longer runtime but more fine-grained and possibly better splits.

Default: 32

sampling_size
Float in the range (0,1] that specifies the fraction of rows to use in each training iteration.

A value of 1 indicates that all rows are used.

Default: 1.0

col_sample_by_tree
Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when building each tree.

A value of 1 indicates that all columns are used.

col_sample_by parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5 andcol_sample_by_node=0.5,col_sample_by_tree samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node then samples half of the remaining pool, so each node samples 6 columns.

This algorithm will always sample at least one column.

Default: 1

col_sample_by_node
Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when evaluating each split.

A value of 1 indicates that all columns are used.

col_sample_by parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5 andcol_sample_by_node=0.5,col_sample_by_tree samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node then samples half of the remaining pool, so each node samples 6 columns.

This algorithm will always sample at least one column.

Default: 1

Examples

See XGBoost for classification.