XGB_CLASSIFIER
Trains an XGBoost model for classification on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
XGB_CLASSIFIER ('model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS param=value[,...] ] )
Arguments
model-name
Name of the model (case-insensitive).
input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- An input column of type CHAR or VARCHAR that represents the dependent variable or outcome.
predictor-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of data types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of type CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.max_ntree
- Integer in the range [1,1000] that sets the maximum number of trees to create.
Default: 10
max_depth
- Integer in the range [1,40] that specifies the maximum depth of each tree.
Default: 6
objective
- The objective/loss function used to iteratively improve the model. 'crossentropy' is the only option.
Default: 'crossentropy'
split_proposal_method
- The splitting strategy for the feature columns. 'global' is the only option. This method calculates the split for each feature column only at the beginning of the algorithm. The feature columns are split into the number of bins specified by
nbins
.Default: 'global'
learning_rate
- Float in the range (0,1] that specifies the weight for each tree's prediction. Setting this parameter can reduce each tree's impact and thereby prevent earlier trees from monopolizing improvements at the expense of contributions from later trees.
Default: 0.3
min_split_loss
- Float in the range [0,1000] that specifies the minimum amount of improvement each split must achieve on the model's objective function value to avoid being pruned.
If set to 0 or omitted, no minimum is set. In this case, trees are pruned according to positive or negative objective function values.
Default: 0.0 (disable)
weight_reg
- Float in the range [0,1000] that specifies the regularization term applied to the weights of classification tree leaves. The higher the setting, the sparser or smoother the weights are, which can help prevent over-fitting.
Default: 1.0
nbins
- Integer in the range (1,1000] that specifies the number of bins to use for finding splits in each column. More bins leads to longer runtime but more fine-grained and possibly better splits.
Default: 32
sampling_size
- Float in the range (0,1] that specifies the fraction of rows to use in each training iteration.
A value of 1 indicates that all rows are used.
Default: 1.0
col_sample_by_tree
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when building each tree.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, forcol_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12.col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.This algorithm will always sample at least one column.
Default: 1
col_sample_by_node
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when evaluating each split.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, forcol_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12.col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.This algorithm will always sample at least one column.
Default: 1