# XGB_CLASSIFIER

Trains an XGBoost model for classification on an input relation.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

## Behavior type

Volatile## Syntax

```
XGB_CLASSIFIER ('model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS
[ exclude_columns = 'excluded-columns' ]
[, max_ntree = max-trees ]
[, max_depth = max-depth ]
[, objective = 'optimization-strategy' ]
[, learning_rate = learning-rate ]
[, min_split_loss = minimum ]
[, weight_reg = regularization ]
[, nbins = num-bins ]
[, sampling_size = fraction-of-rows ]
[, col_sample_by_tree = sample-ratio-per-tree ]
[, col_sample_by_node = sample-ratio-per-node ]
] )
```

## Arguments

`model-name`

Name of the model (case-insensitive).

`input-relation`

- The table or view that contains the training samples. If the input relation is defined in Hive, use
`SYNC_WITH_HCATALOG_SCHEMA`

to sync the`hcatalog`

schema, and then run the machine learning function. `response-column`

- An input column of type CHAR or VARCHAR that represents the dependent variable or outcome.
`predictor-columns`

- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of data types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of type CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.

## Parameters

`exclude_columns`

Comma-separated list of column names from

to exclude from processing.`input-columns`

`max_ntree`

- Integer in the range [1,1000] that sets the maximum number of trees to create.
**Default:**10 `max_depth`

- Integer in the range [1,20] that specifies the maximum depth of each tree.
**Default:**6 `objective`

- The objective/loss function used to iteratively improve the model. Currently, '
`crossentropy`

' is the only option.**Default:**'`crossentropy`

' `learning_rate`

- Float in the range (0,1] that specifies the weight for each tree's prediction. Setting this parameter can reduce each tree's impact and thereby prevent earlier trees from monopolizing improvements at the expense of contributions from later trees.
**Default:**0.3 `min_split_loss`

- Float in the range [0,1000] that specifies the minimum amount of improvement each split must achieve on the model's objective function value to avoid being pruned.
If set to 0 or omitted, no minimum is set. In this case, trees are pruned according to positive or negative objective function values.

**Default:**0.0 (disable) `weight_reg`

- Float in the range [0,1000] that specifies the regularization term applied to the weights of classification tree leaves. The higher the setting, the sparser or smoother the weights are, which can help prevent over-fitting.
**Default:**1.0 `nbins`

- Integer in the range (1,1000] that specifies the number of bins to use for finding splits in each column. More bins leads to longer runtime but more fine-grained and possibly better splits.
**Default:**32 `sampling_size`

- Float in the range (0,1] that specifies the fraction of rows to use in each training iteration.
A value of 1 indicates that all rows are used.

**Default:**1.0 `col_sample_by_tree`

- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when building each tree.
A value of 1 indicates that all columns are used.

`col_sample_by`

parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for`col_sample_by_tree=0.5`

and`col_sample_by_node=0.5`

,`col_sample_by_tree`

samples 12 columns, reducing the available, unsampled column pool to 12.`col_sample_by_node`

then samples half of the remaining pool, so each node samples 6 columns.This algorithm will always sample at least one column.

**Default:**1 `col_sample_by_node`

- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when evaluating each split.
A value of 1 indicates that all columns are used.

`col_sample_by`

parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for`col_sample_by_tree=0.5`

and`col_sample_by_node=0.5`

,`col_sample_by_tree`

samples 12 columns, reducing the available, unsampled column pool to 12.`col_sample_by_node`

then samples half of the remaining pool, so each node samples 6 columns.This algorithm will always sample at least one column.

**Default:**1