RF_CLASSIFIER
Trains a random forest model for classification on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
RF_CLASSIFIER ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, ntree = num-trees]
[, mtry = num-features]
[, sampling_size = sampling-size]
[, max_depth = depth]
[, max_breadth = breadth]
[, min_leaf_size = leaf-size]
[, min_info_gain = threshold]
[, nbins = num-bins] ] )
Arguments
model-name
- Identifies the model stored as a result of the training, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- An input column of type numeric, CHAR/VARCHAR, or BOOLEAN that represents the dependent variable.
Note
Vertica automatically casts numeric response column values to VARCHAR. predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter
exclude_columns
must includeresponse-column
, and any columns that are invalid as predictor columns.All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.ntree
Integer in the range [1, 1000], the number of trees in the forest.
Default: 20
mtry
- Integer in the range [1,
number-predictors
], the number of randomly chosen features from which to pick the best feature to split on a given tree node.Default: Square root of the total number of predictors
sampling_size
Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.
Default: 0.632
max_depth
Integer in the range [1, 100], the maximum depth for growing each tree. For example, a
max_depth
of 0 represents a tree with only a root node, and amax_depth
of 2 represents a tree with four leaf nodes.Default: 5
max_breadth
Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.
Default: 32
min_leaf_size
Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.
Default: 1
min_info_gain
Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.
Default: 0.0
nbins
Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.
Default: 32
Model attributes
data
- Data for the function, including:
-
predictorNames
: The name of the predictors in the same order they were specified for training the model. -
predictorTypes
: The type of the predictors in the same order as their names in predictorNames.
-
ntree
- Number of trees in the model.
skippedRows
- Number of rows in
input_relation
that were skipped because they contained an invalid value. processedRows
- Total number of rows in
input_relation
minusskippedRows
. callStr
- Value of all input arguments that were specified at the time the function was called.
Examples
=> SELECT RF_CLASSIFIER ('myRFModel', 'iris', 'Species', 'Sepal_Length, Sepal_Width,
Petal_Length, Petal_Width' USING PARAMETERS ntree=100, sampling_size=0.3);
RF_CLASSIFIER
--------------------------------------------------
Finished training
(1 row)