RF_REGRESSOR

Trains a random forest model for regression on an input relation.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Syntax

RF_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, ntree = num-trees]
              [, mtry = num-features]
              [, sampling_size = sampling-size]
              [, max_depth = depth]
              [, max_breadth = breadth]
              [, min_leaf_size = leaf-size]
              [, min_info_gain = threshold]
              [, nbins = num-bins] ] )

Arguments

model-name

The model that is stored as a result of training, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

input-relation

The table or view that contains the training samples. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

response-column

A numeric input column that represents the dependent variable.

predictor-columns

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response-column, and any columns that are invalid as predictor columns.

All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.

Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.

Parameters

exclude_columns

Comma-separated list of columns from predictor-columns to exclude from processing.

ntree

Integer in the range [1, 1000], the number of trees in the forest.

Default: 20

mtry

Integer in the range [1, number-predictors], the number of features to consider at the split of a tree node.

Default: One-third the total number of predictors

sampling_size

Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.

Default: 0.632

max_depth

Integer in the range [1, 100], the maximum depth for growing each tree. For example, a max_depth of 0 represents a tree with only a root node, and a max_depth of 2 represents a tree with four leaf nodes.

Default: 5

max_breadth

Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.

Default: 32

min_leaf_size

Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.

The default value of this parameter differs from that of analogous parameters in libraries like sklearn and will therefore yield a model with predicted values that differ from the original response values.

Default: 5

min_info_gain

Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.

Default: 0.0

nbins

Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.

Default: 32

Model attributes

data

Data for the function, including:

predictorNames: The name of the predictors in the same order they were specified for training the model.
predictorTypes: The type of the predictors in the same order as their names in predictorNames.

ntree

Number of trees in the model.

skippedRows

Number of rows in input_relation that were skipped because they contained an invalid value.

processedRows

Total number of rows in input_relation minus skippedRows.

callStr

Value of all input arguments that were specified at the time the function was called.

Examples

=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR
--------------
Finished
(1 row)