Trains a random forest model for regression on an input relation.

Trains a random forest model for regression on an input relation.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type



RF_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
              [exclude_columns = 'excluded-columns']
              [, ntree = num-trees]
              [, mtry = num-features]
              [, sampling_size = sampling-size]
              [, max_depth = depth]
              [, max_breadth = breadth]
              [, min_leaf_size = leaf-size]
              [, min_info_gain = threshold]
              [, nbins = num-bins] ] )


The model that is stored as a result of training, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
The table or view that contains the training samples. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
A numeric input column that represents the dependent variable.

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response-column, and any columns that are invalid as predictor columns.

All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.

Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.


Comma-separated list of columns from predictor-columns to exclude from processing.

Integer in the range [1, 1000], the number of trees in the forest.

Default: 20

Integer in the range [1, number-predictors], the number of features to consider at the split of a tree node.

Default: One-third the total number of predictors


Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.

Default: 0.632


Integer in the range [1, 100], the maximum depth for growing each tree. For example, a max_depth of 0 represents a tree with only a root node, and a max_depth of 2 represents a tree with four leaf nodes.

Default: 5


Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.

Default: 32

Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.

The default value of this parameter differs from that of analogous parameters in libraries like sklearn and will therefore yield a model with predicted values that differ from the original response values.

Default: 5


Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.

Default: 0.0


Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.

Default: 32

Model attributes

Data for the function, including:
  • predictorNames: The name of the predictors in the same order they were specified for training the model.

  • predictorTypes: The type of the predictors in the same order as their names in predictorNames.

Number of trees in the model.
Number of rows in input_relation that were skipped because they contained an invalid value.
Total number of rows in input_relation minus skippedRows.
Value of all input arguments that were specified at the time the function was called.


=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
(1 row)

See also