RF_REGRESSOR
Trains a random forest model for regression on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
RF_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, ntree = num-trees]
[, mtry = num-features]
[, sampling_size = sampling-size]
[, max_depth = depth]
[, max_breadth = breadth]
[, min_leaf_size = leaf-size]
[, min_info_gain = threshold]
[, nbins = num-bins] ] )
Arguments
model-name
- The model that is stored as a result of training, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- A numeric input column that represents the dependent variable.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter
exclude_columns
must includeresponse-column
, and any columns that are invalid as predictor columns.All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing. ntree
Integer in the range [1, 1000], the number of trees in the forest.
Default: 20
mtry
- Integer in the range [1,
number-predictors
], the number of features to consider at the split of a tree node.Default: One-third the total number of predictors
sampling_size
Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.
Default: 0.632
max_depth
Integer in the range [1, 100], the maximum depth for growing each tree. For example, a
max_depth
of 0 represents a tree with only a root node, and amax_depth
of 2 represents a tree with four leaf nodes.Default: 5
max_breadth
Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.
Default: 32
min_leaf_size
- Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.
The default value of this parameter differs from that of analogous parameters in libraries like sklearn and will therefore yield a model with predicted values that differ from the original response values.
Default: 5
min_info_gain
Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.
Default: 0.0
nbins
Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.
Default: 32
Model attributes
data
- Data for the function, including:
-
predictorNames
: The name of the predictors in the same order they were specified for training the model. -
predictorTypes
: The type of the predictors in the same order as their names in predictorNames.
-
ntree
- Number of trees in the model.
skippedRows
- Number of rows in
input_relation
that were skipped because they contained an invalid value. processedRows
- Total number of rows in
input_relation
minusskippedRows
. callStr
- Value of all input arguments that were specified at the time the function was called.
Examples
=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR
--------------
Finished
(1 row)