RF_REGRESSOR
Trains a random forest model for regression on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
RF_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, ntree = num-trees]
              [, mtry = num-features]
              [, sampling_size = sampling-size]
              [, max_depth = depth]
              [, max_breadth = breadth]
              [, min_leaf_size = leaf-size]
              [, min_info_gain = threshold]
              [, nbins = num-bins] ] )
Arguments
- model-name
- The model that is stored as a result of training, where model-nameconforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
- input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use 
SYNC_WITH_HCATALOG_SCHEMAto sync thehcatalogschema, and then run the machine learning function.
- response-column
- A numeric input column that represents the dependent variable.
- predictor-columns
- Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter - exclude_columnsmust include- response-column, and any columns that are invalid as predictor columns.- All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid. - Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood. 
Parameters
- exclude_columns
- Comma-separated list of columns from predictor-columnsto exclude from processing.
- ntree
- Integer in the range [1, 1000], the number of trees in the forest. - Default: 20 
- mtry
- Integer in the range [1, number-predictors], the number of features to consider at the split of a tree node.Default: One-third the total number of predictors 
- sampling_size
- Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree. - Default: 0.632 
- max_depth
- Integer in the range [1, 100], the maximum depth for growing each tree. For example, a - max_depthof 0 represents a tree with only a root node, and a- max_depthof 2 represents a tree with four leaf nodes.- Default: 5 
- max_breadth
- Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have. - Default: 32 
- min_leaf_size
- Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.
The default value of this parameter differs from that of analogous parameters in libraries like sklearn and will therefore yield a model with predicted values that differ from the original response values. Default: 5 
- min_info_gain
- Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded. - Default: 0.0 
- nbins
- Integer in the range [2, 1000], the number of bins to use for discretizing continuous features. - Default: 32 
Model attributes
- data
- Data for the function, including:
- 
predictorNames: The name of the predictors in the same order they were specified for training the model.
- 
predictorTypes: The type of the predictors in the same order as their names in predictorNames.
 
- 
- ntree
- Number of trees in the model.
- skippedRows
- Number of rows in input_relationthat were skipped because they contained an invalid value.
- processedRows
- Total number of rows in input_relationminusskippedRows.
- callStr
- Value of all input arguments that were specified at the time the function was called.
Examples
=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR
--------------
Finished
(1 row)