IFOREST
Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.
For more information about how the iForest algorithm works, see Isolation Forest.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the input data for IFOREST.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Parameters
exclude_columns
- Comma-separated list of column names from
input-columns
to exclude from processing.Default: Empty string ('')
ntree
- Integer in the range [1, 1000], specifies the number of trees in the forest.
Default: 100
sampling_size
- Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.
Default: 0.632
col_sample_by_tree
- Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.
Default: 1.0
max_depth
- Integer in the range [1, 100], specifies the maximum depth for growing each tree.
Default: 10
nbins
- Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.
Default: 32
Model Attributes
details
- Details about the function's predictor columns, including:
-
predictor
: Names of the predictors in the same order specified when training the model. -
type
: Types of the predictors in the same order as their names inpredictor
.
-
tree_count
- Number of trees in the model.
rejected_row_count
- Number of rows in
input-relation
that were skipped because they contained an invalid value. accepted_row_count
- Total number of rows in
input-relation
minusrejected_row_count
. call_string
- Value of all input arguments that were specified at the time the function was called.
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
Examples
In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:
=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)
You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);
=======
details
=======
predictor| type
---------+----------------
team |char or varchar
hr | int
hits | int
avg |float or numeric
salary |float or numeric
===============
Additional Info
===============
Name |Value
------------------+-----
tree_count | 75
rejected_row_count| 0
accepted_row_count|1000
(1 row)