IFOREST
Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.
For more information about how the iForest algorithm works, see Isolation Forest.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
IFOREST( 'model‑name', 'input‑relation', 'input‑columns' [ USING PARAMETERS param=value[,...] ] )
Arguments
model-name- Identifies the model to create, where
model‑nameconforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input‑relation- The table or view that contains the input data for IFOREST.
input‑columns- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Parameters
exclude_columns- Comma-separated list of column names from
input‑columnsto exclude from processing.Default: Empty string ('')
ntree- Integer in the range [1, 1000], specifies the number of trees in the forest.
Default: 100
sampling_size- Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.
Default: 0.632
col_sample_by_tree- Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.
Default: 1.0
max_depth- Integer in the range [1, 100], specifies the maximum depth for growing each tree.
Default: 10
nbins- Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.
Default: 32
Model Attributes
details- Details about the function's predictor columns, including:
-
predictor: Names of the predictors in the same order specified when training the model. -
type: Types of the predictors in the same order as their names inpredictor.
-
tree_count- Number of trees in the model.
rejected_row_count- Number of rows in
input-relationthat were skipped because they contained an invalid value. accepted_row_count- Total number of rows in
input-relationminusrejected_row_count. call_string- Value of all input arguments that were specified at the time the function was called.
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
Examples
In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:
=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)
You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);
=======
details
=======
predictor| type
---------+----------------
team |char or varchar
hr | int
hits | int
avg |float or numeric
salary |float or numeric
===============
Additional Info
===============
Name |Value
------------------+-----
tree_count | 75
rejected_row_count| 0
accepted_row_count|1000
(1 row)