IFOREST

Trains and returns an isolation forest (iForest) model.

Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.

For more information about how the iForest algorithm works, see Isolation Forest.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: The table or view that contains the input data for IFOREST.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
Default: Empty string ('')
ntree: Integer in the range [1, 1000], specifies the number of trees in the forest.
Default: 100
sampling_size: Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.
Default: 0.632
col_sample_by_tree: Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.
Default: 1.0
max_depth: Integer in the range [1, 100], specifies the maximum depth for growing each tree.
Default: 10
nbins: Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.
Default: 32

Model Attributes

details

Details about the function's predictor columns, including:

predictor: Names of the predictors in the same order specified when training the model.
type: Types of the predictors in the same order as their names in predictor.

tree_count

Number of trees in the model.

rejected_row_count

Number of rows in input-relation that were skipped because they contained an invalid value.

accepted_row_count

Total number of rows in input-relation minus rejected_row_count.

call_string

Value of all input arguments that were specified at the time the function was called.

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation

Examples

In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:

=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)

You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------

===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);

=======
details
=======
predictor|      type
---------+----------------
  team   |char or varchar
   hr    |      int
  hits   |      int
   avg   |float or numeric
 salary  |float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 75
rejected_row_count|  0
accepted_row_count|1000

(1 row)