IFOREST

Trains and returns an isolation forest (iForest) model.

Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.

For more information about how the iForest algorithm works, see Isolation Forest.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the input data for IFOREST.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.

Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Parameters

exclude_columns
Comma-separated list of column names from input-columns to exclude from processing.

Default: Empty string ('')

ntree
Integer in the range [1, 1000], specifies the number of trees in the forest.

Default: 100

sampling_size
Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.

Default: 0.632

col_sample_by_tree
Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.

Default: 1.0

max_depth
Integer in the range [1, 100], specifies the maximum depth for growing each tree.

Default: 10

nbins
Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.

Default: 32

Model Attributes

details
Details about the function's predictor columns, including:
  • predictor: Names of the predictors in the same order specified when training the model.

  • type: Types of the predictors in the same order as their names in predictor.

tree_count
Number of trees in the model.
rejected_row_count
Number of rows in input-relation that were skipped because they contained an invalid value.
accepted_row_count
Total number of rows in input-relation minus rejected_row_count.
call_string
Value of all input arguments that were specified at the time the function was called.

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

Examples

In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:

=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)

You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------

===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);

=======
details
=======
predictor|      type
---------+----------------
  team   |char or varchar
   hr    |      int
  hits   |      int
   avg   |float or numeric
 salary  |float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 75
rejected_row_count|  0
accepted_row_count|1000

(1 row)

See also