APPLY_IFOREST

Applies an isolation forest (iForest) model to an input relation.

Applies an isolation forest (iForest) model to an input relation. For each input row, the function returns an output row with two fields:

  • anomaly_score: A float value that represents the average path length across all trees in the model normalized by the training sample size.
  • is_anomaly: A Boolean value that indicates whether the input row is an anomaly. This value is true when anomaly_score is equal to or larger than a given threshold; otherwise, it's false.

Syntax

APPLY_IFOREST( input-columns USING PARAMETERS param=value[,...] )

Arguments

input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Column types must match the types of the predictors in model_name.

Parameters

model_name

Name of the model (case-insensitive).

threshold
Optional. Float in the range (0.0, 1.0), specifies the threshold that determines if a data point is an anomaly. If the anomaly_score for a data point is equal to or larger than the value of threshold, the data point is marked as an outlier.

Alternatively, you can specify a contamination value that sets a threshold where the percentage of training data points labeled as outliers is approximately equal to the value of contamination. You cannot set both contamination and threshold in the same function call.

Default: 0.7

match_by_pos
Optional. Boolean value that specifies how input columns are matched to model columns:
  • false: Match by name.

  • true: Match by the position of columns in the input columns list.

Default: false

contamination
Optional. Float in the range (0.0, 1.0), the approximate ratio of data points in the training data that are labeled as outliers. The function calculates a threshold based on this contamination value. If you do not set this parameter, the function marks outliers using the specified or default threshold value.

You cannot set both contamination and threshold in the same function call.

Privileges

Non-superusers:

  • USAGE privileges on the model

  • SELECT privileges on the input relation

Examples

The following example demonstrates how different threshold values can affect outlier detection on an input relation:

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   threshold=0.75) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+-------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
(1 row)

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   threshold=0.55) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(3 rows)

You can also use different contamination values to alter the outlier threshold:

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   contamination = 0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Marie      | Fields    | {"anomaly_score":0.5307715717521868,"is_anomaly":true}
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(4 rows)

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   contamination = 0.01) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(3 rows)

See also