Detect outliers
Outliers are data points that greatly differ from other data points in a dataset. You can use outlier detection for applications such as fraud detection and system health monitoring, or you can detect outliers to then remove them from your data. If you leave outliers in your data when training a machine learning model, your resultant model is at risk for bias and skewed predictions. Vertica supports two methods for detecting outliers: the DETECT_OUTLIERS function and the IFOREST algorithm.
Note
The examples below use the baseball dataset from the machine learning example data. If you haven't already, Download the machine learning example data.Isolation forest
Isolation forest (iForest) is an unsupervised algorithm that operates on the assumption that outliers are few and different. This assumption makes outliers susceptible to a separation mechanism called isolation. Instead of comparing data instances to a constructed normal distribution of each data feature, isolation focuses on outliers themselves.
To isolate outliers directly, iForest builds binary tree structures named isolation trees (iTrees) to model the feature space. These iTrees randomly and recursively split the feature space so that each node of the tree represents a feature subspace. For instance, the first split divides the whole feature space into two subspaces, which are represented by the two child nodes of the root node. A data instance is considered isolated when it is the only member of a feature subspace. Because outliers are assumed to be few and different, outliers are likely to be isolated sooner than normal data instances.
In order to improve the robustness of the algorithm, iForest builds an ensemble of iTrees, which each separate the feature space differently. The algorithm calculates the average path length needed to isolate a data instance across all iTrees. This average path length helps determine the anomaly_score
for each data instance in a dataset. The data instances with an anomaly_score
above a given threshold are considered outliers.
You do not need a large dataset to train an iForest, and even a small sample should suffice to train an accurate model. The data can have columns of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
After you have a trained an iForest model, you can use the APPLY_IFOREST function to detect outliers in any new data added to the dataset.
The following example demonstrates how to train an iForest model and detect outliers on the baseball dataset.
To build and train an iForest model, call IFOREST:
=> SELECT IFOREST('baseball_outliers','baseball','hr, hits, salary' USING PARAMETERS max_depth=30, nbins=100);
IFOREST
----------
Finished
(1 row)
You can view a summary of the trained model using GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_outliers');
GET_MODEL_SUMMARY
---------------------------------------------------------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT iforest('public.baseball_outliers', 'baseball', 'hr, hits, salary' USING PARAMETERS exclude_columns='', ntree=100, sampling_size=0.632,
col_sample_by_tree=1, max_depth=30, nbins=100);
=======
details
=======
predictor| type
---------+----------------
hr | int
hits | int
salary |float or numeric
===============
Additional Info
===============
Name |Value
------------------+-----
tree_count | 100
rejected_row_count| 0
accepted_row_count|1000
(1 row)
You can apply the trained iForest model to the baseball dataset with APPLY_IFOREST. To view only the data instances that are identified as outliers, you can run the following query:
=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(hr, hits, salary USING PARAMETERS model_name='baseball_outliers', threshold=0.6)
AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
first_name | last_name | predictions
------------+-----------+--------------------------------------------------------
Jacqueline | Richards | {"anomaly_score":0.8572338674053986,"is_anomaly":true}
Debra | Hall | {"anomaly_score":0.6007846156043213,"is_anomaly":true}
Gerald | Fuller | {"anomaly_score":0.6813650107767862,"is_anomaly":true}
(3 rows)
Instead of specifying a threshold
value for APPLY_IFOREST, you can set the contamination
parameter. This parameter sets a threshold so that the ratio of training data points labeled as outliers is approximately equal to the value of contamination
:
=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
contamination = 0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
first_name | last_name | predictions
------------+-----------+--------------------------------------------------------
Marie | Fields | {"anomaly_score":0.5307715717521868,"is_anomaly":true}
Jacqueline | Richards | {"anomaly_score":0.777757463074347,"is_anomaly":true}
Debra | Hall | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
Gerald | Fuller | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(4 rows)
DETECT_OUTLIERS
The DETECT_OUTLIERS function assumes a normal distribution for each data dimension, and then identifies data instances that differ strongly from the normal profile of any dimension. The function uses the robust z-score detection method to normalize each input column. If a data instance contains a normalized value greater than a specified threshold, it is identified as an outlier. The function outputs a table that contains all the outliers.
The function accepts data with only numeric input columns, treats each column independently, and assumes a Gaussian distribution on each column. If you want to detect outliers in new data added to the dataset, you must rerun DETECT_OUTLIERS.
The following example demonstrates how you can detect the outliers in the baseball dataset based on the hr, hits, and salary columns. The DETECT_OUTLIERS function creates a table containing the outliers with the input and key columns:
=> SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore'
USING PARAMETERS outlier_threshold=3.0);
DETECT_OUTLIERS
--------------------------
Detected 5 outliers
(1 row)
To view the outliers, query the output table containing the outliers:
=> SELECT * FROM baseball_hr_hits_salary_outliers;
id | first_name | last_name | dob | team | hr | hits | avg | salary
----+------------+-----------+------------+-----------+---------+---------+-------+----------------------
73 | Marie | Fields | 1985-11-23 | Mauv | 8888 | 34 | 0.283 | 9.99999999341471e+16
89 | Jacqueline | Richards | 1975-10-06 | Pink | 273333 | 4490260 | 0.324 | 4.4444444444828e+17
87 | Jose | Stephens | 1991-07-20 | Green | 80 | 64253 | 0.69 | 16032567.12
222 | Gerald | Fuller | 1991-02-13 | Goldenrod | 3200000 | 216 | 0.299 | 37008899.76
147 | Debra | Hall | 1980-12-31 | Maroon | 1100037 | 230 | 0.431 | 9000101403
(5 rows)
You can create a view omitting the outliers from the table:
=> CREATE VIEW clean_baseball AS
SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers);
CREATE VIEW