DETECT_OUTLIERS

Returns the outliers in a data set based on the outlier threshold.

Returns the outliers in a data set based on the outlier threshold. The output is a table that contains the outliers. DETECT_OUTLIERS uses the detection method robust_szcore to normalize each input column. The function then identifies as outliers all rows that contain a normalized value greater than the default or specified threshold.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

DETECT_OUTLIERS ( 'output-table', 'input-relation','input-columns', 'detection-method'
        [ USING PARAMETERS
              [outlier_threshold = threshold]
              [, exclude_columns = 'excluded-columns']
              [, partition_columns = 'partition-columns'] ] )

Arguments

output-table
The name of the table where Vertica saves rows that are outliers along the chosen input_columns. All columns are present in this table.
input-relation
The table or view that contains outlier data. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of type numeric.
detection-method
The outlier detection method to use, set to robust_zscore.

Parameters

outlier_threshold
The minimum normalized value in a row that is used to identify that row as an outlier.

Default: 3.0

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

partition_columns
Comma-separated list of column names from the input table or view that defines the partitions. DETECT_OUTLIERS detects outliers among each partition separately.

Default: empty list

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output table

Examples

The following example shows how to use DETECT_OUTLIERS:

=> CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float);
CREATE TABLE

=> COPY baseball_roster FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Polo|7|.233
>> Gloss|45|.170
>> Gus|12|.345
>> Gee|1|.125
>> Laus|3|.095
>> Hilltop|16|.222
>> Wicker|78|.333
>> Scooter|0|.121
>> Hank|999999|.8888
>> Popup|35|.378
>> \.


=> SELECT * FROM baseball_roster;
 id | last_name |   hr   |  avg
----+-----------+--------+--------
  3 | Gus       |     12 |  0.345
  4 | Gee       |      1 |  0.125
  6 | Hilltop   |     16 |  0.222
 10 | Popup     |     35 |  0.378
  1 | Polo      |      7 |  0.233
  7 | Wicker    |     78 |  0.333
  9 | Hank      | 999999 | 0.8888
  2 | Gloss     |     45 |   0.17
  5 | Laus      |      3 |  0.095
  8 | Scooter   |      0 |  0.121
(10 rows)

=> SELECT DETECT_OUTLIERS('baseball_outliers', 'baseball_roster', 'id, hr, avg', 'robust_zscore' USING PARAMETERS
outlier_threshold=3.0);

     DETECT_OUTLIERS
--------------------------
 Detected 2 outliers

(1 row)

=> SELECT * FROM baseball_outliers;
 id | last_name | hr         | avg
----+-----------+------------+-------------
  7 | Wicker    |         78 |       0.333
  9 | Hank      |     999999 |      0.8888
(2 rows)