DETECT_OUTLIERS
Returns the outliers in a data set based on the outlier threshold.
Returns the outliers in a data set based on the outlier threshold. The output is a table that contains the outliers. DETECT_OUTLIERS
uses the detection method robust_szcore
to normalize each input column. The function then identifies as outliers all rows that contain a normalized value greater than the default or specified threshold.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
DETECT_OUTLIERS ( 'output-table', 'input-relation','input-columns', 'detection-method'
[ USING PARAMETERS
[outlier_threshold = threshold]
[, exclude_columns = 'excluded-columns']
[, partition_columns = 'partition-columns'] ] )
Arguments
output-table
- The name of the table where Vertica saves rows that are outliers along the chosen
input_columns
. All columns are present in this table. input-relation
- The table or view that contains outlier data. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of type numeric.
detection-method
- The outlier detection method to use, set to
robust_zscore
.
Parameters
outlier_threshold
- The minimum normalized value in a row that is used to identify that row as an outlier.
Default: 3.0
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.partition_columns
- Comma-separated list of column names from the input table or view that defines the partitions.
DETECT_OUTLIERS
detects outliers among each partition separately.Default: empty list
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output table
Examples
The following example shows how to use DETECT_OUTLIERS
:
=> CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float);
CREATE TABLE
=> COPY baseball_roster FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Polo|7|.233
>> Gloss|45|.170
>> Gus|12|.345
>> Gee|1|.125
>> Laus|3|.095
>> Hilltop|16|.222
>> Wicker|78|.333
>> Scooter|0|.121
>> Hank|999999|.8888
>> Popup|35|.378
>> \.
=> SELECT * FROM baseball_roster;
id | last_name | hr | avg
----+-----------+--------+--------
3 | Gus | 12 | 0.345
4 | Gee | 1 | 0.125
6 | Hilltop | 16 | 0.222
10 | Popup | 35 | 0.378
1 | Polo | 7 | 0.233
7 | Wicker | 78 | 0.333
9 | Hank | 999999 | 0.8888
2 | Gloss | 45 | 0.17
5 | Laus | 3 | 0.095
8 | Scooter | 0 | 0.121
(10 rows)
=> SELECT DETECT_OUTLIERS('baseball_outliers', 'baseball_roster', 'id, hr, avg', 'robust_zscore' USING PARAMETERS
outlier_threshold=3.0);
DETECT_OUTLIERS
--------------------------
Detected 2 outliers
(1 row)
=> SELECT * FROM baseball_outliers;
id | last_name | hr | avg
----+-----------+------------+-------------
7 | Wicker | 78 | 0.333
9 | Hank | 999999 | 0.8888
(2 rows)