Detecting outliers
Before you perform an in-depth analysis of your data, you should first remove the outliers from the data. Outliers are data points that greatly differ from other similar data points. If you leave outliers in your data, you risk misclassifying data, introducing bias, or incorrect calculations.
This example uses the baseball data set found on the Vertica GitHub page.
Before you begin the example, load the Machine Learning sample data.-
Detect the outliers based on the hr, hits, and salary columns. The DETECT_OUTLIERS function creates a table containing the outliers with the input and key columns. Before you use the DETECT_OUTLIERS function, make sure that you are a superuser or have CREATE privileges for the schema and SELECT privileges for the table.
=> SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore' USING PARAMETERS outlier_threshold=3.0); DETECT_OUTLIERS -------------------------- Detected 5 outliers (1 row)
-
Query the output table containing the outliers.
=> SELECT * FROM baseball_hr_hits_salary_outliers; id | first_name | last_name | dob | team | hr | hits | avg | salary ----+------------+-----------+------------+-----------+---------+---------+-------+---------------------- 73 | Marie | Fields | 1985-11-23 | Mauv | 8888 | 34 | 0.283 | 9.99999999341471e+16 89 | Jacqueline | Richards | 1975-10-06 | Pink | 273333 | 4490260 | 0.324 | 4.4444444444828e+17 87 | Jose | Stephens | 1991-07-20 | Green | 80 | 64253 | 0.69 | 16032567.12 222 | Gerald | Fuller | 1991-02-13 | Goldenrod | 3200000 | 216 | 0.299 | 37008899.76 147 | Debra | Hall | 1980-12-31 | Maroon | 1100037 | 230 | 0.431 | 9000101403 (5 rows)
-
Create a view omitting the outliers from the table.
=> CREATE VIEW clean_baseball AS SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers); CREATE VIEW
-
Perform your analysis using the view that omits the outliers.