This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data preparation

Before you can analyze your data, you must prepare it.

1: Balancing imbalanced data
2: Detecting outliers
3: Encoding categorical columns
4: Imputing missing values
5: Normalizing data
6: PCA (principal component analysis)

6.1: Dimension reduction using PCA

7: Sampling data
8: SVD (singular value decomposition)

8.1: Computing SVD

Before you can analyze your data, you must prepare it. You can do the following data preparation tasks in Vertica:

1 - Balancing imbalanced data

Imbalanced data occurs when an uneven distribution of classes occurs in the data.

Imbalanced data occurs when an uneven distribution of classes occurs in the data. Building a predictive model on the imbalanced data set would cause a model that appears to yield high accuracy but does not generalize well to the new data in the minority class. To prevent creating models with false levels of accuracy, you should rebalance your imbalanced data before creating a predictive model.

Before you begin the example, load the Machine Learning sample data.

You see imbalanced data a lot in financial transaction data where the majority of the transactions are not fraudulent and a small number of the transactions are fraudulent, as shown in the following example.

View the distribution of the classes.

=> SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud;
 fraud | COUNT
-------+-------
 TRUE  |    19
 FALSE |   981
(2 rows)

Use the BALANCE function to create a more balanced data set.

=> SELECT BALANCE('balance_fin_data', 'transaction_data', 'fraud', 'under_sampling'
                  USING PARAMETERS sampling_ratio = 0.2);
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

View the new distribution of the classifiers.

=> SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud;
 fraud | COUNT
-------+-------
 t     |    19
 f     |   236
(2 rows)

2 - Detecting outliers

Before you perform an in-depth analysis of your data, you should first remove the outliers from the data.

Before you perform an in-depth analysis of your data, you should first remove the outliers from the data. Outliers are data points that greatly differ from other similar data points. If you leave outliers in your data, you risk misclassifying data, introducing bias, or incorrect calculations.

This example uses the baseball data set found on the Vertica GitHub page.

Before you begin the example, load the Machine Learning sample data.

Detect the outliers based on the hr, hits, and salary columns. The DETECT_OUTLIERS function creates a table containing the outliers with the input and key columns. Before you use the DETECT_OUTLIERS function, make sure that you are a superuser or have CREATE privileges for the schema and SELECT privileges for the table.
```
=> SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore'
                         USING PARAMETERS outlier_threshold=3.0);
     DETECT_OUTLIERS
--------------------------
 Detected 5 outliers

(1 row)
```

Query the output table containing the outliers.

=> SELECT * FROM baseball_hr_hits_salary_outliers;
id  | first_name | last_name |    dob     |   team    |   hr    |  hits   |  avg  |        salary
----+------------+-----------+------------+-----------+---------+---------+-------+----------------------
73  | Marie      | Fields    | 1985-11-23 | Mauv      |    8888 |      34 | 0.283 | 9.99999999341471e+16
89  | Jacqueline | Richards  | 1975-10-06 | Pink      |  273333 | 4490260 | 0.324 |  4.4444444444828e+17
87  | Jose       | Stephens  | 1991-07-20 | Green     |      80 |   64253 |  0.69 |          16032567.12
222 | Gerald     | Fuller    | 1991-02-13 | Goldenrod | 3200000 |     216 | 0.299 |          37008899.76
147 | Debra      | Hall      | 1980-12-31 | Maroon    | 1100037 |     230 | 0.431 |           9000101403
(5 rows)

Create a view omitting the outliers from the table.

=> CREATE VIEW clean_baseball AS
   SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers);
CREATE VIEW

Perform your analysis using the view that omits the outliers.

3 - Encoding categorical columns

Many machine learning algorithms cannot work with categorical data.

Many machine learning algorithms cannot work with categorical data. To accommodate such algorithms, categorical data must be converted to numerical data before training. Directly mapping the categorical values into indices is not enough. For example, if your categorical feature has three distinct values "red", "green" and "blue", replacing them with 1, 2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that "red" is much more different than "blue", while it is more similar to "green". Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, "red" can be mapped to [1,0,0], "green" to [0,1,0] and "blue" to [0,0,1]. Now, the pair-wise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data.

The following example shows how you can apply one hot encoding to the Titanic data set. If you would like to read more about this data set, see the Kaggle site.

Suppose you want to use a logistic regression classifier to predict which passengers survived the sinking of the Titanic. You cannot use categorical features for logistic regression without one hot encoding. This data set has two categorical features that you can use. The "sex" feature can be either male or female. The "embarkation_point" feature can be one of the following:

S for Southampton
Q for Queenstown
C for Cherbourg

Before you begin the example, load the Machine Learning sample data.

Run the ONE_HOT_ENCODER_FIT function on the training data:

=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point');
ONE_HOT_ENCODER_FIT
---------------------
Success

(1 row)

View a summary of the titanic_encoder model:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder');
GET_MODEL_SUMMARY
------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'
USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name    |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point|      C       |         0
embarkation_point|      Q       |         1
embarkation_point|      S       |         2
embarkation_point|              |         3
sex              |    female    |         0
sex              |     male     |         1
(1 row)

Run the GET_MODEL_ATTRIBUTE function. This function returns the categorical levels in their native data types, so they can be compared easily with the original table:

=> SELECT * FROM (SELECT GET_MODEL_ATTRIBUTE(USING PARAMETERS model_name='titanic_encoder',
attr_name='varchar_categories')) AS attrs INNER JOIN (SELECT passenger_id, name, sex, age,
embarkation_point FROM titanic_training) AS original_data ON attrs.category_level
ILIKE original_data.embarkation_point ORDER BY original_data.passenger_id LIMIT 10;
category_name     | category_level | category_level_index | passenger_id |name
|  sex   | age | embarkation_point
------------------+----------------+----------------------+--------------+-----------------------------
-----------------------+--------+-----+-------------------
embarkation_point | S              |                    2 |            1 | Braund, Mr. Owen Harris
| male   |  22 | S
embarkation_point | C              |                    0 |            2 | Cumings, Mrs. John Bradley
(Florence Briggs Thayer | female |  38 | C
embarkation_point | S              |                    2 |            3 | Heikkinen, Miss. Laina
| female |  26 | S
embarkation_point | S              |                    2 |            4 | Futrelle, Mrs. Jacques Heath
(Lily May Peel)       | female |  35 | S
embarkation_point | S              |                    2 |            5 | Allen, Mr. William Henry
| male   |  35 | S
embarkation_point | Q              |                    1 |            6 | Moran, Mr. James
| male   |     | Q
embarkation_point | S              |                    2 |            7 | McCarthy, Mr. Timothy J
| male   |  54 | S
embarkation_point | S              |                    2 |            8 | Palsson, Master. Gosta Leonard
| male   |   2 | S
embarkation_point | S              |                    2 |            9 | Johnson, Mrs. Oscar W
(Elisabeth Vilhelmina Berg)  | female |  27 | S
embarkation_point | C              |                    0 |           10 | Nasser, Mrs. Nicholas
(Adele Achem)                | female |  14 | C
(10 rows)

Run the APPLY_ONE_HOT_ENCODER function on both the training and testing data:

=> CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_training) AS sq;

CREATE VIEW

=> CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_testing) AS sq;
CREATE VIEW

Then, train a logistic regression classifier on the training data, and execute the model on the testing data:

=> SELECT LOGISTIC_REG('titanic_log_reg', 'titanic_training_encoded', 'survived', '*'
USING PARAMETERS exclude_columns='passenger_id, survived');
LOGISTIC_REG
---------------------------
Finished in 5 iterations
(1 row)

=> SELECT passenger_id, name, PREDICT_LOGISTIC_REG(pclass, sex_1, age, sibling_and_spouse_count,
parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 USING PARAMETERS
model_name='titanic_log_reg') FROM titanic_testing_encoded ORDER BY passenger_id LIMIT 10;
passenger_id |                     name                     | PREDICT_LOGISTIC_REG
-------------+----------------------------------------------+----------------------
893          | Wilkes, Mrs. James (Ellen Needs)             |                    0
894          | Myles, Mr. Thomas Francis                    |                    0
895          | Wirz, Mr. Albert                             |                    0
896          | Hirvonen, Mrs. Alexander (Helga E Lindqvist) |                    1
897          | Svensson, Mr. Johan Cervin                   |                    0
898          | Connolly, Miss. Kate                         |                    1
899          | Caldwell, Mr. Albert Francis                 |                    0
900          | Abrahim, Mrs. Joseph (Sophie Halaut Easu)    |                    1
901          | Davies, Mr. John Samuel                      |                    0
902          | Ilieff, Mr. Ylio                             |
(10 rows)

4 - Imputing missing values

You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column.

You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column. This impute example uses the small_input_impute table. Using the function, you can specify either the mean or mode method.

These examples show how you can use the IMPUTE function on the small_input_impute table.

Before you begin the example, load the Machine Learning sample data.

First, query the table so you can see the missing values:

=> SELECT * FROM small_input_impute;
pid | pclass | gender |    x1     |    x2     |    x3     | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5   |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | t  | C
7   |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | C
13  |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | C
15  |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | f  | A
16  |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | f  | A
19  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  |
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
2   |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | A
3   |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | B
4   |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | t  | B
6   |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | t  | C
8   |      1 |      1 |  3.273592 |           |  3.477332 | 18 | f  | B
10  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | A
18  |      1 |      1 |  3.273592 |           |  3.477332 | 18 | t  | B
20  |      1 |      1 |           |  3.841606 |  3.754375 | 20 |    | C
9   |      1 |      1 |           |  3.841606 |  3.754375 | 20 | f  | B
11  |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | B
12  |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | C
14  |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | f  | A
17  |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | B
(21 rows)

Specify the mean method

Execute the IMPUTE function, specifying the mean method:

=> SELECT IMPUTE('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
                  USING PARAMETERS exclude_columns='pid');
IMPUTE
--------------------------
Finished in 1 iteration
(1 row)

View output_view to see the imputed values:

=> SELECT * FROM output_view;
pid | pclass | gender |        x1         |        x2         |        x3         | x4 | x5 | x6
----+--------+--------+-------------------+-------------------+-------------------+----+----+----
5   |      0 |      1 |         -2.590837 |         -2.892819 |          -2.70296 |  2 | t  | C
7   |      1 |      1 |          3.829239 |           3.08765 | -3.12989705263158 | 11 | f  | C
13  |      0 |      0 |         -9.060605 |         -9.390844 |         -9.559848 |  6 | t  | C
15  |      0 |      1 |         -2.590837 |         -2.892819 |          -2.70296 |  2 | f  | A
16  |      0 |      1 |         -2.264599 |         -2.615146 |          -2.10729 | 11 | f  | A
19  |      1 |      1 | -3.86645035294118 |          3.841606 |          3.754375 | 20 | t  |
9   |      1 |      1 | -3.86645035294118 |          3.841606 |          3.754375 | 20 | f  | B
11  |      0 |      0 |         -9.445818 |         -9.740541 |         -9.786974 |  3 | t  | B
12  |      0 |      0 |         -9.618292 |         -9.308881 |         -9.562255 |  4 | t  | C
14  |      0 |      0 |         -2.264599 |         -2.615146 |          -2.10729 | 15 | f  | A
17  |      1 |      1 |          3.829239 |           3.08765 | -3.12989705263158 | 11 | f  | B
1   |      0 |      0 |         -9.445818 |         -9.740541 |         -9.786974 |  3 | t  | A
1   |      0 |      0 |         -9.445818 |         -9.740541 |         -9.786974 |  3 | t  | A
2   |      0 |      0 |         -9.618292 |         -9.308881 |         -9.562255 |  4 | t  | A
3   |      0 |      0 |         -9.060605 |         -9.390844 |         -9.559848 |  6 | t  | B
4   |      0 |      0 |         -2.264599 |         -2.615146 |          -2.10729 | 15 | t  | B
6   |      0 |      1 |         -2.264599 |         -2.615146 |          -2.10729 | 11 | t  | C
8   |      1 |      1 |          3.273592 | -3.22766163157895 |          3.477332 | 18 | f  | B
10  |      1 |      1 | -3.86645035294118 |          3.841606 |          3.754375 | 20 | t  | A
18  |      1 |      1 |          3.273592 | -3.22766163157895 |          3.477332 | 18 | t  | B
20  |      1 |      1 | -3.86645035294118 |          3.841606 |          3.754375 | 20 |    | C
(21 rows)

You can also execute the IMPUTE function, specifying the mean method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:

=> SELECT IMPUTE('output_view_group','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid', partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)

View output_view_group to see the imputed values:

=> SELECT * FROM output_view_group;
pid | pclass | gender |    x1     |        x2        |        x3        | x4 | x5 | x6
----+--------+--------+-----------+------------------+------------------+----+----+----
5   |      0 |      1 | -2.590837 |        -2.892819 |         -2.70296 |  2 | t  | C
7   |      1 |      1 |  3.829239 |          3.08765 | 3.66202733333333 | 19 | f  | C
13  |      0 |      0 | -9.060605 |        -9.390844 |        -9.559848 |  6 | t  | C
15  |      0 |      1 | -2.590837 |        -2.892819 |         -2.70296 |  2 | f  | A
16  |      0 |      1 | -2.264599 |        -2.615146 |         -2.10729 | 11 | f  | A
19  |      1 |      1 | 3.5514155 |         3.841606 |         3.754375 | 20 | t  |
1   |      0 |      0 | -9.445818 |        -9.740541 |        -9.786974 |  3 | t  | A
1   |      0 |      0 | -9.445818 |        -9.740541 |        -9.786974 |  3 | t  | A
2   |      0 |      0 | -9.618292 |        -9.308881 |        -9.562255 |  4 | t  | A
3   |      0 |      0 | -9.060605 |        -9.390844 |        -9.559848 |  6 | t  | B
4   |      0 |      0 | -2.264599 |        -2.615146 |         -2.10729 | 15 | t  | B
6   |      0 |      1 | -2.264599 |        -2.615146 |         -2.10729 | 11 | t  | C
8   |      1 |      1 |  3.273592 | 3.59028733333333 |         3.477332 | 18 | f  | B
10  |      1 |      1 | 3.5514155 |         3.841606 |         3.754375 | 20 | t  | A
18  |      1 |      1 |  3.273592 | 3.59028733333333 |         3.477332 | 18 | t  | B
20  |      1 |      1 | 3.5514155 |         3.841606 |         3.754375 | 20 |    | C
9   |      1 |      1 | 3.5514155 |         3.841606 |         3.754375 | 20 | f  | B
11  |      0 |      0 | -9.445818 |        -9.740541 |        -9.786974 |  3 | t  | B
12  |      0 |      0 | -9.618292 |        -9.308881 |        -9.562255 |  4 | t  | C
14  |      0 |      0 | -2.264599 |        -2.615146 |         -2.10729 | 15 | f  | A
17  |      1 |      1 |  3.829239 |          3.08765 | 3.66202733333333 | 19 | f  | B
(21 rows)

Specify the mode method

Execute the IMPUTE function, specifying the mode method:

=> SELECT impute('output_view_mode','small_input_impute', 'pid, x5,x6','mode'
                  USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

View output_view_mode to see the imputed values:

=> SELECT * FROM output_view_mode;
pid | pclass | gender |    x1     |    x2     |    x3     | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5   |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | t  | C
7   |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | C
13  |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | C
15  |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | f  | A
16  |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | f  | A
19  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | B
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
2   |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | A
3   |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | B
4   |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | t  | B
6   |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | t  | C
8   |      1 |      1 |  3.273592 |           |  3.477332 | 18 | f  | B
10  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | A
18  |      1 |      1 |  3.273592 |           |  3.477332 | 18 | t  | B
20  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | C
9   |      1 |      1 |           |  3.841606 |  3.754375 | 20 | f  | B
11  |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | B
12  |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | C
14  |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | f  | A
17  |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | B
(21 rows)

You can also execute the IMPUTE function, specifying the mode method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:

=> SELECT impute('output_view_mode_group','small_input_impute', 'pid, x5,x6','mode'
                   USING PARAMETERS exclude_columns='pid',partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)

View output_view_mode_group to see the imputed values:

=> SELECT * FROM output_view_mode_group;
pid | pclass | gender |    x1     |    x2     |    x3     | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
1   |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | A
2   |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | A
3   |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | B
4   |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | t  | B
13  |      0 |      0 | -9.060605 | -9.390844 | -9.559848 |  6 | t  | C
11  |      0 |      0 | -9.445818 | -9.740541 | -9.786974 |  3 | t  | B
12  |      0 |      0 | -9.618292 | -9.308881 | -9.562255 |  4 | t  | C
14  |      0 |      0 | -2.264599 | -2.615146 |  -2.10729 | 15 | f  | A
5   |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | t  | C
15  |      0 |      1 | -2.590837 | -2.892819 |  -2.70296 |  2 | f  | A
16  |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | f  | A
6   |      0 |      1 | -2.264599 | -2.615146 |  -2.10729 | 11 | t  | C
7   |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | C
19  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | B
9   |      1 |      1 |           |  3.841606 |  3.754375 | 20 | f  | B
17  |      1 |      1 |  3.829239 |   3.08765 |  Infinity |    | f  | B
8   |      1 |      1 |  3.273592 |           |  3.477332 | 18 | f  | B
10  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | t  | A
18  |      1 |      1 |  3.273592 |           |  3.477332 | 18 | t  | B
20  |      1 |      1 |           |  3.841606 |  3.754375 | 20 | f  | C
(21 rows)

5 - Normalizing data

The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale.

The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary and years_worked. The output value you are trying to predict is a worker's future salary. The values in the current_salary column are likely to have a far wider range, and much larger values, than the values in the years_worked column. Therefore, the values in the current_salary column can overshadow the values in the years_worked column, thus skewing your model.

Vertica offers the following data preparation methods which use normalization. These methods are:

MinMax
Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other.
Z-score
Using the Z-score normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable.
Robust Z-score
Using the Robust Z-score normalization method, you can lessen the influence of outliers on Z-score calculations. Robust Z-score normalization uses the median value as opposed to the mean value used in Z-score. By using the median instead of the mean, it helps remove some of the influence of outliers in the data.

Normalizing data results in the creation of a view where the normalized data is saved. The output_view option in the NORMALIZE function determines name of the view .

Normalizing salary data using MinMax

The following example shows how you can normalize the salary_data table using the MinMax normalization method.

Before you begin the example, load the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT * FROM normalized_salary_data;
employee_id | first_name  | last_name  |     years_worked     |    current_salary
------------+-------------+------------+----------------------+----------------------
189         | Shawn       | Moore      | 0.350000000000000000 | 0.437246565765357217
518         | Earl        | Shaw       | 0.100000000000000000 | 0.978867411144492943
1126        | Susan       | Alexander  | 0.250000000000000000 | 0.909048995710749580
1157        | Jack        | Stone      | 0.100000000000000000 | 0.601863084103319918
1277        | Scott       | Wagner     | 0.050000000000000000 | 0.455949209228501786
3188        | Shirley     | Flores     | 0.400000000000000000 | 0.538816771536005140
3196        | Andrew      | Holmes     | 0.900000000000000000 | 0.183954046444834949
3430        | Philip      | Little     | 0.100000000000000000 | 0.735279557092379495
3522        | Jerry       | Ross       | 0.800000000000000000 | 0.671828883472214349
3892        | Barbara     | Flores     | 0.350000000000000000 | 0.092901007123556866

.
.
.
(1000 rows)

Normalizing salary data using z-score

The following example shows how you can normalize the salary_data table using the Z-score normalization method.

Before you begin the example, load the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked',
                     'zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT * FROM normalized_z_salary_data;
employee_id | first_name  | last_name  |    years_worked     |    current_salary
------------+-------------+------------+---------------------+----------------------
189         | Shawn       | Moore      |  -0.524447274157005 |   -0.221041249770669
518         | Earl        | Shaw       |   -1.35743214416495 |     1.66054215981221
1126        | Susan       | Alexander  |  -0.857641222160185 |     1.41799393943946
1157        | Jack        | Stone      |   -1.35743214416495 |    0.350834283622416
1277        | Scott       | Wagner     |   -1.52402911816654 |   -0.156068522159045
3188        | Shirley     | Flores     |  -0.357850300155415 |    0.131812255991634
3196        | Andrew      | Holmes     |    1.30811943986048 |    -1.10097599783475
3430        | Philip      | Little     |   -1.35743214416495 |    0.814321286168547
3522        | Jerry       | Ross       |   0.974925491857304 |    0.593894513770248
3892        | Barbara     | Flores     |  -0.524447274157005 |    -1.41729301118583

.
.
.
(1000 rows)

Normalizing salary data using robust z-score

The following example shows how you can normalize the salary_data table using the robust Z-score normalization method.

Before you begin the example, load the Machine Learning sample data.

=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT * FROM normalized_robustz_salary_data;
employee_id | first_name  | last_name  |    years_worked    |    current_salary
------------+-------------+------------+--------------------+-----------------------
189         | Shawn       | Moore      | -0.404694455685957 | -0.158933849655499140
518         | Earl        | Shaw       | -1.079185215162552 |  1.317126172796275889
1126        | Susan       | Alexander  | -0.674490759476595 |  1.126852528914384584
1157        | Jack        | Stone      | -1.079185215162552 |  0.289689691751547422
1277        | Scott       | Wagner     | -1.214083367057871 | -0.107964200747705902
3188        | Shirley     | Flores     | -0.269796303790638 |  0.117871818902746738
3196        | Andrew      | Holmes     |  1.079185215162552 | -0.849222942006447161
3430        | Philip      | Little     | -1.079185215162552 |  0.653284859470426481
3522        | Jerry       | Ross       |  0.809388911371914 |  0.480364995828913355
3892        | Barbara     | Flores     | -0.404694455685957 | -1.097366550974798397
3939        | Anna        | Walker     | -0.944287063267233 |  0.414956177842775781
4165        | Martha      | Reyes      |  0.269796303790638 |  0.773947701782753329
4335        | Phillip     | Wright     | -1.214083367057871 |  1.218843012657445647
4534        | Roger       | Harris     |  1.079185215162552 |  1.155185021164402608
4806        | John        | Robinson   |  0.809388911371914 | -0.494320112876813908
4881        | Kelly       | Welch      |  0.134898151895319 | -0.540778808820045933
4889        | Jennifer    | Arnold     |  1.214083367057871 | -0.299762093576526566
5067        | Martha      | Parker     |  0.000000000000000 |  0.719991348857328239
5523        | John        | Martin     | -0.269796303790638 | -0.411248545269163826
6004        | Nicole      | Sullivan   |  0.269796303790638 |  1.065141044522487821
6013        | Harry       | Woods      | -0.944287063267233 |  1.005664438654129376
6240        | Norma       | Martinez   |  1.214083367057871 |  0.762412844887071691

.
.
.
(1000 rows)

6 - PCA (principal component analysis)

Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data.

Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data. In essence, a new coordinate system is constructed so that data variation is strongest along the first axis, less strong along the second axis, and so on. Then, the data points are transformed into this new coordinate system. The directions of the axes are called principal components.

If the input data is a table with p columns, there could be maximum p principal components. However, it's usually the case that the data variation along the direction of some k-th principal component becomes almost negligible, which allows us to keep only the first k components. As a result, the new coordinate system has fewer axes. Hence, the transformed data table has only k columns instead of p. It is important to remember that the k output columns are not simply a subset of p input columns. Instead, each of the k output columns is a combination of all p input columns.

You can use the following functions to train and apply the PCA model:

For a complete example, see Dimension reduction using PCA.

6.1 - Dimension reduction using PCA

This PCA example uses a data set with a large number of columns named world.

This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.

Before you begin the example, load the Machine Learning sample data.

Create the PCA model, named pcamodel.

=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country');

PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)

View the summary output of pcamodel.


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel');
GET_MODEL_SUMMARY
--------------------------------------------------------------------------------

Next, apply PCA to a select few columns, with the exception of HDI and country.

=> SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             | -36288.1191849017
0.699 | Belize              |   -81740.32711562
0.427 | Benin               | -122666.882708325
0.805 | Chile               | -161356.484748602
0.687 | China               | -202634.254216416
0.744 | Costa Rica          | -242043.080125449
0.4   | Cote d'Ivoire       | -283330.394428932
0.776 | Cuba                | -322625.857541772
0.895 | Denmark             | -356086.311721071
0.644 | Egypt               | -403634.743992772
.
.
.
(96 rows)

Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.

=> SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1;
-[ RECORD 1 ]--------------
HDI     | 0.886
country | Belgium
em1970  | 3.74891915022521
em1971  | 26.091852917619
em1972  | 22.0262860721982
em1973  | 24.8214492074202
em1974  | 20.9486650320945
em1975  | 29.5717692117088
em1976  | 17.4373459783249
em1977  | 33.1895610966146
em1978  | 15.6251407781098
em1979  | 14.9560299812815
em1980  | 18.0870223053504
em1981  | -6.23151505146251
em1982  | -7.12300504708672
em1983  | -7.52627957856581
em1984  | -7.17428622245234
em1985  | -9.04899186621455
em1986  | -10.5098581697156
em1987  | -7.97146984849547
em1988  | -8.85458031319287
em1989  | -8.78422101747477
em1990  | -9.61931854722004
em1991  | -11.6411235452067
em1992  | -12.8882752879355
em1993  | -15.0647523842803
em1994  | -14.3266175918398
em1995  | -9.07603254825782
em1996  | -9.32002671928241
em1997  | -10.0209028262361
em1998  | -6.70882735196004
em1999  | -7.32575918131333
em2000  | -10.3113551933996
em2001  | -11.0162573094354
em2002  | -10.886264397431
em2003  | -8.96078372850612
em2004  | -11.5157129257881
em2005  | -12.5048269019293
em2006  | -12.2345161132594
em2007  | -8.92504587601715
em2008  | -12.1136551375247
em2009  | -10.1144380511421
em2010  | -7.72468307053519
gdp1970 | 10502.1047183969
gdp1971 | 9259.97560190599
gdp1972 | 6593.98178532712
gdp1973 | 5325.33813328068
gdp1974 | -899.029529832931
gdp1975 | -3184.93671107899
gdp1976 | -4517.68204331439
gdp1977 | -3322.9509067019
gdp1978 | -33.8221923368737
gdp1979 | 2882.50573071066
gdp1980 | 3638.74436577365
gdp1981 | 2211.77365027338
gdp1982 | 5811.44631880621
gdp1983 | 7365.75180165581
gdp1984 | 10465.1797058904
gdp1985 | 12312.7219748196
gdp1986 | 12309.0418293413
gdp1987 | 13695.5173269466
gdp1988 | 12531.9995299889
gdp1989 | 13009.2244205049
gdp1990 | 10697.6839797576
gdp1991 | 6835.94651304181
gdp1992 | 4275.67753277099
gdp1993 | 3382.29408813394
gdp1994 | 3703.65406726311
gdp1995 | 4238.17659535371
gdp1996 | 4692.48744219914
gdp1997 | 4539.23538342266
gdp1998 | 5886.78983381162
gdp1999 | 7527.72448728762
gdp2000 | 7646.05563584361
gdp2001 | 9053.22077886667
gdp2002 | 9914.82548013531
gdp2003 | 9201.64413455221
gdp2004 | 9234.70123279344
gdp2005 | 9565.5457350936
gdp2006 | 9569.86316415438
gdp2007 | 9104.60260145907
gdp2008 | 8182.8163827425
gdp2009 | 6279.93197775805
gdp2010 | 4274.40397281553

7 - Sampling data

The goal of data sampling is to take a smaller, more manageable sample of a much larger data set.

The goal of data sampling is to take a smaller, more manageable sample of a much larger data set. With a sample data set, you can produce predictive models or use it to help you tune your database. The following example shows how you can use the TABLESAMPLE clause to create a sample of your data.

Sampling data from a table

Before you begin the example, load the Machine Learning sample data.

Using the baseball table, create a new table named baseball_sample containing a 25% sample of baseball. Remember, TABLESAMPLE does not guarantee that the exact percentage of records defined in the clause are returned.

=> CREATE TABLE baseball_sample AS SELECT * FROM baseball TABLESAMPLE(25);
CREATE TABLE
=> SELECT * FROM baseball_sample;
 id  | first_name | last_name  |    dob     |    team    | hr  | hits  |  avg  |   salary
-----+------------+------------+------------+------------+-----+-------+-------+-------------
   4 | Amanda     | Turner     | 1997-12-22 | Maroon     |  58 |   177 | 0.187 |     8047721
  20 | Jesse      | Cooper     | 1983-04-13 | Yellow     |  97 |    39 | 0.523 |     4252837
  22 | Randy      | Peterson   | 1980-05-28 | Orange     |  14 |    16 | 0.141 |  11827728.1
  24 | Carol      | Harris     | 1991-04-02 | Fuscia     |  96 |    12 | 0.456 |  40572253.6
  32 | Rose       | Morrison   | 1977-07-26 | Goldenrod  |  27 |   153 | 0.442 | 14510752.49
  50 | Helen      | Medina     | 1987-12-26 | Maroon     |  12 |   150 |  0.54 | 32169267.91
  70 | Richard    | Gilbert    | 1983-07-13 | Khaki      |   1 |   250 | 0.213 | 40518422.76
  81 | Angela     | Cole       | 1991-08-16 | Violet     |  87 |   136 | 0.706 | 42875181.51
  82 | Elizabeth  | Foster     | 1994-04-30 | Indigo     |  46 |   163 | 0.481 | 33896975.53
  98 | Philip     | Gardner    | 1992-05-06 | Puce       |  39 |   239 | 0.697 | 20967480.67
 102 | Ernest     | Freeman    | 1983-10-05 | Turquoise  |  46 |    77 | 0.564 | 21444463.92
.
.
.
(227 rows)

With your sample you can create a predictive model, or tune your database.

8 - SVD (singular value decomposition)

Singular Value Decomposition (SVD) is a matrix decomposition method that allows you to approximate matrix X with dimensions n-by-p as a product of 3 matrices: X(n-by-p) = U(n-by-k).S(k-byk).VT(k-by-p) where k is an integer from 1 to p, and S is a diagonal matrix. Its diagonal has non-negative values, called singular values, sorted from the largest, at the top left, to the smallest, at the bottom right. All other elements of S are zero.

In practice, the matrix V(p-by-k), which is the transposed version of VT, is more preferred.

If k (an input parameter to the decomposition algorithm, also known as the number of components to keep in the output) is equal to p, the decomposition is exact. If k is less than p, the decomposition becomes an approximation.

An application of SVD is lossy data compression. For example, storing X required n.p elements, while storing the three matrices U, S, and VT requires storing n.k + k + k.p elements. If n=1000, p=10, and k=2, storing X would require 10,000 elements while storing the approximation would require 2,000+4+20 = 2,024 elements. A smaller value of k increases the savings in storage space, while a larger value of k gives a more accurate approximation.

Depending on your data, the singular values may decrease rapidly, which allows you to choose a value of k that is much smaller than the value of p.

Another common application of SVD is to perform the principal component analysis.

You can use the following functions to train and apply the SVD model:

For a complete example, see Computing SVD.

8.1 - Computing SVD

This SVD example uses a small data set named small_svd.

This SVD example uses a small data set named small_svd. The example shows you how to compute SVD using the given data set. The table is a matrix of numbers. The singular value decomposition is computed using the SVD function. This example computes the SVD of the table matrix and assigns it to a new object, which contains one vector, and two matrices, U and V. The vector contains the singular values. The first matrix, U contains the left singular vectors and V contains the right singular vectors.

Before you begin the example, load the Machine Learning sample data.

Create the SVD model, named svdmodel.

=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8  Rejected Rows: 0
(1 row)

View the summary output of svdmodel.


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svdmodel');
GET_MODEL_SUMMARY
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
=======
columns
=======
index|name
-----+----
1    | x1
2    | x2
3    | x3
4    | x4
===============
singular_values
===============
index| value  |explained_variance|accumulated_explained_variance
-----+--------+------------------+------------------------------
1    |22.58748|      0.95542     |            0.95542
2    | 3.79176|      0.02692     |            0.98234
3    | 2.55864|      0.01226     |            0.99460
4    | 1.69756|      0.00540     |            1.00000
======================
right_singular_vectors
======================
index|vector1 |vector2 |vector3 |vector4
-----+--------+--------+--------+--------
1    | 0.58736| 0.08033| 0.74288|-0.31094
2    | 0.26661| 0.78275|-0.06148| 0.55896
3    | 0.71779|-0.13672|-0.64563|-0.22193
4    | 0.26211|-0.60179| 0.16587| 0.73596
========
counters
========
counter_name      |counter_value
------------------+-------------
accepted_row_count|      8
rejected_row_count|      0
iteration_count   |      1
===========
call_string
===========
SELECT SVD('public.svdmodel', 'small_svd', 'x1,x2,x3,x4');
(1 row)

Create a new table, named Umat to obtain the values for U.

=> CREATE TABLE Umat AS SELECT APPLY_SVD(id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel',
exclude_columns='id', key_columns='id') OVER() FROM small_svd;
CREATE TABLE

View the results in the Umat table. This table transforms the matrix into a new coordinates system.

=> SELECT * FROM Umat ORDER BY id;
  id |        col1        |        col2        |        col3         |        col4
-----+--------------------+--------------------+---------------------+--------------------
1    | -0.494871802886819 | -0.161721379259287 |  0.0712816417153664 | -0.473145877877408
2    |  -0.17652411036246 | 0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
3    | -0.150974762654569 | -0.589561842046029 | 0.00392654610109522 |  0.360011163271921
4    |  -0.44849499240202 |  0.347260956311326 |   0.186958376368345 |  0.378561270493651
5    | -0.494871802886819 | -0.161721379259287 |  0.0712816417153664 | -0.473145877877408
6    |  -0.17652411036246 | 0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
7    | -0.150974762654569 | -0.589561842046029 | 0.00392654610109522 |  0.360011163271921
8    |  -0.44849499240202 |  0.347260956311326 |   0.186958376368345 |  0.378561270493651
(8 rows)

Then, we can optionally transform the data back by converting it from Umat to Xmat. First, we must create the Xmat table and then apply the APPLY_INVERSE_SVD function to the table:
```
=> CREATE TABLE Xmat AS SELECT APPLY_INVERSE_SVD(* USING PARAMETERS model_name='svdmodel',
exclude_columns='id', key_columns='id') OVER() FROM Umat;
CREATE TABLE
```

Then view the data from the Xmat table that was created:

=> SELECT id, x1::NUMERIC(5,1), x2::NUMERIC(5,1), x3::NUMERIC(5,1), x4::NUMERIC(5,1) FROM Xmat
ORDER BY id;
id | x1  | x2  | x3  | x4
---+-----+-----+-----+-----
1  | 7.0 | 3.0 | 8.0 | 2.0
2  | 1.0 | 1.0 | 4.0 | 1.0
3  | 2.0 | 3.0 | 2.0 | 0.0
4  | 6.0 | 2.0 | 7.0 | 4.0
5  | 7.0 | 3.0 | 8.0 | 2.0
6  | 1.0 | 1.0 | 4.0 | 1.0
7  | 2.0 | 3.0 | 2.0 | 0.0
8  | 6.0 | 2.0 | 7.0 | 4.0
(8 rows)

Data preparation

1 - Balancing imbalanced data

See also

2 - Detecting outliers

See also

3 - Encoding categorical columns

4 - Imputing missing values

Specify the mean method

Specify the mode method

See also

5 - Normalizing data

Normalizing salary data using MinMax

Normalizing salary data using z-score

Normalizing salary data using robust z-score

See also

6 - PCA (principal component analysis)

6.1 - Dimension reduction using PCA

See also

7 - Sampling data

Sampling data from a table

See also

8 - SVD (singular value decomposition)

8.1 - Computing SVD

See also