Before you can analyze your data, you must prepare it. You can do the following data preparation tasks in Vertica:
This is the multi-page printable view of this section. Click here to print.
Data preparation
- 1: Balancing imbalanced data
- 2: Detecting outliers
- 3: Encoding categorical columns
- 4: Imputing missing values
- 5: Normalizing data
- 6: PCA (principal component analysis)
- 7: Sampling data
- 8: SVD (singular value decomposition)
- 8.1: Computing SVD
1 - Balancing imbalanced data
Imbalanced data occurs when an uneven distribution of classes occurs in the data. Building a predictive model on the imbalanced data set would cause a model that appears to yield high accuracy but does not generalize well to the new data in the minority class. To prevent creating models with false levels of accuracy, you should rebalance your imbalanced data before creating a predictive model.
Before you begin the example, load the Machine Learning sample data.You see imbalanced data a lot in financial transaction data where the majority of the transactions are not fraudulent and a small number of the transactions are fraudulent, as shown in the following example.
-
View the distribution of the classes.
=> SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud; fraud | COUNT -------+------- TRUE | 19 FALSE | 981 (2 rows)
-
Use the BALANCE function to create a more balanced data set.
=> SELECT BALANCE('balance_fin_data', 'transaction_data', 'fraud', 'under_sampling' USING PARAMETERS sampling_ratio = 0.2); BALANCE -------------------------- Finished in 1 iteration (1 row)
-
View the new distribution of the classifiers.
=> SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud; fraud | COUNT -------+------- t | 19 f | 236 (2 rows)
See also
2 - Detecting outliers
Before you perform an in-depth analysis of your data, you should first remove the outliers from the data. Outliers are data points that greatly differ from other similar data points. If you leave outliers in your data, you risk misclassifying data, introducing bias, or incorrect calculations.
This example uses the baseball data set found on the Vertica GitHub page.
Before you begin the example, load the Machine Learning sample data.-
Detect the outliers based on the hr, hits, and salary columns. The DETECT_OUTLIERS function creates a table containing the outliers with the input and key columns. Before you use the DETECT_OUTLIERS function, make sure that you are a superuser or have CREATE privileges for the schema and SELECT privileges for the table.
=> SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore' USING PARAMETERS outlier_threshold=3.0); DETECT_OUTLIERS -------------------------- Detected 5 outliers (1 row)
-
Query the output table containing the outliers.
=> SELECT * FROM baseball_hr_hits_salary_outliers; id | first_name | last_name | dob | team | hr | hits | avg | salary ----+------------+-----------+------------+-----------+---------+---------+-------+---------------------- 73 | Marie | Fields | 1985-11-23 | Mauv | 8888 | 34 | 0.283 | 9.99999999341471e+16 89 | Jacqueline | Richards | 1975-10-06 | Pink | 273333 | 4490260 | 0.324 | 4.4444444444828e+17 87 | Jose | Stephens | 1991-07-20 | Green | 80 | 64253 | 0.69 | 16032567.12 222 | Gerald | Fuller | 1991-02-13 | Goldenrod | 3200000 | 216 | 0.299 | 37008899.76 147 | Debra | Hall | 1980-12-31 | Maroon | 1100037 | 230 | 0.431 | 9000101403 (5 rows)
-
Create a view omitting the outliers from the table.
=> CREATE VIEW clean_baseball AS SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers); CREATE VIEW
-
Perform your analysis using the view that omits the outliers.
See also
3 - Encoding categorical columns
Many machine learning algorithms cannot work with categorical data. To accommodate such algorithms, categorical data must be converted to numerical data before training. Directly mapping the categorical values into indices is not enough. For example, if your categorical feature has three distinct values "red", "green" and "blue", replacing them with 1, 2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that "red" is much more different than "blue", while it is more similar to "green". Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, "red" can be mapped to [1,0,0], "green" to [0,1,0] and "blue" to [0,0,1]. Now, the pair-wise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data.
The following example shows how you can apply one hot encoding to the Titanic data set. If you would like to read more about this data set, see the Kaggle site.
Suppose you want to use a logistic regression classifier to predict which passengers survived the sinking of the Titanic. You cannot use categorical features for logistic regression without one hot encoding. This data set has two categorical features that you can use. The "sex" feature can be either male or female. The "embarkation_point" feature can be one of the following:
-
S for Southampton
-
Q for Queenstown
-
C for Cherbourg
- Run the ONE_HOT_ENCODER_FIT function on the training data:
=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point');
ONE_HOT_ENCODER_FIT
---------------------
Success
(1 row)
- View a summary of the titanic_encoder model:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder');
GET_MODEL_SUMMARY
------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'
USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point| C | 0
embarkation_point| Q | 1
embarkation_point| S | 2
embarkation_point| | 3
sex | female | 0
sex | male | 1
(1 row)
- Run the GET_MODEL_ATTRIBUTE function. This function returns the categorical levels in their native data types, so they can be compared easily with the original table:
=> SELECT * FROM (SELECT GET_MODEL_ATTRIBUTE(USING PARAMETERS model_name='titanic_encoder',
attr_name='varchar_categories')) AS attrs INNER JOIN (SELECT passenger_id, name, sex, age,
embarkation_point FROM titanic_training) AS original_data ON attrs.category_level
ILIKE original_data.embarkation_point ORDER BY original_data.passenger_id LIMIT 10;
category_name | category_level | category_level_index | passenger_id |name
| sex | age | embarkation_point
------------------+----------------+----------------------+--------------+-----------------------------
-----------------------+--------+-----+-------------------
embarkation_point | S | 2 | 1 | Braund, Mr. Owen Harris
| male | 22 | S
embarkation_point | C | 0 | 2 | Cumings, Mrs. John Bradley
(Florence Briggs Thayer | female | 38 | C
embarkation_point | S | 2 | 3 | Heikkinen, Miss. Laina
| female | 26 | S
embarkation_point | S | 2 | 4 | Futrelle, Mrs. Jacques Heath
(Lily May Peel) | female | 35 | S
embarkation_point | S | 2 | 5 | Allen, Mr. William Henry
| male | 35 | S
embarkation_point | Q | 1 | 6 | Moran, Mr. James
| male | | Q
embarkation_point | S | 2 | 7 | McCarthy, Mr. Timothy J
| male | 54 | S
embarkation_point | S | 2 | 8 | Palsson, Master. Gosta Leonard
| male | 2 | S
embarkation_point | S | 2 | 9 | Johnson, Mrs. Oscar W
(Elisabeth Vilhelmina Berg) | female | 27 | S
embarkation_point | C | 0 | 10 | Nasser, Mrs. Nicholas
(Adele Achem) | female | 14 | C
(10 rows)
- Run the APPLY_ONE_HOT_ENCODER function on both the training and testing data:
=> CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_training) AS sq;
CREATE VIEW
=> CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_testing) AS sq;
CREATE VIEW
- Then, train a logistic regression classifier on the training data, and execute the model on the testing data:
=> SELECT LOGISTIC_REG('titanic_log_reg', 'titanic_training_encoded', 'survived', '*'
USING PARAMETERS exclude_columns='passenger_id, survived');
LOGISTIC_REG
---------------------------
Finished in 5 iterations
(1 row)
=> SELECT passenger_id, name, PREDICT_LOGISTIC_REG(pclass, sex_1, age, sibling_and_spouse_count,
parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 USING PARAMETERS
model_name='titanic_log_reg') FROM titanic_testing_encoded ORDER BY passenger_id LIMIT 10;
passenger_id | name | PREDICT_LOGISTIC_REG
-------------+----------------------------------------------+----------------------
893 | Wilkes, Mrs. James (Ellen Needs) | 0
894 | Myles, Mr. Thomas Francis | 0
895 | Wirz, Mr. Albert | 0
896 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | 1
897 | Svensson, Mr. Johan Cervin | 0
898 | Connolly, Miss. Kate | 1
899 | Caldwell, Mr. Albert Francis | 0
900 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | 1
901 | Davies, Mr. John Samuel | 0
902 | Ilieff, Mr. Ylio |
(10 rows)
4 - Imputing missing values
You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column. This impute example uses the small_input_impute
table. Using the function, you can specify either the mean or mode method.
These examples show how you can use the IMPUTE function on the small_input_impute
table.
First, query the table so you can see the missing values:
=> SELECT * FROM small_input_impute;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t |
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | | C
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
(21 rows)
Specify the mean method
Execute the IMPUTE function, specifying the mean method:
=> SELECT IMPUTE('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
IMPUTE
--------------------------
Finished in 1 iteration
(1 row)
View output_view
to see the imputed values:
=> SELECT * FROM output_view;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-------------------+-------------------+-------------------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | -3.12989705263158 | 11 | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | t |
9 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | -3.12989705263158 | 11 | f | B
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | -3.22766163157895 | 3.477332 | 18 | f | B
10 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | -3.22766163157895 | 3.477332 | 18 | t | B
20 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | | C
(21 rows)
You can also execute the IMPUTE function, specifying the mean method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT IMPUTE('output_view_group','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid', partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_group
to see the imputed values:
=> SELECT * FROM output_view_group;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+------------------+------------------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | t |
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 | f | B
10 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 | t | B
20 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | | C
9 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 | f | B
(21 rows)
Specify the mode method
Execute the IMPUTE function, specifying the mode method:
=> SELECT impute('output_view_mode','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_mode
to see the imputed values:
=> SELECT * FROM output_view_mode;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | B
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | C
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
(21 rows)
You can also execute the IMPUTE function, specifying the mode method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT impute('output_view_mode_group','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid',partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_mode_group
to see the imputed values:
=> SELECT * FROM output_view_mode_group;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | B
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | C
(21 rows)
See also
IMPUTE5 - Normalizing data
The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary
and years_worked
. The output value you are trying to predict is a worker's future salary. The values in the current_salary
column are likely to have a far wider range, and much larger values, than the values in the years_worked
column. Therefore, the values in the current_salary
column can overshadow the values in the years_worked
column, thus skewing your model.
Vertica offers the following data preparation methods which use normalization. These methods are:
-
MinMax
Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other. -
Z-score
Using the Z-score normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable. -
Robust Z-score
Using the Robust Z-score normalization method, you can lessen the influence of outliers on Z-score calculations. Robust Z-score normalization uses the median value as opposed to the mean value used in Z-score. By using the median instead of the mean, it helps remove some of the influence of outliers in the data.
Normalizing data results in the creation of a view where the normalized data is saved. The output_view
option in the NORMALIZE function determines name of the view .
Normalizing salary data using MinMax
The following example shows how you can normalize the salary_data
table using the MinMax normalization method.
=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_salary_data;
employee_id | first_name | last_name | years_worked | current_salary
------------+-------------+------------+----------------------+----------------------
189 | Shawn | Moore | 0.350000000000000000 | 0.437246565765357217
518 | Earl | Shaw | 0.100000000000000000 | 0.978867411144492943
1126 | Susan | Alexander | 0.250000000000000000 | 0.909048995710749580
1157 | Jack | Stone | 0.100000000000000000 | 0.601863084103319918
1277 | Scott | Wagner | 0.050000000000000000 | 0.455949209228501786
3188 | Shirley | Flores | 0.400000000000000000 | 0.538816771536005140
3196 | Andrew | Holmes | 0.900000000000000000 | 0.183954046444834949
3430 | Philip | Little | 0.100000000000000000 | 0.735279557092379495
3522 | Jerry | Ross | 0.800000000000000000 | 0.671828883472214349
3892 | Barbara | Flores | 0.350000000000000000 | 0.092901007123556866
.
.
.
(1000 rows)
Normalizing salary data using z-score
The following example shows how you can normalize the salary_data
table using the Z-score normalization method.
=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked',
'zscore');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_z_salary_data;
employee_id | first_name | last_name | years_worked | current_salary
------------+-------------+------------+---------------------+----------------------
189 | Shawn | Moore | -0.524447274157005 | -0.221041249770669
518 | Earl | Shaw | -1.35743214416495 | 1.66054215981221
1126 | Susan | Alexander | -0.857641222160185 | 1.41799393943946
1157 | Jack | Stone | -1.35743214416495 | 0.350834283622416
1277 | Scott | Wagner | -1.52402911816654 | -0.156068522159045
3188 | Shirley | Flores | -0.357850300155415 | 0.131812255991634
3196 | Andrew | Holmes | 1.30811943986048 | -1.10097599783475
3430 | Philip | Little | -1.35743214416495 | 0.814321286168547
3522 | Jerry | Ross | 0.974925491857304 | 0.593894513770248
3892 | Barbara | Flores | -0.524447274157005 | -1.41729301118583
.
.
.
(1000 rows)
Normalizing salary data using robust z-score
The following example shows how you can normalize the salary_data
table using the robust Z-score normalization method.
=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_robustz_salary_data;
employee_id | first_name | last_name | years_worked | current_salary
------------+-------------+------------+--------------------+-----------------------
189 | Shawn | Moore | -0.404694455685957 | -0.158933849655499140
518 | Earl | Shaw | -1.079185215162552 | 1.317126172796275889
1126 | Susan | Alexander | -0.674490759476595 | 1.126852528914384584
1157 | Jack | Stone | -1.079185215162552 | 0.289689691751547422
1277 | Scott | Wagner | -1.214083367057871 | -0.107964200747705902
3188 | Shirley | Flores | -0.269796303790638 | 0.117871818902746738
3196 | Andrew | Holmes | 1.079185215162552 | -0.849222942006447161
3430 | Philip | Little | -1.079185215162552 | 0.653284859470426481
3522 | Jerry | Ross | 0.809388911371914 | 0.480364995828913355
3892 | Barbara | Flores | -0.404694455685957 | -1.097366550974798397
3939 | Anna | Walker | -0.944287063267233 | 0.414956177842775781
4165 | Martha | Reyes | 0.269796303790638 | 0.773947701782753329
4335 | Phillip | Wright | -1.214083367057871 | 1.218843012657445647
4534 | Roger | Harris | 1.079185215162552 | 1.155185021164402608
4806 | John | Robinson | 0.809388911371914 | -0.494320112876813908
4881 | Kelly | Welch | 0.134898151895319 | -0.540778808820045933
4889 | Jennifer | Arnold | 1.214083367057871 | -0.299762093576526566
5067 | Martha | Parker | 0.000000000000000 | 0.719991348857328239
5523 | John | Martin | -0.269796303790638 | -0.411248545269163826
6004 | Nicole | Sullivan | 0.269796303790638 | 1.065141044522487821
6013 | Harry | Woods | -0.944287063267233 | 1.005664438654129376
6240 | Norma | Martinez | 1.214083367057871 | 0.762412844887071691
.
.
.
(1000 rows)
See also
6 - PCA (principal component analysis)
Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data. In essence, a new coordinate system is constructed so that data variation is strongest along the first axis, less strong along the second axis, and so on. Then, the data points are transformed into this new coordinate system. The directions of the axes are called principal components.
If the input data is a table with p columns, there could be maximum p principal components. However, it's usually the case that the data variation along the direction of some k-th principal component becomes almost negligible, which allows us to keep only the first k components. As a result, the new coordinate system has fewer axes. Hence, the transformed data table has only k columns instead of p. It is important to remember that the k output columns are not simply a subset of p input columns. Instead, each of the k output columns is a combination of all p input columns.
You can use the following functions to train and apply the PCA model:
For a complete example, see Dimension reduction using PCA.
6.1 - Dimension reduction using PCA
This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.
Before you begin the example, load the Machine Learning sample data.-
Create the PCA model, named
pcamodel
.=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977, em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992, em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007, em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980, gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993, gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006, gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country'); PCA --------------------------------------------------------------- Finished in 1 iterations. Accepted Rows: 96 Rejected Rows: 0 (1 row)
-
View the summary output of
pcamodel
.=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel'); GET_MODEL_SUMMARY --------------------------------------------------------------------------------
-
Next, apply PCA to a select few columns, with the exception of HDI and country.
=> SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world; HDI | country | col1 ------+---------------------+------------------- 0.886 | Belgium | -36288.1191849017 0.699 | Belize | -81740.32711562 0.427 | Benin | -122666.882708325 0.805 | Chile | -161356.484748602 0.687 | China | -202634.254216416 0.744 | Costa Rica | -242043.080125449 0.4 | Cote d'Ivoire | -283330.394428932 0.776 | Cuba | -322625.857541772 0.895 | Denmark | -356086.311721071 0.644 | Egypt | -403634.743992772 . . . (96 rows)
-
Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.
=> SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1; -[ RECORD 1 ]-------------- HDI | 0.886 country | Belgium em1970 | 3.74891915022521 em1971 | 26.091852917619 em1972 | 22.0262860721982 em1973 | 24.8214492074202 em1974 | 20.9486650320945 em1975 | 29.5717692117088 em1976 | 17.4373459783249 em1977 | 33.1895610966146 em1978 | 15.6251407781098 em1979 | 14.9560299812815 em1980 | 18.0870223053504 em1981 | -6.23151505146251 em1982 | -7.12300504708672 em1983 | -7.52627957856581 em1984 | -7.17428622245234 em1985 | -9.04899186621455 em1986 | -10.5098581697156 em1987 | -7.97146984849547 em1988 | -8.85458031319287 em1989 | -8.78422101747477 em1990 | -9.61931854722004 em1991 | -11.6411235452067 em1992 | -12.8882752879355 em1993 | -15.0647523842803 em1994 | -14.3266175918398 em1995 | -9.07603254825782 em1996 | -9.32002671928241 em1997 | -10.0209028262361 em1998 | -6.70882735196004 em1999 | -7.32575918131333 em2000 | -10.3113551933996 em2001 | -11.0162573094354 em2002 | -10.886264397431 em2003 | -8.96078372850612 em2004 | -11.5157129257881 em2005 | -12.5048269019293 em2006 | -12.2345161132594 em2007 | -8.92504587601715 em2008 | -12.1136551375247 em2009 | -10.1144380511421 em2010 | -7.72468307053519 gdp1970 | 10502.1047183969 gdp1971 | 9259.97560190599 gdp1972 | 6593.98178532712 gdp1973 | 5325.33813328068 gdp1974 | -899.029529832931 gdp1975 | -3184.93671107899 gdp1976 | -4517.68204331439 gdp1977 | -3322.9509067019 gdp1978 | -33.8221923368737 gdp1979 | 2882.50573071066 gdp1980 | 3638.74436577365 gdp1981 | 2211.77365027338 gdp1982 | 5811.44631880621 gdp1983 | 7365.75180165581 gdp1984 | 10465.1797058904 gdp1985 | 12312.7219748196 gdp1986 | 12309.0418293413 gdp1987 | 13695.5173269466 gdp1988 | 12531.9995299889 gdp1989 | 13009.2244205049 gdp1990 | 10697.6839797576 gdp1991 | 6835.94651304181 gdp1992 | 4275.67753277099 gdp1993 | 3382.29408813394 gdp1994 | 3703.65406726311 gdp1995 | 4238.17659535371 gdp1996 | 4692.48744219914 gdp1997 | 4539.23538342266 gdp1998 | 5886.78983381162 gdp1999 | 7527.72448728762 gdp2000 | 7646.05563584361 gdp2001 | 9053.22077886667 gdp2002 | 9914.82548013531 gdp2003 | 9201.64413455221 gdp2004 | 9234.70123279344 gdp2005 | 9565.5457350936 gdp2006 | 9569.86316415438 gdp2007 | 9104.60260145907 gdp2008 | 8182.8163827425 gdp2009 | 6279.93197775805 gdp2010 | 4274.40397281553
See also
7 - Sampling data
The goal of data sampling is to take a smaller, more manageable sample of a much larger data set. With a sample data set, you can produce predictive models or use it to help you tune your database. The following example shows how you can use the TABLESAMPLE
clause to create a sample of your data.
Sampling data from a table
Before you begin the example, load the Machine Learning sample data.Using the baseball
table, create a new table named baseball_sample
containing a 25% sample of baseball
. Remember, TABLESAMPLE does not guarantee that the exact percentage of records defined in the clause are returned.
=> CREATE TABLE baseball_sample AS SELECT * FROM baseball TABLESAMPLE(25);
CREATE TABLE
=> SELECT * FROM baseball_sample;
id | first_name | last_name | dob | team | hr | hits | avg | salary
-----+------------+------------+------------+------------+-----+-------+-------+-------------
4 | Amanda | Turner | 1997-12-22 | Maroon | 58 | 177 | 0.187 | 8047721
20 | Jesse | Cooper | 1983-04-13 | Yellow | 97 | 39 | 0.523 | 4252837
22 | Randy | Peterson | 1980-05-28 | Orange | 14 | 16 | 0.141 | 11827728.1
24 | Carol | Harris | 1991-04-02 | Fuscia | 96 | 12 | 0.456 | 40572253.6
32 | Rose | Morrison | 1977-07-26 | Goldenrod | 27 | 153 | 0.442 | 14510752.49
50 | Helen | Medina | 1987-12-26 | Maroon | 12 | 150 | 0.54 | 32169267.91
70 | Richard | Gilbert | 1983-07-13 | Khaki | 1 | 250 | 0.213 | 40518422.76
81 | Angela | Cole | 1991-08-16 | Violet | 87 | 136 | 0.706 | 42875181.51
82 | Elizabeth | Foster | 1994-04-30 | Indigo | 46 | 163 | 0.481 | 33896975.53
98 | Philip | Gardner | 1992-05-06 | Puce | 39 | 239 | 0.697 | 20967480.67
102 | Ernest | Freeman | 1983-10-05 | Turquoise | 46 | 77 | 0.564 | 21444463.92
.
.
.
(227 rows)
With your sample you can create a predictive model, or tune your database.
See also
- FROM clause (for more information about the
TABLESAMPLE
clause)
8 - SVD (singular value decomposition)
Singular Value Decomposition (SVD) is a matrix decomposition method that allows you to approximate matrix X with dimensions n-by-p as a product of 3 matrices: X(n-by-p) = U(n-by-k).S(k-byk).VT(k-by-p) where k is an integer from 1 to p, and S is a diagonal matrix. Its diagonal has non-negative values, called singular values, sorted from the largest, at the top left, to the smallest, at the bottom right. All other elements of S are zero.
In practice, the matrix V(p-by-k), which is the transposed version of VT, is more preferred.
If k (an input parameter to the decomposition algorithm, also known as the number of components to keep in the output) is equal to p, the decomposition is exact. If k is less than p, the decomposition becomes an approximation.
An application of SVD is lossy data compression. For example, storing X required n.p elements, while storing the three matrices U, S, and VT requires storing n.k + k + k.p elements. If n=1000, p=10, and k=2, storing X would require 10,000 elements while storing the approximation would require 2,000+4+20 = 2,024 elements. A smaller value of k increases the savings in storage space, while a larger value of k gives a more accurate approximation.
Depending on your data, the singular values may decrease rapidly, which allows you to choose a value of k that is much smaller than the value of p.
Another common application of SVD is to perform the principal component analysis.
You can use the following functions to train and apply the SVD model:
For a complete example, see Computing SVD.
8.1 - Computing SVD
This SVD example uses a small data set named small_svd. The example shows you how to compute SVD using the given data set. The table is a matrix of numbers. The singular value decomposition is computed using the SVD function. This example computes the SVD of the table matrix and assigns it to a new object, which contains one vector, and two matrices, U and V. The vector contains the singular values. The first matrix, U contains the left singular vectors and V contains the right singular vectors.
Before you begin the example, load the Machine Learning sample data.-
Create the SVD model, named
svdmodel
.=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4'); SVD -------------------------------------------------------------- Finished in 1 iterations. Accepted Rows: 8 Rejected Rows: 0 (1 row)
-
View the summary output of
svdmodel
.=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svdmodel'); GET_MODEL_SUMMARY ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------- ======= columns ======= index|name -----+---- 1 | x1 2 | x2 3 | x3 4 | x4 =============== singular_values =============== index| value |explained_variance|accumulated_explained_variance -----+--------+------------------+------------------------------ 1 |22.58748| 0.95542 | 0.95542 2 | 3.79176| 0.02692 | 0.98234 3 | 2.55864| 0.01226 | 0.99460 4 | 1.69756| 0.00540 | 1.00000 ====================== right_singular_vectors ====================== index|vector1 |vector2 |vector3 |vector4 -----+--------+--------+--------+-------- 1 | 0.58736| 0.08033| 0.74288|-0.31094 2 | 0.26661| 0.78275|-0.06148| 0.55896 3 | 0.71779|-0.13672|-0.64563|-0.22193 4 | 0.26211|-0.60179| 0.16587| 0.73596 ======== counters ======== counter_name |counter_value ------------------+------------- accepted_row_count| 8 rejected_row_count| 0 iteration_count | 1 =========== call_string =========== SELECT SVD('public.svdmodel', 'small_svd', 'x1,x2,x3,x4'); (1 row)
-
Create a new table, named
Umat
to obtain the values for U.=> CREATE TABLE Umat AS SELECT APPLY_SVD(id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id') OVER() FROM small_svd; CREATE TABLE
-
View the results in the
Umat
table. This table transforms the matrix into a new coordinates system.=> SELECT * FROM Umat ORDER BY id; id | col1 | col2 | col3 | col4 -----+--------------------+--------------------+---------------------+-------------------- 1 | -0.494871802886819 | -0.161721379259287 | 0.0712816417153664 | -0.473145877877408 2 | -0.17652411036246 | 0.0753183783382909 | -0.678196192333598 | 0.0567124770173372 3 | -0.150974762654569 | -0.589561842046029 | 0.00392654610109522 | 0.360011163271921 4 | -0.44849499240202 | 0.347260956311326 | 0.186958376368345 | 0.378561270493651 5 | -0.494871802886819 | -0.161721379259287 | 0.0712816417153664 | -0.473145877877408 6 | -0.17652411036246 | 0.0753183783382909 | -0.678196192333598 | 0.0567124770173372 7 | -0.150974762654569 | -0.589561842046029 | 0.00392654610109522 | 0.360011163271921 8 | -0.44849499240202 | 0.347260956311326 | 0.186958376368345 | 0.378561270493651 (8 rows)
-
Then, we can optionally transform the data back by converting it from Umat to Xmat. First, we must create the Xmat table and then apply the APPLY_INVERSE_SVD function to the table:
=> CREATE TABLE Xmat AS SELECT APPLY_INVERSE_SVD(* USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id') OVER() FROM Umat; CREATE TABLE
-
Then view the data from the Xmat table that was created:
=> SELECT id, x1::NUMERIC(5,1), x2::NUMERIC(5,1), x3::NUMERIC(5,1), x4::NUMERIC(5,1) FROM Xmat ORDER BY id; id | x1 | x2 | x3 | x4 ---+-----+-----+-----+----- 1 | 7.0 | 3.0 | 8.0 | 2.0 2 | 1.0 | 1.0 | 4.0 | 1.0 3 | 2.0 | 3.0 | 2.0 | 0.0 4 | 6.0 | 2.0 | 7.0 | 4.0 5 | 7.0 | 3.0 | 8.0 | 2.0 6 | 1.0 | 1.0 | 4.0 | 1.0 7 | 2.0 | 3.0 | 2.0 | 0.0 8 | 6.0 | 2.0 | 7.0 | 4.0 (8 rows)