Imputing missing values
You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column. This impute example uses the small_input_impute
table. Using the function, you can specify either the mean or mode method.
These examples show how you can use the IMPUTE function on the small_input_impute
table.
First, query the table so you can see the missing values:
=> SELECT * FROM small_input_impute;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t |
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | | C
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
(21 rows)
Specify the mean method
Execute the IMPUTE function, specifying the mean method:
=> SELECT IMPUTE('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
IMPUTE
--------------------------
Finished in 1 iteration
(1 row)
View output_view
to see the imputed values:
=> SELECT * FROM output_view;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-------------------+-------------------+-------------------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | -3.12989705263158 | 11 | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | t |
9 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | -3.12989705263158 | 11 | f | B
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | -3.22766163157895 | 3.477332 | 18 | f | B
10 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | -3.22766163157895 | 3.477332 | 18 | t | B
20 | 1 | 1 | -3.86645035294118 | 3.841606 | 3.754375 | 20 | | C
(21 rows)
You can also execute the IMPUTE function, specifying the mean method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT IMPUTE('output_view_group','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid', partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_group
to see the imputed values:
=> SELECT * FROM output_view_group;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+------------------+------------------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | t |
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 | f | B
10 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | 3.59028733333333 | 3.477332 | 18 | t | B
20 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | | C
9 | 1 | 1 | 3.5514155 | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | 3.66202733333333 | 19 | f | B
(21 rows)
Specify the mode method
Execute the IMPUTE function, specifying the mode method:
=> SELECT impute('output_view_mode','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_mode
to see the imputed values:
=> SELECT * FROM output_view_mode;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | B
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | C
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
(21 rows)
You can also execute the IMPUTE function, specifying the mode method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT impute('output_view_mode_group','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid',partition_columns='pclass,gender');
impute
--------------------------
Finished in 1 iteration
(1 row)
View output_view_mode_group
to see the imputed values:
=> SELECT * FROM output_view_mode_group;
pid | pclass | gender | x1 | x2 | x3 | x4 | x5 | x6
----+--------+--------+-----------+-----------+-----------+----+----+----
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
1 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | A
2 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | A
3 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | B
4 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | t | B
13 | 0 | 0 | -9.060605 | -9.390844 | -9.559848 | 6 | t | C
11 | 0 | 0 | -9.445818 | -9.740541 | -9.786974 | 3 | t | B
12 | 0 | 0 | -9.618292 | -9.308881 | -9.562255 | 4 | t | C
14 | 0 | 0 | -2.264599 | -2.615146 | -2.10729 | 15 | f | A
5 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | t | C
15 | 0 | 1 | -2.590837 | -2.892819 | -2.70296 | 2 | f | A
16 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | f | A
6 | 0 | 1 | -2.264599 | -2.615146 | -2.10729 | 11 | t | C
7 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | C
19 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | B
9 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | B
17 | 1 | 1 | 3.829239 | 3.08765 | Infinity | | f | B
8 | 1 | 1 | 3.273592 | | 3.477332 | 18 | f | B
10 | 1 | 1 | | 3.841606 | 3.754375 | 20 | t | A
18 | 1 | 1 | 3.273592 | | 3.477332 | 18 | t | B
20 | 1 | 1 | | 3.841606 | 3.754375 | 20 | f | C
(21 rows)