PCA (principal component analysis)

Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data.

1: Dimension reduction using PCA

Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data. In essence, a new coordinate system is constructed so that data variation is strongest along the first axis, less strong along the second axis, and so on. Then, the data points are transformed into this new coordinate system. The directions of the axes are called principal components.

If the input data is a table with p columns, there could be maximum p principal components. However, it's usually the case that the data variation along the direction of some k-th principal component becomes almost negligible, which allows us to keep only the first k components. As a result, the new coordinate system has fewer axes. Hence, the transformed data table has only k columns instead of p. It is important to remember that the k output columns are not simply a subset of p input columns. Instead, each of the k output columns is a combination of all p input columns.

You can use the following functions to train and apply the PCA model:

For a complete example, see Dimension reduction using PCA.

1 - Dimension reduction using PCA

This PCA example uses a data set with a large number of columns named world.

This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.

Before you begin the example, load the Machine Learning sample data.

Create the PCA model, named pcamodel.

=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country');

PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)

View the summary output of pcamodel.


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel');
GET_MODEL_SUMMARY
--------------------------------------------------------------------------------

Next, apply PCA to a select few columns, with the exception of HDI and country.

=> SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             | -36288.1191849017
0.699 | Belize              |   -81740.32711562
0.427 | Benin               | -122666.882708325
0.805 | Chile               | -161356.484748602
0.687 | China               | -202634.254216416
0.744 | Costa Rica          | -242043.080125449
0.4   | Cote d'Ivoire       | -283330.394428932
0.776 | Cuba                | -322625.857541772
0.895 | Denmark             | -356086.311721071
0.644 | Egypt               | -403634.743992772
.
.
.
(96 rows)

Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.

=> SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1;
-[ RECORD 1 ]--------------
HDI     | 0.886
country | Belgium
em1970  | 3.74891915022521
em1971  | 26.091852917619
em1972  | 22.0262860721982
em1973  | 24.8214492074202
em1974  | 20.9486650320945
em1975  | 29.5717692117088
em1976  | 17.4373459783249
em1977  | 33.1895610966146
em1978  | 15.6251407781098
em1979  | 14.9560299812815
em1980  | 18.0870223053504
em1981  | -6.23151505146251
em1982  | -7.12300504708672
em1983  | -7.52627957856581
em1984  | -7.17428622245234
em1985  | -9.04899186621455
em1986  | -10.5098581697156
em1987  | -7.97146984849547
em1988  | -8.85458031319287
em1989  | -8.78422101747477
em1990  | -9.61931854722004
em1991  | -11.6411235452067
em1992  | -12.8882752879355
em1993  | -15.0647523842803
em1994  | -14.3266175918398
em1995  | -9.07603254825782
em1996  | -9.32002671928241
em1997  | -10.0209028262361
em1998  | -6.70882735196004
em1999  | -7.32575918131333
em2000  | -10.3113551933996
em2001  | -11.0162573094354
em2002  | -10.886264397431
em2003  | -8.96078372850612
em2004  | -11.5157129257881
em2005  | -12.5048269019293
em2006  | -12.2345161132594
em2007  | -8.92504587601715
em2008  | -12.1136551375247
em2009  | -10.1144380511421
em2010  | -7.72468307053519
gdp1970 | 10502.1047183969
gdp1971 | 9259.97560190599
gdp1972 | 6593.98178532712
gdp1973 | 5325.33813328068
gdp1974 | -899.029529832931
gdp1975 | -3184.93671107899
gdp1976 | -4517.68204331439
gdp1977 | -3322.9509067019
gdp1978 | -33.8221923368737
gdp1979 | 2882.50573071066
gdp1980 | 3638.74436577365
gdp1981 | 2211.77365027338
gdp1982 | 5811.44631880621
gdp1983 | 7365.75180165581
gdp1984 | 10465.1797058904
gdp1985 | 12312.7219748196
gdp1986 | 12309.0418293413
gdp1987 | 13695.5173269466
gdp1988 | 12531.9995299889
gdp1989 | 13009.2244205049
gdp1990 | 10697.6839797576
gdp1991 | 6835.94651304181
gdp1992 | 4275.67753277099
gdp1993 | 3382.29408813394
gdp1994 | 3703.65406726311
gdp1995 | 4238.17659535371
gdp1996 | 4692.48744219914
gdp1997 | 4539.23538342266
gdp1998 | 5886.78983381162
gdp1999 | 7527.72448728762
gdp2000 | 7646.05563584361
gdp2001 | 9053.22077886667
gdp2002 | 9914.82548013531
gdp2003 | 9201.64413455221
gdp2004 | 9234.70123279344
gdp2005 | 9565.5457350936
gdp2006 | 9569.86316415438
gdp2007 | 9104.60260145907
gdp2008 | 8182.8163827425
gdp2009 | 6279.93197775805
gdp2010 | 4274.40397281553

PCA (principal component analysis)

1 - Dimension reduction using PCA

See also