Dimension reduction using PCA
This PCA example uses a data set with a large number of columns named world.
This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.
Before you begin the example, load the Machine Learning sample data.-
Create the PCA model, named
pcamodel
.=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977, em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992, em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007, em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980, gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993, gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006, gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country'); PCA --------------------------------------------------------------- Finished in 1 iterations. Accepted Rows: 96 Rejected Rows: 0 (1 row)
-
View the summary output of
pcamodel
.=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel'); GET_MODEL_SUMMARY --------------------------------------------------------------------------------
-
Next, apply PCA to a select few columns, with the exception of HDI and country.
=> SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world; HDI | country | col1 ------+---------------------+------------------- 0.886 | Belgium | -36288.1191849017 0.699 | Belize | -81740.32711562 0.427 | Benin | -122666.882708325 0.805 | Chile | -161356.484748602 0.687 | China | -202634.254216416 0.744 | Costa Rica | -242043.080125449 0.4 | Cote d'Ivoire | -283330.394428932 0.776 | Cuba | -322625.857541772 0.895 | Denmark | -356086.311721071 0.644 | Egypt | -403634.743992772 . . . (96 rows)
-
Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.
=> SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1; -[ RECORD 1 ]-------------- HDI | 0.886 country | Belgium em1970 | 3.74891915022521 em1971 | 26.091852917619 em1972 | 22.0262860721982 em1973 | 24.8214492074202 em1974 | 20.9486650320945 em1975 | 29.5717692117088 em1976 | 17.4373459783249 em1977 | 33.1895610966146 em1978 | 15.6251407781098 em1979 | 14.9560299812815 em1980 | 18.0870223053504 em1981 | -6.23151505146251 em1982 | -7.12300504708672 em1983 | -7.52627957856581 em1984 | -7.17428622245234 em1985 | -9.04899186621455 em1986 | -10.5098581697156 em1987 | -7.97146984849547 em1988 | -8.85458031319287 em1989 | -8.78422101747477 em1990 | -9.61931854722004 em1991 | -11.6411235452067 em1992 | -12.8882752879355 em1993 | -15.0647523842803 em1994 | -14.3266175918398 em1995 | -9.07603254825782 em1996 | -9.32002671928241 em1997 | -10.0209028262361 em1998 | -6.70882735196004 em1999 | -7.32575918131333 em2000 | -10.3113551933996 em2001 | -11.0162573094354 em2002 | -10.886264397431 em2003 | -8.96078372850612 em2004 | -11.5157129257881 em2005 | -12.5048269019293 em2006 | -12.2345161132594 em2007 | -8.92504587601715 em2008 | -12.1136551375247 em2009 | -10.1144380511421 em2010 | -7.72468307053519 gdp1970 | 10502.1047183969 gdp1971 | 9259.97560190599 gdp1972 | 6593.98178532712 gdp1973 | 5325.33813328068 gdp1974 | -899.029529832931 gdp1975 | -3184.93671107899 gdp1976 | -4517.68204331439 gdp1977 | -3322.9509067019 gdp1978 | -33.8221923368737 gdp1979 | 2882.50573071066 gdp1980 | 3638.74436577365 gdp1981 | 2211.77365027338 gdp1982 | 5811.44631880621 gdp1983 | 7365.75180165581 gdp1984 | 10465.1797058904 gdp1985 | 12312.7219748196 gdp1986 | 12309.0418293413 gdp1987 | 13695.5173269466 gdp1988 | 12531.9995299889 gdp1989 | 13009.2244205049 gdp1990 | 10697.6839797576 gdp1991 | 6835.94651304181 gdp1992 | 4275.67753277099 gdp1993 | 3382.29408813394 gdp1994 | 3703.65406726311 gdp1995 | 4238.17659535371 gdp1996 | 4692.48744219914 gdp1997 | 4539.23538342266 gdp1998 | 5886.78983381162 gdp1999 | 7527.72448728762 gdp2000 | 7646.05563584361 gdp2001 | 9053.22077886667 gdp2002 | 9914.82548013531 gdp2003 | 9201.64413455221 gdp2004 | 9234.70123279344 gdp2005 | 9565.5457350936 gdp2006 | 9569.86316415438 gdp2007 | 9104.60260145907 gdp2008 | 8182.8163827425 gdp2009 | 6279.93197775805 gdp2010 | 4274.40397281553