Dimension reduction using PCA

This PCA example uses a data set with a large number of columns named world.

This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.

Before you begin the example, load the Machine Learning sample data.
  1. Create the PCA model, named pcamodel.

    => SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
    em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
    em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
    em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
    gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
    gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
    gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country');
    
    PCA
    ---------------------------------------------------------------
    Finished in 1 iterations.
    Accepted Rows: 96  Rejected Rows: 0
    (1 row)
    
  2. View the summary output of pcamodel.

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel');
    GET_MODEL_SUMMARY
    --------------------------------------------------------------------------------
    
  3. Next, apply PCA to a select few columns, with the exception of HDI and country.

    => SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
    exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world;
    HDI   |       country       |       col1
    ------+---------------------+-------------------
    0.886 | Belgium             | -36288.1191849017
    0.699 | Belize              |   -81740.32711562
    0.427 | Benin               | -122666.882708325
    0.805 | Chile               | -161356.484748602
    0.687 | China               | -202634.254216416
    0.744 | Costa Rica          | -242043.080125449
    0.4   | Cote d'Ivoire       | -283330.394428932
    0.776 | Cuba                | -322625.857541772
    0.895 | Denmark             | -356086.311721071
    0.644 | Egypt               | -403634.743992772
    .
    .
    .
    (96 rows)
    
  4. Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.

    => SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
    exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1;
    -[ RECORD 1 ]--------------
    HDI     | 0.886
    country | Belgium
    em1970  | 3.74891915022521
    em1971  | 26.091852917619
    em1972  | 22.0262860721982
    em1973  | 24.8214492074202
    em1974  | 20.9486650320945
    em1975  | 29.5717692117088
    em1976  | 17.4373459783249
    em1977  | 33.1895610966146
    em1978  | 15.6251407781098
    em1979  | 14.9560299812815
    em1980  | 18.0870223053504
    em1981  | -6.23151505146251
    em1982  | -7.12300504708672
    em1983  | -7.52627957856581
    em1984  | -7.17428622245234
    em1985  | -9.04899186621455
    em1986  | -10.5098581697156
    em1987  | -7.97146984849547
    em1988  | -8.85458031319287
    em1989  | -8.78422101747477
    em1990  | -9.61931854722004
    em1991  | -11.6411235452067
    em1992  | -12.8882752879355
    em1993  | -15.0647523842803
    em1994  | -14.3266175918398
    em1995  | -9.07603254825782
    em1996  | -9.32002671928241
    em1997  | -10.0209028262361
    em1998  | -6.70882735196004
    em1999  | -7.32575918131333
    em2000  | -10.3113551933996
    em2001  | -11.0162573094354
    em2002  | -10.886264397431
    em2003  | -8.96078372850612
    em2004  | -11.5157129257881
    em2005  | -12.5048269019293
    em2006  | -12.2345161132594
    em2007  | -8.92504587601715
    em2008  | -12.1136551375247
    em2009  | -10.1144380511421
    em2010  | -7.72468307053519
    gdp1970 | 10502.1047183969
    gdp1971 | 9259.97560190599
    gdp1972 | 6593.98178532712
    gdp1973 | 5325.33813328068
    gdp1974 | -899.029529832931
    gdp1975 | -3184.93671107899
    gdp1976 | -4517.68204331439
    gdp1977 | -3322.9509067019
    gdp1978 | -33.8221923368737
    gdp1979 | 2882.50573071066
    gdp1980 | 3638.74436577365
    gdp1981 | 2211.77365027338
    gdp1982 | 5811.44631880621
    gdp1983 | 7365.75180165581
    gdp1984 | 10465.1797058904
    gdp1985 | 12312.7219748196
    gdp1986 | 12309.0418293413
    gdp1987 | 13695.5173269466
    gdp1988 | 12531.9995299889
    gdp1989 | 13009.2244205049
    gdp1990 | 10697.6839797576
    gdp1991 | 6835.94651304181
    gdp1992 | 4275.67753277099
    gdp1993 | 3382.29408813394
    gdp1994 | 3703.65406726311
    gdp1995 | 4238.17659535371
    gdp1996 | 4692.48744219914
    gdp1997 | 4539.23538342266
    gdp1998 | 5886.78983381162
    gdp1999 | 7527.72448728762
    gdp2000 | 7646.05563584361
    gdp2001 | 9053.22077886667
    gdp2002 | 9914.82548013531
    gdp2003 | 9201.64413455221
    gdp2004 | 9234.70123279344
    gdp2005 | 9565.5457350936
    gdp2006 | 9569.86316415438
    gdp2007 | 9104.60260145907
    gdp2008 | 8182.8163827425
    gdp2009 | 6279.93197775805
    gdp2010 | 4274.40397281553
    

See also