PLS regression

The following example trains a PLS regression model on a Monarch butterfly population dataset, and then makes predictions with the model.

Combining aspects of PCA (principal component analysis) and linear regression, the PLS regression algorithm extracts a set of latent components that explain as much covariance as possible between the predictor and response variables, and then performs a regression that predicts response values using the extracted components.

This technique is particularly useful when the number of predictor variables is greater than the number of observations or the predictor variables are highly collinear. If either of these conditions is true of the input relation, ordinary linear regression fails to converge to an accurate model.

Use the following functions to train and make predictions with PLS regression models:

  • PLS_REG: Creates and trains a PLS regression model

  • PREDICT_PLS_REG: Applies a trained PLS model to an input relation and returns predicted values

The PLS_REG function supports PLS regression with only one response column, often referred to as PLS1. PLS regression with multiple response columns, known as PLS2, is not currently supported.

Example

This example uses a Monarch butterfly population dataset, which includes columns such as:

  • Log_N (dependent variable): natural log of the western monarch population in the overwintering habitat for the respective year
  • Log_PrevN: natural log of the western monarch population in the overwintering habitat for the previous year
  • Coast_Dev: estimated proportion of developed lands in the overwintering habitat in coastal California
  • Br_Temp: average monthly maximum temperature in degrees Celsius from June to August in the breeding habitat
  • Gly_Ag: summed amount of glyphosate, in pounds, applied for agricultural purposes in California
  • Coast_T: minimum monthly temperature in degrees Celsius averaged across the overwintering habitat in coastal California from December to February

As reported in Crone et al. (2019), the predictor variables in this dataset are highly collinear. Unlike ordinary linear regression techniques, which cannot disambiguate linearly dependent variables, the PLS regression algorithm is designed to handle collinearity.

After you have downloaded the Monarch data locally, you can load the data into Vertica with the following statements:

=> CREATE TABLE monarch_data (Year INTEGER, Log_N FLOAT, Log_PrevN FLOAT, CoastDev FLOAT, Br_PDSI FLOAT, Br_Temp FLOAT,
  Gly_Ag FLOAT, Gly_NonAg FLOAT, NN_Ag FLOAT, NN_NonAg FLOAT, Coast_P FLOAT, Coast_T FLOAT);

=> COPY monarch_data FROM LOCAL 'path-to-data' DELIMITER ',';

You can then split the data into a training and test set:

=> CREATE TABLE monarch_train AS (SELECT * FROM monarch_data WHERE Year < 2010);

=> CREATE TABLE monarch_test AS (SELECT Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T FROM monarch_data WHERE Year >= 2010);

To train a PLS regression model on the training data, use the PLS_REG function. In this example, two models are trained, one with the default num_components of 2 and the other with num_components set to 3:

=> SELECT PLS_REG ('monarch_pls', 'monarch_train', 'Log_N', '*' USING PARAMETERS exclude_columns='Log_N, Year');
                           PLS_REG
-------------------------------------------------------------
 Number of components 2.
Accepted Rows: 28  Rejected Rows: 0
(1 row)

=> SELECT PLS_REG ('monarch_pls_3', 'monarch_train', 'Log_N', '*' USING PARAMETERS exclude_columns='Log_N, Year',
     num_components=3);
                           PLS_REG
-------------------------------------------------------------
 Number of components 3.
Accepted Rows: 28  Rejected Rows: 0
(1 row)

You can use the GET_MODEL_SUMMARY function to view a summary of the model, including coefficient and parameter values:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='monarch_pls');
                                  GET_MODEL_SUMMARY
----------------------------------------------------------------------------------------------

=======
details
=======
predictor|coefficient_0
---------+-------------
Intercept|   5.27029
log_prevn|   0.76654
coastdev |  -0.34445
 br_pdsi |   0.25796
 br_temp |   0.32698
 gly_ag  |  -0.31284
gly_nonag|  -0.32573
  nn_ag  |   0.13260
nn_nonag |  -0.17085
 coast_p |  -0.05202
 coast_t |   0.42183


=========
responses
=========
index|name
-----+-----
  0  |log_n


===========
call_string
===========
SELECT PLS_REG('public.monarch_pls', 'monarch_train', 'log_n', '*' USING PARAMETERS exclude_columns='Log_N, Year', num_components=2, scale=true);

===============
Additional Info
===============
       Name       |Value
------------------+-----
    is_scaled     |  1
   n_components   |  2
rejected_row_count|  0
accepted_row_count| 28

(1 row)

After you train the PLS models, use the PREDICT_PLS_REG function to make predictions on an input relation, in this case the monarch_test data:

--2 component model
=> SELECT PREDICT_PLS_REG (* USING PARAMETERS model_name='monarch_pls') FROM monarch_test;
 PREDICT_PLS_REG
------------------
 2.88462577469318
 2.86535009598611
 2.84138719904564
  2.7222022770597
 3.96163608455087
 3.30690898656628
 2.99904802221049
 (7 rows)

--3 component model
=> SELECT PREDICT_PLS_REG (* USING PARAMETERS model_name='monarch_pls_3') FROM monarch_test;
 PREDICT_PLS_REG
------------------
 3.02572904832937
 3.68777887527724
 3.21578610703037
 3.51114625472752
 4.81912351259015
 4.18201014233219
 3.69428768763682
(7 rows)

Query the monarch_data table to view the actual measured monarch population from the years for which values were predicted:

=> SELECT Year, Log_N FROM monarch_data WHERE YEAR >= 2010;
 Year |  Log_N
------+----------
 2010 |    3.721
 2011 |    4.231
 2012 |    3.742
 2013 |    3.515
 2014 |    3.807
 2015 |    4.032
 2016 | 3.528373
(7 rows)

(7 rows)

To compute the mean squared error (MSE) of the models' predictions, use the MSE function:

--2 component model
=> SELECT MSE(obs, prediction) OVER()
    FROM (SELECT Log_N AS obs,
      PREDICT_PLS_REG (Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T USING PARAMETERS model_name='monarch_pls') AS prediction
    FROM monarch_test WHERE Year >= 2010) AS prediction_output;
        mse        |                 Comments
-------------------+-------------------------------------------
 0.678821911958195 | Of 7 rows, 7 were used and 0 were ignored
(1 row)

--3 component model
=> SELECT MSE(obs, prediction) OVER()
      FROM (SELECT Log_N AS obs,
        PREDICT_PLS_REG (Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T USING PARAMETERS model_name='monarch_pls2') AS prediction
      FROM monarch_test WHERE Year >= 2010) AS prediction_output;
        mse        |                 Comments
-------------------+-------------------------------------------
 0.368195839329685 | Of 7 rows, 7 were used and 0 were ignored
(1 row)

Comparing the MSE of the models' predictions, the PLS model trained with 3 components performs better than the model with only 2 components.

See also