This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Regression algorithms

Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response.

Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response. Regression is used to make predictions about profits, sales, temperature, stocks, and more. For example, you could use regression to predict the price of a house based on the location, the square footage, the size of the lot, and so on. In this example, the house's value is the response, and the other factors, such as location, are the features.

The optimal set of coefficients found for the regression's equation is known as the model. The relationship between the outcome and the features is summarized in the model, which can then be applied to different data sets, where the outcome value is unknown.

1 - Autoregression

See the autoregressive model example under time series models.

See the autoregressive model example under time series models.

2 - Linear regression

Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome.

Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome. You can build linear regression models to:

  • Fit a predictive model to a training data set of independent variables and some dependent variable. Doing so allows you to use feature variable values to make predictions on outcomes. For example, you can predict the amount of rain that will fall on a particular day of the year.

  • Determine the strength of the relationship between an independent variable and some outcome variable. For example, suppose you want to determine the importance of various weather variables on the outcome of how much rain will fall. You can build a linear regression model based on observations of weather patterns and rainfall to find the answer.

Unlike Logistic regression, which you use to determine a binary classification outcome, linear regression is primarily used to predict continuous numerical outcomes in linear relationships.

You can use the following functions to build a linear regression model, view the model, and use the model to make predictions on a set of test data:

For a complete example of how to use linear regression on a table in Vertica, see Building a linear regression model.

2.1 - Building a linear regression model

This linear regression example uses a small data set named faithful.

This linear regression example uses a small data set named faithful. The data set contains the intervals between eruptions and the duration of eruptions for the Old Faithful geyser in Yellowstone National Park. The duration of each eruption can be between 1.5 and 5 minutes. The length of intervals between eruptions and of each eruption varies. However, you can estimate the time of the next eruption based on the duration of the previous eruption. The example shows how you can build a model to predict the value of eruptions, given the value of the waiting feature.

Before you begin the example, load the Machine Learning sample data.
  1. Create the linear regression model, named linear_reg_faithful, using the faithful_training training data:

    => SELECT LINEAR_REG('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting'
       USING PARAMETERS optimizer='BFGS');
            LINEAR_REG
    ---------------------------
     Finished in 6 iterations
    
    (1 row)
    
  2. View the summary output of linear_reg_faithful:

    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='linear_reg_faithful');
    --------------------------------------------------------------------------------
    =======
    details
    =======
    predictor|coefficient|std_err |t_value |p_value
    ---------+-----------+--------+--------+--------
    Intercept| -2.06795  | 0.21063|-9.81782| 0.00000
    waiting  |  0.07876  | 0.00292|26.96925| 0.00000
    
    ==============
    regularization
    ==============
    type| lambda
    ----+--------
    none| 1.00000
    
    ===========
    call_string
    ===========
    linear_reg('public.linear_reg_faithful', 'faithful_training', '"eruptions"', 'waiting'
    USING PARAMETERS optimizer='bfgs', epsilon=1e-06, max_iterations=100,
    regularization='none', lambda=1)
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    iteration_count   |  3
    rejected_row_count|  0
    accepted_row_count| 162
    (1 row)
    
  3. Create a table that contains the response values from running the PREDICT_LINEAR_REG function on your test data. Name this table pred_faithful_results. View the results in the pred_faithful_results table:

    => CREATE TABLE pred_faithful_results AS
       (SELECT id, eruptions, PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='linear_reg_faithful')
       AS pred FROM faithful_testing);
    CREATE TABLE
    
    => SELECT * FROM pred_faithful_results ORDER BY id;
     id  | eruptions |       pred
    -----+-----------+------------------
       4 |     2.283 |  2.8151271587036
       5 |     4.533 | 4.62659045686076
       8 |       3.6 | 4.62659045686076
       9 |      1.95 | 1.94877514654148
      11 |     1.833 | 2.18505296804024
      12 |     3.917 | 4.54783118302784
      14 |      1.75 |  1.6337380512098
      20 |      4.25 | 4.15403481386324
      22 |      1.75 |  1.6337380512098
    .
    .
    .
    (110 rows)
    

Calculating the mean squared error (MSE)

You can calculate how well your model fits the data using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.

=> SELECT MSE (eruptions::float, pred::float) OVER() FROM
   (SELECT eruptions, pred FROM pred_faithful_results) AS prediction_output;
        mse        |                   Comments
-------------------+-----------------------------------------------
 0.252925741352641 | Of 110 rows, 110 were used and 0 were ignored
(1 row)

See also

3 - PLS regression

The following example trains a PLS regression model on a Monarch butterfly population dataset, and then makes predictions with the model.

Combining aspects of PCA (principal component analysis) and linear regression, the PLS regression algorithm extracts a set of latent components that explain as much covariance as possible between the predictor and response variables, and then performs a regression that predicts response values using the extracted components.

This technique is particularly useful when the number of predictor variables is greater than the number of observations or the predictor variables are highly collinear. If either of these conditions is true of the input relation, ordinary linear regression fails to converge to an accurate model.

Use the following functions to train and make predictions with PLS regression models:

  • PLS_REG: Creates and trains a PLS regression model

  • PREDICT_PLS_REG: Applies a trained PLS model to an input relation and returns predicted values

The PLS_REG function supports PLS regression with only one response column, often referred to as PLS1. PLS regression with multiple response columns, known as PLS2, is not currently supported.

Example

This example uses a Monarch butterfly population dataset, which includes columns such as:

  • Log_N (dependent variable): natural log of the western monarch population in the overwintering habitat for the respective year
  • Log_PrevN: natural log of the western monarch population in the overwintering habitat for the previous year
  • Coast_Dev: estimated proportion of developed lands in the overwintering habitat in coastal California
  • Br_Temp: average monthly maximum temperature in degrees Celsius from June to August in the breeding habitat
  • Gly_Ag: summed amount of glyphosate, in pounds, applied for agricultural purposes in California
  • Coast_T: minimum monthly temperature in degrees Celsius averaged across the overwintering habitat in coastal California from December to February

As reported in Crone et al. (2019), the predictor variables in this dataset are highly collinear. Unlike ordinary linear regression techniques, which cannot disambiguate linearly dependent variables, the PLS regression algorithm is designed to handle collinearity.

After you have downloaded the Monarch data locally, you can load the data into Vertica with the following statements:

=> CREATE TABLE monarch_data (Year INTEGER, Log_N FLOAT, Log_PrevN FLOAT, CoastDev FLOAT, Br_PDSI FLOAT, Br_Temp FLOAT,
  Gly_Ag FLOAT, Gly_NonAg FLOAT, NN_Ag FLOAT, NN_NonAg FLOAT, Coast_P FLOAT, Coast_T FLOAT);

=> COPY monarch_data FROM LOCAL 'path-to-data' DELIMITER ',';

You can then split the data into a training and test set:

=> CREATE TABLE monarch_train AS (SELECT * FROM monarch_data WHERE Year < 2010);

=> CREATE TABLE monarch_test AS (SELECT Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T FROM monarch_data WHERE Year >= 2010);

To train a PLS regression model on the training data, use the PLS_REG function. In this example, two models are trained, one with the default num_components of 2 and the other with num_components set to 3:

=> SELECT PLS_REG ('monarch_pls', 'monarch_train', 'Log_N', '*' USING PARAMETERS exclude_columns='Log_N, Year');
                           PLS_REG
-------------------------------------------------------------
 Number of components 2.
Accepted Rows: 28  Rejected Rows: 0
(1 row)

=> SELECT PLS_REG ('monarch_pls_3', 'monarch_train', 'Log_N', '*' USING PARAMETERS exclude_columns='Log_N, Year',
     num_components=3);
                           PLS_REG
-------------------------------------------------------------
 Number of components 3.
Accepted Rows: 28  Rejected Rows: 0
(1 row)

You can use the GET_MODEL_SUMMARY function to view a summary of the model, including coefficient and parameter values:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='monarch_pls');
                                  GET_MODEL_SUMMARY
----------------------------------------------------------------------------------------------

=======
details
=======
predictor|coefficient_0
---------+-------------
Intercept|   5.27029
log_prevn|   0.76654
coastdev |  -0.34445
 br_pdsi |   0.25796
 br_temp |   0.32698
 gly_ag  |  -0.31284
gly_nonag|  -0.32573
  nn_ag  |   0.13260
nn_nonag |  -0.17085
 coast_p |  -0.05202
 coast_t |   0.42183


=========
responses
=========
index|name
-----+-----
  0  |log_n


===========
call_string
===========
SELECT PLS_REG('public.monarch_pls', 'monarch_train', 'log_n', '*' USING PARAMETERS exclude_columns='Log_N, Year', num_components=2, scale=true);

===============
Additional Info
===============
       Name       |Value
------------------+-----
    is_scaled     |  1
   n_components   |  2
rejected_row_count|  0
accepted_row_count| 28

(1 row)

After you train the PLS models, use the PREDICT_PLS_REG function to make predictions on an input relation, in this case the monarch_test data:

--2 component model
=> SELECT PREDICT_PLS_REG (* USING PARAMETERS model_name='monarch_pls') FROM monarch_test;
 PREDICT_PLS_REG
------------------
 2.88462577469318
 2.86535009598611
 2.84138719904564
  2.7222022770597
 3.96163608455087
 3.30690898656628
 2.99904802221049
 (7 rows)

--3 component model
=> SELECT PREDICT_PLS_REG (* USING PARAMETERS model_name='monarch_pls_3') FROM monarch_test;
 PREDICT_PLS_REG
------------------
 3.02572904832937
 3.68777887527724
 3.21578610703037
 3.51114625472752
 4.81912351259015
 4.18201014233219
 3.69428768763682
(7 rows)

Query the monarch_data table to view the actual measured monarch population from the years for which values were predicted:

=> SELECT Year, Log_N FROM monarch_data WHERE YEAR >= 2010;
 Year |  Log_N
------+----------
 2010 |    3.721
 2011 |    4.231
 2012 |    3.742
 2013 |    3.515
 2014 |    3.807
 2015 |    4.032
 2016 | 3.528373
(7 rows)

(7 rows)

To compute the mean squared error (MSE) of the models' predictions, use the MSE function:

--2 component model
=> SELECT MSE(obs, prediction) OVER()
    FROM (SELECT Log_N AS obs,
      PREDICT_PLS_REG (Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T USING PARAMETERS model_name='monarch_pls') AS prediction
    FROM monarch_test WHERE Year >= 2010) AS prediction_output;
        mse        |                 Comments
-------------------+-------------------------------------------
 0.678821911958195 | Of 7 rows, 7 were used and 0 were ignored
(1 row)

--3 component model
=> SELECT MSE(obs, prediction) OVER()
      FROM (SELECT Log_N AS obs,
        PREDICT_PLS_REG (Log_PrevN, CoastDev, Br_PDSI, Br_Temp, Gly_Ag, Gly_NonAg, NN_Ag, NN_NonAg, Coast_P, Coast_T USING PARAMETERS model_name='monarch_pls2') AS prediction
      FROM monarch_test WHERE Year >= 2010) AS prediction_output;
        mse        |                 Comments
-------------------+-------------------------------------------
 0.368195839329685 | Of 7 rows, 7 were used and 0 were ignored
(1 row)

Comparing the MSE of the models' predictions, the PLS model trained with 3 components performs better than the model with only 2 components.

See also

4 - Poisson regression

Using Poisson regression, you can model count data.

Using Poisson regression, you can model count data. Poisson regression offers an alternative to linear regression or logistic regression and is useful when the target variable describes event frequency (event count in a fixed interval of time). Linear regression is preferable if you aim to predict continuous numerical outcomes in linear relationships, while logistic regression is used for predicting a binary classification.

You can use the following functions to build a Poisson regression model, view the model, and use the model to make predictions on a set of test data:

5 - Random forest for regression

The Random Forest for regression algorithm creates an ensemble model of regression trees.

The Random Forest for regression algorithm creates an ensemble model of regression trees. Each tree is trained on a randomly selected subset of the training data. The algorithm predicts the value that is the mean prediction of the individual trees.

You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:

For a complete example of how to use the Random Forest for regression algorithm in Vertica, see Building a random forest regression model.

5.1 - Building a random forest regression model

This example uses the "mtcars" dataset to create a random forest model to predict the value of carb (the number of carburetors).

This example uses the "mtcars" dataset to create a random forest model to predict the value of carb (the number of carburetors).

Before you begin the example, load the Machine Learning sample data.
  1. Use RF_REGRESSOR to create the random forest model myRFRegressorModel using the mtcars training data. View the summary output of the model with GET_MODEL_SUMMARY:

    => SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
    ntree=100, sampling_size=0.3);
    RF_REGRESSOR
    --------------
    Finished
    (1 row)
    
    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='myRFRegressorModel');
    --------------------------------------------------------------------------------
    ===========
    call_string
    ===========
    SELECT rf_regressor('public.myRFRegressorModel', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
    USING PARAMETERS exclude_columns='', ntree=100, mtry=1, sampling_size=0.3, max_depth=5, max_breadth=32,
    min_leaf_size=5, min_info_gain=0, nbins=32);
    
    =======
    details
    =======
    predictor|type
    ---------+-----
    mpg      |float
    cyl      | int
    hp       | int
    drat     |float
    wt       |float
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    tree_count        | 100
    rejected_row_count|  0
    accepted_row_count| 32
    (1 row)
    
  2. Use PREDICT_RF_REGRESSOR to predict the number of carburetors:

    => SELECT PREDICT_RF_REGRESSOR (mpg,cyl,hp,drat,wt
    USING PARAMETERS model_name='myRFRegressorModel') FROM mtcars;
    PREDICT_RF_REGRESSOR
    ----------------------
    2.94774203574204
    2.6954087024087
    2.6954087024087
    2.89906346431346
    2.97688489288489
    2.97688489288489
    2.7086587024087
    2.92078965478965
    2.97688489288489
    2.7086587024087
    2.95621822621823
    2.82255155955156
    2.7086587024087
    2.7086587024087
    2.85650394050394
    2.85650394050394
    2.97688489288489
    2.95621822621823
    2.6954087024087
    2.6954087024087
    2.84493251193251
    2.97688489288489
    2.97688489288489
    2.8856467976468
    2.6954087024087
    2.92078965478965
    2.97688489288489
    2.97688489288489
    2.7934087024087
    2.7934087024087
    2.7086587024087
    2.72469441669442
    (32 rows)
    

6 - SVM (support vector machine) for regression

Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.

Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.

Unlike Logistic regression, which you use to determine a binary classification outcome, SVM for regression is primarily used to predict continuous numerical outcomes.

You can use the following functions to build an SVM for regression model, view the model, and use the model to make predictions on a set of test data:

For a complete example of how to use the SVM algorithm in Vertica, see Building an SVM for regression model.

6.1 - Building an SVM for regression model

This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park.

This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park. The data set contains values about the waiting time between eruptions and the duration of eruptions of the geyser. The example shows how you can build a model to predict the value of eruptions, given the value of the waiting feature.

Before you begin the example, load the Machine Learning sample data.
  1. Create the SVM model, named svm_faithful, using the faithful_training training data:

    => SELECT SVM_REGRESSOR('svm_faithful', 'faithful_training', 'eruptions', 'waiting'
                          USING PARAMETERS error_tolerance=0.1, max_iterations=100);
            SVM_REGRESSOR
    ---------------------------
     Finished in 5 iterations
    Accepted Rows: 162   Rejected Rows: 0
    (1 row)
    
  2. View the summary output of svm_faithful:

    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_faithful');
    
    ------------------------------------------------------------------
    =======
    details
    =======
    
    ===========================
    Predictors and Coefficients
    ===========================
             |Coefficients
    ---------+------------
    Intercept|  -1.59007
    waiting  |   0.07217
    ===========
    call_string
    ===========
    Call string:
    SELECT svm_regressor('public.svm_faithful', 'faithful_training', '"eruptions"',
    'waiting'USING PARAMETERS error_tolerance = 0.1, C=1, max_iterations=100,
    epsilon=0.001);
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    accepted_row_count| 162
    rejected_row_count|  0
    iteration_count  |  5
    (1 row)
    
  3. Create a new table that contains the response values from running the PREDICT_SVM_REGRESSOR function on your test data. Name this table pred_faithful_results. View the results in the pred_faithful_results table:

    => CREATE TABLE pred_faithful AS
           (SELECT id, eruptions, PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='svm_faithful')
            AS pred FROM faithful_testing);
    CREATE TABLE
    => SELECT * FROM pred_faithful ORDER BY id;
     id  | eruptions |       pred
    -----+-----------+------------------
       4 |     2.283 | 2.88444568755189
       5 |     4.533 | 4.54434581879796
       8 |       3.6 | 4.54434581879796
       9 |      1.95 | 2.09058040739072
      11 |     1.833 | 2.30708912016195
      12 |     3.917 | 4.47217624787422
      14 |      1.75 | 1.80190212369576
      20 |      4.25 | 4.11132839325551
      22 |      1.75 | 1.80190212369576
    .
    .
    .
    (110 rows)
    

Calculating the mean squared error (MSE)

You can calculate how well your model fits the data is by using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.

=> SELECT MSE(obs::float, prediction::float) OVER()
   FROM (SELECT eruptions AS obs, pred AS prediction
         FROM pred_faithful) AS prediction_output;
        mse        |                   Comments
-------------------+-----------------------------------------------
 0.254499811834235 | Of 110 rows, 110 were used and 0 were ignored
(1 row)

See also

7 - XGBoost for regression

The following XGBoost functions create and perform predictions with a regression model:.

XGBoost (eXtreme Gradient Boosting) is a popular supervised-learning algorithm used for regression and classification on large datasets. It uses sequentially-built shallow decision trees to provide accurate results and a highly-scalable training method that avoids overfitting.

The following XGBoost functions create and perform predictions with a regression model:

Example

This example uses a small data set named "mtcars", which contains design and performance data for 32 automobiles from 1973-1974, and creates an XGBoost regression model to predict the value of the variable carb (the number of carburetors).

Before you begin the example, load the Machine Learning sample data.
  1. Use XGB_REGRESSOR to create the XGBoost regression model xgb_cars from the mtcars dataset:

    => SELECT XGB_REGRESSOR ('xgb_cars', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt'
        USING PARAMETERS learning_rate=0.5);
     XGB_REGRESSOR
    ---------------
     Finished
    (1 row)
    

    You can then view a summary of the model with GET_MODEL_SUMMARY:

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_cars');
                      GET_MODEL_SUMMARY
    ------------------------------------------------------
    ===========
    call_string
    ===========
    xgb_regressor('public.xgb_cars', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
    USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=squarederror,
    split_proposal_method=global, epsilon=0.001, learning_rate=0.5, min_split_loss=0, weight_reg=0, sampling_size=1)
    
    =======
    details
    =======
    predictor|      type
    ---------+----------------
       mpg   |float or numeric
       cyl   |      int
       hp    |      int
      drat   |float or numeric
       wt    |float or numeric
    
    ===============
    Additional Info
    ===============
           Name       |Value
    ------------------+-----
        tree_count    | 10
    rejected_row_count|  0
    accepted_row_count| 32
    
    (1 row)
    
  2. Use PREDICT_XGB_REGRESSOR to predict the number of carburetors:

    => SELECT carb, PREDICT_XGB_REGRESSOR (mpg,cyl,hp,drat,wt USING PARAMETERS model_name='xgb_cars') FROM mtcars;
     carb | PREDICT_XGB_REGRESSOR
    ------+-----------------------
        4 |      4.00335213618023
        2 |       2.0038188946536
        6 |      5.98866003194438
        1 |      1.01774386191546
        2 |       1.9959801016274
        2 |       2.0038188946536
        4 |      3.99545403625739
        8 |      7.99211056556231
        2 |      1.99291901733151
        3 |       2.9975688946536
        3 |       2.9975688946536
        1 |      1.00320357711227
        2 |       2.0038188946536
        4 |      3.99545403625739
        4 |      4.00124134679445
        1 |      1.00759516721382
        4 |      3.99700517763435
        4 |      3.99580193056138
        4 |      4.00009088187525
        3 |       2.9975688946536
        2 |      1.98625064560888
        1 |      1.00355294416998
        2 |      2.00666247039502
        1 |      1.01682931210169
        4 |      4.00124134679445
        1 |      1.01007809485918
        2 |      1.98438405824605
        4 |      3.99580193056138
        2 |      1.99291901733151
        4 |      4.00009088187525
        2 |       2.0038188946536
        1 |      1.00759516721382
    (32 rows)