This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Linear regression

Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome.

Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome. You can build linear regression models to:

  • Fit a predictive model to a training data set of independent variables and some dependent variable. Doing so allows you to use feature variable values to make predictions on outcomes. For example, you can predict the amount of rain that will fall on a particular day of the year.

  • Determine the strength of the relationship between an independent variable and some outcome variable. For example, suppose you want to determine the importance of various weather variables on the outcome of how much rain will fall. You can build a linear regression model based on observations of weather patterns and rainfall to find the answer.

Unlike Logistic regression, which you use to determine a binary classification outcome, linear regression is primarily used to predict continuous numerical outcomes in linear relationships.

You can use the following functions to build a linear regression model, view the model, and use the model to make predictions on a set of test data:

For a complete example of how to use linear regression on a table in Vertica, see Building a linear regression model.

1 - Building a linear regression model

This linear regression example uses a small data set named faithful.

This linear regression example uses a small data set named faithful. The data set contains the intervals between eruptions and the duration of eruptions for the Old Faithful geyser in Yellowstone National Park. The duration of each eruption can be between 1.5 and 5 minutes. The length of intervals between eruptions and of each eruption varies. However, you can estimate the time of the next eruption based on the duration of the previous eruption. The example shows how you can build a model to predict the value of eruptions, given the value of the waiting feature.

Before you begin the example, load the Machine Learning sample data.
  1. Create the linear regression model, named linear_reg_faithful, using the faithful_training training data:

    => SELECT LINEAR_REG('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting'
       USING PARAMETERS optimizer='BFGS');
            LINEAR_REG
    ---------------------------
     Finished in 6 iterations
    
    (1 row)
    
  2. View the summary output of linear_reg_faithful:

    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='linear_reg_faithful');
    --------------------------------------------------------------------------------
    =======
    details
    =======
    predictor|coefficient|std_err |t_value |p_value
    ---------+-----------+--------+--------+--------
    Intercept| -2.06795  | 0.21063|-9.81782| 0.00000
    waiting  |  0.07876  | 0.00292|26.96925| 0.00000
    
    ==============
    regularization
    ==============
    type| lambda
    ----+--------
    none| 1.00000
    
    ===========
    call_string
    ===========
    linear_reg('public.linear_reg_faithful', 'faithful_training', '"eruptions"', 'waiting'
    USING PARAMETERS optimizer='bfgs', epsilon=1e-06, max_iterations=100,
    regularization='none', lambda=1)
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    iteration_count   |  3
    rejected_row_count|  0
    accepted_row_count| 162
    (1 row)
    
  3. Create a table that contains the response values from running the PREDICT_LINEAR_REG function on your test data. Name this table pred_faithful_results. View the results in the pred_faithful_results table:

    => CREATE TABLE pred_faithful_results AS
       (SELECT id, eruptions, PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='linear_reg_faithful')
       AS pred FROM faithful_testing);
    CREATE TABLE
    
    => SELECT * FROM pred_faithful_results ORDER BY id;
     id  | eruptions |       pred
    -----+-----------+------------------
       4 |     2.283 |  2.8151271587036
       5 |     4.533 | 4.62659045686076
       8 |       3.6 | 4.62659045686076
       9 |      1.95 | 1.94877514654148
      11 |     1.833 | 2.18505296804024
      12 |     3.917 | 4.54783118302784
      14 |      1.75 |  1.6337380512098
      20 |      4.25 | 4.15403481386324
      22 |      1.75 |  1.6337380512098
    .
    .
    .
    (110 rows)
    

Calculating the mean squared error (MSE)

You can calculate how well your model fits the data using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.

=> SELECT MSE (eruptions::float, pred::float) OVER() FROM
   (SELECT eruptions, pred FROM pred_faithful_results) AS prediction_output;
        mse        |                   Comments
-------------------+-----------------------------------------------
 0.252925741352641 | Of 110 rows, 110 were used and 0 were ignored
(1 row)

See also