This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Logistic regression

Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome.

Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome. The outcome of logistic regression is always a binary value.

You can build logistic regression models to:

  • Fit a predictive model to a training data set of independent variables and some binary dependent variable. Doing so allows you to make predictions on outcomes, such as whether a piece of email is spam mail or not.

  • Determine the strength of the relationship between an independent variable and some binary outcome variable. For example, suppose you want to determine whether an email is spam or not. You can build a logistic regression model, based on observations of the properties of email messages. Then, you can determine the importance of various properties of an email message on that outcome.

You can use the following functions to build a logistic regression model, view the model, and use the model to make predictions on a set of test data:

For a complete programming example of how to use logistic regression on a table in Vertica, see Building a logistic regression model.

1 - Building a logistic regression model

This logistic regression example uses a small data set named mtcars.

This logistic regression example uses a small data set named mtcars. The example shows how to build a model that predicts the value of am, which indicates whether the car has an automatic or a manual transmission. It uses the given values of all the other features in the data set.

In this example, roughly 60% of the data is used as training data to create a model. The remaining 40% is used as testing data against which you can test your logistic regression model.

Before you begin the example, load the Machine Learning sample data.
  1. Create the logistic regression model, named logistic_reg_mtcars, using the mtcars_train training data.

    => SELECT LOGISTIC_REG('logistic_reg_mtcars', 'mtcars_train', 'am', 'cyl, wt'
       USING PARAMETERS exclude_columns='hp');
            LOGISTIC_REG
    ----------------------------
     Finished in 15 iterations
    
    (1 row)
    
  2. View the summary output of logistic_reg_mtcars.

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='logistic_reg_mtcars');
    --------------------------------------------------------------------------------
    =======
    details
    =======
    predictor|coefficient|  std_err  |z_value |p_value
    ---------+-----------+-----------+--------+--------
    Intercept| 262.39898 |44745.77338| 0.00586| 0.99532
    cyl      | 16.75892  |5987.23236 | 0.00280| 0.99777
    wt       |-119.92116 |17237.03154|-0.00696| 0.99445
    
    ==============
    regularization
    ==============
    type| lambda
    ----+--------
    none| 1.00000
    
    ===========
    call_string
    ===========
    logistic_reg('public.logistic_reg_mtcars', 'mtcars_train', '"am"', 'cyl, wt'
    USING PARAMETERS exclude_columns='hp', optimizer='newton', epsilon=1e-06,
    max_iterations=100, regularization='none', lambda=1)
    
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    iteration_count   | 20
    rejected_row_count|  0
    accepted_row_count| 20
    (1 row)
    
  3. Create a table named mtcars_predict_results. Populate this table with the prediction outputs you obtain from running the PREDICT_LOGISTIC_REG function on your test data. View the results in the mtcars_predict_results table.

    => CREATE TABLE mtcars_predict_results AS
       (SELECT car_model, am, PREDICT_LOGISTIC_REG(cyl, wt
       USING PARAMETERS model_name='logistic_reg_mtcars')
       AS Prediction FROM mtcars_test);
    CREATE TABLE
    
    
    => SELECT * FROM mtcars_predict_results;
       car_model    | am | Prediction
    ----------------+----+------------
     AMC Javelin    |  0 |          0
     Hornet 4 Drive |  0 |          0
     Maserati Bora  |  1 |          0
     Merc 280       |  0 |          0
     Merc 450SL     |  0 |          0
     Toyota Corona  |  0 |          1
     Volvo 142E     |  1 |          1
     Camaro Z28     |  0 |          0
     Datsun 710     |  1 |          1
     Honda Civic    |  1 |          1
     Porsche 914-2  |  1 |          1
     Valiant        |  0 |          0
    (12 rows)
    
  4. Evaluate the accuracy of the PREDICT_LOGISTIC_REG function, using the CONFUSION_MATRIX evaluation function.

    => SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
       FROM (SELECT am AS obs, Prediction AS pred FROM mtcars_predict_results) AS prediction_output;
     class | 0 | 1 |                   comment
    -------+---+---+---------------------------------------------
         0 | 6 | 1 |
         1 | 1 | 4 | Of 12 rows, 12 were used and 0 were ignored
    (2 rows)
    

    In this case, PREDICT_LOGISTIC_REG correctly predicted that four out of five cars with a value of 1 in the am column have a value of 1. Out of the seven cars which had a value of 0 in the am column, six were correctly predicted to have the value 0. One car was incorrectly classified as having the value 1.

See also