This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Classification algorithms

Classification is an important and popular machine learning tool that assigns items in a data set to different categories.

Classification is an important and popular machine learning tool that assigns items in a data set to different categories. Classification is used to predict risk over time, in fraud detection, text categorization, and more. Classification functions begin with a data set where the different categories are known. For example, suppose you want to classify students based on how likely they are to get into graduate school. In addition to factors like admission score exams and grades, you could also track work experience.

Binary classification means the outcome, in this case, admission, only has two possible values: admit or do not admit. Multiclass outcomes have more than two values. For example, low, medium, or high chance of admission. During the training process, classification algorithms find the relationship between the outcome and the features. This relationship is summarized in the model, which can then be applied to different data sets, where the categories are unknown.

1 - Logistic regression

Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome.

Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome. The outcome of logistic regression is always a binary value.

You can build logistic regression models to:

  • Fit a predictive model to a training data set of independent variables and some binary dependent variable. Doing so allows you to make predictions on outcomes, such as whether a piece of email is spam mail or not.

  • Determine the strength of the relationship between an independent variable and some binary outcome variable. For example, suppose you want to determine whether an email is spam or not. You can build a logistic regression model, based on observations of the properties of email messages. Then, you can determine the importance of various properties of an email message on that outcome.

You can use the following functions to build a logistic regression model, view the model, and use the model to make predictions on a set of test data:

For a complete programming example of how to use logistic regression on a table in Vertica, see Building a logistic regression model.

1.1 - Building a logistic regression model

This logistic regression example uses a small data set named mtcars.

This logistic regression example uses a small data set named mtcars. The example shows how to build a model that predicts the value of am, which indicates whether the car has an automatic or a manual transmission. It uses the given values of all the other features in the data set.

In this example, roughly 60% of the data is used as training data to create a model. The remaining 40% is used as testing data against which you can test your logistic regression model.

Before you begin the example, load the Machine Learning sample data.
  1. Create the logistic regression model, named logistic_reg_mtcars, using the mtcars_train training data.

    => SELECT LOGISTIC_REG('logistic_reg_mtcars', 'mtcars_train', 'am', 'cyl, wt'
       USING PARAMETERS exclude_columns='hp');
            LOGISTIC_REG
    ----------------------------
     Finished in 15 iterations
    
    (1 row)
    
  2. View the summary output of logistic_reg_mtcars.

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='logistic_reg_mtcars');
    --------------------------------------------------------------------------------
    =======
    details
    =======
    predictor|coefficient|  std_err  |z_value |p_value
    ---------+-----------+-----------+--------+--------
    Intercept| 262.39898 |44745.77338| 0.00586| 0.99532
    cyl      | 16.75892  |5987.23236 | 0.00280| 0.99777
    wt       |-119.92116 |17237.03154|-0.00696| 0.99445
    
    ==============
    regularization
    ==============
    type| lambda
    ----+--------
    none| 1.00000
    
    ===========
    call_string
    ===========
    logistic_reg('public.logistic_reg_mtcars', 'mtcars_train', '"am"', 'cyl, wt'
    USING PARAMETERS exclude_columns='hp', optimizer='newton', epsilon=1e-06,
    max_iterations=100, regularization='none', lambda=1)
    
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    iteration_count   | 20
    rejected_row_count|  0
    accepted_row_count| 20
    (1 row)
    
  3. Create a table named mtcars_predict_results. Populate this table with the prediction outputs you obtain from running the PREDICT_LOGISTIC_REG function on your test data. View the results in the mtcars_predict_results table.

    => CREATE TABLE mtcars_predict_results AS
       (SELECT car_model, am, PREDICT_LOGISTIC_REG(cyl, wt
       USING PARAMETERS model_name='logistic_reg_mtcars')
       AS Prediction FROM mtcars_test);
    CREATE TABLE
    
    
    => SELECT * FROM mtcars_predict_results;
       car_model    | am | Prediction
    ----------------+----+------------
     AMC Javelin    |  0 |          0
     Hornet 4 Drive |  0 |          0
     Maserati Bora  |  1 |          0
     Merc 280       |  0 |          0
     Merc 450SL     |  0 |          0
     Toyota Corona  |  0 |          1
     Volvo 142E     |  1 |          1
     Camaro Z28     |  0 |          0
     Datsun 710     |  1 |          1
     Honda Civic    |  1 |          1
     Porsche 914-2  |  1 |          1
     Valiant        |  0 |          0
    (12 rows)
    
  4. Evaluate the accuracy of the PREDICT_LOGISTIC_REG function, using the CONFUSION_MATRIX evaluation function.

    => SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
       FROM (SELECT am AS obs, Prediction AS pred FROM mtcars_predict_results) AS prediction_output;
     class | 0 | 1 |                   comment
    -------+---+---+---------------------------------------------
         0 | 6 | 1 |
         1 | 1 | 4 | Of 12 rows, 12 were used and 0 were ignored
    (2 rows)
    

    In this case, PREDICT_LOGISTIC_REG correctly predicted that four out of five cars with a value of 1 in the am column have a value of 1. Out of the seven cars which had a value of 0 in the am column, six were correctly predicted to have the value 0. One car was incorrectly classified as having the value 1.

See also

2 - Naive bayes

You can use the Naive Bayes algorithm to classify your data when features can be assumed independent.

You can use the Naive Bayes algorithm to classify your data when features can be assumed independent. The algorithm uses independent features to calculate the probability of a specific class. For example, you might want to predict the probability that an email is spam. In that case, you would use a corpus of words associated with spam to calculate the probability the email's content is spam.

You can use the following functions to build a Naive Bayes model, view the model, and use the model to make predictions on a set of test data:

For a complete example of how to use the Naive Bayes algorithm in Vertica, see Classifying data using naive bayes.

2.1 - Classifying data using naive bayes

This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model.

This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model. With this model, you can predict which party the member of the United States Congress is affiliated based on their voting record. To aid in classifying the data it has been cleaned, and any missed votes have been replaced. The cleaned data replaces missed votes with the voter's party majority vote. For example, suppose a member of the Democrats had a missing value for vote1 and majority of the Democrats voted in favor. This example replaces all missing Democrats' votes for vote1 with a vote in favor.

In this example, approximately 75% of the cleaned HouseVotes84 data is randomly selected and copied to a training table. The remaining cleaned HouseVotes84 data is used as a testing table.

Before you begin the example, load the Machine Learning sample data.

You must also load the naive_bayes_data_prepration.sql script:

$ /opt/vertica/bin/vsql -d <name of your database> -f naive_bayes_data_preparation.sql
  1. Create the Naive Bayes model, named naive_house84_model, using the house84_train training data.

    => SELECT NAIVE_BAYES('naive_house84_model', 'house84_train', 'party',
                          '*' USING PARAMETERS exclude_columns='party, id');
                      NAIVE_BAYES
    ------------------------------------------------
     Finished. Accepted Rows: 315  Rejected Rows: 0
    (1 row)
    
  2. Create a new table, named predicted_party_naive. Populate this table with the prediction outputs you obtain from the PREDICT_NAIVE_BAYES function on your test data.

    => CREATE TABLE predicted_party_naive
         AS SELECT party,
              PREDICT_NAIVE_BAYES (vote1, vote2, vote3, vote4, vote5,
                                   vote6, vote7, vote8, vote9, vote10,
                                   vote11, vote12, vote13, vote14,
                                   vote15, vote16
                                     USING PARAMETERS model_name = 'naive_house84_model',
                                                      type = 'response') AS Predicted_Party
           FROM house84_test;
    CREATE TABLE
    
  3. Calculate the accuracy of the model's predictions.

    
    => SELECT  (Predictions.Num_Correct_Predictions / Count.Total_Count) AS Percent_Accuracy
        FROM (  SELECT COUNT(Predicted_Party) AS Num_Correct_Predictions
            FROM predicted_party_naive
            WHERE party = Predicted_Party
             ) AS Predictions,
             (  SELECT COUNT(party) AS Total_Count
                   FROM predicted_party_naive
                ) AS Count;
       Percent_Accuracy
    ----------------------
     0.933333333333333333
    (1 row)
    

The model correctly predicted the party of the members of Congress based on their voting patterns with 93% accuracy.

Viewing the probability of each class

You can also view the probability of each class. Use PREDICT_NAIVE_BAYES_CLASSES to see the probability of each class.

=> SELECT PREDICT_NAIVE_BAYES_CLASSES (id, vote1, vote2, vote3, vote4, vote5,
                                       vote6, vote7, vote8, vote9, vote10,
                                       vote11, vote12, vote13, vote14,
                                       vote15, vote16
                                       USING PARAMETERS model_name = 'naive_house84_model',
                                                        key_columns = 'id', exclude_columns = 'id',
                                                        classes = 'democrat, republican')
        OVER() FROM house84_test;
 id  | Predicted  |    Probability    |       democrat       |      republican
-----+------------+-------------------+----------------------+----------------------
 368 | democrat   |                 1 |                    1 |                    0
 372 | democrat   |                 1 |                    1 |                    0
 374 | democrat   |                 1 |                    1 |                    0
 378 | republican | 0.999999962214987 | 3.77850125111219e-08 |    0.999999962214987
 384 | democrat   |                 1 |                    1 |                    0
 387 | democrat   |                 1 |                    1 |                    0
 406 | republican | 0.999999945980143 | 5.40198564592332e-08 |    0.999999945980143
 419 | democrat   |                 1 |                    1 |                    0
 421 | republican | 0.922808855631005 |   0.0771911443689949 |    0.922808855631005
.
.
.
(109 rows)

See also

3 - Random forest for classification

The Random Forest algorithm creates an ensemble model of decision trees.

The Random Forest algorithm creates an ensemble model of decision trees. Each tree is trained on a randomly selected subset of the training data.

You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:

For a complete example of how to use the Random Forest algorithm in Vertica, see Classifying data using random forest.

3.1 - Classifying data using random forest

This random forest example uses a data set named iris.

This random forest example uses a data set named iris. The example contains four variables that measure various parts of the iris flower to predict its species.

Before you begin the example, make sure that you have followed the steps in Download the machine learning example data.

  1. Use RF_CLASSIFIER to create the random forest model, named rf_iris, using the iris data. View the summary output of the model with GET_MODEL_SUMMARY:

    => SELECT RF_CLASSIFIER ('rf_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
    USING PARAMETERS ntree=100, sampling_size=0.5);
    
            RF_CLASSIFIER
    ----------------------------
    Finished training
    
    (1 row)
    
    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='rf_iris');
    ------------------------------------------------------------------------
    ===========
    call_string
    ===========
    SELECT rf_classifier('public.rf_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length,
    Petal_Width' USING PARAMETERS exclude_columns='', ntree=100, mtry=2, sampling_size=0.5, max_depth=5,
    max_breadth=32, min_leaf_size=1, min_info_gain=0, nbins=32);
    
    =======
    details
    =======
    predictor   |type
    ------------+-----
    sepal_length|float
    sepal_width |float
    petal_length|float
    petal_width |float
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    tree_count        | 100
    rejected_row_count|  0
    accepted_row_count| 150
    (1 row)
    
  2. Apply the classifier to the test data with PREDICT_RF_CLASSIFIER:

    => SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
                                      USING PARAMETERS model_name='rf_iris') FROM iris1;
    
    PREDICT_RF_CLASSIFIER
    -----------------------
    setosa
    setosa
    setosa
    .
    .
    .
    versicolor
    versicolor
    versicolor
    .
    .
    .
    virginica
    virginica
    virginica
    .
    .
    .
    (90 rows)
    
  3. Use PREDICT_RF_CLASSIFIER_CLASSES to view the probability of each class:

    => SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
                                   USING PARAMETERS model_name='rf_iris') OVER () FROM iris1;
    predicted  |    probability
    -----------+-------------------
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |                 1
    setosa     |              0.99
    .
    .
    .
    (90 rows)
    

4 - SVM (support vector machine) for classification

Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data.

Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data. This algorithm implements linear SVM, which is highly scalable.

You can use the following functions to train the SVM model, and use the model to make predictions on a set of test data:

You can also use the following evaluation functions to gain further insights:

For a complete example of how to use the SVM algorithm in Vertica, see Classifying data using SVM (support vector machine).

The implementation of the SVM algorithm in Vertica is based on the paper Distributed Newton Methods for Regularized Logistic Regression.

4.1 - Classifying data using SVM (support vector machine)

This SVM example uses a small data set named mtcars.

This SVM example uses a small data set named mtcars. The example shows how you can use the SVM_CLASSIFIER function to train the model to predict the value of am (the transmission type, where 0 = automatic and 1 = manual) using the PREDICT_SVM_CLASSIFIER function.

Before you begin the example, load the Machine Learning sample data.
  1. Create the SVM model, named svm_class, using the mtcars_train training data.

    
    => SELECT SVM_CLASSIFIER('svm_class', 'mtcars_train', 'am', 'cyl, mpg, wt, hp, gear'
                              USING PARAMETERS exclude_columns='gear');
    SVM_CLASSIFIER
    ----------------------------------------------------------------
    Finished in 12 iterations.
    Accepted Rows: 20  Rejected Rows: 0
    (1 row)
    
  2. View the summary output of `svm_class`.

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_class');
    ------------------------------------------------------------------------
    
    =======
    details
    =======
    predictor|coefficient
    ---------+-----------
    Intercept| -0.02006
    cyl      |  0.15367
    mpg      |  0.15698
    wt       | -1.78157
    hp       |  0.00957
    
    ===========
    call_string
    ===========
    SELECT svm_classifier('public.svm_class', 'mtcars_train', '"am"', 'cyl, mpg, wt, hp, gear'
    USING PARAMETERS exclude_columns='gear', C=1, max_iterations=100, epsilon=0.001);
    
    ===============
    Additional Info
    ===============
    Name              |Value
    ------------------+-----
    accepted_row_count| 20
    rejected_row_count|  0
    iteration_count   | 12
    (1 row)
    
  3. Create a new table, named svm_mtcars_predict. Populate this table with the prediction outputs you obtain from running the PREDICT_SVM_CLASSIFIER function on your test data.

    => CREATE TABLE svm_mtcars_predict AS
       (SELECT car_model, am, PREDICT_SVM_CLASSIFIER(cyl, mpg, wt, hp
                                                USING PARAMETERS model_name='svm_class')
                                                AS Prediction FROM mtcars_test);
    CREATE TABLE
    
  4. View the results in the svm_mtcars_predict table.

    => SELECT * FROM svm_mtcars_predict;
    car_model     | am | Prediction
    ------------- +----+------------
    Toyota Corona |  0 |          1
    Camaro Z28    |  0 |          0
    Datsun 710    |  1 |          1
    Valiant       |  0 |          0
    Volvo 142E    |  1 |          1
    AMC Javelin   |  0 |          0
    Honda Civic   |  1 |          1
    Hornet 4 Drive|  0 |          0
    Maserati Bora |  1 |          1
    Merc 280      |  0 |          0
    Merc 450SL    |  0 |          0
    Porsche 914-2 |  1 |          1
    (12 rows)
    
  5. Evaluate the accuracy of the PREDICT_SVM_CLASSIFIER function, using the CONFUSION_MATRIX evaluation function.

    => SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
            FROM (SELECT am AS obs, Prediction AS pred FROM svm_mtcars_predict) AS prediction_output;
     class | 0 | 1 |                   comment
    -------+---+---+---------------------------------------------
         0 | 6 | 1 |
         1 | 0 | 5 | Of 12 rows, 12 were used and 0 were ignored
    (2 rows)
    

    In this case, PREDICT_SVM_CLASSIFIER correctly predicted that the cars with a value of 1 in the am column have a value of 1. No cars were incorrectly classified. Out of the seven cars which had a value of 0 in the am column, six were correctly predicted to have the value 0. One car was incorrectly classified as having the value 1.

See also

5 - XGBoost for classification

The following XGBoost functions create and perform predictions with a classification model:.

XGBoost (eXtreme Gradient Boosting) is a popular supervised-learning algorithm used for regression and classification on large datasets. It uses sequentially-built shallow decision trees to provide accurate results and a highly-scalable training method that avoids overfitting.

The following XGBoost functions create and perform predictions with a classification model:

Example

This example uses the "iris" dataset, which contains measurements for various parts of a flower, and can be used to predict its species and creates an XGBoost classifier model to classify the species of each flower.

Before you begin the example, load the Machine Learning sample data.
  1. Use XGB_CLASSIFIER to create the XGBoost classifier model xgb_iris using the iris dataset:

    => SELECT XGB_CLASSIFIER ('xgb_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
        USING PARAMETERS max_ntree=10, max_depth=5, weight_reg=0.1, learning_rate=1);
     XGB_CLASSIFIER
    ----------------
     Finished
    (1 row)
    

    You can then view a summary of the model with GET_MODEL_SUMMARY:

    
    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_iris');
                                                                                                                                                                           GET_MODEL_SUMMARY
    ------------------------------------------------------
    ===========
    call_string
    ===========
    xgb_classifier('public.xgb_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
    USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=crossentropy,
    split_proposal_method=global, epsilon=0.001, learning_rate=1, min_split_loss=0, weight_reg=0.1, sampling_size=1)
    
    =======
    details
    =======
     predictor  |      type
    ------------+----------------
    sepal_length|float or numeric
    sepal_width |float or numeric
    petal_length|float or numeric
    petal_width |float or numeric
    
    
    ===============
    Additional Info
    ===============
           Name       |Value
    ------------------+-----
        tree_count    |  10
    rejected_row_count|  0
    accepted_row_count| 150
    
    (1 row)
    
  2. Use PREDICT_XGB_CLASSIFIER to apply the classifier to the test data:

    => SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
        USING PARAMETERS model_name='xgb_iris') FROM iris1;
     PREDICT_XGB_CLASSIFIER
    ------------------------
     setosa
     setosa
     setosa
     .
     .
     .
     versicolor
     versicolor
     versicolor
     .
     .
     .
     virginica
     virginica
     virginica
     .
     .
     .
    
    (90 rows)
    
  3. Use PREDICT_XGB_CLASSIFIER_CLASSES to view the probability of each class:

    => SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
        USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
      predicted  |    probability
    ------------+-------------------
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     | 0.999911552783011
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     setosa     |   0.9999650465368
     versicolor |  0.99991871763563
     .
     .
     .
    (90 rows)