This is the multipage printable view of this section.
Click here to print.
Return to the regular view of this page.
Classification algorithms
Classification is an important and popular machine learning tool that assigns items in a data set to different categories.
Classification is an important and popular machine learning tool that assigns items in a data set to different categories. Classification is used to predict risk over time, in fraud detection, text categorization, and more. Classification functions begin with a data set where the different categories are known. For example, suppose you want to classify students based on how likely they are to get into graduate school. In addition to factors like admission score exams and grades, you could also track work experience.
Binary classification means the outcome, in this case, admission, only has two possible values: admit or do not admit. Multiclass outcomes have more than two values. For example, low, medium, or high chance of admission. During the training process, classification algorithms find the relationship between the outcome and the features. This relationship is summarized in the model, which can then be applied to different data sets, where the categories are unknown.
1  Logistic regression
Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome.
Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome. The outcome of logistic regression is always a binary value.
You can build logistic regression models to:

Fit a predictive model to a training data set of independent variables and some binary dependent variable. Doing so allows you to make predictions on outcomes, such as whether a piece of email is spam mail or not.

Determine the strength of the relationship between an independent variable and some binary outcome variable. For example, suppose you want to determine whether an email is spam or not. You can build a logistic regression model, based on observations of the properties of email messages. Then, you can determine the importance of various properties of an email message on that outcome.
You can use the following functions to build a logistic regression model, view the model, and use the model to make predictions on a set of test data:
For a complete programming example of how to use logistic regression on a table in Vertica, see Building a logistic regression model.
1.1  Building a logistic regression model
This logistic regression example uses a small data set named mtcars.
This logistic regression example uses a small data set named mtcars. The example shows how to build a model that predicts the value of am
, which indicates whether the car has an automatic or a manual transmission. It uses the given values of all the other features in the data set.
In this example, roughly 60% of the data is used as training data to create a model. The remaining 40% is used as testing data against which you can test your logistic regression model.
Before you begin the example,
load the Machine Learning sample data.

Create the logistic regression model, named logistic_reg_mtcars
, using the mtcars_train
training data.
=> SELECT LOGISTIC_REG('logistic_reg_mtcars', 'mtcars_train', 'am', 'cyl, wt'
USING PARAMETERS exclude_columns='hp');
LOGISTIC_REG

Finished in 15 iterations
(1 row)

View the summary output of logistic_reg_mtcars
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='logistic_reg_mtcars');

=======
details
=======
predictorcoefficient std_err z_value p_value
++++
Intercept 262.39898 44745.77338 0.00586 0.99532
cyl  16.75892 5987.23236  0.00280 0.99777
wt 119.92116 17237.031540.00696 0.99445
==============
regularization
==============
type lambda
+
none 1.00000
===========
call_string
===========
logistic_reg('public.logistic_reg_mtcars', 'mtcars_train', '"am"', 'cyl, wt'
USING PARAMETERS exclude_columns='hp', optimizer='newton', epsilon=1e06,
max_iterations=100, regularization='none', lambda=1)
===============
Additional Info
===============
Name Value
+
iteration_count  20
rejected_row_count 0
accepted_row_count 20
(1 row)

Create a table named mtcars_predict_results
. Populate this table with the prediction outputs you obtain from running the PREDICT_LOGISTIC_REG
function on your test data. View the results in the mtcars_predict_results
table.
=> CREATE TABLE mtcars_predict_results AS
(SELECT car_model, am, PREDICT_LOGISTIC_REG(cyl, wt
USING PARAMETERS model_name='logistic_reg_mtcars')
AS Prediction FROM mtcars_test);
CREATE TABLE
=> SELECT * FROM mtcars_predict_results;
car_model  am  Prediction
++
AMC Javelin  0  0
Hornet 4 Drive  0  0
Maserati Bora  1  0
Merc 280  0  0
Merc 450SL  0  0
Toyota Corona  0  1
Volvo 142E  1  1
Camaro Z28  0  0
Datsun 710  1  1
Honda Civic  1  1
Porsche 9142  1  1
Valiant  0  0
(12 rows)

Evaluate the accuracy of the PREDICT_LOGISTIC_REG
function, using the CONFUSION_MATRIX evaluation function.
=> SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
FROM (SELECT am AS obs, Prediction AS pred FROM mtcars_predict_results) AS prediction_output;
class  0  1  comment
+++
0  6  1 
1  1  4  Of 12 rows, 12 were used and 0 were ignored
(2 rows)
In this case, PREDICT_LOGISTIC_REG
correctly predicted that four out of five cars with a value of 1
in the am
column have a value of 1
. Out of the seven cars which had a value of 0
in the am
column, six were correctly predicted to have the value 0
. One car was incorrectly classified as having the value 1
.
See also
2  Naive bayes
You can use the Naive Bayes algorithm to classify your data when features can be assumed independent.
You can use the Naive Bayes algorithm to classify your data when features can be assumed independent. The algorithm uses independent features to calculate the probability of a specific class. For example, you might want to predict the probability that an email is spam. In that case, you would use a corpus of words associated with spam to calculate the probability the email's content is spam.
You can use the following functions to build a Naive Bayes model, view the model, and use the model to make predictions on a set of test data:
For a complete example of how to use the Naive Bayes algorithm in Vertica, see Classifying data using naive bayes.
2.1  Classifying data using naive bayes
This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model.
This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model. With this model, you can predict which party the member of the United States Congress is affiliated based on their voting record. To aid in classifying the data it has been cleaned, and any missed votes have been replaced. The cleaned data replaces missed votes with the voter's party majority vote. For example, suppose a member of the Democrats had a missing value for vote1 and majority of the Democrats voted in favor. This example replaces all missing Democrats' votes for vote1 with a vote in favor.
In this example, approximately 75% of the cleaned HouseVotes84 data is randomly selected and copied to a training table. The remaining cleaned HouseVotes84 data is used as a testing table.
Before you begin the example,
load the Machine Learning sample data.
You must also load the naive_bayes_data_prepration.sql
script:
$ /opt/vertica/bin/vsql d <name of your database> f naive_bayes_data_preparation.sql

Create the Naive Bayes model, named naive_house84_model
, using the house84_train
training data.
=> SELECT NAIVE_BAYES('naive_house84_model', 'house84_train', 'party',
'*' USING PARAMETERS exclude_columns='party, id');
NAIVE_BAYES

Finished. Accepted Rows: 315 Rejected Rows: 0
(1 row)

Create a new table, named predicted_party_naive
. Populate this table with the prediction outputs you obtain from the PREDICT_NAIVE_BAYES function on your test data.
=> CREATE TABLE predicted_party_naive
AS SELECT party,
PREDICT_NAIVE_BAYES (vote1, vote2, vote3, vote4, vote5,
vote6, vote7, vote8, vote9, vote10,
vote11, vote12, vote13, vote14,
vote15, vote16
USING PARAMETERS model_name = 'naive_house84_model',
type = 'response') AS Predicted_Party
FROM house84_test;
CREATE TABLE

Calculate the accuracy of the model's predictions.
=> SELECT (Predictions.Num_Correct_Predictions / Count.Total_Count) AS Percent_Accuracy
FROM ( SELECT COUNT(Predicted_Party) AS Num_Correct_Predictions
FROM predicted_party_naive
WHERE party = Predicted_Party
) AS Predictions,
( SELECT COUNT(party) AS Total_Count
FROM predicted_party_naive
) AS Count;
Percent_Accuracy

0.933333333333333333
(1 row)
The model correctly predicted the party of the members of Congress based on their voting patterns with 93% accuracy.
Viewing the probability of each class
You can also view the probability of each class. Use PREDICT_NAIVE_BAYES_CLASSES to see the probability of each class.
=> SELECT PREDICT_NAIVE_BAYES_CLASSES (id, vote1, vote2, vote3, vote4, vote5,
vote6, vote7, vote8, vote9, vote10,
vote11, vote12, vote13, vote14,
vote15, vote16
USING PARAMETERS model_name = 'naive_house84_model',
key_columns = 'id', exclude_columns = 'id',
classes = 'democrat, republican')
OVER() FROM house84_test;
id  Predicted  Probability  democrat  republican
++++
368  democrat  1  1  0
372  democrat  1  1  0
374  democrat  1  1  0
378  republican  0.999999962214987  3.77850125111219e08  0.999999962214987
384  democrat  1  1  0
387  democrat  1  1  0
406  republican  0.999999945980143  5.40198564592332e08  0.999999945980143
419  democrat  1  1  0
421  republican  0.922808855631005  0.0771911443689949  0.922808855631005
.
.
.
(109 rows)
See also
3  Random forest for classification
The Random Forest algorithm creates an ensemble model of decision trees.
The Random Forest algorithm creates an ensemble model of decision trees. Each tree is trained on a randomly selected subset of the training data.
You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:
For a complete example of how to use the Random Forest algorithm in Vertica, see Classifying data using random forest.
3.1  Classifying data using random forest
This random forest example uses a data set named iris.
This random forest example uses a data set named iris. The example contains four variables that measure various parts of the iris flower to predict its species.
Before you begin the example, make sure that you have followed the steps in Download the machine learning example data.

Use
RF_CLASSIFIER
to create the random forest model, named rf_iris
, using the iris
data. View the summary output of the model with
GET_MODEL_SUMMARY
:
=> SELECT RF_CLASSIFIER ('rf_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS ntree=100, sampling_size=0.5);
RF_CLASSIFIER

Finished training
(1 row)
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='rf_iris');

===========
call_string
===========
SELECT rf_classifier('public.rf_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length,
Petal_Width' USING PARAMETERS exclude_columns='', ntree=100, mtry=2, sampling_size=0.5, max_depth=5,
max_breadth=32, min_leaf_size=1, min_info_gain=0, nbins=32);
=======
details
=======
predictor type
+
sepal_lengthfloat
sepal_width float
petal_lengthfloat
petal_width float
===============
Additional Info
===============
Name Value
+
tree_count  100
rejected_row_count 0
accepted_row_count 150
(1 row)

Apply the classifier to the test data with
PREDICT_RF_CLASSIFIER
:
=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='rf_iris') FROM iris1;
PREDICT_RF_CLASSIFIER

setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.
(90 rows)

Use
PREDICT_RF_CLASSIFIER_CLASSES
to view the probability of each class:
=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='rf_iris') OVER () FROM iris1;
predicted  probability
+
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  0.99
.
.
.
(90 rows)
4  SVM (support vector machine) for classification
Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data.
Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data. This algorithm implements linear SVM, which is highly scalable.
You can use the following functions to train the SVM model, and use the model to make predictions on a set of test data:
You can also use the following evaluation functions to gain further insights:
For a complete example of how to use the SVM algorithm in Vertica, see Classifying data using SVM (support vector machine).
The implementation of the SVM algorithm in Vertica is based on the paper Distributed Newton Methods for Regularized Logistic Regression.
4.1  Classifying data using SVM (support vector machine)
This SVM example uses a small data set named mtcars.
This SVM example uses a small data set named mtcars. The example shows how you can use the SVM_CLASSIFIER function to train the model to predict the value of am
(the transmission type, where 0 = automatic and 1 = manual) using the PREDICT_SVM_CLASSIFIER function.
Before you begin the example,
load the Machine Learning sample data.

Create the SVM model, named svm_class
, using the mtcars_train
training data.
=> SELECT SVM_CLASSIFIER('svm_class', 'mtcars_train', 'am', 'cyl, mpg, wt, hp, gear'
USING PARAMETERS exclude_columns='gear');
SVM_CLASSIFIER

Finished in 12 iterations.
Accepted Rows: 20 Rejected Rows: 0
(1 row)

View the summary output of
`svm_class`
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_class');

=======
details
=======
predictorcoefficient
+
Intercept 0.02006
cyl  0.15367
mpg  0.15698
wt  1.78157
hp  0.00957
===========
call_string
===========
SELECT svm_classifier('public.svm_class', 'mtcars_train', '"am"', 'cyl, mpg, wt, hp, gear'
USING PARAMETERS exclude_columns='gear', C=1, max_iterations=100, epsilon=0.001);
===============
Additional Info
===============
Name Value
+
accepted_row_count 20
rejected_row_count 0
iteration_count  12
(1 row)

Create a new table, named svm_mtcars_predict
. Populate this table with the prediction outputs you obtain from running the PREDICT_SVM_CLASSIFIER
function on your test data.
=> CREATE TABLE svm_mtcars_predict AS
(SELECT car_model, am, PREDICT_SVM_CLASSIFIER(cyl, mpg, wt, hp
USING PARAMETERS model_name='svm_class')
AS Prediction FROM mtcars_test);
CREATE TABLE

View the results in the svm_mtcars_predict
table.
=> SELECT * FROM svm_mtcars_predict;
car_model  am  Prediction
 ++
Toyota Corona  0  1
Camaro Z28  0  0
Datsun 710  1  1
Valiant  0  0
Volvo 142E  1  1
AMC Javelin  0  0
Honda Civic  1  1
Hornet 4 Drive 0  0
Maserati Bora  1  1
Merc 280  0  0
Merc 450SL  0  0
Porsche 9142  1  1
(12 rows)

Evaluate the accuracy of the PREDICT_SVM_CLASSIFIER
function, using the CONFUSION_MATRIX evaluation function.
=> SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
FROM (SELECT am AS obs, Prediction AS pred FROM svm_mtcars_predict) AS prediction_output;
class  0  1  comment
+++
0  6  1 
1  0  5  Of 12 rows, 12 were used and 0 were ignored
(2 rows)
In this case, PREDICT_SVM_CLASSIFIER
correctly predicted that the cars with a value of 1
in the am
column have a value of 1
. No cars were incorrectly classified. Out of the seven cars which had a value of 0
in the am
column, six were correctly predicted to have the value 0
. One car was incorrectly classified as having the value 1
.
See also
5  XGBoost for classification
The following XGBoost functions create and perform predictions with a classification model:.
XGBoost (eXtreme Gradient Boosting) is a popular supervisedlearning algorithm used for regression and classification on large datasets. It uses sequentiallybuilt shallow decision trees to provide accurate results and a highlyscalable training method that avoids overfitting.
The following XGBoost functions create and perform predictions with a classification model:
Example
This example uses the "iris" dataset, which contains measurements for various parts of a flower, and can be used to predict its species and creates an XGBoost classifier model to classify the species of each flower.
Before you begin the example,
load the Machine Learning sample data.

Use
XGB_CLASSIFIER
to create the XGBoost classifier model xgb_iris
using the iris
dataset:
=> SELECT XGB_CLASSIFIER ('xgb_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS max_ntree=10, max_depth=5, weight_reg=0.1, learning_rate=1);
XGB_CLASSIFIER

Finished
(1 row)
You can then view a summary of the model with
GET_MODEL_SUMMARY
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_iris');
GET_MODEL_SUMMARY

===========
call_string
===========
xgb_classifier('public.xgb_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=crossentropy,
split_proposal_method=global, epsilon=0.001, learning_rate=1, min_split_loss=0, weight_reg=0.1, sampling_size=1)
=======
details
=======
predictor  type
+
sepal_lengthfloat or numeric
sepal_width float or numeric
petal_lengthfloat or numeric
petal_width float or numeric
===============
Additional Info
===============
Name Value
+
tree_count  10
rejected_row_count 0
accepted_row_count 150
(1 row)

Use
PREDICT_XGB_CLASSIFIER
to apply the classifier to the test data:
=> SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris') FROM iris1;
PREDICT_XGB_CLASSIFIER

setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.
(90 rows)

Use
PREDICT_XGB_CLASSIFIER_CLASSES
to view the probability of each class:
=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
predicted  probability
+
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.999911552783011
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
versicolor  0.99991871763563
.
.
.
(90 rows)