OpenText Analytics Database 26.2.x – Classification algorithms

Data-Analysis: Logistic regression

Mon, 01 Jan 0001 00:00:00 +0000

Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome. The outcome of logistic regression is always a binary value.

You can build logistic regression models to:

Fit a predictive model to a training data set of independent variables and some binary dependent variable. Doing so allows you to make predictions on outcomes, such as whether a piece of email is spam mail or not.
Determine the strength of the relationship between an independent variable and some binary outcome variable. For example, suppose you want to determine whether an email is spam or not. You can build a logistic regression model, based on observations of the properties of email messages. Then, you can determine the importance of various properties of an email message on that outcome.

You can use the following functions to build a logistic regression model, view the model, and use the model to make predictions on a set of test data:

For a complete programming example of how to use logistic regression on a database table, see Building a logistic regression model.

Data-Analysis: Naive bayes

Mon, 01 Jan 0001 00:00:00 +0000

You can use the Naive Bayes algorithm to classify your data when features can be assumed independent. The algorithm uses independent features to calculate the probability of a specific class. For example, you might want to predict the probability that an email is spam. In that case, you would use a corpus of words associated with spam to calculate the probability the email's content is spam.

You can use the following functions to build a Naive Bayes model, view the model, and use the model to make predictions on a set of test data:

For a complete example of how to use the Naive Bayes algorithm, see Classifying data using naive bayes.

Data-Analysis: Random forest for classification

Mon, 01 Jan 0001 00:00:00 +0000

The Random Forest algorithm creates an ensemble model of decision trees. Each tree is trained on a randomly selected subset of the training data.

You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:

For a complete example of how to use the Random Forest algorithm, see Classifying data using random forest.

Data-Analysis: SVM (support vector machine) for classification

Mon, 01 Jan 0001 00:00:00 +0000

Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data. This algorithm implements linear SVM, which is highly scalable.

You can use the following functions to train the SVM model, and use the model to make predictions on a set of test data:

You can also use the following evaluation functions to gain further insights:

For a complete example of how to use the SVM algorithm, see Classifying data using SVM (support vector machine).

The implementation of the SVM algorithm is based on the paper Distributed Newton Methods for Regularized Logistic Regression.

Data-Analysis: XGBoost for classification

Mon, 01 Jan 0001 00:00:00 +0000

XGBoost (eXtreme Gradient Boosting) is a popular supervised-learning algorithm used for regression and classification on large datasets. It uses sequentially-built shallow decision trees to provide accurate results and a highly-scalable training method that avoids overfitting.

The following XGBoost functions create and perform predictions with a classification model:

Example

This example uses the "iris" dataset, which contains measurements for various parts of a flower, and can be used to predict its species and creates an XGBoost classifier model to classify the species of each flower.

Before you begin the example, load the Machine Learning sample data.

Use XGB_CLASSIFIER to create the XGBoost classifier model xgb_iris using the iris dataset:

=> SELECT XGB_CLASSIFIER ('xgb_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
    USING PARAMETERS max_ntree=10, max_depth=5, weight_reg=0.1, learning_rate=1);
 XGB_CLASSIFIER
----------------
 Finished
(1 row)

You can then view a summary of the model with GET_MODEL_SUMMARY:


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_iris');
                                                                                                                                                                       GET_MODEL_SUMMARY
------------------------------------------------------
===========
call_string
===========
xgb_classifier('public.xgb_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=crossentropy,
split_proposal_method=global, epsilon=0.001, learning_rate=1, min_split_loss=0, weight_reg=0.1, sampling_size=1)

=======
details
=======
 predictor  |      type
------------+----------------
sepal_length|float or numeric
sepal_width |float or numeric
petal_length|float or numeric
petal_width |float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    |  10
rejected_row_count|  0
accepted_row_count| 150

(1 row)

Use PREDICT_XGB_CLASSIFIER to apply the classifier to the test data:

=> SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
    USING PARAMETERS model_name='xgb_iris') FROM iris1;
 PREDICT_XGB_CLASSIFIER
------------------------
 setosa
 setosa
 setosa
 .
 .
 .
 versicolor
 versicolor
 versicolor
 .
 .
 .
 virginica
 virginica
 virginica
 .
 .
 .

(90 rows)

Use PREDICT_XGB_CLASSIFIER_CLASSES to view the probability of each class:

=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
    USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
  predicted  |    probability
------------+-------------------
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     | 0.999911552783011
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 versicolor |  0.99991871763563
 .
 .
 .
(90 rows)