Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response.
Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response. Regression is used to make predictions about profits, sales, temperature, stocks, and more. For example, you could use regression to predict the price of a house based on the location, the square footage, the size of the lot, and so on. In this example, the house's value is the response, and the other factors, such as location, are the features.
The optimal set of coefficients found for the regression's equation is known as the model. The relationship between the outcome and the features is summarized in the model, which can then be applied to different data sets, where the outcome value is unknown.
1 - Autoregression
See the autoregressive model example under time series models.
Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome.
Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome. You can build linear regression models to:
Fit a predictive model to a training data set of independent variables and some dependent variable. Doing so allows you to use feature variable values to make predictions on outcomes. For example, you can predict the amount of rain that will fall on a particular day of the year.
Determine the strength of the relationship between an independent variable and some outcome variable. For example, suppose you want to determine the importance of various weather variables on the outcome of how much rain will fall. You can build a linear regression model based on observations of weather patterns and rainfall to find the answer.
Unlike Logistic regression, which you use to determine a binary classification outcome, linear regression is primarily used to predict continuous numerical outcomes in linear relationships.
You can use the following functions to build a linear regression model, view the model, and use the model to make predictions on a set of test data:
This linear regression example uses a small data set named faithful.
This linear regression example uses a small data set named faithful. The data set contains the intervals between eruptions and the duration of eruptions for the Old Faithful geyser in Yellowstone National Park. The duration of each eruption can be between 1.5 and 5 minutes. The length of intervals between eruptions and of each eruption varies. However, you can estimate the time of the next eruption based on the duration of the previous eruption. The example shows how you can build a model to predict the value of eruptions, given the value of the waiting feature.
Create a table that contains the response values from running the PREDICT_LINEAR_REG function on your test data. Name this table pred_faithful_results. View the results in the pred_faithful_results table:
=> CREATE TABLE pred_faithful_results AS
(SELECT id, eruptions, PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='linear_reg_faithful')
AS pred FROM faithful_testing);
CREATE TABLE
=> SELECT * FROM pred_faithful_results ORDER BY id;
id | eruptions | pred
-----+-----------+------------------
4 | 2.283 | 2.8151271587036
5 | 4.533 | 4.62659045686076
8 | 3.6 | 4.62659045686076
9 | 1.95 | 1.94877514654148
11 | 1.833 | 2.18505296804024
12 | 3.917 | 4.54783118302784
14 | 1.75 | 1.6337380512098
20 | 4.25 | 4.15403481386324
22 | 1.75 | 1.6337380512098
.
.
.
(110 rows)
Calculating the mean squared error (MSE)
You can calculate how well your model fits the data using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.
=> SELECT MSE (eruptions::float, pred::float) OVER() FROM
(SELECT eruptions, pred FROM pred_faithful_results) AS prediction_output;
mse | Comments
-------------------+-----------------------------------------------
0.252925741352641 | Of 110 rows, 110 were used and 0 were ignored
(1 row)
Using Poisson regression, you can model count data.
Using Poisson regression, you can model count data. Poisson regression offers an alternative to linear regression or logistic regression and is useful when the target variable describes event frequency (event count in a fixed interval of time). Linear regression is preferable if you aim to predict continuous numerical outcomes in linear relationships, while logistic regression is used for predicting a binary classification.
You can use the following functions to build a Poisson regression model, view the model, and use the model to make predictions on a set of test data:
The Random Forest for regression algorithm creates an ensemble model of regression trees.
The Random Forest for regression algorithm creates an ensemble model of regression trees. Each tree is trained on a randomly selected subset of the training data. The algorithm predicts the value that is the mean prediction of the individual trees.
You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:
Use
RF_REGRESSOR to create the random forest model myRFRegressorModel using the mtcars training data. View the summary output of the model with
GET_MODEL_SUMMARY:
Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.
Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.
Unlike Logistic regression, which you use to determine a binary classification outcome, SVM for regression is primarily used to predict continuous numerical outcomes.
You can use the following functions to build an SVM for regression model, view the model, and use the model to make predictions on a set of test data:
This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park.
This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park. The data set contains values about the waiting time between eruptions and the duration of eruptions of the geyser. The example shows how you can build a model to predict the value of eruptions, given the value of the waiting feature.
Create a new table that contains the response values from running the PREDICT_SVM_REGRESSOR function on your test data. Name this table pred_faithful_results. View the results in the pred_faithful_results table:
=> CREATE TABLE pred_faithful AS
(SELECT id, eruptions, PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='svm_faithful')
AS pred FROM faithful_testing);
CREATE TABLE
=> SELECT * FROM pred_faithful ORDER BY id;
id | eruptions | pred
-----+-----------+------------------
4 | 2.283 | 2.88444568755189
5 | 4.533 | 4.54434581879796
8 | 3.6 | 4.54434581879796
9 | 1.95 | 2.09058040739072
11 | 1.833 | 2.30708912016195
12 | 3.917 | 4.47217624787422
14 | 1.75 | 1.80190212369576
20 | 4.25 | 4.11132839325551
22 | 1.75 | 1.80190212369576
.
.
.
(110 rows)
Calculating the mean squared error (MSE)
You can calculate how well your model fits the data is by using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.
=> SELECT MSE(obs::float, prediction::float) OVER()
FROM (SELECT eruptions AS obs, pred AS prediction
FROM pred_faithful) AS prediction_output;
mse | Comments
-------------------+-----------------------------------------------
0.254499811834235 | Of 110 rows, 110 were used and 0 were ignored
(1 row)
The following XGBoost functions create and perform predictions with a regression model:.
XGBoost (eXtreme Gradient Boosting) is a popular supervised-learning algorithm used for regression and classification on large datasets. It uses sequentially-built shallow decision trees to provide accurate results and a highly-scalable training method that avoids overfitting.
The following XGBoost functions create and perform predictions with a regression model:
This example uses a small data set named "mtcars", which contains design and performance data for 32 automobiles from 1973-1974, and creates an XGBoost regression model to predict the value of the variable carb (the number of carburetors).