线性回归示例使用名为 faithful 的小数据集。该数据集包含黄石国家公园老忠实间歇泉的喷发间隔和喷发持续时间。每次喷发的持续时间在 1.5 到 5 分钟之间。喷发之间的间隔长度和每次喷发的间隔长度各不相同。不过，您可以根据上一次喷发的持续时间来估计下一次喷发的时间。该示例展示了如何构建模型来预测 eruptions 的值（给定 waiting 特征的值）。

开始示例之前，请加载机器学习示例数据。

使用 linear_reg_faithful 样本数据创建名为 faithful_training 的线性回归模型。

=> SELECT LINEAR_REG('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting'
   USING PARAMETERS optimizer='BFGS');
        LINEAR_REG
---------------------------
 Finished in 6 iterations

(1 row)

查看 linear_reg_faithful 的摘要输出：

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='linear_reg_faithful');
--------------------------------------------------------------------------------
=======
details
=======
predictor|coefficient|std_err |t_value |p_value
---------+-----------+--------+--------+--------
Intercept| -2.06795  | 0.21063|-9.81782| 0.00000
waiting  |  0.07876  | 0.00292|26.96925| 0.00000

==============
regularization
==============
type| lambda
----+--------
none| 1.00000

===========
call_string
===========
linear_reg('public.linear_reg_faithful', 'faithful_training', '"eruptions"', 'waiting'
USING PARAMETERS optimizer='bfgs', epsilon=1e-06, max_iterations=100,
regularization='none', lambda=1)

===============
Additional Info
===============
Name              |Value
------------------+-----
iteration_count   |  3
rejected_row_count|  0
accepted_row_count| 162
(1 row)

通过在测试数据中运行 PREDICT_LINEAR_REG 函数，创建包含响应值的表。将该表命名为 pred_faithful_results：在 pred_faithful_results 表中查看结果：

=> CREATE TABLE pred_faithful_results AS
   (SELECT id, eruptions, PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='linear_reg_faithful')
   AS pred FROM faithful_testing);
CREATE TABLE

=> SELECT * FROM pred_faithful_results ORDER BY id;
 id  | eruptions |       pred
-----+-----------+------------------
   4 |     2.283 |  2.8151271587036
   5 |     4.533 | 4.62659045686076
   8 |       3.6 | 4.62659045686076
   9 |      1.95 | 1.94877514654148
  11 |     1.833 | 2.18505296804024
  12 |     3.917 | 4.54783118302784
  14 |      1.75 |  1.6337380512098
  20 |      4.25 | 4.15403481386324
  22 |      1.75 |  1.6337380512098
.
.
.
(110 rows)

计算均方误差 (MSE)

您可以使用 MSE 函数计算模型与数据的拟合程度。MSE 返回实际值与预测值之间的平方差的平均值。

=> SELECT MSE (eruptions::float, pred::float) OVER() FROM
   (SELECT eruptions, pred FROM pred_faithful_results) AS prediction_output;
        mse        |                   Comments
-------------------+-----------------------------------------------
 0.252925741352641 | Of 110 rows, 110 were used and 0 were ignored
(1 row)

另请参阅

3 - 用于回归的随机森林

回归算法的随机森林创建回归树的集成模型。每棵树都对随机选择的训练数据子集进行训练。该算法预测的值是单个树的平均预测值。

您可以使用下列函数训练随机森林模型，并使用该模型对一组测试数据进行预测：

有关如何在 Vertica 中将随机森林用于回归算法的完整示例，请参阅构建随机森林回归模型。

3.1 - 构建随机森林回归模型

此示例使用 "mtcars" 数据集创建随机森林模型来预测 carb 的值（化油器的数量）。

开始示例之前，请加载机器学习示例数据。

使用 RF_REGRESSOR 和 mtcars 训练数据创建随机森林模型 myRFRegressorModel。使用 GET_MODEL_SUMMARY 查看模型的摘要输出：

=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR
--------------
Finished
(1 row)


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='myRFRegressorModel');
--------------------------------------------------------------------------------
===========
call_string
===========
SELECT rf_regressor('public.myRFRegressorModel', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
USING PARAMETERS exclude_columns='', ntree=100, mtry=1, sampling_size=0.3, max_depth=5, max_breadth=32,
min_leaf_size=5, min_info_gain=0, nbins=32);


=======
details
=======
predictor|type
---------+-----
mpg      |float
cyl      | int
hp       | int
drat     |float
wt       |float
===============
Additional Info
===============
Name              |Value
------------------+-----
tree_count        | 100
rejected_row_count|  0
accepted_row_count| 32
(1 row)

使用 PREDICT_RF_REGRESSOR 预测化油器数量：

=> SELECT PREDICT_RF_REGRESSOR (mpg,cyl,hp,drat,wt
USING PARAMETERS model_name='myRFRegressorModel') FROM mtcars;
PREDICT_RF_REGRESSOR
----------------------
2.94774203574204
2.6954087024087
2.6954087024087
2.89906346431346
2.97688489288489
2.97688489288489
2.7086587024087
2.92078965478965
2.97688489288489
2.7086587024087
2.95621822621823
2.82255155955156
2.7086587024087
2.7086587024087
2.85650394050394
2.85650394050394
2.97688489288489
2.95621822621823
2.6954087024087
2.6954087024087
2.84493251193251
2.97688489288489
2.97688489288489
2.8856467976468
2.6954087024087
2.92078965478965
2.97688489288489
2.97688489288489
2.7934087024087
2.7934087024087
2.7086587024087
2.72469441669442
(32 rows)

4 - 用于回归的 SVM（支持向量机）

用于回归的支持向量机 (SVM) 根据训练数据预测连续有序变量。

与用来确定二进制分类结果的逻辑回归不同的是，用于回归的 SVM 主要用来预测连续数字结果。

您可以使用下列函数构建用于回归的 SVM 模型、查看模型，并使用该模型对一组测试数据进行预测：

有关如何在 Vertica 中使用 SVM 算法的完整示例，请参阅构建用于回归的 SVM 模型。

4.1 - 构建用于回归的 SVM 模型

该用于回归的 SVM 示例使用了一个名为 faithful 的小型数据集，该数据集基于黄石国家公园的老忠实间歇泉。该数据集包含有关间歇泉喷发之间的等待时间和喷发持续时间的值。该示例展示了如何构建模型来预测 eruptions 的值（给定 waiting 特征的值）。

开始示例之前，请加载机器学习示例数据。

使用 faithful_training 训练数据创建名为 svm_faithful 的 SVM 模型。

=> SELECT SVM_REGRESSOR('svm_faithful', 'faithful_training', 'eruptions', 'waiting'
                      USING PARAMETERS error_tolerance=0.1, max_iterations=100);
        SVM_REGRESSOR
---------------------------
 Finished in 5 iterations

Accepted Rows: 162   Rejected Rows: 0
(1 row)

查看 svm_faithful 的摘要输出：

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_faithful');

------------------------------------------------------------------
=======

details

=======


===========================
Predictors and Coefficients
===========================
         |Coefficients
---------+------------
Intercept|  -1.59007
waiting  |   0.07217
===========
call_string
===========
Call string:
SELECT svm_regressor('public.svm_faithful', 'faithful_training', '"eruptions"',
'waiting'USING PARAMETERS error_tolerance = 0.1, C=1, max_iterations=100,
epsilon=0.001);

===============
Additional Info
===============
Name              |Value
------------------+-----
accepted_row_count| 162
rejected_row_count|  0
iteration_count  |  5
(1 row)

在测试数据中运行 PREDICT_SVM_REGRESSOR 函数创建包含响应值的新表。将此表命名为 pred_faithful_results. 。在 pred_faithful_results 表中查看结果：

=> CREATE TABLE pred_faithful AS
       (SELECT id, eruptions, PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='svm_faithful')
        AS pred FROM faithful_testing);
CREATE TABLE

=> SELECT * FROM pred_faithful ORDER BY id;
 id  | eruptions |       pred
-----+-----------+------------------
   4 |     2.283 | 2.88444568755189
   5 |     4.533 | 4.54434581879796
   8 |       3.6 | 4.54434581879796
   9 |      1.95 | 2.09058040739072
  11 |     1.833 | 2.30708912016195
  12 |     3.917 | 4.47217624787422
  14 |      1.75 | 1.80190212369576
  20 |      4.25 | 4.11132839325551
  22 |      1.75 | 1.80190212369576
.
.
.
(110 rows)

计算均方误差 (MSE)

您可以使用 MSE 函数计算模型与数据的拟合程度。MSE 返回实际值与预测值之间的平方差的平均值。

=> SELECT MSE(obs::float, prediction::float) OVER()
   FROM (SELECT eruptions AS obs, pred AS prediction
         FROM pred_faithful) AS prediction_output;
        mse        |                   Comments
-------------------+-----------------------------------------------
 0.254499811834235 | Of 110 rows, 110 were used and 0 were ignored
(1 row)

另请参阅

5 - 用于回归的 XGBoost

XGBoost (eXtreme Gradient Boosting) 是一种很受欢迎的监督式学习算法，用于对大型数据集进行回归和分类。它使用顺序构建的浅层决策树来提供准确的结果和高度可扩展的定型方法，以避免过度拟合。

以下 XGBoost 函数使用回归模型创建和执行预测：

XGB_REGRESSOR
PREDICT_XGB_REGRESSOR

示例

此示例使用名为 "mtcars" 的小型数据集（其中包含 1973-1974 年 32 辆汽车的设计和性能数据），并创建 XGBoost 回归模型来预测变量 carb 的值（化油器的数量）。

开始示例之前，请加载机器学习示例数据。

使用 XGB_REGRESSOR 从 mtcars 数据集创建 XGBoost 回归模型 xgb_cars。

=> SELECT XGB_REGRESSOR ('xgb_cars', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt'
    USING PARAMETERS learning_rate=0.5);
 XGB_REGRESSOR
---------------
 Finished
(1 row)

然后，您可以使用 GET_MODEL_SUMMARY 查看模型的摘要：


=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_cars');
                  GET_MODEL_SUMMARY
------------------------------------------------------
===========
call_string
===========
xgb_regressor('public.xgb_cars', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=squarederror,
split_proposal_method=global, epsilon=0.001, learning_rate=0.5, min_split_loss=0, weight_reg=0, sampling_size=1)

=======
details
=======
predictor|      type
---------+----------------
   mpg   |float or numeric
   cyl   |      int
   hp    |      int
  drat   |float or numeric
   wt    |float or numeric

===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 10
rejected_row_count|  0
accepted_row_count| 32

(1 row)

使用 PREDICT_XGB_REGRESSOR 预测化油器数量：

=> SELECT carb, PREDICT_XGB_REGRESSOR (mpg,cyl,hp,drat,wt USING PARAMETERS model_name='xgb_cars') FROM mtcars;
 carb | PREDICT_XGB_REGRESSOR
------+-----------------------
    4 |      4.00335213618023
    2 |       2.0038188946536
    6 |      5.98866003194438
    1 |      1.01774386191546
    2 |       1.9959801016274
    2 |       2.0038188946536
    4 |      3.99545403625739
    8 |      7.99211056556231
    2 |      1.99291901733151
    3 |       2.9975688946536
    3 |       2.9975688946536
    1 |      1.00320357711227
    2 |       2.0038188946536
    4 |      3.99545403625739
    4 |      4.00124134679445
    1 |      1.00759516721382
    4 |      3.99700517763435
    4 |      3.99580193056138
    4 |      4.00009088187525
    3 |       2.9975688946536
    2 |      1.98625064560888
    1 |      1.00355294416998
    2 |      2.00666247039502
    1 |      1.01682931210169
    4 |      4.00124134679445
    1 |      1.01007809485918
    2 |      1.98438405824605
    4 |      3.99580193056138
    2 |      1.99291901733151
    4 |      4.00009088187525
    2 |       2.0038188946536
    1 |      1.00759516721382
(32 rows)