分类列编码

许多机器学习算法无法处理分类数据。为了适应这种算法,必须在训练之前将分类数据转换为数值数据。将分类值直接映射到索引是不够的。例如,如果您的分类特征具有“红色”、“绿色”和“蓝色”三个不同的值,将它们替换为 1、2 和 3 可能会对训练过程产生负面影响,因为算法通常依赖于值之间的某种数字距离进行区分。在这种情况下,从 1 到 3 的欧几里得距离是从 1 到 2 的距离的两倍,这意味着训练过程会认为“红色”与“蓝色”不同,而与“绿色”更相似。或者,独热编码将每个分类值映射到一个二进制向量以避免这个问题。例如,“红色”可以映射到 [1,0,0],“绿色”可以映射到 [0,1,0],“蓝色”可以映射到 [0,0,1]。现在,这三个类别之间的成对距离都是相同的。独热编码允许您将分类变量转换为二进制值,以便您可以使用不同的机器学习算法来计算您的数据。

以下示例演示如何将独热编码应用于 Titanic 数据集。如果您想了解有关此数据集的详细信息,请参阅 Kaggle 站点

假设您想使用逻辑回归分类器来预测哪些乘客在泰坦尼克号沉没中幸存下来。如果没有一种热编码,则无法将分类特征用于逻辑回归。该数据集有两个可以使用的分类特征。"sex"(性别)特征可以是男性也可以是女性。"embarkation_point"(上船地点)可以是以下几项之一:

  • S 代表南安普顿

  • Q 代表皇后镇

  • C 代表瑟堡

开始示例之前,请加载机器学习示例数据
  1. 对训练数据运行 ONE_HOT_ENCODER_FIT 函数:
=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point');
ONE_HOT_ENCODER_FIT
---------------------
Success

(1 row)

  1. 查看 titanic_encoder 模型摘要:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder');
GET_MODEL_SUMMARY
------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'
USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name    |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point|      C       |         0
embarkation_point|      Q       |         1
embarkation_point|      S       |         2
embarkation_point|              |         3
sex              |    female    |         0
sex              |     male     |         1
(1 row)
  1. 运行 GET_MODEL_ATTRIBUTE 函数。此函数返回其原生数据类型中的分类级别,因此可以轻松地将它们与原始表进行比较:
=> SELECT * FROM (SELECT GET_MODEL_ATTRIBUTE(USING PARAMETERS model_name='titanic_encoder',
attr_name='varchar_categories')) AS attrs INNER JOIN (SELECT passenger_id, name, sex, age,
embarkation_point FROM titanic_training) AS original_data ON attrs.category_level
ILIKE original_data.embarkation_point ORDER BY original_data.passenger_id LIMIT 10;
category_name     | category_level | category_level_index | passenger_id |name
|  sex   | age | embarkation_point
------------------+----------------+----------------------+--------------+-----------------------------
-----------------------+--------+-----+-------------------
embarkation_point | S              |                    2 |            1 | Braund, Mr. Owen Harris
| male   |  22 | S
embarkation_point | C              |                    0 |            2 | Cumings, Mrs. John Bradley
(Florence Briggs Thayer | female |  38 | C
embarkation_point | S              |                    2 |            3 | Heikkinen, Miss. Laina
| female |  26 | S
embarkation_point | S              |                    2 |            4 | Futrelle, Mrs. Jacques Heath
(Lily May Peel)       | female |  35 | S
embarkation_point | S              |                    2 |            5 | Allen, Mr. William Henry
| male   |  35 | S
embarkation_point | Q              |                    1 |            6 | Moran, Mr. James
| male   |     | Q
embarkation_point | S              |                    2 |            7 | McCarthy, Mr. Timothy J
| male   |  54 | S
embarkation_point | S              |                    2 |            8 | Palsson, Master. Gosta Leonard
| male   |   2 | S
embarkation_point | S              |                    2 |            9 | Johnson, Mrs. Oscar W
(Elisabeth Vilhelmina Berg)  | female |  27 | S
embarkation_point | C              |                    0 |           10 | Nasser, Mrs. Nicholas
(Adele Achem)                | female |  14 | C
(10 rows)
  1. 对训练和测试数据运行 APPLY_ONE_HOT_ENCODER 函数:
=> CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_training) AS sq;

CREATE VIEW

=> CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_testing) AS sq;
CREATE VIEW
  1. 然后,对训练数据训练一个逻辑回归分类器,并对测试数据执行该模型:
=> SELECT LOGISTIC_REG('titanic_log_reg', 'titanic_training_encoded', 'survived', '*'
USING PARAMETERS exclude_columns='passenger_id, survived');
LOGISTIC_REG
---------------------------
Finished in 5 iterations
(1 row)

=> SELECT passenger_id, name, PREDICT_LOGISTIC_REG(pclass, sex_1, age, sibling_and_spouse_count,
parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 USING PARAMETERS
model_name='titanic_log_reg') FROM titanic_testing_encoded ORDER BY passenger_id LIMIT 10;
passenger_id |                     name                     | PREDICT_LOGISTIC_REG
-------------+----------------------------------------------+----------------------
893          | Wilkes, Mrs. James (Ellen Needs)             |                    0
894          | Myles, Mr. Thomas Francis                    |                    0
895          | Wirz, Mr. Albert                             |                    0
896          | Hirvonen, Mrs. Alexander (Helga E Lindqvist) |                    1
897          | Svensson, Mr. Johan Cervin                   |                    0
898          | Connolly, Miss. Kate                         |                    1
899          | Caldwell, Mr. Albert Francis                 |                    0
900          | Abrahim, Mrs. Joseph (Sophie Halaut Easu)    |                    1
901          | Davies, Mr. John Samuel                      |                    0
902          | Ilieff, Mr. Ylio                             |
(10 rows)