CROSS_VALIDATE
Performs k-fold cross validation on a learning algorithm using an input relation, and grid search for hyper parameters. The output is an average performance indicator of the selected algorithm. This function supports SVM classification, naive bayes, and logistic regression.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
CROSS_VALIDATE ( 'algorithm', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, cv_model_name = 'model']
[, cv_metrics = 'metrics']
[, cv_fold_count = num-folds]
[, cv_hyperparams = 'hyperparams']
[, cv_prediction_cutoff = prediction-cutoff] ] )
Arguments
algorithm
- Name of the algorithm training function, one of the following:
input-relation
- The table or view that contains data used for training and testing. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- Name of the input column that contains the response.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter
exclude_columns
must includeresponse-column
, and any columns that are invalid as predictor columns.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing. cv_model_name
- The name of a model that lets you retrieve results of the cross validation process. If you omit this parameter, results are displayed but not saved. If you set this parameter to a model name, you can retrieve the results with summary functions
GET_MODEL_ATTRIBUTE
andGET_MODEL_SUMMARY
cv_metrics
- The metrics used to assess the algorithm, specified either as a comma-separated list of metric names or in a JSON array. In both cases, you specify one or more of the following metric names:
-
accuracy
(default) -
error_rate
-
TP
: True positive, the number of cases of class 1 predicted as class 1 -
FP
: False positive, the number of cases of class 0 predicted as class 1 -
TN
: True negative, the number of cases of class 0 predicted as class 0 -
FN
: False negative, the number of cases of class 1 predicted as class 0 -
TPR
orrecall
: True positive rate, the correct predictions among class 1 -
FPR
: False positive rate, the wrong predictions among class 0 -
TNR
: True negative rate, the correct predictions among class 0 -
FNR
: False negative rate, the wrong predictions among class 1 -
PPV
orprecision
: The positive predictive value, the correct predictions among cases predicted as class 1 -
NPV
: Negative predictive value, the correct predictions among cases predicted as class 0 -
MSE
: Mean squared error -
MAE
: Mean absolute error -
rsquared
: coefficient of determination -
explained_variance
-
fscore
(1 + beta˄2) * precison * recall / (beta˄2 * precision + recall)
beta equals 1 by default
-
auc_roc
: AUC of ROC using the specified number of bins, by default 100 -
auc_prc
: AUC of PRC using the specified number of bins, by default 100 -
counts
: Shortcut that resolves to four other metrics:TP
,FP
,TN
, andFN
-
count
: Valid only in JSON syntax, counts the number of cases labeled by one class (case-class-label
) but predicted as another class (predicted-class-label
):cv_metrics='[{"count":[case-class-label, predicted-class-label]}]'
-
cv_fold_count
- The number of folds to split the data.
Default: 5
cv_hyperparams
- A JSON string that describes the combination of parameters for use in grid search of hyper parameters. The JSON string contains pairs of the hyper parameter name. The value of each hyper parameter can be specified as an array or sequence. For example:
{"param1":[value1,value2,...], "param2":{"first":first_value, "step":step_size, "count":number_of_values} }
Hyper parameter names and string values should be quoted using the JSON standard. These parameters are passed to the training function.
cv_prediction_cutoff
- The cutoff threshold that is passed to the prediction stage of logistic regression, a FLOAT between 0 and 1, exclusive
Default: 0.5
Model attributes
call_string
- The value of all input arguments that were specified at the time
CROSS_VALIDATE
was called. run_average
- The average across all folds of all metrics specified in parameter
cv_metrics
, if specified; otherwise, average accuracy. fold_info
- The number of rows in each fold:
-
fold_id
: The index of the fold. -
row_count
: The number of rows held out for testing in the fold.
-
counters
- All counters for the function, including:
-
accepted_row_count
: The total number of rows in theinput_relation
, minus the number of rejected rows. -
rejected_row_count
: The number of rows of theinput_relation
that were skipped because they contained an invalid value. -
feature_count
: The number of features input to the machine learning model.
-
run_details
- Information about each run, where a run means training a single model, and then testing that model on the one held-out fold:
-
fold_id
: The index of the fold held out for testing. -
iteration_count
: The number of iterations used in model training on non-held-out folds. -
accuracy
: All metrics specified in parametercv_metrics
, or accuracy ifcv_metrics
is not provided. -
error_rate
: All metrics specified in parametercv_metrics
, or accuracy if the parameter is omitted.
-
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE and USAGE privileges on the default schema where machine learning algorithms generate models. If
cv_model_name
is provided, the cross validation results are saved as a model in the same schema.
Specifying metrics in JSON
Parameter cv_metrics
can specify metrics as an array of JSON objects, where each object specifies a metric name . For example, the following expression sets cv_metrics
to two metrics specified as JSON objects, accuracy
and error_rate
:
cv_metrics='["accuracy", "error_rate"]'
In the next example, cv_metrics
is set to two metrics, accuracy
and TPR
(true positive rate). Here, the TPR
metric is specified as a JSON object that takes an array of two class label arguments, 2 and 3:
cv_metrics='[ "accuracy", {"TPR":[2,3] } ]'
Metrics specified as JSON objects can accept parameters. In the following example, the fscore
metric specifies parameter beta
, which is set to 0.5:
cv_metrics='[ {"fscore":{"beta":0.5} } ]'
Parameter support can be especially useful for certain metrics. For example, metrics auc_roc
and auc_prc
build a curve, and then compute the area under that curve. For ROC
, the curve is formed by plotting metrics TPR
against FPR
; for PRC
, PPV
(precision
) against TPR
(recall
). The accuracy of such curves can be increased by setting parameter num_bins
to a value greater than the default value of 100. For example, the following expression computes AUC for an ROC curve built with 1000 bins:
cv_metrics='[{"auc_roc":{"num_bins":1000}}]'
Using metrics with Multi-class classifier functions
All supported metrics are defined for binary classifier functions
LOGISTIC_REG
and
SVM_CLASSIFIER
. For multi-class classifier functions such as
NAIVE_BAYES
, these metrics can be calculated for each one-versus-the-rest binary classifier. Use arguments to request the metrics for each classifier. For example, if training data has integer class labels, you can set cv_metrics
with the precision
(PPV
) metric as follows:
cv_metrics='[{"precision":[0,4]}]'
This setting specifies to return two columns with precision computed for two classifiers:
-
Column 1: classifies 0 versus not 0
-
Collumn 2: classifies 4 versus not 4
If you omit class label arguments, the class with index 1 is used. Instead of computing metrics for individual one-versus-the-rest
classifiers, the average is computed in one of the following styles: macro
, micro
, or weighted
(default). For example, the following cv_metrics
setting returns the average weighted by class sizes:
cv_metrics='[{"precision":{"avg":"weighted"}}]'
AUC-type metrics can be similarly defined for multi-class classifiers. For example, the following cv_metrics
setting computes the area under the ROC curve for each one-versus-the-rest
classifier, and then returns the average weighted by class sizes.
cv_metrics='[{"auc_roc":{"avg":"weighted", "num_bins":1000}}]'
Examples
=> SELECT CROSS_VALIDATE('svm_classifier', 'mtcars', 'am', 'mpg'
USING PARAMETERS cv_fold_count= 6,
cv_hyperparams='{"C":[1,5]}',
cv_model_name='cv_svm',
cv_metrics='accuracy, error_rate');
CROSS_VALIDATE
----------------------------
Finished
===========
run_average
===========
C |accuracy |error_rate
---+--------------+----------
1 | 0.75556 | 0.24444
5 | 0.78333 | 0.21667
(1 row)