CROSS_VALIDATE

Performs k-fold cross validation on a learning algorithm using an input relation, and grid search for hyper parameters.

Performs k-fold cross validation on a learning algorithm using an input relation, and grid search for hyper parameters. The output is an average performance indicator of the selected algorithm. This function supports SVM classification, naive bayes, and logistic regression.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

CROSS_VALIDATE ( 'algorithm', 'input-relation', 'response-column', 'predictor-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
           [, cv_model_name = 'model']
           [, cv_metrics = 'metrics']
           [, cv_fold_count = num-folds]
           [, cv_hyperparams = 'hyperparams']
           [, cv_prediction_cutoff = prediction-cutoff] ] )

Arguments

algorithm
Name of the algorithm training function, one of the following:
input-relation
The table or view that contains data used for training and testing. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response-column
Name of the input column that contains the response.
predictor-columns

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response-column, and any columns that are invalid as predictor columns.

Parameters

exclude_columns
Comma-separated list of columns from predictor-columns to exclude from processing.
cv_model_name
The name of a model that lets you retrieve results of the cross validation process. If you omit this parameter, results are displayed but not saved. If you set this parameter to a model name, you can retrieve the results with summary functions GET_MODEL_ATTRIBUTE and GET_MODEL_SUMMARY
cv_metrics
The metrics used to assess the algorithm, specified either as a comma-separated list of metric names or in a JSON array. In both cases, you specify one or more of the following metric names:
  • accuracy (default)

  • error_rate

  • TP: True positive, the number of cases of class 1 predicted as class 1

  • FP: False positive, the number of cases of class 0 predicted as class 1

  • TN: True negative, the number of cases of class 0 predicted as class 0

  • FN: False negative, the number of cases of class 1 predicted as class 0

  • TPR or recall: True positive rate, the correct predictions among class 1

  • FPR: False positive rate, the wrong predictions among class 0

  • TNR: True negative rate, the correct predictions among class 0

  • FNR: False negative rate, the wrong predictions among class 1

  • PPV or precision: The positive predictive value, the correct predictions among cases predicted as class 1

  • NPV: Negative predictive value, the correct predictions among cases predicted as class 0

  • MSE: Mean squared error

  • MAE: Mean absolute error

  • rsquared: coefficient of determination

  • explained_variance

  • fscore

    (1 + beta˄2) * precison * recall / (beta˄2 * precision + recall)
    

    beta equals 1 by default

  • auc_roc: AUC of ROC using the specified number of bins, by default 100

  • auc_prc: AUC of PRC using the specified number of bins, by default 100

  • counts: Shortcut that resolves to four other metrics: TP, FP, TN, and FN

  • count: Valid only in JSON syntax, counts the number of cases labeled by one class (case-class-label) but predicted as another class (predicted-class-label):

    cv_metrics='[{"count":[case-class-label, predicted-class-label]}]'
    
cv_fold_count
The number of folds to split the data.

Default: 5

cv_hyperparams
A JSON string that describes the combination of parameters for use in grid search of hyper parameters. The JSON string contains pairs of the hyper parameter name. The value of each hyper parameter can be specified as an array or sequence. For example:
{"param1":[value1,value2,...], "param2":{"first":first_value, "step":step_size, "count":number_of_values} }

Hyper parameter names and string values should be quoted using the JSON standard. These parameters are passed to the training function.

cv_prediction_cutoff
The cutoff threshold that is passed to the prediction stage of logistic regression, a FLOAT between 0 and 1, exclusive

Default: 0.5

Model attributes

call_string
The value of all input arguments that were specified at the time CROSS_VALIDATE was called.
run_average
The average across all folds of all metrics specified in parameter cv_metrics, if specified; otherwise, average accuracy.
fold_info
The number of rows in each fold:
  • fold_id: The index of the fold.

  • row_count: The number of rows held out for testing in the fold.

counters
All counters for the function, including:
  • accepted_row_count: The total number of rows in the input_relation, minus the number of rejected rows.

  • rejected_row_count: The number of rows of the input_relation that were skipped because they contained an invalid value.

  • feature_count: The number of features input to the machine learning model.

run_details
Information about each run, where a run means training a single model, and then testing that model on the one held-out fold:
  • fold_id: The index of the fold held out for testing.

  • iteration_count: The number of iterations used in model training on non-held-out folds.

  • accuracy: All metrics specified in parameter cv_metrics, or accuracy if cv_metrics is not provided.

  • error_rate: All metrics specified in parameter cv_metrics, or accuracy if the parameter is omitted.

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE and USAGE privileges on the default schema where machine learning algorithms generate models. If cv_model_name is provided, the cross validation results are saved as a model in the same schema.

Specifying metrics in JSON

Parameter cv_metrics can specify metrics as an array of JSON objects, where each object specifies a metric name . For example, the following expression sets cv_metrics to two metrics specified as JSON objects, accuracy and error_rate:

cv_metrics='["accuracy", "error_rate"]'

In the next example, cv_metrics is set to two metrics, accuracy and TPR (true positive rate). Here, the TPR metric is specified as a JSON object that takes an array of two class label arguments, 2 and 3:

cv_metrics='[ "accuracy", {"TPR":[2,3] } ]'

Metrics specified as JSON objects can accept parameters. In the following example, the fscore metric specifies parameter beta, which is set to 0.5:

cv_metrics='[ {"fscore":{"beta":0.5} } ]'

Parameter support can be especially useful for certain metrics. For example, metrics auc_roc and auc_prc build a curve, and then compute the area under that curve. For ROC, the curve is formed by plotting metrics TPR against FPR; for PRC, PPV (precision) against TPR (recall). The accuracy of such curves can be increased by setting parameter num_bins to a value greater than the default value of 100. For example, the following expression computes AUC for an ROC curve built with 1000 bins:

cv_metrics='[{"auc_roc":{"num_bins":1000}}]'

Using metrics with Multi-class classifier functions

All supported metrics are defined for binary classifier functions LOGISTIC_REG and SVM_CLASSIFIER. For multi-class classifier functions such as NAIVE_BAYES, these metrics can be calculated for each one-versus-the-rest binary classifier. Use arguments to request the metrics for each classifier. For example, if training data has integer class labels, you can set cv_metrics with the precision (PPV) metric as follows:

cv_metrics='[{"precision":[0,4]}]'

This setting specifies to return two columns with precision computed for two classifiers:

  • Column 1: classifies 0 versus not 0

  • Collumn 2: classifies 4 versus not 4

If you omit class label arguments, the class with index 1 is used. Instead of computing metrics for individual one-versus-the-rest classifiers, the average is computed in one of the following styles: macro, micro, or weighted (default). For example, the following cv_metrics setting returns the average weighted by class sizes:

cv_metrics='[{"precision":{"avg":"weighted"}}]'

AUC-type metrics can be similarly defined for multi-class classifiers. For example, the following cv_metrics setting computes the area under the ROC curve for each one-versus-the-rest classifier, and then returns the average weighted by class sizes.

cv_metrics='[{"auc_roc":{"avg":"weighted", "num_bins":1000}}]'

Examples

=> SELECT CROSS_VALIDATE('svm_classifier', 'mtcars', 'am', 'mpg'
      USING PARAMETERS cv_fold_count= 6,
                       cv_hyperparams='{"C":[1,5]}',
                       cv_model_name='cv_svm',
                       cv_metrics='accuracy, error_rate');
         CROSS_VALIDATE
----------------------------
 Finished

===========
run_average
===========
C  |accuracy      |error_rate
---+--------------+----------
1 | 0.75556       |  0.24444
5 | 0.78333       |  0.21667
(1 row)