LINEAR_REG

Executes linear regression on an input relation, and returns a linear regression model.

You can export the resulting linear regression model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a linear regression model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

LINEAR_REG ( 'model-name', 'input-relation', 'response-column', 'predictor-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, optimizer = 'optimizer-method']
              [, regularization = 'regularization-method']
              [, epsilon = epsilon-value]
              [, max_iterations = iterations]
              [, lambda = lamda-value]
              [, alpha = alpha-value] ] )

Arguments

model-name

Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

input-relation

The table or view that contains the training data for building the model. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

response-column

Name of the input column that represents the dependent variable or outcome. All values in this column must be numeric, otherwise the model is invalid.

predictor-columns

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response-column, and any columns that are invalid as predictor columns.

All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.

Note

All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.

Parameters

exclude_columns

Comma-separated list of columns from predictor-columns to exclude from processing.

optimizer

The optimizer method used to train the model, one of the following:

Newton
BFGS
CGD

Important
If you select CGD, regularization-method must be set to L1 or ENet, otherwise the function returns an error.

Default: CGD if regularization-method is set to L1 or ENet, otherwise Newton.

regularization

Specifies the method of regularization, one of the following:

None (default)
L1
L2
ENet

epsilon

FLOAT in the range (0.0, 1.0), the error value at which to stop training. Training stops if either the difference between the actual and predicted values is less than or equal to epsilon or if the number of iterations exceeds max_iterations.

Default: 1e-6

max_iterations

INTEGER in the range (0, 1000000), the maximum number of training iterations. Training stops if either the number of iterations exceeds max_iterations or if the difference between the actual and predicted values is less than or equal to epsilon.

Default: 100

lambda

Integer ≥ 0, specifies the value of the regularization parameter.

Default: 1

alpha

Integer ≥ 0, specifies the value of the ENET regularization parameter, which defines how much L1 versus L2 regularization to provide. A value of 1 is equivalent to L1 and a value of 0 is equivalent to L2.

Value range: [0,1]

Default: 0.5

Model attributes

data

The data for the function, including:

coeffNames: Name of the coefficients. This starts with intercept and then follows with the names of the predictors in the same order specified in the call.
coeff: Vector of estimated coefficients, with the same order as coeffNames
stdErr: Vector of the standard error of the coefficients, with the same order as coeffNames
zValue (for logistic regression): Vector of z-values of the coefficients, in the same order as coeffNames
tValue (for linear regression): Vector of t-values of the coefficients, in the same order as coeffNames
pValue: Vector of p-values of the coefficients, in the same order as coeffNames

regularization

Type of regularization to use when training the model.

lambda

Regularization parameter. Higher values enforce stronger regularization. This value must be nonnegative.

alpha

Elastic net mixture parameter.

iterations

Number of iterations that actually occur for the convergence before exceeding max_iterations.

skippedRows

Number of rows of the input relation that were skipped because they contained an invalid value.

processedRows

Total number of input relation rows minus skippedRows.

callStr

Value of all input arguments specified when the function was called.

Examples

=> SELECT LINEAR_REG('myLinearRegModel', 'faithful', 'eruptions', 'waiting'
                      USING PARAMETERS optimizer='BFGS');
         LINEAR_REG
----------------------------
 Finished in 10 iterations

(1 row)