LINEAR_REG
Executes linear regression on an input relation, and returns a linear regression model.
You can export the resulting linear regression model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a linear regression model elsewhere, then import it to Vertica in PMML format to model on data inside Vertica.
This is a metafunction. You must call metafunctions in a toplevel SELECT statement.
Behavior type
VolatileSyntax
LINEAR_REG ( 'modelname', 'inputrelation', 'responsecolumn', 'predictorcolumns'
[ USING PARAMETERS
[exclude_columns = 'excludedcolumns']
[, optimizer = 'optimizermethod']
[, regularization = 'regularizationmethod']
[, epsilon = epsilonvalue]
[, max_iterations = iterations]
[, lambda = lamdavalue]
[, alpha = alphavalue]
[, fit_intercept = booleanvalue] ] )
Arguments
modelname
 Identifies the model to create, where
modelname
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. inputrelation
 Table or view that contains the training data for building the model. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. responsecolumn
 Name of the input column that represents the dependent variable or outcome. All values in this column must be numeric, otherwise the model is invalid.
predictorcolumns
Commaseparated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter
exclude_columns
must includeresponsecolumn
, and any columns that are invalid as predictor columns.All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.
Note
All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.
Parameters
exclude_columns
 Commaseparated list of columns from
predictorcolumns
to exclude from processing. optimizer
 Optimizer method used to train the model, one of the following:

Important
If you selectCGD
,regularizationmethod
must be set toL1
orENet
, otherwise the function returns an error.
Default:
CGD
ifregularizationmethod
is set toL1
orENet
, otherwiseNewton
. regularization
 Method of regularization, one of the following:

None
(default) 
L1

L2

ENet

epsilon
FLOAT in the range (0.0, 1.0), the error value at which to stop training. Training stops if either the difference between the actual and predicted values is less than or equal to
epsilon
or if the number of iterations exceedsmax_iterations
.Default: 1e6
max_iterations
INTEGER in the range (0, 1000000), the maximum number of training iterations. Training stops if either the number of iterations exceeds
max_iterations
or if the difference between the actual and predicted values is less than or equal toepsilon
.Default: 100
lambda
 Integer ≥ 0, specifies the value of the
regularization
parameter.Default: 1
alpha
 Integer ≥ 0, specifies the value of the ENET
regularization
parameter, which defines how much L1 versus L2 regularization to provide. A value of 1 is equivalent to L1 and a value of 0 is equivalent to L2.Value range: [0,1]
Default: 0.5
fit_intercept
 Boolean, specifies whether the model includes an intercept. By setting to false, no intercept will be used in training the model. Note that setting
fit_intercept
to false does not work well with the BFGS optimizer.Default: True
Model attributes
data
 The data for the function, including:

coeffNames
: Name of the coefficients. This starts with intercept and then follows with the names of the predictors in the same order specified in the call. 
coeff
: Vector of estimated coefficients, with the same order ascoeffNames

stdErr
: Vector of the standard error of the coefficients, with the same order ascoeffNames

zValue
(for logistic regression): Vector of zvalues of the coefficients, in the same order ascoeffNames

tValue
(for linear regression): Vector of tvalues of the coefficients, in the same order ascoeffNames

pValue
: Vector of pvalues of the coefficients, in the same order ascoeffNames

regularization
 Type of regularization to use when training the model.
lambda
 Regularization parameter. Higher values enforce stronger regularization. This value must be nonnegative.
alpha
 Elastic net mixture parameter.
iterations
 Number of iterations that actually occur for the convergence before exceeding
max_iterations
. skippedRows
 Number of rows of the input relation that were skipped because they contained an invalid value.
processedRows
 Total number of input relation rows minus
skippedRows
. callStr
 Value of all input arguments specified when the function was called.
Examples
=> SELECT LINEAR_REG('myLinearRegModel', 'faithful', 'eruptions', 'waiting'
USING PARAMETERS optimizer='BFGS', fit_intercept=true);
LINEAR_REG

Finished in 10 iterations
(1 row)