This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Machine learning algorithms
Vertica supports a full range of machine learning functions that train a model on a set of data, and return a model that can be saved for later execution.
Vertica supports a full range of machine learning functions that train a model on a set of data, and return a model that can be saved for later execution.
These functions require the following privileges for non-superusers:
Note
Machine learning algorithms contain a subset of four classification functions:
1 - AUTOREGRESSOR
Creates an autoregressive (AR) model from a stationary time series with consistent timesteps that can then be used for prediction via PREDICT_AR.
Creates an autoregressive (AR) model from a stationary time series with consistent timesteps that can then be used for prediction via PREDICT_AUTOREGRESSOR.
Autoregressive models predict future values of a time series based on the preceding values. More specifically, the user-specified lag determines how many previous timesteps it takes into account during computation, and predicted values are linear combinations of the values at each lag.
Since its input data must be sorted by timestamp, this algorithm is single-threaded.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
AUTOREGRESSOR ('model-name', 'input-relation', 'data-column', 'timestamp-column'
[ USING PARAMETERS
[ p = lags ]
[, missing = "imputation-method" ]
[, regularization = "regularization-method" ]
[, lambda = regularization-value ]
[, compute_mse = boolean ]
] )
Arguments
*
model-name*
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
*
input-relation*
- The table or view containing the
timestamp-column
.
This algorithm expects a stationary time series as input; using a time series with a mean that shifts over time may lead to weaker results.
*
data-column*
- An input column of type NUMERIC that contains the dependent variables or outcomes.
*
timestamp-column*
- One INTEGER, FLOAT, or TIMESTAMP column that represents the timestamp variable. Timesteps must be consistent.
Parameters
p
- INTEGER in the range [1, 1999], the number of lags to consider in the computation. Larger values for
p
weaken the correlation.
Default: 3
missing
- One of the following methods for handling missing values:
-
drop: Missing values are ignored.
-
error: Missing values raise an error.
-
zero: Missing values are replaced with 0.
-
linear_interpolation: Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. This means that in cases where the first or last values in a dataset are missing, they will simply be dropped.
Default: linear_interpolation
regularization
- One of the following regularization methods used when fitting the data:
Default: None
lambda
- FLOAT in the range [0, 100000], the regularization value, lambda.
Default: 1.0
compute_mse
- BOOLEAN, whether to calculate and output the mean squared error (MSE).
Default: False
Examples
See Autoregressive model example.
See also
2 - BISECTING_KMEANS
Executes the bisecting k‑means algorithm on an input relation.
Executes the bisecting k-means algorithm on an input relation. The result is a trained model with a hierarchy of cluster centers, with a range of k values, each of which can be used for prediction.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
BISECTING_KMEANS('model-name', 'input-relation', 'input-columns', 'num-clusters'
[ USING PARAMETERS
[exclude_columns = 'exclude-columns']
[, bisection_iterations = bisection-iterations]
[, split_method = 'split-method']
[, min_divisible_cluster_size = min-cluster-size]
[, kmeans_max_iterations = kmeans-max-iterations]
[, kmeans_epsilon = kmeans-epsilon]
[, kmeans_center_init_method = 'kmeans-init-method']
[, distance_method = 'distance-method']
[, output_view = 'output-view']
[, key_columns = 'key-columns'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- Table or view that contains the input data for k-means. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters
- Number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in k-means.
Parameters
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
bisection_iterations
- Integer between 1 - 1MM inclusive, specifies number of iterations the bisecting k-means algorithm performs for each bisection step. This corresponds to how many times a standalone k-means algorithm runs in each bisection step.
A setting >1 allows the algorithm to run and choose the best k-means run within each bisection step. If you use kmeanspp, the value of bisection_iterations
is always 1, because kmeanspp is more costly to run but also better than the alternatives, so it does not require multiple runs.
Default: 1
split_method
- The method used to choose a cluster to bisect/split, one of:
Default: sum_squares
min_divisible_cluster_size
- Integer ≥ 2, specifies minimum number of points of a divisible cluster.
Default: 2
kmeans_max_iterations
- Integer between 1 and 1MM inclusive, specifies the maximum number of iterations the k-means algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm might not converge.
Default: 10
kmeans_epsilon
- Integer between 1 and 1MM inclusive, determines whether the k-means algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
epsilon
from the previous iteration.
Default: 1e-4
kmeans_center_init_method
- The method used to find the initial cluster centers in k-means, one of:
-
kmeanspp
(default): kmeans++ algorithm
-
pseudo
: Uses "pseudo center" approach used by Spark, bisects given center without iterating over points
distance_method
- The measure for distance between two data points. Only Euclidean distance is supported at this time.
Default: euclidean
output_view
- Name of the view where you save the assignment of each point to its cluster. You must have CREATE privileges on the view schema.
key_columns
- Comma-separated list of column names that identify the output rows. Columns must be in the
input-columns
argument list. To exclude these and other input columns from being used by the algorithm, list them in parameter exclude_columns
.
Model attributes
centers
- A list of centers of the K centroids.
hierarchy
- The hierarchy of K clusters, including:
-
ParentCluster: Parent cluster centroid of each centroid—that is, the centroid of the cluster from which a cluster is obtained by bisection.
-
LeftChildCluster: Left child cluster centroid of each centroid—that is, the centroid of the first sub-cluster obtained by bisecting a cluster.
-
RightChildCluster: the right child cluster centroid of each centroid—that is, the centroid of the second sub-cluster obtained by bisecting a cluster.
-
BisectionLevel: Specifies which bisection step a cluster is obtained from.
-
WithinSS: Within-cluster sum of squares for the current cluster
-
TotalWithinSS: Total within-cluster sum of squares of leaf clusters thus far obtained.
metrics
- Several metrics related to the quality of the clustering, including
-
Total sum of squares
-
Total within-cluster sum of squares
-
Between-cluster sum of squares
-
Between-cluster sum of squares / Total sum of squares
-
Sum of squares for cluster x
, center_id y
[...]
Examples
SELECT BISECTING_KMEANS('myModel', 'iris1', '*', '5'
USING PARAMETERS exclude_columns = 'Species,id', split_method ='sum_squares', output_view = 'myBKmeansView');
See also
3 - KMEANS
Executes the k-means algorithm on an input relation.
Executes the k-means algorithm on an input relation. The result is a model with a list of cluster centers.
You can export the resulting k-means model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a k-means model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
KMEANS ( 'model-name', 'input-relation', 'input-columns', 'num-clusters'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, max_iterations = max-iterations]
[, epsilon = epsilon-value]
[, { init_method = 'init-method' } | { initial_centers_table = 'init-table' } ]
[, output_view = 'output-view']
[, key_columns = 'key-columns'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the input data for k-means. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
num-clusters
- The number of clusters to create, an integer ≤ 10,000. This argument represents the
k
in k-means.
Parameters
Important
Parameters init_method
and initial_centers_table
are mutually exclusive. If you set both, the function returns an error.
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
max_iterations
- The maximum number of iterations the algorithm performs. If you set this value to a number lower than the number of iterations needed for convergence, the algorithm may not converge.
Default: 10
epsilon
- Determines whether the algorithm has converged. The algorithm is considered converged after no center has moved more than a distance of
'epsilon' from the previous iteration.
Default: 1e-4
init_method
- The method used to find the initial cluster centers, one of the following:
-
random
-
kmeanspp
(default): kmeans++ algorithm
This value can be memory intensive for high k. If the function returns an error that not enough memory is available, decrease the value of k or use the random
method.
initial_centers_table
- The table with the initial cluster centers to use. Supply this value if you know the initial centers to use and do not want Vertica to find the initial cluster centers for you.
output_view
- The name of the view where you save the assignments of each point to its cluster. You must have CREATE privileges on the schema where the view is saved.
key_columns
- Comma-separated list of column names from
input-columns
that will appear as the columns of output_view
. These columns should be picked such that their contents identity each input data point. This parameter is only used if output_view
is specified. Columns listed in input-columns
that are only meant to be used as key_columns
and not for training should be listed in exclude_columns
.
Model attributes
centers
- A list that contains the center of each cluster.
metrics
- A string summary of several metrics related to the quality of the clustering.
Examples
The following example creates k-means model myKmeansModel
and applies it to input table iris1
. The call to APPLY_KMEANS
mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:
=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
KMEANS
----------------------------
Finished in 12 iterations
(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
id | APPLY_KMEANS
-----+--------------
5 | 1
10 | 1
14 | 1
15 | 1
21 | 1
22 | 1
24 | 1
25 | 1
32 | 1
33 | 1
34 | 1
35 | 1
38 | 1
39 | 1
42 | 1
...
(60 rows)
See also
4 - LINEAR_REG
Executes linear regression on an input relation, and returns a linear regression model.
Executes linear regression on an input relation, and returns a linear regression model.
You can export the resulting linear regression model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a linear regression model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
LINEAR_REG ( 'model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, optimizer = 'optimizer-method']
[, regularization = 'regularization-method']
[, epsilon = epsilon-value]
[, max_iterations = iterations]
[, lambda = lamda-value]
[, alpha = alpha-value] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training data for building the model. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- Name of the input column that represents the dependent variable or outcome. All values in this column must be numeric, otherwise the model is invalid.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.
Note
All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
optimizer
- The optimizer method used to train the model, one of the following:
Default: CGD
if regularization-method
is set to L1
or ENet
, otherwise Newton
.
regularization
- Specifies the method of regularization, one of the following:
-
None
(default)
-
L1
-
L2
-
ENet
epsilon
FLOAT in the range (0.0, 1.0), the error value at which to stop training. Training stops if either the difference between the actual and predicted values is less than or equal to epsilon
or if the number of iterations exceeds max_iterations
.
Default: 1e-6
max_iterations
INTEGER in the range (0, 1000000), the maximum number of training iterations. Training stops if either the number of iterations exceeds max_iterations
or if the difference between the actual and predicted values is less than or equal to epsilon
.
Default: 100
lambda
- Integer ≥ 0, specifies the value of the
regularization
parameter.
Default: 1
alpha
- Integer ≥ 0, specifies the value of the ENET
regularization
parameter, which defines how much L1 versus L2 regularization to provide. A value of 1 is equivalent to L1 and a value of 0 is equivalent to L2.
Value range: [0,1]
Default: 0.5
Model attributes
data
- The data for the function, including:
-
coeffNames
: Name of the coefficients. This starts with intercept and then follows with the names of the predictors in the same order specified in the call.
-
coeff
: Vector of estimated coefficients, with the same order as coeffNames
-
stdErr
: Vector of the standard error of the coefficients, with the same order as coeffNames
-
zValue
(for logistic regression): Vector of z-values of the coefficients, in the same order as coeffNames
-
tValue
(for linear regression): Vector of t-values of the coefficients, in the same order as coeffNames
-
pValue
: Vector of p-values of the coefficients, in the same order as coeffNames
regularization
- Type of regularization to use when training the model.
lambda
- Regularization parameter. Higher values enforce stronger regularization. This value must be nonnegative.
alpha
- Elastic net mixture parameter.
iterations
- Number of iterations that actually occur for the convergence before exceeding
max_iterations
.
skippedRows
- Number of rows of the input relation that were skipped because they contained an invalid value.
processedRows
- Total number of input relation rows minus
skippedRows
.
callStr
- Value of all input arguments specified when the function was called.
Examples
=> SELECT LINEAR_REG('myLinearRegModel', 'faithful', 'eruptions', 'waiting'
USING PARAMETERS optimizer='BFGS');
LINEAR_REG
----------------------------
Finished in 10 iterations
(1 row)
See also
5 - LOGISTIC_REG
Executes logistic regression on an input relation.
Executes logistic regression on an input relation. The result is a logistic regression model.
You can export the resulting logistic regression model in VERTICA_MODELS or PMML format to apply it on data outside Vertica. You can also train a logistic regression model elsewhere, then import it to Vertica in PMML format to predict on data in Vertica.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
LOGISTIC_REG ( 'model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS [exclude_columns = 'excluded-columns']
[, optimizer = 'optimizer-method']
[, regularization = 'regularization-method']
[, epsilon = epsilon-value]
[, max_iterations = iterations]
[, lambda = lamda-value]
[, alpha = alpha-value] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training data for building the model. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- The input column that represents the dependent variable or outcome. The column value must be 0 or 1, and of type numeric or BOOLEAN. The function automatically skips all other values.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.
Note
All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
optimizer
- The optimizer method used to train the model, one of the following:
Default: CGD
if regularization-method
is set to L1
or ENet
, otherwise Newton
.
regularization
- Specifies the method of regularization, one of the following:
-
None
(default)
-
L1
-
L2
-
ENet
epsilon
FLOAT in the range (0.0, 1.0), the error value at which to stop training. Training stops if either the difference between the actual and predicted values is less than or equal to epsilon
or if the number of iterations exceeds max_iterations
.
Default: 1e-6
max_iterations
INTEGER in the range (0, 1000000), the maximum number of training iterations. Training stops if either the number of iterations exceeds max_iterations
or if the difference between the actual and predicted values is less than or equal to epsilon
.
Default: 100
lambda
- Integer ≥ 0, specifies the value of the ENET
regularization
parameter, which defines how much L1 versus L2 regularization to provide. A value of 1 is equivalent to L1 and a value of 0 is equivalent to L2.
Default: 1
alpha
- ENet mixture parameter that specifies how much regularization to provide.
Value range: [0,1]
Default: 0.5
Model attributes
data
- The data for the function, including:
-
coeffNames
: Name of the coefficients. This starts with intercept and then follows with the names of the predictors in the same order specified in the call.
-
coeff
: Vector of estimated coefficients, with the same order as coeffNames
-
stdErr
: Vector of the standard error of the coefficients, with the same order as coeffNames
-
zValue
(for logistic regression): Vector of z-values of the coefficients, in the same order as coeffNames
-
tValue
(for linear regression): Vector of t-values of the coefficients, in the same order as coeffNames
-
pValue
: Vector of p-values of the coefficients, in the same order as coeffNames
regularization
- Type of regularization to use when training the model.
lambda
- Regularization parameter. Higher values enforce stronger regularization. This value must be nonnegative.
alpha
- Elastic net mixture parameter.
iterations
- Number of iterations that actually occur for the convergence before exceeding
max_iterations
.
skippedRows
- Number of rows of the input relation that were skipped because they contained an invalid value.
processedRows
- Total number of input relation rows minus
skippedRows
.
callStr
- Value of all input arguments specified when the function was called.
Privileges
Superuser, or SELECT privileges on the input relation
Examples
=> SELECT LOGISTIC_REG('myLogisticRegModel', 'mtcars', 'am',
'mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb'
USING PARAMETERS exclude_columns='hp', optimizer='BFGS');
LOGISTIC_REG
----------------------------
Finished in 20 iterations
(1 row)
See also
6 - MOVING_AVERAGE
Creates a moving-average (MA) model from a stationary time series with consistent timesteps that can then be used for prediction via PREDICT_MOVING_AVERAGE.
Creates a moving-average (MA) model from a stationary time series with consistent timesteps that can then be used for prediction via PREDICT_MOVING_AVERAGE.
Moving average models use the errors of previous predictions to make future predictions. More specifically, the user-specified lag determines how many previous predictions and errors it takes into account during computation.
Since its input data must be sorted by timestamp, this algorithm is single-threaded.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
MOVING_AVERAGE ('model-name', 'input-relation', 'data-column', 'timestamp-column'
[ USING PARAMETERS
[ q = lags ]
[, missing = "imputation-method" ]
[, regularization = "regularization-method" ]
[, lambda = regularization-value ]
[, compute_mse = boolean ]
] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view containing the
timestamp-column
.
This algorithm expects a stationary time series as input; using a time series with a mean that shifts over time may lead to weaker results.
data-column
- An input column of type NUMERIC that contains the dependent variables or outcomes.
timestamp-column
- One INTEGER, FLOAT, or TIMESTAMP column that represent the timestamp variable. Timesteps must be consistent.
Parameters
q
- INTEGER in the range [1, 67), the number of lags to consider in the computation.
Default: 1
missing
- One of the following methods for handling missing values:
-
drop: Missing values are ignored.
-
error: Missing values raise an error.
-
zero: Missing values are replaced with 0.
-
linear_interpolation: Missing values are replaced by linearly interpolated values based on the nearest valid entries before and after the missing value. This means that in cases where the first or last values in a dataset are missing, they will simply be dropped.
Default: linear_interpolation
regularization
- One of the following regularization methods used when fitting the data:
Default: None
lambda
- FLOAT in the range [0, 100000], the regularization value, lambda.
Default: 1.0
compute_mse
- BOOLEAN, whether to calculate and output the mean squared error (MSE).
This parameter only accepts "true" or "false" rather than the standard literal equivalents for BOOLEANs like 1 or 0.
Default: False
Examples
See Moving-average model example.
See also
7 - NAIVE_BAYES
Executes the Naive Bayes algorithm on an input relation and returns a Naive Bayes model.
Executes the Naive Bayes algorithm on an input relation and returns a Naive Bayes model.
Columns are treated according to data type:
-
FLOAT: Values are assumed to follow some Gaussian distribution.
-
INTEGER: Values are assumed to belong to one multinomial distribution.
-
CHAR/VARCHAR: Values are assumed to follow some categorical distribution. The string values stored in these columns must not be greater than 128 characters.
-
BOOLEAN: Values are treated as categorical with two values.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
NAIVE_BAYES ( 'model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, alpha = alpha-value] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training data for building the model. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- Name of the input column that represents the dependent variable, or outcome. This column must contain discrete labels that represent different class labels.
The response column must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.
Note
Vertica automatically casts
numeric response column values to VARCHAR.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid. BOOLEAN column values are converted to FLOAT values before training: 0 for false, 1 for true.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
alpha
- Float, specifies use of Laplace smoothing if the event model is categorical, multinomial, or Bernoulli.
Default: 1.0
Model attributes
colsInfo
- The information from the response and predictor columns used in training:
-
index: The index (starting at 0) of the column as provided in training. Index 0 is used for the response column.
-
name: The column name.
-
type: The label used for response with a value of Gaussian, Multinominal, Categorical, or Bernoulli.
alpha
- The smooth parameter value.
prior
- The percentage of each class among all training samples:
nRowsTotal
- The number of samples accepted for training from the data set.
nRowsRejected
- The number of samples rejected for training.
callStr
- The SQL statement used to replicate the training.
Gaussian
- The Gaussian model conditioned on the class indicated by the class_name:
-
index: The index of the predictor column.
-
mu: The mean value of the model.
-
sigmaSq: The squared standard deviation of the model.
Multinominal
- The Multinomial model conditioned on the class indicated by the class_name:
Bernoulli
- The Bernoulli model conditioned on the class indicated by the class_name:
Categorical
- The Gaussian model conditioned on the class indicated by the class_name:
Privileges
Superuser, or SELECT privileges on the input relation.
Examples
=> SELECT NAIVE_BAYES('naive_house84_model', 'house84_train', 'party', '*'
USING PARAMETERS exclude_columns='party, id');
NAIVE_BAYES
--------------------------------------------------
Finished. Accepted Rows: 324 Rejected Rows: 0
(1 row)
See also
8 - RF_CLASSIFIER
Trains a random forest model for classification on an input relation.
Trains a random forest model for classification on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
RF_CLASSIFIER ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, ntree = num-trees]
[, mtry = num-features]
[, sampling_size = sampling-size]
[, max_depth = depth]
[, max_breadth = breadth]
[, min_leaf_size = leaf-size]
[, min_info_gain = threshold]
[, nbins = num-bins] ] )
Arguments
model-name
- Identifies the model stored as a result of the training, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- An input column of type numeric, CHAR/VARCHAR, or BOOLEAN that represents the dependent variable.
Note
Vertica automatically casts
numeric response column values to VARCHAR.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
ntree
Integer in the range [1, 1000], the number of trees in the forest.
Default: 20
mtry
- Integer in the range [1,
number-predictors
], the number of randomly chosen features from which to pick the best feature to split on a given tree node.
Default: Square root of the total number of predictors
sampling_size
Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.
Default: 0.632
max_depth
Integer in the range [1, 100], the maximum depth for growing each tree. For example, a max_depth
of 0 represents a tree with only a root node, and a max_depth
of 2 represents a tree with four leaf nodes.
Default: 5
max_breadth
Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.
Default: 32
min_leaf_size
Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.
Default: 1
min_info_gain
Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.
Default: 0.0
nbins
Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.
Default: 32
Model attributes
data
- Data for the function, including:
ntree
- Number of trees in the model.
skippedRows
- Number of rows in
input_relation
that were skipped because they contained an invalid value.
processedRows
- Total number of rows in
input_relation
minus skippedRows
.
callStr
- Value of all input arguments that were specified at the time the function was called.
Examples
=> SELECT RF_CLASSIFIER ('myRFModel', 'iris', 'Species', 'Sepal_Length, Sepal_Width,
Petal_Length, Petal_Width' USING PARAMETERS ntree=100, sampling_size=0.3);
RF_CLASSIFIER
--------------------------------------------------
Finished training
(1 row)
See also
9 - RF_REGRESSOR
Trains a random forest model for regression on an input relation.
Trains a random forest model for regression on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
RF_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, ntree = num-trees]
[, mtry = num-features]
[, sampling_size = sampling-size]
[, max_depth = depth]
[, max_breadth = breadth]
[, min_leaf_size = leaf-size]
[, min_info_gain = threshold]
[, nbins = num-bins] ] )
Arguments
model-name
- The model that is stored as a result of training, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- A numeric input column that represents the dependent variable.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
ntree
Integer in the range [1, 1000], the number of trees in the forest.
Default: 20
mtry
- Integer in the range [1,
number-predictors
], the number of features to consider at the split of a tree node.
Default: One-third the total number of predictors
sampling_size
Float in the range (0.0, 1.0], the portion of the input data set that is randomly picked for training each tree.
Default: 0.632
max_depth
Integer in the range [1, 100], the maximum depth for growing each tree. For example, a max_depth
of 0 represents a tree with only a root node, and a max_depth
of 2 represents a tree with four leaf nodes.
Default: 5
max_breadth
Integer in the range [1, 1e9], the maximum number of leaf nodes a tree can have.
Default: 32
min_leaf_size
- Integer in the range [1, 1e6], the minimum number of samples each branch must have after splitting a node. A split that results in fewer remaining samples in its left or right branch is be discarded, and the node is treated as a leaf node.
The default value of this parameter differs from that of analogous parameters in libraries like sklearn and will therefore yield a model with predicted values that differ from the original response values.
Default: 5
min_info_gain
Float in the range [0.0, 1.0), the minimum threshold for including a split. A split with information gain less than this threshold is discarded.
Default: 0.0
nbins
Integer in the range [2, 1000], the number of bins to use for discretizing continuous features.
Default: 32
Model attributes
data
- Data for the function, including:
ntree
- Number of trees in the model.
skippedRows
- Number of rows in
input_relation
that were skipped because they contained an invalid value.
processedRows
- Total number of rows in
input_relation
minus skippedRows
.
callStr
- Value of all input arguments that were specified at the time the function was called.
Examples
=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR
--------------
Finished
(1 row)
See also
10 - SVM_CLASSIFIER
Trains the SVM model on an input relation.
Trains the SVM model on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
SVM_CLASSIFIER ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, C = 'cost']
[, epsilon = 'epsilon-value']
[, max_iterations = 'max-iterations']
[, class_weights = 'weight']
[, intercept_mode = 'intercept-mode']
[, intercept_scaling = 'scale'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training data. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- The input column that represents the dependent variable or outcome. The column value must be 0 or 1, and of type numeric or BOOLEAN, otherwise the function returns with an error.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.
Note
All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
C
- Weight for misclassification cost. The algorithm minimizes the regularization cost and the misclassification cost.
Default: 1.0
epsilon
- Used to control accuracy.
Default: 1e-3
max_iterations
- Maximum number of iterations that the algorithm performs.
Default: 100
class_weights
- Specifies how to determine weights of the two classes, one of the following:
-
None
(default): No weights are used
-
value0
, value1
: Two comma-delimited strings that specify two positive FLOAT values, where value0
assigns a weight to class 0, and value1
assigns a weight to class 1.
-
auto
: Weights each class according to the number of samples.
intercept_mode
- Specifies how to treat the intercept, one of the following:
intercept_scaling
- Float value that serves as the value of a dummy feature whose coefficient Vertica uses to calculate the model intercept. Because the dummy feature is not in the training data, its values are set to a constant, by default 1.
Model attributes
coeff
- Coefficients in the model:
nAccepted
- Number of samples accepted for training from the data set
nRejected
- Number of samples rejected when training
nIteration
- Number of iterations used in training
callStr
- SQL statement used to replicate the training
Examples
The following example uses SVM_CLASSIFIER
on the mtcars
table:
=> SELECT SVM_CLASSIFIER(
'mySvmClassModel', 'mtcars', 'am', 'mpg,cyl,disp,hp,drat,wt,qsec,vs,gear,carb'
USING PARAMETERS exclude_columns = 'hp,drat');
SVM_CLASSIFIER
----------------------------------------------------------------
Finished in 15 iterations.
Accepted Rows: 32 Rejected Rows: 0
(1 row)
See also
11 - SVM_REGRESSOR
Trains the SVM model on an input relation.
Trains the SVM model on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
SVM_REGRESSOR ( 'model-name', input-relation, 'response-column', 'predictor-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, error_tolerance = error-tolerance]
[, C = cost]
[, epsilon = epsilon-value]
[, max_iterations = max-iterations]
[, intercept_mode = 'mode']
[, intercept_scaling = 'scale'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
- The table or view that contains the training data. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- An input column that represents the dependent variable or outcome. The column must be a numeric data type.
predictor-columns
Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns
must include response-column
, and any columns that are invalid as predictor columns.
All predictor columns must be of type numeric or BOOLEAN; otherwise the model is invalid.
Note
All BOOLEAN predictor values are converted to FLOAT values before training: 0 for false, 1 for true. No type checking occurs during prediction, so you can use a BOOLEAN predictor column in training, and during prediction provide a FLOAT column of the same name. In this case, all FLOAT values must be either 0 or 1.
Parameters
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
error_tolerance
- Defines the acceptable error margin. Any data points outside this region add a penalty to the cost function.
Default: 0.1
C
- The weight for misclassification cost. The algorithm minimizes the regularization cost and the misclassification cost.
Default: 1.0
epsilon
- Used to control accuracy.
Default: 1e-3
max_iterations
- The maximum number of iterations that the algorithm performs.
Default: 100
intercept_mode
- A string that specifies how to treat the intercept, one of the following
intercept_scaling
- A FLOAT value, serves as the value of a dummy feature whose coefficient Vertica uses to calculate the model intercept. Because the dummy feature is not in the training data, its values are set to a constant, by default set to 1.
Model attributes
coeff
- Coefficients in the model:
nAccepted
- Number of samples accepted for training from the data set
nRejected
- Number of samples rejected when training
nIteration
- Number of iterations used in training
callStr
- SQL statement used to replicate the training
Examples
=> SELECT SVM_REGRESSOR('mySvmRegModel', 'faithful', 'eruptions', 'waiting'
USING PARAMETERS error_tolerance=0.1, max_iterations=100);
SVM_REGRESSOR
----------------------------------------------------------------
Finished in 5 iterations.
Accepted Rows: 272 Rejected Rows: 0
(1 row)
See also
12 - XGB_CLASSIFIER
Trains an XGBoost model for classification on an input relation.
Trains an XGBoost model for classification on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
XGB_CLASSIFIER ('model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS
[ exclude_columns = 'excluded-columns' ]
[, max_ntree = max-trees ]
[, max_depth = max-depth ]
[, objective = 'optimization-strategy' ]
[, learning_rate = learning-rate ]
[, min_split_loss = minimum ]
[, weight_reg = regularization ]
[, nbins = num-bins ]
[, sampling_size = fraction-of-rows ]
[, col_sample_by_tree = sample-ratio-per-tree ]
[, col_sample_by_node = sample-ratio-per-node ]
] )
Arguments
model-name
Name of the model (case-insensitive).
input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- An input column of type CHAR or VARCHAR that represents the dependent variable or outcome.
predictor-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of data types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of type CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
max_ntree
- Integer in the range [1,1000] that sets the maximum number of trees to create.
Default: 10
max_depth
- Integer in the range [1,20] that specifies the maximum depth of each tree.
Default: 6
objective
- The objective/loss function used to iteratively improve the model. Currently, '
crossentropy
' is the only option.
Default: 'crossentropy
'
learning_rate
- Float in the range (0,1] that specifies the weight for each tree's prediction. Setting this parameter can reduce each tree's impact and thereby prevent earlier trees from monopolizing improvements at the expense of contributions from later trees.
Default: 0.3
min_split_loss
- Float in the range [0,1000] that specifies the minimum amount of improvement each split must achieve on the model's objective function value to avoid being pruned.
If set to 0 or omitted, no minimum is set. In this case, trees are pruned according to positive or negative objective function values.
Default: 0.0 (disable)
weight_reg
- Float in the range [0,1000] that specifies the regularization term applied to the weights of classification tree leaves. The higher the setting, the sparser or smoother the weights are, which can help prevent over-fitting.
Default: 1.0
nbins
- Integer in the range (1,1000] that specifies the number of bins to use for finding splits in each column. More bins leads to longer runtime but more fine-grained and possibly better splits.
Default: 32
sampling_size
- Float in the range (0,1] that specifies the fraction of rows to use in each training iteration.
A value of 1 indicates that all rows are used.
Default: 1.0
col_sample_by_tree
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when building each tree.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.
This algorithm will always sample at least one column.
Default: 1
col_sample_by_node
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when evaluating each split.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.
This algorithm will always sample at least one column.
Default: 1
Examples
See XGBoost for classification.
13 - XGB_REGRESSOR
Trains an XGBoost model for regression on an input relation.
Trains an XGBoost model for regression on an input relation.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
Volatile
Syntax
XGB_REGRESSOR ('model-name', 'input-relation', 'response-column', 'predictor-columns'
[ USING PARAMETERS
[ exclude_columns = 'excluded-columns' ]
[, max_ntree = max-trees ]
[, max_depth = max-depth ]
[, objective = 'optimization-strategy' ]
[, learning_rate = learning-rate ]
[, min_split_loss = minimum ]
[, weight_reg = regularization ]
[, nbins = num-bins ]
[, sampling_size = fraction-of-rows ]
[, col_sample_by_tree = sample-ratio-per-tree ]
[, col_sample_by_node = sample-ratio-per-node ]
] )
Arguments
model-name
Name of the model (case-insensitive).
input-relation
- The table or view that contains the training samples. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the hcatalog
schema, and then run the machine learning function.
response-column
- An input column of type INTEGER or FLOAT that represents the dependent variable or outcome.
predictor-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of data types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of type CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Vertica XGBoost and Random Forest algorithms offer native support for categorical columns (BOOL/VARCHAR). Simply pass the categorical columns as predictors to the models and the algorithm will automatically treat the columns as categorical and will not attempt to split them into bins in the same manner as numerical columns; Vertica treats these columns as true categorical values and does not simply cast them to continuous values under-the-hood.
Parameters
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
max_ntree
- Integer in the range [1,1000] that sets the maximum number of trees to create.
Default: 10
max_depth
- Integer in the range [1,20] that specifies the maximum depth of each tree.
Default: 6
objective
- The objective/loss function used to iteratively improve the model. Currently, '
squarederror
' is the only option.
Default: 'squarederror
'
learning_rate
- Float in the range (0,1] that specifies the weight for each tree's prediction. Setting this parameter can reduce each tree's impact and thereby prevent earlier trees from monopolizing improvements at the expense of contributions from later trees.
Default: 0.3
min_split_loss
- Float in the range [0,1000] that specifies the minimum amount of improvement each split must achieve on the model's objective function value to avoid being pruned.
If set to 0 or omitted, no minimum is set. In this case, trees are pruned according to positive or negative objective function values.
Default: 0.0 (disable)
weight_reg
- Float in the range [0,1000] that specifies the regularization term applied to the weights of classification tree leaves. The higher the setting, the sparser or smoother the weights are, which can help prevent over-fitting.
Default: 1.0
nbins
- Integer in the range (1,1000] that specifies the number of bins to use for finding splits in each column. More bins leads to longer runtime but more fine-grained and possibly better splits.
Default: 32
sampling_size
- Float in the range (0,1] that specifies the fraction of rows to use in each training iteration.
A value of 1 indicates that all rows are used.
Default: 1.0
col_sample_by_tree
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when building each tree.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.
This algorithm will always sample at least one column.
Default: 1
col_sample_by_node
- Float in the range (0,1] that specifies the fraction of columns (features), chosen at random, to use when evaluating each split.
A value of 1 indicates that all columns are used.
col_sample_by
parameters "stack" on top of each other if several are specified. That is, given a set of 24 columns, for col_sample_by_tree=0.5
andcol_sample_by_node=0.5
,col_sample_by_tree
samples 12 columns, reducing the available, unsampled column pool to 12. col_sample_by_node
then samples half of the remaining pool, so each node samples 6 columns.
This algorithm will always sample at least one column.
Default: 1
Examples
See XGBoost for regression.