This is the multipage printable view of this section.
Click here to print.
Return to the regular view of this page.
Machine learning for predictive analytics
Vertica provides a number of machine learning functions for performing indatabase analysis.
Vertica provides a number of machine learning functions for performing indatabase analysis. These functions perform data preparation, model training, model management, and predictive tasks. Vertica supports the following indatabase machine learning algorithms:
 Regression algorithms: Linear regression, random forest, SVM, XGBoost
 Classification algorithms: Logistic regression, naive Bayes, random forest, SVM, XGBoost
 Clustering algorithms: Kmeans, bisecting kmeans
 Time series forecasting: Autoregression, movingaverage, ARIMA
For a scikitlike machine learning library that integrates directly with the data in your Vertica database, see VerticaPy.
For more information about specific machine learning functions, see Machine learning functions.
1  Download the machine learning example data
You need several data sets to run the machine learning examples.
You need several data sets to run the machine learning examples. You can download these data sets from the Vertica GitHub repository.
Important
The GitHub examples are based on the latest Vertica version. If you note differences, please upgrade to the latest version.
You can download the example data in either of two ways:

Download the ZIP file. Extract the contents of the file into a directory.

Clone the Vertica Machine Learning GitHub repository. Using a terminal window, run the following command:
$ git clone https://github.com/vertica/MachineLearningExamples
Loading the example data
You can load the example data by doing one of the following. Note that models are not automatically dropped. You must either rerun the load_ml_data.sql
script to drop models or manually drop them.

Copying and pasting the DDL and DML operations in load_ml_data.sql
in a vsql prompt or another Vertica client.

Running the following command from a terminal window within the data folder in the MachineLearningExamples directory:
$ /opt/vertica/bin/vsql d <name of your database> f load_ml_data.sql
You must also load the naive_bayes_data_prepration.sql
script in the MachineLearningExamples directory:
$ /opt/vertica/bin/vsql d <name of your database> f ./naive_bayes/naive_bayes_data_preparation.sql
Example data descriptions
The repository contains the following data sets.
Name 
Description 
agar_dish 
Synthetic data set meant to represent clustering of bacteria on an agar dish. Contains the following columns: id, xcoordinate, and ycoordinate. 
agar_dish_2 
125 rows sampled randomly from the original 500 rows of the agar_dish data set. 
agar_dish_1 
375 rows sampled randomly from the original 500 rows of the agar_dish data set. 
baseball 
Contains statistics from a fictional baseball league. The statistics included are: first name, last name, date of birth, team name, homeruns, hits, batting average, and salary. 
dailymintemperatures 
Contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990. 
dem_votes 
Contains data on the number of yes and no votes by Democrat members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the naive_bayes_data_prepration.sql script. Contains the following columns: vote, yes, no. 
faithful 
Wait times between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
Reference
Härdle, W. (1991) Smoothing Techniques with Implementation in S. New York: Springer.
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365.

faithful_testing 
Roughly 60% of the original 272 rows of the faithful data set. 
faithful_training 
Roughly 40% of the original 272 rows of the faithful data set. 
house84 
The house84 data set includes votes for each of the U.S. House of Representatives Congress members on 16 votes. Contains the following columns: id, party, vote1, vote2, vote3, vote4, vote5, vote6, vote7, vote8, vote9, vote10, vote11, vote12, vote13, vote14, vote15, vote16.
Reference
Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.

iris 
The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Reference
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

iris1 
90 rows sampled randomly from the original 150 rows in the iris data set. 
iris2 
60 rows sampled randomly from the original 150 rows in the iris data set. 
mtcars 
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Reference
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

rep_votes 
Contains data on the number of yes and no votes by Republican members of U.S. Congress for each of the 16 votes in the house84 data set. The table must be populated by running the naive_bayes_data_prepration.sql script. Contains the following columns: vote, yes, no. 
salary_data 
Contains fictional employee data. The data included are: employee id, first name, last name, years worked, and current salary. 
transaction_data 
Contains fictional credit card transactions with a BOOLEAN column indicating whether there was fraud associated with the transaction. The data included are: first name, last name, store, cost, and fraud. 
titanic_testing 
Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived. 
titanic_training 
Contains passenger information from the Titanic ship including sex, age, passenger class, and whether or not they survived. 
world 
Contains countryspecific information about human development using HDI, GDP, and CO2 emissions. 
2  Data preparation
Before you can analyze your data, you must prepare it.
Before you can analyze your data, you must prepare it. You can do the following data preparation tasks in Vertica:
2.1  Balancing imbalanced data
Imbalanced data occurs when an uneven distribution of classes occurs in the data.
Imbalanced data occurs when an uneven distribution of classes occurs in the data. Building a predictive model on the imbalanced data set would cause a model that appears to yield high accuracy but does not generalize well to the new data in the minority class. To prevent creating models with false levels of accuracy, you should rebalance your imbalanced data before creating a predictive model.
Before you begin the example,
load the Machine Learning sample data.
You see imbalanced data a lot in financial transaction data where the majority of the transactions are not fraudulent and a small number of the transactions are fraudulent, as shown in the following example.

View the distribution of the classes.
=> SELECT fraud, COUNT(fraud) FROM transaction_data GROUP BY fraud;
fraud  COUNT
+
TRUE  19
FALSE  981
(2 rows)

Use the BALANCE function to create a more balanced data set.
=> SELECT BALANCE('balance_fin_data', 'transaction_data', 'fraud', 'under_sampling'
USING PARAMETERS sampling_ratio = 0.2);
BALANCE

Finished in 1 iteration
(1 row)

View the new distribution of the classifiers.
=> SELECT fraud, COUNT(fraud) FROM balance_fin_data GROUP BY fraud;
fraud  COUNT
+
t  19
f  236
(2 rows)
See also
2.2  Detect outliers
Outliers are data points that greatly differ from other data points in a dataset.
Outliers are data points that greatly differ from other data points in a dataset. You can use outlier detection for applications such as fraud detection and system health monitoring, or you can detect outliers to then remove them from your data. If you leave outliers in your data when training a machine learning model, your resultant model is at risk for bias and skewed predictions. Vertica supports two methods for detecting outliers: the DETECT_OUTLIERS function and the IFOREST algorithm.
Isolation forest
Isolation forest (iForest) is an unsupervised algorithm that operates on the assumption that outliers are few and different. This assumption makes outliers susceptible to a separation mechanism called isolation. Instead of comparing data instances to a constructed normal distribution of each data feature, isolation focuses on outliers themselves.
To isolate outliers directly, iForest builds binary tree structures named isolation trees (iTrees) to model the feature space. These iTrees randomly and recursively split the feature space so that each node of the tree represents a feature subspace. For instance, the first split divides the whole feature space into two subspaces, which are represented by the two child nodes of the root node. A data instance is considered isolated when it is the only member of a feature subspace. Because outliers are assumed to be few and different, outliers are likely to be isolated sooner than normal data instances.
In order to improve the robustness of the algorithm, iForest builds an ensemble of iTrees, which each separate the feature space differently. The algorithm calculates the average path length needed to isolate a data instance across all iTrees. This average path length helps determine the anomaly_score
for each data instance in a dataset. The data instances with an anomaly_score
above a given threshold are considered outliers.
You do not need a large dataset to train an iForest, and even a small sample should suffice to train an accurate model. The data can have columns of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
After you have a trained an iForest model, you can use the APPLY_IFOREST function to detect outliers in any new data added to the dataset.
The following example demonstrates how to train an iForest model and detect outliers on the baseball dataset.
To build and train an iForest model, call IFOREST:
=> SELECT IFOREST('baseball_outliers','baseball','hr, hits, salary' USING PARAMETERS max_depth=30, nbins=100);
IFOREST

Finished
(1 row)
You can view a summary of the trained model using GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_outliers');
GET_MODEL_SUMMARY

===========
call_string
===========
SELECT iforest('public.baseball_outliers', 'baseball', 'hr, hits, salary' USING PARAMETERS exclude_columns='', ntree=100, sampling_size=0.632,
col_sample_by_tree=1, max_depth=30, nbins=100);
=======
details
=======
predictor type
+
hr  int
hits  int
salary float or numeric
===============
Additional Info
===============
Name Value
+
tree_count  100
rejected_row_count 0
accepted_row_count1000
(1 row)
You can apply the trained iForest model to the baseball dataset with APPLY_IFOREST. To view only the data instances that are identified as outliers, you can run the following query:
=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(hr, hits, salary USING PARAMETERS model_name='baseball_outliers', threshold=0.6)
AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
first_name  last_name  predictions
++
Jacqueline  Richards  {"anomaly_score":0.8572338674053986,"is_anomaly":true}
Debra  Hall  {"anomaly_score":0.6007846156043213,"is_anomaly":true}
Gerald  Fuller  {"anomaly_score":0.6813650107767862,"is_anomaly":true}
(3 rows)
Instead of specifying a threshold
value for APPLY_IFOREST, you can set the contamination
parameter. This parameter sets a threshold so that the ratio of training data points labeled as outliers is approximately equal to the value of contamination
:
=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
contamination = 0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
first_name  last_name  predictions
++
Marie  Fields  {"anomaly_score":0.5307715717521868,"is_anomaly":true}
Jacqueline  Richards  {"anomaly_score":0.777757463074347,"is_anomaly":true}
Debra  Hall  {"anomaly_score":0.5714649698133808,"is_anomaly":true}
Gerald  Fuller  {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(4 rows)
DETECT_OUTLIERS
The DETECT_OUTLIERS function assumes a normal distribution for each data dimension, and then identifies data instances that differ strongly from the normal profile of any dimension. The function uses the robust zscore detection method to normalize each input column. If a data instance contains a normalized value greater than a specified threshold, it is identified as an outlier. The function outputs a table that contains all the outliers.
The function accepts data with only numeric input columns, treats each column independently, and assumes a Gaussian distribution on each column. If you want to detect outliers in new data added to the dataset, you must rerun DETECT_OUTLIERS.
The following example demonstrates how you can detect the outliers in the baseball dataset based on the hr, hits, and salary columns. The DETECT_OUTLIERS function creates a table containing the outliers with the input and key columns:
=> SELECT DETECT_OUTLIERS('baseball_hr_hits_salary_outliers', 'baseball', 'hr, hits, salary', 'robust_zscore'
USING PARAMETERS outlier_threshold=3.0);
DETECT_OUTLIERS

Detected 5 outliers
(1 row)
To view the outliers, query the output table containing the outliers:
=> SELECT * FROM baseball_hr_hits_salary_outliers;
id  first_name  last_name  dob  team  hr  hits  avg  salary
++++++++
73  Marie  Fields  19851123  Mauv  8888  34  0.283  9.99999999341471e+16
89  Jacqueline  Richards  19751006  Pink  273333  4490260  0.324  4.4444444444828e+17
87  Jose  Stephens  19910720  Green  80  64253  0.69  16032567.12
222  Gerald  Fuller  19910213  Goldenrod  3200000  216  0.299  37008899.76
147  Debra  Hall  19801231  Maroon  1100037  230  0.431  9000101403
(5 rows)
You can create a view omitting the outliers from the table:
=> CREATE VIEW clean_baseball AS
SELECT * FROM baseball WHERE id NOT IN (SELECT id FROM baseball_hr_hits_salary_outliers);
CREATE VIEW
See also
2.3  Encoding categorical columns
Many machine learning algorithms cannot work with categorical data.
Many machine learning algorithms cannot work with categorical data. To accommodate such algorithms, categorical data must be converted to numerical data before training. Directly mapping the categorical values into indices is not enough. For example, if your categorical feature has three distinct values "red", "green" and "blue", replacing them with 1, 2 and 3 may have a negative impact on the training process because algorithms usually rely on some kind of numerical distances between values to discriminate between them. In this case, the Euclidean distance from 1 to 3 is twice the distance from 1 to 2, which means the training process will think that "red" is much more different than "blue", while it is more similar to "green". Alternatively, one hot encoding maps each categorical value to a binary vector to avoid this problem. For example, "red" can be mapped to [1,0,0], "green" to [0,1,0] and "blue" to [0,0,1]. Now, the pairwise distances between the three categories are all the same. One hot encoding allows you to convert categorical variables to binary values so that you can use different machine learning algorithms to evaluate your data.
The following example shows how you can apply one hot encoding to the Titanic data set. If you would like to read more about this data set, see the Kaggle site.
Suppose you want to use a logistic regression classifier to predict which passengers survived the sinking of the Titanic. You cannot use categorical features for logistic regression without one hot encoding. This data set has two categorical features that you can use. The "sex" feature can be either male or female. The "embarkation_point" feature can be one of the following:

S for Southampton

Q for Queenstown

C for Cherbourg
Before you begin the example,
load the Machine Learning sample data.
 Run the ONE_HOT_ENCODER_FIT function on the training data:
=> SELECT ONE_HOT_ENCODER_FIT('titanic_encoder', 'titanic_training', 'sex, embarkation_point');
ONE_HOT_ENCODER_FIT

Success
(1 row)
 View a summary of the titanic_encoder model:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='titanic_encoder');
GET_MODEL_SUMMARY

===========
call_string
===========
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'
USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name category_levelcategory_level_index
++
embarkation_point C  0
embarkation_point Q  1
embarkation_point S  2
embarkation_point  3
sex  female  0
sex  male  1
(1 row)
 Run the GET_MODEL_ATTRIBUTE function. This function returns the categorical levels in their native data types, so they can be compared easily with the original table:
=> SELECT * FROM (SELECT GET_MODEL_ATTRIBUTE(USING PARAMETERS model_name='titanic_encoder',
attr_name='varchar_categories')) AS attrs INNER JOIN (SELECT passenger_id, name, sex, age,
embarkation_point FROM titanic_training) AS original_data ON attrs.category_level
ILIKE original_data.embarkation_point ORDER BY original_data.passenger_id LIMIT 10;
category_name  category_level  category_level_index  passenger_id name
 sex  age  embarkation_point
++++
+++
embarkation_point  S  2  1  Braund, Mr. Owen Harris
 male  22  S
embarkation_point  C  0  2  Cumings, Mrs. John Bradley
(Florence Briggs Thayer  female  38  C
embarkation_point  S  2  3  Heikkinen, Miss. Laina
 female  26  S
embarkation_point  S  2  4  Futrelle, Mrs. Jacques Heath
(Lily May Peel)  female  35  S
embarkation_point  S  2  5  Allen, Mr. William Henry
 male  35  S
embarkation_point  Q  1  6  Moran, Mr. James
 male   Q
embarkation_point  S  2  7  McCarthy, Mr. Timothy J
 male  54  S
embarkation_point  S  2  8  Palsson, Master. Gosta Leonard
 male  2  S
embarkation_point  S  2  9  Johnson, Mrs. Oscar W
(Elisabeth Vilhelmina Berg)  female  27  S
embarkation_point  C  0  10  Nasser, Mrs. Nicholas
(Adele Achem)  female  14  C
(10 rows)
 Run the APPLY_ONE_HOT_ENCODER function on both the training and testing data:
=> CREATE VIEW titanic_training_encoded AS SELECT passenger_id, survived, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_training) AS sq;
CREATE VIEW
=> CREATE VIEW titanic_testing_encoded AS SELECT passenger_id, name, pclass, sex_1, age,
sibling_and_spouse_count, parent_and_child_count, fare, embarkation_point_1, embarkation_point_2
FROM (SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder')
FROM titanic_testing) AS sq;
CREATE VIEW
 Then, train a logistic regression classifier on the training data, and execute the model on the testing data:
=> SELECT LOGISTIC_REG('titanic_log_reg', 'titanic_training_encoded', 'survived', '*'
USING PARAMETERS exclude_columns='passenger_id, survived');
LOGISTIC_REG

Finished in 5 iterations
(1 row)
=> SELECT passenger_id, name, PREDICT_LOGISTIC_REG(pclass, sex_1, age, sibling_and_spouse_count,
parent_and_child_count, fare, embarkation_point_1, embarkation_point_2 USING PARAMETERS
model_name='titanic_log_reg') FROM titanic_testing_encoded ORDER BY passenger_id LIMIT 10;
passenger_id  name  PREDICT_LOGISTIC_REG
++
893  Wilkes, Mrs. James (Ellen Needs)  0
894  Myles, Mr. Thomas Francis  0
895  Wirz, Mr. Albert  0
896  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  1
897  Svensson, Mr. Johan Cervin  0
898  Connolly, Miss. Kate  1
899  Caldwell, Mr. Albert Francis  0
900  Abrahim, Mrs. Joseph (Sophie Halaut Easu)  1
901  Davies, Mr. John Samuel  0
902  Ilieff, Mr. Ylio 
(10 rows)
2.4  Imputing missing values
You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column.
You can use the IMPUTE function to replace missing data with the most frequent value or with the average value in the same column. This impute example uses the small_input_impute
table. Using the function, you can specify either the mean or mode method.
These examples show how you can use the IMPUTE function on the small_input_impute
table.
Before you begin the example,
load the Machine Learning sample data.
First, query the table so you can see the missing values:
=> SELECT * FROM small_input_impute;
pid  pclass  gender  x1  x2  x3  x4  x5  x6
++++++++
5  0  1  2.590837  2.892819  2.70296  2  t  C
7  1  1  3.829239  3.08765  Infinity   f  C
13  0  0  9.060605  9.390844  9.559848  6  t  C
15  0  1  2.590837  2.892819  2.70296  2  f  A
16  0  1  2.264599  2.615146  2.10729  11  f  A
19  1  1   3.841606  3.754375  20  t 
1  0  0  9.445818  9.740541  9.786974  3  t  A
1  0  0  9.445818  9.740541  9.786974  3  t  A
2  0  0  9.618292  9.308881  9.562255  4  t  A
3  0  0  9.060605  9.390844  9.559848  6  t  B
4  0  0  2.264599  2.615146  2.10729  15  t  B
6  0  1  2.264599  2.615146  2.10729  11  t  C
8  1  1  3.273592   3.477332  18  f  B
10  1  1   3.841606  3.754375  20  t  A
18  1  1  3.273592   3.477332  18  t  B
20  1  1   3.841606  3.754375  20   C
9  1  1   3.841606  3.754375  20  f  B
11  0  0  9.445818  9.740541  9.786974  3  t  B
12  0  0  9.618292  9.308881  9.562255  4  t  C
14  0  0  2.264599  2.615146  2.10729  15  f  A
17  1  1  3.829239  3.08765  Infinity   f  B
(21 rows)
Specify the mean method
Execute the IMPUTE function, specifying the mean method:
=> SELECT IMPUTE('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
IMPUTE

Finished in 1 iteration
(1 row)
View output_view
to see the imputed values:
=> SELECT * FROM output_view;
pid  pclass  gender  x1  x2  x3  x4  x5  x6
++++++++
5  0  1  2.590837  2.892819  2.70296  2  t  C
7  1  1  3.829239  3.08765  3.12989705263158  11  f  C
13  0  0  9.060605  9.390844  9.559848  6  t  C
15  0  1  2.590837  2.892819  2.70296  2  f  A
16  0  1  2.264599  2.615146  2.10729  11  f  A
19  1  1  3.86645035294118  3.841606  3.754375  20  t 
9  1  1  3.86645035294118  3.841606  3.754375  20  f  B
11  0  0  9.445818  9.740541  9.786974  3  t  B
12  0  0  9.618292  9.308881  9.562255  4  t  C
14  0  0  2.264599  2.615146  2.10729  15  f  A
17  1  1  3.829239  3.08765  3.12989705263158  11  f  B
1  0  0  9.445818  9.740541  9.786974  3  t  A
1  0  0  9.445818  9.740541  9.786974  3  t  A
2  0  0  9.618292  9.308881  9.562255  4  t  A
3  0  0  9.060605  9.390844  9.559848  6  t  B
4  0  0  2.264599  2.615146  2.10729  15  t  B
6  0  1  2.264599  2.615146  2.10729  11  t  C
8  1  1  3.273592  3.22766163157895  3.477332  18  f  B
10  1  1  3.86645035294118  3.841606  3.754375  20  t  A
18  1  1  3.273592  3.22766163157895  3.477332  18  t  B
20  1  1  3.86645035294118  3.841606  3.754375  20   C
(21 rows)
You can also execute the IMPUTE function, specifying the mean method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT IMPUTE('output_view_group','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid', partition_columns='pclass,gender');
impute

Finished in 1 iteration
(1 row)
View output_view_group
to see the imputed values:
=> SELECT * FROM output_view_group;
pid  pclass  gender  x1  x2  x3  x4  x5  x6
++++++++
5  0  1  2.590837  2.892819  2.70296  2  t  C
7  1  1  3.829239  3.08765  3.66202733333333  19  f  C
13  0  0  9.060605  9.390844  9.559848  6  t  C
15  0  1  2.590837  2.892819  2.70296  2  f  A
16  0  1  2.264599  2.615146  2.10729  11  f  A
19  1  1  3.5514155  3.841606  3.754375  20  t 
1  0  0  9.445818  9.740541  9.786974  3  t  A
1  0  0  9.445818  9.740541  9.786974  3  t  A
2  0  0  9.618292  9.308881  9.562255  4  t  A
3  0  0  9.060605  9.390844  9.559848  6  t  B
4  0  0  2.264599  2.615146  2.10729  15  t  B
6  0  1  2.264599  2.615146  2.10729  11  t  C
8  1  1  3.273592  3.59028733333333  3.477332  18  f  B
10  1  1  3.5514155  3.841606  3.754375  20  t  A
18  1  1  3.273592  3.59028733333333  3.477332  18  t  B
20  1  1  3.5514155  3.841606  3.754375  20   C
9  1  1  3.5514155  3.841606  3.754375  20  f  B
11  0  0  9.445818  9.740541  9.786974  3  t  B
12  0  0  9.618292  9.308881  9.562255  4  t  C
14  0  0  2.264599  2.615146  2.10729  15  f  A
17  1  1  3.829239  3.08765  3.66202733333333  19  f  B
(21 rows)
Specify the mode method
Execute the IMPUTE function, specifying the mode method:
=> SELECT impute('output_view_mode','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid');
impute

Finished in 1 iteration
(1 row)
View output_view_mode
to see the imputed values:
=> SELECT * FROM output_view_mode;
pid  pclass  gender  x1  x2  x3  x4  x5  x6
++++++++
5  0  1  2.590837  2.892819  2.70296  2  t  C
7  1  1  3.829239  3.08765  Infinity   f  C
13  0  0  9.060605  9.390844  9.559848  6  t  C
15  0  1  2.590837  2.892819  2.70296  2  f  A
16  0  1  2.264599  2.615146  2.10729  11  f  A
19  1  1   3.841606  3.754375  20  t  B
1  0  0  9.445818  9.740541  9.786974  3  t  A
1  0  0  9.445818  9.740541  9.786974  3  t  A
2  0  0  9.618292  9.308881  9.562255  4  t  A
3  0  0  9.060605  9.390844  9.559848  6  t  B
4  0  0  2.264599  2.615146  2.10729  15  t  B
6  0  1  2.264599  2.615146  2.10729  11  t  C
8  1  1  3.273592   3.477332  18  f  B
10  1  1   3.841606  3.754375  20  t  A
18  1  1  3.273592   3.477332  18  t  B
20  1  1   3.841606  3.754375  20  t  C
9  1  1   3.841606  3.754375  20  f  B
11  0  0  9.445818  9.740541  9.786974  3  t  B
12  0  0  9.618292  9.308881  9.562255  4  t  C
14  0  0  2.264599  2.615146  2.10729  15  f  A
17  1  1  3.829239  3.08765  Infinity   f  B
(21 rows)
You can also execute the IMPUTE function, specifying the mode method and using the partition_columns parameter. This parameter works similarly to the GROUP_BY clause:
=> SELECT impute('output_view_mode_group','small_input_impute', 'pid, x5,x6','mode'
USING PARAMETERS exclude_columns='pid',partition_columns='pclass,gender');
impute

Finished in 1 iteration
(1 row)
View output_view_mode_group
to see the imputed values:
=> SELECT * FROM output_view_mode_group;
pid  pclass  gender  x1  x2  x3  x4  x5  x6
++++++++
1  0  0  9.445818  9.740541  9.786974  3  t  A
1  0  0  9.445818  9.740541  9.786974  3  t  A
2  0  0  9.618292  9.308881  9.562255  4  t  A
3  0  0  9.060605  9.390844  9.559848  6  t  B
4  0  0  2.264599  2.615146  2.10729  15  t  B
13  0  0  9.060605  9.390844  9.559848  6  t  C
11  0  0  9.445818  9.740541  9.786974  3  t  B
12  0  0  9.618292  9.308881  9.562255  4  t  C
14  0  0  2.264599  2.615146  2.10729  15  f  A
5  0  1  2.590837  2.892819  2.70296  2  t  C
15  0  1  2.590837  2.892819  2.70296  2  f  A
16  0  1  2.264599  2.615146  2.10729  11  f  A
6  0  1  2.264599  2.615146  2.10729  11  t  C
7  1  1  3.829239  3.08765  Infinity   f  C
19  1  1   3.841606  3.754375  20  t  B
9  1  1   3.841606  3.754375  20  f  B
17  1  1  3.829239  3.08765  Infinity   f  B
8  1  1  3.273592   3.477332  18  f  B
10  1  1   3.841606  3.754375  20  t  A
18  1  1  3.273592   3.477332  18  t  B
20  1  1   3.841606  3.754375  20  f  C
(21 rows)
See also
IMPUTE
2.5  Normalizing data
The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale.
The purpose of normalization is, primarily, to scale numeric data from different columns down to an equivalent scale. For example, suppose you execute the LINEAR_REG function on a data set with two feature columns, current_salary
and years_worked
. The output value you are trying to predict is a worker's future salary. The values in the current_salary
column are likely to have a far wider range, and much larger values, than the values in the years_worked
column. Therefore, the values in the current_salary
column can overshadow the values in the years_worked
column, thus skewing your model.
Vertica offers the following data preparation methods which use normalization. These methods are:

MinMax
Using the MinMax normalization method, you can normalize the values in both of these columns to be within a distribution of values between 0 and 1. Doing so allows you to compare values on very different scales to one another by reducing the dominance of one column over the other.

Zscore
Using the Zscore normalization method, you can normalize the values in both of these columns to be the number of standard deviations an observation is from the mean of each column. This allows you to compare your data to a normally distributed random variable.

Robust Zscore
Using the Robust Zscore normalization method, you can lessen the influence of outliers on Zscore calculations. Robust Zscore normalization uses the median value as opposed to the mean value used in Zscore. By using the median instead of the mean, it helps remove some of the influence of outliers in the data.
Normalizing data results in the creation of a view where the normalized data is saved. The output_view
option in the NORMALIZE function determines name of the view .
Normalizing salary data using MinMax
The following example shows how you can normalize the salary_data
table using the MinMax normalization method.
Before you begin the example,
load the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_salary_data', 'salary_data', 'current_salary, years_worked', 'minmax');
NORMALIZE

Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_salary_data;
employee_id  first_name  last_name  years_worked  current_salary
++++
189  Shawn  Moore  0.350000000000000000  0.437246565765357217
518  Earl  Shaw  0.100000000000000000  0.978867411144492943
1126  Susan  Alexander  0.250000000000000000  0.909048995710749580
1157  Jack  Stone  0.100000000000000000  0.601863084103319918
1277  Scott  Wagner  0.050000000000000000  0.455949209228501786
3188  Shirley  Flores  0.400000000000000000  0.538816771536005140
3196  Andrew  Holmes  0.900000000000000000  0.183954046444834949
3430  Philip  Little  0.100000000000000000  0.735279557092379495
3522  Jerry  Ross  0.800000000000000000  0.671828883472214349
3892  Barbara  Flores  0.350000000000000000  0.092901007123556866
.
.
.
(1000 rows)
Normalizing salary data using zscore
The following example shows how you can normalize the salary_data
table using the Zscore normalization method.
Before you begin the example,
load the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_z_salary_data', 'salary_data', 'current_salary, years_worked',
'zscore');
NORMALIZE

Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_z_salary_data;
employee_id  first_name  last_name  years_worked  current_salary
++++
189  Shawn  Moore  0.524447274157005  0.221041249770669
518  Earl  Shaw  1.35743214416495  1.66054215981221
1126  Susan  Alexander  0.857641222160185  1.41799393943946
1157  Jack  Stone  1.35743214416495  0.350834283622416
1277  Scott  Wagner  1.52402911816654  0.156068522159045
3188  Shirley  Flores  0.357850300155415  0.131812255991634
3196  Andrew  Holmes  1.30811943986048  1.10097599783475
3430  Philip  Little  1.35743214416495  0.814321286168547
3522  Jerry  Ross  0.974925491857304  0.593894513770248
3892  Barbara  Flores  0.524447274157005  1.41729301118583
.
.
.
(1000 rows)
Normalizing salary data using robust zscore
The following example shows how you can normalize the salary_data
table using the robust Zscore normalization method.
Before you begin the example,
load the Machine Learning sample data.
=> SELECT NORMALIZE('normalized_robustz_salary_data', 'salary_data', 'current_salary, years_worked', 'robust_zscore');
NORMALIZE

Finished in 1 iteration
(1 row)
=> SELECT * FROM normalized_robustz_salary_data;
employee_id  first_name  last_name  years_worked  current_salary
++++
189  Shawn  Moore  0.404694455685957  0.158933849655499140
518  Earl  Shaw  1.079185215162552  1.317126172796275889
1126  Susan  Alexander  0.674490759476595  1.126852528914384584
1157  Jack  Stone  1.079185215162552  0.289689691751547422
1277  Scott  Wagner  1.214083367057871  0.107964200747705902
3188  Shirley  Flores  0.269796303790638  0.117871818902746738
3196  Andrew  Holmes  1.079185215162552  0.849222942006447161
3430  Philip  Little  1.079185215162552  0.653284859470426481
3522  Jerry  Ross  0.809388911371914  0.480364995828913355
3892  Barbara  Flores  0.404694455685957  1.097366550974798397
3939  Anna  Walker  0.944287063267233  0.414956177842775781
4165  Martha  Reyes  0.269796303790638  0.773947701782753329
4335  Phillip  Wright  1.214083367057871  1.218843012657445647
4534  Roger  Harris  1.079185215162552  1.155185021164402608
4806  John  Robinson  0.809388911371914  0.494320112876813908
4881  Kelly  Welch  0.134898151895319  0.540778808820045933
4889  Jennifer  Arnold  1.214083367057871  0.299762093576526566
5067  Martha  Parker  0.000000000000000  0.719991348857328239
5523  John  Martin  0.269796303790638  0.411248545269163826
6004  Nicole  Sullivan  0.269796303790638  1.065141044522487821
6013  Harry  Woods  0.944287063267233  1.005664438654129376
6240  Norma  Martinez  1.214083367057871  0.762412844887071691
.
.
.
(1000 rows)
See also
2.6  PCA (principal component analysis)
Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data.
Principal Component Analysis (PCA) is a technique that reduces the dimensionality of data while retaining the variation present in the data. In essence, a new coordinate system is constructed so that data variation is strongest along the first axis, less strong along the second axis, and so on. Then, the data points are transformed into this new coordinate system. The directions of the axes are called principal components.
If the input data is a table with p columns, there could be maximum p principal components. However, it's usually the case that the data variation along the direction of some kth principal component becomes almost negligible, which allows us to keep only the first k components. As a result, the new coordinate system has fewer axes. Hence, the transformed data table has only k columns instead of p. It is important to remember that the k output columns are not simply a subset of p input columns. Instead, each of the k output columns is a combination of all p input columns.
You can use the following functions to train and apply the PCA model:
For a complete example, see Dimension reduction using PCA.
2.6.1  Dimension reduction using PCA
This PCA example uses a data set with a large number of columns named world.
This PCA example uses a data set with a large number of columns named world. The example shows how you can apply PCA to all columns in the data set (except HDI) and reduce them into two dimensions.
Before you begin the example,
load the Machine Learning sample data.

Create the PCA model, named pcamodel
.
=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI, country');
PCA

Finished in 1 iterations.
Accepted Rows: 96 Rejected Rows: 0
(1 row)

View the summary output of pcamodel
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='pcamodel');
GET_MODEL_SUMMARY


Next, apply PCA to a select few columns, with the exception of HDI and country.
=> SELECT APPLY_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country',cutoff=.3) OVER () FROM world;
HDI  country  col1
++
0.886  Belgium  36288.1191849017
0.699  Belize  81740.32711562
0.427  Benin  122666.882708325
0.805  Chile  161356.484748602
0.687  China  202634.254216416
0.744  Costa Rica  242043.080125449
0.4  Cote d'Ivoire  283330.394428932
0.776  Cuba  322625.857541772
0.895  Denmark  356086.311721071
0.644  Egypt  403634.743992772
.
.
.
(96 rows)

Then, optionally apply the inverse function to transform the data back to its original state. This example shows an abbreviated output, only for the first record. There are 96 records in total.
=> SELECT APPLY_INVERSE_PCA (HDI,country,em1970,em2010,gdp1970,gdp2010 USING PARAMETERS model_name='pcamodel',
exclude_columns='HDI,country', key_columns='HDI,country') OVER () FROM world limit 1;
[ RECORD 1 ]
HDI  0.886
country  Belgium
em1970  3.74891915022521
em1971  26.091852917619
em1972  22.0262860721982
em1973  24.8214492074202
em1974  20.9486650320945
em1975  29.5717692117088
em1976  17.4373459783249
em1977  33.1895610966146
em1978  15.6251407781098
em1979  14.9560299812815
em1980  18.0870223053504
em1981  6.23151505146251
em1982  7.12300504708672
em1983  7.52627957856581
em1984  7.17428622245234
em1985  9.04899186621455
em1986  10.5098581697156
em1987  7.97146984849547
em1988  8.85458031319287
em1989  8.78422101747477
em1990  9.61931854722004
em1991  11.6411235452067
em1992  12.8882752879355
em1993  15.0647523842803
em1994  14.3266175918398
em1995  9.07603254825782
em1996  9.32002671928241
em1997  10.0209028262361
em1998  6.70882735196004
em1999  7.32575918131333
em2000  10.3113551933996
em2001  11.0162573094354
em2002  10.886264397431
em2003  8.96078372850612
em2004  11.5157129257881
em2005  12.5048269019293
em2006  12.2345161132594
em2007  8.92504587601715
em2008  12.1136551375247
em2009  10.1144380511421
em2010  7.72468307053519
gdp1970  10502.1047183969
gdp1971  9259.97560190599
gdp1972  6593.98178532712
gdp1973  5325.33813328068
gdp1974  899.029529832931
gdp1975  3184.93671107899
gdp1976  4517.68204331439
gdp1977  3322.9509067019
gdp1978  33.8221923368737
gdp1979  2882.50573071066
gdp1980  3638.74436577365
gdp1981  2211.77365027338
gdp1982  5811.44631880621
gdp1983  7365.75180165581
gdp1984  10465.1797058904
gdp1985  12312.7219748196
gdp1986  12309.0418293413
gdp1987  13695.5173269466
gdp1988  12531.9995299889
gdp1989  13009.2244205049
gdp1990  10697.6839797576
gdp1991  6835.94651304181
gdp1992  4275.67753277099
gdp1993  3382.29408813394
gdp1994  3703.65406726311
gdp1995  4238.17659535371
gdp1996  4692.48744219914
gdp1997  4539.23538342266
gdp1998  5886.78983381162
gdp1999  7527.72448728762
gdp2000  7646.05563584361
gdp2001  9053.22077886667
gdp2002  9914.82548013531
gdp2003  9201.64413455221
gdp2004  9234.70123279344
gdp2005  9565.5457350936
gdp2006  9569.86316415438
gdp2007  9104.60260145907
gdp2008  8182.8163827425
gdp2009  6279.93197775805
gdp2010  4274.40397281553
See also
2.7  Sampling data
The goal of data sampling is to take a smaller, more manageable sample of a much larger data set.
The goal of data sampling is to take a smaller, more manageable sample of a much larger data set. With a sample data set, you can produce predictive models or use it to help you tune your database. The following example shows how you can use the TABLESAMPLE
clause to create a sample of your data.
Sampling data from a table
Before you begin the example,
load the Machine Learning sample data.
Using the baseball
table, create a new table named baseball_sample
containing a 25% sample of baseball
. Remember, TABLESAMPLE does not guarantee that the exact percentage of records defined in the clause are returned.
=> CREATE TABLE baseball_sample AS SELECT * FROM baseball TABLESAMPLE(25);
CREATE TABLE
=> SELECT * FROM baseball_sample;
id  first_name  last_name  dob  team  hr  hits  avg  salary
++++++++
4  Amanda  Turner  19971222  Maroon  58  177  0.187  8047721
20  Jesse  Cooper  19830413  Yellow  97  39  0.523  4252837
22  Randy  Peterson  19800528  Orange  14  16  0.141  11827728.1
24  Carol  Harris  19910402  Fuscia  96  12  0.456  40572253.6
32  Rose  Morrison  19770726  Goldenrod  27  153  0.442  14510752.49
50  Helen  Medina  19871226  Maroon  12  150  0.54  32169267.91
70  Richard  Gilbert  19830713  Khaki  1  250  0.213  40518422.76
81  Angela  Cole  19910816  Violet  87  136  0.706  42875181.51
82  Elizabeth  Foster  19940430  Indigo  46  163  0.481  33896975.53
98  Philip  Gardner  19920506  Puce  39  239  0.697  20967480.67
102  Ernest  Freeman  19831005  Turquoise  46  77  0.564  21444463.92
.
.
.
(227 rows)
With your sample you can create a predictive model, or tune your database.
See also
 FROM clause (for more information about the
TABLESAMPLE
clause)
2.8  SVD (singular value decomposition)
Singular Value Decomposition (SVD) is a matrix decomposition method that allows you to approximate matrix X with dimensions nbyp as a product of 3 matrices: X(nbyp) = U(nbyk).S(kbyk).VT(kbyp) where k is an integer from 1 to p, and S is a diagonal matrix.
Singular Value Decomposition (SVD) is a matrix decomposition method that allows you to approximate matrix X with dimensions nbyp as a product of 3 matrices: X(nbyp) = U(nbyk).S(kbyk).VT(kbyp) where k is an integer from 1 to p, and S is a diagonal matrix. Its diagonal has nonnegative values, called singular values, sorted from the largest, at the top left, to the smallest, at the bottom right. All other elements of S are zero.
In practice, the matrix V(pbyk), which is the transposed version of VT, is more preferred.
If k (an input parameter to the decomposition algorithm, also known as the number of components to keep in the output) is equal to p, the decomposition is exact. If k is less than p, the decomposition becomes an approximation.
An application of SVD is lossy data compression. For example, storing X required n.p elements, while storing the three matrices U, S, and VT requires storing n.k + k + k.p elements. If n=1000, p=10, and k=2, storing X would require 10,000 elements while storing the approximation would require 2,000+4+20 = 2,024 elements. A smaller value of k increases the savings in storage space, while a larger value of k gives a more accurate approximation.
Depending on your data, the singular values may decrease rapidly, which allows you to choose a value of k that is much smaller than the value of p.
Another common application of SVD is to perform the principal component analysis.
You can use the following functions to train and apply the SVD model:
For a complete example, see Computing SVD.
2.8.1  Computing SVD
This SVD example uses a small data set named small_svd.
This SVD example uses a small data set named small_svd. The example shows you how to compute SVD using the given data set. The table is a matrix of numbers. The singular value decomposition is computed using the SVD function. This example computes the SVD of the table matrix and assigns it to a new object, which contains one vector, and two matrices, U and V. The vector contains the singular values. The first matrix, U contains the left singular vectors and V contains the right singular vectors.
Before you begin the example,
load the Machine Learning sample data.

Create the SVD model, named svdmodel
.
=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD

Finished in 1 iterations.
Accepted Rows: 8 Rejected Rows: 0
(1 row)

View the summary output of svdmodel
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svdmodel');
GET_MODEL_SUMMARY











=======
columns
=======
indexname
+
1  x1
2  x2
3  x3
4  x4
===============
singular_values
===============
index value explained_varianceaccumulated_explained_variance
+++
1 22.58748 0.95542  0.95542
2  3.79176 0.02692  0.98234
3  2.55864 0.01226  0.99460
4  1.69756 0.00540  1.00000
======================
right_singular_vectors
======================
indexvector1 vector2 vector3 vector4
++++
1  0.58736 0.08033 0.742880.31094
2  0.26661 0.782750.06148 0.55896
3  0.717790.136720.645630.22193
4  0.262110.60179 0.16587 0.73596
========
counters
========
counter_name counter_value
+
accepted_row_count 8
rejected_row_count 0
iteration_count  1
===========
call_string
===========
SELECT SVD('public.svdmodel', 'small_svd', 'x1,x2,x3,x4');
(1 row)

Create a new table, named Umat
to obtain the values for U.
=> CREATE TABLE Umat AS SELECT APPLY_SVD(id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel',
exclude_columns='id', key_columns='id') OVER() FROM small_svd;
CREATE TABLE

View the results in the Umat
table. This table transforms the matrix into a new coordinates system.
=> SELECT * FROM Umat ORDER BY id;
id  col1  col2  col3  col4
++++
1  0.494871802886819  0.161721379259287  0.0712816417153664  0.473145877877408
2  0.17652411036246  0.0753183783382909  0.678196192333598  0.0567124770173372
3  0.150974762654569  0.589561842046029  0.00392654610109522  0.360011163271921
4  0.44849499240202  0.347260956311326  0.186958376368345  0.378561270493651
5  0.494871802886819  0.161721379259287  0.0712816417153664  0.473145877877408
6  0.17652411036246  0.0753183783382909  0.678196192333598  0.0567124770173372
7  0.150974762654569  0.589561842046029  0.00392654610109522  0.360011163271921
8  0.44849499240202  0.347260956311326  0.186958376368345  0.378561270493651
(8 rows)

Then, we can optionally transform the data back by converting it from Umat to Xmat. First, we must create the Xmat table and then apply the APPLY_INVERSE_SVD function to the table:
=> CREATE TABLE Xmat AS SELECT APPLY_INVERSE_SVD(* USING PARAMETERS model_name='svdmodel',
exclude_columns='id', key_columns='id') OVER() FROM Umat;
CREATE TABLE

Then view the data from the Xmat table that was created:
=> SELECT id, x1::NUMERIC(5,1), x2::NUMERIC(5,1), x3::NUMERIC(5,1), x4::NUMERIC(5,1) FROM Xmat
ORDER BY id;
id  x1  x2  x3  x4
++++
1  7.0  3.0  8.0  2.0
2  1.0  1.0  4.0  1.0
3  2.0  3.0  2.0  0.0
4  6.0  2.0  7.0  4.0
5  7.0  3.0  8.0  2.0
6  1.0  1.0  4.0  1.0
7  2.0  3.0  2.0  0.0
8  6.0  2.0  7.0  4.0
(8 rows)
See also
3  Regression algorithms
Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response.
Regression is an important and popular machine learning tool that makes predictions from data by learning the relationship between some features of the data and an observed value response. Regression is used to make predictions about profits, sales, temperature, stocks, and more. For example, you could use regression to predict the price of a house based on the location, the square footage, the size of the lot, and so on. In this example, the house's value is the response, and the other factors, such as location, are the features.
The optimal set of coefficients found for the regression's equation is known as the model. The relationship between the outcome and the features is summarized in the model, which can then be applied to different data sets, where the outcome value is unknown.
3.1  Autoregression
See the autoregressive model example under time series models.
See the autoregressive model example under time series models.
3.2  Linear regression
Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome.
Using linear regression, you can model the linear relationship between independent variables, or features, and a dependent variable, or outcome. You can build linear regression models to:

Fit a predictive model to a training data set of independent variables and some dependent variable. Doing so allows you to use feature variable values to make predictions on outcomes. For example, you can predict the amount of rain that will fall on a particular day of the year.

Determine the strength of the relationship between an independent variable and some outcome variable. For example, suppose you want to determine the importance of various weather variables on the outcome of how much rain will fall. You can build a linear regression model based on observations of weather patterns and rainfall to find the answer.
Unlike Logistic regression, which you use to determine a binary classification outcome, linear regression is primarily used to predict continuous numerical outcomes in linear relationships.
You can use the following functions to build a linear regression model, view the model, and use the model to make predictions on a set of test data:
For a complete example of how to use linear regression on a table in Vertica, see Building a linear regression model.
3.2.1  Building a linear regression model
This linear regression example uses a small data set named faithful.
This linear regression example uses a small data set named faithful. The data set contains the intervals between eruptions and the duration of eruptions for the Old Faithful geyser in Yellowstone National Park. The duration of each eruption can be between 1.5 and 5 minutes. The length of intervals between eruptions and of each eruption varies. However, you can estimate the time of the next eruption based on the duration of the previous eruption. The example shows how you can build a model to predict the value of eruptions
, given the value of the waiting
feature.
Before you begin the example,
load the Machine Learning sample data.

Create the linear regression model, named linear_reg_faithful
, using the faithful_training
training data:
=> SELECT LINEAR_REG('linear_reg_faithful', 'faithful_training', 'eruptions', 'waiting'
USING PARAMETERS optimizer='BFGS');
LINEAR_REG

Finished in 6 iterations
(1 row)

View the summary output of linear_reg_faithful
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='linear_reg_faithful');

=======
details
=======
predictorcoefficientstd_err t_value p_value
++++
Intercept 2.06795  0.210639.81782 0.00000
waiting  0.07876  0.0029226.96925 0.00000
==============
regularization
==============
type lambda
+
none 1.00000
===========
call_string
===========
linear_reg('public.linear_reg_faithful', 'faithful_training', '"eruptions"', 'waiting'
USING PARAMETERS optimizer='bfgs', epsilon=1e06, max_iterations=100,
regularization='none', lambda=1)
===============
Additional Info
===============
Name Value
+
iteration_count  3
rejected_row_count 0
accepted_row_count 162
(1 row)

Create a table that contains the response values from running the PREDICT_LINEAR_REG
function on your test data. Name this table pred_faithful_results
. View the results in the pred_faithful_results
table:
=> CREATE TABLE pred_faithful_results AS
(SELECT id, eruptions, PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='linear_reg_faithful')
AS pred FROM faithful_testing);
CREATE TABLE
=> SELECT * FROM pred_faithful_results ORDER BY id;
id  eruptions  pred
++
4  2.283  2.8151271587036
5  4.533  4.62659045686076
8  3.6  4.62659045686076
9  1.95  1.94877514654148
11  1.833  2.18505296804024
12  3.917  4.54783118302784
14  1.75  1.6337380512098
20  4.25  4.15403481386324
22  1.75  1.6337380512098
.
.
.
(110 rows)
Calculating the mean squared error (MSE)
You can calculate how well your model fits the data using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.
=> SELECT MSE (eruptions::float, pred::float) OVER() FROM
(SELECT eruptions, pred FROM pred_faithful_results) AS prediction_output;
mse  Comments
+
0.252925741352641  Of 110 rows, 110 were used and 0 were ignored
(1 row)
See also
3.3  Poisson regression
Using Poisson regression, you can model count data.
Using Poisson regression, you can model count data. Poisson regression offers an alternative to linear regression or logistic regression and is useful when the target variable describes event frequency (event count in a fixed interval of time). Linear regression is preferable if you aim to predict continuous numerical outcomes in linear relationships, while logistic regression is used for predicting a binary classification.
You can use the following functions to build a Poisson regression model, view the model, and use the model to make predictions on a set of test data:
3.4  Random forest for regression
The Random Forest for regression algorithm creates an ensemble model of regression trees.
The Random Forest for regression algorithm creates an ensemble model of regression trees. Each tree is trained on a randomly selected subset of the training data. The algorithm predicts the value that is the mean prediction of the individual trees.
You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:
For a complete example of how to use the Random Forest for regression algorithm in Vertica, see Building a random forest regression model.
3.4.1  Building a random forest regression model
This example uses the "mtcars" dataset to create a random forest model to predict the value of carb (the number of carburetors).
This example uses the "mtcars" dataset to create a random forest model to predict the value of carb
(the number of carburetors).
Before you begin the example,
load the Machine Learning sample data.

Use
RF_REGRESSOR
to create the random forest model myRFRegressorModel
using the mtcars
training data. View the summary output of the model with
GET_MODEL_SUMMARY
:
=> SELECT RF_REGRESSOR ('myRFRegressorModel', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt' USING PARAMETERS
ntree=100, sampling_size=0.3);
RF_REGRESSOR

Finished
(1 row)
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='myRFRegressorModel');

===========
call_string
===========
SELECT rf_regressor('public.myRFRegressorModel', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
USING PARAMETERS exclude_columns='', ntree=100, mtry=1, sampling_size=0.3, max_depth=5, max_breadth=32,
min_leaf_size=5, min_info_gain=0, nbins=32);
=======
details
=======
predictortype
+
mpg float
cyl  int
hp  int
drat float
wt float
===============
Additional Info
===============
Name Value
+
tree_count  100
rejected_row_count 0
accepted_row_count 32
(1 row)

Use
PREDICT_RF_REGRESSOR
to predict the number of carburetors:
=> SELECT PREDICT_RF_REGRESSOR (mpg,cyl,hp,drat,wt
USING PARAMETERS model_name='myRFRegressorModel') FROM mtcars;
PREDICT_RF_REGRESSOR

2.94774203574204
2.6954087024087
2.6954087024087
2.89906346431346
2.97688489288489
2.97688489288489
2.7086587024087
2.92078965478965
2.97688489288489
2.7086587024087
2.95621822621823
2.82255155955156
2.7086587024087
2.7086587024087
2.85650394050394
2.85650394050394
2.97688489288489
2.95621822621823
2.6954087024087
2.6954087024087
2.84493251193251
2.97688489288489
2.97688489288489
2.8856467976468
2.6954087024087
2.92078965478965
2.97688489288489
2.97688489288489
2.7934087024087
2.7934087024087
2.7086587024087
2.72469441669442
(32 rows)
3.5  SVM (support vector machine) for regression
Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.
Support Vector Machine (SVM) for regression predicts continuous ordered variables based on the training data.
Unlike Logistic regression, which you use to determine a binary classification outcome, SVM for regression is primarily used to predict continuous numerical outcomes.
You can use the following functions to build an SVM for regression model, view the model, and use the model to make predictions on a set of test data:
For a complete example of how to use the SVM algorithm in Vertica, see Building an SVM for regression model.
3.5.1  Building an SVM for regression model
This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park.
This SVM for regression example uses a small data set named faithful, based on the Old Faithful geyser in Yellowstone National Park. The data set contains values about the waiting time between eruptions and the duration of eruptions of the geyser. The example shows how you can build a model to predict the value of eruptions
, given the value of the waiting
feature.
Before you begin the example,
load the Machine Learning sample data.

Create the SVM model, named svm_faithful
, using the faithful_training
training data:
=> SELECT SVM_REGRESSOR('svm_faithful', 'faithful_training', 'eruptions', 'waiting'
USING PARAMETERS error_tolerance=0.1, max_iterations=100);
SVM_REGRESSOR

Finished in 5 iterations
Accepted Rows: 162 Rejected Rows: 0
(1 row)

View the summary output of svm_faithful
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_faithful');

=======
details
=======
===========================
Predictors and Coefficients
===========================
Coefficients
+
Intercept 1.59007
waiting  0.07217
===========
call_string
===========
Call string:
SELECT svm_regressor('public.svm_faithful', 'faithful_training', '"eruptions"',
'waiting'USING PARAMETERS error_tolerance = 0.1, C=1, max_iterations=100,
epsilon=0.001);
===============
Additional Info
===============
Name Value
+
accepted_row_count 162
rejected_row_count 0
iteration_count  5
(1 row)

Create a new table that contains the response values from running the PREDICT_SVM_REGRESSOR
function on your test data. Name this table pred_faithful_results.
View the results in the pred_faithful_results
table:
=> CREATE TABLE pred_faithful AS
(SELECT id, eruptions, PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='svm_faithful')
AS pred FROM faithful_testing);
CREATE TABLE
=> SELECT * FROM pred_faithful ORDER BY id;
id  eruptions  pred
++
4  2.283  2.88444568755189
5  4.533  4.54434581879796
8  3.6  4.54434581879796
9  1.95  2.09058040739072
11  1.833  2.30708912016195
12  3.917  4.47217624787422
14  1.75  1.80190212369576
20  4.25  4.11132839325551
22  1.75  1.80190212369576
.
.
.
(110 rows)
Calculating the mean squared error (MSE)
You can calculate how well your model fits the data is by using the MSE function. MSE returns the average of the squared differences between actual value and predicted values.
=> SELECT MSE(obs::float, prediction::float) OVER()
FROM (SELECT eruptions AS obs, pred AS prediction
FROM pred_faithful) AS prediction_output;
mse  Comments
+
0.254499811834235  Of 110 rows, 110 were used and 0 were ignored
(1 row)
See also
3.6  XGBoost for regression
The following XGBoost functions create and perform predictions with a regression model:.
XGBoost (eXtreme Gradient Boosting) is a popular supervisedlearning algorithm used for regression and classification on large datasets. It uses sequentiallybuilt shallow decision trees to provide accurate results and a highlyscalable training method that avoids overfitting.
The following XGBoost functions create and perform predictions with a regression model:
Example
This example uses a small data set named "mtcars", which contains design and performance data for 32 automobiles from 19731974, and creates an XGBoost regression model to predict the value of the variable carb
(the number of carburetors).
Before you begin the example,
load the Machine Learning sample data.

Use
XGB_REGRESSOR
to create the XGBoost regression model xgb_cars
from the mtcars
dataset:
=> SELECT XGB_REGRESSOR ('xgb_cars', 'mtcars', 'carb', 'mpg, cyl, hp, drat, wt'
USING PARAMETERS learning_rate=0.5);
XGB_REGRESSOR

Finished
(1 row)
You can then view a summary of the model with
GET_MODEL_SUMMARY
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_cars');
GET_MODEL_SUMMARY

===========
call_string
===========
xgb_regressor('public.xgb_cars', 'mtcars', '"carb"', 'mpg, cyl, hp, drat, wt'
USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=squarederror,
split_proposal_method=global, epsilon=0.001, learning_rate=0.5, min_split_loss=0, weight_reg=0, sampling_size=1)
=======
details
=======
predictor type
+
mpg float or numeric
cyl  int
hp  int
drat float or numeric
wt float or numeric
===============
Additional Info
===============
Name Value
+
tree_count  10
rejected_row_count 0
accepted_row_count 32
(1 row)

Use
PREDICT_XGB_REGRESSOR
to predict the number of carburetors:
=> SELECT carb, PREDICT_XGB_REGRESSOR (mpg,cyl,hp,drat,wt USING PARAMETERS model_name='xgb_cars') FROM mtcars;
carb  PREDICT_XGB_REGRESSOR
+
4  4.00335213618023
2  2.0038188946536
6  5.98866003194438
1  1.01774386191546
2  1.9959801016274
2  2.0038188946536
4  3.99545403625739
8  7.99211056556231
2  1.99291901733151
3  2.9975688946536
3  2.9975688946536
1  1.00320357711227
2  2.0038188946536
4  3.99545403625739
4  4.00124134679445
1  1.00759516721382
4  3.99700517763435
4  3.99580193056138
4  4.00009088187525
3  2.9975688946536
2  1.98625064560888
1  1.00355294416998
2  2.00666247039502
1  1.01682931210169
4  4.00124134679445
1  1.01007809485918
2  1.98438405824605
4  3.99580193056138
2  1.99291901733151
4  4.00009088187525
2  2.0038188946536
1  1.00759516721382
(32 rows)
4  Classification algorithms
Classification is an important and popular machine learning tool that assigns items in a data set to different categories.
Classification is an important and popular machine learning tool that assigns items in a data set to different categories. Classification is used to predict risk over time, in fraud detection, text categorization, and more. Classification functions begin with a data set where the different categories are known. For example, suppose you want to classify students based on how likely they are to get into graduate school. In addition to factors like admission score exams and grades, you could also track work experience.
Binary classification means the outcome, in this case, admission, only has two possible values: admit or do not admit. Multiclass outcomes have more than two values. For example, low, medium, or high chance of admission. During the training process, classification algorithms find the relationship between the outcome and the features. This relationship is summarized in the model, which can then be applied to different data sets, where the categories are unknown.
4.1  Logistic regression
Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome.
Using logistic regression, you can model the relationship between independent variables, or features, and some dependent variable, or outcome. The outcome of logistic regression is always a binary value.
You can build logistic regression models to:

Fit a predictive model to a training data set of independent variables and some binary dependent variable. Doing so allows you to make predictions on outcomes, such as whether a piece of email is spam mail or not.

Determine the strength of the relationship between an independent variable and some binary outcome variable. For example, suppose you want to determine whether an email is spam or not. You can build a logistic regression model, based on observations of the properties of email messages. Then, you can determine the importance of various properties of an email message on that outcome.
You can use the following functions to build a logistic regression model, view the model, and use the model to make predictions on a set of test data:
For a complete programming example of how to use logistic regression on a table in Vertica, see Building a logistic regression model.
4.1.1  Building a logistic regression model
This logistic regression example uses a small data set named mtcars.
This logistic regression example uses a small data set named mtcars. The example shows how to build a model that predicts the value of am
, which indicates whether the car has an automatic or a manual transmission. It uses the given values of all the other features in the data set.
In this example, roughly 60% of the data is used as training data to create a model. The remaining 40% is used as testing data against which you can test your logistic regression model.
Before you begin the example,
load the Machine Learning sample data.

Create the logistic regression model, named logistic_reg_mtcars
, using the mtcars_train
training data.
=> SELECT LOGISTIC_REG('logistic_reg_mtcars', 'mtcars_train', 'am', 'cyl, wt'
USING PARAMETERS exclude_columns='hp');
LOGISTIC_REG

Finished in 15 iterations
(1 row)

View the summary output of logistic_reg_mtcars
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='logistic_reg_mtcars');

=======
details
=======
predictorcoefficient std_err z_value p_value
++++
Intercept 262.39898 44745.77338 0.00586 0.99532
cyl  16.75892 5987.23236  0.00280 0.99777
wt 119.92116 17237.031540.00696 0.99445
==============
regularization
==============
type lambda
+
none 1.00000
===========
call_string
===========
logistic_reg('public.logistic_reg_mtcars', 'mtcars_train', '"am"', 'cyl, wt'
USING PARAMETERS exclude_columns='hp', optimizer='newton', epsilon=1e06,
max_iterations=100, regularization='none', lambda=1)
===============
Additional Info
===============
Name Value
+
iteration_count  20
rejected_row_count 0
accepted_row_count 20
(1 row)

Create a table named mtcars_predict_results
. Populate this table with the prediction outputs you obtain from running the PREDICT_LOGISTIC_REG
function on your test data. View the results in the mtcars_predict_results
table.
=> CREATE TABLE mtcars_predict_results AS
(SELECT car_model, am, PREDICT_LOGISTIC_REG(cyl, wt
USING PARAMETERS model_name='logistic_reg_mtcars')
AS Prediction FROM mtcars_test);
CREATE TABLE
=> SELECT * FROM mtcars_predict_results;
car_model  am  Prediction
++
AMC Javelin  0  0
Hornet 4 Drive  0  0
Maserati Bora  1  0
Merc 280  0  0
Merc 450SL  0  0
Toyota Corona  0  1
Volvo 142E  1  1
Camaro Z28  0  0
Datsun 710  1  1
Honda Civic  1  1
Porsche 9142  1  1
Valiant  0  0
(12 rows)

Evaluate the accuracy of the PREDICT_LOGISTIC_REG
function, using the CONFUSION_MATRIX evaluation function.
=> SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
FROM (SELECT am AS obs, Prediction AS pred FROM mtcars_predict_results) AS prediction_output;
class  0  1  comment
+++
0  6  1 
1  1  4  Of 12 rows, 12 were used and 0 were ignored
(2 rows)
In this case, PREDICT_LOGISTIC_REG
correctly predicted that four out of five cars with a value of 1
in the am
column have a value of 1
. Out of the seven cars which had a value of 0
in the am
column, six were correctly predicted to have the value 0
. One car was incorrectly classified as having the value 1
.
See also
4.2  Naive bayes
You can use the Naive Bayes algorithm to classify your data when features can be assumed independent.
You can use the Naive Bayes algorithm to classify your data when features can be assumed independent. The algorithm uses independent features to calculate the probability of a specific class. For example, you might want to predict the probability that an email is spam. In that case, you would use a corpus of words associated with spam to calculate the probability the email's content is spam.
You can use the following functions to build a Naive Bayes model, view the model, and use the model to make predictions on a set of test data:
For a complete example of how to use the Naive Bayes algorithm in Vertica, see Classifying data using naive bayes.
4.2.1  Classifying data using naive bayes
This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model.
This Naive Bayes example uses the HouseVotes84 data set to show you how to build a model. With this model, you can predict which party the member of the United States Congress is affiliated based on their voting record. To aid in classifying the data it has been cleaned, and any missed votes have been replaced. The cleaned data replaces missed votes with the voter's party majority vote. For example, suppose a member of the Democrats had a missing value for vote1 and majority of the Democrats voted in favor. This example replaces all missing Democrats' votes for vote1 with a vote in favor.
In this example, approximately 75% of the cleaned HouseVotes84 data is randomly selected and copied to a training table. The remaining cleaned HouseVotes84 data is used as a testing table.
Before you begin the example,
load the Machine Learning sample data.
You must also load the naive_bayes_data_prepration.sql
script:
$ /opt/vertica/bin/vsql d <name of your database> f naive_bayes_data_preparation.sql

Create the Naive Bayes model, named naive_house84_model
, using the house84_train
training data.
=> SELECT NAIVE_BAYES('naive_house84_model', 'house84_train', 'party',
'*' USING PARAMETERS exclude_columns='party, id');
NAIVE_BAYES

Finished. Accepted Rows: 315 Rejected Rows: 0
(1 row)

Create a new table, named predicted_party_naive
. Populate this table with the prediction outputs you obtain from the PREDICT_NAIVE_BAYES function on your test data.
=> CREATE TABLE predicted_party_naive
AS SELECT party,
PREDICT_NAIVE_BAYES (vote1, vote2, vote3, vote4, vote5,
vote6, vote7, vote8, vote9, vote10,
vote11, vote12, vote13, vote14,
vote15, vote16
USING PARAMETERS model_name = 'naive_house84_model',
type = 'response') AS Predicted_Party
FROM house84_test;
CREATE TABLE

Calculate the accuracy of the model's predictions.
=> SELECT (Predictions.Num_Correct_Predictions / Count.Total_Count) AS Percent_Accuracy
FROM ( SELECT COUNT(Predicted_Party) AS Num_Correct_Predictions
FROM predicted_party_naive
WHERE party = Predicted_Party
) AS Predictions,
( SELECT COUNT(party) AS Total_Count
FROM predicted_party_naive
) AS Count;
Percent_Accuracy

0.933333333333333333
(1 row)
The model correctly predicted the party of the members of Congress based on their voting patterns with 93% accuracy.
Viewing the probability of each class
You can also view the probability of each class. Use PREDICT_NAIVE_BAYES_CLASSES to see the probability of each class.
=> SELECT PREDICT_NAIVE_BAYES_CLASSES (id, vote1, vote2, vote3, vote4, vote5,
vote6, vote7, vote8, vote9, vote10,
vote11, vote12, vote13, vote14,
vote15, vote16
USING PARAMETERS model_name = 'naive_house84_model',
key_columns = 'id', exclude_columns = 'id',
classes = 'democrat, republican')
OVER() FROM house84_test;
id  Predicted  Probability  democrat  republican
++++
368  democrat  1  1  0
372  democrat  1  1  0
374  democrat  1  1  0
378  republican  0.999999962214987  3.77850125111219e08  0.999999962214987
384  democrat  1  1  0
387  democrat  1  1  0
406  republican  0.999999945980143  5.40198564592332e08  0.999999945980143
419  democrat  1  1  0
421  republican  0.922808855631005  0.0771911443689949  0.922808855631005
.
.
.
(109 rows)
See also
4.3  Random forest for classification
The Random Forest algorithm creates an ensemble model of decision trees.
The Random Forest algorithm creates an ensemble model of decision trees. Each tree is trained on a randomly selected subset of the training data.
You can use the following functions to train the Random Forest model, and use the model to make predictions on a set of test data:
For a complete example of how to use the Random Forest algorithm in Vertica, see Classifying data using random forest.
4.3.1  Classifying data using random forest
This random forest example uses a data set named iris.
This random forest example uses a data set named iris. The example contains four variables that measure various parts of the iris flower to predict its species.
Before you begin the example, make sure that you have followed the steps in Download the machine learning example data.

Use
RF_CLASSIFIER
to create the random forest model, named rf_iris
, using the iris
data. View the summary output of the model with
GET_MODEL_SUMMARY
:
=> SELECT RF_CLASSIFIER ('rf_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS ntree=100, sampling_size=0.5);
RF_CLASSIFIER

Finished training
(1 row)
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='rf_iris');

===========
call_string
===========
SELECT rf_classifier('public.rf_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length,
Petal_Width' USING PARAMETERS exclude_columns='', ntree=100, mtry=2, sampling_size=0.5, max_depth=5,
max_breadth=32, min_leaf_size=1, min_info_gain=0, nbins=32);
=======
details
=======
predictor type
+
sepal_lengthfloat
sepal_width float
petal_lengthfloat
petal_width float
===============
Additional Info
===============
Name Value
+
tree_count  100
rejected_row_count 0
accepted_row_count 150
(1 row)

Apply the classifier to the test data with
PREDICT_RF_CLASSIFIER
:
=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='rf_iris') FROM iris1;
PREDICT_RF_CLASSIFIER

setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.
(90 rows)

Use
PREDICT_RF_CLASSIFIER_CLASSES
to view the probability of each class:
=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='rf_iris') OVER () FROM iris1;
predicted  probability
+
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  1
setosa  0.99
.
.
.
(90 rows)
4.4  SVM (support vector machine) for classification
Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data.
Support Vector Machine (SVM) is a classification algorithm that assigns data to one category or the other based on the training data. This algorithm implements linear SVM, which is highly scalable.
You can use the following functions to train the SVM model, and use the model to make predictions on a set of test data:
You can also use the following evaluation functions to gain further insights:
For a complete example of how to use the SVM algorithm in Vertica, see Classifying data using SVM (support vector machine).
The implementation of the SVM algorithm in Vertica is based on the paper Distributed Newton Methods for Regularized Logistic Regression.
4.4.1  Classifying data using SVM (support vector machine)
This SVM example uses a small data set named mtcars.
This SVM example uses a small data set named mtcars. The example shows how you can use the SVM_CLASSIFIER function to train the model to predict the value of am
(the transmission type, where 0 = automatic and 1 = manual) using the PREDICT_SVM_CLASSIFIER function.
Before you begin the example,
load the Machine Learning sample data.

Create the SVM model, named svm_class
, using the mtcars_train
training data.
=> SELECT SVM_CLASSIFIER('svm_class', 'mtcars_train', 'am', 'cyl, mpg, wt, hp, gear'
USING PARAMETERS exclude_columns='gear');
SVM_CLASSIFIER

Finished in 12 iterations.
Accepted Rows: 20 Rejected Rows: 0
(1 row)

View the summary output of
`svm_class`
.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_class');

=======
details
=======
predictorcoefficient
+
Intercept 0.02006
cyl  0.15367
mpg  0.15698
wt  1.78157
hp  0.00957
===========
call_string
===========
SELECT svm_classifier('public.svm_class', 'mtcars_train', '"am"', 'cyl, mpg, wt, hp, gear'
USING PARAMETERS exclude_columns='gear', C=1, max_iterations=100, epsilon=0.001);
===============
Additional Info
===============
Name Value
+
accepted_row_count 20
rejected_row_count 0
iteration_count  12
(1 row)

Create a new table, named svm_mtcars_predict
. Populate this table with the prediction outputs you obtain from running the PREDICT_SVM_CLASSIFIER
function on your test data.
=> CREATE TABLE svm_mtcars_predict AS
(SELECT car_model, am, PREDICT_SVM_CLASSIFIER(cyl, mpg, wt, hp
USING PARAMETERS model_name='svm_class')
AS Prediction FROM mtcars_test);
CREATE TABLE

View the results in the svm_mtcars_predict
table.
=> SELECT * FROM svm_mtcars_predict;
car_model  am  Prediction
 ++
Toyota Corona  0  1
Camaro Z28  0  0
Datsun 710  1  1
Valiant  0  0
Volvo 142E  1  1
AMC Javelin  0  0
Honda Civic  1  1
Hornet 4 Drive 0  0
Maserati Bora  1  1
Merc 280  0  0
Merc 450SL  0  0
Porsche 9142  1  1
(12 rows)

Evaluate the accuracy of the PREDICT_SVM_CLASSIFIER
function, using the CONFUSION_MATRIX evaluation function.
=> SELECT CONFUSION_MATRIX(obs::int, pred::int USING PARAMETERS num_classes=2) OVER()
FROM (SELECT am AS obs, Prediction AS pred FROM svm_mtcars_predict) AS prediction_output;
class  0  1  comment
+++
0  6  1 
1  0  5  Of 12 rows, 12 were used and 0 were ignored
(2 rows)
In this case, PREDICT_SVM_CLASSIFIER
correctly predicted that the cars with a value of 1
in the am
column have a value of 1
. No cars were incorrectly classified. Out of the seven cars which had a value of 0
in the am
column, six were correctly predicted to have the value 0
. One car was incorrectly classified as having the value 1
.
See also
4.5  XGBoost for classification
The following XGBoost functions create and perform predictions with a classification model:.
XGBoost (eXtreme Gradient Boosting) is a popular supervisedlearning algorithm used for regression and classification on large datasets. It uses sequentiallybuilt shallow decision trees to provide accurate results and a highlyscalable training method that avoids overfitting.
The following XGBoost functions create and perform predictions with a classification model:
Example
This example uses the "iris" dataset, which contains measurements for various parts of a flower, and can be used to predict its species and creates an XGBoost classifier model to classify the species of each flower.
Before you begin the example,
load the Machine Learning sample data.

Use
XGB_CLASSIFIER
to create the XGBoost classifier model xgb_iris
using the iris
dataset:
=> SELECT XGB_CLASSIFIER ('xgb_iris', 'iris', 'Species', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS max_ntree=10, max_depth=5, weight_reg=0.1, learning_rate=1);
XGB_CLASSIFIER

Finished
(1 row)
You can then view a summary of the model with
GET_MODEL_SUMMARY
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='xgb_iris');
GET_MODEL_SUMMARY

===========
call_string
===========
xgb_classifier('public.xgb_iris', 'iris', '"species"', 'Sepal_Length, Sepal_Width, Petal_Length, Petal_Width'
USING PARAMETERS exclude_columns='', max_ntree=10, max_depth=5, nbins=32, objective=crossentropy,
split_proposal_method=global, epsilon=0.001, learning_rate=1, min_split_loss=0, weight_reg=0.1, sampling_size=1)
=======
details
=======
predictor  type
+
sepal_lengthfloat or numeric
sepal_width float or numeric
petal_lengthfloat or numeric
petal_width float or numeric
===============
Additional Info
===============
Name Value
+
tree_count  10
rejected_row_count 0
accepted_row_count 150
(1 row)

Use
PREDICT_XGB_CLASSIFIER
to apply the classifier to the test data:
=> SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris') FROM iris1;
PREDICT_XGB_CLASSIFIER

setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.
(90 rows)

Use
PREDICT_XGB_CLASSIFIER_CLASSES
to view the probability of each class:
=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
predicted  probability
+
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.999911552783011
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
setosa  0.9999650465368
versicolor  0.99991871763563
.
.
.
(90 rows)
5  Clustering algorithms
Clustering is an important and popular machine learning tool used to find clusters of items in a data set that are similar to one another.
Clustering is an important and popular machine learning tool used to find clusters of items in a data set that are similar to one another. The goal of clustering is to create clusters with a high number of objects that are similar. Similar to classification, clustering segments the data. However, in clustering, the categorical groups are not defined.
Clustering data into related groupings has many useful applications. If you already know how many clusters your data contains, the Kmeans algorithm may be sufficient to train your model and use that model to predict cluster membership for new data points.
However, in the more common case, you do not know before analyzing the data how many clusters it contains. In these cases, the Bisecting kmeans algorithm is much more effective at finding the correct clusters in your data.
Both kmeans and bisecting kmeans predict the clusters for a given data set. A model trained using either algorithm can then be used to predict the cluster to which new data points are assigned.
Clustering can be used to find anomalies in data and find natural groups of data. For example, you can use clustering to analyze a geographical region and determine which areas of that region are most likely to be hit by an earthquake. For a complete example, see Earthquake Cluster Analysis Using the KMeans Approach.
In Vertica, clustering is computed based on Euclidean distance. Through this computation, data points are assigned to the cluster with the nearest center.
5.1  Kmeans
You can use the kmeans clustering algorithm to cluster data points into k different groups based on similarities between the data points.
You can use the kmeans clustering algorithm to cluster data points into k different groups based on similarities between the data points.
kmeans partitions n observations into k clusters. Through this partitioning, kmeans assigns each observation to the cluster with the nearest mean, or cluster center.
For a complete example of how to use kmeans on a table in Vertica, see Clustering data using kmeans .
5.1.1  Clustering data using kmeans
This kmeans example uses two small data sets: agar_dish_1 and agar_dish_2.
This kmeans example uses two small data sets: agar_dish_1
and agar_dish_2
. Using the numeric data in the agar_dish_1
data set, you can cluster the data into k clusters. Then, using the created kmeans model, you can run APPLY_KMEANS on agar_dish_2
and assign them to the clusters created in your original model.
Before you begin the example,
load the Machine Learning sample data.
Clustering training data into k clusters

Create the kmeans model, named agar_dish_kmeans using the agar_dish_1
table data.
=> SELECT KMEANS('agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns ='id', max_iterations=20, output_view='agar_1_view',
key_columns='id');
KMEANS

Finished in 7 iterations
(1 row)
The example creates a model named agar_dish_kmeans
and a view containing the results of the model named agar_1_view
. You might get different results when you run the clustering algorithm. This is because KMEANS randomly picks initial centers by default.

View the output of agar_1_view
.
=> SELECT * FROM agar_1_view;
id  cluster_id
+
2  4
5  4
7  4
9  4
13  4
.
.
.
(375 rows)

Because you specified the number of clusters as 5, verify that the function created five clusters. Count the number of data points within each cluster.
=> SELECT cluster_id, COUNT(cluster_id) as Total_count
FROM agar_1_view
GROUP BY cluster_id;
cluster_id  Total_count
+
0  76
2  80
1  74
3  73
4  72
(5 rows)
From the output, you can see that five clusters were created: 0
, 1
, 2
, 3
, and 4
.
You have now successfully clustered the data from agar_dish_1.csv
into five distinct clusters.
Summarizing your model
View the summary output of agar_dish_means using the GET_MODEL_SUMMARY function.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_kmeans');

=======
centers
=======
x  y
+
0.49708  0.51116
7.481197.52577
1.562381.50561
3.506163.55703
5.520575.49197
=======
metrics
=======
Evaluation metrics:
Total Sum of Squares: 6008.4619
WithinCluster Sum of Squares:
Cluster 0: 12.083548
Cluster 1: 12.389038
Cluster 2: 12.639238
Cluster 3: 11.210146
Cluster 4: 12.994356
Total WithinCluster Sum of Squares: 61.316326
BetweenCluster Sum of Squares: 5947.1456
BetweenCluster SS / Total SS: 98.98%
Number of iterations performed: 2
Converged: True
Call:
kmeans('public.agar_dish_kmeans', 'agar_dish_1', '*', 5
USING PARAMETERS exclude_columns='id', max_iterations=20, epsilon=0.0001, init_method='kmeanspp',
distance_method='euclidean', output_view='agar_view_1', key_columns='id')
(1 row)
Clustering data using a kmeans model
Using agar_dish_kmeans
, the kmeans model you just created, you can assign the points in agar_dish_2
to cluster centers.
Create a table named kmeans_results
, using the agar_dish_2
table as your input table and the agar_dish_kmeans
model for your initial cluster centers.
Add only the relevant feature columns to the arguments in the APPLY_KMEANS
function.
=> CREATE TABLE kmeans_results AS
(SELECT id,
APPLY_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_kmeans') AS cluster_id
FROM agar_dish_2);
The kmeans_results
table shows that the agar_dish_kmeans
model correctly clustered the agar_dish_2
data.
See also
5.2  Kprototypes
You can use the kprototypes clustering algorithm to cluster mixed data into different groups based on similarities between the data points.
You can use the kprototypes clustering algorithm to cluster mixed data into different groups based on similarities between the data points. The kprototypes algorithm extends the functionality of kmeans clustering, which is limited to numerical data, by combining it with kmodes clustering, a clustering algorithm for categorical data.
See the syntax for kprototypes here.
5.3  Bisecting kmeans
The bisecting kmeans clustering algorithm combines kmeans clustering with divisive hierarchy clustering.
The bisecting kmeans clustering algorithm combines kmeans clustering with divisive hierarchy clustering. With bisecting kmeans, you get not only the clusters but also the hierarchical structure of the clusters of data points.
This hierarchy is more informative than the unstructured set of flat clusters returned by Kmeans. The hierarchy shows how the clustering results would look at every step of the process of bisecting clusters to find new clusters. The hierarchy of clusters makes it easier to decide the number of clusters in the data set.
Given a hierarchy of k clusters produced by bisecting kmeans, you can easily calculate any prediction of the form: Assume the data contain only k' clusters, where k' is a number that is smaller than or equal to the k used to train the model.
For a complete example of how to use bisecting kmeans to analyze a table in Vertica, see Clustering data hierarchically using bisecting kmeans.
5.3.1  Clustering data hierarchically using bisecting kmeans
This bisecting kmeans example uses two small data sets named agar_dish_training and agar_dish_testing.
This bisecting kmeans example uses two small data sets named agar_dish_training and agar_dish_testing. Using the numeric data in the agar_dish_training data set, you can cluster the data into k clusters. Then, using the resulting bisecting kmeans model, you can run APPLY_BISECTING_KMEANS on agar_dish_testing and assign the data to the clusters created in your trained model. Unlike regular kmeans (also provided in Vertica), bisecting kmeans allows you to predict with any number of clusters less than or equal to k. So if you train the model with k=5 but later decide to predict with k=2, you do not have to retrain the model; just run APPLY_BISECTING_KMEANS with k=2.
Before you begin the example, load the Machine Learning sample data. For this example, we load agar_dish_training.csv
and agar_dish_testing.csv
.
Clustering training data into k clusters to train the model

Create the bisecting kmeans model, named agar_dish_bkmeans, using the agar_dish_training table data.
=> SELECT BISECTING_KMEANS('agar_dish_bkmeans', 'agar_dish_training', '*', 5 USING PARAMETERS exclude_columns='id', key_columns='id', output_view='agar_1_view');
BISECTING_KMEANS

Finished.
(1 row)
This example creates a model named agar_dish_bkmeans and a view containing the results of the model named agar_1_view. You might get slightly different results when you run the clustering algorithm. This is because BISECTING_KMEANS uses random numbers to generate the best clusters.

View the output of agar_1_view.
=> SELECT * FROM agar_1_view;
id  cluster_id
+
2  4
5  4
7  4
9  4
...
Here we can see the id of each point in the agar_dish_training table and which cluster it has been assigned to.

Because we specified the number of clusters as 5, verify that the function created five clusters by counting the number of data points within each cluster.
=> SELECT cluster_id, COUNT(cluster_id) as Total_count FROM agar_1_view GROUP BY cluster_id;
cluster_id  Total_count
+
5  76
7  73
8  74
4  72
6  80
(5 rows)
You may wonder why the cluster_ids do not start at 0 or 1. The reason is that the bisecting kmeans algorithm generates many more clusters than kmeans, and then outputs the ones that are needed for the designated value of k. We will see later why this is useful.
You have now successfully clustered the data from agar_dish_training.csv
into five distinct clusters.
Summarizing your model
View the summary output of agar_dish_bkmeans using the GET_MODEL_SUMMARY function.
```
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='agar_dish_bkmeans');
======
BKTree
======
center_id x  y  withinss totWithinssbisection_levelcluster_sizeparentleft_childright_child
+++++++++
0 3.594503.593716008.461926008.46192  0  375   1  2
1 6.475746.48280336.41161 1561.29110  1  156  0  5  6
2 1.542101.535741224.879491561.29110  1  219  0  3  4
3 2.540882.53830317.34228  665.83744  2  147  2  7  8
4  0.49708 0.51116 12.08355  665.83744  2  72  2  
5 7.481197.52577 12.38904  354.80922  3  76  1  
6 5.520575.49197 12.99436  354.80922  3  80  1  
7 1.562381.50561 12.63924  61.31633  4  73  3  
8 3.506163.55703 11.21015  61.31633  4  74  3  
=======
Metrics
=======
Measure  Value
+
Total sum of squares 6008.46192
Total withincluster sum of squares  61.31633
Betweencluster sum of squares 5947.14559
Betweencluster sum of squares / Total sum of squares 98.97950
Sum of squares for cluster 1, center_id 5  12.38904
Sum of squares for cluster 2, center_id 6  12.99436
Sum of squares for cluster 3, center_id 7  12.63924
Sum of squares for cluster 4, center_id 8  11.21015
Sum of squares for cluster 5, center_id 4  12.08355
===========
call_string
===========
bisecting_kmeans('agar_dish_bkmeans', 'agar_dish_training', '*', 5
USING PARAMETERS exclude_columns='id', bisection_iterations=1, split_method='SUM_SQUARES', min_divisible_cluster_size=2, distance_method='euclidean', kmeans_center_init_method='kmeanspp', kmeans_epsilon=0.0001, kmeans_max_iterations=10, output_view=''agar_1_view'', key_columns=''id'')
===============
Additional Info
===============
Name Value
+
num_of_clusters  5
dimensions_of_dataset 2
num_of_clusters_found 5
height_of_BKTree  4
(1 row)
```
Here we can see the details of all the intermediate clusters created by bisecting kmeans during training, some metrics for evaluating the quality of the clustering (the lower the sum of squares, the better), the specific parameters with which the algorithm was trained, and some general information about the data algorithm.
Clustering testing data using a bisecting kmeans model
Using agar_dish_bkmeans, the bisecting kmeans model you just created, you can assign the points in agar_dish_testing to cluster centers.

Create a table named bkmeans_results, using the agar_dish_testing table as your input table and the agar_dish_bkmeans model for your cluster centers. Add only the relevant feature columns to the arguments in the APPLY_BISECTING_KMEANS function.
=> CREATE TABLE bkmeans_results_k5 AS
(SELECT id,
APPLY_BISECTING_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_bkmeans', number_clusters=5) AS cluster_id
FROM agar_dish_testing);
=> SELECT cluster_id, COUNT(cluster_id) as Total_count FROM bkmeans_results_k5 GROUP BY cluster_id;
cluster_id  Total_count
+
5  24
4  28
6  20
8  26
7  27
(5 rows)
The bkmeans_results_k5 table shows that the agar_dish_bkmeans model correctly clustered the agar_dish_testing data.

The real advantage of using bisecting kmeans is that the model it creates can cluster data into any number of clusters less than or equal to the k with which it was trained. Now you could cluster the above testing data into 3 clusters instead of 5, without retraining the model:
=> CREATE TABLE bkmeans_results_k3 AS
(SELECT id,
APPLY_BISECTING_KMEANS(x, y
USING PARAMETERS
model_name='agar_dish_bkmeans', number_clusters=3) AS cluster_id
FROM agar_dish_testing);
=> SELECT cluster_id, COUNT(cluster_id) as Total_count FROM bkmeans_results_k3 GROUP BY cluster_id;
cluster_id  Total_count
+
4  28
3  53
1  44
(3 rows)
Prediction using the trained bisecting kmeans model
To cluster data using a trained model, the bisecting kmeans algorithm starts by comparing the incoming data point with the child cluster centers of the root cluster node. The algorithm finds which of those centers the data point is closest to. Then the data point is compared with the child cluster centers of the closest child of the root. The prediction process continues to iterate until it reaches a leaf cluster node. Finally, the point is assigned to the closest leaf cluster. The following picture gives a simple illustration of the training process and prediction process of the bisecting kmeans algorithm. An advantage of using bisecting kmeans is that you can predict using any value of k from 2 to the largest k value the model was trained on.
The model in the picture below was trained on k=5. The middle picture shows using the model to predict with k=5, in other words, match the incoming data point to the center with the closest value, in the level of the hierarchy where there are 5 leaf clusters. The picture on the right shows using the model to predict as if k=2, in other words, first compare the incoming data point to the leaf clusters at the level where there were only two clusters, then match the data point to the closer of those two cluster centers. This approach is faster than predicting with kmeans.
See also
6  Time series forecasting
Time series models are trained on stationary time series (that is, time series where the mean doesn't change over time) of stochastic processes with consistent time steps.
Time series models are trained on stationary time series (that is, time series where the mean doesn't change over time) of stochastic processes with consistent time steps. These algorithms forecast future values by taking into account the influence of values at some number of preceding timesteps (lags).
Examples of applicable datasets include those for temperature, stock prices, earthquakes, product sales, etc.
To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).
6.1  ARIMA model example
Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions.
Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. At model training time, you specify the number of preceding values and previous prediction errors that the model will use to calculate predictions.
You can use the following functions to train and make predictions with ARIMA models:
These functions require time series data with consistent timesteps. To normalize a time series with inconsistent timesteps, see Gap filling and interpolation (GFI).
The following example trains an ARIMA model on a daily temperature dataset and then makes temperature predictions with both possible prediction methods—using the insample data and applying the model to an input relation.
Load the training data
Before you begin the example,
load the Machine Learning sample data.
This examples uses the dailymintemperatures dataset, which contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990. The data is available in the temp_data
table:
=> SELECT * FROM temp_data;
time  Temperature
+
19810101 00:00:00  20.7
19810102 00:00:00  17.9
19810103 00:00:00  18.8
19810104 00:00:00  14.6
19810105 00:00:00  15.8
...
19901227 00:00:00  14
19901228 00:00:00  13.6
19901229 00:00:00  13.5
19901230 00:00:00  15.7
19901231 00:00:00  13
(3650 rows)
=> SELECT COUNT(*) FROM temp_data;
COUNT

3650
(1 row)
Train the ARIMA model
After you load the data, you can use the ARIMA function to create and train an ARIMA model. For this example, the model is trained with lags of p
=3 and q
=3, taking the value and prediction error of three previous time steps into account for each prediction:
=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, q=3);
ARIMA

Finished in 20 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)
You can view a summary of the model with the GET_MODEL_SUMMARY function:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
GET_MODEL_SUMMARY

============
coefficients
============
parameter value
+
phi_1  0.64189
phi_2  0.46667
phi_3 0.11777
theta_1 0.05109
theta_2 0.58699
theta_3 0.15882
==============
regularization
==============
none
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
==============
missing_method
==============
linear_interpolation
===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, d=0, q=3, missing='linear_interpolation', init_method='Zero', epsilon=1e06, max_iterations=100);
===============
Additional Info
===============
Name  Value
+
p  3
q  3
d  0
mean 11.17775
lambda  1.00000
mean_squared_error 5.80490
rejected_row_count 0
accepted_row_count 3650
(1 row)
Make predictions
After you train the ARIMA model, you can call the
PREDICT_ARIMA
function to predict future temperatures. This function supports making predictions using the insample data that the model was trained on or applying the model to an input relation.
Using insample data
The following example makes predictions using the insample data that the model was trained on. The model begins prediction at the end of the temp_data
table and returns predicted values for ten timesteps:
=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10) OVER();
prediction

12.9745063293842
13.4389080858551
13.3955791360528
13.3551146487462
13.3149336514747
13.2750516811057
13.2354710353376
13.1961939790513
13.1572226788109
13.1185592045127
(10 rows)
For both prediction methods, if you want the function to return the standard error of each prediction, you can set output_standard_errors
to true:
=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10, output_standard_errors=true) OVER();
prediction  std_err
+
12.9745063293842  1.00621890780865
13.4389080858551  1.45340836833232
13.3955791360528  1.61041524562932
13.3551146487462  1.76368421116143
13.3149336514747  1.91223938476627
13.2750516811057  2.05618464609977
13.2354710353376  2.19561771498385
13.1961939790513  2.33063553781651
13.1572226788109  2.46133422924445
13.1185592045127  2.58780904243988
(10 rows)
You can also apply the model to an input relation, in this case the temp_data
training set:
=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3651, npredictions=10, output_standard_errors=true) OVER(ORDER BY time) FROM temp_data;
prediction  std_err
+
12.9745063293842  1.00621890780865
13.4389080858551  1.45340836833232
13.3955791360528  1.61041524562932
13.3551146487462  1.76368421116143
13.3149336514747  1.91223938476627
13.2750516811057  2.05618464609977
13.2354710353376  2.19561771498385
13.1961939790513  2.33063553781651
13.1572226788109  2.46133422924445
13.1185592045127  2.58780904243988
(10 rows)
Because the same data and relative start index were provided to both prediction methods, the model predictions for each method are identical.
When applying a model to an input relation, you can set add_mean
to false so that the function returns the predicted difference from the mean instead of the sum of the model mean and the predicted difference:
=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3680, npredictions=10, add_mean=false) OVER(ORDER BY time) FROM temp_data;
prediction

1.2026877112171
1.17114068517961
1.13992534953432
1.10904183333367
1.0784901998692
1.04827044781798
1.01838251238116
0.98882626641461
0.959601521551628
0.93070802931751
(10 rows)
See also
6.2  Autoregressive model example
Autoregressive models predict future values of a time series based on the preceding values. More specifically, the userspecified lag determines how many previous timesteps it takes into account during computation, and predicted values are linear combinations of the values at each lag.
Use the following functions when training and predicting with autoregressive models. Note that these functions require datasets with consistent timesteps.
To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).
Example

Load the datasets from the MachineLearningExamples repository.
This example uses the dailymintemperatures dataset, which contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990:
=> SELECT * FROM temp_data;
time  Temperature
+
19810101 00:00:00  20.7
19810102 00:00:00  17.9
19810103 00:00:00  18.8
19810104 00:00:00  14.6
19810105 00:00:00  15.8
...
19901227 00:00:00  14
19901228 00:00:00  13.6
19901229 00:00:00  13.5
19901230 00:00:00  15.7
19901231 00:00:00  13
(3650 rows)

Use
AUTOREGRESSOR
to create the autoregressive model AR_temperature
from the temp_data
dataset. In this case, the model is trained with a lag of p
=3, taking the previous 3 entries into account for each estimation:
=> SELECT AUTOREGRESSOR('AR_temperature', 'temp_data', 'Temperature', 'time' USING PARAMETERS p=3);
AUTOREGRESSOR

Finished. 3650 elements accepted, 0 elements rejected.
(1 row)
You can view a summary of the model with
GET_MODEL_SUMMARY
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='AR_temperature');
GET_MODEL_SUMMARY

============
coefficients
============
parameter value
+
alpha  1.88817
phi_(t1) 0.70004
phi_(t2)0.05940
phi_(t3) 0.19018
==================
mean_squared_error
==================
not evaluated
===========
call_string
===========
autoregressor('public.AR_temperature', 'temp_data', 'temperature', 'time'
USING PARAMETERS p=3, missing=linear_interpolation, regularization='none', lambda=1, compute_mse=false);
===============
Additional Info
===============
Name Value
+
lag_order  3
rejected_row_count 0
accepted_row_count3650
(1 row)

Use
PREDICT_AUTOREGRESSOR
to predict future temperatures. The following query start
s the prediction at the end of the dataset and returns 10 predictions.
=> SELECT PREDICT_AUTOREGRESSOR(Temperature USING PARAMETERS model_name='AR_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data;
prediction

12.6235419917807
12.9387860506032
12.6683380680058
12.3886937385419
12.2689506237424
12.1503023330142
12.0211734746741
11.9150531529328
11.825870404008
11.7451846722395
(10 rows)
6.3  Movingaverage model example
Moving average models use the errors of previous predictions to make future predictions. More specifically, the userspecified lag determines how many previous predictions and errors it takes into account during computation.
Use the following functions when training and predicting with movingaverage models. Note that these functions require datasets with consistent timesteps.
To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).
Example

Load the datasets from the MachineLearningExamples repository.
This example uses the dailymintemperatures dataset, which contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990:
=> SELECT * FROM temp_data;
time  Temperature
+
19810101 00:00:00  20.7
19810102 00:00:00  17.9
19810103 00:00:00  18.8
19810104 00:00:00  14.6
19810105 00:00:00  15.8
...
19901227 00:00:00  14
19901228 00:00:00  13.6
19901229 00:00:00  13.5
19901230 00:00:00  15.7
19901231 00:00:00  13
(3650 rows)

Use
MOVING_AVERAGE
to create the movingaverage model MA_temperature
from the temp_data
dataset. In this case, the model is trained with a lag of p
=3, taking the error of 3 previous predictions into account for each estimation:
=> SELECT MOVING_AVERAGE('MA_temperature', 'temp_data', 'temperature', 'time' USING PARAMETERS q=3, missing='linear_interpolation', regularization='none', lambda=1);
MOVING_AVERAGE

Finished. 3650 elements accepted, 0 elements rejected.
(1 row)
You can view a summary of the model with
GET_MODEL_SUMMARY
:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='MA_temperature');
GET_MODEL_SUMMARY

============
coefficients
============
parameter value
+
phi_(t0)0.90051
phi_(t1)0.10621
phi_(t2) 0.07173
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
===========
call_string
===========
moving_average('public.MA_temperature', 'temp_data', 'temperature', 'time'
USING PARAMETERS q=3, missing=linear_interpolation, regularization='none', lambda=1);
===============
Additional Info
===============
Name  Value
+
mean 11.17780
lag_order  3
lambda  1.00000
rejected_row_count 0
accepted_row_count 3650
(1 row)

Use
PREDICT_MOVING_AVERAGE
to predict future temperatures. The following query start
s the prediction at the end of the dataset and returns 10 predictions.
=> SELECT PREDICT_MOVING_AVERAGE(Temperature USING PARAMETERS model_name='MA_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data;
prediction

13.1324365636272
12.8071086272833
12.7218966671721
12.6011086656032
12.506624729879
12.4148247026733
12.3307873804812
12.2521385975133
12.1789741993396
12.1107640076638
(10 rows)
7  Model management
Vertica provides a number of tools to manage existing models.
Vertica provides a number of tools to manage existing models. You can view model summaries and attributes, alter model characteristics like name and privileges, drop models, and version models.
7.1  Model versioning
Model versioning provides an infrastructure to track and manage the status of registered models in a database.
Model versioning provides an infrastructure to track and manage the status of registered models in a database. The versioning infrastructure supports a collaborative environment where multiple users can submit candidate models for individual applications, which are identified by their registered_name
. Models in Vertica are by default unregistered, but any user with sufficient privileges can register a model to an application and add it to the versioning environment.
Register models
When a candidate model is ready to be submitted to an application, the model owner or any user with sufficient privileges, including any user with the DBADMIN or MLSUPERVISOR role, can register the model using the REGISTER_MODEL function. The registered model is assigned an initial status of 'under_review' and is visible in the REGISTERED_MODELS system table to users with USAGE privileges.
All registered models under a given registered_name
are considered different versions of the same model, regardless of trainer, algorithm type, or production status. If a model is the first to be registered to a given registered_name
, the model is assigned a registered_version
of one. Otherwise, newly registered models are assigned an incremented registered_version
of n
+ 1, where n
is the number of models already registered to the given registered_name
. Each registered model can be uniquely identified by the combination of registered_name
and registered_version
.
Change model status
After a model is registered, the model owner is automatically changed to superuser and the previous owner is given USAGE privileges. The MLSUPERVISOR role has full privileges over all registered models, as does dbadmin. Only users with these two roles can call the CHANGE_MODEL_STATUS function to change the status of registered models. There are six possible statuses for registered models:

under_review
: Status assigned to newly registered models.

staging
: Model is targeted for A/B testing against the model currently in production.

production
: Model is in production for its specified application. Only one model can be in production for a given registered_name
at one time.

archived
: Status of models that were previously in production. Archived models can be returned to production at any time.

declined
: Model is no longer in consideration for production.

unregistered
: Model is removed from the versioning environment. The model does not appear in the REGISTERED_MODELS system table.
The following diagram depicts the valid status transitions:
Note
If you change the status of a model to 'production' and there is already a model in production under the given registered_name
, the status of the new model is set to 'production' and that of the old model to 'archived'.
You can view the status history of registered models with the MODEL_STATUS_HISTORY system table, which includes models that have been unregistered or dropped. Only superusers or users to whom they have granted sufficient privileges can query the table. Vertica recommends granting access on the table to the MLSUPERVISOR role.
Managing registered models
Only users with the MLSUPERVISOR role, or those granted equivalent privileges, can drop or alter registered models. As with unregistered models, you can drop and alter registered models with the DROP MODEL and ALTER MODEL commands. Dropped models no longer appear in the REGISTERED_MODELS system table.
To make predictions with a registered model, you must use the [schema_name.
]model_name
and predict function for the appropriate model type. You can find all necessary information in the REGISTERED_MODELS system table.
Model registration does not effect the process of exporting models. However, model registration information, such as registered_version
, is not captured in exported models. All imported models are initially unregistered, but each category of imported model—native Vertica, PMML, and TensorFlow—is compatible with model versioning.
Examples
The following example demonstrates a possible model versioning workflow. This workflow begins with registering multiple models to an application, continues with an MLSUPERVISOR managing and changing the status of those models, and concludes with one of the models in production.
First, the models must be registered to an application. In this case, the models, native_linear_reg
and linear_reg_spark1
, are registered as the linear_reg_app
application:
=> SELECT REGISTER_MODEL('native_linear_reg', 'linear_reg_app');
REGISTER_MODEL

Model [native_linear_reg] is registered as [linear_reg_app], version [2]
(1 row)
=> SELECT REGISTER_MODEL('linear_reg_spark1', 'linear_reg_app');
REGISTER_MODEL

Model [linear_reg_spark1] is registered as [linear_reg_app], version [3]
(1 row)
You can query the REGISTERED_MODELS system table to view details about the newly registered models, such as the version number and model status:
=> SELECT * FROM REGISTERED_MODELS;
registered_name  registered_version  status  registered_time  model_id  schema_name  model_name  model_type  category
++++++++
linear_reg_app  3  UNDER_REVIEW  20230126 09:52:00.08216604  45035996273714020  public  linear_reg_spark1  PMML_REGRESSION_MODEL  PMML
linear_reg_app  2  UNDER_REVIEW  20230126 09:51:04.55310205  45035996273850350  public  native_linear_reg  LINEAR_REGRESSION  VERTICA_MODELS
linear_reg_app  1  PRODUCTION  20230124 05:29:25.99062602  45035996273853740  public  linear_reg_newton  LINEAR_REGRESSION  VERTICA_MODELS
logistic_reg_app  1  PRODUCTION  20230123 08:49:25.99062602  45035996273853740  public  log_reg_cgd  LOGISTIC_REGRESSION  VERTICA_MODELS
(4 rows)
If you query the MODELS system table, you can see that the model owner has automatically been changed to superuser, dbadmin in this case:
=> SELECT model_name, owner_name FROM MODELS WHERE model_name IN ('native_linear_reg', 'linear_reg_spark1');
model_name  owner_name
+
native_linear_reg  dbadmin
linear_reg_spark1  dbadmin
(2 rows)
As a user with the MLSUPERVISOR role enabled, you can then change the status of both models to 'staging' with the CHANGE_MODEL_STATUS function, which accepts a registered_name
and registered_version
:
=> SELECT CHANGE_MODEL_STATUS('linear_reg_app', 2, 'staging');
CHANGE_MODEL_STATUS

The status of model [linear_reg_app]  version [2] is changed to [staging]
(1 row)
=> SELECT CHANGE_MODEL_STATUS('linear_reg_app', 3, 'staging');
CHANGE_MODEL_STATUS

The status of model [linear_reg_app]  version [3] is changed to [staging]
(1 row)
After comparing the evaluation metrics of the two staged models against the model currently in production, you can put the better performing model into production. In this case, the linear_reg_spark1
model is moved into production and the linear_reg_spark1
model is declined:
=> SELECT CHANGE_MODEL_STATUS('linear_reg_app', 2, 'declined');
CHANGE_MODEL_STATUS

The status of model [linear_reg_app]  version [2] is changed to [declined]
(1 row)
=> SELECT CHANGE_MODEL_STATUS('linear_reg_app', 3, 'production');
CHANGE_MODEL_STATUS

The status of model [linear_reg_app]  version [3] is changed to [production]
(1 row)
You can then query the REGISTERED_MODELS system table to confirm that the linear_reg_spark1
model is now in 'production', the native_linear_reg
model has been set to 'declined', and the previously in production linear_reg_spark1
model has been automatically moved to 'archived':
=> SELECT * FROM REGISTERED_MODELS;
registered_name  registered_version  status  registered_time  model_id  schema_name  model_name  model_type  category
++++++++
linear_reg_app  3  PRODUCTION  20230126 09:52:00.08216604  45035996273714020  public  linear_reg_spark1  PMML_REGRESSION_MODEL  PMML
linear_reg_app  2  DECLINED  20230126 09:51:04.55310205  45035996273850350  public  native_linear_reg  LINEAR_REGRESSION  VERTICA_MODELS
linear_reg_app  1  ARCHIVED  20230124 05:29:25.99062602  45035996273853740  public  linear_reg_newton  LINEAR_REGRESSION  VERTICA_MODELS
logistic_reg_app  1  PRODUCTION  20230123 08:49:25.99062602  45035996273853740  public  log_reg_cgd  LOGISTIC_REGRESSION  VERTICA_MODELS
(4 rows)
To remove the declined native_linear_reg
model from the versioning environment, you can set the status to 'unregistered':
=> SELECT CHANGE_MODEL_STATUS('linear_reg_app', 2, 'unregistered');
CHANGE_MODEL_STATUS

The status of model [linear_reg_app]  version [2] is changed to [unregistered]
(1 row)
=> SELECT * FROM REGISTERED_MODELS;
registered_name  registered_version  status  registered_time  model_id  schema_name  model_name  model_type  category
++++++++
linear_reg_app  3  PRODUCTION  20230126 09:52:00.08216604  45035996273714020  public  linear_reg_spark1  PMML_REGRESSION_MODEL  PMML
linear_reg_app  1  ARCHIVED  20230124 05:29:25.99062602  45035996273853740  public  linear_reg_newton  LINEAR_REGRESSION  VERTICA_MODELS
logistic_reg_app  1  PRODUCTION  20230123 08:49:25.99062602  45035996273853740  public  log_reg_cgd  LOGISTIC_REGRESSION  VERTICA_MODELS
(3 rows)
You can see that the native_linear_reg
model no longer appears in the REGISTERED_MODELS system table. However, you can still query the MODEL_STATUS_HISTORY system table to view the status history of native_linear_reg
:
=> SELECT * FROM MODEL_STATUS_HISTORY WHERE model_id=45035996273850350;
registered_name  registered_version  new_status  old_status  status_change_time  operator_id  operator_name  model_id  schema_name  model_name
+++ ++++++
linear_reg_app  2  UNDER_REVIEW  UNREGISTERED  20230126 09:51:04.55310205  45035996273964824  user1  45035996273850350  public  native_linear_reg
linear_reg_app  2  STAGING  UNDER_REVIEW  20230129 11:33:02.05246405  45035996273704962  supervisor1  45035996273850350  public  native_linear_reg
linear_reg_app  2  DECLINED  STAGING  20230130 04:12:30.48113605  45035996273704962  supervisor1  45035996273850350  public  native_linear_reg
linear_reg_app  2  UNREGISTERED  DECLINED  20230202 03:25:32.33213205  45035996273704962  supervisor1  45035996273850350  public  native_linear_reg
(4 rows)
See also
7.2  Altering models
You can modify a model using ALTER MODEL, in response to your model's needs.
You can modify a model using
ALTER MODEL
, in response to your model's needs. You can alter a model by renaming the model, changing the owner, and changing the schema.
You can drop or alter any model that you create.
7.2.1  Changing model ownership
As a superuser or model owner, you can reassign model ownership with ALTER MODEL as follows:.
As a superuser or model owner, you can reassign model ownership with
ALTER MODEL
as follows:
ALTER MODEL
modelname
OWNER TO
ownername
Changing model ownership is useful when a model owner leaves or changes responsibilities. Because you can change the owner, the models do not need to be rewritten.
Example
The following example shows how you can use ALTER_MODEL to change the model owner:

Find the model you want to alter. As the dbadmin, you own the model.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mykmeansmodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mykmeansmodel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964

Change the model owner from dbadmin to user1.
=> ALTER MODEL mykmeansmodel OWNER TO user1;
ALTER MODEL

Review V_CATALOG.MODELS
to verify that the owner was changed.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mykmeansmodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mykmeansmodel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  user1
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964
7.2.2  Moving models to another schema
You can move a model from one schema to another with ALTER MODEL.
You can move a model from one schema to another with
ALTER MODEL
. You can move the model as a superuser or user with USAGE
privileges on the current schema and CREATE
privileges on the destination schema.
Example
The following example shows how you can use ALTER MODEL
to change the model schema:

Find the model you want to alter.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mykmeansmodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mykmeansmodel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964

Change the model schema.
=> ALTER MODEL mykmeansmodel SET SCHEMA test;
ALTER MODEL

Review V_CATALOG.MODELS
to verify that the owner was changed.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mykmeansmodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mykmeansmodel
schema_id  45035996273704978
schema_name  test
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964
7.2.3  Renaming a model
ALTER MODEL lets you rename models.
ALTER MODEL
lets you rename models. For example:

Find the model you want to alter.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mymodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mymodel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964

Rename the model.
=> ALTER MODEL mymodel RENAME TO mykmeansmodel;
ALTER MODEL

Review V_CATALOG.MODELS
to verify that the model name was changed.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mykmeansmodel';
[ RECORD 1 ]+
model_id  45035996273816618
model_name  mykmeansmodel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  kmeans
is_complete  t
create_time  20170302 11:16:04.99062605
size  964
7.3  Dropping models
DROP MODEL removes one or more models from the database.
DROP MODEL
removes one or more models from the database. For example:

Find the model you want to drop.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mySvmClassModel';
[ RECORD 1 ]+
model_id  45035996273765414
model_name  mySvmClassModel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  SVM_CLASSIFIER
is_complete  t
create_time  20170214 10:30:44.90394605
size  525

Drop the model.
=> DROP MODEL mySvmClassModel;
DROP MODEL

Review V_CATALOG.MODELS
to verify that the model was dropped.
=> SELECT * FROM V_CATALOG.MODELS WHERE model_name='mySvmClassModel';
(0 rows)
7.4  Managing model security
You can manage the security privileges on your models by using the GRANT and REVOKE statements.
You can manage the security privileges on your models by using the GRANT and REVOKE statements. The following examples show how you can change privileges on user1 and user2 using the faithful
table and the linearReg model.

In the following example, the dbadmin grants the SELECT privilege to user1:
=> GRANT SELECT ON TABLE faithful TO user1;
GRANT PRIVILEGE

Then, the dbadmin grants the CREATE privilege on the public schema to user1:
=> GRANT CREATE ON SCHEMA public TO user1;
GRANT PRIVILEGE

Connect to the database as user1:
=> \c  user1

As user1, build the linearReg model:
=> SELECT LINEAR_REG('linearReg', 'faithful', 'waiting', 'eruptions');
LINEAR_REG

Finished in 1 iterations
(1 row)
 As user1, grant USAGE privileges to user2:
=> GRANT USAGE ON MODEL linearReg TO user2;
GRANT PRIVILEGE
 Connect to the database as user2:
=> \c  user2
 To confirm privileges were granted to user2, run the GET_MODEL_SUMMARY function. A user with the USAGE privilege on a model can run GET_MODEL_SUMMARY on that model:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='linearReg');
=======
details
=======
predictorcoefficientstd_err t_value p_value
++++
Intercept 33.47440  1.1548728.98533 0.00000
eruptions 10.72964  0.3147534.08903 0.00000
==============
regularization
==============
type lambda
+
none 1.00000
===========
call_string
===========
linear_reg('public.linearReg', 'faithful', '"waiting"', 'eruptions'
USING PARAMETERS optimizer='newton', epsilon=1e06, max_iterations=100, regularization='none', lambda=1)
===============
Additional Info
===============
Name Value
+
iteration_count  1
rejected_row_count 0
accepted_row_count 272
(1 row)
 Connect to the database as user1:
=> \c  user1
 Then, you can use the REVOKE statement to revoke privileges from user2:
=> REVOKE USAGE ON MODEL linearReg FROM user2;
REVOKE PRIVILEGE
 To confirm the privileges were revoked, connect as user 2 and run the GET_MODEL_SUMMARY function:
=> \c  user2
=>SELECT GET_MODEL_SUMMARY('linearReg');
ERROR 7523: Problem in get_model_summary.
Detail: Permission denied for model linearReg
See also
7.5  Viewing model attributes
The following topics explain the model attributes for the Vertica machine learning algorithms.
The following topics explain the model attributes for the Vertica machine learning algorithms. These attributes describe the internal structure of a particular model:
7.6  Summarizing models

Find the model you want to summarize.
=> SELECT * FROM v_catalog.models WHERE model_name='svm_class';
model_id  model_name  schema_id  schema_name  owner_id  owner_name  category 
model_type  is_complete  create_time  size
++++++
++++
45035996273715226  svm_class  45035996273704980  public  45035996273704962  dbadmin  VERTICA_MODELS
 SVM_CLASSIFIER  t  20170828 09:49:00.08216604  1427
(1 row)

View the model summary.
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='svm_class');

=======
details
=======
predictorcoefficient
+
Intercept 0.02006
cyl  0.15367
mpg  0.15698
wt  1.78157
hp  0.00957
===========
call_string
===========
SELECT svm_classifier('public.svm_class', 'mtcars_train', '"am"', 'cyl, mpg, wt, hp, gear'
USING PARAMETERS exclude_columns='gear', C=1, max_iterations=100, epsilon=0.001);
===============
Additional Info
===============
Name Value
+
accepted_row_count 20
rejected_row_count 0
iteration_count  12
(1 row)
See also
7.7  Viewing models
Vertica stores the models you create in the V_CATALOG.MODELS system table.
Vertica stores the models you create in the V_CATALOG.MODELS
system table.
You can query V_CATALOG.MODELS
to view information about the models you have created:
=> SELECT * FROM V_CATALOG.MODELS;
[ RECORD 1 ]+
model_id  45035996273765414
model_name  mySvmClassModel
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  SVM_CLASSIFIER
is_complete  t
create_time  20170214 10:30:44.90394605
size  525
[ RECORD 2 ]+
model_id  45035996273711466
model_name  mtcars_normfit
schema_id  45035996273704978
schema_name  public
owner_id  45035996273704962
owner_name  dbadmin
category  VERTICA_MODELS
model_type  SVM_CLASSIFIER
is_complete  t
create_time  20170206 15:03:05.65194105
size  288
See also
8  Using external models with Vertica
To give you the utmost in machine learning flexibility and scalability, Vertica supports importing, exporting, and predicting with PMML and TensorFlow models.
To give you the utmost in machine learning flexibility and scalability, Vertica supports importing, exporting, and predicting with PMML and TensorFlow models.
The machine learning configuration parameter MaxModelSizeKB sets the maximum size of a model that can be imported into Vertica.
Support for PMML models
Vertica supports the import and export of machine learning models in Predictive Model Markup Language (PMML) format. Support for this platformindependent model format allows you to use models trained on other platforms to predict on data stored in your Vertica database. You can also use Vertica as your model repository. Vertica supports PMML version 4.4.1.
With the PREDICT_PMML function, you can use a PMML model archived in Vertica to run prediction on data stored in the Vertica database. For more information, see Using PMML models.
For details on the PMML models, tags, and attributes that Vertica supports, see PMML features and attributes.
Support for TensorFlow models
Vertica now supports importing trained TensorFlow models, and using those models to do prediction in Vertica on data stored in the Vertica database. Vertica supports TensorFlow models trained in TensorFlow version 1.15.
The PREDICT_TENSORFLOW and PREDICT_TENSORFLOW_SCALAR functions let you predict on data in Vertica with TensorFlow models.
For additional information, see TensorFlow models.
Additional external model support
The following functions support both PMML and TensorFlow models:
8.1  TensorFlow models
Tensorflow is a framework for creating neural networks.
Tensorflow is a framework for creating neural networks. It implements basic linear algebra and multivariable calculus operations in a scalable fashion, and allows users to easily chain these operations into a computation graph.
Vertica supports importing, exporting, and making predictions with TensorFlow 1.x and 2.x models trained outside of Vertica.
Indatabase TensorFlow integration with Vertica offers several advantages:

Your models live inside your database, so you never have to move your data to make predictions.

The volume of data you can handle is limited only by the size of your Vertica database, which makes Vertica particularly wellsuited for machine learning on Big Data.

Vertica offers indatabase model management, so you can store as many models as you want.

Imported models are portable and can be exported for use elsewhere.
When you run a TensorFlow model to predict on data in the database, Vertica calls a TensorFlow process to run the model. This allows Vertica to support any model you can create and train using TensorFlow. Vertica just provides the inputs  your data in the Vertica database  and stores the outputs.
8.1.1  TensorFlow integration and directory structure
This page covers importing Tensorflow models into Vertica, making predictions on data in the Vertica database, and exporting the model for use on another Vertica cluster or a thirdparty platform.
This page covers importing Tensorflow models into Vertica, making predictions on data in the Vertica database, and exporting the model for use on another Vertica cluster or a thirdparty platform.
For a starttofinish example through each operation, see TensorFlow example.
Vertica supports models created with either TensorFlow 1.x and 2.x, but 2.x is strongly recommended.
To use TensorFlow with Vertica, install the TFIntegration UDX package on any node. You only need to do this once:
$ /opt/vertica/bin/admintools t install_package d database_name p 'password' package TFIntegration
Directory and file structure for TensorFlow models
Before importing your models, you should have a separate directory for each model that contains each of the following files. Note that Vertica uses the directory name as the model name when you import it:
For example, a tf_models
directory that contains two models, tf_mnist_estimator
and tf_mnist_keras
, has the following layout:
tf_models/
├── tf_mnist_estimator
│ ├── mnist_estimator.pb
│ └── tf_model_desc.json
└── tf_mnist_keras
├── mnist_keras.pb
└── tf_model_desc.json
You can generate both of these files for a given TensorFlow 2 (TF2) model with the freeze_tf2_model.py
script included in the MachineLearningExamples
GitHub repository and in the opt/vertica/packages/TFIntegration/examples
directory in the Vertica database. The script accepts three arguments:
modelpath
 Path to a saved TF2 model directory.
foldername
 (Optional) Name of the folder to which the frozen model is saved; by default,
frozen_tfmodel
.
columntype
 (Optional) Integer, either 0 or 1, that signifies whether the input and output columns for the model are primitive or complex types. Use a value of 0 (default) for primitive types, or 1 for complex.
For example, the following call outputs the frozen tf_autoencoder
model, which accepts complex input/output columns, into the frozen_autoencoder
folder:
$ python3 ./freeze_tf2_model.py path/to/tf_autoencoder frozen_autoencoder 1
tf_model_desc.json
The tf_model_desc.json
file forms the bridge between TensorFlow and Vertica. It describes the structure of the model so that Vertica can correctly match up its inputs and outputs to input/output tables.
Notice that the freeze_tf2_model.py
script automatically generates this file for your TensorFlow 2 model, and this generated file can often be used asis. For more complex models or use cases, you might have to edit this file. For a detailed breakdown of each field, see tf_model_desc.json overview.
Importing TensorFlow models into Vertica
To import TensorFlow models, use IMPORT_MODELS with the category 'TENSORFLOW'
.
Import a single model. Keep in mind that the Vertica database uses the directory name as the model name:
select IMPORT_MODELS ( '/path/tf_models/tf_mnist_keras' USING PARAMETERS category='TENSORFLOW');
import_models

Success
(1 row)
Import all models in the directory (where each model has its own directory) with a wildcard (*):
select IMPORT_MODELS ('/path/tf_models/*' USING PARAMETERS category='TENSORFLOW');
import_models

Success
(1 row)
Make predictions with an imported TensorFlow model
After importing your TensorFlow model, you can use the model to predict on data in a Vertica table. Vertica provides two functions for making predictions with imported TensorFlow models: PREDICT_TENSORFLOW and PREDICT_TENSORFLOW_SCALAR.
The function you choose depends on whether you specified a columntype
of 0 or 1 when calling the freeze_tf2_model.py
script. If columntype
was 0, meaning the model accepts primitive input and output types, use PREDICT_TENSORFLOW to make predictions; otherwise, use PREDICT_TENSORFLOW_SCALAR, as your model should accept complex input and output types.
Using PREDICT_TENSORFLOW
The PREDICT_TENSORFLOW function is different from the other predict functions in that it does not accept any parameters that affect the input columns such as "exclude_columns" or "id_column"; rather, the function always predicts on all the input columns provided. However, it does accept a num_passthru_cols
parameter which allows the user to "skip" some number of input columns, as shown below.
The OVER(PARTITION BEST) clause tells Vertica to parallelize the operation across multiple nodes. See Window partition clause for details:
=> select PREDICT_TENSORFLOW (*
USING PARAMETERS model_name='tf_mnist_keras', num_passthru_cols=1)
OVER(PARTITION BEST) FROM tf_mnist_test_images;
example output, the skipped columns are displayed as the first columns of the output
ID  col0  col1  col2  col3  col4  col5  col6  col7  col8  col9
++++++++++
1  0  0  1  0  0  0  0  0  0  0
3  1  0  0  0  0  0  0  0  0  0
6  0  0  0  0  1  0  0  0  0  0
...
Using PREDICT_TENSORFLOW_SCALAR
The PREDICT_TENSORFLOW_SCALAR function accepts one input column of type ROW, where each field corresponds to an input tensor. It returns one output column of type ROW, where each field corresponds to an output tensor. This complex type support can simplify the process for making predictions on data with many input features.
For instance, the MNIST handwritten digit classification dataset contains 784 input features for each input row, one feature for each pixel in the images of handwritten digits. The PREDICT_TENSORFLOW function requires that each of these input features are contained in a separate input column. By encapsulating these features into a single ARRAY, the PREDICT_TENSORFLOW_SCALAR function only needs a single input column of type ROW, where the pixel values are the array elements for an input field:
Each array for the "image" field has 784 elements.
=> SELECT * FROM mnist_train;
id  inputs
+
1  {"image":[0, 0, 0,..., 244, 222, 210,...]}
2  {"image":[0, 0, 0,..., 185, 84, 223,...]}
3  {"image":[0, 0, 0,..., 133, 254, 78,...]}
...
In this case, the function output consists of a single opeartion with one tensor. The value of this field is an array of ten elements, which are all zero except for the element whose index is the predicted digit:
=> SELECT id, PREDICT_TENSORFLOW_SCALAR(inputs USING PARAMETERS model_name='tf_mnist_ct') FROM mnist_test;
id  PREDICT_TENSORFLOW_SCALAR
+
1  {"prediction:0":["0", "0", "0", "0", "1", "0", "0", "0", "0", "0"]}
2  {"prediction:0":["0", "1", "0", "0", "0", "0", "0", "0", "0", "0"]}
3  {"prediction:0":["0", "0", "0", "0", "0", "0", "0", "1", "0", "0"]}
...
Exporting TensorFlow models
Vertica exports the model as a frozen graph, which can then be reimported at any time. Keep in mind that models that are saved as a frozen graph cannot be trained further.
Use EXPORT_MODELS to export TensorFlow models. For example, to export the tf_mnist_keras
model to the /path/to/export/to
directory:
=> SELECT EXPORT_MODELS ('/path/to/export/to', 'tf_mnist_keras');
export_models

Success
(1 row)
When you export a TensorFlow model, the Vertica database creates and uses the specified directory to store files that describe the model:
$ ls tf_mnist_keras/
crc.json metadata.json mnist_keras.pb model.json tf_model_desc.json
The .pb
and tf_model_desc.json
files describe the model, and the rest are organizational files created by the Vertica database.
File Name 
Purpose 
model_name.pb 
Frozen graph of the model. 
tf_model_desc.json 
Describes the model. 
crc.json 
Keeps track of files in this directory and their sizes. It is used for importing models. 
metadata.json 
Contains Vertica version, model type, and other information. 
model.json 
More verbose version of tf_model_desc.json . 
See also
8.1.2  TensorFlow example
Vertica uses the TFIntegration UDX package to integrate with TensorFlow.
Vertica uses the TFIntegration UDX package to integrate with TensorFlow. You can train your models outside of your Vertica database, then import them to Vertica and make predictions on your data.
TensorFlow scripts and datasets are included in the GitHub repository under MachineLearningExamples/TensorFlow.
The example below creates a Keras (a TensorFlow API) neural network model trained on the MNIST handwritten digit classification dataset, the layers of which are shown below.
The data is fed through each layer from top to bottom, and each layer modifies the input before returning a score. In this example, the data passed in is a set of images of handwritten Arabic numerals and the output would be the probability of the input image being a particular digit:
inputs = keras.Input(shape=(28, 28, 1), name="image")
x = layers.Conv2D(32, 5, activation="relu")(inputs)
x = layers.MaxPooling2D(2)(x)
x = layers.Conv2D(64, 5, activation="relu")(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Flatten()(x)
x = layers.Dense(10, activation='softmax', name='OUTPUT')(x)
tfmodel = keras.Model(inputs, x)
For more information on how TensorFlow interacts with your Vertica database and how to import more complex models, see TensorFlow integration and directory structure.
Prepare a TensorFlow model for Vertica
The following procedures take place outside of Vertica.
Train and save a TensorFlow model

Install TensorFlow 2.

Train your model. For this particular example, you can run train_simple_model.py to train and save the model. Otherwise, and more generally, you can manually train your model in Python and then save it:
$ mymodel.save('my_saved_model_dir')

Run the freeze_tf2_model.py
script included in the MachineLearningExamples repository or in opt/vertica/packages/TFIntegration/examples
, specifying your model, an output directory (optional, defaults to frozen_tfmodel
), and the input and output column type (0 for primitive, 1 for complex).
This script transforms your saved model into the Verticacompatible frozen graph format and creates the tf_model_desc.json
file, which describes how Vertica should translate its tables to TensorFlow tensors:
$ ./freeze_tf2_model.py path/to/tf/model frozen_model_dir 0
Import TensorFlow models and make predictions in Vertica

If you haven't already, as dbadmin
, install the TFIntegration UDX package on any node. You only need to do this once.
$ /opt/vertica/bin/admintools t install_package d database_name p 'password' package TFIntegration

Copy the the directory
to any node in your Vertica cluster and import the model:
=> SELECT IMPORT_MODELS('path/to/frozen_model_dir' USING PARAMETERS category='TENSORFLOW');

Import the dataset you want to make a prediction on. For this example:

Copy the MachineLearningExamples/TensorFlow/data
directory to any node on your Vertica cluster.

From that data
directory, run the SQL script load_tf_data.sql
to load the MNIST dataset:
$ vsql f load_tf_data.sql

Make a prediction with your model on your dataset with PREDICT_TENSORFLOW. In this example, the model is used to classify the images of handwritten numbers in the MNIST dataset:
=> SELECT PREDICT_TENSORFLOW (*
USING PARAMETERS model_name='tf_mnist_keras', num_passthru_cols=1)
OVER(PARTITION BEST) FROM tf_mnist_test_images;
example output, the skipped columns are displayed as the first columns of the output
ID  col0  col1  col2  col3  col4  col5  col6  col7  col8  col9
++++++++++
1  0  0  1  0  0  0  0  0  0  0
3  1  0  0  0  0  0  0  0  0  0
6  0  0  0  0  1  0  0  0  0  0
...
Note
Because a
columntype
of 0 was used when calling the
freeze_tf2_model.py
script, the PREDICT_TENSORFLOW function must be used to make predictions. If
columntype
was 1, meaning the model accepts complex input and output types, predictions must be made with the
PREDICT_TENSORFLOW_SCALAR function.

To export the model with EXPORT_MODELS:
=> SELECT EXPORT_MODELS('/path/to/export/to', 'tf_mnist_keras');
EXPORT_MODELS

Success
(1 row)
Note
While you cannot continue training a TensorFlow model after you export it from Vertica, you can use it to predict on data in another Vertica cluster, or outside Vertica on another platform.
TensorFlow 1 (deprecated)

Install TensorFlow 1.15 with Python 3.7 or below.

Run train_save_model.py
in MachineLearningExamples/TensorFlow/tf1
to train and save the model in TensorFlow and frozen graph formats.
Note
Vertica does not supply a script to save and freeze TensorFlow 1 models that use complex column types.
See also
8.1.3  tf_model_desc.json overview
Before importing your externally trained TensorFlow models, you must:.
Before importing your externally trained TensorFlow models, you must:

save the model in frozen graph (.pb
) format

create tf_model_desc.json
, which describes to your Vertica database how to map its inputs and outputs to input/output tables
Conveniently, the script freeze_tf2_model.py
included in the TensorFlow
directory of the MachineLearningExamples repository (and in opt/vertica/packages/TFIntegration/examples
) will do both of these automatically. In most cases, the generated tf_model_desc.json
can be used asis, but for more complex datasets and use cases, you might need to edit it.
The contents of the tf_model_desc.json
file depend on whether you provide a columntype
of 0 or 1 when calling the freeze_tf2_model.py
script. If columntype
is 0, the imported model accepts primitive input and output columns. If it is 1, the model accepts complex input and output columns.
Models that accept primitive types
The following tf_model_desc.json
is generated from the MNIST handwriting dataset used by the TensorFlow example.
{
"frozen_graph": "mnist_keras.pb",
"input_desc": [
{
"op_name": "image_input",
"tensor_map": [
{
"idx": 0,
"dim": [
1,
28,
28,
1
],
"col_start": 0
}
]
}
],
"output_desc": [
{
"op_name": "OUTPUT/Softmax",
"tensor_map": [
{
"idx": 0,
"dim": [
1,
10
],
"col_start": 0
}
]
}
]
}
This file describes the structure of the model's inputs and outputs. It must contain a frozen_graph
field that matches the filename of the .pb model, an input_desc
field, and an output_desc
field.
input_desc
and output_desc
: the descriptions of the input and output nodes in the TensorFlow graph. Each of these include the following fields:

op_name
: the name of the operation node which is set when creating and training the model. You can typically retrieve the names of these parameters from tfmodel.
inputs
and tfmodel.
outputs
. For example:
$ print({t.name:t for t in tfmodel.inputs})
{'image_input:0': <tf.Tensor 'image_input:0' shape=(?, 28, 28, 1) dtype=float32>}
$ print({t.name:t for t in tfmodel.outputs})
{'OUTPUT/Softmax:0': <tf.Tensor 'OUTPUT/Softmax:0' shape=(?, 10) dtype=float32>}
In this case, the respective values for op_name
would be the following.
For a more detailed example of this process, review the code for freeze_tf2_model.py
.

tensor_map
: how to map the tensor to Vertica columns, which can be specified with the following:

idx
: the index of the output tensor under the given operation (should be 0 for the first output, 1 for the second output, etc.).

dim
: the vector holding the dimensions of the tensor; it provides the number of columns.

col_start
(only used if col_idx
is not specified): the starting column index. When used with dim
, it specifies a range of indices of Vertica columns starting at col_start
and ending at col_start
+flattend_tensor_dimension
. Vertica starts at the column specified by the index col_start
and gets the next flattened_tensor_dimension
columns.

col_idx
: the indices in the Vertica columns corresponding to the flattened tensors. This allows you explicitly specify the indices of the Vertica columns that couldn't otherwise be specified as a simple range with col_start
and dim
(e.g. 1, 3, 5, 7).

data_type
(not shown): the data type of the input or output, one of the following:

TF_FLOAT (default)

TF_DOUBLE

TF_INT8

TF_INT16

TF_INT32

TF_INT64
Below is a more complex example that includes multiple inputs and outputs:
{
"input_desc": [
{
"op_name": "input1",
"tensor_map": [
{
"idx": 0,
"dim": [
4
],
"col_idx": [
0,
1,
2,
3
]
},
{
"idx": 1,
"dim": [
2,
2
],
"col_start": 4
}
]
},
{
"op_name": "input2",
"tensor_map": [
{
"idx": 0,
"dim": [],
"col_idx": [
8
]
},
{
"idx": 1,
"dim": [
2
],
"col_start": 9
}
]
}
],
"output_desc": [
{
"op_name": "output",
"tensor_map": [
{
"idx": 0,
"dim": [
2
],
"col_start": 0
}
]
}
]
}
Models that accept complex types
The following tf_model_desc.json
is generated from a model that inputs and outputs complex type columns:
{
"column_type": "complex",
"frozen_graph": "frozen_graph.pb",
"input_tensors": [
{
"name": "x:0",
"data_type": "int32",
"dims": [
1,
1
]
},
{
"name": "x_1:0",
"data_type": "int32",
"dims": [
1,
2
]
}
],
"output_tensors": [
{
"name": "Identity:0",
"data_type": "float32",
"dims": [
1,
1
]
},
{
"name": "Identity_1:0",
"data_type": "float32",
"dims": [
1,
2
]
}
]
}
As with models that accept primitive types, this file describes the structure of the model's inputs and outputs and contains a frozen_graph
field that matches the filename of the .pb model. However, instead of an input_desc
field and an output_desc
field, models with complex types have an input_tensors
field and an output_tensors
field, as well as a column_type
field.

column_type
: specifies that the model accepts input and output columns of complex types. When imported into Vertica, the model must make predictions using the PREDICT_TENSORFLOW_SCALAR function.

input_tensors
and output_tensors
: the descriptions of the input and output tensors in the TensorFlow graph. Each of these fields include the following subfields:

name
: the name of the tensor for which information is listed. The name is in the format of operation:tensornumber
, where operation
is the operation that contains the tensor and tensornumber
is the index of the tensor under the given operation.

data_type
: the data type of the elements in the input or output tensor, one of the following:

TF_FLOAT (default)

TF_DOUBLE

TF_INT8

TF_INT16

TF_INT32

TF_INT64

dims
: the dimensions of the tensor. Each input/output tensor is contained in a 1D ARRAY in the input/output ROW column.
See also
8.2  Using PMML models
Vertica can import, export, and make predictions with PMML models of version 4.4 and below.
Vertica can import, export, and make predictions with PMML models of version 4.4 and below.
8.2.1  Exporting Vertica models in PMML format
You can take advantage of the builtin distributed algorithms in Vertica to train machine learning models.
You can take advantage of the builtin distributed algorithms in Vertica to train machine learning models. There might be cases in which you want to use these models for prediction outside Vertica, for example on an edge node. You can export certain Vertica models in PMML format and use them for prediction using a library or platform that supports reading and evaluating PMML models. Vertica supports the export of the following Vertica model types into PMML format: KMEANS, LINEAR_REGRESSION, LOGISTIC_REGRESSION, RF_CLASSIFIER, RF_REGRESSOR, XGB_CLASSIFIER, and XGB_REGRESSOR.
Here is an example for training a model in Vertica and then exporting it in PMML format. The following diagram shows the workflow of the example. We use vsql to run this example.
Let's assume that you want to train a logistic regression model on the data in a relation named 'patients' in order to predict the second attack of patients given their treatment and trait anxiety.
After training, the model is shown in a system table named V_CATALOG.MODELS which lists the archived ML models in Vertica.
=>  Training a logistic regression model on a training_data
=> SELECT logistic_reg('myModel', 'patients', 'second_attack', 'treatment, trait_anxiety');
logistic_reg

Finished in 5 iterations
(1 row)
=>  Looking at the models table
=> SELECT model_name, schema_name, category, model_type, create_time, size FROM models;
model_name  schema_name  category  model_type  create_time  size
+++++
myModel  public  VERTICA_MODELS  LOGISTIC_REGRESSION  20200728 00:05:18.44195804  1845
(1 row)
You can look at the summary of the model using the GET_MODEL_SUMMARY function.
=>  Looking at the summary of the model
=> \t
Showing only tuples.
=> SELECT get_model_summary(USING PARAMETERS model_name='myModel');
=======
details
=======
predictor coefficientstd_err z_value p_value
++++
Intercept  6.36347  3.213901.97998 0.04771
treatment  1.02411  1.171080.87450 0.38185
trait_anxiety 0.11904  0.05498 2.16527 0.03037
==============
regularization
==============
type lambda
+
none 1.00000
===========
call_string
===========
logistic_reg('public.myModel', 'patients', '"second_attack"', 'treatment, trait_anxiety'
USING PARAMETERS optimizer='newton', epsilon=1e06, max_iterations=100, regularization='none', lambda=1, alpha=0.5)
===============
Additional Info
===============
Name Value
+
iteration_count  5
rejected_row_count 0
accepted_row_count 20
You can also retrieve the model's attributes using the GET_MODEL_ATTRIBUTE function.
=> \t
Tuples only is off.
=>  The list of the attributes of the model
=> SELECT get_model_attribute(USING PARAMETERS model_name='myModel');
attr_name  attr_fields  #_of_rows
++
details  predictor, coefficient, std_err, z_value, p_value  3
regularization  type, lambda  1
iteration_count  iteration_count  1
rejected_row_count  rejected_row_count  1
accepted_row_count  accepted_row_count  1
call_string  call_string  1
(6 rows)
=>  Returning the coefficients of the model in a tabular format
=> SELECT get_model_attribute(USING PARAMETERS model_name='myModel', attr_name='details');
predictor  coefficient  std_err  z_value  p_value
++++
Intercept  6.36346994178182  3.21390452471434  1.97998101463435  0.0477056620380991
treatment  1.02410605239327  1.1710801464903  0.874496980810833  0.381847663704613
trait_anxiety  0.119044916668605  0.0549791755747139  2.16527285875412  0.0303667955962211
(3 rows)
You can use the EXPORT_MODELS function in a simple statement to export the model in PMML format, as shown below.
=>  Exporting the model as PMML
=> SELECT export_models('/data/username/temp', 'myModel' USING PARAMETERS category='PMML');
export_models

Success
(1 row)
See also
8.2.2  Importing and predicting with PMML models
As a Vertica user, you can train ML models in other platforms, convert them to standard PMML format, and then import them into Vertica for indatabase prediction on data stored in Vertica relations.
As a Vertica user, you can train ML models in other platforms, convert them to standard PMML format, and then import them into Vertica for indatabase prediction on data stored in Vertica relations.
Here is an example of how to import a PMML model trained in Spark. The following diagram shows the workflow of the example.
You can use the IMPORT_MODELS function in a simple statement to import the PMML model. The imported model then appears in a system table named V_CATALOG.MODELS which lists the archived ML models in Vertica.
=>  importing the PMML model trained and generated in Spark
=> SELECT import_models('/data/username/temp/spark_logistic_reg' USING PARAMETERS category='PMML');
import_models

Success
(1 row)
=>  Looking at the models table=> SELECT model_name, schema_name, category, model_type, create_time, size FROM models;
model_name  schema_name  category  model_type  create_time  size
+++++
spark_logistic_reg  public  PMML  PMML_REGRESSION_MODEL  20200728 00:12:29.38970904  5831
(1 row)
You can look at the summary of the model using GET_MODEL_SUMMARY function.
=> \t
Showing only tuples.
=> SELECT get_model_summary(USING PARAMETERS model_name='spark_logistic_reg');
=============
function_name
=============
classification
===========
data_fields
===========
name dataType optype
++
field_0 double continuous
field_1 double continuous
field_2 double continuous
field_3 double continuous
field_4 double continuous
field_5 double continuous
field_6 double continuous
field_7 double continuous
field_8 double continuous
target  string categorical
==========
predictors
==========
name exponentcoefficient
++
field_0 1  0.23318
field_1 1  0.73623
field_2 1  0.29964
field_3 1  0.12809
field_4 1  0.66857
field_5 1  0.51675
field_6 1  0.41026
field_7 1  0.30829
field_8 1  0.17788
===============
Additional Info
===============
Name  Value
+
is_supervised 1
intercept 1.20173
You can also retrieve the model's attributes using the GET_MODEL_ATTRIBUTE function.
=> \t
Tuples only is off.
=>  The list of the attributes of the PMML model
=> SELECT get_model_attribute(USING PARAMETERS model_name='spark_logistic_reg');
attr_name  attr_fields  #_of_rows
++
is_supervised  is_supervised  1
function_name  function_name  1
data_fields  name, dataType, optype  10
intercept  intercept  1
predictors  name, exponent, coefficient  9
(5 rows)
=>  The coefficients of the PMML model
=> SELECT get_model_attribute(USING PARAMETERS model_name='spark_logistic_reg', attr_name='predictors');
name  exponent  coefficient
++
field_0  1  0.2331769167607
field_1  1  0.736227459496199
field_2  1  0.29963728232024
field_3  1  0.128085369856188
field_4  1  0.668573096260048
field_5  1  0.516750679584637
field_6  1  0.41025989394959
field_7  1  0.308289533913736
field_8  1  0.177878773139411
(9 rows)
You can then use the PREDICT_PMML function to apply the imported model on a relation for indatabase prediction. The internal parameters of the model can be matched to the column names of the input relation by their names or their listed position. Direct input values can also be fed to the function as displayed below.
=>  Using the imported PMML model for scoring direct input values
=> SELECT predict_pmml(1.5,0.5,2,1,0.75,4.2,3.1,0.9,1.1
username(> USING PARAMETERS model_name='spark_logistic_reg', match_by_pos=true);
predict_pmml

1
(1 row)
=>  Using the imported PMML model for scoring samples in a table
=> SELECT predict_pmml(* USING PARAMETERS model_name='spark_logistic_reg') AS prediction
=> FROM test_data;
prediction

1
0
(2 rows)
See also
8.2.3  PMML features and attributes
Using External Models With Vertica gives an overview of the features Vertica supports for working with external models.
To be compatible with Vertica, PMML models must conform to the following:
The following table lists supported PMML tags and attributes.