This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Transformation functions

The machine learning API includes a set of UDx functions that transform the columns of each input row to one or more corresponding output columns.

1: APPLY_BISECTING_KMEANS
2: APPLY_IFOREST
3: APPLY_INVERSE_PCA
4: APPLY_INVERSE_SVD
5: APPLY_KMEANS
6: APPLY_KPROTOTYPES
7: APPLY_NORMALIZE
8: APPLY_ONE_HOT_ENCODER
9: APPLY_PCA
10: APPLY_SVD
11: PREDICT_ARIMA
12: PREDICT_AUTOREGRESSOR
13: PREDICT_LINEAR_REG
14: PREDICT_LOGISTIC_REG
15: PREDICT_MOVING_AVERAGE
16: PREDICT_NAIVE_BAYES
17: PREDICT_NAIVE_BAYES_CLASSES
18: PREDICT_PLS_REG
19: PREDICT_PMML
20: PREDICT_POISSON_REG
21: PREDICT_RF_CLASSIFIER
22: PREDICT_RF_CLASSIFIER_CLASSES
23: PREDICT_RF_REGRESSOR
24: PREDICT_SVM_CLASSIFIER
25: PREDICT_SVM_REGRESSOR
26: PREDICT_TENSORFLOW
27: PREDICT_TENSORFLOW_SCALAR
28: PREDICT_XGB_CLASSIFIER
29: PREDICT_XGB_CLASSIFIER_CLASSES
30: PREDICT_XGB_REGRESSOR
31: REVERSE_NORMALIZE

The machine learning API includes a set of UDx functions that transform the columns of each input row to one or more corresponding output columns. These transformations follow rules that are defined in models that were created earlier. For example, APPLY_SVD uses an SVD model to transform input data.

Unless otherwise indicated, these functions require the following privileges for non-superusers:

USAGE privileges on the model
SELECT privileges on the input relation

In general, given an invalid input row, the return value for these functions is NULL.

1 - APPLY_BISECTING_KMEANS

Applies a trained bisecting k-means model to an input relation, and assigns each new data point to the closest matching cluster in the trained model.

Note

If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

Syntax

SELECT APPLY_BISECTING_KMEANS( 'input-columns'
        USING PARAMETERS model_name = 'model-name'
            [, num_clusters = 'num-clusters']
            [, match_by_pos = match-by-position] ] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.

Parameters

model_name

Name of the model (case-insensitive).

num_clusters

Integer between 1 and k inclusive, where k is the number of centers in the model, specifies the number of clusters to use for prediction.

Default: Value that the model specifies for k

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Privileges

Non-superusers: model owner, or USAGE privileges on the model

2 - APPLY_IFOREST

Applies an isolation forest (iForest) model to an input relation.

Applies an isolation forest (iForest) model to an input relation. For each input row, the function returns an output row with two fields:

anomaly_score: A float value that represents the average path length across all trees in the model normalized by the training sample size.
is_anomaly: A Boolean value that indicates whether the input row is an anomaly. This value is true when anomaly_score is equal to or larger than a given threshold; otherwise, it's false.

Syntax

APPLY_IFOREST( input-columns USING PARAMETERS param=value[,...] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Column types must match the types of the predictors in model_name.

Parameters

model_name

Name of the model (case-insensitive).

threshold

Optional. Float in the range (0.0, 1.0), specifies the threshold that determines if a data point is an anomaly. If the anomaly_score for a data point is equal to or larger than the value of threshold, the data point is marked as an outlier.

Alternatively, you can specify a contamination value that sets a threshold where the percentage of training data points labeled as outliers is approximately equal to the value of contamination. You cannot set both contamination and threshold in the same function call.

Default: 0.7

match_by_pos

Optional. Boolean value that specifies how input columns are matched to model columns:

false: Match by name.
true: Match by the position of columns in the input columns list.

Default: false

contamination

Optional. Float in the range (0.0, 1.0), the approximate ratio of data points in the training data that are labeled as outliers. The function calculates a threshold based on this contamination value. If you do not set this parameter, the function marks outliers using the specified or default threshold value.

You cannot set both contamination and threshold in the same function call.

Privileges

Non-superusers:

USAGE privileges on the model
SELECT privileges on the input relation

Examples

The following example demonstrates how different threshold values can affect outlier detection on an input relation:

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   threshold=0.75) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+-------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
(1 row)

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   threshold=0.55) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(3 rows)

You can also use different contamination values to alter the outlier threshold:

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   contamination = 0.1) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Marie      | Fields    | {"anomaly_score":0.5307715717521868,"is_anomaly":true}
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(4 rows)

=> SELECT * FROM (SELECT first_name, last_name, APPLY_IFOREST(team, hr, hits, avg, salary USING PARAMETERS model_name='baseball_anomalies',
   contamination = 0.01) AS predictions FROM baseball) AS outliers WHERE predictions.is_anomaly IS true;
 first_name | last_name |                      predictions
------------+-----------+--------------------------------------------------------
 Jacqueline | Richards  | {"anomaly_score":0.777757463074347,"is_anomaly":true}
 Debra      | Hall      | {"anomaly_score":0.5714649698133808,"is_anomaly":true}
 Gerald     | Fuller    | {"anomaly_score":0.5980549926114661,"is_anomaly":true}
(3 rows)

3 - APPLY_INVERSE_PCA

Inverts the APPLY_PCA-generated transform back to the original coordinate system.

Syntax

APPLY_INVERSE_PCA ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, exclude_columns = 'excluded-columns']
            [, key_columns = 'key-columns'] )

Arguments

input-columns

Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. The following requirements apply:

All columns must be a numeric data type.
Enclose the column name in double quotes if it contains special characters.

Parameters

model_name: Name of the model (case-insensitive).
exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
key_columns: Comma-separated list of column names from input-columns that identify its data rows. These columns are included in the output table.

Examples

The following example shows how to use the APPLY_INVERSE_PCA function. It shows the output for the first record.


=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE

=> SELECT * FROM worldPCA;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             |  79002.2946705704
0.699 | Belize              | -25631.6670012556
0.427 | Benin               | -40373.4104598122
0.805 | Chile               | -16805.7940082156
0.687 | China               | -37279.2893141103
0.744 | Costa Rica          | -19505.5631231635
0.4   | Cote d'Ivoire       | -38058.2060339272
0.776 | Cuba                | -23724.5779612041
0.895 | Denmark             |  117325.594028813
0.644 | Egypt               | -34609.9941604549
...
(96 rows)

=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
    USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
    key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI  |       country       |      em1970       |      em1971       |      em1972      |      em1973      |
      em1974      |      em1975       |      em1976|      em1977      |      em1978       |      em1979
   |      em1980       |      em1981      |      em1982       |      em1983       |      em1984       |em1985
|      em1986       |      em1987       |      em1988       |      em1989      |      em1990      |      em1991
|      em1992       |      em1993|      em1994      |      em1995       |      em1996       |      em1997
    |      em1998       |      em1999       |      em2000       |      em2001       |em2002       |
em2003      |      em2004       |      em2005      |      em2006       |      em2007       |      em2008
|      em2009      |      em2010       |     gdp1970      |     gdp1971      |     gdp1972      |     gdp1973
|     gdp1974      |     gdp1975      |     gdp1976      |     gdp1977      |gdp1978      |     gdp1979
 |     gdp1980      |     gdp1981      |     gdp1982      |     gdp1983      |     gdp1984      |     gdp1985
      |     gdp1986|    gdp1987      |     gdp1988      |     gdp1989      |     gdp1990      |     gdp1991
     |     gdp1992      |     gdp1993      |     gdp1994      |     gdp1995      |     gdp1996      |
gdp1997      |     gdp1998      |     gdp1999      |     gdp2000      |     gdp2001      |     gdp2002
|     gdp2003      |gdp2004      |     gdp2005      |     gdp2006      |     gdp2007      |     gdp2008
  |     gdp2009      |     gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium             |  18585.6613572407 | -16145.6374560074 |  26938.956253415 | 8094.30475779595 |
 12073.5461203817 | -11069.0567600181 | 19133.8584911727|   5500.312894949 | -4227.94863799987 |  6265.77925410752
|  -10884.749295608 | 30929.4669575201 | -7831.49439429977 |  3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 |   1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 |  18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 |   1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
|  6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
|  5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 |  5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 |  8718.3731386891
...
(96 rows)

4 - APPLY_INVERSE_SVD

Transforms the data back to the original domain.

Transforms the data back to the original domain. This essentially computes the approximated version of the original data by multiplying three matrices: matrix U (input to this function), matrices S and V (stored in the model).

Syntax

APPLY_INVERSE_SVD ( 'input-columns'
        USING PARAMETERS model_name = 'model-name'
            [, exclude_columns = 'excluded-columns']
            [, key_columns = 'key-columns'] )

Arguments

input-columns

Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. The following requirements apply:

All columns must be a numeric data type.
Enclose the column name in double quotes if it contains special characters.

Parameters

model_name: Name of the model (case-insensitive).
exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
key_columns: Comma-separated list of column names from input-columns that identify its data rows. These columns are included in the output table.

Examples

=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8  Rejected Rows: 0
(1 row)

=> CREATE TABLE transform_svd AS SELECT
     APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
     OVER () FROM small_svd;
CREATE TABLE

=> SELECT * FROM transform_svd;
id  |       col1        |        col2         |        col3         |        col4
----+-------------------+---------------------+---------------------+--------------------
4   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
6   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
1   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
2   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
3   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
5   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
8   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
7   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
(8 rows)

=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id  |        x1        |        x2        |        x3        |        x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)

5 - APPLY_KMEANS

Assigns each row of an input relation to a cluster center from an existing k-means model.

Syntax

APPLY_KMEANS ( input-columns
       USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Privileges

Non-superusers: model owner, or USAGE privileges on the model

Examples

The following example creates k-means model myKmeansModel and applies it to input table iris1. The call to APPLY_KMEANS mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:

=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
           KMEANS
----------------------------
 Finished in 12 iterations

(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
 id  | APPLY_KMEANS
-----+--------------
   5 |            1
  10 |            1
  14 |            1
  15 |            1
  21 |            1
  22 |            1
  24 |            1
  25 |            1
  32 |            1
  33 |            1
  34 |            1
  35 |            1
  38 |            1
  39 |            1
  42 |            1
...
 (60 rows)

6 - APPLY_KPROTOTYPES

Assigns each row of an input relation to a cluster center from an existing k-prototypes model.

Syntax

APPLY_KPROTOTYPES ( input-columns
       USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Privileges

Non-superusers: model owner, or USAGE privileges on the model

Examples

The following example creates k-prototypes model small_model and applies it to input table small_test_mixed:

=> SELECT KPROTOTYPES('small_model_initcenters', 'small_test_mixed', 'x0, country', 3 USING PARAMETERS initial_centers_table='small_test_mixed_centers', key_columns='pid');
      KPROTOTYPES
---------------------------
Finished in 2 iterations

(1 row)

=> SELECT country, x0, APPLY_KPROTOTYPES(country, x0
USING PARAMETERS model_name='small_model')
FROM small_test_mixed;
  country   | x0  | apply_kprototypes
------------+-----+-------------------
 'China'    |  20 |                 0
 'US'       |  85 |                 2
 'Russia'   |  80 |                 1
 'Brazil'   |  78 |                 1
 'US'       |  23 |                 0
 'US'       |  50 |                 0
 'Canada'   |  24 |                 0
 'Canada'   |  18 |                 0
 'Russia'   |  90 |                 2
 'Russia'   |  98 |                 2
 'Brazil'   |  89 |                 2
...
(45 rows)

7 - APPLY_NORMALIZE

A UDTF function that applies the normalization parameters saved in a model to a set of specified input columns.

A UDTF function that applies the normalization parameters saved in a model to a set of specified input columns. If any column specified in the function is not in the model, its data passes through unchanged to APPLY_NORMALIZE.

Note

Note: If a column contains only one distinct value, APPLY_NORMALIZE returns NaN for values in that column.

Syntax

APPLY_NORMALIZE ( input-columns USING PARAMETERS model_name = 'model-name');

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. If you supply an asterisk, APPLY_NORMALIZE normalizes all columns in the model.

Parameters

model_name: Name of the model (case-insensitive).

Examples

The following example creates a model with NORMALIZE_FIT using the wt and hp columns in table mtcars , and then uses this model in successive calls to APPLY_NORMALIZE and REVERSE_NORMALIZE.

=> SELECT NORMALIZE_FIT('mtcars_normfit', 'mtcars', 'wt,hp', 'minmax');
NORMALIZE_FIT
---------------
Success
(1 row)

The following call to APPLY_NORMALIZE specifies the hp and cyl columns in table mtcars, where hp is in the normalization model and cyl is not in the normalization model:

=> CREATE TABLE mtcars_normalized AS SELECT APPLY_NORMALIZE (hp, cyl USING PARAMETERS model_name = 'mtcars_normfit') FROM mtcars;
CREATE TABLE
=> SELECT * FROM mtcars_normalized;
          hp        | cyl
--------------------+-----
  0.434628975265018 | 8
  0.681978798586572 | 8
  0.434628975265018 | 6
                  1 | 8
  0.540636042402827 | 8
                  0 | 4
  0.681978798586572 | 8
 0.0459363957597173 | 4
  0.434628975265018 | 8
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.204946996466431 | 6
  0.201413427561837 | 4
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.215547703180212 | 4
 0.0353356890459364 | 4
  0.187279151943463 | 6
  0.452296819787986 | 8
  0.628975265017668 | 8
  0.346289752650177 | 8
  0.137809187279152 | 4
  0.749116607773852 | 8
  0.144876325088339 | 4
  0.151943462897526 | 4
  0.452296819787986 | 8
  0.452296819787986 | 8
  0.575971731448763 | 8
  0.159010600706714 | 4
  0.346289752650177 | 8
(32 rows)

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
  hp | cyl
-----+-----
 175 | 8
 245 | 8
 175 | 6
 335 | 8
 205 | 8
  52 | 4
 245 | 8
  65 | 4
 175 | 8
 110 | 6
 123 | 6
  66 | 4
 110 | 6
 109 | 4
 110 | 6
 123 | 6
  66 | 4
 113 | 4
  62 | 4
 105 | 6
 180 | 8
 230 | 8
 150 | 8
  91 | 4
 264 | 8
  93 | 4
  95 | 4
 180 | 8
 180 | 8
 215 | 8
  97 | 4
 150 | 8
(32 rows)

The following call to REVERSE_NORMALIZE also specifies the hp and cyl columns in table mtcars, where hp is in normalization model mtcars_normfit, and cyl is not in the normalization model.

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
       hp        | cyl
-----------------+-----
205.000005722046 |   8
150.000000357628 |   8
150.000000357628 |   8
93.0000016987324 |   4
 174.99999666214 |   8
94.9999992102385 |   4
214.999997496605 |   8
97.0000009387732 |   4
245.000006556511 |   8
 174.99999666214 |   6
             335 |   8
245.000006556511 |   8
62.0000002086163 |   4
 174.99999666214 |   8
230.000002026558 |   8
              52 |   4
263.999997675419 |   8
109.999999523163 |   6
123.000002324581 |   6
64.9999996386468 |   4
66.0000005029142 |   4
112.999997898936 |   4
109.999999523163 |   6
180.000000983477 |   8
180.000000983477 |   8
108.999998658895 |   4
109.999999523163 |   6
104.999999418855 |   6
123.000002324581 |   6
180.000000983477 |   8
66.0000005029142 |   4
90.9999999701977 |   4
(32 rows)

8 - APPLY_ONE_HOT_ENCODER

A user-defined transform function (UDTF) that loads the one hot encoder model and writes out a table that contains the encoded columns.

Syntax

APPLY_ONE_HOT_ENCODER( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, drop_first = 'is-first']
            [, ignore_null = 'ignore']
            [, separator = 'separator-character']
            [, column_naming = 'name-output']
            [, null_column_name = 'null-column-name'] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

, stores the categories and their corresponding levels.

drop_first

Boolean value, one of the following:

true (default): Treat the first level of the categorical variable as the reference level.
false: Every level of the categorical variable has a corresponding column in the output view

ignore_null

Boolean value, one of the following:

true (default): Null values set all corresponding one-hot binary columns to null.
false: Null values in input-columns are treated as a categorical level

separator

The character that separates the input variable name and the indicator variable level in the output table.To avoid using any separator, set this parameter to null value.

Default: Underscore (_)

column_naming

Appends categorical levels to column names according to the specified method:

indices (default): Uses integer indices to represent categorical levels.
values/values_relaxed: Both methods use categorical level names. If duplicate column names occur, the function attempts to disambiguate them by appending _n, where n is a zero-based integer index (_0, _1,...).

If the function cannot produce unique column names , it handles this according to the chosen method:
- values returns an error.
- values_relaxed reverts to using indices.

Important

The following column naming rules apply if column_naming is set to values or values_relaxed:

Input column names with more than 128 characters are truncated.
Column names can contain special characters.
If parameter ignore_null is set to true, APPLY_ONE_HOT_ENCODER constructs the column name from the value set in parameter null_column_name. If this parameter is omitted, the string null is used.

null_column_name

The string used in naming the indicator column for null values, used only if ignore_null is set to false and column_naming is set to values or values_relaxed.

Default:null

Note

Note: If an input row contains a level not stored in the model, the output row columns corresponding to that categorical level are returned as null values.

Examples

=> SELECT APPLY_ONE_HOT_ENCODER(cyl USING PARAMETERS model_name='one_hot_encoder_model',
drop_first='true', ignore_null='false') FROM mtcars;
cyl | cyl_1 | cyl_2
----+-------+-------
8   |     0 |     1
4   |     0 |     0
4   |     0 |     0
8   |     0 |     1
8   |     0 |     1
8   |     0 |     1
4   |     0 |     0
8   |     0 |     1
8   |     0 |     1
4   |     0 |     0
8   |     0 |     1
6   |     1 |     0
4   |     0 |     0
4   |     0 |     0
6   |     1 |     0
6   |     1 |     0
8   |     0 |     1
8   |     0 |     1
4   |     0 |     0
4   |     0 |     0
6   |     1 |     0
8   |     0 |     1
8   |     0 |     1
6   |     1 |     0
4   |     0 |     0
8   |     0 |     1
8   |     0 |     1
8   |     0 |     1
6   |     1 |     0
6   |     1 |     0
4   |     0 |     0
4   |     0 |     0
(32 rows)

9 - APPLY_PCA

Transforms the data using a PCA model.

Transforms the data using a PCA model. This returns new coordinates of each data point.

Syntax

APPLY_PCA ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, num_components = num-components]
            [, cutoff = cutoff-value]
            [, match_by_pos = match-by-position]
            [, exclude_columns = 'excluded-columns']
            [, key_columns = 'key-columns'] )

Arguments

input-columns

Comma-separated list of columns that contain the data matrix, or asterisk (*) to select all columns. The following requirements apply:

All columns must be a numeric data type.
Enclose the column name in double quotes if it contains special characters.

Parameters

model_name

Name of the model (case-insensitive).

num_components

The number of components to keep in the model. This is the number of output columns that will be generated. If you omit this parameter and the cutoff parameter, all model components are kept.

cutoff

Set to 1, specifies the minimum accumulated explained variance. Components are taken until the accumulated explained variance reaches this value.

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

key_columns

Comma-separated list of column names from input-columns that identify its data rows. These columns are included in the output table.

Examples


=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE

=> SELECT * FROM worldPCA;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             |  79002.2946705704
0.699 | Belize              | -25631.6670012556
0.427 | Benin               | -40373.4104598122
0.805 | Chile               | -16805.7940082156
0.687 | China               | -37279.2893141103
0.744 | Costa Rica          | -19505.5631231635
0.4   | Cote d'Ivoire       | -38058.2060339272
0.776 | Cuba                | -23724.5779612041
0.895 | Denmark             |  117325.594028813
0.644 | Egypt               | -34609.9941604549
...
(96 rows)

=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
    USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
    key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI  |       country       |      em1970       |      em1971       |      em1972      |      em1973      |
      em1974      |      em1975       |      em1976|      em1977      |      em1978       |      em1979
   |      em1980       |      em1981      |      em1982       |      em1983       |      em1984       |em1985
|      em1986       |      em1987       |      em1988       |      em1989      |      em1990      |      em1991
|      em1992       |      em1993|      em1994      |      em1995       |      em1996       |      em1997
    |      em1998       |      em1999       |      em2000       |      em2001       |em2002       |
em2003      |      em2004       |      em2005      |      em2006       |      em2007       |      em2008
|      em2009      |      em2010       |     gdp1970      |     gdp1971      |     gdp1972      |     gdp1973
|     gdp1974      |     gdp1975      |     gdp1976      |     gdp1977      |gdp1978      |     gdp1979
 |     gdp1980      |     gdp1981      |     gdp1982      |     gdp1983      |     gdp1984      |     gdp1985
      |     gdp1986|    gdp1987      |     gdp1988      |     gdp1989      |     gdp1990      |     gdp1991
     |     gdp1992      |     gdp1993      |     gdp1994      |     gdp1995      |     gdp1996      |
gdp1997      |     gdp1998      |     gdp1999      |     gdp2000      |     gdp2001      |     gdp2002
|     gdp2003      |gdp2004      |     gdp2005      |     gdp2006      |     gdp2007      |     gdp2008
  |     gdp2009      |     gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium             |  18585.6613572407 | -16145.6374560074 |  26938.956253415 | 8094.30475779595 |
 12073.5461203817 | -11069.0567600181 | 19133.8584911727|   5500.312894949 | -4227.94863799987 |  6265.77925410752
|  -10884.749295608 | 30929.4669575201 | -7831.49439429977 |  3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 |   1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 |  18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 |   1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
|  6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
|  5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 |  5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 |  8718.3731386891
...
(96 rows)

10 - APPLY_SVD

Transforms the data using an SVD model.

Transforms the data using an SVD model. This computes the matrix U of the SVD decomposition.

Syntax

APPLY_SVD ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, num_components = num-components]
            [, cutoff = cutoff-value]
            [, match_by_pos = match-by-position]
            [, exclude_columns = 'excluded-columns']
            [, key_columns = 'key-columns'] )

Arguments

input-columns

Comma-separated list of columns that contain the data matrix, or asterisk (*) to select all columns. The following requirements apply:

All columns must be a numeric data type.
Enclose the column name in double quotes if it contains special characters.

Parameters

model_name

Name of the model (case-insensitive).

num_components

The number of components to keep in the model. This is the number of output columns that will be generated. If neither this parameter nor the cutoff parameter is provided, all components from the model are kept.

cutoff

Set to 1, specifies the minimum accumulated explained variance. Components are taken until the accumulated explained variance reaches this value. If you omit this parameter and the num_components parameter, all model components are kept.

match_by_pos

Boolean value that specifies how input columns are matched to model columns:

false (default): Match by name.
true: Match by the position of columns in the input columns list.

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

key_columns

Comma-separated list of column names from input-columns that identify its data rows. These columns are included in the output table.

Examples

=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8  Rejected Rows: 0
(1 row)

=> CREATE TABLE transform_svd AS SELECT
     APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
     OVER () FROM small_svd;
CREATE TABLE

=> SELECT * FROM transform_svd;
id  |       col1        |        col2         |        col3         |        col4
----+-------------------+---------------------+---------------------+--------------------
4   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
6   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
1   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
2   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
3   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
5   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
8   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
7   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
(8 rows)

=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id  |        x1        |        x2        |        x3        |        x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)

11 - PREDICT_ARIMA

Applies an autoregressive integrated moving average (ARIMA) model to an input relation or makes predictions using the in-sample data.

Applies an autoregressive integrated moving average (ARIMA) model to an input relation or makes predictions using the in-sample data. ARIMA models make predictions based on preceding time series values and errors of previous predictions. The function, by default, returns the predicted values plus the mean of the model.

Behavior type

Immutable

Syntax

Apply to an input relation:

PREDICT_ARIMA ( timeseries-column
        USING PARAMETERS param=value[,...] )
        OVER (ORDER BY timestamp-column)
        FROM input-relation

Make predictions using the in-sample data:

PREDICT_ARIMA ( USING PARAMETERS model_name = 'ARIMA-model'
        [, start = prediction-start ]
        [, npredictions = num-predictions ]
        [, output_standard_errors = boolean ] )
        OVER ()

Arguments

timeseries-column: Name of a NUMERIC column in input-relation used to make predictions.
timestamp-column: Name of an INTEGER, FLOAT, or TIMESTAMP column in input-relation that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout the timestamp-column.
input-relation: Input relation containing timeseries-column and timestamp-column.

Parameters

model_name

Name of a trained ARIMA model.

start

The behavior of the start parameter and its range of accepted values depends on whether you provide a timeseries-column:

No provided timeseries-column: start must be an integer ≥0, where zero indicates to start prediction at the end of the in-sample data. If start is a positive value, the function predicts the values between the end of the in-sample data and the start index, and then uses the predicted values as time series inputs for the subsequent npredictions.
timeseries-column provided: start must be an integer ≥1 and identifies the index (row) of the timeseries-column at which to begin prediction. If the start index is greater than the number of rows, N, in the input data, the function predicts the values between N and start and uses the predicted values as time series inputs for the subsequent npredictions.

Default:

No provided timeseries-column: prediction begins from the end of the in-sample data.
timeseries-column provided: prediction begins from the end of the provided input data.

npredictions

Integer ≥1, the number of predicted timesteps.

Default: 10

missing

Methods for handling missing values, one of the following strings:

'drop': Missing values are ignored.
'error': Missing values raise an error.
'zero': Missing values are replaced with 0.
'linear_interpolation': Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. If all values before or after a missing value in the prediction range are missing or invalid, interpolation is impossible and the function errors.

Default: Method used when training the model

add_mean

Boolean, whether to add the model mean to the predicted value.

Default: True

output_standard_errors

Boolean, whether to return estimates of the standard error of each prediction.

Default: False

Examples

The following example makes predictions using the in-sample data that the arima_temp model was trained on:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', npredictions=10) OVER();
 index |    prediction
-------+------------------
| 12.9794640462952
| 13.3759980774506
| 13.4596213753292
| 13.4670492239575
| 13.4559956810351
| 13.4405315951159
|  13.424086943584
| 13.4074973032696
| 13.3909657020137
|  13.374540947803
(10 rows)

You can also apply the model to an input relation:

=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=100, npredictions=10) OVER(ORDER BY time) FROM temp_data;
 index |    prediction
-------+------------------
| 15.0373821404594
| 13.4707358943239
| 10.5714574755414
| 13.1957213344543
| 13.5606204019976
| 13.1604413418938
| 13.3998222399722
| 12.6110939669533
| 12.9015211253485
| 13.2382768006631
(10 rows)

For an in-depth example that trains and makes predictions with an ARIMA model, see ARIMA model example.

12 - PREDICT_AUTOREGRESSOR

Applies an autoregressor (AR) model to an input relation.

Applies an autoregressor (AR) or vector autoregression (VAR) model to an input relation. The function returns predictions for each value column specified during model creation.

AR and VAR models use previous values to make predictions. During model training, the user specifies the number of lagged timesteps taken into account during computation. The model predicts future values as a linear combination of the timeseries values at each lag.

Syntax

PREDICT_AUTOREGRESSOR ( timeseries-columns
        USING PARAMETERS model_name = 'model-name' [, param=value[,...] ] )
        OVER (ORDER BY timestamp-column)
        FROM input-relation

Note

The following argument, as written, is required and cannot be omitted nor substituted with another type of clause.

OVER (ORDER BY timestamp-column)

Arguments

timeseries-columns: The timeseries columns used to make predictions. The number of timeseries-columns must be the same as the number of value columns provided during model training.
For each prediction, the model only considers the previous P values of each column, where P is the lag set during model creation.
timestamp-column: The timestamp column, with consistent timesteps, used to make the prediction.
input-relation: The input relation containing the timeseries-columns and timestamp-column.
The input-relation cannot have missing values in any of the P rows preceding start, where P is the lag set during model creation. To handle missing values, see IMPUTE or Linear interpolation.

Parameters

model_name

Name of the model (case-insensitive).

start

INTEGER >p or ≤0, the index (row) of the input-relation at which to start the prediction. If omitted, the prediction starts at the end of the input-relation.

If the start index is greater than the number of rows N in timeseries-columns, then the values between N and start are predicted and used for the prediction.

If negative, the start index is identified by counting backwards from the end of the input-relation.

For an input-relation of N rows, negative values have a lower limit of either -1000 or -(N-p), whichever is greater.

Default: the end of input-relation

npredictions

INTEGER ≥1, the number of predicted timesteps.

Default: 10

missing

One of the following methods for handling missing values:

drop: Missing values are ignored.
error: Missing values raise an error.
zero: Missing values are replaced with 0.
linear_interpolation: Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. If all values before or after a missing value in the prediction range are missing or invalid, interpolation is impossible and the function errors. VAR models do not support linear interpolation.

Default: Method used when training the model

Examples

The following example makes predictions using an AR model for 10 timesteps after the end of the input relation:

=> SELECT PREDICT_AUTOREGRESSOR(Temperature USING PARAMETERS model_name='AR_temperature', npredictions=10) 
  OVER(ORDER BY time) FROM temp_data;
 index |    prediction
-------+------------------
     1 | 12.6235419917807
     2 | 12.9387860506032
     3 | 12.6683380680058
     4 | 12.3886937385419
     5 | 12.2689506237424
     6 | 12.1503023330142
     7 | 12.0211734746741
     8 | 11.9150531529328
     9 | 11.825870404008
    10 | 11.7451846722395
(10 rows)

The following example makes predictions using a VAR model for 10 timesteps after the end of the input relation:

=> SELECT PREDICT_AUTOREGRESSOR(temp_location1, temp_location2 USING PARAMETERS 
  model_name='VAR_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data_VAR;  
 index |  temp_location1  |  temp_location2
-------+------------------+------------------
     1 | 11.7583950082813 | 12.7193948948294
     2 | 11.8492948294829 | 12.2400294852222
     3 | 11.9917382847772 | 13.8385038582000
     4 | 11.2302988673747 | 13.1174827497563
     5 | 11.8920481717273 | 13.1948776593788
     6 | 12.1737583757385 | 12.7362846366622
     7 | 12.0397364321183 | 12.9274628462844
     8 | 12.0395726450372 | 12.1749275028444
     9 | 11.8249947849488 | 12.3274926927433
    10 | 11.1129497288422 | 12.1749274927493
(10 rows)

See Autoregressive model example and VAR model example for extended examples.

13 - PREDICT_LINEAR_REG

Applies a linear regression model on an input relation and returns the predicted value as a FLOAT.

Syntax

PREDICT_LINEAR_REG ( input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='myLinearRegModel')FROM
faithful ORDER BY id;


 PREDICT_LINEAR_REG
--------------------
   4.15403481386324
   2.18505296804024
   3.76023844469864
    2.8151271587036
   4.62659045686076
   2.26381224187316
   4.86286827835952
   4.62659045686076
   1.94877514654148
   4.62659045686076
   2.18505296804024
...
 (272 rows)

The following example shows how to use the PREDICT_LINEAR_REG function on an input table, using the match_by_pos parameter. Note that you can replace the column argument with a constant that does not match an input column:

=> SELECT PREDICT_LINEAR_REG(55 USING PARAMETERS model_name='linear_reg_faithful',
                     match_by_pos='true')FROM faithful ORDER BY id;
 PREDICT_LINEAR_REG
--------------------
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
...
 (272 rows)

14 - PREDICT_LOGISTIC_REG

Applies a logistic regression model on an input relation.

PREDICT_LOGISTIC_REG returns as a FLOAT the predicted class or the probability of the predicted class, depending on how the type parameter is set. You can cast the return value to INTEGER or another numeric type when the return is in the probability of the predicted class.

Syntax

PREDICT_LOGISTIC_REG ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, type = 'prediction-type']
            [, cutoff = probability-cutoff]
            [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

type

Type of prediction for logistic regression, one of the following:

response (default): Predicted values are 0 or 1.
probability: Output is the probability of the predicted category to be 1.

cutoff

Used in conjunction with the type parameter, a FLOAT between 0 and 1, exclusive. When type is set to response, the returned value of prediction is 1 if its corresponding probability is greater than or equal to the value of cutoff; otherwise, it is 0.

Default: 0.5

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT car_model,
            PREDICT_LOGISTIC_REG(mpg, cyl, disp, drat, wt, qsec, vs, gear, carb
                                 USING PARAMETERS model_name='myLogisticRegModel')
          FROM mtcars;
      car_model      | PREDICT_LOGISTIC_REG
---------------------+----------------------
 Camaro Z28          |                    0
 Fiat 128            |                    1
 Fiat X1-9           |                    1
 Ford Pantera L      |                    1
 Merc 450SE          |                    0
 Merc 450SL          |                    0
 Toyota Corona       |                    0
 AMC Javelin         |                    0
 Cadillac Fleetwood  |                    0
 Datsun 710          |                    1
 Dodge Challenger    |                    0
 Hornet 4 Drive      |                    0
 Lotus Europa        |                    1
 Merc 230            |                    0
 Merc 280            |                    0
 Merc 280C           |                    0
 Merc 450SLC         |                    0
 Pontiac Firebird    |                    0
 Porsche 914-2       |                    1
 Toyota Corolla      |                    1
 Valiant             |                    0
 Chrysler Imperial   |                    0
 Duster 360          |                    0
 Ferrari Dino        |                    1
 Honda Civic         |                    1
 Hornet Sportabout   |                    0
 Lincoln Continental |                    0
 Maserati Bora       |                    1
 Mazda RX4           |                    1
 Mazda RX4 Wag       |                    1
 Merc 240D           |                    0
 Volvo 142E          |                    1
(32 rows)

The following example shows how to use PREDICT_LOGISTIC_REG on an input table, using the match_by_pos parameter. Note that you can replace any of the column inputs with a constant that does not match an input column. In this example, column mpg was replaced with the constant 20:

=> SELECT car_model,
            PREDICT_LOGISTIC_REG(20, cyl, disp, drat, wt, qsec, vs, gear, carb
                                 USING PARAMETERS model_name='myLogisticRegModel', match_by_pos='true')
          FROM mtcars;
      car_model     | PREDICT_LOGISTIC_REG
--------------------+----------------------
AMC Javelin         |                    0
Cadillac Fleetwood  |                    0
Camaro Z28          |                    0
Chrysler Imperial   |                    0
Datsun 710          |                    1
Dodge Challenger    |                    0
Duster 360          |                    0
Ferrari Dino        |                    1
Fiat 128            |                    1
Fiat X1-9           |                    1
Ford Pantera L      |                    1
Honda Civic         |                    1
Hornet 4 Drive      |                    0
Hornet Sportabout   |                    0
Lincoln Continental |                    0
Lotus Europa        |                    1
Maserati Bora       |                    1
Mazda RX4           |                    1
Mazda RX4 Wag       |                    1
Merc 230            |                    0
Merc 240D           |                    0
Merc 280            |                    0
Merc 280C           |                    0
Merc 450SE          |                    0
Merc 450SL          |                    0
Merc 450SLC         |                    0
Pontiac Firebird    |                    0
Porsche 914-2       |                    1
Toyota Corolla      |                    1
Toyota Corona       |                    0
Valiant             |                    0
Volvo 142E          |                    1
(32 rows)

15 - PREDICT_MOVING_AVERAGE

Applies a moving-average (MA) model, created by MOVING_AVERAGE, to an input relation.

Moving average models use the errors of previous predictions to make future predictions. More specifically, the user-specified "lag" determines how many previous predictions and errors it takes into account during computation.

Syntax

PREDICT_MOVING_AVERAGE ( timeseries-column
        USING PARAMETERS
            model_name = 'model-name'
            [, start = starting-index]
            [, npredictions = npredictions]
            [, missing = "imputation-method" ] )
        OVER (ORDER BY timestamp-column)
        FROM input-relation

Note

The following argument, as written, is required and cannot be omitted nor substituted with another type of clause.

OVER (ORDER BY timestamp-column)

Arguments

timeseries-column: The timeseries column used to make the prediction (only the last q values, specified during model creation, are used).
timestamp-column: The timestamp column, with consistent timesteps, used to make the prediction.
input-relation: The input relation containing the timeseries-column and timestamp-column.
Note that input-relation cannot have missing values in any of the q (set during training) rows preceding start. To handle missing values, see IMPUTE or Linear interpolation.

Parameters

model_name

Name of the model (case-insensitive).

start

INTEGER >q or ≤0, the index (row) of the input-relation at which to start the prediction. If omitted, the prediction starts at the end of the input-relation.

If the start index is greater than the number of rows N in timeseries-column, then the values between N and start are predicted and used for the prediction.

If negative, the start index is identified by counting backwards from the end of the input-relation.

For an input-relation of N rows, negative values have a lower limit of either -1000 or -(N-q), whichever is greater.

Default: the end of input-relation

npredictions

INTEGER ≥1, the number of predicted timesteps.

Default: 10

missing

One of the following methods for handling missing values:

drop: Missing values are ignored.
error: Missing values raise an error.
zero: Missing values are replaced with 0.
linear_interpolation: Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. If all values before or after a missing value in the prediction range are missing or invalid, interpolation is impossible and the function errors.

Default: Method used when training the model

Examples

See Moving-average model example.

16 - PREDICT_NAIVE_BAYES

Applies a Naive Bayes model on an input relation.

Depending on how the type parameter is set, PREDICT_NAIVE_BAYES returns a VARCHAR that specifies either the predicted class or probability of the predicted class. If the function returns probability, you can cast the return value to an INTEGER or another numeric data type.

Syntax

PREDICT_NAIVE_BAYES ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, type = ' return-type ']
            [, class = 'user-input-class']
            [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

type

One of the following:

response (default): Returns the class with the highest probability.
probability: Valid only if class parameter is set, returns the probability of belonging to the specified class argument.

class

Required if type parameter is set to probability. If you omit this parameter, PREDICT_NAIVE_BAYES returns the class that it predicts as having the highest probability.

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT party, PREDICT_NAIVE_BAYES (vote1, vote2, vote3
                                        USING PARAMETERS model_name='naive_house84_model',
                                                         type='response')
                                        AS Predicted_Party
                                        FROM house84_test;
   party    | Predicted_Party
------------+-----------------
 democrat   | democrat
 democrat   | democrat
 democrat   | democrat
 republican | republican
 democrat   | democrat
 democrat   | democrat
 democrat   | democrat
 democrat   | democrat
 democrat   | democrat
 republican | republican
 democrat   | democrat
 democrat   | democrat
 democrat   | democrat
 democrat   | republican
 republican | republican
 democrat   | democrat
 republican | republican
...
(99 rows)

17 - PREDICT_NAIVE_BAYES_CLASSES

Applies a Naive Bayes model on an input relation and returns the probabilities of classes:.

Applies a Naive Bayes model on an input relation and returns the probabilities of classes:

VARCHAR predicted column contains the class label with the highest probability.
Multiple FLOAT columns, where the first probability column contains the probability for the class specified in the predicted column. Other columns contain the probability of belonging to each class specified in the classes parameter.

Syntax

PREDICT_NAIVE_BAYES_CLASSES ( predictor-columns
        USING PARAMETERS model_name = 'model-name'
            [, key_columns = 'key-columns']
            [, exclude_columns = 'excluded-columns]
            [, classes = 'classes']
            [, match_by_pos = match-by-position] )
OVER( [window-partition-clause] )

Arguments

predictor-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

key_columns

Comma-separated list of predictor column names that identify the output rows. To exclude these and other predictor columns from being used for prediction, include them in the argument list for parameter exclude_columns.

exclude_columns

Comma-separated list of columns from predictor-columns to exclude from processing.

classes

Comma-separated list of class labels in the model. The probability of belonging to this given class as predicted by the classifier. The values are case sensitive.

match_by_pos

Boolean value that specifies how predictor columns are matched to model features:

false (default): Match by name.
true: Match by the position of columns in the predictor columns list.

Examples

=> SELECT PREDICT_NAIVE_BAYES_CLASSES (id, vote1, vote2 USING PARAMETERS
model_name='naive_house84_model',key_columns='id',exclude_columns='id',
classes='democrat, republican', match_by_pos='false')
        OVER() FROM house84_test;
 id  | Predicted  |    Probability    |     democrat      |    republican
-----+------------+-------------------+-------------------+-------------------
  21 | democrat   | 0.775473383353576 | 0.775473383353576 | 0.224526616646424
  28 | democrat   | 0.775473383353576 | 0.775473383353576 | 0.224526616646424
  83 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
 102 | democrat   | 0.779889432167111 | 0.779889432167111 | 0.220110567832889
 107 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
 125 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
 132 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
 136 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
 155 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
 174 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
...
(1 row)

18 - PREDICT_PLS_REG

Applies a PLS regression model on an input relation and returns the predicted values.

Applies a PLS_REG model to an input relation and returns a predicted FLOAT value for each row in the input relation.

Syntax

PREDICT_PLS_REG ( input-columns
        USING PARAMETERS param=value[,...] )

Arguments

input-columns: Comma-separated list of predictor columns to use from the input relation or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

The following example uses the monarch_pls model to make predictions on the monarch_test input relation:

=> SELECT PREDICT_PLS_REG (* USING PARAMETERS model_name='monarch_pls') FROM monarch_test;
 PREDICT_PLS_REG
------------------
 2.88462577469318
 2.86535009598611
 2.84138719904564
  2.7222022770597
 3.96163608455087
 3.30690898656628
 2.99904802221049
 (7 rows)

For an in-depth example, see PLS regression.

19 - PREDICT_PMML

Applies an imported PMML model on an input relation.

Applies an imported PMML model on an input relation. The function returns the result that would be expected for the model type encoded in the PMML model.

PREDICT_PMML returns NULL in the following cases:

The predictor is an invalid or NULL value.
The categorical predictor is of an unknown class.

Note

PREDICT_PMML returns values of complex type ROW for models that use the Output tag. Currently, Vertica does not support directly inserting this data into a table.

You can work around this limitation by changing the output to JSON with TO_JSON before inserting it into a table:

=> CREATE TABLE predicted_output AS SELECT TO_JSON(PREDICT_PMML(X1,X2,X3
USING PARAMETERS model_name='pmml_imported_model'))
AS predicted_value
FROM input_table;

Syntax

PREDICT_PMML ( input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive). For a list of supported PMML model types and tags, see PMML features and attributes.

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

In this example, the function call uses all the columns from the table as predictors and predicts the value using the 'my_kmeans' model in PMML format:

SELECT PREDICT_PMML(* USING PARAMETERS model_name='my_kmeans') AS predicted_label FROM table;

In this example, the function call takes only columns col1, col2 as predictors, and predicts the value for each row using the 'my_kmeans' model from schema 'my_schema':

SELECT PREDICT_PMML(col1, col2 USING PARAMETERS model_name='my_schema.my_kmeans') AS predicted_label FROM table;

In this example, the function call returns an error as neither schema nor model-name can accept * as a value:

SELECT PREDICT_PMML(* USING PARAMETERS model_name='*.*') AS predicted_label FROM table;
SELECT PREDICT_PMML(* USING PARAMETERS model_name='*') AS predicted_label FROM table;
SELECT PREDICT_PMML(* USING PARAMETERS model_name='models.*') AS predicted_label FROM table;

20 - PREDICT_POISSON_REG

Applies a Poisson regression model on an input relation and returns the predicted value as a FLOAT.

Syntax

PREDICT_POISSON_REG ( input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT PREDICT_POISSON_REG(waiting USING PARAMETERS model_name='MYModel')::numeric(20,10) FROM lin.faithful ORDER BY id;
predict_poisson_reg
---------------------
4.0230080811
2.2284857176
3.5747254723
2.6921731651
4.6357580051
2.2817680621
4.9762900161
4.6357580051
2.0759884314
(9 rows)

21 - PREDICT_RF_CLASSIFIER

Applies a random forest model on an input relation.

Applies a random forest model on an input relation. PREDICT_RF_CLASSIFIER returns a VARCHAR data type that specifies one of the following, as determined by how the type parameter is set:

The predicted class (based on popular votes)
Probability of a class for each input instance.

Note

The predicted class is selected only based on the popular vote of the decision trees in the forest. Therefore, in special cases the calculated probability of the predicted class may not be the highest.

Syntax

PREDICT_RF_CLASSIFIER ( input-columns
        USING PARAMETERS model_name = 'model-name'
            [, type = 'prediction-type']
            [, class = 'user-input-class']
            [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

type

Type of prediction to return, one of the following:

response (default): The class with the highest probability among all possible classes.
probability: Valid only if the class parameter is set, returns the probability of the specified class.

class

Class to use when the type parameter is set to probability. If you omit this parameter, the function uses the predicted class—the one with the popular vote. Thus, the predict function returns the probability that the input instance belongs to its predicted class.

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
       USING PARAMETERS model_name='myRFModel') FROM iris;
PREDICT_RF_CLASSIFIER
-----------------------
setosa
setosa
setosa
...
versicolor
versicolor
versicolor
...
virginica
virginica
virginica
...
(150 rows)

This example shows how you can use the PREDICT_RF_CLASSIFIER function, using the match_by_pos parameter:

=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
                                USING PARAMETERS model_name='myRFModel', match_by_pos='true') FROM iris;
PREDICT_RF_CLASSIFIER
-----------------------
setosa
setosa
setosa
...
versicolor
versicolor
versicolor
...
virginica
virginica
virginica
...
(150 rows)

22 - PREDICT_RF_CLASSIFIER_CLASSES

Applies a random forest model on an input relation and returns the probabilities of classes:.

Applies a random forest model on an input relation and returns the probabilities of classes:

VARCHAR predicted column contains the class label with the highest vote (popular vote).
Multiple FLOAT columns, where the first probability column contains the probability for the class reported in the predicted column. Other columns contain the probability of each class specified in the classes parameter.
Key columns with the same value and data type as matching input columns specified in parameter key_columns.

Note

Selection of the predicted class is based on the popular vote of decision trees in the forest. Thus, in special cases the calculated probability of the predicted class might not be the highest.

Syntax

PREDICT_RF_CLASSIFIER_CLASSES ( predictor-columns
        USING PARAMETERS model_name = 'model-name'
            [, key_columns = 'key-columns']
            [, exclude_columns = 'excluded-columns']
            [, classes = 'classes']
            [, match_by_pos = match-by-position] )
OVER( [window-partition-clause] )

Arguments

predictor-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

key_columns

exclude_columns

Comma-separated list of columns from predictor-columns to exclude from processing.

classes

Comma-separated list of class labels in the model. The probability of belonging to this given class is predicted by the classifier. Values are case sensitive.

match_by_pos

Boolean value that specifies how predictor columns are matched to model features:

false (default): Match by name.
true: Match by the position of columns in the predictor columns list.

Examples

=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
                               USING PARAMETERS model_name='myRFModel') OVER () FROM iris;
predicted  |    probability
-----------+-------------------
setosa     |                 1
setosa     |              0.99
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |              0.97
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |              0.99
...
(150 rows)

This example shows how to use function PREDICT_RF_CLASSIFIER_CLASSES, using the match_by_pos parameter:

=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
                          USING PARAMETERS model_name='myRFModel', match_by_pos='true') OVER () FROM iris;
predicted  |    probability
-----------+-------------------
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
setosa     |                 1
...
(150 rows)s

23 - PREDICT_RF_REGRESSOR

Applies a random forest model on an input relation, and returns with a FLOAT data type that specifies the predicted value of the random forest model—the average of the prediction of the trees in the forest.

Syntax

PREDICT_RF_REGRESSOR ( input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

=> SELECT PREDICT_RF_REGRESSOR (mpg,cyl,hp,drat,wt
USING PARAMETERS model_name='myRFRegressorModel')FROM mtcars;
PREDICT_RF_REGRESSOR
----------------------
2.94774203574204
2.6954087024087
2.6954087024087
2.89906346431346
2.97688489288489
2.97688489288489
2.7086587024087
2.92078965478965
2.97688489288489
2.7086587024087
2.95621822621823
2.82255155955156
2.7086587024087
2.7086587024087
2.85650394050394
2.85650394050394
2.97688489288489
2.95621822621823
2.6954087024087
2.6954087024087
2.84493251193251
2.97688489288489
2.97688489288489
2.8856467976468
2.6954087024087
2.92078965478965
2.97688489288489
2.97688489288489
2.7934087024087
2.7934087024087
2.7086587024087
2.72469441669442
(32 rows)

24 - PREDICT_SVM_CLASSIFIER

Uses an SVM model to predict class labels for samples in an input relation, and returns the predicted value as a FLOAT data type.

Syntax

PREDICT_SVM_CLASSIFIER (input-columns
     USING PARAMETERS model_name = 'model-name'
            [, match_by_pos = match-by-position]
            [, type = 'return-type']
            [, cutoff = 'cutoff-value'] ] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

type

A string that specifies the output to return for each input row, one of the following:

response: Outputs the predicted class of 0 or 1.
probability: Outputs a value in the range (0,1), the prediction score transformed using the logistic function.

cutoff

Valid only if the type parameter is set to probability, a FLOAT value that is compared to the transformed prediction score to determine the predicted class.

Default: 0

Examples


=> SELECT PREDICT_SVM_CLASSIFIER (mpg,cyl,disp,wt,qsec,vs,gear,carb
USING PARAMETERS model_name='mySvmClassModel') FROM mtcars;
PREDICT_SVM_CLASSIFIER
------------------------
0
0
1
0
0
1
1
1
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
0
(32 rows)

This example shows how to use PREDICT_SVM_CLASSIFIER on the mtcars table, using the match_by_pos parameter. In this example, column mpg was replaced with the constant 40:


=> SELECT PREDICT_SVM_CLASSIFIER (40,cyl,disp,wt,qsec,vs,gear,carb
USING PARAMETERS model_name='mySvmClassModel', match_by_pos ='true') FROM mtcars;
PREDICT_SVM_CLASSIFIER
------------------------
0
0
0
0
1
0
0
1
1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
1
(32 rows)

25 - PREDICT_SVM_REGRESSOR

Uses an SVM model to perform regression on samples in an input relation, and returns the predicted value as a FLOAT data type.

Syntax

PREDICT_SVM_REGRESSOR(input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples


=> SELECT PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='mySvmRegModel')
           FROM faithful ORDER BY id;
 PREDICT_SVM_REGRESSOR
--------------------
   4.06488248694445
   2.30392277646291
   3.71269054484815
   2.867429883817
   4.48751281746003
   2.37436116488217
   4.69882798271781
   4.48751281746003
   2.09260761120512
...
 (272 rows)

This example shows how you can use the PREDICT_SVM_REGRESSOR function on the faithful table, using the match_by_pos parameter. In this example, the waiting column was replaced with the constant 40:


=> SELECT PREDICT_SVM_REGRESSOR(40 USING PARAMETERS model_name='mySvmRegModel', match_by_pos='true')
           FROM faithful ORDER BY id;
 PREDICT_SVM_REGRESSOR
--------------------
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
   1.31778533859324
...
 (272 rows)

26 - PREDICT_TENSORFLOW

Applies a TensorFlow model on an input relation, and returns with the result expected for the encoded model type.

Syntax

PREDICT_TENSORFLOW ( input-columns
        USING PARAMETERS model_name = 'model-name' [, num_passthru_cols = 'n-first-columns-to-ignore'] )
OVER( [window-partition-clause] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name: Name of the model (case-insensitive).
num_passthru_cols: Integer that specifies the number of input columns to skip.

Examples

Use PREDICT_TENSORFLOW with the num_passthru_cols parameter to skip the first two input columns:


=> SELECT PREDICT_TENSORFLOW ( pid,label,x1,x2
        USING PARAMETERS model_name='spiral_demo', num_passthru_cols=2 )
OVER(PARTITION BEST) as predicted_class FROM points;

--example output, the skipped columns are displayed as the first columns of the output
  pid  | label |         col0         |         col1
-------+-------+----------------------+----------------------
     0 |     0 |    0.990638732910156 |  0.00936129689216614
     1 |     0 |    0.999036073684692 | 0.000963933940511197
     2 |     1 |   0.0103802494704723 | 0.989619791507721

27 - PREDICT_TENSORFLOW_SCALAR

Applies a TensorFlow model on an input relation, and returns with the result expected for the encoded model type. This function supports 1D complex types as input and output.

Applies an imported TensorFlow model on an input relation. This function, unlike PREDICT_TENSORFLOW, accepts one input column of type ROW, where each field corresponds to an input tensor, and returns one output column of type ROW, where each field corresponds to an output tensor.

For details about importing TensorFlow models into Vertica, see TensorFlow integration and directory structure.

Syntax

PREDICT_TENSORFLOW_SCALAR ( inputs
        USING PARAMETERS model_name = 'model-name' )

Arguments

inputs: Input column of type ROW with fields of 1D ARRAYs that represent input tensors. These tensors can represent outputs for various input operations.

Parameters

model_name: Name of the model (case-insensitive).

Examples

This function can simplify the process for making predictions on data with many input features.

For instance, the MNIST handwritten digit classification dataset contains 784 input features for each input row, one feature for each pixel in the images of handwritten digits. The PREDICT_TENSORFLOW function requires that each of these input features are contained in a separate input column. By encapsulating these features into a single ARRAY, the PREDICT_TENSORFLOW_SCALAR function only needs a single input column of type ROW, where the pixel values are the array elements for an input field:

--Each array for the "image" field has 784 elements. 
=> SELECT * FROM mnist_train;
id |                   inputs
---+---------------------------------------------
 1 | {"image":[0, 0, 0,..., 244, 222, 210,...]}
 2 | {"image":[0, 0, 0,..., 185, 84, 223,...]}
 3 | {"image":[0, 0, 0,..., 133, 254, 78,...]}
 ...

In this case, the function output consists of a single opeartion with one tensor. The value of this field is an array of ten elements, which are all zero except for the element whose index is the predicted digit:

=> SELECT id, PREDICT_TENSORFLOW_SCALAR(inputs USING PARAMETERS model_name='tf_mnist_ct') FROM mnist_test;
 id |                   PREDICT_TENSORFLOW_SCALAR                      
----+-------------------------------------------------------------------
  1 | {"prediction:0":["0", "0", "0", "0", "1", "0", "0", "0", "0", "0"]} 
  2 | {"prediction:0":["0", "1", "0", "0", "0", "0", "0", "0", "0", "0"]} 
  3 | {"prediction:0":["0", "0", "0", "0", "0", "0", "0", "1", "0", "0"]} 
...

To view the expected input and output tensors for an imported TensorFlow model, call GET_MODEL_SUMMARY:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='tf_mnist_ct');
                             GET_MODEL_SUMMARY
---------------------------------------------------------------------------

=============
input_tensors
=============
  name |type |dimensions
-------+-----+----------
 image |int32|  [-1,784]  


==============
output_tensors
==============
     name     | type |dimensions
--------------+------+----------
 prediction:0 |int32 | [-1,10]  

(1 row)

28 - PREDICT_XGB_CLASSIFIER

Applies an XGBoost classifier model on an input relation.

Applies an XGBoost classifier model on an input relation. PREDICT_XGB_CLASSIFIER returns a VARCHAR data type that specifies one of the following, as determined by how the type parameter is set:

The predicted class (based on probability scores)
Probability of a class for each input instance.

Syntax

PREDICT_XGB_CLASSIFIER ( input-columns
          USING PARAMETERS model_name = 'model-name'
              [, type = 'prediction-type' ]
              [, class = 'user-input-class' ]
              [, match_by_pos = 'match-by-position' ]
              [, probability_normalization = 'prob-normalization' ] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

type

Type of prediction to return, one of the following:

response (default): The class with the highest probability among all possible classes.
probability: Valid only if the class parameter is set, returns for each input instance the probability of the specified class or predicted class.

class

Class to use when the type parameter is set to probability. If you omit this parameter, the function uses the predicted class—the one with the highest probability score. Thus, the predict function returns the probability that the input instance belongs to the specified or predicted class.

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

probability_normalization

The classifier's normalization method, either softmax (multi-class classifier) or logit (binary classifier). If unspecified, the default logit function is used for normalization.

Examples

Use PREDICT_XGB_CLASSIFIER to apply the classifier to the test data:

=> SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
    USING PARAMETERS model_name='xgb_iris', probability_normalization='logit') FROM iris1;
PREDICT_XGB_CLASSIFIER
------------------------
setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.

(90 rows)

See XGBoost for classification for more examples.

29 - PREDICT_XGB_CLASSIFIER_CLASSES

Applies an XGBoost classifier model on an input relation and returns the probabilities of classes:.

Applies an XGBoost classifier model on an input relation and returns the probabilities of classes:

VARCHAR predicted column contains the class label with the highest probability.
Multiple FLOAT columns, where the first probability column contains the probability for the class reported in the predicted column. Other columns contain the probability of each class specified in the classes parameter.
Key columns with the same value and data type as matching input columns specified in parameter key_columns.

All trees contribute to a predicted probability for each response class, and the highest probability class is chosen.

Syntax

PREDICT_XGB_CLASSIFIER_CLASSES ( predictor-columns)
        USING PARAMETERS model_name = 'model-name'
            [, key_columns = 'key-columns']
            [, exclude_columns = 'excluded-columns']
            [, classes = 'classes']
            [, match_by_pos = match-by-position]
            [, probability_normalization = 'prob-normalization' ] )
OVER( [window-partition-clause] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

key_columns

exclude_columns

Comma-separated list of columns from predictor-columns to exclude from processing.

classes

Comma-separated list of class labels in the model. The probability of belonging to each given class is predicted by the classifier. Values are case sensitive.

match_by_pos

Boolean value that specifies how predictor columns are matched to model features:

false (default): Match by name.
true: Match by the position of columns in the predictor columns list.

probability_normalization

The classifier's normalization method, either softmax (multi-class classifier) or logit (binary classifier). If unspecified, the default logit function is used for normalization.

Examples

After creating an XGBoost classifier model with XGB_CLASSIFIER, you can use PREDICT_XGB_CLASSIFIER_CLASSES to view the probability of each classification. In this example, the XGBoost classifier model "xgb_iris" is used to predict the probability that a given flower belongs to a species of iris:

=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
    USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
  predicted  |    probability
------------+-------------------
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     | 0.999911552783011
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 setosa     |   0.9999650465368
 versicolor |  0.99991871763563
 .
 .
 .
(90 rows)

You can also specify additional classes. In this example, PREDICT_XGB_CLASSIFIER_CLASSES makes the same prediction as the previous example, but also returns the probability that a flower belongs to the specified classes "virginica" and "versicolor":

=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
    USING PARAMETERS model_name='xgb_iris', classes='virginica,versicolor', probability_normalization='logit') OVER (PARTITION BEST) FROM iris1;
 predicted  |    probability    |      virginica       |      versicolor
------------+-------------------+----------------------+----------------------
 setosa     |   0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
 setosa     |   0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
 setosa     |   0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
 .
 .
 .
 versicolor |  0.99991871763563 | 6.45697562080953e-05 |     0.99991871763563
 versicolor | 0.999967282051702 | 1.60052775404199e-05 |    0.999967282051702
 versicolor | 0.999648819964864 |  0.00028366342010669 |    0.999648819964864
 .
 .
 .
 virginica  | 0.999977039257386 |    0.999977039257386 | 1.13305901169304e-05
 virginica  | 0.999977085131063 |    0.999977085131063 | 1.12847163501674e-05
 virginica  | 0.999977039257386 |    0.999977039257386 | 1.13305901169304e-05
(90 rows)

30 - PREDICT_XGB_REGRESSOR

Applies an XGBoost regressor model on an input relation.

Applies an XGBoost regressor model on an input relation. PREDICT_XGB_REGRESSOR returns a FLOAT data type that specifies the predicted value by the XGBoost model: a weighted sum of contributions by each tree in the model.

Syntax

PREDICT_XGB_REGRESSOR ( input-columns
        USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name

Name of the model (case-insensitive).

match_by_pos

Boolean value that specifies how input columns are matched to model features:

true: Match by the position of columns in the input columns list.
false (default): Match by name.

Examples

See XGBoost for regression.

31 - REVERSE_NORMALIZE

Reverses the normalization transformation on normalized data, thereby de-normalizing the normalized data.

Reverses the normalization transformation on normalized data, thereby de-normalizing the normalized data. If you specify a column that is not in the specified model, REVERSE_NORMALIZE returns that column unchanged.

Syntax

REVERSE_NORMALIZE ( input-columns USING PARAMETERS model_name = 'model-name' );

Arguments

input-columns: The columns to use from the input relation, or asterisk (*) to select all columns.

Parameters

model_name: Name of the model (case-insensitive).

Examples

Use REVERSE_NORMALIZE on the hp and cyl columns in table mtcars, where hp is in normalization model mtcars_normfit, and cyl is not in the normalization model.

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars;
hp    | cyl
------+-----
42502 |   8
58067 |   8
26371 |   4
42502 |   8
31182 |   6
32031 |   4
26937 |   4
34861 |   6
34861 |   6
50992 |   8
50992 |   8
49577 |   8
25805 |   4
18447 |   4
29767 |   6
65142 |   8
69387 |   8
14768 |   4
49577 |   8
60897 |   8
94857 |   8
31182 |   6
31182 |   6
30899 |   4
69387 |   8
49577 |   6
18730 |   4
18730 |   4
74764 |   8
17598 |   4
50992 |   8
27503 |   4
(32 rows)

Transformation functions

1 - APPLY_BISECTING_KMEANS

Note

Syntax

Arguments

Parameters

Privileges

2 - APPLY_IFOREST

Syntax

Arguments

Parameters

Privileges

Examples

See also

3 - APPLY_INVERSE_PCA

Syntax

Arguments

Parameters

Examples

See also

4 - APPLY_INVERSE_SVD

Syntax

Arguments

Parameters

Examples

See also

5 - APPLY_KMEANS

Syntax

Arguments

Parameters

Privileges

Examples

See also

6 - APPLY_KPROTOTYPES

Syntax

Arguments

Parameters

Privileges

Examples

See also

7 - APPLY_NORMALIZE

Note

Syntax

Arguments

Parameters

Examples

See also

8 - APPLY_ONE_HOT_ENCODER

Syntax

Arguments

Parameters

Important

Note

Examples

See also

9 - APPLY_PCA

Syntax

Arguments

Parameters

Examples

See also

10 - APPLY_SVD

Syntax

Arguments

Parameters

Examples

See also

11 - PREDICT_ARIMA

Behavior type

Syntax

Arguments

Parameters

Examples

See also

12 - PREDICT_AUTOREGRESSOR

Syntax

Note

Arguments

Parameters

Examples