This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Transformation functions
The machine learning API includes a set of UDx functions that transform the columns of each input row to one or more corresponding output columns.
The machine learning API includes a set of UDx functions that transform the columns of each input row to one or more corresponding output columns. These transformations follow rules that are defined in models that were created earlier. For example,
APPLY_SVD
uses an SVD model to transform input data.
Unless otherwise indicated, these functions require the following privileges for non-superusers:
In general, given an invalid input row, the return value for these functions is NULL.
1 - APPLY_BISECTING_KMEANS
Applies a trained bisecting k-means model to an input relation, and assigns each new data point to the closest matching cluster in the trained model.
Applies a trained bisecting k-means model to an input relation, and assigns each new data point to the closest matching cluster in the trained model.
Note
If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync the
hcatalog
schema, and then run the machine learning function.
Syntax
SELECT APPLY_BISECTING_KMEANS( 'input-columns'
USING PARAMETERS model_name = 'model-name'
[, num_clusters = 'num-clusters']
[, match_by_pos = match-by-position] ] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
Parameters
model_name
Name of the model (case-insensitive).
num_clusters
- Integer between 1 and
k
inclusive, where k
is the number of centers in the model, specifies the number of clusters to use for prediction.
Default: Value that the model specifies for k
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Privileges
Non-superusers: model owner, or USAGE privileges on the model
2 - APPLY_INVERSE_PCA
Inverts the APPLY_PCA-generated transform back to the original coordinate system.
Inverts the APPLY_PCA-generated transform back to the original coordinate system.
Syntax
APPLY_INVERSE_PCA ( input-columns
USING PARAMETERS model_name = 'model-name'
[, exclude_columns = 'excluded-columns']
[, key_columns = 'key-columns'] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. The following requirements apply:
Parameters
model_name
Name of the model (case-insensitive).
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
key_columns
- Comma-separated list of column names from
input-columns
that identify its data rows. These columns are included in the output table.
Examples
The following example shows how to use the APPLY_INVERSE_PCA function. It shows the output for the first record.
=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96 Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE
=> SELECT * FROM worldPCA;
HDI | country | col1
------+---------------------+-------------------
0.886 | Belgium | 79002.2946705704
0.699 | Belize | -25631.6670012556
0.427 | Benin | -40373.4104598122
0.805 | Chile | -16805.7940082156
0.687 | China | -37279.2893141103
0.744 | Costa Rica | -19505.5631231635
0.4 | Cote d'Ivoire | -38058.2060339272
0.776 | Cuba | -23724.5779612041
0.895 | Denmark | 117325.594028813
0.644 | Egypt | -34609.9941604549
...
(96 rows)
=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI | country | em1970 | em1971 | em1972 | em1973 |
em1974 | em1975 | em1976| em1977 | em1978 | em1979
| em1980 | em1981 | em1982 | em1983 | em1984 |em1985
| em1986 | em1987 | em1988 | em1989 | em1990 | em1991
| em1992 | em1993| em1994 | em1995 | em1996 | em1997
| em1998 | em1999 | em2000 | em2001 |em2002 |
em2003 | em2004 | em2005 | em2006 | em2007 | em2008
| em2009 | em2010 | gdp1970 | gdp1971 | gdp1972 | gdp1973
| gdp1974 | gdp1975 | gdp1976 | gdp1977 |gdp1978 | gdp1979
| gdp1980 | gdp1981 | gdp1982 | gdp1983 | gdp1984 | gdp1985
| gdp1986| gdp1987 | gdp1988 | gdp1989 | gdp1990 | gdp1991
| gdp1992 | gdp1993 | gdp1994 | gdp1995 | gdp1996 |
gdp1997 | gdp1998 | gdp1999 | gdp2000 | gdp2001 | gdp2002
| gdp2003 |gdp2004 | gdp2005 | gdp2006 | gdp2007 | gdp2008
| gdp2009 | gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium | 18585.6613572407 | -16145.6374560074 | 26938.956253415 | 8094.30475779595 |
12073.5461203817 | -11069.0567600181 | 19133.8584911727| 5500.312894949 | -4227.94863799987 | 6265.77925410752
| -10884.749295608 | 30929.4669575201 | -7831.49439429977 | 3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 | 1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 | 18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 | 1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
| 6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
| 5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 | 5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 | 8718.3731386891
...
(96 rows)
See also
3 - APPLY_INVERSE_SVD
Transforms the data back to the original domain.
Transforms the data back to the original domain. This essentially computes the approximated version of the original data by multiplying three matrices: matrix U (input to this function), matrices S and V (stored in the model).
Syntax
APPLY_INVERSE_SVD ( 'input-columns'
USING PARAMETERS model_name = 'model-name'
[, match_by_pos = match-by-position]
[, exclude_columns = 'excluded-columns']
[, key_columns = 'key-columns'] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. The following requirements apply:
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
key_columns
- Comma-separated list of column names from
input-columns
that identify its data rows. These columns are included in the output table.
Examples
=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8 Rejected Rows: 0
(1 row)
=> CREATE TABLE transform_svd AS SELECT
APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
OVER () FROM small_svd;
CREATE TABLE
=> SELECT * FROM transform_svd;
id | col1 | col2 | col3 | col4
----+-------------------+---------------------+---------------------+--------------------
4 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
6 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
1 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
2 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
3 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
5 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
8 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
7 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
(8 rows)
=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id | x1 | x2 | x3 | x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)
See also
4 - APPLY_KMEANS
Assigns each row of an input relation to a cluster center from an existing k-means model.
Assigns each row of an input relation to a cluster center from an existing k-means model.
Syntax
APPLY_KMEANS ( input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Privileges
Non-superusers: model owner, or USAGE privileges on the model
Examples
The following example creates k-means model myKmeansModel
and applies it to input table iris1
. The call to APPLY_KMEANS
mixes column names and constants. When a constant is passed in place of a column name, the constant is substituted for the value of the column in all rows:
=> SELECT KMEANS('myKmeansModel', 'iris1', '*', 5
USING PARAMETERS max_iterations=20, output_view='myKmeansView', key_columns='id', exclude_columns='Species, id');
KMEANS
----------------------------
Finished in 12 iterations
(1 row)
=> SELECT id, APPLY_KMEANS(Sepal_Length, 2.2, 1.3, Petal_Width
USING PARAMETERS model_name='myKmeansModel', match_by_pos='true') FROM iris2;
id | APPLY_KMEANS
-----+--------------
5 | 1
10 | 1
14 | 1
15 | 1
21 | 1
22 | 1
24 | 1
25 | 1
32 | 1
33 | 1
34 | 1
35 | 1
38 | 1
39 | 1
42 | 1
...
(60 rows)
See also
5 - APPLY_NORMALIZE
A UDTF function that applies the normalization parameters saved in a model to a set of specified input columns.
A UDTF function that applies the normalization parameters saved in a model to a set of specified input columns. If any column specified in the function is not in the model, its data passes through unchanged to APPLY_NORMALIZE
.
Note
Note: If a column contains only one distinct value, APPLY_NORMALIZE
returns NaN for values in that column.
Syntax
APPLY_NORMALIZE ( input-columns USING PARAMETERS model_name = 'model-name');
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. If you supply an asterisk,
APPLY_NORMALIZE
normalizes all columns in the model.
Parameters
model_name
Name of the model (case-insensitive).
Examples
The following example creates a model with NORMALIZE_FIT
using the wt
and hp
columns in table mtcars
, and then uses this model in successive calls to APPLY_NORMALIZE and REVERSE_NORMALIZE.
=> SELECT NORMALIZE_FIT('mtcars_normfit', 'mtcars', 'wt,hp', 'minmax');
NORMALIZE_FIT
---------------
Success
(1 row)
The following call to APPLY_NORMALIZE
specifies the hp
and cyl
columns in table mtcars
, where hp
is in the normalization model and cyl
is not in the normalization model:
=> CREATE TABLE mtcars_normalized AS SELECT APPLY_NORMALIZE (hp, cyl USING PARAMETERS model_name = 'mtcars_normfit') FROM mtcars;
CREATE TABLE
=> SELECT * FROM mtcars_normalized;
hp | cyl
--------------------+-----
0.434628975265018 | 8
0.681978798586572 | 8
0.434628975265018 | 6
1 | 8
0.540636042402827 | 8
0 | 4
0.681978798586572 | 8
0.0459363957597173 | 4
0.434628975265018 | 8
0.204946996466431 | 6
0.250883392226148 | 6
0.049469964664311 | 4
0.204946996466431 | 6
0.201413427561837 | 4
0.204946996466431 | 6
0.250883392226148 | 6
0.049469964664311 | 4
0.215547703180212 | 4
0.0353356890459364 | 4
0.187279151943463 | 6
0.452296819787986 | 8
0.628975265017668 | 8
0.346289752650177 | 8
0.137809187279152 | 4
0.749116607773852 | 8
0.144876325088339 | 4
0.151943462897526 | 4
0.452296819787986 | 8
0.452296819787986 | 8
0.575971731448763 | 8
0.159010600706714 | 4
0.346289752650177 | 8
(32 rows)
=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
hp | cyl
-----+-----
175 | 8
245 | 8
175 | 6
335 | 8
205 | 8
52 | 4
245 | 8
65 | 4
175 | 8
110 | 6
123 | 6
66 | 4
110 | 6
109 | 4
110 | 6
123 | 6
66 | 4
113 | 4
62 | 4
105 | 6
180 | 8
230 | 8
150 | 8
91 | 4
264 | 8
93 | 4
95 | 4
180 | 8
180 | 8
215 | 8
97 | 4
150 | 8
(32 rows)
The following call to REVERSE_NORMALIZE
also specifies the hp
and cyl
columns in table mtcars
, where hp
is in normalization model mtcars_normfit
, and cyl
is not in the normalization model.
=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
hp | cyl
-----------------+-----
205.000005722046 | 8
150.000000357628 | 8
150.000000357628 | 8
93.0000016987324 | 4
174.99999666214 | 8
94.9999992102385 | 4
214.999997496605 | 8
97.0000009387732 | 4
245.000006556511 | 8
174.99999666214 | 6
335 | 8
245.000006556511 | 8
62.0000002086163 | 4
174.99999666214 | 8
230.000002026558 | 8
52 | 4
263.999997675419 | 8
109.999999523163 | 6
123.000002324581 | 6
64.9999996386468 | 4
66.0000005029142 | 4
112.999997898936 | 4
109.999999523163 | 6
180.000000983477 | 8
180.000000983477 | 8
108.999998658895 | 4
109.999999523163 | 6
104.999999418855 | 6
123.000002324581 | 6
180.000000983477 | 8
66.0000005029142 | 4
90.9999999701977 | 4
(32 rows)
See also
6 - APPLY_ONE_HOT_ENCODER
A user-defined transform function (UDTF) that loads the one hot encoder model and writes out a table that contains the encoded columns.
A user-defined transform function (UDTF) that loads the one hot encoder model and writes out a table that contains the encoded columns.
Syntax
APPLY_ONE_HOT_ENCODER( input-columns
USING PARAMETERS model_name = 'model-name'
[, drop_first = 'is-first']
[, ignore_null = 'ignore']
[, separator = 'separator-character']
[, column_naming = 'name-output']
[, null_column_name = 'null-column-name'] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
, stores the categories and their corresponding levels.
drop_first
- Boolean value, one of the following:
ignore_null
- Boolean value, one of the following:
separator
- The character that separates the input variable name and the indicator variable level in the output table.To avoid using any separator, set this parameter to null value.
Default: Underscore (_
)
column_naming
- Appends categorical levels to column names according to the specified method:
-
indices
(default): Uses integer indices to represent categorical levels.
-
values
/values_relaxed
: Both methods use categorical level names. If duplicate column names occur, the function attempts to disambiguate them by appending _
n
, where n
is a zero-based integer index (_0
, _1
,...).
If the function cannot produce unique column names , it handles this according to the chosen method:
Important
The following column naming rules apply if column_naming
is set to values
or values_relaxed
:
-
Input column names with more than 128 characters are truncated.
-
Column names can contain special characters.
-
If parameter ignore_null
is set to true, APPLY_ONE_HOT_ENCODER
constructs the column name from the value set in parameter null_column_name
. If this parameter is omitted, the string null
is used.
null_column_name
- The string used in naming the indicator column for null values, used only if
ignore_null
is set to false
and column_naming
is set to values
or values_relaxed
.
Default: null
Note
Note: If an input row contains a level not stored in the model, the output row columns corresponding to that categorical level are returned as null values.
Examples
=> SELECT APPLY_ONE_HOT_ENCODER(cyl USING PARAMETERS model_name='one_hot_encoder_model',
drop_first='true', ignore_null='false') FROM mtcars;
cyl | cyl_1 | cyl_2
----+-------+-------
8 | 0 | 1
4 | 0 | 0
4 | 0 | 0
8 | 0 | 1
8 | 0 | 1
8 | 0 | 1
4 | 0 | 0
8 | 0 | 1
8 | 0 | 1
4 | 0 | 0
8 | 0 | 1
6 | 1 | 0
4 | 0 | 0
4 | 0 | 0
6 | 1 | 0
6 | 1 | 0
8 | 0 | 1
8 | 0 | 1
4 | 0 | 0
4 | 0 | 0
6 | 1 | 0
8 | 0 | 1
8 | 0 | 1
6 | 1 | 0
4 | 0 | 0
8 | 0 | 1
8 | 0 | 1
8 | 0 | 1
6 | 1 | 0
6 | 1 | 0
4 | 0 | 0
4 | 0 | 0
(32 rows)
See also
7 - APPLY_PCA
Transforms the data using a PCA model.
Transforms the data using a PCA model. This returns new coordinates of each data point.
Syntax
APPLY_PCA ( input-columns
USING PARAMETERS model_name = 'model-name'
[, num_components = num-components]
[, cutoff = cutoff-value]
[, match_by_pos = match-by-position]
[, exclude_columns = 'excluded-columns']
[, key_columns = 'key-columns'] )
Arguments
input-columns
- Comma-separated list of columns that contain the data matrix, or asterisk (*) to select all columns. The following requirements apply:
Parameters
model_name
Name of the model (case-insensitive).
num_components
- The number of components to keep in the model. This is the number of output columns that will be generated. If you omit this parameter and the
cutoff
parameter, all model components are kept.
cutoff
- Set to 1, specifies the minimum accumulated explained variance. Components are taken until the accumulated explained variance reaches this value.
match_by_pos
Boolean value that specifies how input columns are matched to model features:
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
key_columns
- Comma-separated list of column names from
input-columns
that identify its data rows. These columns are included in the output table.
Examples
=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96 Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE
=> SELECT * FROM worldPCA;
HDI | country | col1
------+---------------------+-------------------
0.886 | Belgium | 79002.2946705704
0.699 | Belize | -25631.6670012556
0.427 | Benin | -40373.4104598122
0.805 | Chile | -16805.7940082156
0.687 | China | -37279.2893141103
0.744 | Costa Rica | -19505.5631231635
0.4 | Cote d'Ivoire | -38058.2060339272
0.776 | Cuba | -23724.5779612041
0.895 | Denmark | 117325.594028813
0.644 | Egypt | -34609.9941604549
...
(96 rows)
=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI | country | em1970 | em1971 | em1972 | em1973 |
em1974 | em1975 | em1976| em1977 | em1978 | em1979
| em1980 | em1981 | em1982 | em1983 | em1984 |em1985
| em1986 | em1987 | em1988 | em1989 | em1990 | em1991
| em1992 | em1993| em1994 | em1995 | em1996 | em1997
| em1998 | em1999 | em2000 | em2001 |em2002 |
em2003 | em2004 | em2005 | em2006 | em2007 | em2008
| em2009 | em2010 | gdp1970 | gdp1971 | gdp1972 | gdp1973
| gdp1974 | gdp1975 | gdp1976 | gdp1977 |gdp1978 | gdp1979
| gdp1980 | gdp1981 | gdp1982 | gdp1983 | gdp1984 | gdp1985
| gdp1986| gdp1987 | gdp1988 | gdp1989 | gdp1990 | gdp1991
| gdp1992 | gdp1993 | gdp1994 | gdp1995 | gdp1996 |
gdp1997 | gdp1998 | gdp1999 | gdp2000 | gdp2001 | gdp2002
| gdp2003 |gdp2004 | gdp2005 | gdp2006 | gdp2007 | gdp2008
| gdp2009 | gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium | 18585.6613572407 | -16145.6374560074 | 26938.956253415 | 8094.30475779595 |
12073.5461203817 | -11069.0567600181 | 19133.8584911727| 5500.312894949 | -4227.94863799987 | 6265.77925410752
| -10884.749295608 | 30929.4669575201 | -7831.49439429977 | 3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 | 1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 | 18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 | 1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
| 6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
| 5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 | 5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 | 8718.3731386891
...
(96 rows)
See also
8 - APPLY_SVD
Transforms the data using an SVD model.
Transforms the data using an SVD model. This computes the matrix U of the SVD decomposition.
Syntax
APPLY_SVD ( input-columns
USING PARAMETERS model_name = 'model-name'
[, num_components = num-components]
[, cutoff = cutoff-value]
[, match_by_pos = match-by-position]
[, exclude_columns = 'excluded-columns']
[, key_columns = 'key-columns'] )
Arguments
input-columns
- Comma-separated list of columns that contain the data matrix, or asterisk (*) to select all columns. The following requirements apply:
Parameters
model_name
Name of the model (case-insensitive).
num_components
- The number of components to keep in the model. This is the number of output columns that will be generated. If neither this parameter nor the
cutoff
parameter is provided, all components from the model are kept.
cutoff
- Set to 1, specifies the minimum accumulated explained variance. Components are taken until the accumulated explained variance reaches this value. If you omit this parameter and the
num_components
parameter, all model components are kept.
match_by_pos
- Boolean value that specifies how input columns are matched to model columns:
exclude_columns
Comma-separated list of column names from input-columns
to exclude from processing.
key_columns
- Comma-separated list of column names from
input-columns
that identify its data rows. These columns are included in the output table.
Examples
=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8 Rejected Rows: 0
(1 row)
=> CREATE TABLE transform_svd AS SELECT
APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
OVER () FROM small_svd;
CREATE TABLE
=> SELECT * FROM transform_svd;
id | col1 | col2 | col3 | col4
----+-------------------+---------------------+---------------------+--------------------
4 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
6 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
1 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
2 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
3 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
5 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
8 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
7 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
(8 rows)
=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id | x1 | x2 | x3 | x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)
See also
9 - PREDICT_AUTOREGRESSOR
Applies an autoregressor (AR) model to an input relation.
Applies an autoregressor (AR) model to an input relation.
Autoregressive models use previous values to make predictions. More specifically, the user-specified "lag" determines how many previous timesteps it takes into account during computation, and predicted values are linear combinations of those lags.
Syntax
PREDICT_AUTOREGRESSOR ( 'timeseries-column'
USING PARAMETERS
model-name = 'model-name'
[, start = starting-index]
[, npredictions = npredictions]
[, missing = "imputation-method" ] )
OVER (ORDER BY 'timestamp-column')
FROM input-relation
Note
The following argument, as written, is required and cannot be omitted nor substituted with another type of clause.
OVER (ORDER BY 'timestamp-column')
Arguments
timeseries-column
- The timeseries column used to make the prediction (only the last
p
values, specified during model creation, are used).
timestamp-column
- The timestamp column, with consistent timesteps, used to make the prediction.
input-relation
- The input relation containing the
timeseries-column
and timestamp-column
.
Note that input-relation
cannot have missing values in any of the p
(set during training) rows preceding start
. To handle missing values, see IMPUTE or Linear interpolation.
Parameters
model_name
Name of the model (case-insensitive).
start
- INTEGER >p or ≤0, the index (row) of the
input-relation
at which to start the prediction. If omitted, the prediction starts at the end of the input-relation
.
If the start
index is greater than the number of rows N
in timeseries-column
, then the values between N
and start
are predicted and used for the prediction.
If negative, the start
index is identified by counting backwards from the end of the input-relation
.
For an input-relation
of N rows, negative values have a lower limit of either -1000 or -(N-p), whichever is greater.
Default: the end of input-relation
npredictions
- INTEGER ≥1, the number of predicted timesteps.
Default: 10
missing
- One of the following methods for handling missing values:
-
drop: Missing values are ignored.
-
error: Missing values raise an error.
-
zero: Missing values are replaced with 0.
-
linear_interpolation: Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. If all values before or after a missing value in the prediction range are missing or invalid, interpolation is impossible and the function errors.
Default: linear_interpolation
Examples
See Autoregressive model example.
See also
10 - PREDICT_LINEAR_REG
Applies a linear regression model on an input relation and returns the predicted value as a FLOAT.
Applies a linear regression model on an input relation and returns the predicted value as a FLOAT.
Syntax
PREDICT_LINEAR_REG ( input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT PREDICT_LINEAR_REG(waiting USING PARAMETERS model_name='myLinearRegModel')FROM
faithful ORDER BY id;
PREDICT_LINEAR_REG
--------------------
4.15403481386324
2.18505296804024
3.76023844469864
2.8151271587036
4.62659045686076
2.26381224187316
4.86286827835952
4.62659045686076
1.94877514654148
4.62659045686076
2.18505296804024
...
(272 rows)
The following example shows how to use the PREDICT_LINEAR_REG function on an input table, using the match_by_pos
parameter. Note that you can replace the column argument with a constant that does not match an input column:
=> SELECT PREDICT_LINEAR_REG(55 USING PARAMETERS model_name='linear_reg_faithful',
match_by_pos='true')FROM faithful ORDER BY id;
PREDICT_LINEAR_REG
--------------------
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
2.28552115094171
...
(272 rows)
11 - PREDICT_LOGISTIC_REG
Applies a logistic regression model on an input relation.
Applies a logistic regression model on an input relation.
PREDICT_LOGISTIC_REG returns as a FLOAT the predicted class or the probability of the predicted class, depending on how the type
parameter is set. You can cast the return value to INTEGER or another numeric type when the return is in the probability of the predicted class.
Syntax
PREDICT_LOGISTIC_REG ( input-columns
USING PARAMETERS model_name = 'model-name'
[, type = 'prediction-type']
[, cutoff = probability-cutoff]
[, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
type
- Type of prediction for logistic regression, one of the following:
cutoff
- Used in conjunction with the
type
parameter, a FLOAT between 0 and 1, exclusive. When type
is set to response
, the returned value of prediction is 1 if its corresponding probability is greater than or equal to the value of cutoff
; otherwise, it is 0.
Default: 0.5
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT car_model,
PREDICT_LOGISTIC_REG(mpg, cyl, disp, drat, wt, qsec, vs, gear, carb
USING PARAMETERS model_name='myLogisticRegModel')
FROM mtcars;
car_model | PREDICT_LOGISTIC_REG
---------------------+----------------------
Camaro Z28 | 0
Fiat 128 | 1
Fiat X1-9 | 1
Ford Pantera L | 1
Merc 450SE | 0
Merc 450SL | 0
Toyota Corona | 0
AMC Javelin | 0
Cadillac Fleetwood | 0
Datsun 710 | 1
Dodge Challenger | 0
Hornet 4 Drive | 0
Lotus Europa | 1
Merc 230 | 0
Merc 280 | 0
Merc 280C | 0
Merc 450SLC | 0
Pontiac Firebird | 0
Porsche 914-2 | 1
Toyota Corolla | 1
Valiant | 0
Chrysler Imperial | 0
Duster 360 | 0
Ferrari Dino | 1
Honda Civic | 1
Hornet Sportabout | 0
Lincoln Continental | 0
Maserati Bora | 1
Mazda RX4 | 1
Mazda RX4 Wag | 1
Merc 240D | 0
Volvo 142E | 1
(32 rows)
The following example shows how to use PREDICT_LOGISTIC_REG
on an input table, using the match_by_pos
parameter. Note that you can replace any of the column inputs with a constant that does not match an input column. In this example, column mpg
was replaced with the constant 20:
=> SELECT car_model,
PREDICT_LOGISTIC_REG(20, cyl, disp, drat, wt, qsec, vs, gear, carb
USING PARAMETERS model_name='myLogisticRegModel', match_by_pos='true')
FROM mtcars;
car_model | PREDICT_LOGISTIC_REG
--------------------+----------------------
AMC Javelin | 0
Cadillac Fleetwood | 0
Camaro Z28 | 0
Chrysler Imperial | 0
Datsun 710 | 1
Dodge Challenger | 0
Duster 360 | 0
Ferrari Dino | 1
Fiat 128 | 1
Fiat X1-9 | 1
Ford Pantera L | 1
Honda Civic | 1
Hornet 4 Drive | 0
Hornet Sportabout | 0
Lincoln Continental | 0
Lotus Europa | 1
Maserati Bora | 1
Mazda RX4 | 1
Mazda RX4 Wag | 1
Merc 230 | 0
Merc 240D | 0
Merc 280 | 0
Merc 280C | 0
Merc 450SE | 0
Merc 450SL | 0
Merc 450SLC | 0
Pontiac Firebird | 0
Porsche 914-2 | 1
Toyota Corolla | 1
Toyota Corona | 0
Valiant | 0
Volvo 142E | 1
(32 rows)
12 - PREDICT_MOVING_AVERAGE
Applies a moving-average (MA) model, created by MOVING_AVERAGE, to an input relation.
Applies a moving-average (MA) model, created by MOVING_AVERAGE, to an input relation.
Moving average models use the errors of previous predictions to make future predictions. More specifically, the user-specified "lag" determines how many previous predictions and errors it takes into account during computation.
Syntax
PREDICT_MOVING_AVERAGE ( 'timeseries-column'
USING PARAMETERS
model_name = 'model-name'
[, start = starting-index]
[, npredictions = npredictions]
[, missing = "imputation-method" ] )
OVER (ORDER BY 'timestamp-column')
FROM input-relation
Note
The following argument, as written, is required and cannot be omitted nor substituted with another type of clause.
OVER (ORDER BY 'timestamp-column')
Arguments
timeseries-column
- The timeseries column used to make the prediction (only the last
q
values, specified during model creation, are used).
timestamp-column
- The timestamp column, with consistent timesteps, used to make the prediction.
input-relation
- The input relation containing the
timeseries-column
and timestamp-column
.
Note that input-relation
cannot have missing values in any of the q
(set during training) rows preceding start
. To handle missing values, see IMPUTE or Linear interpolation.
Parameters
model_name
Name of the model (case-insensitive).
start
- INTEGER >q or ≤0, the index (row) of the
input-relation
at which to start the prediction. If omitted, the prediction starts at the end of the input-relation
.
If the start
index is greater than the number of rows N
in timeseries-column
, then the values between N
and start
are predicted and used for the prediction.
If negative, the start
index is identified by counting backwards from the end of the input-relation
.
For an input-relation
of N rows, negative values have a lower limit of either -1000 or -(N-q), whichever is greater.
Default: the end of input-relation
npredictions
- INTEGER ≥1, the number of predicted timesteps.
Default: 10
missing
- One of the following methods for handling missing values:
-
drop: Missing values are ignored.
-
error: Missing values raise an error.
-
zero: Missing values are replaced with 0.
-
linear_interpolation: Missing values are replaced by linearly-interpolated values based on the nearest valid entries before and after the missing value. If all values before or after a missing value in the prediction range are missing or invalid, interpolation is impossible and the function errors.
Default: linear_interpolation
Examples
See Moving-average model example.
See also
13 - PREDICT_NAIVE_BAYES
Applies a Naive Bayes model on an input relation.
Applies a Naive Bayes model on an input relation.
Depending on how the type
parameter is set, PREDICT_NAIVE_BAYES returns a VARCHAR that specifies either the predicted class or probability of the predicted class. If the function returns probability, you can cast the return value to an INTEGER or another numeric data type.
Syntax
PREDICT_NAIVE_BAYES ( input-columns
USING PARAMETERS model_name = 'model-name'
[, type = ' return-type ']
[, class = 'user-input-class']
[, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
type
- One of the following:
-
response
(default): Returns the class with the highest probability.
-
probability
: Valid only if class
parameter is set, returns the probability of belonging to the specified class argument.
class
- Required if
type
parameter is set to probability
. If you omit this parameter, PREDICT_NAIVE_BAYES
returns the class that it predicts as having the highest probability.
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT party, PREDICT_NAIVE_BAYES (vote1, vote2, vote3
USING PARAMETERS model_name='naive_house84_model',
type='response')
AS Predicted_Party
FROM house84_test;
party | Predicted_Party
------------+-----------------
democrat | democrat
democrat | democrat
democrat | democrat
republican | republican
democrat | democrat
democrat | democrat
democrat | democrat
democrat | democrat
democrat | democrat
republican | republican
democrat | democrat
democrat | democrat
democrat | democrat
democrat | republican
republican | republican
democrat | democrat
republican | republican
...
(99 rows)
See also
14 - PREDICT_NAIVE_BAYES_CLASSES
Applies a Naive Bayes model on an input relation and returns the probabilities of classes:.
Applies a Naive Bayes model on an input relation and returns the probabilities of classes:
-
VARCHAR predicted
column contains the class label with the highest probability.
-
Multiple FLOAT columns, where the first probability
column contains the probability for the class specified in the predicted column. Other columns contain the probability of belonging to each class specified in the classes
parameter.
Syntax
PREDICT_NAIVE_BAYES_CLASSES ( predictor-columns
USING PARAMETERS model_name = 'model-name'
[, key_columns = 'key-columns']
[, exclude_columns = 'excluded-columns]
[, classes = 'classes']
[, match_by_pos = match-by-position] )
OVER( [window-partition-clause] )
Arguments
predictor-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
key_columns
Comma-separated list of predictor column names that identify the output rows. To exclude these and other predictor columns from being used for prediction, include them in the argument list for parameter exclude_columns
.
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
classes
- Comma-separated list of class labels in the model. The probability of belonging to this given class as predicted by the classifier. The values are case sensitive.
match_by_pos
- Boolean value that specifies how predictor columns are matched to model features:
Examples
=> SELECT PREDICT_NAIVE_BAYES_CLASSES (id, vote1, vote2 USING PARAMETERS
model_name='naive_house84_model',key_columns='id',exclude_columns='id',
classes='democrat, republican', match_by_pos='false')
OVER() FROM house84_test;
id | Predicted | Probability | democrat | republican
-----+------------+-------------------+-------------------+-------------------
21 | democrat | 0.775473383353576 | 0.775473383353576 | 0.224526616646424
28 | democrat | 0.775473383353576 | 0.775473383353576 | 0.224526616646424
83 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
102 | democrat | 0.779889432167111 | 0.779889432167111 | 0.220110567832889
107 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
125 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
132 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
136 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
155 | republican | 0.598662714551597 | 0.401337285448403 | 0.598662714551597
174 | republican | 0.592510497724379 | 0.407489502275621 | 0.592510497724379
...
(1 row)
See also
15 - PREDICT_PMML
Applies an imported PMML model on an input relation.
Applies an imported PMML model on an input relation. The function returns the result that would be expected for the model type encoded in the PMML model.
PREDICT_PMML returns NULL in the following cases:
Note
PREDICT_PMML returns values of complex type ROW for models that use the Output
tag. Vertica does not currently support directly inserting this data into a table.
You can work around this limitation by changing the output to JSON with TO_JSON before inserting it into a table:
=> CREATE TABLE predicted_output AS SELECT TO_JSON(PREDICT_PMML(X1,X2,X3
USING PARAMETERS model_name='pmml_imported_model'))
AS predicted_value
FROM input_table;
Syntax
PREDICT_PMML ( input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
Parameter |
Description |
model_name |
Name of the model (case-insensitive). The function supports PMML models that encode the following model types:
-
K-means
-
Linear regression
-
Logistic regression
|
match_by_pos |
Boolean value that specifies how input columns are matched to model features:
|
Examples
In this example, the function call uses all the columns from the table as predictors and predicts the value using the 'my_kmeans
' model in PMML format:
SELECT PREDICT_PMML(* USING PARAMETERS model_name='my_kmeans') AS predicted_label FROM table;
In this example, the function call takes only columns col1, col2
as predictors, and predicts the value for each row using the 'my_kmeans
' model from schema 'my_schema
':
SELECT PREDICT_PMML(col1, col2 USING PARAMETERS model_name='my_schema.my_kmeans') AS predicted_label FROM table;
In this example, the function call returns an error as neither schema
nor model-name
can accept * as a value:
SELECT PREDICT_PMML(* USING PARAMETERS model_name='*.*') AS predicted_label FROM table;
SELECT PREDICT_PMML(* USING PARAMETERS model_name='*') AS predicted_label FROM table;
SELECT PREDICT_PMML(* USING PARAMETERS model_name='models.*') AS predicted_label FROM table;
See also
16 - PREDICT_RF_CLASSIFIER
Applies a random forest model on an input relation.
Applies a random forest model on an input relation. PREDICT_RF_CLASSIFIER returns a VARCHAR data type that specifies one of the following, as determined by how the type
parameter is set:
Note
The predicted class is selected only based on the popular vote of the decision trees in the forest. Therefore, in special cases the calculated probability of the predicted class may not be the highest.
Syntax
PREDICT_RF_CLASSIFIER ( input-columns
USING PARAMETERS model_name = 'model-name'
[, type = 'prediction-type']
[, class = 'user-input-class']
[, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
type
- Type of prediction to return, one of the following:
-
response
(default): The class with the highest probability among all possible classes.
-
probability
: Valid only if the class
parameter is set, returns the probability of the specified class.
class
- Class to use when the
type
parameter is set to probability
. If you omit this parameter, the function uses the predicted class—the one with the popular vote. Thus, the predict function returns the probability that the input instance belongs to its predicted class.
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='myRFModel') FROM iris;
PREDICT_RF_CLASSIFIER
-----------------------
setosa
setosa
setosa
...
versicolor
versicolor
versicolor
...
virginica
virginica
virginica
...
(150 rows)
This example shows how you can use the PREDICT_RF_CLASSIFIER function, using the match_by_pos
parameter:
=> SELECT PREDICT_RF_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='myRFModel', match_by_pos='true') FROM iris;
PREDICT_RF_CLASSIFIER
-----------------------
setosa
setosa
setosa
...
versicolor
versicolor
versicolor
...
virginica
virginica
virginica
...
(150 rows)
See also
17 - PREDICT_RF_CLASSIFIER_CLASSES
Applies a random forest model on an input relation and returns the probabilities of classes:.
Applies a random forest model on an input relation and returns the probabilities of classes:
-
VARCHAR predicted
column contains the class label with the highest vote (popular vote).
-
Multiple FLOAT columns, where the first probability
column contains the probability for the class reported in the predicted column. Other columns contain the probability of each class specified in the classes
parameter.
-
Key columns with the same value and data type as matching input columns specified in parameter key_columns
.
Note
Selection of the predicted class is based on the popular vote of decision trees in the forest. Thus, in special cases the calculated probability of the predicted class might not be the highest.
Syntax
PREDICT_RF_CLASSIFIER_CLASSES ( predictor-columns
USING PARAMETERS model_name = 'model-name'
[, key_columns = 'key-columns']
[, exclude_columns = 'excluded-columns']
[, classes = 'classes']
[, match_by_pos = match-by-position] )
OVER( [window-partition-clause] )
Arguments
predictor-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
key_columns
Comma-separated list of predictor column names that identify the output rows. To exclude these and other predictor columns from being used for prediction, include them in the argument list for parameter exclude_columns
.
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
classes
- Comma-separated list of class labels in the model. The probability of belonging to this given class is predicted by the classifier. Values are case sensitive.
match_by_pos
- Boolean value that specifies how predictor columns are matched to model features:
Examples
=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='myRFModel') OVER () FROM iris;
predicted | probability
-----------+-------------------
setosa | 1
setosa | 0.99
setosa | 1
setosa | 1
setosa | 1
setosa | 0.97
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 0.99
...
(150 rows)
This example shows how to use function PREDICT_RF_CLASSIFIER_CLASSES
, using the match_by_pos
parameter:
=> SELECT PREDICT_RF_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='myRFModel', match_by_pos='true') OVER () FROM iris;
predicted | probability
-----------+-------------------
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
setosa | 1
...
(150 rows)s
See also
18 - PREDICT_RF_REGRESSOR
Applies a random forest model on an input relation, and returns with a FLOAT data type that specifies the predicted value of the random forest model—the average of the prediction of the trees in the forest.
Applies a random forest model on an input relation, and returns with a FLOAT data type that specifies the predicted value of the random forest model—the average of the prediction of the trees in the forest.
Syntax
PREDICT_RF_REGRESSOR ( input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT PREDICT_RF_REGRESSOR (mpg,cyl,hp,drat,wt
USING PARAMETERS model_name='myRFRegressorModel')FROM mtcars;
PREDICT_RF_REGRESSOR
----------------------
2.94774203574204
2.6954087024087
2.6954087024087
2.89906346431346
2.97688489288489
2.97688489288489
2.7086587024087
2.92078965478965
2.97688489288489
2.7086587024087
2.95621822621823
2.82255155955156
2.7086587024087
2.7086587024087
2.85650394050394
2.85650394050394
2.97688489288489
2.95621822621823
2.6954087024087
2.6954087024087
2.84493251193251
2.97688489288489
2.97688489288489
2.8856467976468
2.6954087024087
2.92078965478965
2.97688489288489
2.97688489288489
2.7934087024087
2.7934087024087
2.7086587024087
2.72469441669442
(32 rows)
See also
19 - PREDICT_SVM_CLASSIFIER
Uses an SVM model to predict class labels for samples in an input relation, and returns the predicted value as a FLOAT data type.
Uses an SVM model to predict class labels for samples in an input relation, and returns the predicted value as a FLOAT data type.
Syntax
PREDICT_SVM_CLASSIFIER (input-columns
USING PARAMETERS model_name = 'model-name'
[, match_by_pos = match-by-position]
[, type = 'return-type']
[, cutoff = 'cutoff-value'] ] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
type
- A string that specifies the output to return for each input row, one of the following:
-
response
: Outputs the predicted class of 0 or 1.
-
probability
: Outputs a value in the range (0,1), the prediction score transformed using the logistic function.
cutoff
- Valid only if the
type
parameter is set to probability
, a FLOAT value that is compared to the transformed prediction score to determine the predicted class.
Default: 0
Examples
=> SELECT PREDICT_SVM_CLASSIFIER (mpg,cyl,disp,wt,qsec,vs,gear,carb
USING PARAMETERS model_name='mySvmClassModel') FROM mtcars;
PREDICT_SVM_CLASSIFIER
------------------------
0
0
1
0
0
1
1
1
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
0
(32 rows)
This example shows how to use PREDICT_SVM_CLASSIFIER
on the mtcars
table, using the match_by_pos
parameter. In this example, column mpg
was replaced with the constant 40:
=> SELECT PREDICT_SVM_CLASSIFIER (40,cyl,disp,wt,qsec,vs,gear,carb
USING PARAMETERS model_name='mySvmClassModel', match_by_pos ='true') FROM mtcars;
PREDICT_SVM_CLASSIFIER
------------------------
0
0
0
0
1
0
0
1
1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
1
(32 rows)
See also
20 - PREDICT_SVM_REGRESSOR
Uses an SVM model to perform regression on samples in an input relation, and returns the predicted value as a FLOAT data type.
Uses an SVM model to perform regression on samples in an input relation, and returns the predicted value as a FLOAT data type.
Syntax
PREDICT_SVM_REGRESSOR(input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
=> SELECT PREDICT_SVM_REGRESSOR(waiting USING PARAMETERS model_name='mySvmRegModel')
FROM faithful ORDER BY id;
PREDICT_SVM_REGRESSOR
--------------------
4.06488248694445
2.30392277646291
3.71269054484815
2.867429883817
4.48751281746003
2.37436116488217
4.69882798271781
4.48751281746003
2.09260761120512
...
(272 rows)
This example shows how you can use the PREDICT_SVM_REGRESSOR function on the faithful table, using the match_by_pos
parameter. In this example, the waiting column was replaced with the constant 40:
=> SELECT PREDICT_SVM_REGRESSOR(40 USING PARAMETERS model_name='mySvmRegModel', match_by_pos='true')
FROM faithful ORDER BY id;
PREDICT_SVM_REGRESSOR
--------------------
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
1.31778533859324
...
(272 rows)
See also
21 - PREDICT_TENSORFLOW
Applies a TensorFlow model on an input relation, and returns with the result expected for the encoded model type.
Applies a TensorFlow model on an input relation, and returns with the result expected for the encoded model type.
Syntax
PREDICT_TENSORFLOW ( input-columns
USING PARAMETERS model_name = 'model-name' [, num_passthru_cols = 'n-first-columns-to-ignore'] )
OVER( [window-partition-clause] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
num_passthru_cols
- Integer that specifies the number of input columns to skip.
Examples
Use PREDICT_TENSORFLOW with the num_passthru_cols
parameter to skip the first two input columns:
SELECT PREDICT_TENSORFLOW ( pid,label,x1,x2
USING PARAMETERS model_name='spiral_demo', num_passthru_cols=2 )
OVER(PARTITION BEST) as predicted_class FROM points;
--example output, the skipped columns are displayed as the first columns of the output
pid | label | col0 | col1
-------+-------+----------------------+----------------------
0 | 0 | 0.990638732910156 | 0.00936129689216614
1 | 0 | 0.999036073684692 | 0.000963933940511197
2 | 1 | 0.0103802494704723 | 0.989619791507721
See also
22 - PREDICT_XGB_CLASSIFIER
Applies an XGBoost classifier model on an input relation.
Applies an XGBoost classifier model on an input relation. PREDICT_XGB_CLASSIFIER
returns a VARCHAR data type that specifies one of the following, as determined by how the type
parameter is set:
Syntax
PREDICT_XGB_CLASSIFIER ( input-columns
USING PARAMETERS model_name = 'model-name'
[, type = 'prediction-type' ]
[, class = 'user-input-class' ]
[, match_by_pos = 'match-by-position' ]
[, probability_normalization = 'prob-normalization' ] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
type
- Type of prediction to return, one of the following:
-
response
(default): The class with the highest probability among all possible classes.
-
probability
: Valid only if the class
parameter is set, returns for each input instance the probability of the specified class or predicted class.
class
- Class to use when the
type
parameter is set to probability
. If you omit this parameter, the function uses the predicted class—the one with the highest probability score. Thus, the predict function returns the probability that the input instance belongs to the specified or predicted class.
match_by_pos
Boolean value that specifies how input columns are matched to model features:
probability_normalization
The classifier's normalization method, either softmax
(multi-class classifier) or logit
(binary classifier). If unspecified, the default logit
function is used for normalization.
Examples
Use
PREDICT_XGB_CLASSIFIER
to apply the classifier to the test data:
=> SELECT PREDICT_XGB_CLASSIFIER (Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris', probability_normalization='logit') FROM iris1;
PREDICT_XGB_CLASSIFIER
------------------------
setosa
setosa
setosa
.
.
.
versicolor
versicolor
versicolor
.
.
.
virginica
virginica
virginica
.
.
.
(90 rows)
See XGBoost for classification for more examples.
23 - PREDICT_XGB_CLASSIFIER_CLASSES
Applies an XGBoost classifier model on an input relation and returns the probabilities of classes:.
Applies an XGBoost classifier model on an input relation and returns the probabilities of classes:
-
VARCHAR predicted
column contains the class label with the highest probability.
-
Multiple FLOAT columns, where the first probability
column contains the probability for the class reported in the predicted column. Other columns contain the probability of each class specified in the classes
parameter.
-
Key columns with the same value and data type as matching input columns specified in parameter key_columns
.
All trees contribute to a predicted probability for each response class, and the highest probability class is chosen.
Syntax
PREDICT_XGB_CLASSIFIER_CLASSES ( predictor-columns)
USING PARAMETERS model_name = 'model-name'
[, key_columns = 'key-columns']
[, exclude_columns = 'excluded-columns']
[, classes = 'classes']
[, match_by_pos = match-by-position]
[, probability_normalization = 'prob-normalization' ] )
OVER( [<span class="code-variable"><span class="code-variable"><a href="/en/sql-reference/functions/analytic-functions/window-partition-clause/#">window-partition-clause</a></span></span>] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
key_columns
Comma-separated list of predictor column names that identify the output rows. To exclude these and other predictor columns from being used for prediction, include them in the argument list for parameter exclude_columns
.
exclude_columns
- Comma-separated list of columns from
predictor-columns
to exclude from processing.
classes
- Comma-separated list of class labels in the model. The probability of belonging to each given class is predicted by the classifier. Values are case sensitive.
match_by_pos
- Boolean value that specifies how predictor columns are matched to model features:
probability_normalization
The classifier's normalization method, either softmax
(multi-class classifier) or logit
(binary classifier). If unspecified, the default logit
function is used for normalization.
Examples
After creating an XGBoost classifier model with
XGB_CLASSIFIER
, you can use PREDICT_XGB_CLASSIFIER_CLASSES
to view the probability of each classification. In this example, the XGBoost classifier model "xgb_iris" is used to predict the probability that a given flower belongs to a species of iris:
=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris') OVER (PARTITION BEST) FROM iris1;
predicted | probability
------------+-------------------
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.999911552783011
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
setosa | 0.9999650465368
versicolor | 0.99991871763563
.
.
.
(90 rows)
You can also specify additional classes. In this example, PREDICT_XGB_CLASSIFIER_CLASSES
makes the same prediction as the previous example, but also returns the probability that a flower belongs to the specified classes
"virginica" and "versicolor":
=> SELECT PREDICT_XGB_CLASSIFIER_CLASSES(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width
USING PARAMETERS model_name='xgb_iris', classes='virginica,versicolor', probability_normalization='logit') OVER (PARTITION BEST) FROM iris1;
predicted | probability | virginica | versicolor
------------+-------------------+----------------------+----------------------
setosa | 0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
setosa | 0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
setosa | 0.9999650465368 | 1.16160301545536e-05 | 2.33374330460065e-05
.
.
.
versicolor | 0.99991871763563 | 6.45697562080953e-05 | 0.99991871763563
versicolor | 0.999967282051702 | 1.60052775404199e-05 | 0.999967282051702
versicolor | 0.999648819964864 | 0.00028366342010669 | 0.999648819964864
.
.
.
virginica | 0.999977039257386 | 0.999977039257386 | 1.13305901169304e-05
virginica | 0.999977085131063 | 0.999977085131063 | 1.12847163501674e-05
virginica | 0.999977039257386 | 0.999977039257386 | 1.13305901169304e-05
(90 rows)
24 - PREDICT_XGB_REGRESSOR
Applies an XGBoost regressor model on an input relation.
Applies an XGBoost regressor model on an input relation. PREDICT_XGB_REGRESSOR
returns a FLOAT data type that specifies the predicted value by the XGBoost model: a weighted sum of contributions by each tree in the model.
Syntax
PREDICT_XGB_REGRESSOR ( input-columns
USING PARAMETERS model_name = 'model-name' [, match_by_pos = match-by-position] )
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
match_by_pos
Boolean value that specifies how input columns are matched to model features:
Examples
See XGBoost for regression.
25 - REVERSE_NORMALIZE
Reverses the normalization transformation on normalized data, thereby de-normalizing the normalized data.
Reverses the normalization transformation on normalized data, thereby de-normalizing the normalized data. If you specify a column that is not in the specified model, REVERSE_NORMALIZE
returns that column unchanged.
Syntax
REVERSE_NORMALIZE ( input-columns USING PARAMETERS model_name = 'model-name' );
Arguments
input-columns
- The columns to use from the input relation, or asterisk (*) to select all columns.
Parameters
model_name
Name of the model (case-insensitive).
Examples
Use REVERSE_NORMALIZE
on the hp
and cyl
columns in table mtcars
, where hp
is in normalization model mtcars_normfit
, and cyl
is not in the normalization model.
=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars;
hp | cyl
------+-----
42502 | 8
58067 | 8
26371 | 4
42502 | 8
31182 | 6
32031 | 4
26937 | 4
34861 | 6
34861 | 6
50992 | 8
50992 | 8
49577 | 8
25805 | 4
18447 | 4
29767 | 6
65142 | 8
69387 | 8
14768 | 4
49577 | 8
60897 | 8
94857 | 8
31182 | 6
31182 | 6
30899 | 4
69387 | 8
49577 | 6
18730 | 4
18730 | 4
74764 | 8
17598 | 4
50992 | 8
27503 | 4
(32 rows)
See also