This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data preparation

Vertica supports machine learning functions that prepare data as needed before subjecting it to analysis.

1: BALANCE
2: CORR_MATRIX
3: DETECT_OUTLIERS
4: IFOREST
5: IMPUTE
6: NORMALIZE
7: NORMALIZE_FIT
8: ONE_HOT_ENCODER_FIT
9: PCA
10: SUMMARIZE_CATCOL
11: SUMMARIZE_NUMCOL
12: SVD

Vertica supports machine learning functions that prepare data as needed before subjecting it to analysis.

1 - BALANCE

Returns a view with an equal distribution of the input data based on the response_column.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BALANCE ( 'output-view', 'input-relation', 'response-column', 'balance-method'
       [ USING PARAMETERS sampling_ratio=ratio ] )

Arguments

output-view

The name of the view where Vertica saves the balanced data from the input relation.

Note

Note: The view that results from this function employs a random function. Its content can differ each time it is used in a query. To make the operations on the view predictable, store it in a regular table.

input-relation

The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

response-column

Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.

balance-method

Specifies a method to select data from the minority and majority classes, one of the following.

hybrid_sampling: Performs over-sampling and under-sampling on different classes so each class is equally represented.
over_sampling: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality.
under_sampling: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality.
weighted_sampling: An alias of under_sampling.

Parameters

ratio: The desired ratio between the majority class and the minority class. This value has no effect when used with balance method hybrid_sampling.
Default: 1.0

Privileges

Non-superusers:

SELECT privileges on the input relation
CREATE privileges on the output view schema

Examples

=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE

=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
 bug_type | COUNT
----------+-------
        2 |     1
        1 |     3
        3 |     7
(3 rows)

=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
        2 |     1
        1 |     2
        3 |     1
(3 rows)

2 - CORR_MATRIX

Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns.

Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns. The function is implemented as a Multi-Phase Transform function.

Syntax

CORR_MATRIX ( input-columns ) OVER()

Arguments

input-columns: A comma-separated list of the columns in the input table. The input columns can be of any numeric type or BOOL, but they will be converted internally to FLOAT. The number of input columns must be more than 1 and not more than 1600.

Returns

CORR_MATRIX returns the correlation matrix in triplet format. That is, each pair-wise correlation is identified by three returned columns: name of the first variable, name of the second variable, and the correlation value of the pair. The function also returns two extra columns: number_of_ignored_input_rows and number_of_processed_input_rows. The value of the fourth/fifth column indicates the number of rows from the input which are ignored/used to calculate the corresponding correlation value. Any input pair with NULL, Inf, or NaN is ignored.

The correlation matrix is symmetric with a value of 1 on all diagonal elements; therefore, it can return only the value of elements above the diagonals—that is, the upper triangle. Nevertheless, the function returns the entire matrix to simplify any later operations. Then, the number of output rows is:

(#input-columns)^2

The first two output columns are of type VARCHAR(128), the third one is of type FLOAT, and the last two are of type INT.

Notes

The contents of the OVER clause must be empty.
The function returns no rows when the input table is empty.
When any of X_i and Y_i is NULL, Inf, or NaN, the pair will not be included in the calculation of CORR(X, Y). That is, any input pair with NULL, Inf, or NaN is ignored.
For the pair of (X,X), regardless of the contents of X: CORR(X,X) = 1, number_of_ignored_input_rows = 0, and number_of_processed_input_rows = #input_rows.
When (NSUMX2 == SUMXSUMX) or (NSUMY2 == SUMYSUMY) then value of CORR(X, Y) will be NULL. In theory it can happen in case of a column with constant values; nevertheless, it may not be always observed because of rounding error.
In the special case where all pair values of (X_i,Y_i) contain NULL, inf, or NaN, and X != Y: CORR(X,Y)=NULL.

Examples

The following example

SELECT variable_name_1 | variable_name_2 ----------------+------------- Sepal.Length    | Sepal.Width Sepal.Width     | Sepal.Length Sepal.Length    | Petal.Length Petal.Length    | Sepal.Length Sepal.Length    | Petal.Width Petal.Width     | Sepal.Length Sepal.Width     | Petal.Length Petal.Length    | Sepal.Width Sepal.Width     | Petal.Width Petal.Width     | Sepal.Width Petal.Length    | Petal.Width Petal.Width     | Petal.Length Sepal.Length    | Sepal.Length    |1 Sepal.Width     | Sepal.Width     |1 Petal.Length    | Petal.Length    |1 Petal.Width     | Petal.Width     |1 (16 rows)

uses the iris dataset.* CORR_MATRIX("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") OVER() FROM iris; | corr_value | number_of_ignored_input_rows | number_of_processed_input_rows ----+-------------------+------------------------------+-------------------------------- |-0.117569784133002 | 0 | 150 |-0.117569784133002 | 0 | 150 |0.871753775886583 | 0 | 150 |0.871753775886583 | 0 | 150 |0.817941126271577 | 0 | 150 |0.817941126271577 | 0 | 150 |-0.42844010433054 | 0 | 150 |-0.42844010433054 | 0 | 150 |-0.366125932536439 | 0 | 150 |-0.366125932536439 | 0 | 150 |0.962865431402796 | 0 | 150 |0.962865431402796 | 0 | 150 | 0 | 150 | 0 | 150 | 0 | 150 | 0 | 150

3 - DETECT_OUTLIERS

Returns the outliers in a data set based on the outlier threshold.

Returns the outliers in a data set based on the outlier threshold. The output is a table that contains the outliers. DETECT_OUTLIERS uses the detection method robust_szcore to normalize each input column. The function then identifies as outliers all rows that contain a normalized value greater than the default or specified threshold.

Note

You can calculate normalized column values with Vertica functions NORMALIZE and NORMALIZE_FIT.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

DETECT_OUTLIERS ( 'output-table', 'input-relation','input-columns', 'detection-method'
        [ USING PARAMETERS
              [outlier_threshold = threshold]
              [, exclude_columns = 'excluded-columns']
              [, partition_columns = 'partition-columns'] ] )

Arguments

output-table: The name of the table where Vertica saves rows that are outliers along the chosen input_columns. All columns are present in this table.
input-relation: The table or view that contains outlier data. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of type numeric.
detection-method: The outlier detection method to use, set to robust_zscore.

Parameters

outlier_threshold: The minimum normalized value in a row that is used to identify that row as an outlier.
Default: 3.0
exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
partition_columns: Comma-separated list of column names from the input table or view that defines the partitions. DETECT_OUTLIERS detects outliers among each partition separately.
Default: empty list

Privileges

Non-superusers:

SELECT privileges on the input relation
CREATE privileges on the output table

Examples

The following example shows how to use DETECT_OUTLIERS:

=> CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float);
CREATE TABLE

=> COPY baseball_roster FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Polo|7|.233
>> Gloss|45|.170
>> Gus|12|.345
>> Gee|1|.125
>> Laus|3|.095
>> Hilltop|16|.222
>> Wicker|78|.333
>> Scooter|0|.121
>> Hank|999999|.8888
>> Popup|35|.378
>> \.


=> SELECT * FROM baseball_roster;
 id | last_name |   hr   |  avg
----+-----------+--------+--------
  3 | Gus       |     12 |  0.345
  4 | Gee       |      1 |  0.125
  6 | Hilltop   |     16 |  0.222
 10 | Popup     |     35 |  0.378
  1 | Polo      |      7 |  0.233
  7 | Wicker    |     78 |  0.333
  9 | Hank      | 999999 | 0.8888
  2 | Gloss     |     45 |   0.17
  5 | Laus      |      3 |  0.095
  8 | Scooter   |      0 |  0.121
(10 rows)

=> SELECT DETECT_OUTLIERS('baseball_outliers', 'baseball_roster', 'id, hr, avg', 'robust_zscore' USING PARAMETERS
outlier_threshold=3.0);

     DETECT_OUTLIERS
--------------------------
 Detected 2 outliers

(1 row)

=> SELECT * FROM baseball_outliers;
 id | last_name | hr         | avg
----+-----------+------------+-------------
  7 | Wicker    |         78 |       0.333
  9 | Hank      |     999999 |      0.8888
(2 rows)

4 - IFOREST

Trains and returns an isolation forest (iForest) model.

Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.

For more information about how the iForest algorithm works, see Isolation Forest.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: The table or view that contains the input data for IFOREST.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
Default: Empty string ('')
ntree: Integer in the range [1, 1000], specifies the number of trees in the forest.
Default: 100
sampling_size: Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.
Default: 0.632
col_sample_by_tree: Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.
Default: 1.0
max_depth: Integer in the range [1, 100], specifies the maximum depth for growing each tree.
Default: 10
nbins: Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.
Default: 32

Model Attributes

details

Details about the function's predictor columns, including:

predictor: Names of the predictors in the same order specified when training the model.
type: Types of the predictors in the same order as their names in predictor.

tree_count

Number of trees in the model.

rejected_row_count

Number of rows in input-relation that were skipped because they contained an invalid value.

accepted_row_count

Total number of rows in input-relation minus rejected_row_count.

call_string

Value of all input arguments that were specified at the time the function was called.

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation

Examples

In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:

=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)

You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------

===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);

=======
details
=======
predictor|      type
---------+----------------
  team   |char or varchar
   hr    |      int
  hits   |      int
   avg   |float or numeric
 salary  |float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 75
rejected_row_count|  0
accepted_row_count|1000

(1 row)

5 - IMPUTE

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column.

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column. This function supports numeric and categorical data types.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IMPUTE( 'output-view', 'input-relation', 'input-columns', 'method'
        [ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, partition_columns = 'partition-columns'] ] )

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Arguments

output-view

Name of the view that shows the input table with imputed values in place of missing values. In this view, rows without missing values are kept intact while the rows with missing values are modified according to the specified method.

input-relation

The table or view that contains the data for missing-value imputation. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

input-columns

Comma-separated list of input columns where missing values will be replaced, or asterisk (*) to specify all columns. All columns must be of type numeric or BOOLEAN.

method

The method to compute the missing value replacements, one of the following:

mean: The missing values in each column will be replaced by the mean of that column. This method can be used for numeric data only.
mode: The missing values in each column will be replaced by the most frequent value in that column. This method can be used for categorical data only.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
partition_columns: Comma-separated list of column names from the input relation that defines the partitions.

Privileges

Non-superusers:

SELECT privileges on the input relation
CREATE privileges on the output view schema

Examples

Execute IMPUTE on the small_input_impute table, specifying the mean method:

=> SELECT impute('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

Execute IMPUTE, specifying the mode method:

=> SELECT impute('output_view3','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

6 - NORMALIZE

Runs a normalization algorithm on an input relation.

Runs a normalization algorithm on an input relation. The output is a view with the normalized data.

Note

Note: This function differs from NORMALIZE_FIT, which creates and stores a model rather than creating a view definition. This can lead to different performance characteristics between the two functions.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

NORMALIZE ( 'output-view', 'input-relation', 'input-columns', 'normalization-method'
           [ USING PARAMETERS exclude_columns = 'excluded-columns' ] )

Arguments

output-view

The name of the view showing the input relation with normalized data replacing the specified input columns. .

input-relation

The table or view that contains the data to normalize. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.

input-columns

Comma-separated list of numeric input columns that contain the values to normalize, or asterisk (*) to select all columns.

normalization-method

The normalization method to use, one of the following:

minmax
zscore
robust_zscore

If infinity values appear in the table, the method ignores those values.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.

Privileges

Non-superusers:

SELECT privileges on the input relation
CREATE privileges on the output view schema

Examples

These examples show how you can use the NORMALIZE function on the wt and hp columns in the mtcars table.

Execute the NORMALIZE function, and specify the minmax method:

=> SELECT NORMALIZE('mtcars_norm', 'mtcars',
                    'wt, hp', 'minmax');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

Execute the NORMALIZE function, and specify the zscore method:

=> SELECT NORMALIZE('mtcars_normz','mtcars',
                    'wt, hp', 'zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

Execute the NORMALIZE function, and specify the robust_zscore method:

=> SELECT NORMALIZE('mtcars_normz', 'mtcars',
                    'wt, hp', 'robust_zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

7 - NORMALIZE_FIT

This function differs from NORMALIZE, which directly outputs a view with normalized results, rather than storing normalization parameters into a model for later operation.

Note

This function differs from NORMALIZE, which directly outputs a view with normalized results, rather than storing normalization parameters into a model for later operation.

NORMALIZE_FIT computes normalization parameters for each of the specified columns in an input relation. The resulting model stores the normalization parameters. For example, for MinMax normalization, the minimum and maximum value of each column are stored in the model. The generated model serves as input to functions APPLY_NORMALIZE and REVERSE_NORMALIZE.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

NORMALIZE_FIT ( 'model-name', 'input-relation', 'input-columns', 'normalization-method'
        [ USING PARAMETERS  [exclude_columns = 'excluded-columns'] [, output_view = 'output-view'] ] )

Arguments

model-name

Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.

input-relation

input-columns

Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.

normalization-method

The normalization method to use, one of the following:

minmax
zscore
robust_zscore

If you specify robust_zscore, NORMALIZE_FIT uses the function APPROXIMATE_MEDIAN [aggregate].

All normalization methods ignore infinity, negative infinity, or NULL values in the input relation.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
output_view: Name of the view that contains all columns from the input relation, with the specified input columns normalized.

Model attributes

data

Normalization method set to minmax:

colNames: Model column names
mins: Minimum value of each column
maxes: Maximum value of each column

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation
CREATE privileges on the output view schema

Examples

The following example creates a model with NORMALIZE_FIT using the wt and hp columns in table mtcars , and then uses this model in successive calls to APPLY_NORMALIZE and REVERSE_NORMALIZE.

=> SELECT NORMALIZE_FIT('mtcars_normfit', 'mtcars', 'wt,hp', 'minmax');
NORMALIZE_FIT
---------------
Success
(1 row)

The following call to APPLY_NORMALIZE specifies the hp and cyl columns in table mtcars, where hp is in the normalization model and cyl is not in the normalization model:

=> CREATE TABLE mtcars_normalized AS SELECT APPLY_NORMALIZE (hp, cyl USING PARAMETERS model_name = 'mtcars_normfit') FROM mtcars;
CREATE TABLE
=> SELECT * FROM mtcars_normalized;
          hp        | cyl
--------------------+-----
  0.434628975265018 | 8
  0.681978798586572 | 8
  0.434628975265018 | 6
                  1 | 8
  0.540636042402827 | 8
                  0 | 4
  0.681978798586572 | 8
 0.0459363957597173 | 4
  0.434628975265018 | 8
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.204946996466431 | 6
  0.201413427561837 | 4
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.215547703180212 | 4
 0.0353356890459364 | 4
  0.187279151943463 | 6
  0.452296819787986 | 8
  0.628975265017668 | 8
  0.346289752650177 | 8
  0.137809187279152 | 4
  0.749116607773852 | 8
  0.144876325088339 | 4
  0.151943462897526 | 4
  0.452296819787986 | 8
  0.452296819787986 | 8
  0.575971731448763 | 8
  0.159010600706714 | 4
  0.346289752650177 | 8
(32 rows)

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
  hp | cyl
-----+-----
 175 | 8
 245 | 8
 175 | 6
 335 | 8
 205 | 8
  52 | 4
 245 | 8
  65 | 4
 175 | 8
 110 | 6
 123 | 6
  66 | 4
 110 | 6
 109 | 4
 110 | 6
 123 | 6
  66 | 4
 113 | 4
  62 | 4
 105 | 6
 180 | 8
 230 | 8
 150 | 8
  91 | 4
 264 | 8
  93 | 4
  95 | 4
 180 | 8
 180 | 8
 215 | 8
  97 | 4
 150 | 8
(32 rows)

The following call to REVERSE_NORMALIZE also specifies the hp and cyl columns in table mtcars, where hp is in normalization model mtcars_normfit, and cyl is not in the normalization model.

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
       hp        | cyl
-----------------+-----
205.000005722046 |   8
150.000000357628 |   8
150.000000357628 |   8
93.0000016987324 |   4
 174.99999666214 |   8
94.9999992102385 |   4
214.999997496605 |   8
97.0000009387732 |   4
245.000006556511 |   8
 174.99999666214 |   6
             335 |   8
245.000006556511 |   8
62.0000002086163 |   4
 174.99999666214 |   8
230.000002026558 |   8
              52 |   4
263.999997675419 |   8
109.999999523163 |   6
123.000002324581 |   6
64.9999996386468 |   4
66.0000005029142 |   4
112.999997898936 |   4
109.999999523163 |   6
180.000000983477 |   8
180.000000983477 |   8
108.999998658895 |   4
109.999999523163 |   6
104.999999418855 |   6
123.000002324581 |   6
180.000000983477 |   8
66.0000005029142 |   4
90.9999999701977 |   4
(32 rows)

8 - ONE_HOT_ENCODER_FIT

Generates a sorted list of each of the category levels for each feature to be encoded, and stores the model.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

ONE_HOT_ENCODER_FIT ( 'model-name', 'input-relation','input-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, output_view = 'output-view']
              [, extra_levels = 'category-levels'] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: The table or view that contains the data for one hot encoding. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be INTEGER, BOOLEAN, VARCHAR, or dates.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
output_view: The name of the view that stores the input relation and the one hot encodings. Columns are returned in the order they appear in the input relation, with the one-hot encoded columns appended after the original columns.
extra_levels: Additional levels in each category that are not in the input relation. This parameter should be passed as a string that conforms with the JSON standard, with category names as keys, and lists of extra levels in each category as values.

Model attributes

call_string

The value of all input arguments that were specified at the time the function was called.

varchar_categories integer_categories boolean_categories date_categories

Settings for all:

category_name: Column name
category_level: Levels of the category, sorted for each category
category_level_index: Index of this categorical level in the sorted list of levels for the category.

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation
CREATE privileges on the output view schema

Examples

=> SELECT ONE_HOT_ENCODER_FIT ('one_hot_encoder_model','mtcars','*'
USING PARAMETERS exclude_columns='mpg,disp,drat,wt,qsec,vs,am');
ONE_HOT_ENCODER_FIT
--------------------
Success
(1 row)

9 - PCA

Computes principal components from the input table/view.

Computes principal components from the input table/view. The results are saved in a PCA model. Internally, PCA finds the components by using SVD on the co-variance matrix built from the input date. The singular values of this decomposition are also saved as part of the PCA model. The signs of all elements of a principal component could be flipped all together on different runs.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

PCA ( 'model-name', 'input-relation', 'input-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, num_components = num-components]
              [, scale = is-scaled]
              [, method = 'method'] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: The table or view that contains the input data for PCA.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All input columns must be a numeric data type.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

num_components

The number of components to keep in the model. If this value is not provided, all components are kept. The maximum number of components is the number of non-zero singular values returned by the internal call to SVD. This number is less than or equal to SVD (number of columns, number of rows).

scale

A Boolean value that specifies whether to standardize the columns during the preparation step:

True: Use a correlation matrix instead of a covariance matrix.
False (default)

method

The method used to calculate PCA, can be set to LAPACK.

Model attributes

columns

The information about columns from the input relation used for creating the PCA model:

index
name

singular_values

The information about singular values found. They are sorted in descending order:

index
value
explained_variance : percentage of the variance in data that can be attributed to this singular value
accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one

principal_components

The principal components corresponding to the singular values mentioned above:

index: indies of the elements in each component
PC1
PC2
...

counters

The information collected during training the model, stored as name-value pairs:

counter_name
- accepted_row_count: number of valid rows in the data
- rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data
- iteration_count: number of iterations, always 1 for the current implementation of PCA
counter_value

call_string

The function call that created the model.

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation

Examples


=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE

=> SELECT * FROM worldPCA;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             |  79002.2946705704
0.699 | Belize              | -25631.6670012556
0.427 | Benin               | -40373.4104598122
0.805 | Chile               | -16805.7940082156
0.687 | China               | -37279.2893141103
0.744 | Costa Rica          | -19505.5631231635
0.4   | Cote d'Ivoire       | -38058.2060339272
0.776 | Cuba                | -23724.5779612041
0.895 | Denmark             |  117325.594028813
0.644 | Egypt               | -34609.9941604549
...
(96 rows)

=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
    USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
    key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI  |       country       |      em1970       |      em1971       |      em1972      |      em1973      |
      em1974      |      em1975       |      em1976|      em1977      |      em1978       |      em1979
   |      em1980       |      em1981      |      em1982       |      em1983       |      em1984       |em1985
|      em1986       |      em1987       |      em1988       |      em1989      |      em1990      |      em1991
|      em1992       |      em1993|      em1994      |      em1995       |      em1996       |      em1997
    |      em1998       |      em1999       |      em2000       |      em2001       |em2002       |
em2003      |      em2004       |      em2005      |      em2006       |      em2007       |      em2008
|      em2009      |      em2010       |     gdp1970      |     gdp1971      |     gdp1972      |     gdp1973
|     gdp1974      |     gdp1975      |     gdp1976      |     gdp1977      |gdp1978      |     gdp1979
 |     gdp1980      |     gdp1981      |     gdp1982      |     gdp1983      |     gdp1984      |     gdp1985
      |     gdp1986|    gdp1987      |     gdp1988      |     gdp1989      |     gdp1990      |     gdp1991
     |     gdp1992      |     gdp1993      |     gdp1994      |     gdp1995      |     gdp1996      |
gdp1997      |     gdp1998      |     gdp1999      |     gdp2000      |     gdp2001      |     gdp2002
|     gdp2003      |gdp2004      |     gdp2005      |     gdp2006      |     gdp2007      |     gdp2008
  |     gdp2009      |     gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium             |  18585.6613572407 | -16145.6374560074 |  26938.956253415 | 8094.30475779595 |
 12073.5461203817 | -11069.0567600181 | 19133.8584911727|   5500.312894949 | -4227.94863799987 |  6265.77925410752
|  -10884.749295608 | 30929.4669575201 | -7831.49439429977 |  3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 |   1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 |  18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 |   1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
|  6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
|  5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 |  5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 |  8718.3731386891
...
(96 rows)

10 - SUMMARIZE_CATCOL

Returns a statistical summary of categorical data input, in three columns:.

Returns a statistical summary of categorical data input, in three columns:

CATEGORY: Categorical levels, of the same SQL data type as the summarized column
COUNT: The number of category levels, of type INTEGER
PERCENT: Represents category percentage, of type FLOAT

Syntax

SUMMARIZE_CATCOL (target-column
        [ USING PARAMETERS TOPK = topk-value [, WITH_TOTALCOUNT = show-total] ] )
OVER()

Arguments

target-column

The name of the input column to summarize, one of the following data types:

BOOLEAN
FLOAT
INTEGER
DATE
CHAR/VARCHAR

Parameters

TOPK: Integer, specifies how many of the most frequent rows to include in the output.
WITH_TOTALCOUNT: A Boolean value that specifies whether the table contains a heading row that displays the total number of rows displayed in the target column, and a percent equal to 100.
Default:true

Examples

This example shows the categorical summary for the current_salary column in the salary_data table. The output of the query shows the column category, count, and percent. The first column gives the categorical levels, with the same SQL data type as the input column, the second column gives a count of that value, and the third column gives a percentage.

=> SELECT SUMMARIZE_CATCOL (current_salary USING PARAMETERS TOPK = 5) OVER() FROM salary_data;
CATEGORY | COUNT | PERCENT
---------+-------+---------
         |  1000 |     100
   39004 |     2 |     0.2
   35321 |     1 |     0.1
   36313 |     1 |     0.1
   36538 |     1 |     0.1
   36562 |     1 |     0.1
(6 rows)

11 - SUMMARIZE_NUMCOL

Returns a statistical summary of columns in a Vertica table:.

Returns a statistical summary of columns in a Vertica table:

Count
Mean
Standard deviation
Min/max values
Approximate percentile
Median

All summary values are FLOAT data types, except INTEGER for count.

Syntax

SUMMARIZE_NUMCOL (input-columns [ USING PARAMETERS exclude_columns = 'excluded-columns'] ) OVER()

Arguments

input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All columns must be a numeric data type. If you select all columns, SUMMARIZE_NUMCOL normalizes all columns in the model

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.

Examples

Show the statistical summary for the age and salary columns in the employee table:

=> SELECT SUMMARIZE_NUMCOL(* USING PARAMETERS exclude_columns='id,name,gender,title') OVER() FROM employee;
COLUMN         | COUNT |    MEAN    |      STDDEV      |  MIN    | PERC25  | MEDIAN  |  PERC75   |  MAX
---------------+-------+------------+------------------+---------+---------+---------+-----------+--------
age            |     5 |    63.4    | 19.3209730603818 |      44 |      45 |      67 |      71   |     90
salary         |     5 | 3456.76    | 1756.78754300285 | 1234.56 | 2345.67 | 3456.78 | 4567.89   | 5678.9
(2 rows)

12 - SVD

Computes singular values (the diagonal of the S matrix) and right singular vectors (the V matrix) of an SVD decomposition of the input relation.

Computes singular values (the diagonal of the S matrix) and right singular vectors (the V matrix) of an SVD decomposition of the input relation. The results are saved as an SVD model. The signs of all elements of a singular vector in SVD could be flipped all together on different runs.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

SVD ( 'model-name', 'input-relation', 'input-columns'
     [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, num_components = num-components]
              [, method = 'method'] ] )

Arguments

model-name: Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: The table or view that contains the input data for SVD.
input-columns: Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be a numeric data type.

Parameters

exclude_columns: Comma-separated list of column names from input-columns to exclude from processing.
num_components: The number of components to keep in the model. The maximum number of components is the number of non-zero singular values computed, which is less than or equal to min (number of columns, number of rows). If you omit this parameter, all components are kept.
method: The method used to calculate SVD, can be set to LAPACK.

Model attributes

columns

The information about columns from the input relation used for creating the SVD model:

index
name

singular_values

The information about singular values found. They are sorted in descending order:

index
value
explained_variance : percentage of the variance in data that can be attributed to this singular value
accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one

right_singular_vectors

The right singular vectors corresponding to the singular values mentioned above:

index: indices of the elements in each vector
vector1
vector2
...

counters

The information collected during training the model, stored as name-value pairs:

counter_name
- accepted_row_count: number of valid rows in the data
- rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data
- iteration_count: number of iterations, always 1 for the current implementation of SVD
counter_value

call_string

The function call that created the model.

Privileges

Non-superusers:

CREATE privileges on the schema where the model is created
SELECT privileges on the input relation

Examples

=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8  Rejected Rows: 0
(1 row)

=> CREATE TABLE transform_svd AS SELECT
     APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
     OVER () FROM small_svd;
CREATE TABLE

=> SELECT * FROM transform_svd;
id  |       col1        |        col2         |        col3         |        col4
----+-------------------+---------------------+---------------------+--------------------
4   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
6   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
1   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
2   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
3   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
5   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
8   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
7   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
(8 rows)

=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id  |        x1        |        x2        |        x3        |        x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)

Data preparation

1 - BALANCE

Behavior type

Syntax

Arguments

Note

Parameters

Privileges

Examples

See also

2 - CORR_MATRIX

Syntax

Arguments

Returns

Notes

Examples

3 - DETECT_OUTLIERS

Note

Behavior type

Syntax

Arguments

Parameters

Privileges

Examples

4 - IFOREST

Behavior type

Syntax

Arguments

Parameters

Model Attributes

Privileges

Examples

See also

5 - IMPUTE

Behavior type

Syntax

Behavior type

Arguments

Parameters

Privileges

Examples

See also

6 - NORMALIZE

Note

Behavior type

Syntax

Arguments

Parameters

Privileges

Examples

See also

7 - NORMALIZE_FIT

Note

Behavior type

Syntax

Arguments

Parameters

Model attributes

Privileges

Examples

See also

8 - ONE_HOT_ENCODER_FIT

Behavior type

Syntax

Arguments

Parameters

Model attributes

Privileges

Examples

See also

9 - PCA

Behavior type

Syntax

Arguments

Parameters

Model attributes

Privileges

Examples

See also

10 - SUMMARIZE_CATCOL