This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data preparation

Vertica supports machine learning functions that prepare data as needed before subjecting it to analysis.

Vertica supports machine learning functions that prepare data as needed before subjecting it to analysis.

1 - BALANCE

Returns a view with an equal distribution of the input data based on the response_column.

Returns a view with an equal distribution of the input data based on the response_column.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

BALANCE ( 'output-view', 'input-relation', 'response-column', 'balance-method'
       [ USING PARAMETERS sampling_ratio=ratio ] )

Arguments

output-view
The name of the view where Vertica saves the balanced data from the input relation.
input-relation
The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response-column
Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.
balance-method
Specifies a method to select data from the minority and majority classes, one of the following.
  • hybrid_sampling: Performs over-sampling and under-sampling on different classes so each class is equally represented.

  • over_sampling: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality.

  • under_sampling: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality.

  • weighted_sampling: An alias of under_sampling.

Parameters

ratio
The desired ratio between the majority class and the minority class. This value has no effect when used with balance method hybrid_sampling.

Default: 1.0

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE

=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
 bug_type | COUNT
----------+-------
        2 |     1
        1 |     3
        3 |     7
(3 rows)

=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
         BALANCE
--------------------------
 Finished in 1 iteration

(1 row)

=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
        2 |     1
        1 |     2
        3 |     1
(3 rows)

See also

2 - CHI_SQUARED

Computes the conditional chi-Square independence test on two categorical variables to find the likelihood that the two variables are independent.

Computes the conditional chi-square independence test on two categorical variables to find the likelihood that the two variables are independent. To condition the independence test on another set of variables, you can partition the data on these variables using a PARTITION BY clause.

This function is a multi-phase transform function.

Syntax

CHI_SQUARED( 'x-column', 'y-column' 
    [ USING PARAMETERS param=value[,...] ] )

Arguments

x-column, y-column
Columns in the input relation to be tested for dependency with each other. These columns must contain categorical data in numeric format.

Parameters

x_cardinality
Integer in the range [1, 20], the cardinality of x-column. If the cardinality of x-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.

Default: 20

y_cardinality
Integer in the range [1, 20], the cardinality of y-column. If the cardinality of y-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.

Default: 20

alpha
Float in the range (0.0, 1.0), the significance level. If the returned pvalue is less than this value, the null hypothesis, which assumes the variables are independent, is rejected.

Default: 0.05

Returns

The function returns two values:

  • pvalue (float): the confidence that the two variables are independent. If this value is greater than the alpha parameter value, the null hypothesis is accepted and the variables are considered independent.
  • independent (boolean): true if the variables are independent; otherwise, false.

Privileges

SELECT privileges on the input relation

Examples

The following examples use the titanic dataset from the machine learning example data. If you have not downloaded these datasets, see Download the machine learning example data for instructions.

The titanic_training table contains data related to passengers on the Titanic, including:

  • pclass: the ticket class of the passenger, ranging from 1st class to 3rd class
  • survived: whether the passenger survived, where 1 is yes and 0 is no
  • gender: gender of the passenger
  • sibling_and_spouse_count: number of siblings aboard the Titanic
  • embarkation_point: port of embarkation

To test whether the survival of a passenger is dependent on their ticket class, run the following chi-square test:

=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2, alpha=0.05) OVER() FROM titanic_training;
 pvalue | independent
--------+-------------
      0 | f
(1 row)

With a returned pvalue of zero, the null hypothesis is rejected and you can conclude that the survived and pclass variables are dependent. To test whether this outcome is conditional on the gender of the passenger, partition by the gender column in the OVER clause:

=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2) OVER(PARTITION BY gender) FROM titanic;
 pvalue | independent
--------+-------------
      0 | f
(1 row)

As the pvalue is still zero, it is clear that the dependence of the pclass and survived variables is not conditional on the gender of the passenger.

If one of the categorical columns that you want to test is not a numeric type, use the HASH function to convert it into type INT:

=> SELECT CHI_SQUARED(sibling_and_spouse_count, HASH(embarkation_point) USING PARAMETERS alpha=0.05) OVER() FROM titanic_training;
       pvalue       | independent
--------------------+-------------
 0.0753039994044853 | t
(1 row)

The returned pvalue is greater than alpha, meaning the null hypothesis is accepted and the sibling_and_spouse_count and embarkation_point are independent.

3 - CORR_MATRIX

Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns.

Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient between each pair of its input columns. The function is implemented as a Multi-Phase Transform function.

Syntax

CORR_MATRIX ( input-columns ) OVER()

Arguments

input-columns
A comma-separated list of the columns in the input table. The input columns can be of any numeric type or BOOL, but they will be converted internally to FLOAT. The number of input columns must be more than 1 and not more than 1600.

Returns

CORR_MATRIX returns the correlation matrix in triplet format. That is, each pair-wise correlation is identified by three returned columns: name of the first variable, name of the second variable, and the correlation value of the pair. The function also returns two extra columns: number_of_ignored_input_rows and number_of_processed_input_rows. The value of the fourth/fifth column indicates the number of rows from the input which are ignored/used to calculate the corresponding correlation value. Any input pair with NULL, Inf, or NaN is ignored.

The correlation matrix is symmetric with a value of 1 on all diagonal elements; therefore, it can return only the value of elements above the diagonals—that is, the upper triangle. Nevertheless, the function returns the entire matrix to simplify any later operations. Then, the number of output rows is:

(#input-columns)^2

The first two output columns are of type VARCHAR(128), the third one is of type FLOAT, and the last two are of type INT.

Notes

  • The contents of the OVER clause must be empty.

  • The function returns no rows when the input table is empty.

  • When any of X_i and Y_i is NULL, Inf, or NaN, the pair will not be included in the calculation of CORR(X, Y). That is, any input pair with NULL, Inf, or NaN is ignored.

  • For the pair of (X,X), regardless of the contents of X: CORR(X,X) = 1, number_of_ignored_input_rows = 0, and number_of_processed_input_rows = #input_rows.

  • When (NSUMX2 == SUMXSUMX) or (NSUMY2 == SUMYSUMY) then value of CORR(X, Y) will be NULL. In theory it can happen in case of a column with constant values; nevertheless, it may not be always observed because of rounding error.

  • In the special case where all pair values of (X_i,Y_i) contain NULL, inf, or NaN, and X != Y: CORR(X,Y)=NULL.

Examples

The following example uses the iris dataset.*

SELECT CORR_MATRIX("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") OVER() FROM iris;
variable_name_1 | variable_name_2 | corr_value        | number_of_ignored_input_rows | number_of_processed_input_rows
----------------+-----------------+-------------------+------------------------------+--------------------------------
Sepal.Length    | Sepal.Width     |-0.117569784133002 | 0                            | 150
Sepal.Width     | Sepal.Length    |-0.117569784133002 | 0                            | 150
Sepal.Length    | Petal.Length    |0.871753775886583  | 0                            | 150
Petal.Length    | Sepal.Length    |0.871753775886583  | 0                            | 150
Sepal.Length    | Petal.Width     |0.817941126271577  | 0                            | 150
Petal.Width     | Sepal.Length    |0.817941126271577  | 0                            | 150
Sepal.Width     | Petal.Length    |-0.42844010433054  | 0                            | 150
Petal.Length    | Sepal.Width     |-0.42844010433054  | 0                            | 150
Sepal.Width     | Petal.Width     |-0.366125932536439 | 0                            | 150
Petal.Width     | Sepal.Width     |-0.366125932536439 | 0                            | 150
Petal.Length    | Petal.Width     |0.962865431402796  | 0                            | 150
Petal.Width     | Petal.Length    |0.962865431402796  | 0                            | 150
Sepal.Length    | Sepal.Length    |1                  | 0                            | 150
Sepal.Width     | Sepal.Width     |1                  | 0                            | 150
Petal.Length    | Petal.Length    |1                  | 0                            | 150
Petal.Width     | Petal.Width     |1                  | 0                            | 150
(16 rows)

4 - DETECT_OUTLIERS

Returns the outliers in a data set based on the outlier threshold.

Returns the outliers in a data set based on the outlier threshold. The output is a table that contains the outliers. DETECT_OUTLIERS uses the detection method robust_szcore to normalize each input column. The function then identifies as outliers all rows that contain a normalized value greater than the default or specified threshold.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

DETECT_OUTLIERS ( 'output-table', 'input-relation','input-columns', 'detection-method'
        [ USING PARAMETERS
              [outlier_threshold = threshold]
              [, exclude_columns = 'excluded-columns']
              [, partition_columns = 'partition-columns'] ] )

Arguments

output-table
The name of the table where Vertica saves rows that are outliers along the chosen input_columns. All columns are present in this table.
input-relation
The table or view that contains outlier data. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of type numeric.
detection-method
The outlier detection method to use, set to robust_zscore.

Parameters

outlier_threshold
The minimum normalized value in a row that is used to identify that row as an outlier.

Default: 3.0

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

partition_columns
Comma-separated list of column names from the input table or view that defines the partitions. DETECT_OUTLIERS detects outliers among each partition separately.

Default: empty list

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output table

Examples

The following example shows how to use DETECT_OUTLIERS:

=> CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float);
CREATE TABLE

=> COPY baseball_roster FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Polo|7|.233
>> Gloss|45|.170
>> Gus|12|.345
>> Gee|1|.125
>> Laus|3|.095
>> Hilltop|16|.222
>> Wicker|78|.333
>> Scooter|0|.121
>> Hank|999999|.8888
>> Popup|35|.378
>> \.


=> SELECT * FROM baseball_roster;
 id | last_name |   hr   |  avg
----+-----------+--------+--------
  3 | Gus       |     12 |  0.345
  4 | Gee       |      1 |  0.125
  6 | Hilltop   |     16 |  0.222
 10 | Popup     |     35 |  0.378
  1 | Polo      |      7 |  0.233
  7 | Wicker    |     78 |  0.333
  9 | Hank      | 999999 | 0.8888
  2 | Gloss     |     45 |   0.17
  5 | Laus      |      3 |  0.095
  8 | Scooter   |      0 |  0.121
(10 rows)

=> SELECT DETECT_OUTLIERS('baseball_outliers', 'baseball_roster', 'id, hr, avg', 'robust_zscore' USING PARAMETERS
outlier_threshold=3.0);

     DETECT_OUTLIERS
--------------------------
 Detected 2 outliers

(1 row)

=> SELECT * FROM baseball_outliers;
 id | last_name | hr         | avg
----+-----------+------------+-------------
  7 | Wicker    |         78 |       0.333
  9 | Hank      |     999999 |      0.8888
(2 rows)

5 - IFOREST

Trains and returns an isolation forest (iForest) model.

Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.

For more information about how the iForest algorithm works, see Isolation Forest.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the input data for IFOREST.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.

Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.

Parameters

exclude_columns
Comma-separated list of column names from input-columns to exclude from processing.

Default: Empty string ('')

ntree
Integer in the range [1, 1000], specifies the number of trees in the forest.

Default: 100

sampling_size
Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.

Default: 0.632

col_sample_by_tree
Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.

Default: 1.0

max_depth
Integer in the range [1, 100], specifies the maximum depth for growing each tree.

Default: 10

nbins
Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.

Default: 32

Model Attributes

details
Details about the function's predictor columns, including:
  • predictor: Names of the predictors in the same order specified when training the model.

  • type: Types of the predictors in the same order as their names in predictor.

tree_count
Number of trees in the model.
rejected_row_count
Number of rows in input-relation that were skipped because they contained an invalid value.
accepted_row_count
Total number of rows in input-relation minus rejected_row_count.
call_string
Value of all input arguments that were specified at the time the function was called.

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

Examples

In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:

=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)

You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------

===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);

=======
details
=======
predictor|      type
---------+----------------
  team   |char or varchar
   hr    |      int
  hits   |      int
   avg   |float or numeric
 salary  |float or numeric


===============
Additional Info
===============
       Name       |Value
------------------+-----
    tree_count    | 75
rejected_row_count|  0
accepted_row_count|1000

(1 row)

See also

6 - IMPUTE

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column.

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column. This function supports numeric and categorical data types.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IMPUTE( 'output-view', 'input-relation', 'input-columns', 'method'
        [ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, partition_columns = 'partition-columns'] ] )

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Arguments

output-view
Name of the view that shows the input table with imputed values in place of missing values. In this view, rows without missing values are kept intact while the rows with missing values are modified according to the specified method.
input-relation
The table or view that contains the data for missing-value imputation. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of input columns where missing values will be replaced, or asterisk (*) to specify all columns. All columns must be of type numeric or BOOLEAN.
method
The method to compute the missing value replacements, one of the following:
  • mean: The missing values in each column will be replaced by the mean of that column. This method can be used for numeric data only.

  • mode: The missing values in each column will be replaced by the most frequent value in that column. This method can be used for categorical data only.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

partition_columns
Comma-separated list of column names from the input relation that defines the partitions.

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

Execute IMPUTE on the small_input_impute table, specifying the mean method:

=> SELECT impute('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

Execute IMPUTE, specifying the mode method:

=> SELECT impute('output_view3','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

See also

Imputing missing values

7 - NORMALIZE

Runs a normalization algorithm on an input relation.

Runs a normalization algorithm on an input relation. The output is a view with the normalized data.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

NORMALIZE ( 'output-view', 'input-relation', 'input-columns', 'normalization-method'
           [ USING PARAMETERS exclude_columns = 'excluded-columns' ] )

Arguments

output-view
The name of the view showing the input relation with normalized data replacing the specified input columns. .
input-relation
The table or view that contains the data to normalize. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of numeric input columns that contain the values to normalize, or asterisk (*) to select all columns.
normalization-method
The normalization method to use, one of the following:
  • minmax

  • zscore

  • robust_zscore

If infinity values appear in the table, the method ignores those values.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

These examples show how you can use the NORMALIZE function on the wt and hp columns in the mtcars table.

Execute the NORMALIZE function, and specify the minmax method:

=> SELECT NORMALIZE('mtcars_norm', 'mtcars',
                    'wt, hp', 'minmax');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

Execute the NORMALIZE function, and specify the zscore method:

=> SELECT NORMALIZE('mtcars_normz','mtcars',
                    'wt, hp', 'zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

Execute the NORMALIZE function, and specify the robust_zscore method:

=> SELECT NORMALIZE('mtcars_normz', 'mtcars',
                    'wt, hp', 'robust_zscore');
        NORMALIZE
--------------------------
 Finished in 1 iteration

(1 row)

See also

Normalizing data

8 - NORMALIZE_FIT

This function differs from NORMALIZE, which directly outputs a view with normalized results, rather than storing normalization parameters into a model for later operation.

NORMALIZE_FIT computes normalization parameters for each of the specified columns in an input relation. The resulting model stores the normalization parameters. For example, for MinMax normalization, the minimum and maximum value of each column are stored in the model. The generated model serves as input to functions APPLY_NORMALIZE and REVERSE_NORMALIZE.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

NORMALIZE_FIT ( 'model-name', 'input-relation', 'input-columns', 'normalization-method'
        [ USING PARAMETERS  [exclude_columns = 'excluded-columns'] [, output_view = 'output-view'] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the data to normalize. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
normalization-method
The normalization method to use, one of the following:
  • minmax

  • zscore

  • robust_zscore

If you specify robust_zscore, NORMALIZE_FIT uses the function APPROXIMATE_MEDIAN [aggregate].

All normalization methods ignore infinity, negative infinity, or NULL values in the input relation.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

output_view
Name of the view that contains all columns from the input relation, with the specified input columns normalized.

Model attributes

data
Normalization method set to minmax:
  • colNames: Model column names

  • mins: Minimum value of each column

  • maxes: Maximum value of each column

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

The following example creates a model with NORMALIZE_FIT using the wt and hp columns in table mtcars , and then uses this model in successive calls to APPLY_NORMALIZE and REVERSE_NORMALIZE.

=> SELECT NORMALIZE_FIT('mtcars_normfit', 'mtcars', 'wt,hp', 'minmax');
NORMALIZE_FIT
---------------
Success
(1 row)

The following call to APPLY_NORMALIZE specifies the hp and cyl columns in table mtcars, where hp is in the normalization model and cyl is not in the normalization model:

=> CREATE TABLE mtcars_normalized AS SELECT APPLY_NORMALIZE (hp, cyl USING PARAMETERS model_name = 'mtcars_normfit') FROM mtcars;
CREATE TABLE
=> SELECT * FROM mtcars_normalized;
          hp        | cyl
--------------------+-----
  0.434628975265018 | 8
  0.681978798586572 | 8
  0.434628975265018 | 6
                  1 | 8
  0.540636042402827 | 8
                  0 | 4
  0.681978798586572 | 8
 0.0459363957597173 | 4
  0.434628975265018 | 8
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.204946996466431 | 6
  0.201413427561837 | 4
  0.204946996466431 | 6
  0.250883392226148 | 6
  0.049469964664311 | 4
  0.215547703180212 | 4
 0.0353356890459364 | 4
  0.187279151943463 | 6
  0.452296819787986 | 8
  0.628975265017668 | 8
  0.346289752650177 | 8
  0.137809187279152 | 4
  0.749116607773852 | 8
  0.144876325088339 | 4
  0.151943462897526 | 4
  0.452296819787986 | 8
  0.452296819787986 | 8
  0.575971731448763 | 8
  0.159010600706714 | 4
  0.346289752650177 | 8
(32 rows)

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
  hp | cyl
-----+-----
 175 | 8
 245 | 8
 175 | 6
 335 | 8
 205 | 8
  52 | 4
 245 | 8
  65 | 4
 175 | 8
 110 | 6
 123 | 6
  66 | 4
 110 | 6
 109 | 4
 110 | 6
 123 | 6
  66 | 4
 113 | 4
  62 | 4
 105 | 6
 180 | 8
 230 | 8
 150 | 8
  91 | 4
 264 | 8
  93 | 4
  95 | 4
 180 | 8
 180 | 8
 215 | 8
  97 | 4
 150 | 8
(32 rows)

The following call to REVERSE_NORMALIZE also specifies the hp and cyl columns in table mtcars, where hp is in normalization model mtcars_normfit, and cyl is not in the normalization model.

=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
       hp        | cyl
-----------------+-----
205.000005722046 |   8
150.000000357628 |   8
150.000000357628 |   8
93.0000016987324 |   4
 174.99999666214 |   8
94.9999992102385 |   4
214.999997496605 |   8
97.0000009387732 |   4
245.000006556511 |   8
 174.99999666214 |   6
             335 |   8
245.000006556511 |   8
62.0000002086163 |   4
 174.99999666214 |   8
230.000002026558 |   8
              52 |   4
263.999997675419 |   8
109.999999523163 |   6
123.000002324581 |   6
64.9999996386468 |   4
66.0000005029142 |   4
112.999997898936 |   4
109.999999523163 |   6
180.000000983477 |   8
180.000000983477 |   8
108.999998658895 |   4
109.999999523163 |   6
104.999999418855 |   6
123.000002324581 |   6
180.000000983477 |   8
66.0000005029142 |   4
90.9999999701977 |   4
(32 rows)

See also

Normalizing data

9 - ONE_HOT_ENCODER_FIT

Generates a sorted list of each of the category levels for each feature to be encoded, and stores the model.

Generates a sorted list of each of the category levels for each feature to be encoded, and stores the model.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

ONE_HOT_ENCODER_FIT ( 'model-name', 'input-relation','input-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, output_view = 'output-view']
              [, extra_levels = 'category-levels'] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the data for one hot encoding. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be INTEGER, BOOLEAN, VARCHAR, or dates.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

output_view
The name of the view that stores the input relation and the one hot encodings. Columns are returned in the order they appear in the input relation, with the one-hot encoded columns appended after the original columns.
extra_levels
Additional levels in each category that are not in the input relation. This parameter should be passed as a string that conforms with the JSON standard, with category names as keys, and lists of extra levels in each category as values.

Model attributes

call_string
The value of all input arguments that were specified at the time the function was called.
varchar_categories integer_categories boolean_categories date_categories
Settings for all:
  • category_name: Column name

  • category_level: Levels of the category, sorted for each category

  • category_level_index: Index of this categorical level in the sorted list of levels for the category.

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

=> SELECT ONE_HOT_ENCODER_FIT ('one_hot_encoder_model','mtcars','*'
USING PARAMETERS exclude_columns='mpg,disp,drat,wt,qsec,vs,am');
ONE_HOT_ENCODER_FIT
--------------------
Success
(1 row)

See also

10 - PCA

Computes principal components from the input table/view.

Computes principal components from the input table/view. The results are saved in a PCA model. Internally, PCA finds the components by using SVD on the co-variance matrix built from the input date. The singular values of this decomposition are also saved as part of the PCA model. The signs of all elements of a principal component could be flipped all together on different runs.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

PCA ( 'model-name', 'input-relation', 'input-columns'
        [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, num_components = num-components]
              [, scale = is-scaled]
              [, method = 'method'] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the input data for PCA.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All input columns must be a numeric data type.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

num_components
The number of components to keep in the model. If this value is not provided, all components are kept. The maximum number of components is the number of non-zero singular values returned by the internal call to SVD. This number is less than or equal to SVD (number of columns, number of rows).
scale
A Boolean value that specifies whether to standardize the columns during the preparation step:
  • True: Use a correlation matrix instead of a covariance matrix.

  • False (default)

method
The method used to calculate PCA, can be set to LAPACK.

Model attributes

columns
The information about columns from the input relation used for creating the PCA model:
  • index

  • name

singular_values
The information about singular values found. They are sorted in descending order:
  • index

  • value

  • explained_variance : percentage of the variance in data that can be attributed to this singular value

  • accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one

principal_components
The principal components corresponding to the singular values mentioned above:
  • index: indies of the elements in each component

  • PC1

  • PC2

  • ...

counters
The information collected during training the model, stored as name-value pairs:
  • counter_name

    • accepted_row_count: number of valid rows in the data

    • rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data

    • iteration_count: number of iterations, always 1 for the current implementation of PCA

  • counter_value

call_string
The function call that created the model.

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

Examples


=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96  Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE

=> SELECT * FROM worldPCA;
HDI   |       country       |       col1
------+---------------------+-------------------
0.886 | Belgium             |  79002.2946705704
0.699 | Belize              | -25631.6670012556
0.427 | Benin               | -40373.4104598122
0.805 | Chile               | -16805.7940082156
0.687 | China               | -37279.2893141103
0.744 | Costa Rica          | -19505.5631231635
0.4   | Cote d'Ivoire       | -38058.2060339272
0.776 | Cuba                | -23724.5779612041
0.895 | Denmark             |  117325.594028813
0.644 | Egypt               | -34609.9941604549
...
(96 rows)

=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
    USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
    key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI  |       country       |      em1970       |      em1971       |      em1972      |      em1973      |
      em1974      |      em1975       |      em1976|      em1977      |      em1978       |      em1979
   |      em1980       |      em1981      |      em1982       |      em1983       |      em1984       |em1985
|      em1986       |      em1987       |      em1988       |      em1989      |      em1990      |      em1991
|      em1992       |      em1993|      em1994      |      em1995       |      em1996       |      em1997
    |      em1998       |      em1999       |      em2000       |      em2001       |em2002       |
em2003      |      em2004       |      em2005      |      em2006       |      em2007       |      em2008
|      em2009      |      em2010       |     gdp1970      |     gdp1971      |     gdp1972      |     gdp1973
|     gdp1974      |     gdp1975      |     gdp1976      |     gdp1977      |gdp1978      |     gdp1979
 |     gdp1980      |     gdp1981      |     gdp1982      |     gdp1983      |     gdp1984      |     gdp1985
      |     gdp1986|    gdp1987      |     gdp1988      |     gdp1989      |     gdp1990      |     gdp1991
     |     gdp1992      |     gdp1993      |     gdp1994      |     gdp1995      |     gdp1996      |
gdp1997      |     gdp1998      |     gdp1999      |     gdp2000      |     gdp2001      |     gdp2002
|     gdp2003      |gdp2004      |     gdp2005      |     gdp2006      |     gdp2007      |     gdp2008
  |     gdp2009      |     gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium             |  18585.6613572407 | -16145.6374560074 |  26938.956253415 | 8094.30475779595 |
 12073.5461203817 | -11069.0567600181 | 19133.8584911727|   5500.312894949 | -4227.94863799987 |  6265.77925410752
|  -10884.749295608 | 30929.4669575201 | -7831.49439429977 |  3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 |   1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 |  18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 |   1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
|  6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
|  5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 |  5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 |  8718.3731386891
...
(96 rows)

See also

11 - SUMMARIZE_CATCOL

Returns a statistical summary of categorical data input, in three columns:.

Returns a statistical summary of categorical data input, in three columns:

  • CATEGORY: Categorical levels, of the same SQL data type as the summarized column

  • COUNT: The number of category levels, of type INTEGER

  • PERCENT: Represents category percentage, of type FLOAT

Syntax

SUMMARIZE_CATCOL (target-column
        [ USING PARAMETERS TOPK = topk-value [, WITH_TOTALCOUNT = show-total] ] )
OVER()

Arguments

target-column
The name of the input column to summarize, one of the following data types:
  • BOOLEAN

  • FLOAT

  • INTEGER

  • DATE

  • CHAR/VARCHAR

Parameters

TOPK
Integer, specifies how many of the most frequent rows to include in the output.
WITH_TOTALCOUNT
A Boolean value that specifies whether the table contains a heading row that displays the total number of rows displayed in the target column, and a percent equal to 100.

Default:true

Examples

This example shows the categorical summary for the current_salary column in the salary_data table. The output of the query shows the column category, count, and percent. The first column gives the categorical levels, with the same SQL data type as the input column, the second column gives a count of that value, and the third column gives a percentage.

=> SELECT SUMMARIZE_CATCOL (current_salary USING PARAMETERS TOPK = 5) OVER() FROM salary_data;
CATEGORY | COUNT | PERCENT
---------+-------+---------
         |  1000 |     100
   39004 |     2 |     0.2
   35321 |     1 |     0.1
   36313 |     1 |     0.1
   36538 |     1 |     0.1
   36562 |     1 |     0.1
(6 rows)

12 - SUMMARIZE_NUMCOL

Returns a statistical summary of columns in a Vertica table:.

Returns a statistical summary of columns in a Vertica table:

  • Count

  • Mean

  • Standard deviation

  • Min/max values

  • Approximate percentile

  • Median

All summary values are FLOAT data types, except INTEGER for count.

Syntax

SUMMARIZE_NUMCOL (input-columns [ USING PARAMETERS exclude_columns = 'excluded-columns'] ) OVER()

Arguments

input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All columns must be a numeric data type. If you select all columns, SUMMARIZE_NUMCOL normalizes all columns in the model

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

Examples

Show the statistical summary for the age and salary columns in the employee table:

=> SELECT SUMMARIZE_NUMCOL(* USING PARAMETERS exclude_columns='id,name,gender,title') OVER() FROM employee;
COLUMN         | COUNT |    MEAN    |      STDDEV      |  MIN    | PERC25  | MEDIAN  |  PERC75   |  MAX
---------------+-------+------------+------------------+---------+---------+---------+-----------+--------
age            |     5 |    63.4    | 19.3209730603818 |      44 |      45 |      67 |      71   |     90
salary         |     5 | 3456.76    | 1756.78754300285 | 1234.56 | 2345.67 | 3456.78 | 4567.89   | 5678.9
(2 rows)

13 - SVD

Computes singular values (the diagonal of the S matrix) and right singular vectors (the V matrix) of an SVD decomposition of the input relation.

Computes singular values (the diagonal of the S matrix) and right singular vectors (the V matrix) of an SVD decomposition of the input relation. The results are saved as an SVD model. The signs of all elements of a singular vector in SVD could be flipped all together on different runs.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

SVD ( 'model-name', 'input-relation', 'input-columns'
     [ USING PARAMETERS
              [exclude_columns = 'excluded-columns']
              [, num_components = num-components]
              [, method = 'method'] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the input data for SVD.
input-columns
Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be a numeric data type.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

num_components
The number of components to keep in the model. The maximum number of components is the number of non-zero singular values computed, which is less than or equal to min (number of columns, number of rows). If you omit this parameter, all components are kept.
method
The method used to calculate SVD, can be set to LAPACK.

Model attributes

columns
The information about columns from the input relation used for creating the SVD model:
  • index

  • name

singular_values
The information about singular values found. They are sorted in descending order:
  • index

  • value

  • explained_variance : percentage of the variance in data that can be attributed to this singular value

  • accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one

right_singular_vectors
The right singular vectors corresponding to the singular values mentioned above:
  • index: indices of the elements in each vector

  • vector1

  • vector2

  • ...

counters
The information collected during training the model, stored as name-value pairs:
  • counter_name

    • accepted_row_count: number of valid rows in the data

    • rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data

    • iteration_count: number of iterations, always 1 for the current implementation of SVD

  • counter_value

call_string
The function call that created the model.

Privileges

Non-superusers:

  • CREATE privileges on the schema where the model is created

  • SELECT privileges on the input relation

Examples

=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8  Rejected Rows: 0
(1 row)

=> CREATE TABLE transform_svd AS SELECT
     APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
     OVER () FROM small_svd;
CREATE TABLE

=> SELECT * FROM transform_svd;
id  |       col1        |        col2         |        col3         |        col4
----+-------------------+---------------------+---------------------+--------------------
4   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
6   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
1   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
2   |  0.17652411036246 | -0.0753183783382909 |  -0.678196192333598 | 0.0567124770173372
3   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
5   | 0.494871802886819 |   0.161721379259287 |  0.0712816417153664 | -0.473145877877408
8   |  0.44849499240202 |  -0.347260956311326 |   0.186958376368345 |  0.378561270493651
7   | 0.150974762654569 |   0.589561842046029 | 0.00392654610109522 |  0.360011163271921
(8 rows)

=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id  |        x1        |        x2        |        x3        |        x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 |  6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 |  107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)

See also