Vertica supports machine learning functions that prepare data as needed before subjecting it to analysis.
This is the multi-page printable view of this section. Click here to print.
Data preparation
- 1: BALANCE
- 2: CORR_MATRIX
- 3: DETECT_OUTLIERS
- 4: IFOREST
- 5: IMPUTE
- 6: NORMALIZE
- 7: NORMALIZE_FIT
- 8: ONE_HOT_ENCODER_FIT
- 9: PCA
- 10: SUMMARIZE_CATCOL
- 11: SUMMARIZE_NUMCOL
- 12: SVD
1 - BALANCE
Returns a view with an equal distribution of the input data based on the response_column.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
BALANCE ( 'output-view', 'input-relation', 'response-column', 'balance-method'
[ USING PARAMETERS sampling_ratio=ratio ] )
Arguments
output-view
- The name of the view where Vertica saves the balanced data from the input relation.
Note
Note: The view that results from this function employs a random function. Its content can differ each time it is used in a query. To make the operations on the view predictable, store it in a regular table. input-relation
- The table or view that contains the data the function uses to create a more balanced data set. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. response-column
- Name of the input column that represents the dependent variable, of type VARCHAR or INTEGER.
balance-method
- Specifies a method to select data from the minority and majority classes, one of the following.
-
hybrid_sampling
: Performs over-sampling and under-sampling on different classes so each class is equally represented. -
over_sampling
: Over-samples on all classes, with the exception of the most majority class, towards the most majority class's cardinality. -
under_sampling
: Under-samples on all classes, with the exception of the most minority class, towards the most minority class's cardinality. -
weighted_sampling
: An alias ofunder_sampling
.
-
Parameters
ratio
- The desired ratio between the majority class and the minority class. This value has no effect when used with balance method
hybrid_sampling
.Default: 1.0
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
=> CREATE TABLE backyard_bugs (id identity, bug_type int, finder varchar(20));
CREATE TABLE
=> COPY backyard_bugs FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1|Ants
>> 1|Beetles
>> 3|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 3|Caterpillars
>> 2|Ladybugs
>> 3|Ants
>> 3|Beetles
>> 1|Ladybugs
>> 3|Ladybugs
>> \.
=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs GROUP BY bug_type;
bug_type | COUNT
----------+-------
2 | 1
1 | 3
3 | 7
(3 rows)
=> SELECT BALANCE('backyard_bugs_balanced', 'backyard_bugs', 'bug_type', 'under_sampling');
BALANCE
--------------------------
Finished in 1 iteration
(1 row)
=> SELECT bug_type, COUNT(bug_type) FROM backyard_bugs_balanced GROUP BY bug_type;
----------+-------
2 | 1
1 | 2
3 | 1
(3 rows)
See also
2 - CORR_MATRIX
Takes an input relation with numeric columns, and calculates the Pearson Correlation Coefficient
between each pair of its input columns. The function is implemented as a Multi-Phase Transform function.
Syntax
CORR_MATRIX ( input-columns ) OVER()
Arguments
input-columns
- A comma-separated list of the columns in the input table. The input columns can be of any numeric type or BOOL, but they will be converted internally to FLOAT. The number of input columns must be more than 1 and not more than 1600.
Returns
CORR_MATRIX returns the correlation matrix in triplet format. That is, each pair-wise correlation is identified by three returned columns: name of the first variable, name of the second variable, and the correlation value of the pair. The function also returns two extra columns: number_of_ignored_input_rows
and number_of_processed_input_rows
. The value of the fourth/fifth column indicates the number of rows from the input which are ignored/used to calculate the corresponding correlation value. Any input pair with NULL, Inf, or NaN is ignored.
The correlation matrix is symmetric with a value of 1 on all diagonal elements; therefore, it can return only the value of elements above the diagonals—that is, the upper triangle. Nevertheless, the function returns the entire matrix to simplify any later operations. Then, the number of output rows is:
(#input-columns)^2
The first two output columns are of type VARCHAR(128), the third one is of type FLOAT, and the last two are of type INT.
Notes
-
The contents of the OVER clause must be empty.
-
The function returns no rows when the input table is empty.
-
When any of X_i and Y_i is NULL, Inf, or NaN, the pair will not be included in the calculation of CORR(X, Y). That is, any input pair with NULL, Inf, or NaN is ignored.
-
For the pair of (X,X), regardless of the contents of X: CORR(X,X) = 1, number_of_ignored_input_rows = 0, and number_of_processed_input_rows = #input_rows.
-
When (NSUMX2 == SUMXSUMX) or (NSUMY2 == SUMYSUMY) then value of CORR(X, Y) will be NULL. In theory it can happen in case of a column with constant values; nevertheless, it may not be always observed because of rounding error.
-
In the special case where all pair values of (X_i,Y_i) contain NULL, inf, or NaN, and X != Y: CORR(X,Y)=NULL.
Examples
The following example uses the iris dataset.*
|
3 - DETECT_OUTLIERS
Returns the outliers in a data set based on the outlier threshold. The output is a table that contains the outliers. DETECT_OUTLIERS
uses the detection method robust_szcore
to normalize each input column. The function then identifies as outliers all rows that contain a normalized value greater than the default or specified threshold.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
DETECT_OUTLIERS ( 'output-table', 'input-relation','input-columns', 'detection-method'
[ USING PARAMETERS
[outlier_threshold = threshold]
[, exclude_columns = 'excluded-columns']
[, partition_columns = 'partition-columns'] ] )
Arguments
output-table
- The name of the table where Vertica saves rows that are outliers along the chosen
input_columns
. All columns are present in this table. input-relation
- The table or view that contains outlier data. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of type numeric.
detection-method
- The outlier detection method to use, set to
robust_zscore
.
Parameters
outlier_threshold
- The minimum normalized value in a row that is used to identify that row as an outlier.
Default: 3.0
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.partition_columns
- Comma-separated list of column names from the input table or view that defines the partitions.
DETECT_OUTLIERS
detects outliers among each partition separately.Default: empty list
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output table
Examples
The following example shows how to use DETECT_OUTLIERS
:
=> CREATE TABLE baseball_roster (id identity, last_name varchar(30), hr int, avg float);
CREATE TABLE
=> COPY baseball_roster FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> Polo|7|.233
>> Gloss|45|.170
>> Gus|12|.345
>> Gee|1|.125
>> Laus|3|.095
>> Hilltop|16|.222
>> Wicker|78|.333
>> Scooter|0|.121
>> Hank|999999|.8888
>> Popup|35|.378
>> \.
=> SELECT * FROM baseball_roster;
id | last_name | hr | avg
----+-----------+--------+--------
3 | Gus | 12 | 0.345
4 | Gee | 1 | 0.125
6 | Hilltop | 16 | 0.222
10 | Popup | 35 | 0.378
1 | Polo | 7 | 0.233
7 | Wicker | 78 | 0.333
9 | Hank | 999999 | 0.8888
2 | Gloss | 45 | 0.17
5 | Laus | 3 | 0.095
8 | Scooter | 0 | 0.121
(10 rows)
=> SELECT DETECT_OUTLIERS('baseball_outliers', 'baseball_roster', 'id, hr, avg', 'robust_zscore' USING PARAMETERS
outlier_threshold=3.0);
DETECT_OUTLIERS
--------------------------
Detected 2 outliers
(1 row)
=> SELECT * FROM baseball_outliers;
id | last_name | hr | avg
----+-----------+------------+-------------
7 | Wicker | 78 | 0.333
9 | Hank | 999999 | 0.8888
(2 rows)
4 - IFOREST
Trains and returns an isolation forest (iForest) model. After you train the model, you can use the APPLY_IFOREST function to predict outliers in an input relation.
For more information about how the iForest algorithm works, see Isolation Forest.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
IFOREST( 'model-name', 'input-relation', 'input-columns' [ USING PARAMETERS param=value[,...] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the input data for IFOREST.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Columns must be of types CHAR, VARCHAR, BOOL, INT, or FLOAT.
Columns of types CHAR, VARCHAR, and BOOL are treated as categorical features; all others are treated as numeric features.
Parameters
exclude_columns
- Comma-separated list of column names from
input-columns
to exclude from processing.Default: Empty string ('')
ntree
- Integer in the range [1, 1000], specifies the number of trees in the forest.
Default: 100
sampling_size
- Float in the range (0.0, 1.0], specifies the portion of the input data set that is randomly picked, without replacement, for training each tree.
Default: 0.632
col_sample_by_tree
- Float in the range (0.0, 1.0], specifies the fraction of columns that are randomly picked for training each tree.
Default: 1.0
max_depth
- Integer in the range [1, 100], specifies the maximum depth for growing each tree.
Default: 10
nbins
- Integer in the range [2, 1000], specifies the number of bins used to discretize continuous features.
Default: 32
Model Attributes
details
- Details about the function's predictor columns, including:
-
predictor
: Names of the predictors in the same order specified when training the model. -
type
: Types of the predictors in the same order as their names inpredictor
.
-
tree_count
- Number of trees in the model.
rejected_row_count
- Number of rows in
input-relation
that were skipped because they contained an invalid value. accepted_row_count
- Total number of rows in
input-relation
minusrejected_row_count
. call_string
- Value of all input arguments that were specified at the time the function was called.
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
Examples
In the following example, the input data to the function contains columns of type INT, VARCHAR, and FLOAT:
=> SELECT IFOREST('baseball_anomalies','baseball','team, hr, hits, avg, salary' USING PARAMETERS ntree=75, sampling_size=0.7,
max_depth=15);
IFOREST
----------
Finished
(1 row)
You can verify that all the input columns were read in correctly by calling GET_MODEL_SUMMARY and checking the details section:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='baseball_anomalies');
GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------------------------------------
===========
call_string
===========
SELECT iforest('public.baseball_anomalies', 'baseball', 'team, hr, hits, avg, salary' USING PARAMETERS exclude_columns='', ntree=75,
sampling_size=0.7, col_sample_by_tree=1, max_depth=15, nbins=32);
=======
details
=======
predictor| type
---------+----------------
team |char or varchar
hr | int
hits | int
avg |float or numeric
salary |float or numeric
===============
Additional Info
===============
Name |Value
------------------+-----
tree_count | 75
rejected_row_count| 0
accepted_row_count|1000
(1 row)
See also
5 - IMPUTE
Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column. This function supports numeric and categorical data types.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
IMPUTE( 'output-view', 'input-relation', 'input-columns', 'method'
[ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, partition_columns = 'partition-columns'] ] )
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileArguments
output-view
- Name of the view that shows the input table with imputed values in place of missing values. In this view, rows without missing values are kept intact while the rows with missing values are modified according to the specified method.
input-relation
- The table or view that contains the data for missing-value imputation. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of input columns where missing values will be replaced, or asterisk (*) to specify all columns. All columns must be of type numeric or BOOLEAN.
method
- The method to compute the missing value replacements, one of the following:
-
mean
: The missing values in each column will be replaced by the mean of that column. This method can be used for numeric data only. -
mode
: The missing values in each column will be replaced by the most frequent value in that column. This method can be used for categorical data only.
-
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.partition_columns
- Comma-separated list of column names from the input relation that defines the partitions.
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
Execute IMPUTE
on the small_input_impute
table, specifying the mean method:
=> SELECT impute('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)
Execute IMPUTE
, specifying the mode method:
=> SELECT impute('output_view3','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)
See also
Imputing missing values6 - NORMALIZE
Runs a normalization algorithm on an input relation. The output is a view with the normalized data.
Note
Note: This function differs from NORMALIZE_FIT, which creates and stores a model rather than creating a view definition. This can lead to different performance characteristics between the two functions.This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
NORMALIZE ( 'output-view', 'input-relation', 'input-columns', 'normalization-method'
[ USING PARAMETERS exclude_columns = 'excluded-columns' ] )
Arguments
output-view
- The name of the view showing the input relation with normalized data replacing the specified input columns. .
input-relation
- The table or view that contains the data to normalize. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of numeric input columns that contain the values to normalize, or asterisk (*) to select all columns.
normalization-method
- The normalization method to use, one of the following:
-
minmax
-
zscore
-
robust_zscore
If infinity values appear in the table, the method ignores those values.
-
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.
Privileges
Non-superusers:
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
These examples show how you can use the NORMALIZE function on the wt
and hp
columns in the mtcars table.
Execute the NORMALIZE function, and specify the minmax
method:
=> SELECT NORMALIZE('mtcars_norm', 'mtcars',
'wt, hp', 'minmax');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
Execute the NORMALIZE function, and specify the zscore
method:
=> SELECT NORMALIZE('mtcars_normz','mtcars',
'wt, hp', 'zscore');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
Execute the NORMALIZE function, and specify the robust_zscore
method:
=> SELECT NORMALIZE('mtcars_normz', 'mtcars',
'wt, hp', 'robust_zscore');
NORMALIZE
--------------------------
Finished in 1 iteration
(1 row)
See also
Normalizing data7 - NORMALIZE_FIT
Note
This function differs from NORMALIZE, which directly outputs a view with normalized results, rather than storing normalization parameters into a model for later operation.NORMALIZE_FIT
computes normalization parameters for each of the specified columns in an input relation. The resulting model stores the normalization parameters. For example, for MinMax
normalization, the minimum and maximum value of each column are stored in the model. The generated model serves as input to functions APPLY_NORMALIZE and REVERSE_NORMALIZE.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
NORMALIZE_FIT ( 'model-name', 'input-relation', 'input-columns', 'normalization-method'
[ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, output_view = 'output-view'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the data to normalize. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be of data type numeric.
normalization-method
- The normalization method to use, one of the following:
-
minmax
-
zscore
-
robust_zscore
If you specify
robust_zscore
,NORMALIZE_FIT
uses the function APPROXIMATE_MEDIAN [aggregate].All normalization methods ignore infinity, negative infinity, or NULL values in the input relation.
-
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.output_view
- Name of the view that contains all columns from the input relation, with the specified input columns normalized.
Model attributes
data
- Normalization method set to
minmax
:-
colNames
: Model column names -
mins
: Minimum value of each column -
maxes
: Maximum value of each column
-
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
The following example creates a model with NORMALIZE_FIT
using the wt
and hp
columns in table mtcars
, and then uses this model in successive calls to APPLY_NORMALIZE and REVERSE_NORMALIZE.
=> SELECT NORMALIZE_FIT('mtcars_normfit', 'mtcars', 'wt,hp', 'minmax');
NORMALIZE_FIT
---------------
Success
(1 row)
The following call to APPLY_NORMALIZE
specifies the hp
and cyl
columns in table mtcars
, where hp
is in the normalization model and cyl
is not in the normalization model:
=> CREATE TABLE mtcars_normalized AS SELECT APPLY_NORMALIZE (hp, cyl USING PARAMETERS model_name = 'mtcars_normfit') FROM mtcars;
CREATE TABLE
=> SELECT * FROM mtcars_normalized;
hp | cyl
--------------------+-----
0.434628975265018 | 8
0.681978798586572 | 8
0.434628975265018 | 6
1 | 8
0.540636042402827 | 8
0 | 4
0.681978798586572 | 8
0.0459363957597173 | 4
0.434628975265018 | 8
0.204946996466431 | 6
0.250883392226148 | 6
0.049469964664311 | 4
0.204946996466431 | 6
0.201413427561837 | 4
0.204946996466431 | 6
0.250883392226148 | 6
0.049469964664311 | 4
0.215547703180212 | 4
0.0353356890459364 | 4
0.187279151943463 | 6
0.452296819787986 | 8
0.628975265017668 | 8
0.346289752650177 | 8
0.137809187279152 | 4
0.749116607773852 | 8
0.144876325088339 | 4
0.151943462897526 | 4
0.452296819787986 | 8
0.452296819787986 | 8
0.575971731448763 | 8
0.159010600706714 | 4
0.346289752650177 | 8
(32 rows)
=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
hp | cyl
-----+-----
175 | 8
245 | 8
175 | 6
335 | 8
205 | 8
52 | 4
245 | 8
65 | 4
175 | 8
110 | 6
123 | 6
66 | 4
110 | 6
109 | 4
110 | 6
123 | 6
66 | 4
113 | 4
62 | 4
105 | 6
180 | 8
230 | 8
150 | 8
91 | 4
264 | 8
93 | 4
95 | 4
180 | 8
180 | 8
215 | 8
97 | 4
150 | 8
(32 rows)
The following call to REVERSE_NORMALIZE
also specifies the hp
and cyl
columns in table mtcars
, where hp
is in normalization model mtcars_normfit
, and cyl
is not in the normalization model.
=> SELECT REVERSE_NORMALIZE (hp, cyl USING PARAMETERS model_name='mtcars_normfit') FROM mtcars_normalized;
hp | cyl
-----------------+-----
205.000005722046 | 8
150.000000357628 | 8
150.000000357628 | 8
93.0000016987324 | 4
174.99999666214 | 8
94.9999992102385 | 4
214.999997496605 | 8
97.0000009387732 | 4
245.000006556511 | 8
174.99999666214 | 6
335 | 8
245.000006556511 | 8
62.0000002086163 | 4
174.99999666214 | 8
230.000002026558 | 8
52 | 4
263.999997675419 | 8
109.999999523163 | 6
123.000002324581 | 6
64.9999996386468 | 4
66.0000005029142 | 4
112.999997898936 | 4
109.999999523163 | 6
180.000000983477 | 8
180.000000983477 | 8
108.999998658895 | 4
109.999999523163 | 6
104.999999418855 | 6
123.000002324581 | 6
180.000000983477 | 8
66.0000005029142 | 4
90.9999999701977 | 4
(32 rows)
See also
Normalizing data8 - ONE_HOT_ENCODER_FIT
Generates a sorted list of each of the category levels for each feature to be encoded, and stores the model.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
ONE_HOT_ENCODER_FIT ( 'model-name', 'input-relation','input-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, output_view = 'output-view']
[, extra_levels = 'category-levels'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the data for one hot encoding. If the input relation is defined in Hive, use
SYNC_WITH_HCATALOG_SCHEMA
to sync thehcatalog
schema, and then run the machine learning function. input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be INTEGER, BOOLEAN, VARCHAR, or dates.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.output_view
- The name of the view that stores the input relation and the one hot encodings. Columns are returned in the order they appear in the input relation, with the one-hot encoded columns appended after the original columns.
extra_levels
- Additional levels in each category that are not in the input relation. This parameter should be passed as a string that conforms with the JSON standard, with category names as keys, and lists of extra levels in each category as values.
Model attributes
call_string
- The value of all input arguments that were specified at the time the function was called.
-
varchar_categories integer_categories boolean_categories date_categories
- Settings for all:
-
category_name
: Column name -
category_level
: Levels of the category, sorted for each category -
category_level_index
: Index of this categorical level in the sorted list of levels for the category.
-
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
-
CREATE privileges on the output view schema
Examples
=> SELECT ONE_HOT_ENCODER_FIT ('one_hot_encoder_model','mtcars','*'
USING PARAMETERS exclude_columns='mpg,disp,drat,wt,qsec,vs,am');
ONE_HOT_ENCODER_FIT
--------------------
Success
(1 row)
See also
9 - PCA
Computes principal components from the input table/view. The results are saved in a PCA model. Internally, PCA finds the components by using SVD on the co-variance matrix built from the input date. The singular values of this decomposition are also saved as part of the PCA model. The signs of all elements of a principal component could be flipped all together on different runs.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
PCA ( 'model-name', 'input-relation', 'input-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, num_components = num-components]
[, scale = is-scaled]
[, method = 'method'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the input data for PCA.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All input columns must be a numeric data type.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.num_components
- The number of components to keep in the model. If this value is not provided, all components are kept. The maximum number of components is the number of non-zero singular values returned by the internal call to SVD. This number is less than or equal to SVD (number of columns, number of rows).
scale
- A Boolean value that specifies whether to standardize the columns during the preparation step:
-
True
: Use a correlation matrix instead of a covariance matrix. -
False
(default)
-
method
- The method used to calculate PCA, can be set to
LAPACK
.
Model attributes
columns
- The information about columns from the input relation used for creating the PCA model:
-
index
-
name
-
singular_values
- The information about singular values found. They are sorted in descending order:
-
index
-
value
-
explained_variance : percentage of the variance in data that can be attributed to this singular value
-
accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one
-
principal_components
- The principal components corresponding to the singular values mentioned above:
-
index: indies of the elements in each component
-
PC1
-
PC2
-
...
-
counters
- The information collected during training the model, stored as name-value pairs:
-
counter_name
-
accepted_row_count: number of valid rows in the data
-
rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data
-
iteration_count: number of iterations, always 1 for the current implementation of PCA
-
-
counter_value
-
call_string
- The function call that created the model.
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
Examples
=> SELECT PCA ('pcamodel', 'world','country,HDI,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,
em1978,em1979,em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,
em1993,em1994,em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,
em2008,em2009,em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,
gdp1981,gdp1982,gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,
gdp1994,gdp1995,gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,
gdp2007,gdp2008,gdp2009,gdp2010' USING PARAMETERS exclude_columns='HDI,country');
PCA
---------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 96 Rejected Rows: 0
(1 row)
=> CREATE TABLE worldPCA AS SELECT
APPLY_PCA (HDI,country,em1970,em1971,em1972,em1973,em1974,em1975,em1976,em1977,em1978,em1979,
em1980,em1981,em1982,em1983,em1984 ,em1985,em1986,em1987,em1988,em1989,em1990,em1991,em1992,em1993,em1994,
em1995,em1996,em1997,em1998,em1999,em2000,em2001,em2002,em2003,em2004,em2005,em2006,em2007,em2008,em2009,
em2010,gdp1970,gdp1971,gdp1972,gdp1973,gdp1974,gdp1975,gdp1976,gdp1977,gdp1978,gdp1979,gdp1980,gdp1981,gdp1982,
gdp1983,gdp1984,gdp1985,gdp1986,gdp1987,gdp1988,gdp1989,gdp1990,gdp1991,gdp1992,gdp1993,gdp1994,gdp1995,
gdp1996,gdp1997,gdp1998,gdp1999,gdp2000,gdp2001,gdp2002,gdp2003,gdp2004,gdp2005,gdp2006,gdp2007,gdp2008,
gdp2009,gdp2010 USING PARAMETERS model_name='pcamodel', exclude_columns='HDI, country', key_columns='HDI,
country',cutoff=.3)OVER () FROM world;
CREATE TABLE
=> SELECT * FROM worldPCA;
HDI | country | col1
------+---------------------+-------------------
0.886 | Belgium | 79002.2946705704
0.699 | Belize | -25631.6670012556
0.427 | Benin | -40373.4104598122
0.805 | Chile | -16805.7940082156
0.687 | China | -37279.2893141103
0.744 | Costa Rica | -19505.5631231635
0.4 | Cote d'Ivoire | -38058.2060339272
0.776 | Cuba | -23724.5779612041
0.895 | Denmark | 117325.594028813
0.644 | Egypt | -34609.9941604549
...
(96 rows)
=> SELECT APPLY_INVERSE_PCA (HDI, country, col1
USING PARAMETERS model_name = 'pcamodel', exclude_columns='HDI,country',
key_columns = 'HDI, country') OVER () FROM worldPCA;
HDI | country | em1970 | em1971 | em1972 | em1973 |
em1974 | em1975 | em1976| em1977 | em1978 | em1979
| em1980 | em1981 | em1982 | em1983 | em1984 |em1985
| em1986 | em1987 | em1988 | em1989 | em1990 | em1991
| em1992 | em1993| em1994 | em1995 | em1996 | em1997
| em1998 | em1999 | em2000 | em2001 |em2002 |
em2003 | em2004 | em2005 | em2006 | em2007 | em2008
| em2009 | em2010 | gdp1970 | gdp1971 | gdp1972 | gdp1973
| gdp1974 | gdp1975 | gdp1976 | gdp1977 |gdp1978 | gdp1979
| gdp1980 | gdp1981 | gdp1982 | gdp1983 | gdp1984 | gdp1985
| gdp1986| gdp1987 | gdp1988 | gdp1989 | gdp1990 | gdp1991
| gdp1992 | gdp1993 | gdp1994 | gdp1995 | gdp1996 |
gdp1997 | gdp1998 | gdp1999 | gdp2000 | gdp2001 | gdp2002
| gdp2003 |gdp2004 | gdp2005 | gdp2006 | gdp2007 | gdp2008
| gdp2009 | gdp2010
-------+---------------------+-------------------+-------------------+------------------+------------------
+------------------+-------------------+------------------+------------------+-------------------+---------
----------+-------------------+------------------+-------------------+-------------------+-----------------
--+------------------+-------------------+-------------------+-------------------+------------------+-------
-----------+------------------+-------------------+-------------------+------------------+------------------
-+-------------------+------------------+-------------------+-------------------+-------------------+-------
------------+--------------------+------------------+-------------------+------------------+----------------
---+-------------------+-------------------+------------------+-------------------+------------------+------
------------+------------------+------------------+------------------+------------------+------------------+
------------------+------------------+------------------+------------------+------------------+-------------
-----+------------------+------------------+------------------+------------------+------------------+-------
-----------+------------------+------------------+------------------+------------------+------------------+-
-----------------+------------------+------------------+------------------+------------------+--------------
----+------------------+------------------+------------------+------------------+------------------+--------
----------+------------------+------------------+------------------+------------------+------------------
0.886 | Belgium | 18585.6613572407 | -16145.6374560074 | 26938.956253415 | 8094.30475779595 |
12073.5461203817 | -11069.0567600181 | 19133.8584911727| 5500.312894949 | -4227.94863799987 | 6265.77925410752
| -10884.749295608 | 30929.4669575201 | -7831.49439429977 | 3235.81760508742 | -22765.9285442662 | 27200
.6767714485 | -10554.9550160917 | 1169.4144482273 | -16783.7961289161 | 27932.2660829329 | 17227.9083196848
| 13956.0524012749 | -40175.6286481088 | -10889.4785920499 | 22703.6576872859 | -14635.5832197402 |
2857.12270512168 | 20473.5044214494 | -52199.4895696423 | -11038.7346460738 | 18466.7298633088 | -17410.4225137703 |
-3475.63826305462 | 29305.6753822341 | 1242.5724942049 | 17491.0096310849 | -12609.9984515902 | -17909.3603476248
| 6276.58431412381 | 21851.9475485178 | -2614.33738160397 | 3777.74134131349 | 4522.08854282736 | 4251.90446379366
| 4512.15101396876 | 4265.49424538129 | 5190.06845330997 | 4543.80444817989 | 5639.81122679089 | 4420.44705213467
| 5658.8820279283 | 5172.69025294376 | 5019.63640408663 | 5938.84979495903 | 4976.57073629812 | 4710.49525137591
| 6523.65700286465 | 5067.82520773578 | 6789.13070219317 | 5525.94643553563 | 6894.68336419297 | 5961.58442474331
| 5661.21093840818 | 7721.56088518218 | 5959.7301109143 | 6453.43604137202 | 6739.39384033096 | 7517.97645468455
| 6907.49136910647 | 7049.03921764209 | 7726.49091035527 | 8552.65909911844 | 7963.94487647115 | 7187.45827585515
| 7994.02955410523 | 9532.89844418041 | 7962.25713582666 | 7846.68238907624 | 10230.9878908643 | 8642.76044946519
| 8886.79860331866 | 8718.3731386891
...
(96 rows)
See also
10 - SUMMARIZE_CATCOL
Returns a statistical summary of categorical data input, in three columns:
-
CATEGORY: Categorical levels, of the same SQL data type as the summarized column
-
COUNT: The number of category levels, of type INTEGER
-
PERCENT: Represents category percentage, of type FLOAT
Syntax
SUMMARIZE_CATCOL (target-column
[ USING PARAMETERS TOPK = topk-value [, WITH_TOTALCOUNT = show-total] ] )
OVER()
Arguments
target-column
- The name of the input column to summarize, one of the following data types:
-
BOOLEAN
-
FLOAT
-
INTEGER
-
DATE
-
CHAR/VARCHAR
-
Parameters
TOPK
- Integer, specifies how many of the most frequent rows to include in the output.
WITH_TOTALCOUNT
- A Boolean value that specifies whether the table contains a heading row that displays the total number of rows displayed in the target column, and a percent equal to 100.
Default:
true
Examples
This example shows the categorical summary for the current_salary
column in the salary_data
table. The output of the query shows the column category, count, and percent. The first column gives the categorical levels, with the same SQL data type as the input column, the second column gives a count of that value, and the third column gives a percentage.
=> SELECT SUMMARIZE_CATCOL (current_salary USING PARAMETERS TOPK = 5) OVER() FROM salary_data;
CATEGORY | COUNT | PERCENT
---------+-------+---------
| 1000 | 100
39004 | 2 | 0.2
35321 | 1 | 0.1
36313 | 1 | 0.1
36538 | 1 | 0.1
36562 | 1 | 0.1
(6 rows)
11 - SUMMARIZE_NUMCOL
Returns a statistical summary of columns in a Vertica table:
-
Count
-
Mean
-
Standard deviation
-
Min/max values
-
Approximate percentile
-
Median
All summary values are FLOAT data types, except INTEGER for count.
Syntax
SUMMARIZE_NUMCOL (input-columns [ USING PARAMETERS exclude_columns = 'excluded-columns'] ) OVER()
Arguments
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. All columns must be a numeric data type. If you select all columns,
SUMMARIZE_NUMCOL
normalizes all columns in the model
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.
Examples
Show the statistical summary for the age
and salary
columns in the employee
table:
=> SELECT SUMMARIZE_NUMCOL(* USING PARAMETERS exclude_columns='id,name,gender,title') OVER() FROM employee;
COLUMN | COUNT | MEAN | STDDEV | MIN | PERC25 | MEDIAN | PERC75 | MAX
---------------+-------+------------+------------------+---------+---------+---------+-----------+--------
age | 5 | 63.4 | 19.3209730603818 | 44 | 45 | 67 | 71 | 90
salary | 5 | 3456.76 | 1756.78754300285 | 1234.56 | 2345.67 | 3456.78 | 4567.89 | 5678.9
(2 rows)
12 - SVD
Computes singular values (the diagonal of the S matrix) and right singular vectors (the V matrix) of an SVD decomposition of the input relation. The results are saved as an SVD model. The signs of all elements of a singular vector in SVD could be flipped all together on different runs.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
VolatileSyntax
SVD ( 'model-name', 'input-relation', 'input-columns'
[ USING PARAMETERS
[exclude_columns = 'excluded-columns']
[, num_components = num-components]
[, method = 'method'] ] )
Arguments
model-name
- Identifies the model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- The table or view that contains the input data for SVD.
input-columns
- Comma-separated list of columns to use from the input relation, or asterisk (*) to select all columns. Input columns must be a numeric data type.
Parameters
exclude_columns
Comma-separated list of column names from
input-columns
to exclude from processing.num_components
- The number of components to keep in the model. The maximum number of components is the number of non-zero singular values computed, which is less than or equal to min (number of columns, number of rows). If you omit this parameter, all components are kept.
method
- The method used to calculate SVD, can be set to
LAPACK
.
Model attributes
columns
- The information about columns from the input relation used for creating the SVD model:
-
index
-
name
-
singular_values
- The information about singular values found. They are sorted in descending order:
-
index
-
value
-
explained_variance : percentage of the variance in data that can be attributed to this singular value
-
accumulated_explained_variance : percentage of the variance in data that can be retained if we drop all singular values after this current one
-
right_singular_vectors
- The right singular vectors corresponding to the singular values mentioned above:
-
index: indices of the elements in each vector
-
vector1
-
vector2
-
...
-
counters
- The information collected during training the model, stored as name-value pairs:
-
counter_name
-
accepted_row_count: number of valid rows in the data
-
rejected_row_count: number of invalid rows (having NULL, INF or NaN) in the data
-
iteration_count: number of iterations, always 1 for the current implementation of SVD
-
-
counter_value
-
call_string
- The function call that created the model.
Privileges
Non-superusers:
-
CREATE privileges on the schema where the model is created
-
SELECT privileges on the input relation
Examples
=> SELECT SVD ('svdmodel', 'small_svd', 'x1,x2,x3,x4');
SVD
--------------------------------------------------------------
Finished in 1 iterations.
Accepted Rows: 8 Rejected Rows: 0
(1 row)
=> CREATE TABLE transform_svd AS SELECT
APPLY_SVD (id, x1, x2, x3, x4 USING PARAMETERS model_name='svdmodel', exclude_columns='id', key_columns='id')
OVER () FROM small_svd;
CREATE TABLE
=> SELECT * FROM transform_svd;
id | col1 | col2 | col3 | col4
----+-------------------+---------------------+---------------------+--------------------
4 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
6 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
1 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
2 | 0.17652411036246 | -0.0753183783382909 | -0.678196192333598 | 0.0567124770173372
3 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
5 | 0.494871802886819 | 0.161721379259287 | 0.0712816417153664 | -0.473145877877408
8 | 0.44849499240202 | -0.347260956311326 | 0.186958376368345 | 0.378561270493651
7 | 0.150974762654569 | 0.589561842046029 | 0.00392654610109522 | 0.360011163271921
(8 rows)
=> SELECT APPLY_INVERSE_SVD (* USING PARAMETERS model_name='svdmodel', exclude_columns='id',
key_columns='id') OVER () FROM transform_svd;
id | x1 | x2 | x3 | x4
----+------------------+------------------+------------------+------------------
4 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
6 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
7 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
1 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
2 | 20.6468626294368 | 9.30974906868751 | 8.71006863405534 | 6.5855928603967
3 | 31.2494347777156 | 20.6336519003026 | 27.5668287751507 | 5.84427645886865
5 | 107.93376580719 | 51.6980548011917 | 97.9665796560552 | 40.4918236881051
8 | 91.4056627665577 | 44.7629617207482 | 83.1704961993117 | 38.9274292265543
(8 rows)