IMPUTE

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column.

Imputes missing values in a data set with either the mean or the mode, based on observed values for a variable in each column. This function supports numeric and categorical data types.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

IMPUTE( 'output-view', 'input-relation', 'input-columns', 'method'
        [ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, partition_columns = 'partition-columns'] ] )

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Arguments

output-view
Name of the view that shows the input table with imputed values in place of missing values. In this view, rows without missing values are kept intact while the rows with missing values are modified according to the specified method.
input-relation
The table or view that contains the data for missing-value imputation. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
input-columns
Comma-separated list of input columns where missing values will be replaced, or asterisk (*) to specify all columns. All columns must be of type numeric or BOOLEAN.
method
The method to compute the missing value replacements, one of the following:
  • mean: The missing values in each column will be replaced by the mean of that column. This method can be used for numeric data only.

  • mode: The missing values in each column will be replaced by the most frequent value in that column. This method can be used for categorical data only.

Parameters

exclude_columns

Comma-separated list of column names from input-columns to exclude from processing.

partition_columns
Comma-separated list of column names from the input relation that defines the partitions.

Privileges

Non-superusers:

  • SELECT privileges on the input relation

  • CREATE privileges on the output view schema

Examples

Execute IMPUTE on the small_input_impute table, specifying the mean method:

=> SELECT impute('output_view','small_input_impute', 'pid, x1,x2,x3,x4','mean'
USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

Execute IMPUTE, specifying the mode method:

=> SELECT impute('output_view3','small_input_impute', 'pid, x5,x6','mode' USING PARAMETERS exclude_columns='pid');
impute
--------------------------
Finished in 1 iteration
(1 row)

See also

Imputing missing values