NAIVE_BAYES

Executes the Naive Bayes algorithm on an input relation and returns a Naive Bayes model.

Executes the Naive Bayes algorithm on an input relation and returns a Naive Bayes model.

Columns are treated according to data type:

  • FLOAT: Values are assumed to follow some Gaussian distribution.

  • INTEGER: Values are assumed to belong to one multinomial distribution.

  • CHAR/VARCHAR: Values are assumed to follow some categorical distribution. The string values stored in these columns must not be greater than 128 characters.

  • BOOLEAN: Values are treated as categorical with two values.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Volatile

Syntax

NAIVE_BAYES ( 'model-name', 'input-relation', 'response-column', 'predictor-columns'
        [ USING PARAMETERS [exclude_columns = 'excluded-columns'] [, alpha = alpha-value] ] )

Arguments

model-name
Identifies the model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
The table or view that contains the training data for building the model. If the input relation is defined in Hive, use SYNC_WITH_HCATALOG_SCHEMA to sync the hcatalog schema, and then run the machine learning function.
response-column
Name of the input column that represents the dependent variable, or outcome. This column must contain discrete labels that represent different class labels.

The response column must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid.

predictor-columns

Comma-separated list of columns in the input relation that represent independent variables for the model, or asterisk (*) to select all columns. If you select all columns, the argument list for parameter exclude_columns must include response-column, and any columns that are invalid as predictor columns.

All predictor columns must be of type numeric, CHAR/VARCHAR, or BOOLEAN; otherwise the model is invalid. BOOLEAN column values are converted to FLOAT values before training: 0 for false, 1 for true.

Parameters

exclude_columns
Comma-separated list of columns from predictor-columns to exclude from processing.
alpha
Float, specifies use of Laplace smoothing if the event model is categorical, multinomial, or Bernoulli.

Default: 1.0

Model attributes

colsInfo
The information from the response and predictor columns used in training:
  • index: The index (starting at 0) of the column as provided in training. Index 0 is used for the response column.

  • name: The column name.

  • type: The label used for response with a value of Gaussian, Multinominal, Categorical, or Bernoulli.

alpha
The smooth parameter value.
prior
The percentage of each class among all training samples:
  • label: The class label.

  • value: The percentage of each class.

nRowsTotal
The number of samples accepted for training from the data set.
nRowsRejected
The number of samples rejected for training.
callStr
The SQL statement used to replicate the training.
Gaussian
The Gaussian model conditioned on the class indicated by the class_name:
  • index: The index of the predictor column.

  • mu: The mean value of the model.

  • sigmaSq: The squared standard deviation of the model.

Multinominal
The Multinomial model conditioned on the class indicated by the class_name:
  • index: The index of the predictor column.

  • prob: The probability conditioned on the class indicated by the class_name.

Bernoulli
The Bernoulli model conditioned on the class indicated by the class_name:
  • index: The index of the predictor column.

  • probTrue: The probability of having the value TRUE in this predictor column.

Categorical
The Gaussian model conditioned on the class indicated by the class_name:
  • category: The value in the predictor name.

  • <class_name>: The probability of having that value conditioned on the class indicated by the class_name.

Privileges

Superuser, or SELECT privileges on the input relation.

Examples

=> SELECT NAIVE_BAYES('naive_house84_model', 'house84_train', 'party', '*'
                      USING PARAMETERS exclude_columns='party, id');
                                  NAIVE_BAYES
--------------------------------------------------
 Finished. Accepted Rows: 324  Rejected Rows: 0
(1 row)

See also