CHI_SQUARED

Computes the conditional chi-Square independence test on two categorical variables to find the likelihood that the two variables are independent.

Computes the conditional chi-square independence test on two categorical variables to find the likelihood that the two variables are independent. To condition the independence test on another set of variables, you can partition the data on these variables using a PARTITION BY clause.

This function is a multi-phase transform function.

Syntax

CHI_SQUARED( 'x-column', 'y-column' 
    [ USING PARAMETERS param=value[,...] ] )

Arguments

x-column, y-column
Columns in the input relation to be tested for dependency with each other. These columns must contain categorical data in numeric format.

Parameters

x_cardinality
Integer in the range [1, 20], the cardinality of x-column. If the cardinality of x-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.

Default: 20

y_cardinality
Integer in the range [1, 20], the cardinality of y-column. If the cardinality of y-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.

Default: 20

alpha
Float in the range (0.0, 1.0), the significance level. If the returned pvalue is less than this value, the null hypothesis, which assumes the variables are independent, is rejected.

Default: 0.05

Returns

The function returns two values:

  • pvalue (float): the confidence that the two variables are independent. If this value is greater than the alpha parameter value, the null hypothesis is accepted and the variables are considered independent.
  • independent (boolean): true if the variables are independent; otherwise, false.

Privileges

SELECT privileges on the input relation

Examples

The following examples use the titanic dataset from the machine learning example data. If you have not downloaded these datasets, see Download the machine learning example data for instructions.

The titanic_training table contains data related to passengers on the Titanic, including:

  • pclass: the ticket class of the passenger, ranging from 1st class to 3rd class
  • survived: whether the passenger survived, where 1 is yes and 0 is no
  • gender: gender of the passenger
  • sibling_and_spouse_count: number of siblings aboard the Titanic
  • embarkation_point: port of embarkation

To test whether the survival of a passenger is dependent on their ticket class, run the following chi-square test:

=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2, alpha=0.05) OVER() FROM titanic_training;
 pvalue | independent
--------+-------------
      0 | f
(1 row)

With a returned pvalue of zero, the null hypothesis is rejected and you can conclude that the survived and pclass variables are dependent. To test whether this outcome is conditional on the gender of the passenger, partition by the gender column in the OVER clause:

=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2) OVER(PARTITION BY gender) FROM titanic;
 pvalue | independent
--------+-------------
      0 | f
(1 row)

As the pvalue is still zero, it is clear that the dependence of the pclass and survived variables is not conditional on the gender of the passenger.

If one of the categorical columns that you want to test is not a numeric type, use the HASH function to convert it into type INT:

=> SELECT CHI_SQUARED(sibling_and_spouse_count, HASH(embarkation_point) USING PARAMETERS alpha=0.05) OVER() FROM titanic_training;
       pvalue       | independent
--------------------+-------------
 0.0753039994044853 | t
(1 row)

The returned pvalue is greater than alpha, meaning the null hypothesis is accepted and the sibling_and_spouse_count and embarkation_point are independent.