CHI_SQUARED
Computes the conditional chi-square independence test on two categorical variables to find the likelihood that the two variables are independent. To condition the independence test on another set of variables, you can partition the data on these variables using a PARTITION BY clause.
Tip
If a categorical column is not of a numeric data type, you can use the HASH function to convert it into a column of type INT, where each category is mapped to a unique integer. However, note that NULL values are hashed to zero, so they will be included in the test instead of skipped by the function.This function is a multi-phase transform function.
Syntax
CHI_SQUARED( 'x-column', 'y-column'
[ USING PARAMETERS param=value[,...] ] )
Arguments
x-column
,y-column
- Columns in the input relation to be tested for dependency with each other. These columns must contain categorical data in numeric format.
Parameters
x_cardinality
- Integer in the range [1, 20], the cardinality of x-column. If the cardinality of x-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.
Default: 20
y_cardinality
- Integer in the range [1, 20], the cardinality of y-column. If the cardinality of y-column is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.
Default: 20
alpha
- Float in the range (0.0, 1.0), the significance level. If the returned
pvalue
is less than this value, the null hypothesis, which assumes the variables are independent, is rejected.Default: 0.05
Returns
The function returns two values:
pvalue
(float): the confidence that the two variables are independent. If this value is greater than thealpha
parameter value, the null hypothesis is accepted and the variables are considered independent.independent
(boolean): true if the variables are independent; otherwise, false.
Privileges
SELECT privileges on the input relation
Examples
The following examples use the titanic
dataset from the machine learning example data. If you have not downloaded these datasets, see Download the machine learning example data for instructions.
The titanic_training
table contains data related to passengers on the Titanic, including:
pclass
: the ticket class of the passenger, ranging from 1st class to 3rd classsurvived
: whether the passenger survived, where 1 is yes and 0 is nogender
: gender of the passengersibling_and_spouse_count
: number of siblings aboard the Titanicembarkation_point
: port of embarkation
To test whether the survival of a passenger is dependent on their ticket class, run the following chi-square test:
=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2, alpha=0.05) OVER() FROM titanic_training;
pvalue | independent
--------+-------------
0 | f
(1 row)
With a returned pvalue
of zero, the null hypothesis is rejected and you can conclude that the survived
and pclass
variables are dependent. To test whether this outcome is conditional on the gender of the passenger, partition by the gender
column in the OVER clause:
=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2) OVER(PARTITION BY gender) FROM titanic;
pvalue | independent
--------+-------------
0 | f
(1 row)
As the pvalue
is still zero, it is clear that the dependence of the pclass
and survived
variables is not conditional on the gender of the passenger.
If one of the categorical columns that you want to test is not a numeric type, use the HASH function to convert it into type INT:
=> SELECT CHI_SQUARED(sibling_and_spouse_count, HASH(embarkation_point) USING PARAMETERS alpha=0.05) OVER() FROM titanic_training;
pvalue | independent
--------------------+-------------
0.0753039994044853 | t
(1 row)
The returned pvalue
is greater than alpha
, meaning the null hypothesis is accepted and the sibling_and_spouse_count
and embarkation_point
are independent.