# CHI_SQUARED

Computes the conditional chi-square independence test on two categorical variables to find the likelihood that the two variables are independent. To condition the independence test on another set of variables, you can partition the data on these variables using a PARTITION BY clause.

#### Tip

If a categorical column is not of a numeric data type, you can use the HASH function to convert it into a column of type INT, where each category is mapped to a unique integer. However, note that NULL values are hashed to zero, so they will be included in the test instead of skipped by the function.This function is a multi-phase transform function.

## Syntax

```
CHI_SQUARED( 'x-column', 'y-column'
[ USING PARAMETERS param=value[,...] ] )
```

## Arguments

,`x-column`

`y-column`

- Columns in the input relation to be tested for dependency with each other. These columns must contain categorical data in numeric format.

## Parameters

`x_cardinality`

- Integer in the range [1, 20], the cardinality of
*x-column*. If the cardinality of*x-column*is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.**Default:**20 `y_cardinality`

- Integer in the range [1, 20], the cardinality of
*y-column*. If the cardinality of*y-column*is less than the default value of 20, setting this parameter can decrease the amount memory used by the function.**Default:**20 `alpha`

- Float in the range (0.0, 1.0), the significance level. If the returned
`pvalue`

is less than this value, the null hypothesis, which assumes the variables are independent, is rejected.**Default:**0.05

## Returns

The function returns two values:

`pvalue`

(float): the confidence that the two variables are independent. If this value is greater than the`alpha`

parameter value, the null hypothesis is accepted and the variables are considered independent.`independent`

(boolean): true if the variables are independent; otherwise, false.

## Privileges

SELECT privileges on the input relation

## Examples

The following examples use the `titanic`

dataset from the machine learning example data. If you have not downloaded these datasets, see Download the machine learning example data for instructions.

The `titanic_training`

table contains data related to passengers on the Titanic, including:

`pclass`

: the ticket class of the passenger, ranging from 1st class to 3rd class`survived`

: whether the passenger survived, where 1 is yes and 0 is no`gender`

: gender of the passenger`sibling_and_spouse_count`

: number of siblings aboard the Titanic`embarkation_point`

: port of embarkation

To test whether the survival of a passenger is dependent on their ticket class, run the following chi-square test:

```
=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2, alpha=0.05) OVER() FROM titanic_training;
pvalue | independent
--------+-------------
0 | f
(1 row)
```

With a returned `pvalue`

of zero, the null hypothesis is rejected and you can conclude that the `survived`

and `pclass`

variables are dependent. To test whether this outcome is conditional on the gender of the passenger, partition by the `gender`

column in the OVER clause:

```
=> SELECT CHI_SQUARED(pclass, survived USING PARAMETERS x_cardinality=3, y_cardinality=2) OVER(PARTITION BY gender) FROM titanic;
pvalue | independent
--------+-------------
0 | f
(1 row)
```

As the `pvalue`

is still zero, it is clear that the dependence of the `pclass`

and `survived`

variables is not conditional on the gender of the passenger.

If one of the categorical columns that you want to test is not a numeric type, use the HASH function to convert it into type INT:

```
=> SELECT CHI_SQUARED(sibling_and_spouse_count, HASH(embarkation_point) USING PARAMETERS alpha=0.05) OVER() FROM titanic_training;
pvalue | independent
--------------------+-------------
0.0753039994044853 | t
(1 row)
```

The returned `pvalue`

is greater than `alpha`

, meaning the null hypothesis is accepted and the `sibling_and_spouse_count`

and `embarkation_point`

are independent.