The Chi-Square Test of Independence tests for independence between two categorical variables. The test has many applications, from survey analysis to feature selection in machine learning. In this tutorial, you’ll learn how to calculate the chi-square test in Python using the SciPy library.
By the end of this tutorial, you’ll have learned the following:
- What the Chi-Square Test of Independence is and how to interpret it
- How to calculate the Chi-Square Test of Independence in Python using SciPy
- How to work better with 2×2 contingency tables using Yates’ Correct
Table of Contents
What is the Chi-Square Test of Independence
The Chi-Square Test of Independence is used to test whether or not there is an association between two categorical variables. This has many different applications, across multiple domains, including:
- Customer Satisfaction: Assessing if satisfaction levels vary significantly among different demographics.
- Disease and Risk Factors: Studying if the occurrence of a disease is independent of certain risk factors like smoking habits, diet, or exercise.
- Voting Patterns: Analyzing if voting patterns are independent of income levels or educational backgrounds.
- Exam Performance: Analyzing if exam scores are independent of study habits or time spent on studying.
Understanding the appropriate circumstances for using this test is crucial. It is applied when investigating associations between categorical variables and when comparing observed data with expected distributions.
The test evaluates if there’s an association between two categorical variables in a dataset. It measures the difference between the expected and observed frequencies in a contingency table to determine if these differences are statistically significant or simply due to chance.
Like many statistical tests, the chi-square test uses two hypotheses:
- The null hypothesis (H0) states that the two variables are independent,
- The alternative hypothesis (H1) states that the two variables are not independent, meaning that they are associated
The test works using the following formula:
X2 = Σ(O-E)**2 / E
Where:
- O is the observed value, and
- E is the expected value
You convert the test statistic (X2) to a p-value that corresponds with your data’s degrees of freedom. If the p-value is less than your chosen significance level, then you reject the null hypothesis. If it’s not less than your chosen significance value, then you fail to reject the null hypothesis.
Now that you have a strong understanding of the chi-square test, let’s dive into how to use the SciPy library to calculate it.
How to Calculate the Chi-Square Test of Independence in Python
In Python, you can use the powerful SciPy library to calculate the chi-square test of independence. For this section, we’ll work with a sample scenario of testing whether there is an association between tenure at a company and job satisfaction.
In Python, we can use the Pandas crosstab function to create a contingency table. For our sample data, the cross tab (or contingency table) looks like this:
Job Satisfaction | Less Than 1 Year | 1-3 Years | 3-5 Years | Over 5 Years |
---|---|---|---|---|
Satisfied | 25 | 40 | 30 | 50 |
Neutral | 15 | 20 | 25 | 30 |
Dissatisfied | 10 | 15 | 20 | 25 |
We can load this into a Python dataset, using the following list of lists:
# Our Sample Dataset
data = [
[25, 40, 30, 50],
[15, 20, 25, 30],
[10, 15, 20, 25]
]
Now that we have loaded our data, we can import SciPy stats module. We can then use the chi2_contingency
function to calculate our chi-squared test:
# Running a Chi-Square Test of Independence in Python
import scipy.stats as stats
res = stats.chi2_contingency(data)
print(res)
# Returns:
# Chi2ContingencyResult(statistic=3.0617653191544822, pvalue=0.8010556748879476, dof=6, expected_freq=array([[23.7704918 , 35.6557377 , 35.6557377 , 49.91803279],
# [14.75409836, 22.13114754, 22.13114754, 30.98360656],
# [11.47540984, 17.21311475, 17.21311475, 24.09836066]]))
We can see that by passing our contingency table into the chi2_contingency
function, that we return a Result object, which has different attributes, including:
statistic
, the test statistic (X2 from our earlier formula)pvalue
, the p-value associated with our degrees of freedomdof
, the degrees of freedom, which are equal to (# of columns – 1) * (# of rows – 1)expected_freq
, the expected frequencies
In many cases, you’ll see this result immediately unpacked, as shown below:
# Unpacking Results
import scipy.stats as stats
statistic, p, dof, expected = stats.chi2_contingency(data)
We can now complete our hypothesis testing by assessing the p-value. Recall that we have the following hypotheses:
- The null hypothesis (H0) states that the two variables are independent,
- The alternative hypothesis (H1) states that the two variables are not independent, meaning that they are associated
Say that we are working with a 95% confidence, we need to assess whether our p-value is less than 0.05. If it is, then we reject the null hypothesis. We can do this using an if-else statement:
# Assessing our results
if p < 0.05:
print('Reject null hypothesis: variables have an association.')
else:
print('Fail to reject the null hypothesis. The variables are independent.')
# Returns:
# Fail to reject the null hypothesis. The variables are independent.
In our example dataset, we failed to reject the null hypothesis. In this case, we would be drawing the conclusion that our variables are independent of one another.
Implementing Yates’ Correction for Chi-Squared Tests
In certain scenarios, especially when dealing with smaller sample sizes or 2×2 contingency tables, Yates’ Continuity Correction offers a refined approach to the Chi-Square Test of Independence by adjusting the calculated statistic.
Yates’ correction aims to mitigate potential inaccuracies that might arise in small samples or 2×2 tables. It adjusts the absolute differences between observed and expected frequencies by a constant (usually 0.5) before squaring them for the chi-square calculation.
The formula for Yates’ correction in a 2×2 contingency table is:
X2 = Σ(||O-E| - 0.5
|)**2 / E
Where:
- O represents the observed frequency.
- E represents the expected frequency.
In SciPy, you can implement the correction by using the correction=
parameter. The parameter accepts a boolean value and defaults to True
. However, the correction will only be applied when the degrees of freedom are 1, meaning it automatically gets applied to a 2×2 contingency table!
Conclusion
The Chi-Square Test of Independence serves as a robust statistical method for exploring associations between categorical variables. Its versatility spans various fields, helping in hypothesis testing and understanding relationships between distinct variables.
Throughout this tutorial, you’ve gained comprehensive insights into:
- Understanding the Chi-Square Test: Recognizing its purpose, how it evaluates associations between categorical variables, and its significance in statistical analysis.
- Python Implementation using SciPy: Learning how to apply the Chi-Square Test in Python using the
scipy.stats
module, leveraging thechi2_contingency
function to calculate the chi-square statistic, p-value, degrees of freedom, and expected frequencies. - Hypothesis Testing: Grasping the fundamental aspects of hypothesis testing with the Chi-Square Test, distinguishing between null and alternative hypotheses, and interpreting results based on p-values and significance levels.
Additionally, we explored Yates’ Continuity Correction, a technique useful in certain scenarios, especially for 2×2 contingency tables or small sample sizes. While SciPy’s chi2_contingency
doesn’t explicitly implement Yates’ correction, it automatically applies it when dealing with a 2×2 table.
To learn more about the SciPy function, check out the official documentation here.
Additional Resources
Check out these related articles: