A common problem faced by data journalists is how to determine if there is a statistical relationship between two categorical variables such as gender, race, or the share of the vote for two candidates in an election. The simplest way to visualize the relationship is to represent the counts for each combination of two variables in a contingency table with the rows representing the levels of one variable and the columns representing the levels of the other variable. The most commonly used statistical test for an association between the row and column variables is the chi-square (χ2) test. The example in the table below illustrates this test using 2016 primary data for the American presidential election.
The chi-square test is based on calculating expected values for each cell in the table. In the above example, we calculate the expected value (the value for that would be expected if there were no relationship among the variables) for the cell for states where Trump finished third on the Republican side and for states where Bernie Sanders won on the Democratic side by multiplying the row total for where Trump placed third (3) by the column total for states where Sanders won (22). This product is then divided by the total number of observations for (51). The formula for the expected value is given by:
But there is a problem with the chi-square test. The test is only an approximation of the distribution of counts in contingency tables. If more than 20% of the cells in the table have an expected value of less than five, the chi-square approximation does not work to test the hypothesis of an association between the row variable and the column variable (as is the case in the table above). Both variables in the table are categorical, which means that the values of the variables can only take certain values such as gender, political affiliation, or placement in an election. A continuous variable is one that could take any value on the number line such as temperature, height, or weight. The major statistical packages will alert the user if this assumption is violated. Violating the assumption causes the observed p-value to be incorrect and can lead to incorrect conclusions being made regarding the presence or absence of an association.
To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test. Rather than the chi-square distribution, which is an approximation of the distribution of observed and expected values, Fisher’s exact test is based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.
The commands to conduct Fisher’s exact and the chi-square test in R can be seen below, using the US Primary Election table above (yellow for Fisher’s exact test, green for the chi-square test).
As a warning, the p-value should not be used as an indicator of the strength of the association between categorical variables. Either the test is significant or not, which means that either the relationship is present or not. The p-value is sensitive to sample size. Often the odds ratio can be used to estimate the effect size but R only computes it in the fisher.test function for tables with 2 columns and 2 rows.
Fisher’s exact test provides a criterion for deciding whether the differences in observed percentages between two categorical variables in a sample are significant or just due to random noise in the data. In the above example, the 86% of primary states won by Clinton and Trump are significantly different from the 55% of primary won by Sanders and Trump. Journalists should always be careful about making these judgments by just looking at observed percentages or counts because of the subjectivity of such decisions. Subjective decisions can be further clouded by ones preconceived notions about the issues related to the data.