My occupation is a statistician. I tell people it is like "CSI without dead bodies" because analyzing a set of data that has been collected is like doing an autopsy on a deceased person in the sense that I'm trying to learn what I can from what statistics and information are available. Except in this case the information does not involve gross things. For me the research process can be humorous, scary, but always captivating.

I came across this article on medium.com on how the field of data journalism has evolved over the last ten years. It is an interview with Simon Rogers, a data editor at Google News Lab. He started out at the Guardian newspaper of London in 2009 as the founder of their data blog. He had a new job title, Data Editor. They started out with 47 data sets. That number grew exponentially.

As the amount of available data increased, techniques of data visualization likewise increased. The above chart shows how people across the world searched for and were informed about the Paris terrorist attacks in the 24 hours after the attacks. The questions asked in the Paris area were very different than those asked in other parts of the world.

Data Journalism Awards received received 471 entries in 2016. Last year it received over 700 from across the world. I was one to submit one of the 608 projects from 62 countries for 2019 from this site. The shortlist will be revealed in May.

As this site was created in September 2010 I do not feel that far behind the curve from Simon Rogers and Nate Silver in terms of experience. I am happy to enlighten my corner of the world on the insights that data can provide. Above is a talk by Simon Rogers on Data Journalism.

I've been forced to move back to Pennsylvania so it has been harder to find the time to post to this blog. The eighth anniversary of this blog is coming up and I will be preparing the anniversary post. I have found time to write an article on the new data journalism site Darply on hate group concentration in groups per million and Trump's approval rating at the state level.

I was planning a post on a published study that showed how cities and towns with local newspapers have greater government efficiency. Events in Annapolis, MD yesterday seem to have added significance to the post. A gunman who was angry at the Capitol Gazette for reporting on his harassing of a woman, went into their office and killed 5 of their staff. The study I am citing was inspired by an episode of John Oliver's show Last Week Tonight from three years ago (which can be seen above) about the decline of newspapers. The researchers found a correlation between the lack of a local print journalism outlet and a 5 to 11% increase in municipal borrowing. This underscores the valuable service that these papers provide. The whole study can be read here.

Newspapers have been in decline for decades as the internet and other media have crowded them out. Yesterdays incident brings an added dimension to the difficulties that they face. Newspapers get complaints about the stories that they run all the time with the occasional threat. This is the worst attack on a western media outlet since the anthrax attacks in 2001 and Charlie Hebdo in Paris in 2014. Hopefully these attacks will have no effect on the content that these outlets provide. The element of fear in reporting is a hard thing to regulate however. Independent blogs like mine try to fill the void by providing my own take on the news with my own findings thrown in. But I am one person. I do not have the resources that the newspapers and TV/Radio journalists have or once had since Johannes Gutenberg created the first printing press and Ben Franklin had his print shop. We all keep on keeping on. **Related Posts**

My latest post on Data Driven Journalism is up an reprinted here. In my last post,
I reported that Washington, DC had an extremely high rate of 30.83 hate
groups per million residents in 2016 relative to the other 50 states
(the national rate was 2.84 groups per million). DC also had an
exceptionally low percent of the vote for Donald Trump in 2016, at just
4.1%. For these reasons, and other characteristics which make DC
fundamentally different from the other 50 states, I had to exclude it
from a correlational analysis between hate group concentration and
Trump’s percent of the vote. For this post, I will look at other ways
in which DC is an outlier.

According to the most recent Small Area Income and Poverty Estimates (SAIPE) from 2015, DC
ranks third in median household income at $70,848 behind Maryland and
Alaska. Yet, the same SAIPE estimate also ranks DC eighth for the
percent of the population in poverty, at 17.4%. This indicates a large
gap between the rich and poor. The high rate of poverty is reflected in
DC’s low life expectancy at 76.53 years, ranking 43rd compared with the
overall US average of 78.86 years. Similarly, DC’s infant mortality ranked eleventh in the country,
at 7 deaths per 1,000 live births compared to the US rate of 5.9 deaths
per live births. Newly released estimates from the Census Bureau for
2015 show DC has the second lowest rate of those without health
insurance at 4.3% behind Massachusetts. These income and health
statistics suggest that DC deviates from the national rates, but not
that it is an extreme outlier – with one exception.

The statistics on crime suggest that DC is an extreme outlier. DC had a violent crime rate of 1,244.4 offenses per 100,000 residents in 2014.
This is almost twice as large as the next highest state, being Alaska
with a rate of 635.8 offenses per 100,000 residents, and more than three
times as large as the US rate of 365.5 offenses per 100,000 residents.
In 2014, it had the highest murder rate of any other state at 15.9
offenses per 100,000 residents.

Image: Paul Ricci.

Last fall, the FBI’s Uniform Crime Report released the number of hate crime incidents in 2015 for each state.
Adjusting their numbers for population, DC had a higher rate than any
other state, at 96.69 offenses per million residents. Using the FBI rate
method, this rate would be 9.67 reported offenses per 100,000
residents. As the above graph shows, this hate crime rate corresponds
with DC’s high rate of hate groups. However, this relationship does not
hold up when DC is excluded from the analysis, as can be seen in the
graph below. If DC is excluded, there is no statistically significant
relationship between the concentration of hate groups and hate crimes in
any of the other states, with only 2% of the variability accounted for.

Image: Paul Ricci.

Comparison of DC with New York City

So what factors besides poverty could be driving this relationship?
Compared to the other states DC has the highest population density by
far at 11,157.58 persons per square mile. Because Washington, DC is a quasi-city state, it may be appropriate to compare it to the US’s largest city, New York City (NYC).
In 2015, NYC had 8,550,405 inhabitants over a total of 302.64 square
miles (approximately 488.13 km2) giving the city a population density of
28,252.72 persons per square mile. I don’t have hate crime data for
NYC but I can estimate the hate group rate from the hate group map of the Southern Poverty Law Center.
I counted 36 hate groups in the area, which would give NYC a rate of
4.21 groups per million – a number which is considerably below DC’s rate
of 30.83 groups per million. In 2010, 25.5% of NYCs population
identified as African-American whereas 50.7% of DCs population did. Of
the 21 total hate groups in DC, six of them are black separatist groups
such as the Nation of Islam (28.6%). Of the 36 hate groups in NYC,
eight are black separatist (22.2%). You can scan the other hate groups
in each city here.

Conclusion One must be careful to draw grand conclusions from statistics that
compare DC to the rest of the US and DC to NYC. One can look at the
obvious differences DC has with the other states. While it has three
votes in the Electoral College for President, it has no members in
Congress with full voting privileges on laws which may affect them.
Further, as John Oliver explains, they have to pay full federal taxes:

We see Washington, DC portrayed in the media all the time but do we
really notice what goes on there outside of the White House, the Capitol
Building, and the various other federal buildings? DC residents have
been campaigning for full statehood for years but it has been stalled in
Congress. This second class citizenship may or may not explain all of
the statistical discrepancies for DC. The issue definitely merits
further study. There could be many other anomalies regarding DC of
which I am not aware.

This week I had an article published in Data Driven Journalism on the use of Fisher's exact test with contingency tables. It is reprinted here. A common problem faced by data journalists is how to determine if
there is a statistical relationship between two categorical variables
such as gender, race, or the share of the vote for two candidates in an
election. The simplest way to visualize the relationship is to
represent the counts for each combination of two variables in a
contingency table with the rows representing the levels of one variable
and the columns representing the levels of the other variable. The most
commonly used statistical test for an association between the row and
column variables is the chi-square (Ï‡2) test. The example in the table below illustrates this test using 2016 primary data for the American presidential election.

The columns in the above table shows the primary states won by Hillary
Clinton and by Bernie Sanders on the Democratic side and Donald Trump
placed in the same primary states on the Republican side. The total
number of states in the table is 51 because the District of Columbia is
included. For example, the column percent’s show that Trump won 86% of
the primary states that Clinton won while he won 55% of the states that
Sanders won. Using the chi-square test, journalists can explore
relationships between the states won by Trump, Clinton, and Sanders.

The chi-square test is based on calculating expected values for each
cell in the table. In the above example, we calculate the expected
value (the value for that would be expected if there were no
relationship among the variables) for the cell for states where Trump
finished third on the Republican side and for states where Bernie
Sanders won on the Democratic side by multiplying the row total for
where Trump placed third (3) by the column total for states where
Sanders won (22). This product is then divided by the total number of
observations for (51). The formula for the expected value is given by:

That means that for this cell a value of 1.29 would be expected if the
primary states where Trump finished third and Sanders won were
completely independent of each other. The observed value for this cell
is 2, suggesting a higher count than would be expected. Expected values
would be computed for each cell in the table and the difference between
the observed and expected values for each cell is computed, squared,
divided by the expected value, and summed across the cells in the table
according to the formula:

If the value for the chi-square exceeds the chi-square critical value
for a given degree of freedom (found by multiplying the number of rows
minus one and the number of columns minus one) and p-value, it is
concluded that there is an association between the variables.

But there is a problem with the chi-square test.The test is only an approximation of the distribution of counts in
contingency tables. If more than 20% of the cells in the table have an
expected value of less than five, the chi-square approximation does not
work to test the hypothesis of an association between the row variable
and the column variable (as is the case in the table above). Both
variables in the table are categorical, which means that the values of
the variables can only take certain values such as gender, political
affiliation, or placement in an election. A continuous variable is one
that could take any value on the number line such as temperature,
height, or weight. The major statistical packages will alert the user if
this assumption is violated. Violating the assumption causes the
observed p-value to be incorrect and can lead to incorrect conclusions
being made regarding the presence or absence of an association.

To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test.Rather than the chi-square distribution, which is an approximation of
the distribution of observed and expected values, Fisher’s exact test is
based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.

Here the Ri! are the factorials of the row totals (5!=5*4*3*2*1), Ci!
are the factorials of the individual column totals, N! is the factorial
of the table total and the aij! are the factorials for the individual
cell values. The Î ij is the product coefficient of the individual cell
values. Such a formula is even more computationally intensive than the
chi-square test, especially for tables with many rows and columns. This
is why the chi-square test was favored in the past because it took too
much memory for computers to run. These days it is less of an issue for
computers to run the Fisher’s exact test and it is easy to run in the
major statistical packages, such as R, SAS, SPSS, and STATA.

The commands to conduct Fisher’s exact and the chi-square test in R can
be seen below, using the US Primary Election table above (yellow for
Fisher’s exact test, green for the chi-square test).

The output for the Fishers exact test shows that there is a probability
of 0.03653 of observing these table frequencies when there is no
association between the rows and columns. The chi-square test output
shows a probability of 0.04217 for a relationship in the same table. If
we were using the 0.05 p value as the criteria for significance we
would find a relationship for both tests in this case though the
p-values differ. In a case where the sample size is even smaller than
in the example given above, this difference in p-values would be even
greater and could lead us to reach the wrong conclusion regarding
whether the is a relationship between the variables or not. States
which Hillary Clinton won in the primary season were more likely to be
won by Donald Trump while states where Bernie Sanders won were more
likely to have Trump finish 2nd or 3rd.

As a warning, the p-value should not be used as an indicator of the
strength of the association between categorical variables. Either the
test is significant or not, which means that either the relationship is
present or not. The p-value is sensitive to sample size. Often the
odds ratio can be used to estimate the effect size but R only computes
it in the fisher.test function for tables with 2 columns and 2 rows.

Fisher’s exact test provides a criterion for deciding whether the
differences in observed percentages between two categorical variables in
a sample are significant or just due to random noise in the data. In
the above example, the 86% of primary states won by Clinton and Trump
are significantly different from the 55% of primary won by Sanders and
Trump. Journalists should always be careful about making these
judgments by just looking at observed percentages or counts because of
the subjectivity of such decisions. Subjective decisions can be further
clouded by ones preconceived notions about the issues related to the
data. **Related posts**

Two movies out this holiday/Oscar season have similar plots but very different outcomes. Truth tells the story of the 60 minutes II team (with Dan Rather played by Robert Redford) that put together an expose of George W. Bush's military record in the Texas and Alabama air national guards. Strings were pulled to get him out of fighting in Vietnam. There were indications that he did not show up for duty when he was required to.

After the piece aired in the summer of 2004 (during the election), questions were raised about the fonts in one of the documents cited in the report. The claim from the right was that the font could not have appeared in a document typed in the late 60's/early 70's. An investigation ensued and the end result was Dan Rather, his award winning producer Mary Mapes (played by Cate Blanchett), and the rest of the team being fired from CBS. Regardless of the authenticity of the document, the veracity of the rest of the original report was never disproved. Jon Stewart weighed in on it this way at the time.

Like the reactions to the original report, the reviews from the film were lukewarm with a metascore of 67 out of 100. The review above was one that was not so positive on the film. One of the complaints was that Redford does not look anything Dan Rather (he didn't look anything like Bob Woodward either in All the President's Men but he still played him and it's now considered a classic). Conservative film critic Michael Medved gave the film 2 stars because he said it was well acted but too biased against the Bush administration.

The film Spotlight has been much better received so far. It has received Golden Globe nominations for best drama and best director (but none for acting) and has a metacritic score of 93. It is the story of how the spotlight team (headed by Michael Keaton and includes Rachel McAdams, Mark Ruffalo, and another guy) at the Boston Globe is goaded to investigate sexual abuse by priests by the new owner and editor of the paper (LIev Schreiber). Spoiler alert the team is successful in taking on the Boston Archdiocese getting Cardinal Bernard Law replaced and wins a Pulitzer Prize for their efforts in 2003. Michael Medved gave the film 3 stars and also called the film one sided against the Catholic Church but said it was understandable due to the crimes that were revealed. The Spotlight team did a valuable public service in revealing church abuses but the film does skirt another issue, media consolidation. Newspapers are having their staffs reduced as more and more of our media is in control of fewer and fewer individuals. This is happening at many newspapers as well as broadcast outlets like CBS. This makes it harder for news media outlets to challenge the powers that be with fewer reporters to do the investigating.

Both films show that with the right backing it is possible for investigative reporters to expose the crimes of the powers that be. The outside media owners backed the spotlight team and Woodward and Bernstein and they were successful. In the case of Dan Rather and the film Truth, Viacom, the parent company of CBS, sided with the critics and said that if that one document in the report may or may not be real then the whole report must be false. The program 60 Minutes II ended soon thereafter.

There is a third film coming out called Concussion starring Will Smith as a neuropathologist who discovers a brain disease in football players called chronic traumatic encephalopathy or CTE. He encounters resistance from the NFL on his findings but they eventually relent. Will Smith has received a Golden Globe nomination for this role. I haven't seen it yet so I can't comment any more. **Related posts**