Showing posts with label Journalism. Show all posts
Showing posts with label Journalism. Show all posts

Sunday, April 28, 2019

Happy 10th Birthday for Data Journalism


I came across this article on medium.com on how the field of data journalism has evolved over the last ten years. It is an interview with Simon Rogers, a data editor at Google News Lab.  He started out at the Guardian newspaper of London in 2009 as the founder of their data blog.   He had a new job title, Data Editor.  They started out with 47 data sets.  That number grew exponentially.

As the amount of available data increased, techniques of data visualization likewise increased.  The above chart shows how people across the world searched for and were informed about the Paris terrorist attacks in the 24 hours after the attacks.  The questions asked in the Paris area were very different than those asked in other parts of the world.

Data Journalism Awards received received 471 entries in 2016.  Last year it received over 700 from across the world.  I was one to submit one of the 608 projects from 62 countries for 2019 from this site.  The shortlist will be revealed in May.  



As this site was created in September 2010 I do not feel that far behind the curve from Simon Rogers and Nate Silver in terms of experience.  I am happy to enlighten my corner of the world on the insights that data can provide.  Above is a talk by Simon Rogers on Data Journalism.




**Related Posts**


Monday, September 10, 2018

An Update on Data Journalism and Darply

I've been forced to move back to Pennsylvania so it has been harder to find the time to post to this blog.  The eighth anniversary of this blog is coming up and I will be preparing the anniversary post.  I have found time to write an article on the new data journalism site Darply on hate group concentration in groups per million and Trump's approval rating at the state level.



The full article can be read here.  The website is now crowdfunding to increase it's reach if you care about supporting real news websites please consider supporting them here.

https://gogetfunding.com/darply-data-driven-news-magazine/

**Related Posts**

Hate Crimes in Pennsylvania Under Reported (and other states as well)


Friday, June 29, 2018

Events in Annapolis Coincide with Posting on Local Papers


I was planning a post on a published study that showed how cities and towns with local newspapers have greater government efficiency.  Events in Annapolis, MD yesterday seem to have added significance to the post.  A gunman who was angry at the Capitol Gazette for reporting on his harassing of a woman, went into their office and killed 5 of their staff.  

The study I am citing was inspired by an episode of John Oliver's show Last Week Tonight from three years ago (which can be seen above) about the decline of newspapers.  The researchers found a correlation between the lack of a local print journalism outlet and a 5 to 11% increase in municipal borrowing.  This underscores the valuable service that these papers provide.  The whole study can be read here.


Newspapers have been in decline for decades as the internet and other media have crowded them out.  Yesterdays incident brings an added dimension to the difficulties that they face.  Newspapers get complaints about the stories that they run all the time with the occasional threat.  This is the worst attack on a western media outlet since the anthrax attacks in 2001 and Charlie Hebdo in Paris in 2014.  Hopefully these attacks will have no effect on the content that these outlets provide.  The element of fear in reporting is a hard thing to regulate however.

Independent blogs like mine try to fill the void by providing my own take on the news with my own findings thrown in.  But I am one person.  I do not have the resources that the newspapers and TV/Radio journalists have or once had since Johannes Gutenberg created the first printing press and Ben Franklin had his print shop. We all keep on keeping on.

**Related Posts**

Amazon, The Washington Post, and New Media



Thursday, May 18, 2017

How is Washington DC an outlier? Let's count the ways. (Repost from Data Driven Journalism)

My latest post on Data Driven Journalism is up an reprinted here.
 
In my last post, I reported that Washington, DC had an extremely high rate of 30.83 hate groups per million residents in 2016 relative to the other 50 states (the national rate was 2.84 groups per million).  DC also had an exceptionally low percent of the vote for Donald Trump in 2016, at just 4.1%.  For these reasons, and other characteristics which make DC fundamentally different from the other 50 states, I had to exclude it from a correlational analysis between hate group concentration and Trump’s percent of the vote.  For this post, I will look at other ways in which DC is an outlier.

According to the most recent Small Area Income and Poverty Estimates (SAIPE) from 2015, DC ranks third in median household income at $70,848 behind Maryland and Alaska. Yet, the same SAIPE estimate also ranks DC eighth for the percent of the population in poverty, at 17.4%.  This indicates a large gap between the rich and poor.  The high rate of poverty is reflected in DC’s low life expectancy at 76.53 years, ranking 43rd compared with the overall US average of 78.86 years. Similarly, DC’s infant mortality ranked eleventh in the country, at 7 deaths per 1,000 live births compared to the US rate of 5.9 deaths per live births.  Newly released estimates from the Census Bureau for 2015 show DC has the second lowest rate of those without health insurance at 4.3% behind Massachusetts. These income and health statistics suggest that DC deviates from the national rates, but not that it is an extreme outlier – with one exception.

The statistics on crime suggest that DC is an extreme outlier.  DC had a violent crime rate of 1,244.4 offenses per 100,000 residents in 2014.  This is almost twice as large as the next highest state, being Alaska with a rate of 635.8 offenses per 100,000 residents, and more than three times as large as the US rate of 365.5 offenses per 100,000 residents.  In 2014, it had the highest murder rate of any other state at 15.9 offenses per 100,000 residents. 



Image: Paul Ricci.
Last fall, the FBI’s Uniform Crime Report released the number of hate crime incidents in 2015 for each state.  Adjusting their numbers for population, DC had a higher rate than any other state, at 96.69 offenses per million residents. Using the FBI rate method, this rate would be 9.67 reported offenses per 100,000 residents.  As the above graph shows, this hate crime rate corresponds with DC’s high rate of hate groups.  However, this relationship does not hold up when DC is excluded from the analysis, as can be seen in the graph below.  If DC is excluded, there is no statistically significant relationship between the concentration of hate groups and hate crimes in any of the other states, with only 2% of the variability accounted for.



Image: Paul Ricci.
 
Comparison of DC with New York City

So what factors besides poverty could be driving this relationship?  Compared to the other states DC has the highest population density by far at 11,157.58 persons per square mile.  Because Washington, DC is a quasi-city state, it may be appropriate to compare it to the US’s largest city, New York City (NYC).  In 2015, NYC had 8,550,405 inhabitants over a total of 302.64 square miles (approximately 488.13 km2) giving the city a population density of 28,252.72 persons per square mile.  I don’t have hate crime data for NYC but I can estimate the hate group rate from the hate group map of the Southern Poverty Law Center.  I counted 36 hate groups in the area, which would give NYC a rate of 4.21 groups per million – a number which is considerably below DC’s rate of 30.83 groups per million. In 2010, 25.5% of NYCs population identified as African-American whereas 50.7% of DCs population did.  Of the 21 total hate groups in DC, six of them are black separatist groups such as the Nation of Islam (28.6%).  Of the 36 hate groups in NYC, eight are black separatist (22.2%).  You can scan the other hate groups in each city here.

Looking at other statistics for NYC, the violent crime rate is 596.7 offenses per 100,000 residents and the murder rate sits at 3.9 offenses per 100,000 residents.  These are considerably lower than DCs rates of 1,244.4 violent offenses per 100,000 residents and 15.9 murders per 100,000 residents.  DC has a higher median household income at $70,848 than NYC’s $53,373.  Correspondingly, the 20.0% poverty rate for NYC is higher than DC’s 17.4%. 

Conclusion
 
One must be careful to draw grand conclusions from statistics that compare DC to the rest of the US and DC to NYC.  One can look at the obvious differences DC has with the other states. While it has three votes in the Electoral College for President, it has no members in Congress with full voting privileges on laws which may affect them. Further, as John Oliver explains, they have to pay full federal taxes:



We see Washington, DC portrayed in the media all the time but do we really notice what goes on there outside of the White House, the Capitol Building, and the various other federal buildings?  DC residents have been campaigning for full statehood for years but it has been stalled in Congress.  This second class citizenship may or may not explain all of the statistical discrepancies for DC.  The issue definitely merits further study.  There could be many other anomalies regarding DC of which I am not aware.

**Related Posts**

Don’t test me: Using Fisher’s exact test to unearth stories about statistical relationships (Repost) 

Concentration of Hate Groups Predict Hate Crimes (if you consider DC) and Trump Vote (if you don't)

SPLC Hate Group Update: Washington, DC has an Increase in Activity

Thursday, January 19, 2017

Don’t test me: Using Fisher’s exact test to unearth stories about statistical relationships (Repost)

This week I had an article published in Data Driven Journalism on the use of Fisher's exact test with contingency tables.  It is reprinted here.

A common problem faced by data journalists  is how to determine if there is a statistical relationship between two categorical variables such as gender, race, or the share of the vote for two candidates in an election.  The simplest way to visualize the relationship is to represent the counts for each combination of two variables in a contingency table with the rows representing the levels of one variable and the columns representing the levels of the other variable.  The most commonly used statistical test for an association between the row and column variables is the chi-square (χ2) test.  The example in the table below illustrates this test using 2016 primary data for the American presidential election.
table1.PNG
The columns in the above table shows the primary states won by Hillary Clinton and by Bernie Sanders on the Democratic side and Donald Trump placed in the same primary states on the Republican side.  The total number of states in the table is 51 because the District of Columbia is included.  For example, the column percent’s show that Trump won 86% of the primary states that Clinton won while he won 55% of the states that Sanders won.  Using the chi-square test, journalists can explore relationships between the states won by Trump, Clinton, and Sanders.

The chi-square test is based on calculating expected values for each cell in the table.  In the above example, we calculate the expected value (the value for that would be expected if there were no relationship among the variables) for the cell for states where Trump finished third on the Republican side and for states where Bernie Sanders won on the Democratic side by multiplying the row total for where Trump placed third (3) by the column total for states where Sanders won (22).  This product is then divided by the total number of observations for (51).  The formula for the expected value is given by:
equ.PNG
That means that for this cell a value of 1.29 would be expected if the primary states where Trump finished third and Sanders won were completely independent of each other.  The observed value for this cell is 2, suggesting a higher count than would be expected.  Expected values would be computed for each cell in the table and the difference between the observed and expected values for each cell is computed, squared, divided by the expected value, and summed across the cells in the table according to the formula:
equ1.PNG
If the value for the chi-square exceeds the chi-square critical value for a given degree of freedom (found by multiplying the number of rows minus one and the number of columns minus one) and p-value, it is concluded that there is an association between the variables.

But there is a problem with the chi-square test.  The test is only an approximation of the distribution of counts in contingency tables. If more than 20% of the cells in the table have an expected value of less than five, the chi-square approximation does not work to test the hypothesis of an association between the row variable and the column variable (as is the case in the table above).  Both variables in the table are categorical, which means that the values of the variables can only take certain values such as gender, political affiliation, or placement in an election. A continuous variable is one that could take any value on the number line such as temperature, height, or weight. The major statistical packages will alert the user if this assumption is violated.  Violating the assumption causes the observed p-value to be incorrect and can lead to incorrect conclusions being made regarding the presence or absence of an association.

To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test.  Rather than the chi-square distribution, which is an approximation of the distribution of observed and expected values, Fisher’s exact test is based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.
fisher.gif
Here the Ri! are the factorials of the row totals (5!=5*4*3*2*1), Ci! are the factorials of the individual column totals, N! is the factorial of the table total and the aij! are the factorials for the individual cell values.  The Πij is the product coefficient of the individual cell values.  Such a formula is even more computationally intensive than the chi-square test, especially for tables with many rows and columns.  This is why the chi-square test was favored in the past because it took too much memory for computers to run.  These days it is less of an issue for computers to run the Fisher’s exact test and it is easy to run in the major statistical packages, such as R, SAS, SPSS, and STATA.

The commands to conduct Fisher’s exact and the chi-square test in R can be seen below, using the US Primary Election table above (yellow for Fisher’s exact test, green for the chi-square test).
fisher.PNG
The output for the Fishers exact test shows that there is a probability of 0.03653 of observing these table frequencies when there is no association between the rows and columns.  The chi-square test output shows a probability of 0.04217 for a relationship in the same table.  If we were using the 0.05 p value as the criteria for significance we would find a relationship for both tests in this case though the p-values differ.  In a case where the sample size is even smaller than in the example given above, this difference in p-values would be even greater and could lead us to reach the wrong conclusion regarding whether the is a relationship between the variables or not.  States which Hillary Clinton won in the primary season were more likely to be won by Donald Trump while states where Bernie Sanders won were more likely to have Trump finish 2nd or 3rd.

As a warning, the p-value should not be used as an indicator of the strength of the association between categorical variables.  Either the test is significant or not, which means that either the relationship is present or not.  The p-value is sensitive to sample size.  Often the odds ratio can be used to estimate the effect size but R only computes it in the fisher.test function for tables with 2 columns and 2 rows.

Fisher’s exact test provides a criterion for deciding whether the differences in observed percentages between two categorical variables in a sample are significant or just due to random noise in the data.  In the above example, the 86% of primary states won by Clinton and Trump are significantly different from the 55% of primary won by Sanders and Trump.  Journalists should always be careful about making these judgments by just looking at observed percentages or counts because of the subjectivity of such decisions.  Subjective decisions can be further clouded by ones preconceived notions about the issues related to the data.
**Related posts**


Patriotic Projections and Calculations

 

Statistics and Old Beliefs

 

Probability  

 

Saturday, December 19, 2015

Two Investigative Films: Truth and Spotlight


Two movies out this holiday/Oscar season have similar plots but very different outcomes.  Truth tells the story of the 60 minutes II team (with Dan Rather played by Robert Redford) that put together an expose of George W. Bush's military record in the Texas and Alabama air national guards.  Strings were pulled to get him out of fighting in Vietnam.  There were indications that he did not show up for duty when he was required to.

After the piece aired in the summer of 2004 (during the election), questions were raised about the fonts in one of the documents cited in the report.  The claim from the right was that the font could not have appeared in a document typed in the late 60's/early 70's.  An investigation ensued and the end result was Dan Rather, his award winning producer Mary Mapes (played by Cate Blanchett), and the rest of the team being fired from CBS.  Regardless of the authenticity of the document, the veracity of the rest of the original report was never disproved.  Jon Stewart weighed in on it this way at the time.


Like the reactions to the original report, the reviews from the film were lukewarm with a metascore of 67 out of 100.  The review above was one that was not so positive on the film.  One of the complaints was that Redford does not look anything Dan Rather (he didn't look anything like Bob Woodward either in All the President's Men but he still played him and it's now considered a classic). Conservative film critic Michael Medved gave the film 2 stars because he said it was well acted but too biased against the Bush administration. 


The film Spotlight has been much better received so far.  It has received Golden Globe nominations for best drama and best director (but none for acting) and has a metacritic score of 93.  It is the story of how the spotlight team (headed by Michael Keaton and includes Rachel McAdams, Mark Ruffalo, and another guy) at the Boston Globe is goaded to investigate sexual abuse by priests by the new owner and editor of the paper (LIev Schreiber).  Spoiler alert the team is successful in taking on the Boston Archdiocese getting Cardinal Bernard Law replaced and wins a Pulitzer Prize for their efforts in 2003.  Michael Medved gave the film 3 stars and also called the film one sided against the Catholic Church but said it was understandable due to the crimes that were revealed. 

The Spotlight team did a valuable public service in revealing church abuses but the film does skirt another issue, media consolidation.  Newspapers are having their staffs reduced as more and more of our media is in control of fewer and fewer individuals.  This is happening at many newspapers as well as broadcast outlets like CBS.  This makes it harder for news media outlets to challenge the powers that be with fewer reporters to do the investigating.

Both films show that with the right backing it is possible for investigative reporters to expose the crimes of the powers that be.  The outside media owners backed the spotlight team and Woodward and Bernstein and they were successful.  In the case of Dan Rather and the film Truth, Viacom, the parent company of CBS, sided with the critics and said that if that one document in the report may or may not be real then the whole report must be false.  The program 60 Minutes II ended soon thereafter.

There is a third film coming out called Concussion starring Will Smith as a neuropathologist who discovers a brain disease in football players called chronic traumatic encephalopathy or CTE.  He encounters resistance from the NFL on his findings but they eventually relent.  Will Smith has received a Golden Globe nomination for this role.  I haven't seen it yet so I can't comment any more.

**Related posts**

Concussions

Say it ain't so JoePa!

A Modest Proposal to Curb Cognitive Deficits in the NFL and other High Contact Sports

2012: A 2004 Election Rerun?