Showing posts with label Categorical Data. Show all posts
Showing posts with label Categorical Data. Show all posts

Thursday, May 23, 2019

Job Mobility for Slavic and Hungarian Steelworkers from 1900-1950

Below is another excerpt from my upcoming book on Johnstown by the numbers.  It is a discussion of the job mobility of East central European immigrants in the steel mills in Johnstown, PA from 1900-1950.  
 Ewa Morawska (1985) in For Bread with Butter: Life-Worlds of East Central Europeans in Johnstown, Pennsylvania, 1890-1940 thoroughly chronicles the struggles of East Central European Immigrants namely Slavic, Hungarian, and Austrian immigrants in the city in the early part of the 20th century.
Morawska (1985, p. 100) found that in the steel industry approximately 7% of East Central European immigrants who remained in the city moved up from unskilled or unspecified semiskilled laborers in the mills to semiskilled or skilled workers from 1900 to 1920.  She also looked at first generation immigrants who remained from 1915 to 1930 and for 2nd generation immigrants from 1920 to 1949/50.  These numbers are summarized in the tables below.  There were not enough first generation immigrants to follow from 1900 to 1930.  First generation immigrants tended to move from city to city, especially in the early days that they are in the US.
Table 1a shows how job mobility was for first generation immigrants from 1900 to 1920 and from 1915 to 1930 and for second generation from 1920 to 1949/50.  These were immigrants who remained in the city during the periods in which they were tracked in the Census and the city directories.  The numbers on the observed side of the table are the actual shifts of immigrants from unskilled or unspecified semiskilled to semiskilled or skilled, vice versa, and immobile (no change in employment status in the mill over the period).  The Expected with no Discrimination side of the table shows what the numbers that would be if the overall mobility rates were the same as they were for western European immigrants or native workers (no discrimination).
Table 1a
Job mobility from unskilled or unspecified semiskilled to skilled or semiskilled steelworkers for 1st & 2nd generation East Central European Immigrants (Morawska, 1985, pp. 100, 164, 166)
Observed %
Expected % with no Discrimination
upward mobility
downward mobility
upward mobility
downward mobility
1900-1920 1st gen
1915-1930 1st  gen
1920-1949/50 2nd gen

            The upward mobility rates for both first generation periods were considerably lower than they were for the second generation and for the numbers we would expect if there were no discrimination.  The downward mobility numbers were higher for second generation mill workers than for first generation workers and for what would be expected if there were no discrimination.  The downward mobility numbers were identical for the 1900-1920 and the 1915-1930 periods for the first generation were both nearly identical to what would be expected if there were no discrimination.
Morawska, E. (1985).  For Bread with Butter:  Worlds of East Central Europeans in

Johnstown, Pennsylvania, 1890-1940.  Cambridge: New York.

**Related Posts**

Friday, January 4, 2019

Does playing in the NFL help a head coach? Not in the Playoffs.

Vince Lombardi never played in the NFL
NFL Playoff time is upon us.  While my Steelers won't be there there will be plenty of action.  How the coaches handle their personnel will go a long way to determine who wins.  Athlon sports produced a list of the top 25 NFL coaches of all time.  I notices that some of the top coaches on the list never, such as Vince Lombardi and Bill Belichik never played in the NFL while others such as Don Shula, Tom Landry, and Chuck Noll had.  I thought I would take a closer look at whether playing in the NFL was a predictor of their success.

EL Curly Lambeau played and coached the Packers at the same time
The Athlon list had 14 of the 25 coaches who had played in the NFL.  This includes Bill Parcells and John Madden who were drafted but never played a down for their teams.  Three of the early coaches, George Halas, Curly Lambeau, Steve Owen, and Guy Chamberlin, played for and coached their teams at the same time for at least part of their careers.  The coaches who played had a combined record (including playoffs) of 2,656 wins, 1,553 losses, and 110 ties with 34 championships for a winning percentage 62.8%.  The coaches who did not play had a combined record of 1,563 wins, 914 losses, and 30 ties with 24 championships for a winning percentage of 62.9%.

Coaching and playing for their teams was different in the early days than it is today. I looked at the wins and losses for coaches whose careers overlapped the Super Bowl era.  Championships won by Vince Lombardi and Paul Brown won before the Super Bowl Era are included.  This would be a really large list so it was limited to coaches from this era who were on the list or who had taken their teams to a Super Bowl.  This gives a list of 51 coaches, 28 who had not played and 23 who had.  The ones who had played have a combined record of 3,236 wins, 2291 losses, and 36 ties with 21 championships for a 58.5% winning percentage.  The ones who had not had a combined record of 3,512 wins, 2,419 losses and 41 ties with 41 championships for a 59.2% winning percentage.  

Coach Played in NFL
Y (N=23)
N (N=28)
Regular Season Winning %
Playoff Winning %
Overall Winning %
Championships per Coach

Breaking down these numbers by playoff and regular season games in the Super Bowl era, we see where not playing in the NFL makes a difference.  In the regular season, coaches who played had a winning percentage of 58.7% while those who did not had 58.8%, virtually no difference.  In the playoffs however coaches who played had a winning percentage of 54.5% while those who didn't had 58.3%.  This would explain the difference in championships won be these coaches with 41 won by those who did not play (1.42 championships per coach) versus those who did not (21 or 0.91 per coach).  

I can only speculate as to the reasons why elite coaches who did and did not play in the NFL differ on the playoffs on the playoff winning % and championships.  It could be that coaches who played can sympathize with what their players are going through come playoff time.  They might not push their players as hard in the playoffs.  The players have a lot of aches and pains in the playoffs.  

Another reason  could be that the adage "great players do not make great coaches" holds here.  Only a few of the player-coaches could be considered stars on their teams (like Mike Ditka) but they were good enough to make it to the NFL.  You can speculate as to other reasons (i.e. Concussions) for this difference.  You can see the full list of coaches in this post here.  


The NFL just fired 5 of it's 7 African American head coaches. The Steelers Mike Tomlin and the Chargers Mike Lynn are now the only two left in the league.  In the data set used here there were four African American Coaches (7.8% of the total of 51 for the super bowl era).  Three of them did not play in the NFL, Mike Tomlin, Jim Caldwell, and Lovie Smith and one did, Tony Dungy.  All four coaches have a combined winning percentage of 59.7% with 2 championships.  They have a winning % 60.4% in the regular season and 47.8% in the playoffs.  Tomlin and Dungy were listed in the Athlon all time coaches list (8% of the 25).  

According to Dave Zirin at The Nation magazine, the number of African head coaches has never been nigher than 30% of the total head coaches in one year when they are 70% pf the players.  Would a different pattern emerge if I looked at coaches with this experience in the NBA, MLB or NHL?

**Related Posts**

The Super Bowl That May be Someday: A Steagles Super Bowl

Thursday, January 19, 2017

Don’t test me: Using Fisher’s exact test to unearth stories about statistical relationships (Repost)

This week I had an article published in Data Driven Journalism on the use of Fisher's exact test with contingency tables.  It is reprinted here.

A common problem faced by data journalists  is how to determine if there is a statistical relationship between two categorical variables such as gender, race, or the share of the vote for two candidates in an election.  The simplest way to visualize the relationship is to represent the counts for each combination of two variables in a contingency table with the rows representing the levels of one variable and the columns representing the levels of the other variable.  The most commonly used statistical test for an association between the row and column variables is the chi-square (χ2) test.  The example in the table below illustrates this test using 2016 primary data for the American presidential election.
The columns in the above table shows the primary states won by Hillary Clinton and by Bernie Sanders on the Democratic side and Donald Trump placed in the same primary states on the Republican side.  The total number of states in the table is 51 because the District of Columbia is included.  For example, the column percent’s show that Trump won 86% of the primary states that Clinton won while he won 55% of the states that Sanders won.  Using the chi-square test, journalists can explore relationships between the states won by Trump, Clinton, and Sanders.

The chi-square test is based on calculating expected values for each cell in the table.  In the above example, we calculate the expected value (the value for that would be expected if there were no relationship among the variables) for the cell for states where Trump finished third on the Republican side and for states where Bernie Sanders won on the Democratic side by multiplying the row total for where Trump placed third (3) by the column total for states where Sanders won (22).  This product is then divided by the total number of observations for (51).  The formula for the expected value is given by:
That means that for this cell a value of 1.29 would be expected if the primary states where Trump finished third and Sanders won were completely independent of each other.  The observed value for this cell is 2, suggesting a higher count than would be expected.  Expected values would be computed for each cell in the table and the difference between the observed and expected values for each cell is computed, squared, divided by the expected value, and summed across the cells in the table according to the formula:
If the value for the chi-square exceeds the chi-square critical value for a given degree of freedom (found by multiplying the number of rows minus one and the number of columns minus one) and p-value, it is concluded that there is an association between the variables.

But there is a problem with the chi-square test.  The test is only an approximation of the distribution of counts in contingency tables. If more than 20% of the cells in the table have an expected value of less than five, the chi-square approximation does not work to test the hypothesis of an association between the row variable and the column variable (as is the case in the table above).  Both variables in the table are categorical, which means that the values of the variables can only take certain values such as gender, political affiliation, or placement in an election. A continuous variable is one that could take any value on the number line such as temperature, height, or weight. The major statistical packages will alert the user if this assumption is violated.  Violating the assumption causes the observed p-value to be incorrect and can lead to incorrect conclusions being made regarding the presence or absence of an association.

To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test.  Rather than the chi-square distribution, which is an approximation of the distribution of observed and expected values, Fisher’s exact test is based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.
Here the Ri! are the factorials of the row totals (5!=5*4*3*2*1), Ci! are the factorials of the individual column totals, N! is the factorial of the table total and the aij! are the factorials for the individual cell values.  The Πij is the product coefficient of the individual cell values.  Such a formula is even more computationally intensive than the chi-square test, especially for tables with many rows and columns.  This is why the chi-square test was favored in the past because it took too much memory for computers to run.  These days it is less of an issue for computers to run the Fisher’s exact test and it is easy to run in the major statistical packages, such as R, SAS, SPSS, and STATA.

The commands to conduct Fisher’s exact and the chi-square test in R can be seen below, using the US Primary Election table above (yellow for Fisher’s exact test, green for the chi-square test).
The output for the Fishers exact test shows that there is a probability of 0.03653 of observing these table frequencies when there is no association between the rows and columns.  The chi-square test output shows a probability of 0.04217 for a relationship in the same table.  If we were using the 0.05 p value as the criteria for significance we would find a relationship for both tests in this case though the p-values differ.  In a case where the sample size is even smaller than in the example given above, this difference in p-values would be even greater and could lead us to reach the wrong conclusion regarding whether the is a relationship between the variables or not.  States which Hillary Clinton won in the primary season were more likely to be won by Donald Trump while states where Bernie Sanders won were more likely to have Trump finish 2nd or 3rd.

As a warning, the p-value should not be used as an indicator of the strength of the association between categorical variables.  Either the test is significant or not, which means that either the relationship is present or not.  The p-value is sensitive to sample size.  Often the odds ratio can be used to estimate the effect size but R only computes it in the fisher.test function for tables with 2 columns and 2 rows.

Fisher’s exact test provides a criterion for deciding whether the differences in observed percentages between two categorical variables in a sample are significant or just due to random noise in the data.  In the above example, the 86% of primary states won by Clinton and Trump are significantly different from the 55% of primary won by Sanders and Trump.  Journalists should always be careful about making these judgments by just looking at observed percentages or counts because of the subjectivity of such decisions.  Subjective decisions can be further clouded by ones preconceived notions about the issues related to the data.
**Related posts**

Patriotic Projections and Calculations


Statistics and Old Beliefs