Yesterday, I got a copy of a recent decision in the Eastern District of New York about another fireman discrimination lawsuit, this about the city of New York, which is rather larger and more significant than the one in New Haven. I read it and learned that (i) Ricci means nothing and (ii) the 4/5 test is dead—replaced by a simple test for statistical significance. Under the new test, any statistically significant difference between white and minority scores is prima facie evidence of discrimination.
The judge's decision is both insanely important and insane. He is saying that the Equal Employment Opportunity Commission's Four-Fifth's Rule isn't tough enough at sniffing out Disparate Impact. Instead, any "statistically significant" difference in passing rates between racial groups should shift the burden of proof in the case to the employer and make the employer meet the strict "business necessity" burden.
And with a big enough sample size, such as the 10,000 or so who take the FDNY entry level employment test, practically any racial difference, no matter how pragmatically insignificant, can be deemed to have attained statistical significance for purposes of bringing the hammer down legally on the employer under Title VII of the 1964 Civil Rights Act.
Judging from media coverage, you might think that racial quotas are an open question in this country. You might even get that impression from reading iSteve some days. But they aren’t. Racial quotas are part of how we live now. This case demonstrates how and why.Judge Garaufis specifically allowed the Vulcan Society to take over as main plaintiff from the three individuals, and specifically banned the main union, the Uniformed Firefighters Association from joining the defense. The union was worried that the Bloomberg Administration would not put up a stiff enough fight. I can't find online documents showing Bloomberg's defense, so I have a hard time evaluating how true the union's worry was.
You can learn a lot about a case just by looking at the caption. The plaintiff is the United States of America, joined by a black firefighters group and individual black and Hispanic firefighters. The case number, beginning with 1:07 CV, shows that it was a civil case commenced in 2007. From that and the name of the plaintiff we can deduce that this case was brought by the Bush Justice Department in 2007, while Alberto Gonzales was attorney general and, you will recall, there was great worry that Bush was politicizing the Justice Department. In fact, cases like this get brought in every administration, no matter who is the president and who is the attorney general.
Defendants are the City of New York, the fire department, the Department of Citywide Administrative Services (which developed the tests at issue) and the mayor and the fire commissioner. No unions and no firefighters. No New York equivalent of Frank Ricci.
Checking the bio of the judge, Nicholas Garaufis, I see that he was appointed by Clinton on Sen. Moynihan’s recommendation and unanimously confirmed by the Senate in 2000. He has a typical background for a New York federal judge: associate at a white-shoe firm, Chadbourne & Park, nine years as counsel to the Queens borough president, Claire Shulman; five years as general counsel to the Federal Aviation Administration during the Clinton administration. If I had to guess I would say he was politically in the left half of the federal judiciary, but nevertheless he is a completely mainstream guy. In his opinion he cites to a 1972 decision by Judge Edward Weinfeld dinging an earlier version of the firefighters’ exam. Weinfeld was one of the most respected federal district court judges in the country.You can see both the 1999 and 2002 tests in their entirety here.
The news angle: this decision made the front page of the New York Law Journal, which is basically the house organ of the New York bench and bar. It was a subject of a short article in the Times. It didn’t make the Wall Street Journal at all. It will get maybe one /one millionth of the coverage that Ricci got or that Professor Gates’ arrest is getting.
The Times article is innumerate but informative. There are the expected quotes from the intervenors’ lawyers, including a pious statement from Dana Lossia that, “Really what this decision says is, the exams you were using don’t pick the best-qualified people. What they really don’t do is pick the people who would best protect the city.” Not a word from the Department of Justice.
But it’s the stuff from the city that is really important. The city says: the tests are no longer being used, and since the city began administering a new test in Jan. 2007, minorities are now 38% of the candidates on the passing list and 33% of the top 4,000, who are most likely to be offered a job. They are a third of the most recent class of probationary firefighters. So, in practical terms: racial quotas for New York City firefighters have been in place since January 2007; and all this lawsuit is about is back pay and promotions for black and Hispanic applicants before 2007. Back pay and promotions will be addressed in the remedy phase, which is what Judge Garaufis, the lawyers, and the experts will turn to now.
The critical thing legally is that this case was decided on summary judgment: expert opinions and statistics. The judge did not hear any testimony. Under the law, you are only supposed to grant summary judgment when the losing side has failed to show any dispute about any fact that might conceivably affect the outcome. Or, in layman’s terms, there was no need for a trial, because New York City’s position was so weak that it didn’t deserve one.
Here is what Judge Garaufis said about Ricci in his opinion.“Before proceeding to the legal analysis, I offer a brief word about the Supreme Court’s decision in Ricci v. DeStefano, 129 S. Ct. 2658 (June 29, 2009). I reference Ricci not because the Supreme Court’s ruling controls the outcome in this case; to the contrary, I mention Ricci precisely because it does not. In Ricci, the City of New Haven had set aside the results of a promotional examination, and the Supreme Court confronted the narrow issue of whether New Haven could defend a violation of Title VII’s disparate treatment provision by asserting that its challenged employment action was an attempt to comply with Title VII’s disparate impact provision. The Court held that such a defense is only available when “the employer can demonstrate a strong basis in evidence that, had it not taken the action, it would have been liable under the disparate-impact statute.” Id. at 2664. In contrast, this case presents the entirely separate question of whether Plaintiffs have shown that the City’s use of Exams 7029 and 2043 has actually had a disparate impact upon black and Hispanic applicants for positions as entry-level firefighters. Ricci did not confront this issue.”
The obvious corollary of this analysis is that if the City of New Haven had not set aside the results of the exam and simply allowed itself to be sued, it would have lost and racial quotas would have triumphed. I don’t think other municipalities will miss this.
To the numbers.
Both tests consisted of 85 multiple choice questions. You had to pass the test in order to get to take the physical test, the PPT. The written test and the PPT were then averaged and the applicants were rank-ordered. Judge Garaufis isn’t so simple-minded as to attach copies of the tests to his opinion, but it is my impression that they weren’t very different in content. 
But they were very, very different in passing grade. Test 7029, in effect from 1999 to 2002, required you to get 84.705% right (72/85) to pass. Test 2043, in effect from 2002 to 2007, required you to get only 70% (60/85) to pass. So, under test 2043, passage rates rocketed up and the racial gaps diminished—but not enough for the Justice Department or Judge Garaufis.
Test 7029 -- 1999
Whites passed at an 89.9% rate (11,613/12,915). Blacks passed at a 60.3% rate (1,054/1,749). So, blacks passed at 67% of the white rate. Test 7029 failed the 4/5 rule. An expert said there were 33.9 “units of standard deviation” between the white and the black rate, making it a one-in-a-ridiculously-large-number possibility that the disparity in white and black results were the result of chance. The expert report is not online, and I don’t know what a “unit of standard deviation” is. Obviously the black result is not 34 standard deviations below the white result. But perhaps here “units of standard deviation” simply expresses how far the white-black result is from chance, if you assume no difference in ability between the two groups.
Hispanics passed 7029 at a 76.7% rate, meaning their pass rate was 85.3% of the white pass rate. So test 7029 did not flunk the 4/5 rule for Hispanics. But, an expert found that there were 17.4 “units of standard deviation” between the white and Hispanic results, again making it staggeringly unlikely that the difference in scores between the two groups was the result of chance.
Test 2043 -- 2002
Whites passed this test at an impressive 97.2% rate (13,495/13,877). Blacks passed it at 85.4% (1,190/1,393). So, this is a spectacularly easy test. I would think you’d have to be damn near illiterate not to pass it. Anyway, the black pass rate was 87.8% of the white pass rate. So, the 4/5 rule is satisfied. But, an expert finds that there are 21.8 “units of standard deviation” between the white and black scores, again making it extraordinarily unlikely that the difference in scores between the groups is the result of chance.
The Hispanic pass rate is 92.8% of the white rate, or 95.5%. The 4/5 rule is easily satisfied. But there are 10.5 “units of standard deviation” between the white and Hispanic results, making it extremely unlikely that the difference between the scores was the result of chance.
The opinion then goes on to the rank ordering issue. There is a gap of more than 600 slots between the average white and the average black applicant on the hiring list under test 7029, and a gap of more than 900 slots between the average white and the average black under test 2043, because of course everyone and his cretinous brother are passing 2043 and getting on the list.
The city argues that the judge should deny summary judgment on the prima facie question of disparate impact where the 4/5 rule is satisfied (that is, test 7029 for Hispanics, and test 2043 for black and Hispanics). But the judge says no. Citing to various cases from the Second Circuit (i.e., the geographical branch of the federal appellate courts comprising New York and Connecticut) he says that disparate impact is established any time the racial gap is more than 3 standard deviations—by which he means, not that the black/Hispanic score is more than 3 standard deviations lower than the white score, but where the gap between the races is great enough that the result is almost certainly not caused by chance.
In other words, statistical significance at the .01 level. Assuming the two races were absolutely equal in test taking ability, what is the chance that blacks would do so much worse on average on this test given to many thousands of people just by random bad luck? The answer is of course one in bazillion. Therefore, the FDNY got a lot of 'splainin' to do.
Does the Judge know what "statistical significance" even means? As Inigo Montoya said in The Princess Bride: " You keep using that word. I do not think it means, what you think it means." If you have a huge sample size, you can find "statistical significance" in the technical sense even when there is no practical significance.
I won’t spend a lot of time discussing the legal tests and discussion of precedent, but will try to focus on the basics. The main conceptual flaw of these exams was that they measured cognitive abilities to the exclusion of non-cognitive abilities, even though non-cognitive qualities are clearly important to being a successful firefighter. (Of course, if non-cognitive qualities are equally distributed among the population, and cognitive abilities are concentrated among whites, then the rank-ordering of a test that considers non-cognitive as well as cognitive abilities won’t be different, racially speaking, than the rank ordering of a test that considers cognitive abilities alone. But the law, in its majestic equality, takes no notice of such trifles.)
The judge complains that the cognitive test didn't test for non-cognitive traits such as Conscientuousness and Cleanliness, but doesn't indicate how the city could have efficiently, effectively, and fairly tested these non-cognitive traits among many thousands of job applicants ... and with no Disparate Impact, either!
Perhaps city officials could have sniffed each job applicant and then graded him on Cleanliness?
Written tests of the boy scout virtues aren't bad, but they have a much weaker chain of evidence supporting them than do cognitive tests. So, all the judges' quibbles about this cognitive test would be an order of magnitude larger for any of the virtue tests.For example, pencil and paper virtue tests can be outsmarted. "Ooh, how can I look good on this question?" (This isn't such a problem for firemen, except in the case of smart arsonists like John Orr, but it's a big problem testing policemen, where the last thing you want is somebody out at the extreme of the Smart-Bad quadrant.)
The good thing about IQ tests is that they can't really be outsmarted (outside of blatant copying off the Asian guy sitting next to you). If you outsmart an IQ test, well, then, you are smart.
Nor does the judge cite any evidence that there won't be racial gaps in non-cognitive traits like Dependability, Cooperation, Concern for Others, Persistence, and Self Control. Any kind of valid test for those traits will show large (and embarrassing) racial gaps, so the fire department would be right back in court being charged with Disparate Impact. For example, the judge complains that the City should have tested for "mechanical ability" as if adding a test of "mechanical ability" would obviously reduce the black-white gap.
The main structural flaw in the tests was that the questions were written by firefighters, rather than by testing professionals. The judge made great hay of the fact that some of the questions were supposed to measure inductive reasoning, and several of the firefighters involved in the process had no idea what inductive reasoning was.
Judge Garaufis then engages in a lot of quibbling over how the test was made up -- in house, by NYC civil servants interviewing firemen.
Ironically, Garaufis compliments the New Haven Fire Department's test in comparison to FDNY's, even though Justice Ginsburg's losing dissent went on and on about how bad it is.
Clearly, though, a lot of work went into it, and no evidence is presented of racial bias. (Presumably, the city officials went to extra lengths to get minority firemen to contribute to the test design processs and to buy off on the results.)
NYC's good-enough-for-government-work approach possible isn't the ideal way to make up a fireman hiring test. Nor is the best way these expensive custom-made city-by-city tests like in New Haven. The best way is like in college admissions -- you have a couple of nationally competing companies put out national tests, like the SAT and ACT. But, that's not feasibly under the current law because everybody has to act like the only reason for disparate impact is because the last 37,434 fireman's tests didn't do it right, but this time, we're gonna all roll up our sleeves and, doggone it, get it right!
But if you know anything about test design (which the Judge clear doesn't), you also know that there are rapidly diminishing returns to test design sophistication. When you are just coming up with a test that eliminates the bottom 30% of white guys who want to be firemen, you know, it's really not that hard to come up with something good enough.
Other issues: the reading level of the test was too high (see p. 75); there was an insufficient showing of a linkage between the qualities measured and the abilities necessary to make a good fireman; and so on.
The government’s expert, Siskin, seems to acknowledge the existence of g, see p. 67 of the opinion. But the judge notes this solely by way of saying that the test was unjustifiably focused on the measurement of cognitive abilities.
That's a hilarious part where the plaintiff's statistical genius expert witness rediscovers the general factor of intelligence, like Charles Spearman discovered the g factor in 1904. The test design process identified nine cognitive skills useful in firefighting.
This analysis revealed a pattern showing that the “[questions] intended to measure an individual cognitive ability actually tend[ed] to correlate as or more highly with [questions] intended to measure different cognitive abilities . . . .”
Spearman invented factor analysis, which he used to discover the existence of the g Factor. No doubt wholly innocent of this bit of cognitive testing history from more than a century ago, the judge complains:
To further support his conclusion, Dr. Siskin applied a method called “factor analysis,” which is “a statistical methodology that, based on the empirical data, defines an underlying structure which can explain the correlation among the [questions].” (Id.) “For the results of factor analysis to confirm the test plan, the analysis should find that [questions] group together to comprise nine or 10 factors in a manner consistent with the test plan, such that the Deductive Reasoning [questions] group together to form one factor and the [questions] intended to measure Inductive Reasoning group together to form a second factor, and so forth.” Dr. Siskin’s factor analysis showed that the data did not “factor into nine distinct factors or ability domains,” but instead “seem to primarily measure a general cognitive ability (except, perhaps, Memorization), and to a much lesser extent, a second specific cognitive ability (which is different from any defined by the test developers).”. According to Dr. Siskin, [t]his result demonstrates that the purported intent of the test design (to measure and weight nine distinct cognitive ability domains) was not successful.”
I'm shocked, shocked to discover that this test validates the existence of the g Factor and once again fails to prove the validity of Howard Gardner's theory of multiple intelligences.
Then the judge says:
This evidence shows that the cognitive abilities intended to be tested on Exams 7029 and 2043 were not the most important cognitive abilities for the job of firefighter.
But, Judge G., you just complained that the test tested g, the general cognitive ability, rather than various hypothesized specific abilities. Tautologically speaking, isn't general cognitive ability the most important cognitive ability?
Interestingly, the use of a cutoff score is very problematic, even when the cutoff is based on the number of firefighters required (I should say, especially when the cutoff is based on the number of firefighters required). See p. 78. I think the obvious takeaway is that the fire department should consider many applicants more than it needs, and hire on a racially balanced basis from that larger pool.
In other words, do what Chicago is doing: hire firefighters largely at random.
We'd be better off with strict racial quotas than with that.