|Rank||State||Whites' relative NAEP|
improvement from 4th
grade to 8th (St.Dev.)
|6.||Dept. of Defense||.55|
Over the years, I've been frustrated by how everybody uses the absolute test scores of students to evaluate how good a job a school is doing: "You'll get a great education at Harvard because the average SAT score there is 1500!" Yes, but that's what they got in high school before Harvard got its mitts on them. In truth, nobody has much of an idea whether Harvard is doing a better or worse job than, say, Cal State Dominguez Hills at helping its students live up to their individual potential.
Similarly, I often hear people assume that the principal at, say, Beverly Hills H.S. is doing a good job because test scores are high there, while the principal at say, Compton Dominguez must be doing badly because scores are low. That's quite unfair.
Absolute test scores for public schools are so dominated by demographics that the results are notoriously boring and depressing.
The state of California attempts to deal with this problem by giving two Academic Performance Index scores to each public school, one absolute and one relative to "similar schools."
But I've always wanted to look at how much "value added" schools provide.
Earlier, Audacious Epigone tried to figure out from the federal National Assessment of Educational Progress reading and math test results how much value different state educational systems are adding. He compared, across states, performance by 4th graders in 2003 vs. performance by 8th graders in 2007 on the NAEP.
That's a pretty clean comparison (for example, if one state has had a policy of discouraging Special Ed kids from taking the NAEP and another doesn't it, the differences shouldn't affect the relative change over time, unlike the usual absolute comparisons).
But what if there is a big demographic shift going on, such as in states with a dramatic Hispanic influx? That would distort the numbers.
So now, in the table above, he's looking at just the change in performance from 4th to 8th grade among non-Hispanic white students in order to reduce the impact of demographic change and make for even more of an apples to apples comparison.
(This analysis could also be done for blacks and for Hispanics, but not for all 50 states because of inadequate sample sizes of minorities in, say, Montana or Vermont.)
The results are quite striking. In the best state, Montana's white students did almost a standard deviation better as 8th graders in 2007 than they (using the term "they" roughly) did as 4th graders in 2003 relative to the rest of the country. In contrast, in the worst state, Connecticut's white students' change from 4th to 8th grade was one and a third standard deviations worse than the national average, relatively speaking.
That's more than a two standard deviation difference between #1 and #50. These are such large differences that I'm hesitant to present the numbers, but maybe somebody out there can help us check them out.
Clearly, there is some demographic change from 4th grade in 2003 to 8th grade in 2007 still showing up in the data. Perhaps the top white students in Connecticut (last on the list) are more likely than in typical states to leave the public schools for elite prep schools starting in 7th grade? (Maybe not -- most of the boarding schools in that state famous for boarding schools are 9-12).
In general, the states at the top of the list tend to be less demographically diverse than those at the bottom, although there are obvious exceptions, such as West Virginia doing quite badly.
Still, the sample sizes are impressively large: 196,000 for public school 4th graders (across all races) and 164,000 for 8th graders or around 5-6% of all students in those two grades. So the typical state is represented by roughly 2,000 white 4th graders and 2,000 white 8th graders. So, there are probably close to 1,000 whites in each grade at minimum for just about every state. (D.C., though, is excluded because there are so few whites in its public school system.) The two superstates, California and Texas, have extra-large samples of at least 10,000 students of all races in each 4th grade sample, so the number of whites there are adequate, yet they differ by about a standard deviation.
Part of the results are no doubt methodological noise. Some states might have switched to more upscale schools where the test is administered rom 2003 to 2007 to make themselves look better. Or, for example,the NAEPs are administered during a window from January to March, so if a state gave its 4th graders the test in March in 2003 and its 8th graders the test in January in 2007, it would be cheating itself of two months of maturation vs. the national norm.
On the other hand, there would be one obvious way to cheat: give a bad education from K-3 to depress 4th grade scores, then start to do your best to teach kids a lot once they take the NAEP in 4th grade so you can score high on the 8th grade test.
Still, it's unlikely that anybody has tried to game this particular analysis simply because I don't think anybody has ever thought of this analysis before.
Just looking at the table, I don't see any obvious demographic pattern explaining why, for example, Vermont would be in 13th place a standard deviation ahead of New Hampshire in 43rd place. Or why are Maryland's whites (3rd place) two standard deviations ahead of Connecticut's whites (50th place)? Both have affluent, moderately liberal, well-educated white populations. Perhaps we really are approaching the Holy Grail of a measure of educational effectiveness?
Normally, when I look at a table of data, I can figure out what's driving the rankings. Here, though, I can't. That could be good news - I really don't know much of anything about pubic school quality across the country apart from demographics (other than a vague impression from the media that Texas is better than California), so the fact that the results look pretty random could mean that we are looking at actual differences in public school effectiveness. The bad news is that the results could also look random because they are pretty random due to lots of different kinds of noise.
Any suggestions you might have for torture testing the data would be appreciated.