Note to readers: This is *problems with test score measurement week* at ProfitOfEd.org.

Timothy Bond and Kevin Lang have a new paper with really scary results for folks using test scores for both policy and research. Scary, and maybe very deep.

We mostly treat test scores as *cardinal, *meaning that getting a 100 on a test is in some sense twice as good as getting a 50. In truth, test scores are mostly *ordinal.* 51 points is better than 50 and 100 points is better than 99, but there’s not a meaningful sense in which the difference between 100 and 99 is the same as the difference between 51 and 50. If we “regraded” test scores and rescaled so that 50 points was scored as 50.9999 points, we’d preserve the ordering of scores. And all we really know is that a higher score is better, not “how much” better.

So what? Here’s the gotchya. Bond and Lang show that if you rescale test scores *without changing the ordering* you very substantially change conclusions about the black-white test score gap. For example, a standard calculation shows black-white test score gaps equal 0.25 standard deviations in kindergarten rising to .61 standard deviations by third grade. In other words, the gap starts out as significant and then gets a lot worse while kids are in school. But Bond and Lang show one rescaling that would have the gap start at 0.12 and rise to 0.64. So the trend in gap is somewhat worse than usually thought. But they show an alternative rescaling that has the gap start at 0.24 and end at 0.06. In other words, the gap almost disappears by third grade.

The point of this is that there isn’t any way to know which scale is “right.” And it makes a big difference.

This does *not* say that test scores aren’t valuable. Higher is still better. Moreover, sometimes test scores can be tied into cardinal outcomes such as income. But the research does say that standard comparisons of test scores, across time or across groups, may be much harder to interpret than we thought.