Last week the NY Times‘ Dana Goldstein and Manny Fernandez reported on a political fight in Texas over the scoring of the STAAR—the State of Texas Assessments of Academic Readiness—the state’s version of the achievement test each state must still administer every year in grades 3-8 and once in high school. The federal Every Student Succeeds Act, passed in 2015 to replace No Child Left Behind, still mandates annual testing, although Congress no longer imposes its own high stakes punishments for failure.
However, Congress still does require the states to submit plans to the U.S. Department of Education declaring what will be the consequences for low-scoring schools. Goldstein and Fernandez explain that Texas, like many other states, still imposes punishments for the low scorers instead of offering help: “The test, the State of Texas Assessments of Academic Readiness, or STAAR, can have profound consequences not just for students but for schools across the state, hundreds of which have been deemed inadequate and are subject to interventions that critics say are undue.” Schools have to provide help for students who are not on grade level. Also: “Texas grades its districts on an A through F scale, in part based on how many students are meeting or exceeding grade-level standards… Persistently failing schools, and districts with just a single such school, can be shut down or taken over by the state—a threat facing the state’s largest school system, in Houston.”
Decades of research show that, in the aggregate, standardized test scores correlate with family and neighborhood income. In a country where segregation by race and poverty continues to grow, it is now recognized among experts and researchers that rating and ranking schools and districts by their aggregate test scores merely brands the poorest schools as failing. When sanctions are attached, political regimes of test-based accountability merely punish the schools and the teachers and the students in the poorest places.
In an excellent 2017, book, The Testing Charade: Pretending to Make Schools Better, Harvard professor Daniel Koretz explains the correlation of aggregate standardized test scores with family and community economics: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do… Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (The Testing Charade, pp. 129-130)
Goldstein and Fernandez report that the political fight in Texas this month is about the test scores in third grade reading: “The 2018 STAAR tests found that 58 percent of Texas third graders are not reading at grade level. On the 2017 National Assessment of Educational Progress, given to a sample of fourth graders across the country, 72 percent of Texas students were not proficient in reading—a fact the state has cited as evidence that tough local standards are warranted.”
Like many other states, Texas blames the public schools. But Goldstein and Fernandez present other factors that ought to be considered here: “More than half of the state’s public school students are Hispanic and nearly 60 percent come from low-income families. About a fifth are still learning English.” The state argues that’s all the more reason to set the passing cut score high and motivate schools to catch kids up quicker.
But educators and parents and some politicians in Texas are pushing back. They contend that the bar is set so high that students who are reading at grade level still score below the cut score for proficiency. There is a lot of discussion of reading passages said to be two grade levels ahead of the students being tested and of something called Lexile measures, which involve the number of syllables in a word and are used to evaluate the difficulty of the passages on the test.
It would clear up a lot of the trouble if more people read Chapter 8, “Making Up Unrealistic Targets,” in Daniel Koretz’s book. Koretz explains that there is nothing really scientific about where “proficient” cut scores are set: “If one doesn’t look too closely, reporting what percentage of students are ‘proficient’ seems clear enough. Someone somehow determined what level of achievement we should expect at any given grade—that’s what we will call ‘proficient’—and we’re just counting how many kids have reached that point. This seeming simplicity and clarity is why almost all public discussion of test scores is now cast in terms of the percentage reaching either the proficient standard, or occasionally, another cut score… The trust most people have in performance standards is essential, because the entire educational system now revolves around them. The percentage of kids who reach the standard is the key number determining which teachers and schools will be rewarded or punished.” (The Testing Charade, pl 120)
Koretz explains that standardized test cut scores are not set scientifically. There is no scientific or even magical way of deciding exactly which reading passages every third grader must be able to decode and comprehend, and anyway, students in third grade are not consistent. Koretz examines several methods used by panels of judges to set the “proficient” level. He adds that the methods used by different state panels don’t arrive at the same cut scores: “The percentage of kids deemed to be ‘proficient’ sometimes varies dramatically from one method to another.” (The Testing Charade, p. 124)
Goldstein and Fernandez indicate that Texas uses the National Assessment of Education Progress (NAEP) as its audit test by which it judges the accuracy of the way Texas sets its levels of proficiency. When the scores on the STAAR are compared to the scores on the NAEP, politicians in Texas are really concerned because NAEP shows that 72 percent of third graders in Texas are not proficient—even worse than the 58 percent who score below proficient on the STAAR.
But the matter is not as dire as it would appear. The education historian Diane Ravitch served on the National Assessment Governing Board for seven years. Ravitch explains that the cut scores on the NAEP are set artificially high. It is much harder to reach the proficient level than what our common understanding of the term “proficient” would lead us to expect: “‘Proficient’ on NAEP does not indicate ‘average’ performance; it is set very high… There are four levels. At the top is ‘advanced.’ Then comes ‘proficient.’ Then ‘basic.’ And last, ‘below basic.’ Advanced is truly superb performance, which is like getting an A+. Among fourth graders, 8% were advanced readers in 2011; 3% of eighth graders were advanced. In reading, these numbers have changed little in the past twenty years… Proficient is akin to a solid A. In reading, the proportion who were proficient in fourth grade reading rose from 29% in 1992 to 34% in 2011. The proportion proficient in eighth grade also rose from 29% to 34% in those years… Basic is akin to a B or C level performance. Good but not good enough.”
The argument about what different “proficient” levels really mean is old and tired, but we can’t seem to move beyond it. Today we know that the No Child Left Behind Act was aspirational. It was supposed to motivate teachers to work harder to raise scores. Policymakers hoped that if they set the bar really high, teachers would figure out how to get kids over it. It didn’t work. No Child Left Behind said that all children in American public schools would be proficient by 2014 or their school would be labeled failing. Finally as 2014 loomed closer, Arne Duncan had to give states waivers to avoid what was going to happen if the law had been enforced: All American public schools would have been declared “failing.”
As we continue to haggle about the cut scores by which we judge our children and their schools, however, there is one thing we almost never consider. What if—instead of punishing the schools where scores are lower and instead of making their children drill harder and attend Saturday cram sessions—we were willing to invest more tax dollars in the lowest scoring schools? What if we made classes smaller to make it possible for teachers to work more personally with each student? What if we made sure that the schools in our poorest communities had well stocked libraries with certified librarians and story-hours once or even twice a week?
Koretz comes to this same conclusion, although he explains it more theoretically: “(I)t is clear that the implicit assumption undergirding the reforms is that we can dramatically reduce the variability of achievement… Unfortunately, all evidence indicates that this optimism is unfounded. We can undoubtedly reduce variations in performance appreciably if we summoned the political will and committed the resources to do so—which would require a lot more than simply imposing requirements that educators reach arbitrary targets for test scores.” (The Testing Charade, p. 131)