Politicians Forget that Cut Scores on Standardized Tests Are Not Grounded in Science

Last week the NY TimesDana Goldstein and Manny Fernandez reported on a political fight in Texas over the scoring of the STAAR—the State of Texas Assessments of Academic Readiness—the state’s version of the achievement test each state must still administer every year in grades 3-8 and once in high school.  The federal Every Student Succeeds Act, passed in 2015 to replace No Child Left Behind, still mandates annual testing, although Congress no longer imposes its own high stakes punishments for failure.

However, Congress still does require the states to submit plans to the U.S. Department of Education declaring what will be the consequences for low-scoring schools.  Goldstein and Fernandez explain that Texas, like many other states, still imposes punishments for the low scorers instead of offering help: “The test, the State of Texas Assessments of Academic Readiness, or STAAR, can have profound consequences not just for students but for schools across the state, hundreds of which have been deemed inadequate and are subject to interventions that critics say are undue.”  Schools have to provide help for students who are not on grade level. Also: “Texas grades its districts on an A through F scale, in part based on how many students are meeting or exceeding grade-level standards… Persistently failing schools, and districts with just a single such school, can be shut down or taken over by the state—a threat facing the state’s largest school system, in Houston.”

Decades of research show that, in the aggregate, standardized test scores correlate with family and neighborhood income. In a country where segregation by race and poverty continues to grow, it is now recognized among experts and researchers that rating and ranking schools and districts by their aggregate test scores merely brands the poorest schools as failing. When sanctions are attached, political regimes of test-based accountability merely punish the schools and the teachers and the students in the poorest places.

In an excellent 2017, book, The Testing Charade: Pretending to Make Schools Better, Harvard professor Daniel Koretz explains the correlation of aggregate standardized test scores with family and community economics: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do… Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.”  (The Testing Charade, pp. 129-130)

Goldstein and Fernandez report that the political fight in Texas this month is about the test scores in third grade reading: “The 2018 STAAR tests found that 58 percent of Texas third graders are not reading at grade level. On the 2017 National Assessment of Educational Progress, given to a sample of fourth graders across the country, 72 percent of Texas students were not proficient in reading—a fact the state has cited as evidence that tough local standards are warranted.”

Like many other states, Texas blames the public schools.  But Goldstein and Fernandez present other factors that ought to be considered here: “More than half of the state’s public school students are Hispanic and nearly 60 percent come from low-income families.  About a fifth are still learning English.”  The state argues that’s all the more reason to set the passing cut score high and motivate schools to catch kids up quicker.

But educators and parents and some politicians in Texas are pushing back. They contend that the bar is set so high that students who are reading at grade level still score below the cut score for proficiency.  There is a lot of discussion of reading passages said to be two grade levels ahead of the students being tested and of something called Lexile measures, which involve the number of syllables in a word and are used to evaluate the difficulty of the passages on the test.

It would clear up a lot of the trouble if more people read Chapter 8, “Making Up Unrealistic Targets,” in Daniel Koretz’s book. Koretz explains that there is nothing really scientific about where “proficient” cut scores are set: “If one doesn’t look too closely, reporting what percentage of students are ‘proficient’ seems clear enough. Someone somehow determined what level of achievement we should expect at any given grade—that’s what we will call ‘proficient’—and we’re just counting how many kids have reached that point. This seeming simplicity and clarity is why almost all public discussion of test scores is now cast in terms of the percentage reaching either the proficient standard, or occasionally, another cut score… The trust most people have in performance standards is essential, because the entire educational system now revolves around them. The percentage of kids who reach the standard is the key number determining which teachers and schools will be rewarded or punished.” (The Testing Charade, pl 120)

Koretz explains that standardized test cut scores are not set scientifically. There is no scientific or even magical way of deciding exactly which reading passages every third grader must be able to decode and comprehend, and anyway, students in third grade are not consistent.  Koretz examines several methods used by panels of judges to set the “proficient” level.  He adds that the methods used by different state panels don’t arrive at the same cut scores: “The percentage of kids deemed to be ‘proficient’ sometimes varies dramatically from one method to another.” (The Testing Charade, p. 124)

Goldstein and Fernandez indicate that Texas uses the National Assessment of Education Progress (NAEP) as its audit test by which it judges the accuracy of the way Texas sets its levels of proficiency. When the scores on the STAAR are compared to the scores on the NAEP, politicians in Texas are really concerned because NAEP shows that 72 percent of third graders in Texas are not proficient—even worse than the 58 percent who score below proficient on the STAAR.

But the matter is not as dire as it would appear. The education historian Diane Ravitch served on the National Assessment Governing Board for seven years.  Ravitch explains that the cut scores on the NAEP are set artificially high.  It is much harder to reach the proficient level than what our common understanding of the term “proficient” would lead us to expect: “‘Proficient’ on NAEP does not indicate ‘average’ performance; it is set very high… There are four levels. At the top is ‘advanced.’ Then comes ‘proficient.’ Then ‘basic.’ And last, ‘below basic.’  Advanced is truly superb performance, which is like getting an A+. Among fourth graders, 8% were advanced readers in 2011; 3% of eighth graders were advanced. In reading, these numbers have changed little in the past twenty years…   Proficient is akin to a solid A. In reading, the proportion who were proficient in fourth grade reading rose from 29% in 1992 to 34% in 2011. The proportion proficient in eighth grade also rose from 29% to 34% in those years… Basic is akin to a B or C level performance. Good but not good enough.”

The argument about what different “proficient” levels really mean is old and tired, but we can’t seem to move beyond it. Today we know that the No Child Left Behind Act was aspirational. It was supposed to motivate teachers to work harder to raise scores. Policymakers hoped that if they set the bar really high, teachers would figure out how to get kids over it. It didn’t work.  No Child Left Behind said that all children in American public schools would be proficient by 2014 or their school would be labeled failing. Finally as 2014 loomed closer, Arne Duncan had to give states waivers to avoid what was going to happen if the law had been enforced: All American public schools would have been declared “failing.”

As we continue to haggle about the cut scores by which we judge our children and their schools, however, there is one thing we almost never consider.  What if—instead of punishing the schools where scores are lower and instead of making their children drill harder and attend Saturday cram sessions—we were willing to invest more tax dollars in the lowest scoring schools?  What if we made classes smaller to make it possible for teachers to work more personally with each student?  What if we made sure that the schools in our poorest communities had well stocked libraries with certified librarians and story-hours once or even twice a week?

Koretz comes to this same conclusion, although he explains it more theoretically: “(I)t is clear that the implicit assumption undergirding the reforms is that we can dramatically reduce the variability of achievement… Unfortunately, all evidence indicates that this optimism is unfounded.  We can undoubtedly reduce variations in performance appreciably if we summoned the political will and committed the resources to do so—which would require a lot more than simply imposing requirements that educators reach arbitrary targets for test scores.” (The Testing Charade, p. 131)

Advertisements

NAEP Scores Flatline, Achievement Gaps Persist. Millions of Children Are Still Left Behind

For almost two decades since the passage of No Child Left Behind, our society has been operating according to an educational policy scheme by which we say we’ve been holding educators accountable. The two year National Assessment of Education Progress (NAEP) scores were released this week, however, and while experts are parsing the meaning of the difference of a couple of points of gain or loss at fourth or eighth grade on the new  scores, what is clear is that No Child Left Behind has neither significantly raised student achievement nor closed racial and economic achievement gaps.

For the Washington Post, Moriah Balingit reports: “The gap between high- and low-achieving students widened on a national math and science exam, a disparity that educators say is another sign that schools need to do more to lift the performance of their most challenged students.  Averages for fourth-and eighth-graders on the National Assessment of Educational Progress, also called the Nation’s Report Card, were mostly unchanged between 2015 and 2017.  The exception was eighth-grade reading scores, which rose slightly.  But scores for the bottom 25 percent of students dropped slightly in all but eighth-grade reading.  Scores for the top quartile rose slightly in eighth-grade reading and math.  The slippage among the nation’s lowest-performing students raised concerns among educators and experts….  Peggy G. Carr, associate commissioner for the National Center for Education Statistics, said there were no statistically significant changes when it came to different categories of students.  This means black and Hispanic students continue to trail their white counterparts on the exam. Students from low-income households also performed below the national average, as did special-education students, though they posted significant gains in 2017 compared with two years earlier.”

Unlike state-by-state achievement tests mandated by the 2002 No Child Left Behind and continued under the 2015 Every Student Succeeds Act, the NAEP is given to a representative sampling of students across all the states. Its purpose is to gauge the overall state of public education across the nation, not to compare scores for particular states or schools.  There is no test-prep for the NAEP.

Education Week‘s Sarah Sparks summarizes the 2017 results: “Across the board struggling American students are falling behind, while top performers are rising higher.”  This certainly reflects the growing gap noticed by Stanford University sociologist Sean Reardon who, several years ago, used a massive data set to document the consequences of widening economic inequality for children’s outcomes at school. Reardon showed that while in 1970, only 15 percent of families lived in neighborhoods classified as affluent or poor, by 2007, 31 percent of families lived in such neighborhoods. By 2007, fewer families across America lived in mixed income communities. Reardon also demonstrated that along with growing residential inequality is a simultaneous jump in an income-inequality school achievement gap. The achievement gap between the children with income in the top ten percent and the children with income in the bottom ten percent, was 30-40 percent wider among children born in 2001 than those born in 1975, and twice as large as the black-white achievement gap.

In an Education Week follow-up on the release, also this week, of a special subset of NAEP data comparing the scores of large urban school districts, Sparks declares that over time, “America’s large urban districts have been improving faster than the nation as a whole.” Scores in cities of over 250,000 are rising more quickly than the scores of other students, but rising so slowly that it will take decades for them to catch up if growth continues at the current rate. A basic score on NAEP is the lowest level, while proficient is scored in such a way that students deemed proficient are achieving at somewhat above an average level. Sparks describes the trend of rising scores among urban students: “These gains are a mixed blessing: Urban 4th graders scored on average at the basic level in math and reading. Urban 8th graders scored on average at the basic level in reading and below basic in math. Yet, 27 percent of urban 8th graders scored at or above the proficient level in reading in 2017, up 8 percentage points since 2007.  That’s faster than the 5 percentage-point reading growth for students overall.”

For the Cleveland Plain Dealer, Patrick O’Donnell describes a mixed bag of gains and losses for students in that very poor city: “The major bright spot was in eighth-grade math, where Cleveland had the third-highest increase among cities. That placed Cleveland’s scores ahead of the Baltimore, Detroit, Fresno and Milwaukee districts and in a tie with Shelby County (Memphis), Tenn. The district also mostly held on to a sizeable gain it made in fourth grade reading between 2014 and 2015, the previous NAEP test, falling just a single statistically insignificant point. But fourth grade math and eighth grade reading scores had the worst and third-worst drops out of all tested cities.”

The Detroit Free Press’s Lori Higgins reports discouraging scores in that other very poor Rust Belt city: “In Detroit, students had the worst performance not only among large, urban districts but also compared with all states in fourth- and eighth-grade math, as well as fourth-grade reading.  Detroit shared the bottom spot with Cleveland for eighth-grade reading.”

This year the NAEP was administered online to 80 percent of students, and there has been complaining that the change may have lowered scores. However, Peggy Carr of the National Center for Education Statistics explained to the Post‘s Balingit that scores were formally adjusted to compensate for the online administration of the test—and to make the scores comparable with the older paper-and-pencil version: “Research shows digital assessments are tougher for students than paper-and-pencil tests. So, Carr said, her federal center adjusted results so the change in format ‘would not influence the comparisons and trends that we are reporting.'”

The stated purpose of federal policy in education since the passage of No Child Left Behind in 2002 has been to hold schools accountable for raising achievement among the nation’s lowest scoring students and to close achievement gaps. In the meantime, as the teachers in West Virginia, Oklahoma and Kentucky have shown us this month, states have cut funding for education due to the economic recession of 2008 and continued tax slashing across many states.  The Center on Budget and Policy Priorities (CBPP) has documented this trend, with 29 states in 2015 providing less overall funding, adjusted for inflation, than in 2008. In 19 of those states, local school districts also cut funding. Comparing 2018 general fund, per-pupil formula funding in 12 states for which that data is currently available, CBPP reports that Oklahoma, Texas, Kentucky, Alabama, Arizona, West Virginia, Mississippi, Utah, Kansas, Michigan, North Carolina, and Idaho spend considerably less today than they did in 2008.

Nobody traces small changes in NAEP scores to particular causes from school district to school district. Surely, however, three major trends are implicated in the flattening of NAEP scores over time.

  • Our society has not addressed deepening poverty and widening inequality, at a time when growing research demonstrates that family and neighborhood poverty affects children’s achievement at school.
  • Nearly two decades of education policy has focused on punishing public schools—too often the schools in our poorest communities—by closing schools, by firing teachers and principals, by charterizing schools, or by imposing portfolio governance.
  • As school teachers are now exposing, funding in too many places has collapsed below acceptable levels.