We Must Renew Efforts to End High-Stakes “Test and Punish” in U.S. Public Schools

As an opponent of federally mandated high-stakes standardized tests in the public schools, I have been worrying that, after educators were unsuccessful last year in pressing Education Secretary Miguel Cardona to stop the testing for the 2020-2021 school year during the pandemic, many opponents of test-based accountability have pretty much stopped pushing back on the testing.

In a column last week for Education Week, Rick Hess worries that supporters of high stakes testing are also struggling.  Rick Hess is the “public school accountability hawk” scholar-in-residence at the American Enterprise Institute. He writes: “During the pandemic, I’ve talked to a lot of educational leaders and advocates who believe in the importance of testing and school accountability—but feel like they’re swimming upstream in their efforts to maintain support for these issues. I’ve been struck at how tough many of them have found it to navigate the shifting political currents.”

If advocates on both sides of the school accountability debate are worried that COVID has drawn the public’s attention away from the effects of standardized testing on public education, it seems like a good time to renew advocacy for eliminating annual testing as the driving force in our public schools.

Hess’s subject in his recent column is the federally required administration of standardized achievement tests every year for all students in grades 3-8 and once in high school. The policy was put in place in 2002 by No Child Left Behind and continued in 2015 when Congress passed the Every Student Succeeds Act. For two decades, proponents like Hess have described testing’s goal as holding schools accountable by imposing sanctions on the schools unable quickly to raise the aggregate test scores of their student populations.

Hess acknowledges more problems with standardized testing than I would have expected: “I suspect the current struggles are healthy—they’re a reminder of how much the momentum and machinery of the Clinton-Bush-Obama era allowed testing advocates to coast. Backed by federal mandates, huge foundation dollars, and media allies, they talked in sweeping assertions about the importance of testing and accountability. They’d insist that testing was the key to leaving no child behind… That reading and math tests revealed achievement gaps and that this was crucial to closing them. That the right standards would provide a foundation for the right tests, permitting complex teacher and school evaluation systems to drive system improvement… (T)esting has real shortcomings. State tests aren’t designed to improve instruction. The results don’t come back for months, and parents don’t get any actionable feedback from them.”

Despite his complaints about big problems with test-based school accountbility, however, Hess continues to believe that advocates must strengthen and improve their advocacy for continuing annual high-stakes testing: “Testing and accountability advocates can no longer count on being carried forward by powerful political patrons or deep-pocketed foundations. And, after multiple years of pandemic waivers, they can no longer count on Washington ordering states to hold the line. This should serve as a call to think anew about how to make the case for testing… It’s an opportunity to revisit how to ensure testing really is serving the needs of students, parents, and educators—and learn how to explain that in a distrustful era.”

The problem with Hess’s argument is that he fails to show that high-stakes testing accomplishes any positive purpose, and he neglects to identify much of the damage thrust upon our schools and our society by “test and punish” school accountability.

Making the strongest case against annual standardized testing is Daniel Koretz, the Harvard University expert on the construction of standardized tests and their uses at school. Koretz’s book, The Testing Charade: Pretending to Make Schools Better, written for a wide audience, is the most important book examining how high stakes testing has wrecked our public schools. Koretz cites something called Campbell’s Law to explain what No Child Left Behind brought us twenty years ago: “The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor… Achievement tests may well be valuable indicators of… achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways.”(The Testing Charade, pp. 38-39)

Koretz explains what happened to teaching and learning when policymakers attached high stakes to achievement tests that had been designed simply to measure what students are learning. The new purpose was accountability—creating consequences for the schools and the teachers in schools where scores failed quickly to rise. There are a number of ways high-stakes testing narrows the curriculum: “(T)he tested samples of content and skills are not fully representative, either of the goals of schooling broadly or of student achievement more narrowly…(H)igh-stakes testing creates strong incentives to focus on the tested sample rather than the domain it is intended to represent.” (The Testing Charade, p. 16-19)

Federally mandated high-stakes testing in U.S. public schools focuses only on math and reading: “The often unspoken premise of the reformers was that somehow… other subjects, such as history, civics, art, and music, aspects of math and reading that are hard to measure with standardized tests, and ‘softer’ things such as engaging instruction, love of learning, and ability to work in groups—would somehow take care of itself. It didn’t, and that shouldn’t have surprised anyone.  The second reason for the failure is that the system is very high-pressure… Narrowness and high pressure are a very potent combination… A third critical failure of the reforms is that they left almost no room for human judgment. Teachers are not trusted to evaluate students or each other, principals are not trusted to evaluate teachers, and the judgment of professionals from outside the school has only a limited role. What the reformers trust is ‘objective’ standardized measures. This was not accidental.” The Testing Charade, pp. 32-33)

Koretz explains how schools and school districts discovered ways to inflate their scores through test prep and drilling on the material that predictably appears on the tests year after year. But test prep hasn’t been the only consequence. Sometimes schools held struggling middle school students back a grade to prevent their being tested on the high school test. Sometimes teachers were caught providing students with the answers on the tests and in some places teachers were found to have erased and changed students’ answers on the tests. One instance of outright cheating happened in Washington, D.C. under Michelle Rhee, and in Atlanta, the superintendent and many educators were indicted.

Koretz explains that the high-stakes testing regime was particularly punitive for the schools serving the poorest children: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores…. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms…. Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (The Testing Charade, pp. 129-130)

What about the effects of high-stakes testing in society beyond the classroom?  No Child Left Behind imposed federal punishments by requiring that staffs at low scoring schools be reconstituted by firing principals and half the staff, or by requiring that schools be charterized, privatized, or shut down.  Education Secretary Arne Duncan used Race to the Top to force states to tie teachers’ evaluations to students’ test scores. In 2015, Congress replaced No Child Left Behind with the Every Student Succeeds Act and stopped imposing federally established harsh sanctions, but ESSA continues—in 2022—to require that every year all the states state must submit plans embodying sanctions to hold the lowest-scoring five percent of public schools accountable.

Here are some of the broader effects of ESSA. Today the federal government continues to require states to rank and rate schools based primarily on standardized test scores. The ranking and rating of schools brands low scoring school districts—usually the districts serving concentrations of poor children—as “failing” and drives middle class flight to wealthier exurbs, thereby accelerating racial and economic segregation. Some states continue to take over low-scoring schools and school districts and turn these districts over to appointed overseers or commissions.  School districts continue to shut down low-scoring schools. Many states locate charter schools and grant voucher eligibility in low scoring school districts. And even though researchers have demonstrated that students’ test scores are an unreliable and invalid way to evaluate teachers and despite that the federal government no longer requires states to use test scores for teacher evaluation, many states haven’t taken the trouble to repeal policies that evaluate teachers by their students’ scores. Many states continue to hold students back in third grade if their reading scores are low, and some states base high school graduation on the state test even for students who have successfully completed all of their required courses.

Rick Hess calls on proponents of high-stakes standardized testing “to think anew about how to make the case for testing.”  I call on opponents of standardized testing to present the reams of academic research documenting the damage wrought by federally mandated, test-based school accountability and to intensify pressure for the elimination of high-stakes testing in U.S. public schools.

Even Though ESSA Dropped the Requirement, 34 States Still Evaluate Schoolteachers by Students’ Test Scores

Chalkbeat‘s Matt Barnum reports this week that 9 of the 43 school districts which adopted the use of students’ standardized test scores to evaluate teachers have stopped using students’ scores for teacher evaluation. This is an important development because all sorts of research has shown that students’ scores are unreliable as a measure of the quality of a teacher.  But too many states are still evaluating their teachers with unreliable algorithms based on students’ test scores.

Barnum reminds us about the history of using students’ standardized test scores to evaluate teachers: “The push to remake teacher evaluations was jump-started by the Obama administration’s Race to the Top competition, which offered a chance at federal dollars to states that enacted favored policies—including linking teacher evaluation to student test scores… Philanthropies—most notably the Bill and Melinda Gates Foundation—provided support for a constellation of groups pushing these ideas.”

Evaluating teachers by their students’ standardized test scores also became a condition for states to qualify for a No Child Left Behind Waiver. After it became apparent that No Child Left Behind was going to declare a majority of schools “failures” because they were not going to be able to meet the law’s rigid schedule, in 2011, the federal government offered to relax some of the law’s most punitive consequences by offering states waivers from No Child Left Behind. But to qualify for a waiver, states had to promise to enact some of Arne Duncan’s pet policies. Using students’ standardized test scores for evaluating schoolteachers was one of the requirements for states to qualify for No Child Left Behind Waivers.  Education Week explained: “In exchange, states had to agree to set standards aimed at preparing students for higher education and the workforce. Waiver states could either choose the Common Core State Standards, or get their higher education institutions to certify that their standards are rigorous enough. They also must put in place assessments aligned to those standards. And they have to institute teacher-evaluation systems that take into account student progress on state standardized tests, as well as single out 15 percent of schools for turnaround efforts or more targeted interventions.” (Emphasis is mine.)

Barnam explains the impact of these federal requirements: “Between 2009 and 2013, the number of states requiring test scores to be used in teacher evaluations spiked from 15 to 41, including Washington, DC.”

But in 2015, Congress replaced No Child Left Behind with a new federal education law, the Every Student Succeeds Act (ESSA).  And  the new law was partly shaped by a protest against Arne Duncan’s misguided teacher evaluation scheme. Barnum explains: “The backlash culminated with the 2015 passage of the Every Student Succeeds Act, which explicitly bars future secretaries of education from doing what Obama’s Education Secretary Arne Duncan did—trying to influence how teachers are evaluated.”

At the time, the Washington Post‘s Lyndsey Layton described how the new ESSA would specifically stop the U.S. Secretary of Education from intervening in the formulation of state laws by limiting, “the legal authority of the education secretary, who would be legally barred from influencing state decisions about academic benchmarks, such as the Common Core State Standards, teacher evaluations and other education policies.”

Barnum outlines many of the problems with the schemes states set up to comply with Arne Duncan’s requirement that—to qualify for a Race to the Top grant or a NCLB waiver—states must judge teachers by students’ scores: “States that complied with federal urging to overhaul their evaluation systems struggled with exactly how to measure teachers’ performance. Classroom observations were usually the biggest factor, with tests playing a key role. But since many teachers do not have a standardized test corresponding to their grade and subject, some districts created new tests or had teachers create their own, raising concerns about overtesting. In other instances, teachers were evaluated in part by student performance in subjects they didn’t teach—the situation for half of New York City teachers in 2016. In many states, the new evaluations debuted just as new academic standards and tests were being implemented, frustrating teachers and their unions who felt they were being held accountable for unfamiliar material without adequate training.”

It became popular to use statistical algorithms called Value Added Measures (VAMs) of student learning rather than merely the aggregate benchmark scores of a teacher’s students as the basis of each teacher’s evaluation. However, in 2014, the American Statistical Association, and in 2015, the American Education Research Association released evidence that calculations trying to measure each teacher’s discrete contribution to her students’ learning were statistically flawed. The American Statistical Association warned: “Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability to teachers… The majority of the variation in test scores is attributable to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences. The VAM scores themselves have large standard errors, even when calculated using several years of data. These large standard errors make rankings unstable, even under the best scenarios for modeling.”

The problem is that a lot of states continue to use students’ standardized test scores to evaluate teachers.  Education Week‘s Madeline Will explains: “Now 34 states require student-growth measures in teacher evaluations… Ten states and the District of Columbia dropped the requirement, while two states (Alabama and Texas) added a student-growth requirement during the same time period. Among the states that do still require an objective measure of student growth, eight do not currently require that the state standardized test be the source of the data. Instead, districts can use measures like their own assessments, student portfolios, and student learning objectives to determine teachers’ contribution to student growth….”

The 2015 replacement for No Child Left Behind—the Every Student Succeeds Act—ended the federal policy pushing states to judge teachers by their students’ standardized test scores. It is reprehensible that so many states are still holding on to this kind of discredited teacher evaluation scheme.