U.S. Public Education Is Driven by High-Stakes Testing. Are the Proficiency Cut-Scores Legitimate?

Back in 2005, I worked with members of the National Council of Churches Committee on Public Education and Literacy to develop a short resource, Ten Moral Concerns in the No Child Left Behind Act. While closing achievement gaps seemed an important goal, to us it seemed wrong that—according to an unrelenting year-by-year Adequate Yearly Progress schedule—the law blindly held teachers and schools accountable for raising all children’s test performance to the test score targets set by every state. Children come to school with such a wide range of preparation, and achievement gaps are present when children arrive in Kindergarten.  At that time, we expressed our concern this way:

“Till now the No Child Left Behind Act has neither acknowledged where children start the school year nor celebrated their individual accomplishments. A school where the mean eighth grade math score for any one subgroup grows from a third to a sixth grade level has been labeled a “in need of improvement” (a label of failure) even though the students have made significant progress. The law has not acknowledged that every child is unique and that Adequate Yearly Progress (AYP) thresholds are merely benchmarks set by human beings. Although the Department of Education now permits states to measure student growth, because the technology for tracking individual learning over time is far more complicated than the law’s authors anticipated, too many children will continue to be labeled failures even though they are making strides, and their schools will continue to be labeled failures unless all sub-groups of children are on track to reach reading and math proficiency by 2014.”

Of course today we know that the No Child Left Behind Act was supposed to motivate teachers to work harder to raise scores. Policymakers hoped that if they set the bar really high, teachers would figure out how to get kids over it.  It didn’t work.  No Child Left Behind said that all children would be proficient by 2014 or their school would be labeled failing. Finally as 2014 loomed closer, Arne Duncan had to give states waivers to avoid what was going to happen if the law had been enforced: All American public schools would have been declared “failing.”

Despite the failure of No Child Left Behind,  members of the public, the press, and the politicians across the 50 statehouses who implemented the testing requirements of No Child Left Behind continue to accept the validity of high stakes testing. Politicians, the newspaper reporters and editors who report the scores, and the general public trust the supposed experts who set the cut scores.  That is why states still rank and rate public schools by their test scores and legislators pass laws to punish  low-scoring schools and teachers. That is why on Wednesday this blog commented on Ohio’s plan to expand EdChoice vouchers for students in low-scoring schools and add charters in low-scoring school districts. The list of “failing” schools where students will qualify for vouchers will rise next school year in Ohio from 218 to 475. The list of charter school-eligible districts will grow from 38 to 217.

In response to the continuation of test-and-punish, I’ve been quoting Daniel Koretz’s book, The Testing Charade about the fact that testing cut scores are arbitrary and  punishments unfair:  “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do…  Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.”  (The Testing Charade, pp. 129-130)

As a blogger, I am not an expert on how test score targets—the cut scores—are set, but Daniel Koretz devotes an entire chapter of his book, “Making Up Unrealistic Targets,” to this subject.  Here is how he begins:  “If one doesn’t look too closely, reporting what percentage of students are ‘proficient’ seems clear enough. Someone somehow determined what level of achievement we should expect at any given grade—that’s what we will call ‘proficient’—and we’re just counting how many kids have reached that point. This seeming simplicity and clarity is why almost all public discussion of test scores is now cast in terms of the percentage reaching either the proficient standard, or occasionally, another cut score… The trust most people have in performance standards is essential, because the entire educational system now revolves around them. The percentage of kids who reach the standard is the key number determining which teachers and schools will be rewarded or punished.”  (The Testing Charade, p. 120)

After emphasizing that benchmark scores are not scientifically set and are in fact all arbitrary, Koretz examines some of the methods. The “bookmark” method, he explains, “hinges entirely on people’s guesses about how imaginary students would perform on individual test items… (P)anels of judges are given a written definition of what a standard like “proficient” is supposed to mean.”  Koretz quotes from Nebraska’s definition of reading comprehension: “A student scoring at the Meets the Standards level generally utilizes a variety of reading skills and strategies to comprehend and interpret narrative and informational text at grade level.” After enumerating some of the specific skills and strategies listed in Nebraska, Koretz adds a qualification to the way Nebraska describes its methodology: “A short digression: the emphasized word generally is very important. One of the problems in setting standards is that students are inconsistent in their performance.” (The Testing Charade, pp. 121-122) (Emphasis in the original.)

Koretz continues: “There is another, perhaps even more important, reason why performance standards can’t be trusted: there are many different methods one can use, and there is rarely a really persuasive reason to select one over the other. For example, another common approach, the Angoff method… is like the bookmark in requiring panelists to imagine marginally proficient students, but in this approach they are not given the order of difficulty of the items or a response probability. Instead panelists have to guess the percentage of imaginary marginally proficient students who would correctly answer every item in the test. Other methods entail examining and rating actual student work, rather than guessing the performance of imaginary students on individual items.  Yet other methods hinge on predictions of later performance—for example, in college. There are yet others. This wouldn’t matter if these different methods gave you at least roughly similar results, but they often don’t.  The percentage of kids deemed to be ‘proficient’ sometimes varies dramatically from one method to another.  This inconsistency was copiously documented almost thirty years ago, and the news hasn’t gotten any better.” (The Testing Charade, pp.123-124)

Koretz continues his warning: “However, setting the standards themselves is just the beginning. What gives the performance standards real bite is their translation into conrcete targets for educators, which depends on more than the rigor of the standard itself.  We have to say just who has to reach the threshold. We have to say how quickly performance has to increase—not only overall but for different types of kids and schools. A less obvious but equally important question is how much variation in performance is acceptable… A sensible way to set targets would be to look for evidence suggesting how rapidly teachers can raise achievement by legitimate means—that is, by improving instruction, not by using bad test prep, gaming the system, or simply cheating…  However, the targets in our test-based accountability systems have often required unremitting improvements, year after year, many times as large as any large-scale change we have seen.” (The Testing Charade, pp. 125-126)

Koretz concludes: “(I)t is clear that the implicit assumption undergirding the reforms is that we can dramatically reduce the variability of achievement… Unfortunately, all evidence indicates that this optimism is unfounded.  We can undoubtedly reduce variations in performance appreciably if we summoned the political will and committed the resources to do so—which would require a lot more than simply imposing requirements that educators reach arbitrary targets for test scores.” (The Testing Charade, p. 131)

4 thoughts on “U.S. Public Education Is Driven by High-Stakes Testing. Are the Proficiency Cut-Scores Legitimate?

  1. We don’t know how to argue with numbers. The damage done to learning (and far too many children) by the implementation of test and punish reforms is only recently beginning to be understood. I would like to echo and extend the importance of this post and the efforts of Daniel Koretz. I offer these thoughts as a former assistant commissioner in a state department of education where my work included oversight of the state’s assessment program and the development of the state’s response to NCLB.
    In this morning’s piece you quote Daniel Koretz’s work, including the following…” Unfortunately… It seems that no one asked for evidence that these ambitious targets for gains were realistic.“ While it may seem a small distinction I believe that it would be appropriate to add the word “valid” to Koretz’s choice of “realistic”.
    A number of states employ assessment/statistics experts to help insure the quality of their assessments and the proper use of the results. But what happens if the work of such groups is compromised by poor choices by the funding agency? Or undue influence by large testing contractors? Or manipulation to advance political agendas?

    While the process used in states to determine scores may be statistically reliable and perhaps useful in comparing scores from school to school, the process breaks down, as Koretz accurately describes, when we assign descriptors of achievement levels such as proficient or non-proficient. Koretz correctly points out that the process used to assign such terms is highly flawed and continues to contribute to the overly simplistic judgments by the press and the public about the quality of our public schools and the challenges facing them.

    Not addressed but perhaps more troubling than the use of arbitrary understanding or definition of terms such as proficiency is the manipulation of these “cut scores” to achieve political goals – I.e., instances where state boards of education, state departments of education or politicians seek to alter cut scores to reach “acceptable” pass/fail rates.

    • Thanks so much for adding your clarifications to the discussion. The process used in Delaware at the outset of NCLB reflects what you described. There is a foundational problem that is too often ignored: standardized tests that are used by most DOEs are designed specifically to generate normal distribution curves so that psychometricians can perform their statistical magic, but these tests are NOT designed to measure mastery of learning objectives—“mastery” tests, in a world where what is tested is what is taught based upon specific outcomes, generate J-curves, which is what every teacher, and cumulatively, every school should be aiming for. This is NOT the early Twentieth Century when reformers, enthralled by NDCs, decided that schools should be used to test kids. That century-old and dysfunctional mentality is alive and well in DOEs across the country and is continuing to hurt students and teachers because it is impossible to master anything based on a bell-curve. Politicians and media types still do not understand this, and until they do, we will continue, as a profession, to spin our wheels.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s