Closing Achievement Gaps Will Require Closing Opportunity Gaps Outside of School

Last week this blog highlighted Advocates for Children of New York’s new report documenting that more than 10 percent of the over one million students in the New York City Public Schools—101,000 students—are homeless. These students are living in shelters, doubled up with friends or relatives, or living in cars and parks. What are the academic challenges for these homeless children and other children living in families with minimum wage employment, unemployment, unstable housing, food insecurity and inadequate medical care?

Although federal law continues to require that states measure the quality of schools and school districts with standardized tests, all sorts of research documents that students’ standardized test scores are indicators of their life circumstances and not a good measure of the quality of their public schools. Students concentrated in poor cities or scattered in impoverished and remote rural areas are more likely to struggle academically no matter the quality of their public school.

Here are just two examples of this research.

In 2017, Katherine Michelmore of Syracuse University and Susan Dynarski of the University of Michigan studied data from Michigan to identify the role of economic disadvantage in achievement gaps as measured by test scores: “We use administrative data from Michigan to develop a… detailed measure of economic disadvantage… Children who spend all of their school years eligible for subsidized meals have the lowest scores, whereas those who are never eligible have the highest. In eighth grade, the score gap between these two groups is nearly a standard deviation.” “Sixty percent of Michigan’s eighth graders were eligible for subsidized lunch at least once during their time in public schools. But just a quarter of these children (14% of all eighth graders) were economically disadvantaged in every year between kindergarten and eighth grade… Ninety percent of the test score gap we observe in eighth grade between the persistently disadvantaged and the never disadvantaged is present by third grade.”

In How Schools Really Matter: Why Our Assumption about Schools and Inequality Is Mostly Wrong, Douglas Downey, a professor of sociology at The Ohio State University describes academic research showing that evaluating public schools based on standardized test scores is unfair to educators and misleading to the public: “It turns out that gaps in skills between advantaged and disadvantaged children are largely formed prior to kindergarten entry and then do not grow appreciably when children are in school.” (How Schools Really Matter, p. 9) “Much of the ‘action’ of inequality therefore occurs very early in life… In addition to the fact that achievement gaps are primarily formed in early childhood, there is another reason to believe that schools are not as responsible for inequality as many think. It turns out that when children are in school during the nine-month academic year, achievement gaps are rather stable. Indeed, sometimes we even observe that socioeconomic gaps grow more slowly during school periods than during summers.” (How Schools Really Matter, p. 28)

In the context of this research, Downey examines the six indicators the Ohio Department of Education uses to evaluate public schools when it releases annual report cards on school performance. Although the state has ceased branding public schools with “A-F” letter grades, Downey explains that the state of Ohio continues to ignore the role outside-of-school variables in students’ lives when it blames educators and schools for low aggregate test scores:

“The report card for schools is constructed from six indicators and not a single one of them gauges performance independent of the children’s nonschool environments. First is achievement, which is based on the percentage of students  who pass state tests… By far, the biggest determinant of whether a school produces high or low test scores is the income level of the students’ families it serves… Second is the extent to which a district closes achievement gaps among subgroups. But performance on this indicator can also be influenced by factors out of the school’s control… Third, schools are gauged by the degree to which the school improved at-risk K-3 readers… Of course, it is much easier to make progress on this indicator if serving children who go home each evening to reinforce the school goals. Fourth, schools are evaluated on their progress, an indicator based on how much growth students exhibit on math and reading tests. This kind of indicator is better than most at isolating how schools matter, but again, growth is easier in schools where students enjoy home environments that also promote learning… Fifth, the graduation rate constitutes a component of the district’s (rating)… but this is only a measure of school quality if the likelihood of a child’s on-time graduation has nothing to do with the stress they experience at home, the access they have to health care, or the quality of their neighborhood.  Finally districts are evaluated on whether their students are prepared for success.  This indicator gauges the percentage of students at a school viewed as ready to succeed after high school… and is determined by how well the students performed on the ACT or SAT and whether they earned a 3 or higher on at least one AP exam… These report cards ‘are designed to give parents, communities, educators, and policymakers information about the performance of districts and schools,’ but what they really do is mix important factors outside of school with what is going on inside the schools in unknown ways.” (How Schools Really Matter, pp. 115-116)

What these reports and many others demonstrate is that we cannot expect that no child will be left behind merely because Congress passes a law declaring that schools can make every American child post proficient test scores by 2014. No Child Left Behind’s (and now the Every Student Succeeds Act’s)  policies—which have branded schools unable quickly to raise aggregate test scores as “failing schools”— have unfairly targeted school districts located in poor communities. In 2017, the Harvard University testing expert, Daniel Koretz published The Testing Charade: Pretending to Make Schools Better in which he shows that ameliorating opportunity gaps in the lives of children is not something schools can accomplish by themselves.

Koretz explains: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms… Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (pp. 129-130) Koretz continues: “(T)his decision backfired. The result was, in many cases, unrealistic expectations that teachers simply couldn’t meet by any legitimate means.” (p. 134)

Advertisement

Nobody Should Be Wasting Time Worrying About When to Administer Standardized Tests

Parents, children, teachers, principals, and school superintendents are living through a time of unknowns. COVID-19 is raging across the states with many public schools operating only online. Some public schools, which have been able to open in person or on hybrid schedules, have subsequently been forced to close already reopened buildings or specific classrooms as COVID-19 cases arise and everybody quarantines.

In the midst of a chaotic situation with no good and stable solutions for many public schools, suddenly last week everybody started worrying about what to do about this year’s standardized tests. The Washington Post‘s Perry Stein reports that outgoing Secretary of Education, Betsy DeVos postponed the winter administration of the National Assessment of Educational Progress, the one test administered across all the states, the test that tracks school achievement over the decades and is not distorted by high stakes consequences.

Representatives Bobby Scott (D-VA) and Patty Murray (D-WA), the Democratic leaders of the House Education Committee, agreed to delay the NAEP, but said the nation needs some kind of measure of learning loss during the pandemic.  They released a statement declaring that annual state tests mandated under the Every Student Succeeds Act must surely be administered: “Existing achievement gaps are widening for our most vulnerable students, including students from families with low incomes, students with disabilities, English learners, and students of color. In order for our nation to recover and rebuild from the pandemic, we must first understand the magnitude of learning loss that has impacted students across the country. That cannot happen without assessment data.”

While I frequently agree with Representatives Scott and Murray, I think worrying about standardized testing right now ought to be a low priority, and I think the state-by-state achievement tests mandated by the Every Student Succeeds Act are the wrong kind of test.  Neither do I believe that the mandated, annual state achievement tests are necessary to help teachers grasp their students’ learning needs during and following the widespread school closures and disruptions in the current school year.  Our schoolteachers are well trained professionals who are prepared to develop their students’ reading comprehension skills, to track problems with computational skills and mathematical conceptualization, and to help support their students emotionally after a period of disruption. The emphasis right now and when children return to classrooms must be supporting teachers facing the complex challenge of serving children who have been out of the classroom for too long. Standardized test scores very often don’t even arrive at schools for months after the tests are administered; they play little role in supporting teachers’ capacity to discern their students’ learning gains or losses.

If we are looking for complex data about the impact of the pandemic on public schools across communities and across states, at some point it will be realistic for the National Center for Education Statistics again to administer the National Assessment of Educational Progress, which is designed as a national audit test to determine learning trends over time.  When it is practical to administer NAEP, certainly that test should happen.

The annual standardized tests, mandated first by No Child Left Behind and, since 2015 by the Every Student Succeeds Act, are designed for an entirely different purpose.  And ironically the purpose and use of these tests for holding schools accountable distorts the results as schools struggle to raise scores at any cost in order to avoid the high stakes punishments that Congress attached to these tests or forced the states to attach. What are these high stakes? States still have to submit to the U.S. Department of Education plans for how to turnaround their lowest performing schools according to these tests.  Some states still evaluate teachers according to their students’ scores. States rate and rank particular schools and school districts according to their aggregate test scores. Many states publish these rankings, which encourages real estate redlining as well as racial and economic segregation across metropolitan areas. Different states place voucher programs or charter schools in school districts where scores are low. Some states take over low scoring schools and school districts and turn them over to appointed commissions that supplant locally elected school boards.  Some school districts have claimed to use school closure as a so-called turnaround plan.

In a profound 2017 book, The Testing Charade: Pretending to Make Schools Better, Daniel Koretz, a Harvard University expert on standardized testing, documents research exposing flaws in the entire strategy of No Child Left Behind, which combined standardized testing with high stakes punishments for schools unable quickly to raise students’ test scores. Koretz explains social scientist Don Campbell’s well-known theory describing the universal human response when high stakes are tied to a quantitative social indicator.  In this case, the social indicator is whether or not educators and particular schools can produce higher aggregate student test scores year after year:

“The more any quantitative social indicator is is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor… Achievement tests may well be valuable indicators of… achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways.” (The Testing Charade, pp. 38-39)

Koretz shows that imposing high stakes punishments on schools and educators unable quickly to raise students’ scores inevitably produces reallocation of instruction to what is being tested, causes states eventually to lower standards, causes some schools quietly to exclude from testing the students likely to fail. Under No Child Left Behind, the high stakes even led to abject cheating—as happened in Atlanta under Superintendent Beverly Hall.

What all this means is that the state achievement tests mandated by No Child Left Behind and the Every Student Succeeds Act—whether administered to students this year or put off until after vaccines are widely available and students return to their classrooms—are not an appropriate tool for measuring the long term impact of the pandemic on students’ lives and learning.

Ideological advocacy for holding public schools accountable drove the passage and implementation of the original No Child Left Behind Act. The idea was that educators can be motivated to work harder through fear if their schools are threatened with punishments.  The idea of attaching high stakes consequences for low test scores remains with us today. Last week Chester E. Finn, Jr., formerly of the Thomas B. Fordham Institute and now affiliated with the Hoover Institution, published a widely read column in the Washington Post.  Twenty years ago, Finn strongly promoted No Child Left Behind’s test-and-punish strategy, and clearly he continues to believe in using high stakes testing as a threat. Here is a paragraph from his recent column that Finn could easily have cut, pasted, and slightly updated from something he wrote back in 2001:

“The results from those state assessments are the main source of information about school performance and about pupil learning in the core subjects of the K-12 curriculum. The results also indicate whether America’s appalling — and persistent — achievement gaps are getting any narrower. These student statewide test results are the foundation of a school-performance measurement structure that the United States has been painstakingly constructing in the decades since being declared “A Nation at Risk” in 1983. The information from the tests is used at every level of the system. It enables parents to see how their children are faring on an “external” metric, beyond the grades conferred by their teachers, and it helps principals assess how their schools are doing. The results also equip superintendents to gauge what must be done to boost district-wide achievement, and they furnish state officials with the information needed to guide their assistance and interventions.”

Today, nearly two decades after the states were mandated to administer annual standardized tests and after No Child Left Behind imposed sanctions on the schools with the lowest scores, we know that the whole scheme failed to support children’s school achievement and failed to close achievement gaps. Some schools were charterized as a punishment; other schools were shut down; principals and teachers were fired.  And scores on the national audit test, the National Assessment of Education Progress (the NAEP), have fallen in some cases and in other cases remained flat.

I believe it is unnecessary—in the midst of a raging pandemic and a Presidential transition—to worry about when the federal government will mandate widespread standardized testing.  The bigger question is whether and how the federal government will manage a plan to get the pandemic under control and provide enough support to help states and school districts get all children and adolescents back in school.

I agree with Diane Ravitch, who explains: “Resumption of standardized testing is completely ridiculous in the midst of a pandemic. The validity of the tests has always been an issue; their validity in the midst of a national crisis will be zero. They will show, even more starkly, that students who are in economically secure families have higher test scores than those who do not. They will show that children in poverty and children with disabilities have suffered disproportionately due to lack of schooling.  We already know that.  Why put pressure on students and teachers to demonstrate what we already know?  At this point, we don’t even know whether all students will have the advantage of in-person instruction by March.  If anything, we need a thorough review of the value, validity, and reliability of annual standardized testing, a practice that is unknown in any high-performing nation in the world.  We are choking on the rotten fumes of No Child Left Behind, Race to the Top, and the Every Student Succeeds Act.”

U.S. Public Education Is Driven by High-Stakes Testing. Are the Proficiency Cut-Scores Legitimate?

Back in 2005, I worked with members of the National Council of Churches Committee on Public Education and Literacy to develop a short resource, Ten Moral Concerns in the No Child Left Behind Act. While closing achievement gaps seemed an important goal, to us it seemed wrong that—according to an unrelenting year-by-year Adequate Yearly Progress schedule—the law blindly held teachers and schools accountable for raising all children’s test performance to the test score targets set by every state. Children come to school with such a wide range of preparation, and achievement gaps are present when children arrive in Kindergarten.  At that time, we expressed our concern this way:

“Till now the No Child Left Behind Act has neither acknowledged where children start the school year nor celebrated their individual accomplishments. A school where the mean eighth grade math score for any one subgroup grows from a third to a sixth grade level has been labeled a “in need of improvement” (a label of failure) even though the students have made significant progress. The law has not acknowledged that every child is unique and that Adequate Yearly Progress (AYP) thresholds are merely benchmarks set by human beings. Although the Department of Education now permits states to measure student growth, because the technology for tracking individual learning over time is far more complicated than the law’s authors anticipated, too many children will continue to be labeled failures even though they are making strides, and their schools will continue to be labeled failures unless all sub-groups of children are on track to reach reading and math proficiency by 2014.”

Of course today we know that the No Child Left Behind Act was supposed to motivate teachers to work harder to raise scores. Policymakers hoped that if they set the bar really high, teachers would figure out how to get kids over it.  It didn’t work.  No Child Left Behind said that all children would be proficient by 2014 or their school would be labeled failing. Finally as 2014 loomed closer, Arne Duncan had to give states waivers to avoid what was going to happen if the law had been enforced: All American public schools would have been declared “failing.”

Despite the failure of No Child Left Behind,  members of the public, the press, and the politicians across the 50 statehouses who implemented the testing requirements of No Child Left Behind continue to accept the validity of high stakes testing. Politicians, the newspaper reporters and editors who report the scores, and the general public trust the supposed experts who set the cut scores.  That is why states still rank and rate public schools by their test scores and legislators pass laws to punish  low-scoring schools and teachers. That is why on Wednesday this blog commented on Ohio’s plan to expand EdChoice vouchers for students in low-scoring schools and add charters in low-scoring school districts. The list of “failing” schools where students will qualify for vouchers will rise next school year in Ohio from 218 to 475. The list of charter school-eligible districts will grow from 38 to 217.

In response to the continuation of test-and-punish, I’ve been quoting Daniel Koretz’s book, The Testing Charade about the fact that testing cut scores are arbitrary and  punishments unfair:  “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do…  Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.”  (The Testing Charade, pp. 129-130)

As a blogger, I am not an expert on how test score targets—the cut scores—are set, but Daniel Koretz devotes an entire chapter of his book, “Making Up Unrealistic Targets,” to this subject.  Here is how he begins:  “If one doesn’t look too closely, reporting what percentage of students are ‘proficient’ seems clear enough. Someone somehow determined what level of achievement we should expect at any given grade—that’s what we will call ‘proficient’—and we’re just counting how many kids have reached that point. This seeming simplicity and clarity is why almost all public discussion of test scores is now cast in terms of the percentage reaching either the proficient standard, or occasionally, another cut score… The trust most people have in performance standards is essential, because the entire educational system now revolves around them. The percentage of kids who reach the standard is the key number determining which teachers and schools will be rewarded or punished.”  (The Testing Charade, p. 120)

After emphasizing that benchmark scores are not scientifically set and are in fact all arbitrary, Koretz examines some of the methods. The “bookmark” method, he explains, “hinges entirely on people’s guesses about how imaginary students would perform on individual test items… (P)anels of judges are given a written definition of what a standard like “proficient” is supposed to mean.”  Koretz quotes from Nebraska’s definition of reading comprehension: “A student scoring at the Meets the Standards level generally utilizes a variety of reading skills and strategies to comprehend and interpret narrative and informational text at grade level.” After enumerating some of the specific skills and strategies listed in Nebraska, Koretz adds a qualification to the way Nebraska describes its methodology: “A short digression: the emphasized word generally is very important. One of the problems in setting standards is that students are inconsistent in their performance.” (The Testing Charade, pp. 121-122) (Emphasis in the original.)

Koretz continues: “There is another, perhaps even more important, reason why performance standards can’t be trusted: there are many different methods one can use, and there is rarely a really persuasive reason to select one over the other. For example, another common approach, the Angoff method… is like the bookmark in requiring panelists to imagine marginally proficient students, but in this approach they are not given the order of difficulty of the items or a response probability. Instead panelists have to guess the percentage of imaginary marginally proficient students who would correctly answer every item in the test. Other methods entail examining and rating actual student work, rather than guessing the performance of imaginary students on individual items.  Yet other methods hinge on predictions of later performance—for example, in college. There are yet others. This wouldn’t matter if these different methods gave you at least roughly similar results, but they often don’t.  The percentage of kids deemed to be ‘proficient’ sometimes varies dramatically from one method to another.  This inconsistency was copiously documented almost thirty years ago, and the news hasn’t gotten any better.” (The Testing Charade, pp.123-124)

Koretz continues his warning: “However, setting the standards themselves is just the beginning. What gives the performance standards real bite is their translation into conrcete targets for educators, which depends on more than the rigor of the standard itself.  We have to say just who has to reach the threshold. We have to say how quickly performance has to increase—not only overall but for different types of kids and schools. A less obvious but equally important question is how much variation in performance is acceptable… A sensible way to set targets would be to look for evidence suggesting how rapidly teachers can raise achievement by legitimate means—that is, by improving instruction, not by using bad test prep, gaming the system, or simply cheating…  However, the targets in our test-based accountability systems have often required unremitting improvements, year after year, many times as large as any large-scale change we have seen.” (The Testing Charade, pp. 125-126)

Koretz concludes: “(I)t is clear that the implicit assumption undergirding the reforms is that we can dramatically reduce the variability of achievement… Unfortunately, all evidence indicates that this optimism is unfounded.  We can undoubtedly reduce variations in performance appreciably if we summoned the political will and committed the resources to do so—which would require a lot more than simply imposing requirements that educators reach arbitrary targets for test scores.” (The Testing Charade, p. 131)

Rick Hess’s Mistake: Failure of Test-and-Punish Is Not Limited to a Few Districts That Have Disappointed

Frederick M. Hess, the director of education policy studies at the American Enterprise Institute, has always been a corporate education reform kind of guy. That is why Hess’s honest analysis this week of the ultimate fraud of a succession of school district miracles—Washington, D.C.’s test score and graduation rate miracle under Michelle Rhee and those who followed her, Alonzo Crim’s Atlanta in the 1980s, Houston’s Texas Miracle under Rod Paige, Arne Duncan’s Chicago, and Beverly Hall’s Atlanta—is so refreshingly candid.

In all of these cases, as Hess points out, there was “a remarkable dearth of attention paid to ensuring that the metrics (were) actually valid and reliable.”  Second, it was “tempting for civic leaders and national advocates to accept happy success stories at face value—especially when they (were) fronted by a charismatic superintendent.” And finally “reformers and reporters (made) things worse with their lust for ‘celebrity superintendents’ and ‘model systems.’ Their fascination nurtur(ed) an echo chamber in which a handful of leaders (got) exalted, often for too-good-to-be-true results.”

One must give Hess credit for honestly admitting the failure of so much of what his own kind of school reformers have been exalting for the past quarter century—business school accountability for schools, driven by universal standardized testing, and evaluated by two primary outcomes—standardized test scores and graduation rates. But Hess makes a mistake when he attributes the problem to a few “model” school districts that have disappointed.

Hess’s explanation is inadequate.  Inadequate because the system itself—the whole idea of school reform based on high stakes testing—cannot work.  Daniel Koretz, the Harvard specialist on testing, tells us why in a recent book: The Testing Charade: Pretending to Make Schools Better.

Koretz defines the problem with high-stakes-test-based school accountability by exploring a primary principle of social science research. Forty years ago, Don Campbell, “one of the founders of the science of program evaluation,” articulated a core principle now known as “Campbell’s Law”: “The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” (p. 38)

How does Campbell’s Law describe the dilemma Frederick Hess identifies?  Koretz quotes Don Campbell himself describing the distortion that will follow when high stakes consequences are attached to a school district’s capacity to raise its aggregate test scores: “Achievement tests may well be valuable indicators of… achievement under conditions of normal teaching aimed at general competence.  But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways.” (p. 39)

In The Testing Charade, Koretz provides extensive evidence about all the ways high stakes tied to test scores have triggered Campbell’s Law—to invalidate the test results themselves and to undermine our education system and the experiences of teachers and students trapped by No Child Left Behind and the Every Student Succeeds Act in a scheme to raise test scores at all costs.

One consequence is score inflation: “All that is required for scores to become inflated is that the sampling used to create a test has to be predictable… For inflation to occur, teachers or students need to capitalize on this predictability, focusing on the specifics of the test at the expense of the larger domain.” (p. 62)  We read about all the ways curriculum designers and teachers are incentivized to focus their classes on the specific elements of any particular academic discipline that have appeared on previous tests.

A second consequence, related to the first, is flat-out test-prep. Test prep narrows what is taught to students to the material that is tested and drills students about using clues in the test itself to come up with the right answers. Koretz identifies three kinds of bad test prep. Reallocation between subjects has been common when schools emphasize No Child Left Behind’s tested subjects—reading and math—and cut back on social studies, the arts, music and recess. Reallocation within subjects is when schools study past years’ versions of the state tests and ask teachers to focus on particular aspects of a subject.  Finally there is coaching. Schools and test-prep companies teach students to respond in a formulaic way to the format of the questions themselves. Koretz explains why all this has implications for educational equity: “Inappropriate test preparation, like score inflation, is more severe in some places than in others. Teachers of high-achieving students have less reason to indulge in bad preparation for high-stakes tests because the majority of their students will score adequately without it—in particular, above the ‘proficient’ cut score that counts for accountability purposes. So one would expect that test preparation would be a more severe problem in schools serving high concentrations of disadvantaged students…. Once again, disadvantaged kids are getting the short end of the stick.” (pp. 116-117)

And a third consequence, demonstrated in every one of Frederick Hess’s examples is cheating. Koretz examines the biggest cheating scandals, notably Atlanta, Philadelphia, and Washington, DC.  He notes: “Cheating—by teachers and administrators, not by students—is one of the simplest ways to inflate scores, and if you aren’t caught, it’s the most dependable.” Sometimes teachers or administrators erase and change students answers; sometimes they provide teachers or students with the test items in advance; other times teachers give students the answer during the test.  And finally sometimes schools “scrub” off the enrollment rolls the students who are likely to fail.

Koretz presents the questions around cheating by educators as morally fraught. After all, test scores are not simply a proxy for the quality of a school or a school district:  “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms… Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (pp. 129-130)

In a system that, by its very structure, is guaranteed to trigger Campbell’s Law, Koretz wonders about the moral implications of cheating: “Just who is responsible?  Is it just the people who actually carry out the fraud or require it?  Or are those who create the pressures to cheat also culpable, even if not criminally?” (p. 91)

Like Frederick Hess, Daniel Koretz recognizes that although outcomes-based, test-and-punish school accountability has been hyped and celebrated, ultimately this kind of school policy has not improved schools as promised.  Koretz digs deeper, however, to expose that the system itself—not merely its abuse by particular educators in particular school districts—is deeply flawed.

Koretz concludes: “It is no exaggeration to say that the costs of test-based accountability have been huge. Instruction has been corrupted on a broad scale. Large amounts of instructional time are now siphoned off into test-prep activities that at best waste time and at worst defraud students and their parents.  Cheating has become widespread. The public has been deceived into thinking that achievement has dramatically improved and that achievement gaps have narrowed. Many students are subjected to severe stress… The primary benefit we received in return for all of this was substantial gains in elementary-school math that don’t persist until graduation… On balance, then, the reforms have been a failure.” (pp. 191-192)

The Problems of Outcomes-Based School Accountablity

I am so tired of the narrative of “failing” schools—a story which is always accompanied by the story of “failing” teachers and their “failing” students. I find myself trapped in arguments about this subject in places where I don’t want to be talking about it—with good friends and relatives around dinner tables, at parties, during intermissions at concerts.  And even though I know a lot about the topic, I can never really win the argument, because the people with whom I am discussing it have always read about it in the newspapers where the test score comparisons are published.  This narrative has no reference whatsoever to what is happening in particular classrooms or particular schools or school districts. Many people with strong opinions have not been in a public school for decades.

The real subject here, of course, is what education is.  But the conversation instead is always a comparison of test scores as a proxy for the quality of a community and its schools.  One wants to get at the the real meaning and purpose of outcomes-based, test-measured school accountability, but that is hard to do in a casual conversation.  And underneath any conversation about “failing” schools are lots of realities about segregation—by class and also by race.

Research has documented growing economic inequality and segregation by family income. Sean Reardon, a Stanford University sociologist, used a massive data set to document the consequences of widening economic inequality for children’s outcomes at school. Reardon showed that while in 1970, only 15 percent of families lived in neighborhoods classified as affluent or poor, by 2007, 31 percent of families lived in such neighborhoods. By 2007, fewer families across America lived in mixed income communities. Reardon also demonstrated that along with growing residential inequality is a simultaneous jump in an income-inequality school achievement gap. The achievement gap between the children with income in the top ten percent and the children with income in the bottom ten percent, was 30-40 percent wider among children born in 2001 than those born in 1975, and twice as large as the black-white achievement gap.

Then there is segregation by race.  Recently I had occasion to revisit a 2014 article by Richard Rothstein on the long-term effects of racism in our caste society: “Even for low-income families, other groups’ disadvantages—though serious—are not similar to those faced by African Americans. Although the number of high-poverty white communities is growing (many are rural)… poor whites are less likely to live in high poverty neighborhoods than poor blacks.  Nationwide, 7 percent of poor whites live in high-poverty neighborhoods, while 23 percent of poor blacks do so. Patrick Sharkey’s Stuck in Place showed that multigenerational concentrated poverty remains an almost uniquely black phenomenon; white children in poor neighborhoods are likely to live in middle-class neighborhoods as adults, whereas black children in poor neighborhoods are likely to remain in such surroundings as adults.  In other words, poor whites are more likely to be temporarily poor, while poor blacks are more likely to be permanently so…. Certainly, Hispanics suffer discrimination, some of it severe… but the undeniable hardship faced by recent, non-English speaking, unskilled, low-wage immigrants is not equivalent to blacks’ centuries of lower-caste status. The problems are different, and the remedies must also be different….”

Our public schools across America are situated in very different communities—small towns of all sorts, small cities, big cities, poor neighborhoods, rich neighborhoods—schools whose children speak English and other schools where for many children, English is not the primary language. Within all this diversity, however is the reality of segregation by race, and according to Reardon, growing segregation by family income.  In more and more places across America, children live in pockets of extreme poverty or pockets of extreme affluence.   While teachers can work with all the outside-of-school variables the children bring to their classrooms—including intensifying segregation by income, there is much of the experience of each child that schoolteachers cannot control. Children are neither blank slates nor empty vessels into which knowledge can be poured.

On Sunday morning, the subject of “failing” schools and “failing” teachers and “failing” students arrived on my doorstep in Patrick O’Donnell’s Plain Dealer article about what key Ohio legislators believe is dangerous: that too many students graduated from high school this year because of “soft” alternative pathways to graduation.  These alternative pathways were only for the 2018 school year— because educators successfully lobbied that the new graduation tests were so hard that all sorts of young people would be denied graduation.  O’Donnell tells us the educators’ fears were well grounded: “More than a third of this spring’s high school graduates from some urban areas would never have received their diplomas under Ohio’s new graduation requirements, were it not for some temporary and easier ‘pathways’ added to avert a statewide graduation ‘crisis.’ In Akron and Columbus, new test-based requirements would have prevented more than a third of this year’s graduates from marching at ceremonies in caps and gowns. In Cleveland, the impact of the controversial new standards would have been even stronger. The higher expectations would have wiped out diplomas for nearly half of the seniors who received them. Those students instead graduated using special one-time alternate pathways created just for this year to ease the transition to the new standards.”

This is the “failing” schools narrative at work.  If you can find a way to read this without noticing legislators’ indictment of those “failing” schools in Akron and Columbus and especially in Cleveland, Rep. Andy Brenner, Chair of the Ohio House Education Committee, will correct you: “What’s going on that they’re not able to get kids up to being college and career-ready?”

Contrast the understanding of education by outcomes-based education accountability hawks like Andy Brenner with the understanding of learning depicted in the new documentary film about Fred Rogers, Won’t You Be My Neighbor?  Mr. Rogers—influenced by prominent experts in child development like Barry Brazelton and Margaret McFarland—defined education as relating to children, listening to children, and responding to children’s questions and needs and concerns.  For Mr. Rogers, education was not teacher- or school-driven but instead happened in relationship—building a child’s understanding from the foundation within the child. A teacher guides instead of lecturing; a teacher responds instead of driving material into a child’s brain.  A teacher starts where the child is.

Contrast such a developmental understanding of teaching and learning with the model framed by an outcomes-driven reformer intent on pouring in enough testable material to get enough adolescents to pass the tests and produce a career-ready cohort from each high school. The outcomes-based reformer worries about the so-called quality of the diploma; the educator in Mr. Rogers’ mold considers beginning where the child is and helping that child realize her or his promise.

In this year’s very best book on education, Harvard’s Daniel Koretz describes the flaws in outcomes-based school accountability. The title explains the book’s importance for our times: The Testing Charade: Pretending to Make Schools Better.

Koretz is a psychometrician.  While he is neither a child psychologist nor a specialist in child development, Koretz describes the omission of all sorts of essential parts of education, including the kind of teaching Fred Rogers believed was important: “A… critical failure of the reforms is that they left almost no room for human judgment. Teachers are not trusted to evaluate students or each other, principals are not trusted to evaluate teachers, and the judgment of professionals from outside the school has only a limited role. What the reformers trust is ‘objective’ standardized measures…. (T)he focus of reform in the United States has been to rely as much as possible on standardized measures and to minimize human judgment, even though the result was to leave a great deal of what is most important unmeasured—and therefore to give educators no incentive to focus on it.  This is one of the most fundamental flaws of test-based accountability and one of the most significant reasons for its failures.” (The Testing Charade, pp. 34-35)

Koretz explains how outcomes-based education is undermining our very understanding of education—and undermining teaching: “Not only is bad test prep pervasive. It has begun to undermine the very notion of good instruction… One of the rationales given to new teachers for focusing on score gains is that high-stakes tests serve a gatekeeping function, and therefore training kids to do well on tests opens doors for them… Whether raising scores will improve students’ later success… depends on how one raises scores.  Increasing scores by teaching well can increase students’ later success… In the early days of test-based accountability, some observers worried that educators were coming to confuse the test with the curriculum… Some of today’s teacher educators, however, make a virtue of this mistake. They often tell new teachers that tests, rather than standards or a curriculum should define what they teach… Why does this matter so much? To start, it encourages reallocation—that is, focusing instruction on the tested sample rather than the domain or the curriculum that it is supposed to represent… What we want is for students to gain the ability to apply knowledge and skills to problems they actually encounter—not to ensure their proficiency in applying them only to test items that look exactly like the ones they will confront in the main test at the end of the year.”  (The Testing Charade, pp. 112-116)

Finally, Koretz speaks directly to the problem in Ohio, where alternative pathways to high school graduation have been needed to ensure high school graduation for large percentages of students in the state’s poorest cities but where students in affluent suburbs with schools to which the state awards “A+” grades merely sail through the new graduation requirements. Outcomes-based education accountability hawks set benchmarks more easily reached by the privileged, but we blame the schools and teachers in poorer communities—and with high school graduation benchmarks, we penalize the students themselves.

Koretz explains: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores…. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms…. Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (The Testing Charade, pp. 129-130)

Sometimes I think I ought to carry a copy of Koretz’s book in my purse, though I’d be written off as such a bore if I were to pull it out and read from it when somebody at a party begins bragging about their school—rated “A+” by the state of Ohio—while the school across town gets an “F.”  Everybody ought to take Daniel Koretz’s book to read at the beach this summer.

Sorting Out the Debate About Educational Accountability

The watchword for the last quarter century’s school reform has been accountability: holding schools and school teachers accountable for quickly raising students’ scores on standardized tests. Sanctioning schools and teachers who can’t quickly raise scores was supposed to be an effective strategy for overcoming educational injustice. Test-and-punish has enabled us at least to say we’ve been doing something to hold schools accountable.

The politics of this conversation are pretty confusing—all going back to the federal education law, the 2001 No Child Left Behind Act (NCLB), and the debate about its replacement, the 2015 Every Student Succeeds Act (ESSA).  There was bipartisan agreement in 2001-2002 when NCLB was debated, passed, and signed into law that our society could close racial and economic achievement gaps by testing all students and then demanding that schools quickly raise the scores of underachieving students. In 2015 when Congress debated the law’s reauthorization, accountability-hawk Democrats stood by test-and-punish accountability; many Republicans, led by Senator Lamar Alexander instead pushed to expand states’ rights by lifting the heavy hand of the federal government and allowing states to design their own plans to improve so-called failing schools. Worrying that removal of universal testing would let schools off the hook, the Civil Rights Community has stood by NCLB’s testing plan. Many have continued to assume that universal testing exposes achievement gaps and that the exposure will motivate politicians and educators to address racial and economic disparities.

Test-and-punish school reform has been at the center of a conversation between Republican Senator Lamar Alexander, the chair of the Senate Health, Education, Labor and Pensions Committee, and Republican Education Secretary Betsy DeVos.  An article by Caitlin Emma published over the weekend by POLITICO examines the history of No Child Left Behind vs. the Every Student Succeeds Act as a background for looking at how policy around school accountability has been evolving in the Trump administration. Emma describes the new ESSA, passed by a Republican Congress in 2015 and designed to return at least some authority for accountability back to the states. But Democrats prodded by Civil Rights leaders and some Republicans have stood by federally imposed accountability: “Critics… worry whether states will adequately track and provide equal opportunities for at-risk kids…. (Even) former Republican Rep. John Kline… an architect of the measure, has said he’s worried states are now getting away with testing plans that violate a key requirement of the law—that states administer the same test to all students annually.  The provision is critical (Kline believes) so that states are forced to report the performance of all students and the results for poor and minority students are not hidden from view, as they were for decades before federal testing requirements were enacted.”

Emma explains: “The Every Student Succeeds Act, which passed in 2015, was widely viewed by Republicans as a corrective to the federal overreach that followed… No Child Left Behind.”  Emma reports that last summer, when Jason Botel, an official in Betsy DeVos’s Department of Education began reviewing the states’ applications for federal funds under the ESSA, Botel demanded that before he would approve some states’ plans, they must toughen their standards and demand more.  Powerful Republican Senator Lamar Alexander, who had—during the 2015 reauthorization—supported a return of control to the states, formally complained to Betsy DeVos—“furious that a top DeVos aide was circumventing a new law aimed at reducing the federal government’s role in K-12 education. He contended that the agency was out of bounds by challenging state officials, for instance, about whether they were setting sufficiently ambitions goals for their students.”

For many of us who have, for fifteen years, closely followed educational accountability as mandated under No Child Left Behind and the Every Student Succeeds Act, the entire debate seems wrong-headed and bizarre.  I am writing about those of us who care deeply about expanding opportunity for children segregated in schools where poverty is highly concentrated— schools where intense segregation by poverty is overlaid on segregation by ethnicity and race. The schools these children attend have, under federal policy, been derided by accountability hawks as “failing” schools.  Widespread blaming—of schools and school teachers—now dominates discussions of school reform even as sociologists increasingly document that family and neighborhood poverty pose overwhelming challenges for these children and their schools.

Much of the confusion and rancor arises because the public debate about school accountability conflates two very different questions:

  • Should the federal government be involved at all in telling states what to do about education?
  • Is test-and-punish accountability an effective strategy for improving public schools and closing opportunity gaps?

The original federal education law, the 1965 Elementary and Secondary Education Act, addressed the first question as a response to the needs of children in primarily southern states, where schools serving black children had been underfunded and inadequate for generations. There are similar problems of inequity across cities today and forgotten rural areas. Poor children and children of color segregated in particular areas remain under served. The debate about this first question involves states’ rights vs. what has come to be accepted (by many of us) as the federal government’s responsibility to protect the rights of all children and ensure they are all well served. It is a heated question that remains underneath much of the debate about school reform.

The second question involves the strategy Congress chose for reforming schools in the 2001 No Child Left Behind Act. Congress blamed teachers and schools and devised a law that was supposed to force schools and teachers to work harder and faster to improve test scores in schools where achievement lagged when all children in each state were tested on a single standardized test.  It is becoming clearer all the time that when Congress jumped behind test-and-punish accountability, it chose the wrong strategy.  A long and growing body of research demonstrates that test scores are far more aligned with a school’s aggregate economic level than with the work of the teachers or the curriculum being offered to students. Economists like Bruce Baker at Rutgers University also document enormous opportunity gaps as these same public schools in our nation’s poorest communities receive far less public investment than the schools in wealthy suburbs, schools serving children whose families also invest heavily in enrichments at home.

Here is just some of the prominent research from the past ten years that tries to answer the second question.

In 2010, Anthony Bryk and educational sociologists from the Consortium on Chicago School Research at the University of Chicago described the challenges for a particular subset of schools in Chicago, Illinois that exist in a city where many schools serve low income children. The Consortium focused on 46 schools whose students live in neighborhoods where poverty is extremely concentrated.  These “truly disadvantaged” schools are far poorer than the norm. They serve families and neighborhoods where the median family income is $9,480. They are racially segregated, each serving 99 percent African American children, and they serve on average 96 percent poor children, with virtually no middle class children present. The researchers report that in the truly disadvantaged schools, 25 percent of the children have been substantiated by the Department of Children and Family Services as being abused or neglected, either currently or during some earlier point in their elementary career. “This means that in a typical classroom of 30… a teacher might be expected to engage 7 or 8 such students every year.”  “(T)he job of school improvement appears especially demanding in truly disadvantaged urban communities where collective efficacy and church participation may be relatively low, residents have few social contacts outside their neighborhood, and crime rates are high.  It can be equally demanding in schools with relatively high proportions of students living under exceptional circumstances, where the collective human need can easily overwhelm even the strongest of spirits and the best of intentions. Under these extreme conditions, sustaining the necessary efforts to push a school forward on a positive trajectory of change may prove daunting indeed.” (Organizing Schools for Improvement, pp. 172-187)

Then in 2011, Sean Reardon of Stanford University released a massive data analysis confirming the connection of school achievement gaps to growing economic inequality and residential patterns becoming rapidly more segregated by income. Reardon documented that across America’s metropolitan areas the proportion of families living in either very poor or very affluent neighborhoods increased from 15 percent in 1970 to 33 percent by 2009, and the proportion of families living in middle income neighborhoods declined from 65 percent in 1970 to 42 percent in 2009.  Reardon also demonstrated that along with growing residential inequality is a simultaneous jump in an income-inequality school achievement gap among children and adolescents.  The achievement gap between students with income in the top ten percent and students with income in the bottom ten percent is 30-40 percent wider among children born in 2001 than those born in 1975.

In The Testing Charade, a book published just last month, Daniel Koretz of Harvard University blames test-and-punish accountability for enabling our society to pretend that we have been overcoming educational inequity at the same time we avoid making the public investment necessary even to begin addressing the problem: “One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores…. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms…. Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (pp. 129-130)  “If we are going to make real headway, we are going to have to confront the simple fact that many teachers will need substantial supports if they are going to markedly improve the performance of their students… And the range of services needed is broad. One can’t expect students’ performance in schools to be unaffected by inadequate nutrition, insufficient health care, home environments that have prepared them poorly for school, or violence on the way to school.” (p. 201)

The second question involves the overall direction of education policy, and it is important because we desperately need a better strategy. Blaming and punishing the schools with the lowest scores—by closing “failing” schools or privatizing them or firing their teachers and principals—has only further undermined the public schools in the poorest neighborhoods of our big cities without addressing the opportunity gaps the tests identify.

Today’s Republican tax slashing agenda will only further reduce public investment in education.  And we are likely to keep on blaming the victims.

Daniel Koretz: More Detail from “The Testing Charade” on Cheating Scandal in Atlanta

Back in 2015, I watched when part of the trial of the Atlanta school teachers—accused of erasing and correcting their students’ test scores—was televised on C-Span (see here and here). And two weeks ago I read Daniel Koretz’s new book, The Testing Charade, a book about what happens when high stakes punishments are attached to any social indicator. I read Koretz’s book pretty much without emotion or judgment—as an academic exercise to understand his argument against the high stakes that policy makers have used as a threat to drive teachers to work harder and raise test scores faster. I didn’t focus on the sections about the cheating scandals.  After all, I imagined, the scandals have just become a part of history.

Then on Wednesday evening, I watched Lisa Stark’s report for the PBS NewsHour about the 9 Atlanta school teachers and principals who are appealing their criminal convictions to clear their names and avoid stints in prison for participating in what is said to have been a 44-school cheating scandal driven by Superintendent Beverly Hall, who won awards when test scores rose miraculously quickly in Atlanta’s schools. Hall died before her own involvement could be adjudicated.

Daniel Koretz, the Harvard professor whose new book explores the Atlanta cheating scandal (among cheating scandals in Washington, D.C, Pennsylvania and many other places) as among the widespread consequences of our test-and-punish regime of school reform, spoke briefly in Lisa Stark’s report. In his book he attributes the problem to what social scientists call Campbell’s Law. Here is Koretz’s definition: “The more any quantitative social indicator is used for social decision making the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” (p. 38)

Koretz explores the issue far more deeply in his new book than he did in Wednesday night’s short clip for the NewsHour. My feeling two years ago that the Atlanta educators’ criminal convictions were unfair and what, as I watched the PBS report, I recognized as my feeling of relief two weeks ago when I read Koretz’s book—that an expert scholar confirmed my own sense of injustice in Atlanta—sent me back again yesterday to Koretz’s book.  Here is some of what he didn’t have time to say in Wednesday’s report for PBS.

“One aspect of the great inequity of the American educational system is that disadvantaged kids tend to be clustered in the same schools. The causes are complex, but the result is simple: some schools have far lower average scores—and, particularly important in this system, more kids who aren’t ‘proficient’—than others. Therefore, if one requires that all students must hit the proficient target by a certain date, these low-scoring schools will face far more demanding targets for gains than other schools do. This was not an accidental byproduct of the notion that ‘all children can learn to a high level.’ It was a deliberate and prominent part of many of the test-based accountability reforms… Unfortunately… it seems that no one asked for evidence that these ambitious targets for gains were realistic. The specific targets were often an automatic consequence of where the Proficient standard was placed and the length of time schools were given to bring all students to that standard, which are both arbitrary.” (pp. 129-130) Koretz continues: “(T)his decision backfired. The result was, in many cases, unrealistic expectations that teachers simply couldn’t meet by any legitimate means.” (p. 134)

In Atlanta, Koretz describes the situation at Parks Middle School, as it was portrayed by Rachel Aviv in a New Yorker profile of the Atlanta cheating scandal.  Koretz explains: “This is the school where Damany Lewis and Christopher Waller worked. Aviv documented the way in which Waller choreographed an increasingly large and well-organized cheating ring… Why did Lewis and others do this  At least in Lewis’s case, it was not because he was comfortable cheating. Quite the contrary…  Then why? In a nutshell, because their only other choice was to fail—not when compared with reasonable goals but when held to Hall’s and NCLB’s entirely arbitrary targets. Parks is located in a terribly depressed neighborhood. Half the homes are vacant. Students call the neighborhood ‘Jack City’ because of all the armed robberies. Very few of the students come from homes with two parents. Aviv reported that some students came to school in filthy clothing and that Lewis told students to drop dirty laundry in the back of his truck so that he could wash clothes for them. Some of the parents were dysfunctional because of drug use. During the years leading up to the cheating scandal, Parks had made real progress. A new principal renovated the school and worked on both refocusing students on academics and building a sense of community. Using funds that Hall’s administration had obtained, the school implemented after-school and tutoring programs. However, this simply wasn’t enough, given how fast scores had to rise to meet Hall’s demands. Lewis told Aviv that he had pushed his students harder than they had ever been pushed and that he was ‘not willing to let the state slap them in the face and say they’re failures.'” (pp. 77-78)

Besides leaving 9 Atlanta teachers and principals with criminal convictions, what has been the ultimate outcome of all this test-and-punish for society as a whole including our children? “It’s no exaggeration to say that the costs of test-based accountability have been huge. Instruction has been corrupted on a broad scale. Large amounts of instructional time are now siphoned off into test-prep activities that at best waste time and at worst defraud students and their parents. Cheating has become widespread. The public has been deceived into thinking that achievement has dramatically improved and that achievement gaps have narrowed. Many students are subjected to severe stress… Educators have been evaluated in misleading and in some cases utterly absurd ways. Careers have been disrupted and in some cases ended. Educators including prominent administrators, have been indicted and even imprisoned. The primary benefit we received in return for all of this was substantial gains in elementary-school math that don’t persist until graduation.” (p. 191)

Koretz concludes: “Reformers may take umbrage and say that they certainly didn’t demand that teachers cheat. They didn’t, although in fact many policy makers actively encouraged bad test prep that produced fraudulent gins. What they did demand was unrelenting and often very large gains that many teachers couldn’t produce through better instruction, and they left them with inadequate supports as they struggled to meet these often unrealistic targets. They gave many educators the choice… fail, cut corners, or cheat—and many chose not to fail.” (p. 244)