Ill-conceived ratings systems can wreak havoc on educators’ careers
New York educators are pushing back forcefully against the state’s controversial teacher evaluation system. This spring, the Teachers Association of the cities of Rochester and Syracuse filed a lawsuit against the state, arguing that the ratings metrics unfairly penalize teachers of disadvantaged students. Now Sheri G. Lederman (PDF), a lifelong teacher from Long Island, is challenging her “ineffective” rating as arbitrary and capricious, based on an ill-conceived and misapplied statistical model of teaching quality.
These suits converge on the issue of whether teachers should be judged on the basis of student test scores, and New York state is poised to set a nationwide precedent on the use of value-added testing data in teacher evaluations. While most parents and administrators would agree that educational accountability is essential, thorny questions persist about how the art of teaching should be appraised in a data-driven culture.
Value-added evaluation systems have been celebrated by U.S. Secretary of Education Arne Duncan and his 2010 Race to the Top initiative for their potential to distinguish between highly effective and ineffective teachers. Value-added models (VAMs) draw from students’ prior test scores and their backgrounds, such as race and socioeconomic status, to forecast how well they ought to score on a current year’s standardized exam. If math or English students fail to reach these benchmarks, then their teachers are deemed ineffective. This expectation game, however, springs from Byzantine formulas that have been denounced by the American Statistical Association as wrongly measuring “correlation, not causation.”
For example, because tests are given only in certain subjects to certain age groups, 70 percent of educators in Florida last year received VAM rankings based on students or subjects they didn’t even teach. New York’s system determines whether a teacher is highly effective, effective, developing or ineffective, using a triad of measures: 20 percent based on value-added modeling of students’ state test scores, 20 percent on district level assessments and 60 percent on an array of other measures, such as classroom observations. Lederman’s value-added classification dropped two rungs in just one year despite having student test scores that were consistently more than double the state average for meeting standards.
One explanation for this change is that New York statisticians rejigger the VAM formula each year, effectively moving the goalposts without informing teachers. Furthermore, researchers at the University Of Colorado at Boulder found that tweaks to the formula for reading outcomes would alter the effectiveness ratings for more than 50 percent of Los Angeles public school teachers. In New York an ineffective rating cannot be appealed, which explains why the impetus behind Lederman’s suit is not monetary or political; rather, she seeks to have her score clarified and recalculated.
Erroneous evaluations can have real-world ramifications. The public release of teachers’ ratings can damage their professional reputations and set up future employment challenges, including denial of tenure or dismissal. Lederman has job security after a 17-year career, but green teachers, who are increasingly the norm in nationwide classrooms, face serious risks. They have no existing file of other evaluations and, without seniority, are often slotted into classrooms with underperforming students of different learning needs.
The danger of VAMs can be seen in the verdict of a May 2014 study published by the American Educational Research Association, which found no consistent correlation between teachers with high-scoring students and teachers who excelled in other metrics of effective schooling. Across six sample states, the report concluded, “The tests used for calculating VAM are not particularly able to detect differences in the content or quality of classroom instruction.”
Teacher evaluations should provide educators with actionable guidance,
not simply grade them on their effectiveness in prepping kids for tests.
VAM-style evaluations might work well for internal diagnostics in painting broad-brush district comparisons or in pinpointing areas for teacher training. Yet the shoddiness of specific VAM forecasts raises serious doubts about their use in determining an individual teacher’s worth. A 2010 report (PDF) commissioned by the U.S. Department of Education (DOE) found that the error rate for value-added scores can be as high as 35 percent when using only one year of data. A system that could rate 1 in 3 teachers incorrectly is one that essentially plays pin the tail on the donkey with their livelihoods.
One major concern made apparent in the Lederman case is that the metrics of VAMs are opaque. The exact inputs remain unknown, since children enter school with a host of differences in ability, attendance and background. Carol Burris, the principal of South Side High School in Rockville Centre, New York, and a vigorous opponent of high-stakes testing, explained in an interview with The Washington Post that the recipe of the annual value-added growth measure is unclear. This is a problem: A big goal of evaluations should be staff development, but if administrators do not know the metrics, they cannot mentor their teachers.
Jennifer Wallace Jacoby, an assistant professor of education and psychology at Mount Holyoke College who researches socioeconomically diverse groups of children, suggested to us that in many ways the VAM is a measure of convenience — looking to capture those data points that are easy to compartmentalize (such as standardized test scores) and ignoring those aspects of the classroom dynamic that are messier and more difficult to quantify.
Rather than shy away from learning complexity, states could take into account data streams that encompass the full range of a child’s schooling. Vivienne Ming, a visiting scholar at the University of California at Berkeley’s Redwood Center for Theoretical Neuroscience and a co-founder of the educational research firm Soccos, explained in an interview that naturalistic data-collection methods begin with scientific inquiry and an assumption that all data points in a student’s learning environment, not just standardized assessments, are relevant.
In other words, the goal of teacher evaluations should be to provide educators with feedback and actionable guidance rather than to simply grade them on their effectiveness in prepping kids for tests. Soccos, for instance, develops tech tools to assess behavioral data such as motivation, creativity, social intelligence and metacognitive ability, using algorithms based, in part, on University of Pennsylvania professor Angela Duckworth’s measures of student grit and self-regulation and Stanford University professor Carol Dweck’s measures of mindset. This behavioral data can be used to predict or improve quality of life outcomes for students beyond Race to the Top’s narrow focus on college readiness, including a student’s social, emotional and physical health.
In an ideal world, all public servants, not just teachers, would be a part of this endeavor. But as Burris emphasizes, quality-of-life outcomes are not the sole responsibility of schools. “We are an important piece of the puzzle, but we are not the whole puzzle. The courts, social services and politicians are equally responsible for the social well-being of children,” she said.
Education is about relationships, not statistics. Reducing Lederman’s outstanding classroom observations and her award-winning doctoral dissertation to a mathematical equation devalues the influence she has had in her years of service.
States must build better evaluation models that include a more thoughtful use of data about the true drivers of contemporary learning. If Lederman is victorious, New York might just become the forge for these new kinds of tools.