Ms Matthews was apprehensive as she opened the envelope containing her evaluation report. She had worked hard over the summer, taking graduate classes to learn some new teaching strategies to help her students improve their music listening and reading skills. She had excitedly incorporated these techniques in her classes.
But the state had just increased the portion of her yearly effectiveness rating based on the math and reading scores of students in her school system from 40% to 50%. Now, her annual evaluation as a music teacher would be determined largely on the scores her students earned on their standardized math and reading tests; not on her ability to help her students learn how to sing, play instruments, compose and improvise.
Ms Matthews was worried that those scores might lower her rating from the previous year’s “Highly Effective” to “Minimally Effective,” or, even worse, “Ineffective.” Two consecutive ratings of “Ineffective” could mean the loss of her position in the high-needs school where she worked.
As a former high school music teacher and school administrator who now studies education and music education policy issues, I have seen the serious misuse of data in teacher evaluation. I know Ms Matthews is not the only one to open the envelope with the evaluation report with trembling hands.
Music teachers across the United States are being evaluated based on test scores in subjects they don’t teach.
Tools of measurement
Teacher evaluation today is based on the use of statistical formulas known as “value-added measures” (VAM). The idea behind VAM is that student test scores can be used not only to measure student learning but also the instruction from their teachers.
In simple terms, VAM compares an estimate of a hypothetical group of students’ test scores to a set of test scores from actual students. The average of all students’ differences is the school’s VAM score. VAM scores are intended to measure the contribution of a teacher or school to student learning.
However, much of the current research suggests that these scores are too imprecise and variable to carry much validity as an indication of a teacher’s effectiveness.
Currently, every state in the country requires the use of “student growth measures” in math and reading as a factor in teacher evaluation. As many as 17 states specifically mandate VAM for the evaluation. In these states, as much as 50% of an individual teacher’s rating is now determined by this formula. Many states have been forced into instituting these forms of evaluation due to pressures exerted by the Department of Education in order “to win federal Race to the Top grants or waivers from No Child Left Behind (NCLB).”
However, this is not the purpose for which VAM was originally intended or designed. The initial intent of VAM, according to William Sanders (known as the “father” of VAM), was to help researchers make sense of huge “mountains of data, using mathematics in the same way it was used to understand the growth of crops or the effects of a drug.”
As a recent statement issued by the American Statistical Association stated:
VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
These problems are only magnified when VAM is used to evaluate music teachers.
Imagine for a moment, a physician being evaluated based upon their patients’ illnesses or injuries, not on the treatment delivered. Or consider the logic behind evaluating a steakhouse based on the fish you had at the seafood restaurant across the street last night.
This is what is currently happening with VAM being used for evaluation of music teachers.
Numerous music education policy groups have begun to focus their attention on how these practices are impacting school music programs, teachers and students.
These groups have produced policy briefs and position statements suggesting caution as state departments of education consider increasing the portion of teachers’ evaluations that are based on VAM.
They have also suggested new music-specific evaluation tools that provide teachers with high-quality assessment activities that keep the focus on music teaching and learning – not on math and reading test scores.
For instance, the Michigan Arts Education Instruction and Assessment (MAEIA) initiative has developed a database of assessment tools, which are available free of charge to all arts educators.
We measure what we treasure…or do we?
But without more thoughtful evaluation systems, even the best data will not result in authentic assessment practices. An example of this comes from the predominant employee evaluation tool in the business world for years, known as “stack ranking.”
At Microsoft, employees were rated “on a score of one to five, with one being the best. Managers were then given a curve to base their rankings on, and forced to give a certain percentage of employees a poor ‘five’ label – even if the managers did not consider the employee to be unsatisfactory at their jobs.”
Often referred to as “rank and yank,” stack ranking did not result in noticeable improvements to employee productivity, and instead contributed to a culture of fear and mistrust at many of the companies in which it was used.
Even as similar systems are gaining traction in public schools in the US, Microsoft, Expedia and Adobe Systems have now abandoned the practice of stack ranking.
We often hear the old adage, “We measure what we treasure,” when discussions turn to issues of educational accountability. This saying is often used to provide justification for narrowing the curriculum to math and reading in the elementary grades, and to STEM subjects in the upper grades.
According to Andy Hargreaves and Henry Braun, policy scholars from Boston College, “Data driven improvement and accountability (DDIA) in the US has focused on what is easily measured rather than on what is educationally valued.”
For most of us, it is precisely those things that we value the most – our families, our students and colleagues, the beauty of a well-turned phrase – that are the most stubbornly resistant to statistical measurement.
It is our duty as policymakers to be sure that the kinds of data we are using to evaluate all teachers are not only valid and reliable, but are meaningful and used appropriately.
Music education can be a vital, critical component of each student’s educational journey if we can work together to develop policies that support and encourage comprehensive musical experiences for all of our nation’s children.