Saturday, March 23, 2013

The Ineffective Rating Fetish


Posted by Matthew Di Carlo

In a story for Education Week, always reliable Stephen Sawchuk reports on what may be a trend in states’ first results from their new teacher evaluation systems: The ratings are skewed toward the top.

For example, the article notes that, in Michigan, Florida and Georgia, a high proportion of teachers (more than 90 percent) received one of the two top ratings (out of four or five). This has led to some grumbling among advocates and others, citing similarities between these results and those of the old systems, in which the vast majority of teachers were rated “satisfactory,” and very few were found to be “unsatisfactory.”

Differentiation is very important in teacher evaluations – it’s kind of the whole point. Thus, it’s a problem when ratings are too heavily concentrated toward one end of the distribution. However, as Aaron Pallas points out, these important conversations about evaluation results sometimes seem less focused on good measurement or even the spread of teachers across categories than on the narrower question of how many teachers end up with the lowest rating – i.e., how many teachers will be fired.

The belief that the road to drastic improvement is paved with teacher dismissals has become something of an article of faith in many education policy circles, and it seems to play an inordinately important role in debate and decisions about teacher performance assessment in the U.S.

Among the best examples of this phenomenon can be found in recent changes to the teacher evaluation system (IMPACT) used by the District of Columbia Public Schools (DCPS).

In 2010-11, 15 percent of teachers received the highest IMPACT rating (“highly effective”), 66 percent were “effective,” 16 percent “minimally effective,” and 3 percent received the lowest rating (“ineffective”).

One might argue that this distribution, unlike those in the states in the Education Week article, is in the ballpark of what you’d expect: Two-thirds of teachers are in the “satisfactory” (i.e., “effective”) category, for which there are no rewards or consequences; 15 percent are highly effective, and thus potentially eligible for bonuses; and one in five receive ratings (ineffective or minimally effective) that either result in their dismissal or give them one year to improve their rating, respectively.

To the degree one can draw tentative conclusions from simply eyeballing results, this distribution seems plausible, and I don’t recall anyone arguing otherwise.

DCPS had other thoughts. They recently announced some major changes to the system, which went into effect this year. Several of these adjustments, such as dropping teachers’ lowest observation scores, seem sensible. The most important, however, were two changes to the IMPACT scoring system: (1) Increasing the upper bound of the “ineffective” category (such that more teachers would receive that rating); and (2) essentially “partitioning” the “effective” category into two separate categories, the lower of which (“developing”), if received in three consecutive years, might result in dismissal.


DCPS put forth no specific reason, at least publicly, for expanding the “ineffective” category. They did attempt some explanation of the “partitioning” change, which was that the “effective” category was “too wide a range to be a high standard,” as exemplified by the fact that “effective”-rated teachers’ value-added scores varied widely. This is, at best, painfully thin.

For one thing, it is to be expected that value-added scores will vary considerably among “effective”-rated teachers, since some teachers with lower observations “compensate” with higher value-added scores, and vice-versa. It doesn’t quite make sense to design a system that consists of different types of measures that we know aren’t strongly correlated, and then change the system’s scoring because one of those measures varies within final rating categories. The whole idea of a “multiple measures” system is that one measure should not dominate.

Moreover, creating a new “middle category,” particularly one that might carry high-stakes consequences, is not entirely consistent with the research on both value-added and observations, which suggests that these measures are less well-equipped to differentiate between teachers toward the middle of the distribution.

In explaining the “bigger picture” reasons for the scoring adjustments, DCPS Chancellor Kaya Henderson said the “system needed to be more rigorous if the district is to dramatically accelerate student achievement.” What does the ever-opaque term “rigorous” mean in this context? Given the substance of the changes, it seems that “rigorous” means subjecting a larger proportion of teachers to high stakes decisions, motivated by the (likely misinterpreted) fact that DCPS proficiency rates are not increasing.

(Side note: In my opinion, the premise that teacher evaluations, no matter how well they are designed and implemented, should be expected to “dramatically accelerate” student performance is highly questionable, and, at the very least, should not guide policy decisions.)

To be clear, there are low-performing teachers who, given a chance to improve, should be dismissed. Whether or not IMPACT is a good system in terms of identifying those teachers remains to be seen (despite absurd claims by more credulous supporters).

In any case, this is still a critical time in the IMPACT implementation process, and policy adjustments that cause further shocks must be approached with extreme caution. I do not doubt that DCPS considered these two scoring changes carefully. Perhaps there’s a good policy justification for them, but I haven’t seen it. It seems like dismissals for their own sake. Risk is once again increasing, and rewards are not. And there is every reason to worry that the changes will hurt the credibility of the system and the district among current (and, perhaps, future) teachers.

Evaluation systems aren’t just about personnel decisions, and they’re certainly not about meeting an arbitrary quota for these actions. They are about identifying strengths and weaknesses and incentivizing voluntary behavioral changes that lead to improvement among current teachers, encouraging retention among high-performers, and making the profession more attractive to sharp people who are thinking about entering the classroom.

Furthermore, all of these outcomes, high- and low-stakes, depend on having well-designed systems that are perceived as legitimate and fair by teachers and administrators (for more on these issues, see this simulation, which is discussed here).

Now, let’s return to the states in the Education Week article. They face a very difficult situation. As with any first implementation of a new system, some need for revision is only to be expected. And, in a few cases, the results are implausible, and substantial adjustments may be required to address this.

Perhaps I will discuss some recommendations in a future post. In the meantime, let’s hope these states and districts balance the need to differentiate with the priorities of recognizing the strengths and limitations of current measures in determining how to accomplish such differentiation, as well as preserving the systems’ perceived fairness and legitimacy in the process.

These are extremely important considerations, and failure to pay attention to them might very well threaten any positive impacts of these new systems. I fear that some states and districts are going to spend the next decade or so learning that lesson the hard way.
- Matt Di Carlo

No comments:

Post a Comment

Popular Posts