New Research Bolsters the Case For Using Student Surveys to Evaluate Teachers

Even in the world of education policy, where the search for solutions can be so myopic that fads have their own fads, the practice of evaluating teachers based on student feedback has had a meteoric rise. In early 2012, part II of the Gates Foundation’s Measures of Effective Teaching (MET) study (pdf) recommended using them to evaluate teachers. Later in the year a favorable piece in the Atlantic by Amanda Ripley offered the idea to a wider audience. Then two months ago New York released its long awaited teacher evaluation guidelines, and wouldn’t you know it, the plan called for student surveys to make up a small but legitimate piece of the high-stakes decision-making process.

But some remain skeptical that student judgments accurately reflect teacher performance. Research is mixed on the quality of student metacognition and judgments of learning, and therefore it’s possible student feedback reflects perceived learning that hasn’t actually occurred. There is also a concern that student evaluations will be influenced by characteristics that make teachers popular, but not necessarily effective.

Two new studies should help alleviate these concerns. The first study, which was led by Carnegie Mellon’s Eyal Peer, was an attempt to directly replicate the Dr. Fox Effect, a finding that has often been used to call into question the accuracy of student evaluations. The effect gets its name from a famous 1973 study in which students watched a lecture by a paid actor (“Dr. Fox”) who was introduced as an expert in the field of game theory. Although the lecture was largely nonsense, the actor spoke with great enthusiasm, and observers rated the phony Dr. Fox very highly on a series on follow up questions (e.g. “Did he put his material across in an interesting way?”) The positive ratings have been cited as evidence that student evaluations are swayed by the presentation of a lesson rather than its content or impact.

The original study has remained controversial because it had a laundry list of red flags. There was no suitable control group. Nearly all the questions were yes/no items where a “yes” meant a positive evaluation of the lecturer. This made the results susceptible to the “acquiescence bias” — the tendency for people to simply answer “yes” to everything. Observers were also unfamiliar with the topic, which meant that they might have been affected by the perception that the lecturer was an expert. However, despite these shortcomings, as well as doubts cast by numerous follow-up studies, the original Dr. Fox study continues to be influential.

Peer’s study sought to replicate the findings using the original lecture video, but with a series of methodological tweaks. In the initial experiment Peer and his partner, Elisha Babad, aimed to combat the acquiescence bias by including six additional conditions in which answers were either on a 1-6 scale, on a -3 to +3 scale, re-phrased so that a positive answer meant a negative evaluation, or some combination of the three. In a addition, some of the conditions had explicit instructions about the need to avoid the acquiescence bias. Surprisingly, across all conditions evaluations tended to remain just as positive as they were in both the original 1973 study and the control condition that replicated the original study.

A follow-up experiment investigated the impact of the lecturer’s status by showing one group of participants the lectures without the glowing introduction that painted Dr. Fox as an expert. Once again, both groups gave the lecturer similar positive ratings.

A third and final experiment examined whether the unfamiliarity of the topic might account for the high ratings. Two groups of participants watched the lecture video. One consisted of undergraduates largely unfamiliar with the topic, while the second consisted of graduate students who had studied game theory and decision making, the subject of the lecture. Yet again, both groups gave the lecturer similar positive ratings.

But then then something interesting happened:

Before we had the opportunity to debrief them, the students exclaimed spontaneously (and in consensus) that the lecture was quite stupid and that the speaker spoke nonsense. In their comments, the students conveyed to us that they did not feel that they had learned anything from the lecture. However, at the same time, they evaluated the speaker favorably. We then realized that the underlying assumption made by Naftulin et al. (1973) about the connection between favorable ratings and the sense of learning might be false. It might be possible that students would rate a speaker favorably and still not feel that they had actually learned anything.

The researchers then went back to their original data and analyzed answers from a question that wasn’t in the original 1973 study, but that they had had the foresight to include: “Did you learn from the lecture?”

Sure enough, participants gave answers that were significantly more negative than their answers to the other questions. In fact, whereas a majority of participants gave positive responses to the other questions, nearly two-thirds gave negative responses when asked about learning.

The implication is that students are able say they enjoyed a lecture and appreciated the speaker without being duped into thinking they learned something. In terms of using student feedback to evaluate teachers, this means that a well-constructed survey won’t necessarily be biased by a teacher’s charisma. If the surveys do a good job asking about learning, students will do a good job answering about learning.

The second study examined the relationship between student feedback and science learning in more than 50 German 3rd grade classrooms. The researchers broke feedback about teachers into three areas deemed essential for learning: classroom management, cognitive activation, and supportive climate. After controlling for teacher popularity, the researchers found that student ratings of classroom management predicted student achievement, and that ratings of cognitive activation and supportive climate predicted student subject-interest. The results provide another piece of evidence for a strong link between student feedback and actual learning. In addition, there have also been questions about the efficacy of feedback from younger students, and the findings suggest that among third graders there’s little cause for concern.

It’s worth emphasizing that the findings from the two studies deal only with the short-term relationship between student feedback and learning outcomes. There’s a very legitimate question as to whether student feedback will be continue to be an effective metric in the long run (i.e. get ready for lots of Campbell’s Law proclamations). Nevertheless, the two studies should hearten those who have been pushing for a larger student role in evaluating teachers.
Peer, E., & Babad, E. (2013). The Doctor Fox Research (1973) Rerevisited: “Educational Seduction” Ruled Out. Journal of Educational Psychology DOI: 10.1037/a0033827

Fauth, B., Decristan, J., Rieser, S., Klieme, E., & Buttner, G. (2013). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes Learning and Instruction DOI: 10.1016/j.learninstruc.2013.07.001


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s