Where course reviews go wrong

Before she picks her courses for the coming semester, one of the first things College senior Joanna Kass does is open up a browser tab for Penn Course Review. For her, instructor quality ratings are valuable because they help her decide between courses she's interested in taking.

"I think that if a professor gets poor ratings year after year after year, that's an indication that it is not the best class to take," Kass said. "If I honestly can't decide between two classes, then it's a way to decide."

Like Kass, many students and administrators widely acknowledge the importance of Penn Course Review and the data it provides. Nearly 90 percent of students fill out end-of-semester course evaluations, and numerous apps - like Penn Course Plus, which has been downloaded by nearly 1,600 students - have been built to make it easier for students to use course ratings data.

Data from Penn Course Review is useful because it helps students quantify information about courses they're interested in taking, such as their quality and difficulty, and then decide whether the courses will be worth their time. Administrators also use this data to make promotion decisions, track teaching quality and provide guidance to low-performing teachers.

But while students and administrators tend to rely on course evaluations when making decisions, experts acknowledge that there are flaws with student ratings that could strongly impact interpretations of the data, especially if they are not understood or carefully considered. Many researchers who study course review data argue that these biases (although use of the term bias itself is debated) - which include anti-STEM sentiments, and gender and racial inequities - need to be controlled for, and caution that student ratings shouldn't be the only factor used in teacher evaluations.

* * *

Research on course evaluation data has shown that biases come in all shapes and sizes.

Information specific to particular types of courses, including the size of the class and what subject matter it covers, can alter how students evaluate them. Administrators and researchers widely acknowledge that larger courses receive lower ratings than smaller courses, although the difference is small. This certainly holds true at Penn: A Daily Pennsylvanian analysis of courses offered between the spring semester of 2009 and the spring semester of 2015 found that course quality is highest in small courses (fewer than 25 students), lower in medium-sized courses (25 to 50 students) and lowest in large courses (more than 50 students).

Further, STEM courses generally receive lower course quality ratings than humanities and social science courses, in addition to higher difficulty ratings. Though researchers haven't found a clear cause behind this, a 2009 paper suggested possible reasons, including that science faculty might spend less time focused on teaching than humanities faculty - working instead on grant proposals - or that students didn't have sufficient background knowledge or interest to appreciate hard science courses. Penn Course Review data reveals that the same bias against STEM can be seen at Penn. A DP ranking of departments by average course quality found that 14 of the bottom 20 departments were in STEM fields, while only three of the top 20 departments were in STEM. On the flip side, 17 of the top 20 departments were in the humanities, while only one of the bottom 20 were humanities departments - it was writing seminar.

Other known biases in course evaluations concern particular groups of teachers and students, which include the debated case of bias against female professors. Numerous research studies, as well as anecdotal experience, have indicated that female professors receive lower ratings than male instructors. According to an online chart of words students use to describe their professors, male instructors are more frequently described as "brilliant" and "knowledgeable"; female professors, on the other hand, were more likely to be called "bossy" and "mean." The most recent academic study of gender bias, published in January of 2016, found evidence of bias against female professors in both French classrooms and in online U.S. courses.

However, the methods and results of studies showing bias against female professors have faced criticism. In a response to the January 2016 publication, researchers from IDEA, a non-profit studying higher education, questioned the methodology of the study, noting that the results show that gender accounts for only 1 percent of variance in ratings. Steve Benton, a senior researcher at IDEA, explained in an interview that there's no consistent evidence that male professors receive higher scores than female professors, noting the evidence "simply isn't there." Some studies have found no bias against female professors, and one study from 2000 found a slight bias in favor of female professors by female students.

Beyond gender, a professor's ethnicity might also play a role in ratings. Ronald Berk, a professor emeritus at Johns Hopkins University who has written extensively about course review data, suggested there might be an "interplay between gender and ethnicity" that could affect student ratings of courses. One study on how beauty affects instructor ratings found that minority professors, as well as non-native English speakers and women, receive lower ratings. Unfortunately, there has been a dearth of empirical research on how race factors into course evaluations, which one study on the subject notes is a "scholarly gap that is itself troubling."

* * *

Given all these potential sources of bias, does this mean that students and administrators shouldn't trust or use course review data?

Some researchers say yes. The authors of the 2016 study on gender bias write that student evaluations of teachers "should not be relied upon as a measure of teaching effectiveness" due to gender bias. Philip B. Stark, professor of economics at University of California at Berkeley and co-author of a critical paper on course evaluations, told Inside Higher Education that the increasing evidence of bias in student evaluations is reason for universities to not use this data. "[Student evaluations] do not measure what they purport to measure," he said. Others argue that student ratings are relatively useless because students lack the expertise to effectively critique their instructors.

But many experts in the field disagree, noting that student review data is extremely valuable for administrative decisions.

"It's an important voice, just like a patient's voice is important in judging interactions with their physician," argued Benton, who has authored numerous articles on student ratings. Students spend around 40 hours per semester in the classroom with their professor, the thinking goes, so they should have a good idea about how effective their teacher is at teaching.

But Benton and Berk, as well as other researchers, point out that student review data shouldn't be the only data administrators consider. IDEA, the nonprofit where Benton is a researcher, recommends that student evaluations should only count towards 30 to 50 percent of an instructor's overall evaluation. Berk recommends that administrators take scholarship, peer observations and peer review into consideration when evaluating professors.

Rob Nelson, Penn's executive director for education and academic planning, says he would prefer to conduct peer reviews of professors, but there is no clear way to implement it on a school-wide level. "We're stuck with an imperfect evaluation," he said.

"I would be perfectly fine getting rid of course evaluation data if there were something better to replace it with," Nelson added. "The fact is this is what we have."

Preference for Smaller Classes