Understanding the World: Shoddy Evaluations—A Case Study from University Life in the 1970s
Bias, Poor Methods, and Questionable Results in Student Evaluations of Professors (Note to self: I was young, I needed money, I was an idiot, what else can I say?)
Note: This essay was prepared with the research assistance and ghostwriting of ChatGPT 4.0. No LLM AI were harmed in the process. I wanted to though.
Author's Preface
Recruitment for the Teacher Evaluation Handbook
When I was a young man in the early 1970s in university, I was recruited to do yeoman work on a professor's evaluation handbook. There were two people on the team, my "boss" and myself. I don't remember a lot of details about it. The fellow I was working with was a little older than me. The source of funding, I'm not sure about, but I imagine it was the student council. Our job was to survey students and produce a handbook grading teachers.
Grunt Work and Questionnaire Development
We proceeded with me being second fiddle on the project, doing the grunt work. We prepared a questionnaire. I should say the other guy prepared a questionnaire. I don't remember now any details, but I remember I went to university classrooms and somehow or other through some magic got students to fill out the questionnaire. Some random basis, well, not truly random, probably a self-selected basis; those who were willing to take the time to fill out the questionnaire. We have no idea how biased that made the sample, but it probably made the sample non-representative. Probably by self-selecting respondents, we got people who had strong opinions, maybe one way, maybe the other. Was it representative? Well, almost certainly not.
Lack of Research Methods and Validation
I don't know about the process of preparing the questions. I'm not sure the "boss's" background. He might have had a few political science courses, but I think he was some sort of an undergrad just as I was. We had little concept of validation. I know this was prior to any research methods courses, statistical courses that I might have taken. I know the "boss" kept on talking about descriptors. “We need lots of descriptors” which I found a very odd choice of a word.
Data Analysis and Handbook Creation
Anyway, we took this questionnaire and somehow or other through some miracle, I guess, we analyzed the data. I hadn't done any computing work then, hadn't done any statistics, I'm not sure how we analyzed the data (probably he knew or had a contact who knew SPSS) but somehow or other we did some sort of mathematical analysis of it, frequencies and descriptive stats, and we wrote it up into a book that was printed through some mystery process. Again, the details of late have escaped me. It had a cover of thick yellow paper, bound somehow. I wish I could remember those details, but we produced a publication.
Content of the Handbook: Positive and Negative Opinions
I was given the job of writing substantial sections of it, and I was instructed to echo some positive opinions, and include some negative opinions — as if that actually revealed anything. It's sort of like looking at Amazon book reviews. You get some positive opinions, you get some negative opinions, and Lord knows what to make of it other than that people differ on their opinions.
Likert Scale and Final Distribution
We prepared this handbook with some funny number crunching, probably percentages of numbers on the Likert scale items, the Likert scale being basically an agree, disagree type of thing on various questions for various professors. I don't know how good our coverage was. I don't know if we got 100% of the professors or not. We got a pretty high rate of coverage and a lot of opinions; it was a Likert scale with a place to put in comments. We randomly selected extreme comments (pro and con) and crunched the numbers somehow. I don't remember how the calculations were done. And put it into the handbook, which we then distributed throughout the campus.
Reflection: Bogus Results and Subjective Opinions
We ended up with evaluations of professors that I consider now totally bogus. They were based on some shoddy methods. It probably damaged a lot of professors — egos and repuations. Maybe it gave confidence to some, a lack of confidence to others. But in the end I think it was not just a total waste of time, but quite pernicious. I guess the idea was you'd give guidance as to which professors were good and which were bad. But it's a highly subjective opinion in any case on an individual basis. And it's relational. What one person regards as a good professor, another may regard as an abomination. And it very much depends not just on the professor, but on the student and the relationship between them.
Personal Reflection on Teacher Opinions
Later on I remembered a teacher in high school that I considered the worst of the worst. I looked on a social media site where people from my school talked about and teachers a few years back. I had a teacher in high school I considered the worst of the worst — arrogant, a bully, sadistic, and he probably smelled back too (could be stretching it on that). Lo and behold, there was at least one woman who said he was the best teacher she'd ever had. And I just thought he was a horror show. Go figure.
Salt is Cheap
Take opinions with a grain of salt. And don't trust research conclusions, particularly of that nature, because they may be done quite poorly, quite fraudulent, quite bogus results. Okay, not necessarily fraudulent, but quite bogus because of, if nothing else, a lack of sophistication and very poor methodology.
My Subsequent Experiences
I did after a while, study research methods and statistics. I self-educated myself on survey methods and did a little bit of stuff in the nature of surveys from time to time. I eventually came to regard them as problematic. I came to regard all such measurement methods as suspect.
A fellow I knew in those days started an influential Canadian polling firm. I have seen his name it the paper a number of times, and read his polling results occasionally. He was not an exceptional intellect himself (IMHO), but he still created an influential organization. So, I have no trust in his organization nor their results. I have also come to realize the huge validity and accuracy problems inherent in any such method of assessment and sampling, regardless of who is using them.
You might have known I would get back to Epistemological considerations eventually, eh? — Ephektikoi - self-professed Guerrilla Epistemologist, Cracker Barrel Philosopher, Cheap Seats Polymath, Smart-ass.
Introduction (by ChatGPT, under my gentle guidance)
Student evaluations of teachers have long been a subject of debate in academic circles. On the surface, they appear to provide useful feedback to both instructors and institutions. However, a deeper dive into the methodologies commonly employed—especially in the 1970s when formal research methods were not as robust as they are today—reveals a complex web of biases, poor sample selection, and misinterpretation of data. In this essay, we examine a personal experience of conducting teacher evaluations in university settings during the early 1970s. Through this lens, we will explore issues such as sample bias, the reliability of self-selected surveys, the limitations of Likert scales, and the importance of validation in evaluation systems. This exploration will be supported by citations from current research on survey design and student evaluation methodologies.
The Problem of Sample Bias in Self-Selected Surveys
One of the fundamental issues with the 1970s university teacher evaluation project was the reliance on a self-selected sample of respondents. Research has shown that self-selection introduces significant bias into survey results (Bethlehem, 2010). Self-selection bias occurs when individuals with strong opinions—either positive or negative—are more likely to participate, skewing the results away from a truly representative sample. This bias was likely at play in the student evaluation project, where students who either particularly liked or disliked a professor were more motivated to fill out the survey.
Bethlehem (2010) elaborates on the dangers of self-selection in survey research, stating that it undermines the reliability of conclusions drawn from such data. Moreover, Wagner (2017) supports this by highlighting that self-selected surveys often over-represent extreme views, which can lead to misleading conclusions about the general sentiment of a population.
Validation in Evaluation Systems
A crucial issue in the 1970s project was the lack of validation in both the survey questions and the analysis process. Proper validation ensures that a survey measures what it is intended to measure and can lead to more reliable and accurate conclusions (DeVellis, 2016). In the case of this teacher evaluation project, there was little understanding of how to validate the questionnaire or the metrics used, such as the Likert scale.
Descriptors were mentioned repeatedly by the project's leader, though it was unclear how these were operationalized. Descriptors, while a legitimate concept in certain fields, were likely jargon picked up without a full understanding of their application in this context. Without clear validation or a framework for understanding how descriptors should be used to rate professors, the evaluation relied on the subjective interpretation of students. Validating such a system would require piloting the questions, refining them based on feedback, and ensuring the Likert scale could meaningfully differentiate between levels of student satisfaction or teaching effectiveness (Fowler, 2014).
As pointed out by Joshi et al. (2015), validation is particularly important in the use of Likert scales, which are prone to producing data that oversimplifies complex opinions. For example, without careful wording and scale calibration, a "strongly agree" response could mean very different things to different respondents. In this project, the absence of validation likely contributed to unreliable data, reinforcing the need for careful methodological planning in evaluations.
The Subjectivity of Teaching Evaluations
A recurring theme in this project was the subjective nature of student evaluations. Research confirms that student evaluations often reflect personal preferences rather than objective measures of teaching effectiveness. For example, Emery, Kramer, and Tian (2003) found that student evaluations are influenced by factors such as the grade received in the course, the difficulty of the material, and personal rapport with the instructor.
In the 1970s project, this subjectivity manifested in highly varied responses to the same professors. What one student considered an engaging teaching style, another might have found confusing or unpleasant. According to Boring, Ottoboni, and Stark (2016), these subjective judgments can produce skewed and unreliable evaluations, leading to incorrect conclusions about an instructor’s overall competence.
The Limitations of the Likert Scale
Another significant issue with the methodology used in this project was the reliance on the Likert scale. While Likert scales are widely used in social science research, they are not without limitations. The Likert scale typically captures attitudes on a spectrum ranging from "strongly agree" to "strongly disagree." However, it can oversimplify complex opinions and reduce them to a binary-like response system, limiting the richness of the data (Joshi, Kale, Chandel, & Pal, 2015).
In the case of the teacher evaluations, the Likert scale likely failed to capture the nuance of students’ experiences. Moreover, students may have interpreted the questions differently, depending on their unique expectations of a good professor. As Boone et al. (2012) point out, without clear validation and question testing, Likert scales can produce misleading data that fail to represent the true range of opinions on a topic. In this project, the reliance on such a scale without ensuring its appropriateness or validation likely contributed to skewed results, further emphasizing the need for thoughtful survey design.
Summary
The teacher evaluation project from the 1970s, while initiated with good intentions, was flawed in multiple ways: self-selection bias, subjective interpretations, lack of validation in the survey design, and the misuse of descriptors and Likert scales. Without proper validation, the survey questions and metrics used failed to provide reliable or meaningful data, which in turn affected the credibility of the results. Research on modern evaluation systems consistently highlights the importance of carefully designed and validated instruments to ensure that conclusions drawn from surveys and evaluations are accurate and useful. This case study serves as a reminder that without proper attention to methodology, even well-intentioned efforts can lead to faulty conclusions and, potentially, unintended consequences.
References
Bethlehem, J. (2010). Selection bias in web surveys. International Statistical Review, 78(2), 161-188. https://doi.org/10.1111/j.1751-5823.2010.00112.x
This article explains how self-selection bias undermines the validity of web and self-selected surveys.
Boone, H. N., & Boone, D. A. (2012). Analyzing Likert data. Journal of Extension, 50(2), 1-5. https://archives.joe.org/joe/2012april/pdf/JOE_v50_2tt2.pdf
This article discusses the importance of proper data analysis techniques for Likert scales and highlights how improper methods or poor question design can lead to misleading interpretations.
Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research. https://doi.org/10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
This paper argues that student evaluations are poor measures of teaching effectiveness due to their subjective nature and bias.
DeVellis, R. F. (2016). Scale Development: Theory and Applications (4th ed.). SAGE Publications. https://www.amazon.com/Scale-Development-Applications-Applied-Research/dp/150634156X
DeVellis provides a comprehensive overview of scale development and validation, emphasizing the importance of ensuring that survey instruments measure what they are intended to measure.
Emery, C. R., Kramer, T. R., & Tian, R. G. (2003). Return to academic standards: A critique of student evaluations of teaching effectiveness. Quality Assurance in Education, 11(1), 37-46. https://doi.org/10.1108/09684880310462074
This paper discusses the influence of student evaluations on academic standards and the potential issues with using student feedback as a measure of teaching quality.
Fowler, F. J. (2014). Survey Research Methods (5th ed.). SAGE Publications. https://us.sagepub.com/en-us/nam/survey-research-methods/book239718
Fowler’s book is a key resource on survey design, covering methods for creating valid and reliable surveys, including the importance of validation and piloting questions.
Joshi, A., Kale, S., Chandel, S., & Pal, D. K. (2015). Likert scale: Explored and explained. British Journal of Applied Science & Technology, 7(4), 396-403. https://doi.org/10.9734/BJAST/2015/14975
This paper explains the Likert scale and its limitations, focusing on how it can oversimplify complex responses and the need for proper scale calibration.
Wagner, W. E. (2017). Using IBM SPSS Statistics for Research Methods and Social Science Statistics (6th ed.). SAGE Publications. https://us.sagepub.com/en-us/nam/using-ibm-spss-statistics-for-research-methods-and-social-science-statistics/book244009
Wagner’s book provides guidance on analyzing survey data, highlighting common pitfalls in interpretation, including the issue of self-selection bias.
Appendix A: The Meaning of Validation
Definition of Validation
Validation is a process in research and survey design that ensures a tool, such as a questionnaire, measures what it is intended to measure. It involves testing the instrument for accuracy, reliability, and applicability in a given context. A validated tool provides confidence that the results derived from it are meaningful, consistent, and replicable. In practical terms, validation means that if we use the questionnaire repeatedly, the results should reflect a true measurement of the variables under study, not merely noise or bias introduced by poorly designed questions or inappropriate scales.
Types of Validation
Validation can be broken down into several key types:
Content Validation
Content validation ensures that the survey covers all the relevant aspects of the topic being studied. For example, if a questionnaire aims to assess teaching effectiveness, it should include questions that address different dimensions of teaching, such as clarity of instruction, engagement with students, and knowledge of the subject matter. Content validation typically involves consulting subject matter experts to ensure the survey comprehensively addresses the necessary domains.Construct Validation
Construct validation focuses on whether the questionnaire accurately measures the concept it purports to measure. If a Likert-scale survey is meant to measure "student satisfaction," construct validation would assess whether the questions truly reflect that construct, ensuring no unrelated factors (such as personal bias or external influences) skew the results.Criterion Validation
Criterion validation compares the results of the questionnaire to an established benchmark or outcome. If another well-validated tool exists to measure the same variable, criterion validation checks whether the new questionnaire produces similar results. This type of validation is often used to ensure new measurement tools are in line with existing standards.Face Validation
Face validation involves a superficial check to see if the questions on the questionnaire "look" appropriate and relevant to the respondents. Although this is the simplest form of validation, it does not involve rigorous statistical analysis and is primarily subjective.
Validating a Questionnaire Using Likert Scaling
Likert scales are commonly used in questionnaires to gauge opinions, attitudes, or behaviors. Validating a Likert-based questionnaire involves multiple steps, starting with pilot testing. Pilot testing allows researchers to administer the questionnaire to a smaller sample to gather preliminary data and identify any issues in the questions or response scales. During this stage, researchers assess both content and construct validity, ensuring that the questions align with the theoretical framework behind the study.
One common method to validate Likert scales is through factor analysis, which helps identify whether the responses cluster around certain factors (or dimensions) of the concept being measured. For example, in a questionnaire about teaching effectiveness, factor analysis might reveal that responses naturally group around factors such as communication, knowledge, and fairness. This helps refine the instrument to ensure it is measuring what it intends to measure and that the scales are well understood by the respondents.
Practical Problems with Validation
Practical problems often arise when attempting to validate a questionnaire. One key issue is the difficulty of accessing a representative sample for pilot testing. If the sample is not representative of the population being studied, the validation process may yield skewed or biased results, leading to a questionnaire that does not work well in its intended context.
Another problem is the resources required for validation, which can be time-consuming and costly. Gathering enough data to perform meaningful statistical tests, such as factor analysis, requires large sample sizes, which may not always be available. In some cases, the complexity of the statistical methods required for proper validation can exceed the capabilities of the researchers or their tools.
Pragmatic Problems with Validation
On a pragmatic level, one of the biggest challenges is question fatigue among respondents. If a questionnaire is too long or repetitive, respondents may lose interest or rush through their answers, compromising the validity of the results. Similarly, response bias—where participants answer in ways they believe are socially acceptable or what the researcher wants to hear—can distort data and make validation difficult.
Additionally, contextual differences can interfere with the validation process. A questionnaire validated in one cultural or institutional setting may not work as well in another, making it challenging to generalize the results across different populations or settings.
Theoretical Problems with Validation
Theoretically, validation can also pose significant challenges. One of the main theoretical problems is the ambiguity of constructs. In some cases, the concept being measured—such as "student satisfaction" or "teaching effectiveness"—is not clearly defined, making it difficult to create questions that accurately reflect that construct. The fluidity of some constructs can also lead to construct validity issues, where what is being measured may shift subtly depending on the context or the interpretation of the respondents.
Another theoretical concern is the over-reliance on statistical methods for validation. While factor analysis and other statistical tools can provide insight into how well a questionnaire measures a concept, they do not address deeper issues related to the philosophical or psychological understanding of the concept itself. This can result in a questionnaire that is statistically sound but fails to capture the full complexity of the topic under study.
Conclusion
Validation is an essential part of developing a reliable and effective questionnaire. However, it is not without its challenges. Both practical and theoretical issues can complicate the process, from obtaining representative samples to ensuring the clarity of constructs being measured. For Likert-scale questionnaires in particular, validation through pilot testing and factor analysis is crucial but may be limited by the pragmatic realities of conducting research. Despite these difficulties, thorough validation remains the best way to ensure that a questionnaire yields useful, accurate, and meaningful results.
Appendix B: The Likert Scale
Definition of the Likert Scale
The Likert Scale is a widely used psychometric tool designed to measure people's attitudes, opinions, or perceptions. It consists of a series of statements related to a topic, with respondents asked to indicate their level of agreement or disagreement using a fixed range of responses. Commonly, the scale ranges from "Strongly Agree" to "Strongly Disagree," with a neutral option in the middle. Likert scales are ordinal in nature, meaning they show a rank order but do not indicate equal intervals between response options.
Origin of the Likert Scale
The Likert Scale was developed in 1932 by Rensis Likert, a social psychologist who aimed to improve the measurement of attitudes. Likert's approach involved assigning numeric values to each response on a scale to facilitate statistical analysis. By aggregating responses to multiple items related to the same underlying concept, Likert scales allow researchers to generate a composite score reflecting an individual's overall attitude toward a subject.
Where the Likert Scale is Used
The Likert Scale is commonly used in social science research, education, marketing, psychology, and healthcare to assess subjective opinions and attitudes. For example, it might be used to gauge customer satisfaction in business settings, measure patient attitudes toward a treatment in healthcare, or understand public opinion on political issues in surveys. Its versatility and ease of use make it a popular tool in both academic and applied research contexts.
Methods for Administering the Likert Scale
Likert scales are most commonly administered through questionnaires, either on paper or electronically. Each statement in the questionnaire is followed by a set of response options, typically ranging from 5 to 7 points (e.g., Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree). Respondents select the option that best reflects their opinion for each statement.
In some cases, researchers may administer Likert scales verbally, particularly in interviews or focus groups. However, this method is less common due to the potential for interviewer bias and the difficulty in standardizing responses across participants.
Idiosyncratic Meanings Attributed to Likert Scale Questions
An important issue with the Likert scale is that respondents may attribute idiosyncratic meanings to the questions posed. Each individual may extract different interpretations of the questions based on their personal understanding of the issues or their use of language. This can significantly impact how they respond, leading to variability in results.
For example, one respondent might interpret the phrase "satisfaction with services" in a business survey to mean the speed of service, while another might focus on the friendliness of staff. These differences in interpretation can lead to responses that reflect different underlying attitudes or opinions than the researcher intended to measure.
Additionally, a respondent’s understanding and interpretation of the questions may change over time. If they were to redo the questionnaire at a later date, they might provide different answers due to shifts in their understanding or perspective on the issue. As a result, there is no guarantee that they would assign the same rankings or produce the same data points in subsequent iterations. This variability can introduce noise into the data, making it difficult for researchers to draw consistent conclusions about trends or patterns.
Use of the Likert Scale in Questionnaires
In questionnaires, Likert scales allow researchers to quantify subjective data. By presenting respondents with multiple items related to a concept (e.g., customer satisfaction), the Likert scale generates data that can be analyzed to assess trends, averages, and differences among groups. Likert-scale responses can be analyzed using descriptive statistics, such as means and frequencies, and more advanced techniques like factor analysis to explore underlying patterns or dimensions in the data.
Likert scales are frequently used in surveys that aim to measure attitudes over a spectrum rather than in a binary fashion. They are particularly useful when the goal is to capture the intensity of respondents' feelings about a subject, as the multiple response options allow for more nuanced feedback.
Pragmatic Issues with the Likert Scale
One of the main pragmatic challenges of using Likert scales is response bias. Respondents may tend to select the middle (neutral) option, either out of indecision or to avoid the effort of thinking through the extremes. This is sometimes called central tendency bias. Additionally, respondents might exhibit acquiescence bias, where they disproportionately agree with statements regardless of their true feelings.
Another challenge is that Likert scales are ordinal, meaning they capture the rank order of preferences but not the precise distance between options. For instance, the difference between "Strongly Agree" and "Agree" is not necessarily the same as the difference between "Agree" and "Neutral." This limits the type of statistical analyses that can be performed and requires researchers to carefully choose appropriate methods for analysis.
Theoretical Issues with the Likert Scale
The Likert Scale, while practical, also faces theoretical limitations. One key issue is the assumption of linearity. Many analyses of Likert-scale data treat responses as if they are continuous, assuming that the difference between each response category is uniform. However, in reality, the intervals between categories are not equal, which can lead to inaccurate interpretations of the data.
Another theoretical concern is that Likert scales often collapse complex constructs into simplistic categories. Human attitudes and opinions are often more nuanced than can be captured by a small set of predefined responses. This simplification may lead to a loss of important information, as respondents are forced to fit their feelings into rigid categories.
Likert scales also face challenges in cross-cultural research. Respondents from different cultures may interpret the same response options differently, with some cultures favoring more moderate responses and others more extreme ones. This can create difficulties in comparing results across diverse groups and lead to erroneous conclusions.
Conclusion
The Likert Scale remains a highly valuable tool for researchers aiming to quantify subjective attitudes and opinions. Its simplicity and adaptability have made it ubiquitous in fields ranging from psychology to market research. However, it is not without its flaws. Pragmatic issues such as response biases and ordinal measurement limitations, along with theoretical concerns about the assumptions of linearity and cross-cultural comparability, require careful attention from researchers. Additionally, the idiosyncratic meanings respondents may attribute to questions and the potential for different interpretations over time add further complexity to the analysis. Despite these challenges, when used thoughtfully and with a clear understanding of its limitations, the Likert scale can still provide meaningful insights into human behavior and attitudes.