>
Volume: | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Permission is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. Please notify the editor if an article is to be used in a newsletter. |
Many alternative forms of assessment--portfolios, oral examinations, open-ended questions, essays--rely heavily on multiple raters, or judges. Multiple raters can improve reliability just as multiple test items can improve the reliability of standardized tests. Choosing and training good judges and using various statistical techniques can further improve the reliability and accuracy of instruments that depend on the use of raters.
After identifying several common sources of rating errors, this article examines how the impact of rating errors can be reduced.
UNDERSTANDING RATING ERRORS
There are numerous threats to the validity of scores based on ratings. People being rated may not be performing in their usual manner. The situation or task may not elicit typical behavior. Or the raters may be unintentionally distorting the results. Some of the rater effects that have been identified and studied are:
MINIMIZING RATING ERRORS THROUGH TRAINING
An established body of literature shows that training can minimize rater effects. In 1975, Latham, Wexley, and Purcell used training to reduce rater effects among employment interviewers. Since then, a variety of training programs have been developed in both interviewing and performance appraisal contexts.
For example, Jaeger and Busch (1984) used a simulation to train judges in a three-stage standard-setting operation. After working through the simulation, the judges clearly understood their rating task.
Pulakos (1986) trained raters in what types of data to focus on, how to interpret the data, and how to use the data in formulating judgments. This training yielded more reliable (higher inter-rater agreement) and accurate (valid) ratings than no training or "incongruent" training (training not tailored to the demands of the rating task).
This literature suggests that rater training programs should:
CHOOSING JUDGES
The choice of judges may have a significant influence on scores. Hambleton and Powell (1983) have done an excellent job of identifying many of the issues involved in choosing judges. Their recommendations to some common questions are:
USING STATISTICAL TECHNIQUES
The difference between a rater's average and the average of all ratings is called the "rater effect." If the rater effect is zero, no systematic bias exists in the scores. Because of rater errors such as those discussed earlier, the rater effect is rarely zero.
If all the judges rate everyone being evaluated, some rater effects may not be a problem: The candidates all realize the same benefit or penalty from the rater's leniency or harshness. The ranks are not biased, and no one receives preferential treatment.
However, an issue arises if different sets of multiple raters are used--a common situation when scoring essays, accrediting institutions, and evaluating teacher performance. Candidates evaluated by different sets of multiple raters may receive biased scores because they drew relatively lenient or relatively harsh judges.
Several approaches may be followed to adjust potentially biased ratings given by different sets of multiple raters. Compared with simply averaging each candidate's ratings--in other words, doing nothing--these statistical approaches have been shown to reduce measurement error and increase accuracy. When applied to actual performance data, they typically produce substantial adjustments and change significant numbers of pass/fail decisions.
Three statistical approaches discussed in the literature are (see Houston and Svec, 1991):
The imputation approach is most appropriate when each rater evaluates only a few candidates. The weighted regression approach is most appropriate when variations are expected in rater reliability.
REFERENCES
Dearborn, D.C., and H.A. Simon. (1958). Selective perception: A note on the departmental identification of executives,
Sociometry, June, 140-148.
Hambleton, R.K., and S. Powell. (1983). A framework for viewing the process of standard setting.
Evaluation and the Health Professions, 6(1), 3-24.
Houston, W.M., M.R. Raymond, and J.C. Svec. (1991). Adjustments for Rater Effects.
Applied Psychological Measurement, 15(4), 409-421.
Jaeger, R.M., and J.C. Busch. (1984). The effects of a Delphi modification of the Angoff-Jaeger standard setting procedure on standards recommended for the National Teacher Examination. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
Nisbett, R.E., and T.D. Wilson. (1977). The halo effect: Evidence for the unconscious alteration of judgments.
Journal of Personality and Social Psychology, 35, 450-456.
Pulakos, E.D. (1986). The development of training programs to increase accuracy on different rating forms.
Organizational Behavior and Human Decision Processes, 38, 76-91.
-----
| |||||||||||||
Descriptors: *Error of Measurement; Evaluation Methods; *Evaluators; *Interrater Reliability; Least Squares Statistics; Rating Scales; Regression (Statistics); Scaling; Scores; *Scoring; Test Interpretation; *Training; Validity |
Sitemap 1 - Sitemap 2 - Sitemap 3 - Sitemap 4 - Sitemape 5 - Sitemap 6