Tests should provide equal opportunities for all students to demonstrate their abilities and knowledge. Recognizing that language influences test responses, any test development needs to include the bias review process. The challenge for test developers and reviewers is to assure, to the extent possible, that language, symbols, words, phrases and content that can be construed as sexist, racist or otherwise potentially offensive, inappropriate or negative, are eliminated from test items.
While it may be nearly impossible for tests to incorporate the diversity of background, cultural tradition and viewpoints found in the test-taking population, preparing unbiased tests simply takes effort--for careful development and review. This publication will help you recognize and reduce bias in test items. Use it as a guide to develop your own item bias review form.
There is an important distinction between stereotyping and bias. Stereotyping is the consistent representation of a given group in a particular light, which may be offensive to members of that group. Stereotyping does not, except in extreme cases, lead to differential performance. Bias, on the other hand, is the presence of some characteristic of an item that results in differential performance for two individuals of the same ability but from different ethnic, sex, cultural, or religious groups. However, both stereotyping and bias are undesirable properties of a test item.
Another undesirable characteristic related to bias is offensiveness, which can obstruct the purpose of a test item. Material that candidates consider offensive may produce negative feelings that may affect their attitudes toward testing, and hence, their test scores.
In any bias investigation, the first step is to identify the subgroups of interest. Bias reviews and studies generally focus on differential performance for sex, ethnic, cultural, and religious groups. For example, the Educational Testing Service (ETS) has identified the following six groups to give special consideration during sensitivity reviews: Asian/Pacific Island Americans, Black Americans, Hispanic Americans, individuals with disabilities, Native Americans/American Indians and women. Some or all of these groups may be relevant to your situation. In the following discussion, the term designated subgroups of interest (DSI) is used to avoid repeating a list of possible subgroups.
In preparing an item bias review form, each question can be asked from two perspectives: Is the item fair? Is the item biased? While the difference may seem trivial, some researchers contend that judges cannot detect bias in an item, but can assess an item's fairness. Perhaps the best approach is to include both types of questions on the review form.
The checklist below offers questions designed to gauge an item's fairness. A checklist to help detect various forms of bias appears on pages 7 and 8.
Checklist for Fairness
>> Does the item appear to be fair with respect to representation of situations for examinees and free of annoying stereotyping?
>> Does the item give a positive representation of designated subgroups of interest (DSI)?
>> Is there a lack of DSI representation in nonstereotypical settings?
>> Is the test item material balanced in terms of being equally familiar to every DSI?
>> Is there an over- or under-representation of a sex group in either a primary or secondary role?
>> Are members of DSI highly visible and positively portrayed in a wide range of traditional and nontraditional roles?
>> Are positive stereotypes (e.g., A woman as a loving mother) balanced by a sufficient number of nontraditional portrayals?
>> Are DSI represented at least in proportion to their incidence in the general population?
>> Does the item include topics of interest and relevance to DSI?
>> Are DSI referred to in the same way with respect to the use of first names and titles?
>> Is there a balance (across items in the test) of proper names? ethnic groups? activities for all groups (active, passive, neutral)? roles for both sexes (traditional, nontraditional, neutral)? adult role models (worker, parent)? character development (major, minor, neutral)? settings (suburban, urban, rural)?
>> Does an item have contextual justification (example: predominance of sickle cell anemia among Black people).
>> Is there greater opportunity on the part of members of one group to be acquainted with the vocabulary?
>> Is there greater opportunity on the part of members of one group to experience the situation or become acquainted with the process presented by the items?
>> Will the item "turn-off" examinees so that they are unable to do as well as their abilities would indicate?
>> Will all examinees be "free" psychologically and emotionally to respond to an item?
>> Will all examinees have equal opportunity to respond?
>> Are the members of a DSI portrayed as uniformly having certain aptitudes, interests, occupations, or personality traits?
Different Kinds of Bias
Bias comes in many forms. It can be sex, cultural, ethnic, religious, or class bias. An item may be biased if it contains content or language that is differentially familiar to subgroups of examinees, or if the item structure or format is differentially difficult for subgroups of examinees. An example of content bias against girls would be one in which students are asked to compare the weights of several objects, including a football. Since girls are less likely to have handled a football, they might find the item more difficult than boys, even though they have mastered the concept measured by the item.
An item may be considered language biased if it uses terms that are not commonly used statewide or if it uses terms that have different connotations in different parts of the state. An example of language bias against blacks is found in an item in which students were asked to identify an object that began with the same sound as "hand." While the correct answer was "heart," black students more often chose "car" because, in black slang, a car is referred to as a "hog." The black students had mastered the concept but were selecting the wrong item because of language differences.
Questions that might be asked to detect content, language, and item structure and format bias are highlighted in the following checklist.
>> Does the item contain content that is different or unfamiliar to different DSI?
>> Will members of DSI get the item correct or incorrect for the wrong reason?
>> Does the content of the item reflect information and/or skills that may not be expected to be within the educational background of all test-taking examinees?
>> Does the item content contain information that could give an advantage to examinees of some DSI?
>> Does the item contain words that have different or unfamiliar meanings for DSI?
>> Is the item free of difficult vocabulary?
>> Is the item free of group specific language, vocabulary, or reference pronouns?
>> Will any of the item distractors be unusually attractive to members of DSI for cultural reasons? (For example, some words may have different meanings in the first language of some of the examinees.)
>> Are there any flaws in the items to which members of DSI are differentially sensitive?
>> Does the item contain any errors or clues that make the various answer choices unequally attractive to members of DSI?
>> Does the explanation concerning the nature of the task required to successfully complete the item tend to differentially confuse members of DSI?
>> Will any of the distractors draw a disproportionate number of members of DSI: Are there flays in the item that cause one or more options of the item to be attractive to members of DSI?
>> Are clues included in the item that would facilitate the performance of one group over another?
>> Should "I don't know" be included as an answer choice to prevent disproportionate amounts of guessing?
>> Will the "correct" or "best" answer change for different DSI?
>> Will the use of a "negative" in the item cause differences in performance?
>> Are there any inadequacies or ambiguities in the test instructions, item stem, keyed response, or distractors?
>> Does the format or structure of the item present greater problems for students from one background than from others?
Stereotyping of Minorities
Stereotyping and inadequate or unfavorable representation of DSI are undesirable properties of tests to which reviewers should be sensitized. Tests should be free of material that may be offensive, demeaning, or emotionally charged. While the presence of such material may not make the item more difficult for the candidate, it may cause him or her to become "turned off," and result in lowered performance. An example of emotionally charged material would be an item dealing with abortion or gun control. An example of offensive material would be an item that implied the inferiority of a certain group, which would be offensive to that group. An example might be the use of intelligence scores from several cultural groups in the stem of a statistics problem. Terms that are generally unacceptable in test items include lower class, housewife, Canuck, Chinaman, colored people, and red man.
Additional terms to avoid include job designations that end in "man." For example, use police officer instead of policeman; firefighter instead of fireman. Other recommendations to eliminate stereotyping:
o Avoid material that is controversial, inflammatory, demeaning or offensive to members of DSI.
o Avoid depicting members of DSI as having stereotypical occupations (i.e., Chinese launderer) or in stereotypical situations (i.e., boys as creative and successful, girls needing help with problems).
Checklist for Stereotyping
>> Does the test item contain material that is controversial or inflammatory for members of DSI? Material that is demeaning or offensive to DSI?
>> Does the test item portray members of DSI in situations that do not involve authority or leadership?
>> Does the test item depict members of either sex as experiencing stereotyped emotions? (i.e., boys never cry)
>> Does the test item depict members of DSI as having stereotyped occupations? (i.e., Chinese launderer)
>> Does the test item depict members of DSI in stereotypical situations? (e.g., boys as creative and successful, girls needing help with problems)
>> Does the test item contain "art bias?" (girls in dresses)
>> Does the item contain language that could be offensive to a segment of the examinee population?
>> Does the item contain biased language? (e.g., disproportional uses of male terms or names and patronizing expressions like "the little woman" or "the fair sex" must be avoided.)
>> Do the job designations end in "man?" (use police officer instead of policeman; firefighter instead of fireman)
>> Have terms such as man, men, mankind been used as collective terms for the human race? Instead, use humanity, people, men and women.
With the increasing use of performance assessments in education, these checklists can easily be extended to include performance tasks, instructions, and scoring guides.
Becker, B.J., (1990). Item characteristics and gender differences on the SAT-M for mathematically able youths. American Educational Research Journal, 27(1), 65-87.
Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28(1), 32-35.
Berk, R.A. (Ed.). (1982). Handbook of methods for detecting test bias. Baltimore, MD: The Johns Hopkins University Press.
Engelhard, G., Jr., Hansche, L., & Rutledge, K.E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3(4), 347-360.
Chipman, S.F. (1988, April). Word problems: Where test bias creeps in. Paper presented at the meeting of AERA, New Orleans.
Hambleton, R.K., Jones, R.W., & Rogers, H.J. (in press). Comparison of empirical and judgmental methods for detecting differential item functioning. Education Research Quarterly.
Lawrence, I.M., Curley, W.E., & McHale, F.J. (1988, April). Differential item functioning of SAT-verbal reading subscore items for male and female examinees. Paper presented at the meeting of AERA, New Orleans.
McLarty, J.R., Noble, A.C., & Huntley, R.M. (1989). Effects of item wording on sex bias. Journal of Education Measurement, 26, 285-293.
Mellenbergh, G.J. (1984, December). Finding the biasing trait(s). Paper presented at the Advanced Study Institute Human Assessment: Advances in Measuring Cognition and Motivation, Athens, Greece.
Mellenbergh, G.J. 1985, April). Item bias: Dutch research on its definition, detection, and explanation. Paper presented at the meeting of AERA, Chicago.
Scheuneman, J.D. (1982a). A new look at bias in aptitude tests. In P. Merrifield (Ed.), New directions for testing and measurement: Measuring human abilities, No. 12. San Francisco: Jossey-Bass.
Scheuneman, J.D. (1982b). A posteriori analyses of biased items. In R. A. Berk (Ed.), Handbook of methods for detecting test bias. Baltimore, MD: The Johns Hopkins University Press.
Scheuneman, J.D. (1984). A theoretical framework for the exploration of causes and effects of bias in testing. Educational Psychology, 19(4), 219-225.
Schmitt, A.P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the Scholastic Aptitude Test. Journal of Education Measurement, 25, 1-13.
Schmitt, A.P., Curley, W.E., Blaustein, C.A., & Dorans, N.J. (1988, April). Experimental evaluation of language and interest factors related to differential item functioning for Hispanic examinees on the SAT-verbal. Paper presented at the meeting of AERA, New Orleans.
Tittle, C.K. (1982). Use of judgmental methods in item bias studies. In R.A. Berk (Ed.), Handbook of methods for detecting item bias. Baltimore, MD: The Johns Hopkins University Press.
©1999-2012 Clearinghouse on Assessment and Evaluation. All rights reserved. Your privacy is guaranteed at