ERIC/AE Digest Series EDO-TM-95-6, October 1995
How Difficult Should a Test Be?
Robert B. Frary
Virginia Polytechnic Institute and State University
Traditional grading practices used by many teachers at nearly all levels of education seem geared to
a belief that test scores of around 70% or better represent a "passable" level of achievement.
Sometimes circumstances seem to indicate the advisability of a lower level, but typically such a level is
adopted only with misgivings. Reflecting these misgivings, the instructor would usually prefer that the
situation not arise again and that knowledge of the fact that the grades were "curved" be restricted as
much as possible.
What is wrong with 70% as a passing score?
The thinking that leads to the idea of a passing score in the neighborhood of 70% may have its
roots in elementary education. For example, consider a spelling test. The teacher selects more or less
randomly from a list of words the students have been assigned. Then a score of 70% is probably a
fairly accurate estimate of the percentage of the words on the list that the student can spell. Similar
arguments can be made for arithmetic computation tests when problems of a specific type have been
formulated in some random manner.
However, this approach to testing frequently breaks down, even at the elementary school level.
Consider a test in history. How does one define the universe of all questions that might appear on a
history test? Moreover, even if such a list were available, random selection from it would probably not
yield a test with satisfactory content from the standpoint of what the teacher wanted to emphasize.
There are test construction methods available for dealing with these problems, but the average teacher
simply sits down and, consulting a list of course objectives, the course or unit outline, etcetera, writes
enough test questions to fill the allotted time.
Under these circumstances, there is certainly no assurance that students who should get passing
grades will be able to answer 70% or more correctly. The test may be somewhat harder or easier than
the instructor intended.
Eventually, with experience, most instructors learn to write tests that are about as difficult as they
intend, and for many this outcome means that students who are of marginal ability and who put forth
reasonable effort score about 70% or perhaps as low as 60%. Because this level is essentially arbitrary,
it is reasonable to question whether it is desirable. In fact, we will argue that a 70% passing level is
much too high for a large majority of high school or college-level classroom tests.
There are exceptions, of course. In what follows, we will not be discussing essay examinations or
tests designed to measure degree of achievement in a restricted and highly defined subarea of a course
(criterion referenced tests).
What we are concerned about are tests covering fairly diverse topics on which scores are
determined by adding up points for each right answer. Almost any multiple-choice midterm or final
examination would be in this category as would most short-answer and problem-solving tests.
What is the optimal average difficulty?
Because such a test may be as hard or as easy as the instructor makes it, it is clear that the percent
right does not estimate some level of achievement directly as do spelling or arithmetic tests. What the
scores do provide is a ranking of class members in terms of their achievement over the content of the
test. Under these circumstances maximum testing effectiveness is gained when the average score is in
the range of 50% to 60%. To see why this is true, consider a question gotten right by 99 of 100
examinees. This outcome provides 99 "bits" of information that could be used to rank the examinees,
namely, that each of the 99 who answered correctly knows more than the one who answered
incorrectly.
If only 90 answer correctly, 900 "bits" are generated. Specifically, the first examinee who
answered correctly knows more than each of the 10 who answered incorrectly, and so on for each of
the 90 who answered correctly, which generates 90 X 10 "bits." Of course, the maximum number of
"bits" is generated when 50 answer correctly and 50 answer incorrectly; 50 X 50 = 2500 "bits."
Hence, questions that only about half can answer correctly are the best kind for generating ranking
information. Obviously, using a lot of them on a test will yield an average score somewhat below 70%.
Of course, it is desirable to ask a few easy questions to encourage students, especially at the beginning
of a test, and to ask some really hard questions to help better students learn their own capabilities and
limitations.
What happens psychometrically?
One result of making a test more difficult will be a wider spread of scores. On a 100-question test
with an average score of 80, nearly all scores will be in the range of 60 to 100. If the average score is
55, scores will probably range from about 25 to 90. Then fewer examinees will earn any given score.
With fewer students having scores adjacent to each letter grade boundary, small errors in grading or
slight variation in the instructor's liberality have less effect. Though these problems cannot be
completely avoided, their effect is minimized when fewer students earn scores adjacent to letter grade
boundaries, and the result is a fairer test.
Recommendations
Consistent with what these observations, we strongly recommend extremely limited use of items
that are answered correctly by more than 90 percent of the examinees. This recommendation can also
be justified from the standpoint of preventing waste -- waste of clerical time, testing time, and supplies.
There is simply no need to write, print and obtain responses to that half of a typical test which nearly
all examinees get right. The instructor can gain about as much ranking information with a test half as
long containing mainly the more difficult questions and even better ranking information from such a
test of the original length. It is only necessary to assume that all functioning, qualified students can
answer a large percentage of the easy questions omitted from the test. A completely negligent student
will score badly enough on the harder test to justify a failing grade.
The instructor who introduces harder tests will have a little adjusting to do with respect to grade
assignment. Obviously, if the average score is around 55%, the lowest passing score will have to be
somewhat below 50%. This outcome will bother some people, though it shouldn't. After all, the easy
questions just weren't asked, and it is reasonable to assume that most students would answer nearly all
of them correctly.
This Digest was adapted with permission from Testing Memo 2: How difficult should a test be, Office
of Measurement and Research Services, Virginia Polytechnic Institute and State University,
Blacksburg, VA 24060
Further Reading
Airasian, P. (1994) Classroom Assessment, Second Edition, NY: McGraw-Hill.
Brown, F. (1983), Principles of Educational and Psychological Testing, Third edition, NY: Holt Rinehart, Winston.Chapter
11.
Cangelosi, J. (1990) Designing Tests for Evaluating Student Achievement. NY: Addison Wellesley.
Grunlund, N (1993) How to make achievement tests and assessments, 5th edition, NY: Allen and Bacon.
ERIC
Clearinghouse on Assessment and Evaluation, 210 O'Boyle Hall,
The Catholic University of America, Washington, DC 20064 * 800
464-3742
This publication was prepared with funding from the Office of
Educational Research and Improvement, U.S. Department of Education,
under contract RR93002002. The opinions expressed in this report do
not necessarily reflect the positions or policies of OERI or the U.S.
Department of Education. Permission is granted to copy and
distribute this ERIC/AE Digest.
[Home]
|