How Difficult Should an Online Education Test Be?

Clearinghouse on Assessment and Evaluation

ERIC/AE Digest Series EDO-TM-95-6, October 1995

How Difficult Should a Test Be?

Robert B. Frary
Virginia Polytechnic Institute and State University

Traditional grading practices used by many teachers at nearly all levels of education seem geared to a belief that test scores of around 70% or better represent a "passable" level of achievement. Sometimes circumstances seem to indicate the advisability of a lower level, but typically such a level is adopted only with misgivings. Reflecting these misgivings, the instructor would usually prefer that the situation not arise again and that knowledge of the fact that the grades were "curved" be restricted as much as possible.

What is wrong with 70% as a passing score?

The thinking that leads to the idea of a passing score in the neighborhood of 70% may have its roots in elementary education. For example, consider a spelling test. The teacher selects more or less randomly from a list of words the students have been assigned. Then a score of 70% is probably a fairly accurate estimate of the percentage of the words on the list that the student can spell. Similar arguments can be made for arithmetic computation tests when problems of a specific type have been formulated in some random manner.

However, this approach to testing frequently breaks down, even at the elementary school level. Consider a test in history. How does one define the universe of all questions that might appear on a history test? Moreover, even if such a list were available, random selection from it would probably not yield a test with satisfactory content from the standpoint of what the teacher wanted to emphasize. There are test construction methods available for dealing with these problems, but the average teacher simply sits down and, consulting a list of course objectives, the course or unit outline, etcetera, writes enough test questions to fill the allotted time.

Under these circumstances, there is certainly no assurance that students who should get passing grades will be able to answer 70% or more correctly. The test may be somewhat harder or easier than the instructor intended.

Eventually, with experience, most instructors learn to write tests that are about as difficult as they intend, and for many this outcome means that students who are of marginal ability and who put forth reasonable effort score about 70% or perhaps as low as 60%. Because this level is essentially arbitrary, it is reasonable to question whether it is desirable. In fact, we will argue that a 70% passing level is much too high for a large majority of high school or college-level classroom tests.

There are exceptions, of course. In what follows, we will not be discussing essay examinations or tests designed to measure degree of achievement in a restricted and highly defined subarea of a course (criterion referenced tests).

What we are concerned about are tests covering fairly diverse topics on which scores are determined by adding up points for each right answer. Almost any multiple-choice midterm or final examination would be in this category as would most short-answer and problem-solving tests.

What is the optimal average difficulty?

Because such a test may be as hard or as easy as the instructor makes it, it is clear that the percent right does not estimate some level of achievement directly as do spelling or arithmetic tests. What the scores do provide is a ranking of class members in terms of their achievement over the content of the test. Under these circumstances maximum testing effectiveness is gained when the average score is in the range of 50% to 60%. To see why this is true, consider a question gotten right by 99 of 100 examinees. This outcome provides 99 "bits" of information that could be used to rank the examinees, namely, that each of the 99 who answered correctly knows more than the one who answered incorrectly.

If only 90 answer correctly, 900 "bits" are generated. Specifically, the first examinee who answered correctly knows more than each of the 10 who answered incorrectly, and so on for each of the 90 who answered correctly, which generates 90 X 10 "bits." Of course, the maximum number of "bits" is generated when 50 answer correctly and 50 answer incorrectly; 50 X 50 = 2500 "bits."

Hence, questions that only about half can answer correctly are the best kind for generating ranking information. Obviously, using a lot of them on a test will yield an average score somewhat below 70%. Of course, it is desirable to ask a few easy questions to encourage students, especially at the beginning of a test, and to ask some really hard questions to help better students learn their own capabilities and limitations.

What happens psychometrically?

One result of making a test more difficult will be a wider spread of scores. On a 100-question test with an average score of 80, nearly all scores will be in the range of 60 to 100. If the average score is 55, scores will probably range from about 25 to 90. Then fewer examinees will earn any given score. With fewer students having scores adjacent to each letter grade boundary, small errors in grading or slight variation in the instructor's liberality have less effect. Though these problems cannot be completely avoided, their effect is minimized when fewer students earn scores adjacent to letter grade boundaries, and the result is a fairer test.

Recommendations

Consistent with what these observations, we strongly recommend extremely limited use of items that are answered correctly by more than 90 percent of the examinees. This recommendation can also be justified from the standpoint of preventing waste -- waste of clerical time, testing time, and supplies. There is simply no need to write, print and obtain responses to that half of a typical test which nearly all examinees get right. The instructor can gain about as much ranking information with a test half as long containing mainly the more difficult questions and even better ranking information from such a test of the original length. It is only necessary to assume that all functioning, qualified students can answer a large percentage of the easy questions omitted from the test. A completely negligent student will score badly enough on the harder test to justify a failing grade.

The instructor who introduces harder tests will have a little adjusting to do with respect to grade assignment. Obviously, if the average score is around 55%, the lowest passing score will have to be somewhat below 50%. This outcome will bother some people, though it shouldn't. After all, the easy questions just weren't asked, and it is reasonable to assume that most students would answer nearly all of them correctly.

This Digest was adapted with permission from Testing Memo 2: How difficult should a test be, Office of Measurement and Research Services, Virginia Polytechnic Institute and State University, Blacksburg, VA 24060

Further Reading

Airasian, P. (1994) Classroom Assessment, Second Edition, NY: McGraw-Hill.

Brown, F. (1983), Principles of Educational and Psychological Testing, Third edition, NY: Holt Rinehart, Winston.Chapter 11.

Cangelosi, J. (1990) Designing Tests for Evaluating Student Achievement. NY: Addison Wellesley.

Grunlund, N (1993) How to make achievement tests and assessments, 5th edition, NY: Allen and Bacon.

ERIC Clearinghouse on Assessment and Evaluation, 210 O'Boyle Hall,
The Catholic University of America, Washington, DC 20064 * 800 464-3742

This publication was prepared with funding from the Office of Educational Research and Improvement, U.S. Department of Education, under contract RR93002002. The opinions expressed in this report do not necessarily reflect the positions or policies of OERI or the U.S. Department of Education. Permission is granted to copy and distribute this ERIC/AE Digest.

Sitemap 1 - Sitemap 2 - Sitemap 3 - Sitemap 4 - Sitemap 5 - Sitemap 6