TESTING MEMO 7: ASSIGNING LETTER GRADES TO TEST SCORES by Larry J. Weber Virginia Polytechnic Institute and State University In TESTING MEMO 6 we recommended recording and averaging T-scores to determine final class rankings. Admittedly, this approach may not be practical in every case, especially if the T-scores would have to be computed manually. At the same time, we discussed a difficulty inherent in averaging number-right or percent-right scores, namely, that differences in score variability from one test to another result in loss of control over the influence of each test on the average score. At this point, another approach needs to be considered, namely, recording letter grades for each test and assignment and averaging these (A=4, B=3, etc.), possibly after weighting some grades more heavily than others. One variation of this approach is an acceptable substitute for the use of T-scores. That is to apply a single, predetermined distribution of letter grades to each test. For example, one might decide to assign about 10% As, 30% Bs, 45% Cs, 10% Ds, and 5% Fs on every test or assignment. As was the case with use of T-scores, this practice tends to assure that each test influences the final course grade in the intended manner. In the space allowed for this memo, it would be impossible to answer all questions readers might have about this method or to qualify its use sufficiently to prevent every conceivable misapplication. However, some problems or questions may be anticipated. First, there is no hard-and- fast rule for predetermining letter grade distributions. Certainly, it is not necessary to have an equal number of Bs and Ds, and no one "has to fail." Distributions may be determined empirically by examining grade distributions for similar courses from past years. However, current circumstances may modify them for a given class. Second, applying this method to small classes must be done with great care. Personal knowledge of the ability and achievement of individual students may warrant overriding the application of a predetermined distribution. The approach just described is often a good one for larger classes of know composition, but there are obviously situations in which it would not be appropriate. In such cases (when it has been decided to record letter grades), there is really no way to avoid the problem of potential loss of control over the influence of each test on the composite grade. Nevertheless, testing and determination of course grades must occur in these cases. Therefore, in what follows we offer some convential wisdom applicable to assignment of letter grades to tests for later averaging. A sure way to increase student anxiety is to announce that you grade your tests on "the curve," for in the eyes of some students this is tantamount to announcing in advance of the test that a certain percentage of students will fail. Alternatively, students, and even instructors, tend to feel more secure if the criteria for letter-grade assignment are announced in advance. However, despite the greater popularity of the latter, it is difficult to recommend either practice. The notion of grading on the curve no doubt grew out of the fact that when large numbers of examinees are administered lengthy tests, the frequency distribution of the resulting scores typically tends toward the shape of the normal or "bell-shaped" curve. The mathematical formula for the for the normal curve is such that the curve is symmetric about the mean and almost the entire area under the curve is contained within three standard deviations above and below the mean, with roughly 68% of this area within one standard deviation above and below the mean. If the total area is translated into the total number of examinees, it is seen that "grading on the curve" seems to suggest that most examinees would be awarded grades of C and relatively few examinees would receive As or Fs. Because of the symmetry of the distribution, it is suggested that the number of As should equal the number of Fs and that the number of Bs should equal the number of Ds. However, the degree to which the frequency distribution of scores on classroom tests approximate the normal distribution is probably not very great except for large classes with lengthy tests of appropriate difficulty (see TESTING MEMO 2 regarding difficulty). But even in these cases, there is no reason to adopt points on the normal distribution, defined, a priori, by standard deviation units, as the basis for establishing the cutting points between letter grades. A far more reasonable approach would be to examine the frequency distribution of the scores and then capitalize on naturally occurring gaps in the score distribution by setting the cutting points between letter grades at the mid- point of the gaps. Although this practice may result in awarding a few more letter grades at a particular level than you may have intended, it will help to minimize student quibbling over one or two points which might otherwise make the difference between one letter grade and another. The number of natural groupings may also suggest that there are fewer distinct levels of performance than suggested by the traditional five letter grades. In this case, you may decide to award no Fs or perhaps no As, or, if there is a large gap in between, you may elect not to award any Bs or Ds. The above suggestions may appear subjective or even capricious but, unfortunately, there is no objective procedure that can be counted upon to establish letter grade criteria an advance of the test. Ultimately, the assignment of letter grades is a professional judgement that must be rendered on the basis of fallible test scores. Ironically, students seem to gain a false sense of security if the criteria for letter grades on a test are announced in advance. Typically, this announcement specifies the percent-correct score associated with each letter grade. Unfortunately, this requires knowledge in advance as to how easy or difficult the test will be for a particular group. Unless you maintain an item bank containing information as to how difficult each item was when administered previously, it is often the case that the test turns out to be easier or harder than anticipated, sometimes greatly so. If it turns out to be too easy, you will suffer the guilt of grade inflation. If it turns out to be too difficult, irate students will try to persuade you to change your a priori grading criteria. Consequently, it is not recommended to announce the letter grade criteria until you've had a chance to consider the score distribution. One of the most difficult decisions is determining the cutting point between a minimally acceptable score and a failing score. If the test contains a reasonable number of items, you may be able to identify those addressing basic elements of instruction which you believe should be answered correctly even by marginal students. A separate score could be computed based only on these items and those students who do poorly on these items might be prime candidates for failing grades. If a multiple-choice test has been used, a roughly analogous procedure might be followed whereby you identify, for each question, the single worst answer and compute a worst answer score for each examinee. This outcome can be accomplished quite easily by providing your measurement service a "worst" choice key as well as the "best" choice key and having the tests processed twice. You might then combine the resulting two scores, perhaps by subtracting the "worst" answer score from the "best" answer score, or you might simply use the "worst" answer scores independently to help you identify prime candidates for failure. Another strategy that might be invoked to establish a minimally acceptable score for a multiple-choice test is to take advantage of the standard error of measurement, which is routinely provided by test scoring offices. If the test is appropriately difficult, it might be reasonable to set the minimally acceptable score at a point which is significantly higher than the score which would be expected on the basis of random guessing alone. For example, suppose the mean score on a 40 item multiple-choice test composed of four-choice items was 25 correct with a standard error of measurement equal to 3.0. In this case, you might want to set the minimum passing score at 16, which is two standard errors above the expected chance level score. Though this practice minimizes the possibility of someone passing the test who is totally uninformed, you may wish to set the passing score somewhat higher in light of other considerations. This recommendation is based on the assumption that the test was of appropriate difficulty for maximizing the discrimination among scores with the average score mid-way between the chance level and a perfect score. (See TESTING MEMO 2.) Actually, the standard error of measurement is the standard deviation of the scores that an exminee might obtain with repeated testing under the assumption that this repeated testing had no effect on learning. Therefore, if on the test described in the preceding paragraph is such that you believe 20 should be the minimum passing score, a reasonable actual minimum passing score might be 17, one standard error below. This would allow for the fact that someone with an average score of 20 (over hypothetical repeated testings) would score below 17 about 16% of the time. For more information, contact Bob Frary at Robert B. Frary, Director of Measurement and Research Services Office of Measurement and Research Services 2096 Derring Hall Virginia Polytechnic Institute and State University Blacksburg, VA 24060 703/231-5413 (voice) frary#064;vtvm1.cc.vt.edu ###