TESTING MEMO 8: RELIABILITY OF TEST SCORES by Robert B. Frary Virginia Polytechnic Institute and State University Reliability is a measure of the stability of test scores. Suppose a test is administered to the same group of examinees on successive days with no inter- vening instruction in the area tested. If the test scores are highly stable (or reliable), each examinee will get the same or close to the same score on both administrations. The closer the pairs of scores are, the more stable or reliable they are over time. Of course, the passage of time is not the only reason that test scores differ from one administration to another. Transitory examinee characteristics (e.g., feeling better or worse than usual) have their effect as do administration conditions (e.g., noise level, temperature, and ventilation). For multiple-choice tests, variation in luck when guessing can yield score differ- ences from one administration to another. Reliability is affected by all such factors, and it is usually not possible to determine the relative contribution of each one separately. A reliability coefficient is a Pearson product- moment correlation coefficient between two sets of scores as described above. Correlation coefficients range from -1 through 0 to +1, but negative values are not meaningful with respect to reliability. A coef- ficient of 0 means that there is no relationship be- tween the two sets of scores. A coefficient of 1 would occur if all examinees got the same score on both ad- ministrations. A reliability coefficient, may be eval- uated (roughly) as follows: .90 or higher - High reliability. Suitable for making a decision about an examinee based on a single test score. .80 to .89 - Good reliability. Suitable for use in evaluating individual examinees if averaged with a small number of other scores of similar reliability. .60 to .79 - Low/moderate reliability. Suitable for evaluating individuals only if averaged with several other scores of similar reliability. .40 to .59 - Doubtful reliability. Should be used only with caution in the evaluation of individual examinees. May be satisfactory for determining average score differences between groups. It is rarely feasible to administer a test twice to evaluate reliability. Nevertheless, the responses from a single administration of a test can be used to esti- mate reliability. These estimates are called internal consistency reliability coefficients. Measurement and Research score reports include one of these, the Kuder- Richardson formula 20 (KR20) coefficient. Computation of internal consistency reliability coefficients is based on assuming that the test is not speeded and that its content is homogeneous. These coefficients over- estimate reliability if there is not sufficient time for nearly all examinees to finish. They underestimate reli- ability if the test content is not homogeneous. For ex- ample, suppose that a test contains questions covering two distinct areas of a course and that the number of correct answers for a student in one area is not a very good indicator of how well that student will do in the other area. In this case, rather than combine two dis- parate topics on the same test, it is better to sepa- rate the questions into two subtests generating separate scores and reliability coefficients. All reliability estimates are subject to considerable error when there are small numbers of examinees or test items. If there are fewer than, say, 25 examinees or 10 items, the reliability estimate must be "taken with a grain of salt." This phenomenon is especially noticeable when there are several scrambled forms of the test, each administered to a relatively small number of examinees. Then the KR20 coefficients may fluctuate considerably from one form to another. In this case, the instructor may wish to have the responses unscrambled and evaluated as if they came from a single form. Most test scoring/measurement services offices should be prepared to provide this service based on the instructor's unscrambling key. In many cases a test will contain items for which the correlation between selection of the keyed answer and total score is very weak or negative. This outcome suggests that such items do not relate to the content measured by the other items. If items with this characteristic are dropped from the test, KR20 will invariably improve. This phenomenon is discussed in more detail in TESTING MEMO 5. For classroom testing, the most common cause of low reli- ability is test questions that are too easy. When all or nearly all of the questions are answered correctly by more than, say, 80% of the examinees, the resulting scores will be in a narrow range. For example, under such circumstances, most of the scores on a 50-question test will lie between 40 and 50. Then a small score fluctuation due to extraneous circumstances (such as those discussed above) will have a large relative effect on the class standing of the examinee. On the 50-item test just described, the range for As might be 47-50. Then a student with a headache, who would otherwise make an A, may miss one or two extra questions and make a B. At the same time, a B student who has just moderately good luck when guessing may make an A. These kinds of errors give rise to low reliability coefficients. If the test questions are harder, the scores will be more spread out and reliability will be higher. Assuming that the same numbers of each letter grade will be given, small errors in scores are then less likely to result in different grades. TESTING MEMO 2 discus- ses test difficulty in greater detail. Described in TESTING MEMO 7, the standard error of measure- ment (SEM) is related to reliability--the lower the reli- ability, the larger the SEM. This statistic is included rou- tinely on test scoring printouts and reflects the extent to which an individual examinee's scores on many (hypothetical) administrations of a test would fluctuate about that exami- nee's average score over the many administrations. Of course, it is that average or "true" score that should be used to eval- uate the examinee. If the assumptions for KR20 are met, then the odds are about 2 to 1 (or the probability is about 2/3) that the examinee's "true" score is contained within one SEM (above and below) of his or her actual score on the test. Recognition of how far the "true" score of each examinee might be from his or her actual score may suggest some liberality in determining the cut points between letter grades. Note that in all of the above discussion, only scores were described as being more or less reliable. Tests per se, that is the instruments themselves, cannot be described in this way. A test may yield highly reliable scores under one set of circumstances and scores of low reliability under another. Fac- tors to be considered in this regard are administration condi- tions, appropriateness and difficulty of the test for the exami- nees, and examinee motivation and attitude. When these factors are unfavorable, the scores are likely to be less reliable than otherwise. For more information, contact Bob Frary at Robert B. Frary, Director of Measurement and Research Services Office of Measurement and Research Services 2096 Derring Hall Virginia Polytechnic Institute and State University Blacksburg, VA 24060 703/231-5413 (voice) frary#064;vtvm1.cc.vt.edu ###