VPI Occasional Paper: Detection of Answer Copying on Multiple-Choice Tests and Interpretation of _g_ [subscript] 2 Statistics by Robert B. Frary Virginia Polytechic Institute and State University [Note: * will be substituted for _g_ [subscript] 2 throughout the remainder of the text.] When there is a high degree of correspondence between the responses of two examinees on a multiple-choice test, it is only natural to suspect cheating. However, other possibilities must be considered. If both examinees are excellent students, the correspondence will be unavoidably high because they will both answer a large proportion of items correctly. Therefore, any method of detecting cheating must take into consideration the score levels of the examinees. The popularity of right and wrong choices must also be taken into consideration. For example, consider a suspected pair who had 20 choices in common on a 30-item test. Fifteen of their common choices were correct answers to the easiest questions, and the five wrong choices in common were the most popular wrong answers to very hard questions. Contrast this pair with another pair who also had 20 choices in common. However, for this pair 15 of the common choices were wrong answers to relatively easy questions. Obviously, it is very unlikely that by sheer chance two examinees would choose that many wrong answers in common. As an illustration, consider, the probability of throwing 15 or more doubles out of 20 throws of a pair of dice. The probability of this event occurring is less than one in 70 million. Of course at the other end of the continuum, there is some degree of response similarity to be expected. The probability of throwing no doubles in 20 throws of dice is about one in 40. The job for someone interested in detecting cheating is to produce a statistic which reflects the probability that an observed degree of correspondence between right as well as wrong answers was due to chance. Toward this end we have developed a statistic which is sensitive to examinee score level and the popularity of right and wrong choices. It is similar to the _t_ statistic and is approximately normally distributed for tests with reasonable length and moderately large numbers of examinees. A complete description of this statistic, called the * statistic, may be found in the _Journal of Educational Statistics_ 2 (1977): 235-256. We have used the * statistic extensively at Virginia Tech. One case started with an instructor's wish to determine the extent to which various security measures were preventing cheating on examinations in a multi-section course. The examination chosen for analysis had 30 4-choice items responded to by 944 examinees in 11 rooms, all of whom took the test simultaneously. There were two forms of the test, designated A and B, the second a scrambled version of the first, and these were passed out alternately in each room. Two distributions of *s were produced for mutually exclusive groups: Group 1 Pairs of examinees who were in the same room and took the same form of the test. There were 29,617 such pairs, which yielded 59,234 *. since the computation of a * changes when the potential copier and the person copied from are interchanged. Group 2 Pairs of examinees who were in different rooms and took different forms of the test. A total of 192,816 such pairs were available, which yielded 385,632 *s. If it can be assumed that there was no crossform-crossroom cheating, the Group 2 distribution of *s may be taken as a norm with which to compare other distributions arising from other subsamples of examinees on the same test. The highest * for the different form and room group was 4.60. The distribution of the same room and form group had 95 *s above 4.60. This outcome suggested that there was extensive copying among about 5% of examinees. Further examination of the same room and form * revealed that, in six of the 11 testing rooms, no *s above 4.60 were observed. These rooms were all relatively small, which suggests that the prevalence of cheating was related to room size. Given these and similar findings over a period of more than 10 years, it appears highly likely that answer copying will occur if a single form of a test is used in a large class with close seating. On the brighter side, many instructors have been successful in preventing cheating, not always by threat of detection but by more positive measures. Strategies have varied, but some elements common to many have been the following: 1. Let students know in a low-key but sincere manner that cheating is a matter of concern and inform them of the standards they must observe. (Some students have reported that instructors appear unconcerned at reports of cheating or disbelieve such reports without investigating.) 2. Use widely-spaced seating or multiple (scrambled) forms of the test (preferably on different colors of paper). In either case, an explanation should be made that these measures are not taken because of general distrust of students, but to protect the majority who would not cheat and to make test taking less stressful. Then students don't have to sit as if wearing blinders to avoid an accidental glance at a neighbor's answer sheet. 3. Be present or have a representative present throughout all examinations. This presence reinforces earlier statements of concern and in no way violates the Honor System. If at all possible, make a seating chart. While some of the above measures require significant effort by the instructor, it seems unlikely that anything less will alleviate what is a widespread and well documented problem. A few fortunate classes may be essentially free of cheating even though the opportunity exists; however, even small doubts are better resolved. Instructors are encouraged to request an analysis of answer correspondence on multiple-choice tests through the Office of Measurement and Research. Should such an analysis reveal evidence of cheating, preventive measures may be taken. If specific pairs of students are identified as having correspondences beyond any reasonable attribution to chance the instructor may wish to: 1. Advise the suspected students privately of the degree of observed correspondence. 2. Observe suspected pairs on subsequent tests. 3. Collect other evidence, such as that from seating charts or observations of other students in the vicinity of those suspected, and turn this evidence over to the Honor System. Alternatively, student witnesses may wish to undertake this responsibility. It is indeed unfortunate that there is a need to distribute the foregoing commentary. Even discussing such problems might convey an undesirable preoccupation with the negative side of education, to say nothing of the effect of policing exams, explaining to students the need for multiple forms of tests, etc. More undesirable, however, is the known effect of ignoring the problem or reacting only to the most obvious and flagrant violations. Only a unified effort by the faculty can alleviate this problem. Interpretation of * Answer Correspondence Statistics For every pair of examinees who take a multiple-choice test, two * statistics are computed. For examinees A and B, for example, one * statistic reflects the probability that the observed answer correspondence pattern would occur, under the hypothesis that A did not copy from B. The other * reflects the same probability under the hypothesis that B did not copy from A. A high (positive) * yields a low probability that the observed correspondence between answers would occur in the absence of copying. The average * is approximately zero, with negative *s representing correspondences that are less likely than expected. The probability associated with a given * depends on the number of questions on the test. Obviously if the test is very short there could be a large number of highly similar answer patterns. Tests as short as ten questions can be analyzed effectively for answer correspondence, but the probabilities associated with higher *s are much lower for tests with 20 or more questions. Table 1 gives the approximate probabilities associated with higher *s for tests of different lengths. Inspection of Table 1 reveals that all the probabilities associated with *s of 4.6 or higher for tests of 20 or more questions are less than 1/10000. Therefore, it can be said that such values (4.6 or higher with 20 or more questions) certainly represent unusual occurrences in the absence of copying. When probabilities are as low as 1/10000, many people have trouble understanding what they mean. However, it is easy to understand what is meant by saying that, in coin tossing, the probability of getting three heads in a row is about 1/8. This means that if 800 people each toss a coin three times, about 100 of them will toss three heads. To understand what a probability of 1/10000 means, consider the likelihood of being killed in an auto accident within the next year. With 15,000 miles of auto travel, the probability of death is about 1/5000. This means that, among every 5000 people who travel about 15 000 miles per year by auto, we should expect one death in an auto accident. This probability is twice as high as 1/10000. It can be argued that one's personal probability of an auto related death is lower than 1/5000. Valid reasons for this conclusion might be traveling predominantly in low-traffic areas, always driving carefully, and having exceptionally good reflexes. Nevertheless, the probability of death in an auto accident for almost anyone who drives is somewhat more than most of the probabilities listed in Table 1. In spite of this macabre fact, most of us drive without worrying very much about being killed. This lack of concern attests to the truly small character of probabilities of less than 1/10000. The probability of being the top winner in a typical state lottery after buying one ticket is greater than most of the probabilities in the lower right quadrant of Table 1. While the probability of an auto death was used above for illustration, there is an important difference between this occurrence and that of a high *. That difference is in assignment of causality; it is almost always possible to attribute a cause to an auto death, e.g., mechanical failure or driver negligence. Extensive copying will almost always cause a high *, but, on rare occasions, high *s occur in the complete absence of copying. We say that these are chance occurrences because we can predict the approximate number that will be observed for a given test length and class size. This is analogous to being able to predict the number of jackpots from a slot machine over a given number of plays. Actually, the slot machine's results are caused in some sense just as much so as are auto deaths, but we do not know the details of this causality and refer to chance outcomes because of being able to estimate the small probability of a jackpot. Similarly, we do not know what causes high *s in the absence of copying, but the probability of their occurrence is known with considerable accuracy. Because a high * can occur in the absence of copying, it is important to be able to interpret *s in the higher ranges correctly in the context of the Virginia Tech Honor System. To do so, it is first necessary to understand a fundamental fact concerning the operation of any justice system. That fact is that such systems operate on the basis of probabilities, and that there is never complete certainty of guilt or innocence. The possibility always exists that the guilty may be found not guilty or, more of concern, that the innocent may be found guilty. How high a probability that an innocent defendant will be found guilty can be accommodated within a justice system? One would like to require this probability to be zero, but this would in turn require that all defendants be found not guilty. Perhaps even one incorrect guilty verdict in a thousand cases or one in ten thousand seems excessive. Nevertheless, the answer to the preceding question must be reasonable, because, the more unlikely it is than an innocent party will be convicted, the more likely it becomes that the guilty will be declared not guilty. Reasonableness has been exceeded when convictions become rare in spite of pretrial investigations that strongly suggest guilt. One possible approach to setting a suitable probability of innocence below which no reasonable doubt of guilt exists is to consider how many false convictions could be tolerated in a 100-year period. Perhaps one in 100 years is the most that could be justified for maintenance of the system. If there are 50 to 100 cases per year, the result would be in the range of 1/5000 to 1/10000. Another possible guide to selecting an appropriate probability of convicting the innocent might be to select one which is low enough so that we would not worry very much if it were the probability of our own accidental deaths, say 1/10000. Given such a level, it would seem simple to identify and convict cheaters. One could simply select a conservatively high * and inflict penalties for all instances exceeding this value. There are two reasons why this approach is not feasible: 1. The two * statistics for a pair of examinees do not indicate who may have copied from whom. Usually the larger statistic will reflect the probability that the higher scorer did not copy from the lower scorer. This is a natural characteristic of the statistic and has nothing to do with the direction of copying. 2. The number of examinees is related to the number of occurrences of *s above any fixed point. For example, in a class of 200, 39.800 *s will be computed. If the test has only, say, 20 questions, there will be about four *s above 4.6 in the total absence of copying. In a class of any reasonable size there will be some *s in a moderately high range, say, 3.3 to 4.0, again, in the complete absence of cheating. For these reasons, as a matter of policy, the Virginia Tech Honor System does not bring cases to trial solely on the basis of * statistics, however large they may be. Additional evidence is required for trial and conviction. For example, it might be shown that the defendant was seated behind the other party with whom the high * was generated and in a position from which that person's answer sheet could have been seen. In addition or alternatively, there might be witnesses' reports that the defendant appeared to be looking at the answer sheet of the other party. Taken alone, evidence of seat location and apparent copying behavior may not be sufficient to convict a defendant. However, a corresponding high * should influence a panel greatly in the direction of conviction. This is true regardless of the fact that any high * could have occurred in the absence of copying. An analogy from medicine may make it more clear why this should be the case. Consider the measurement of blood serum cholesterol level. There is a range of observed values that occur in perfectly healthy individuals. The high end of this range has values that are typically observed for persons with existing or incipient circulatory problems. When such a reading occurs, the physician does not immediately conclude that the patient is sick and administer therapy to reduce the cholesterol level. Instead, additional data are gathered. If all or most signs point to circulatory trouble, treatment is initiated. Considered in this light, even a moderately high * can be meaningful. Just as observing a person appearing to look at a neighbor's answers does not guarantee that copying took place, so a high or moderately high * does not. However, when events such as these occur together there is much less doubt of guilt. A few additional points: 1. A * below any set level does not imply innocence. An examinee who copies only a small proportion of answers will probably not generate a very high * with respect to the other party. Alternatively, an examinee may copy a few answers from various other examinees generating no very high * in the process. Copying extensively from a person who earns a nearly perfect score on a test will also not yield a high *. In this instance, the * will not identify the two sets of responses as unusually similar, since any two highly competent examinees would have a large number of right-answer correspondences. 2. In the computation of a *, values are added up for each test question. When examinee A chooses the same right choice as examinee B, the amount added is usually very small. In fact, if two examinees have only right choices in common, their *s will probably be negative. It is common wrong choices that innate the *s, especially common wrong choices that were marked by relatively few other examinees. This does not mean that right-answer correspondences have no effect. A cheating index involving only wrong answers would be less sensitive and accurate. This information pertains to the next point. 3. A common defense in cases involving people who sat together and generated high *s is that they studied together. We have investigated this effect by asking students in a large number and wide variety of classes to indicate the extent to which they studied with other class members. This was accomplished through a question appended to tests in these classes with the following wording: To what extent did you study for this test with another person or persons in this class? 1) Not at all 2) Less than an hour 3) One to four hours 4) Four to eight hours 5 ) Over eight hours In no class was there any meaningful difference between the averages of the *s across the groups reporting different amounts of studying together. Some of the groups reporting the larger amounts of mutual studying were quite small, so that many if not most of the *s for these groups were from pairs who actually studied together. These results notwithstanding, it is highly implausible that any pair of reasonably competent students would learn large amounts of wrong information jointly. When people study together, they sometimes fail to cover material that will be on a test, but they only rarely share or reinforce each other's misinformation or misconceptions. And even if there is some joint learning of wrong information, it is still beyond plausibility that the instructor would then fortuitously offer a number of wrong choices that matched the misinformation possessed by a specific pair of examinees. Since it is mainly wrong answers that inflate *s (see 2 above), defendants claiming an effect from studying together would have to show that they learned large amounts of wrong information jointly, which the instructor then represented with corresponding wrong choices. In summary, the following points should be kept in mind when interpreting *s: 1. No evidence proves guilt beyond any shadow of a doubt. There is some doubt about every judicial decision. If the standard for conviction is evidence that leaves no reasonable doubt, someone applying this standard should have some idea of how small the probability of innocence must be in order to conclude that there is no reasonable doubt of guilt. Then the probability associated with a * can be evaluated properly in a particular case. 2. While a high * does not guarantee that copying occurred, it strongly corroborates other evidence to this effect. 3. The fact that an accused pair studied together is almost never the cause of a high *. 4. Copying a small number of answers will usually not result in a large *, nor will extensive copying from someone with a very high score. Therefore, a low * does not imply innocence. 5. The relative sizes of the two *s for a pair of examinees do not indicate the direction of possible copying. This must be established from witnesses or seating charts. Table 1 Approximate Probabilities of an Observed * in the Absence of Copying Number of Test Questions [Note: k=000] _*_ _10_ _20_ _30_ _40_ 4.6 1/1550 1/10200 1/25800 1/45600 4.8 1/2050 1/16100 1/45200 1/85100 5.0 1/2710 1/25200 1/78800 1/160k 5.2 1/3550 1/39200 1/137k 1/301k 5.4 1/4620 1/61k 1/240k 1/568k 5.6 1/5980 1/94k 1/417k 1/1070k 5.8 1/7710 1/145k 1/724k 1/2040k 6.0 1/9880 1/223k 1/1260k 1/3860k 6.5 1 18k 1/634k 1/4900k 1/19100k 7.0 1/32k 1/1740k 1/18700k 1/93300k _*_ _50_ _100_ 4.6 1/66600 1/160k 4.8 1/131k 1/355k 5.0 1/259k 1/806k 5.2 1/515k 1/1860k 5.4 1/1030k 1/4360k 5.6 1/2080k 1/10400k 5.8 1/4210k 1/25100k 6.0 1/8550k 1/61600k 6.5 1/50600k 1/629000k 7.0 1/301000k 1/3210000k For more information, contact Robert B. Frary, Director of Measurement and Research Services 2096 Derring Hall Virginia Polytechnic Institute and State University Blacksburg, VA 24060 703/231-5413 (voice) frary#064;vtvm1.cc.vt.edu ###