VPI Occasional Paper: Detection of Answer Copying on 
Multiple-Choice Tests and Interpretation of _g_ [subscript] 
2 Statistics
by
Robert B. Frary
Virginia Polytechic Institute and State University

[Note: * will be substituted for _g_ [subscript] 2 throughout the 
remainder of  the text.]

When there is a high degree of correspondence between the 
responses of two examinees on a multiple-choice test, it is only 
natural to suspect cheating. However, other possibilities must be 
considered. If both examinees are excellent students, the 
correspondence will be unavoidably high because they will both 
answer a large proportion of items correctly. Therefore, any 
method of detecting cheating must take into consideration the 
score levels of the examinees. 

The popularity of right and wrong choices must also be taken into 
consideration. For example, consider a suspected pair who had 20 
choices in common on a 30-item test. Fifteen of their common 
choices were correct answers to the easiest questions, and the five 
wrong choices in common were the most popular wrong answers to 
very hard questions. Contrast this pair with another pair who also 
had 20 choices in common. However, for this pair 15 of the common 
choices were wrong answers to relatively easy questions. Obviously, 
it is very unlikely that by sheer chance two examinees would choose 
that many wrong answers in common. As an illustration, consider,
the probability of throwing 15 or more doubles out of 20 throws of
a pair of dice. The probability of this event occurring is less than one 
in 70 million. 

Of course at the other end of the continuum, there is some degree 
of response similarity to be expected. The probability of throwing 
no doubles in 20 throws of dice is about one in 40. The job for 
someone interested in detecting cheating is to produce a statistic 
which reflects the probability that an observed degree of 
correspondence between right as well as wrong answers was due 
to chance. 

Toward this end we have developed a statistic which is sensitive
to examinee score level and the popularity of right and wrong 
choices. It is similar to the _t_ statistic and is approximately 
normally distributed for tests with reasonable length and 
moderately large numbers of examinees. A complete description 
of this statistic, called the * statistic, may be found in the _Journal 
of Educational Statistics_ 2 (1977): 235-256. 

We have used the * statistic extensively at Virginia Tech. One 
case started with an instructor's wish to determine the extent to 
which various security measures were preventing cheating on 
examinations in a multi-section course. The examination chosen 
for analysis had 30 4-choice items responded to by 944 examinees 
in 11 rooms, all of whom took the test simultaneously. There were 
two forms of the test, designated A and B, the second a scrambled 
version of the first, and these were passed out alternately in each 
room. 

Two distributions of *s were produced for mutually exclusive 
groups: 

Group 1 
Pairs of examinees who were in the same room and took the same 
form of the test. There were 29,617 such pairs, which yielded 
59,234 *. since the computation of a * changes when the potential 
copier and the person copied from are interchanged. 

Group 2 
Pairs of examinees who were in different rooms and took different 
forms of the test. A total of 192,816 such pairs were available, which 
yielded 385,632 *s. 

If it can be assumed that there was no crossform-crossroom cheating, 
the Group 2 distribution of *s may be taken as a norm with which to 
compare other distributions arising from other subsamples of 
examinees on the same test. The highest * for the different form 
and room group was 4.60. The distribution of the same room and 
form group had 95 *s above 4.60. This outcome suggested that 
there was extensive copying among about 5% of examinees. Further 
examination of the same room and form * revealed that, in six of the 
11 testing rooms, no *s above 4.60 were observed. These rooms 
were all relatively small, which suggests that the prevalence of 
cheating was related to room size.


Given these and similar findings over a period of more than 10 
years, it appears highly likely that answer copying will occur if 
a single form of a test is used in a large class with close seating. 
On the brighter side, many instructors have been successful in 
preventing cheating, not always by threat of detection but by more 
positive measures. Strategies have varied, but some elements 
common to many have been the following: 

1. Let students know in a low-key but sincere manner that cheating 
is a matter of concern and inform them of the standards they must 
observe. (Some students have reported that instructors appear 
unconcerned at reports of cheating or disbelieve such reports without 
investigating.) 

2. Use widely-spaced seating or multiple (scrambled) forms of the test 
(preferably on different colors of paper). In either case, an explanation 
should be made that these measures are not taken because of general 
distrust of students, but to protect the majority who would not cheat 
and to make test taking less stressful. Then students don't have to sit 
as if wearing blinders to avoid an accidental glance at a neighbor's 
answer sheet. 

3. Be present or have a representative present throughout all 
examinations. This presence reinforces earlier statements of concern 
and in no way violates the Honor System. If at all possible, make a 
seating chart. 


While some of the above measures require significant effort by
 the instructor, it seems unlikely that anything less will alleviate 
what is a widespread and well documented problem. A few 
fortunate classes may be essentially free of cheating even though 
the opportunity exists; however, even small doubts are better 
resolved. Instructors are encouraged to request an analysis of 
answer correspondence on multiple-choice tests through the 
Office of Measurement and Research. 

Should such an analysis reveal evidence of cheating, preventive 
measures may be taken. If specific pairs of students are identified 
as having correspondences beyond any reasonable attribution to 
chance the instructor may wish to: 

1. Advise the suspected students privately of the degree of observed 
correspondence. 

2. Observe suspected pairs on subsequent tests. 

3. Collect other evidence, such as that from seating charts or 
observations of other students in the vicinity of those suspected, 
and turn this evidence over to the Honor System. Alternatively, 
student witnesses may wish to undertake this responsibility. 

It is indeed unfortunate that there is a need to distribute the 
foregoing commentary. Even discussing such problems might 
convey an undesirable preoccupation with the negative side of 
education, to say nothing of the effect of policing exams, 
explaining to students the need for multiple forms of tests, etc. 
More undesirable, however, is the known effect of ignoring 
the problem or reacting only to the most obvious and flagrant 
violations. Only a unified effort by the faculty can alleviate 
this problem.

Interpretation of  * Answer Correspondence Statistics

For every pair of examinees who take a multiple-choice test, 
two * statistics are computed. For examinees A and B, for 
example, one * statistic reflects the probability that the observed 
answer correspondence pattern would occur, under the 
hypothesis that A did not copy from B. The other * reflects the 
same probability under the hypothesis that B did not copy from 
A. A high (positive) * yields a low probability that the observed 
correspondence between answers would occur in the absence of 
copying. The average * is approximately zero, with negative *s 
representing correspondences that are less likely than expected. 

The probability associated with a given * depends on the number 
of questions on the test. Obviously if the test is very short there 
could be a large number of highly similar answer patterns. Tests 
as short as ten questions can be analyzed effectively for answer 
correspondence, but the probabilities associated with higher *s 
are much lower for tests with 20 or more questions. Table 1 
gives the approximate probabilities associated with higher *s 
for tests of different lengths.

Inspection of Table 1 reveals that all the probabilities associated 
with *s of 4.6 or higher for tests of 20 or more questions are less 
than 1/10000. Therefore, it can be said that such values (4.6 or 
higher with 20 or more questions) certainly represent unusual 
occurrences in the absence of copying.

When probabilities are as low as 1/10000, many people have 
trouble understanding what they mean. However, it is easy to 
understand what is meant by saying that, in coin tossing, the 
probability of getting three heads in a row is about 1/8. This 
means that if 800 people each toss a coin three times, about 100 
of them will toss three heads. To understand what a probability 
of 1/10000 means, consider the likelihood of being killed in an 
auto accident within the next year. With 15,000 miles of auto 
travel, the probability of death is about 1/5000. This means that, 
among every 5000 people who travel about 15 000 miles per year 
by auto, we should expect one death in an auto accident. This 
probability is twice as high as 1/10000. It can be argued that one's 
personal probability of an auto related death is lower than 1/5000. 
Valid reasons for this conclusion might be traveling predominantly 
in low-traffic areas, always driving carefully, and having exceptionally 
good reflexes. Nevertheless, the probability of death in an auto 
accident for almost anyone who drives is somewhat more than 
most of the probabilities listed in Table 1. In spite of this macabre 
fact, most of us drive without worrying very much about being 
killed. This lack of concern attests to the truly small character of 
probabilities of less than 1/10000. The probability of being the 
top winner in a typical state lottery after buying one ticket is 
greater than most of the probabilities in the lower right quadrant 
of Table 1. 

While the probability of an auto death was used above for 
illustration, there is an important difference between this 
occurrence and that of a high *. That difference is in assignment 
of causality; it is almost always possible to attribute a cause to an 
auto death, e.g., mechanical failure or driver negligence. Extensive 
copying will almost always cause a high *, but, on rare occasions, 
high *s occur in the complete absence of copying. We say that these 
are chance occurrences because we can predict the approximate 
number that will be observed for a given test length and class size. 
This is analogous to being able to predict the number of jackpots 
from a slot machine over a given number of plays. Actually, the slot 
machine's results are caused in some sense just as much so as are auto 
deaths, but we do not know the details of this causality and refer to 
chance outcomes because of being able to estimate the small 
probability of a jackpot. Similarly, we do not know what causes high 
*s in the absence of copying, but the probability of their occurrence
 is known with considerable accuracy. 

Because a high * can occur in the absence of copying, it is important
to be able to interpret *s in the higher ranges correctly in the context 
of the Virginia Tech Honor System. To do so, it is first necessary to 
understand a fundamental fact concerning the operation of any justice
system. That fact is that such systems operate on the basis of 
probabilities, and that there is never complete certainty of guilt or
innocence. The possibility always exists that the guilty may be found 
not guilty or, more of concern, that the innocent may be found guilty. 
How high a probability that an innocent defendant will be found guilty 
can be accommodated within a justice system? One would like to 
require this probability to be zero, but this would in turn require 
that all defendants be found not guilty. Perhaps even one incorrect 
guilty verdict in a thousand cases or one in ten thousand seems 
excessive. Nevertheless, the answer to the preceding question must 
be reasonable, because, the more unlikely it is than an innocent 
party will be convicted, the more likely it becomes that the guilty 
will be declared not guilty. Reasonableness has been exceeded 
when convictions become rare in spite of pretrial investigations that 
strongly suggest guilt. 

One possible approach to setting a suitable probability of innocence 
below which no reasonable doubt of guilt exists is to consider how 
many false convictions could be tolerated in a 100-year period. 
Perhaps one in 100 years is the most that could be justified for 
maintenance of the system. If there are 50 to 100 cases per year, 
the result would be in the range of 1/5000 to 1/10000. Another 
possible guide to selecting an appropriate probability of convicting 
the innocent might be to select one which is low enough so that 
we would not worry very much if it were the probability of our 
own accidental deaths, say 1/10000. 

Given such a level, it would seem simple to identify and convict 
cheaters. One could simply select a conservatively high * and 
inflict penalties for all instances exceeding this value. There are 
two reasons why this approach is not feasible: 

1. The two * statistics for a pair of examinees do not indicate 
who may have copied from whom. Usually the larger statistic 
will reflect the probability that the higher scorer did not copy 
from the lower scorer. This is a natural characteristic of the 
statistic and has nothing to do with the direction of copying. 

2. The number of examinees is related to the number of 
occurrences of *s above any fixed point. For example, in a 
class of 200, 39.800 *s will be computed. If  the test has 
only, say, 20 questions, there will be about four *s above 4.6 
in the total absence of copying. In a class of any reasonable 
size there will be some *s in a moderately high range, say, 
3.3 to 4.0, again, in the complete absence of cheating. 

For these reasons, as a matter of policy, the Virginia Tech 
Honor System does not bring cases to trial solely on the basis 
of * statistics, however large they may be. Additional evidence 
is required for trial and conviction. For example, it might be 
shown that the defendant was seated behind the other party 
with whom the high * was generated and in a position from 
which that person's answer sheet could have been seen. In 
addition or alternatively, there might be witnesses' reports 
that the defendant appeared to be looking at the answer sheet 
of the other party. 

Taken alone, evidence of seat location and apparent copying 
behavior may not be sufficient to convict a defendant. However, 
a corresponding high * should influence a panel greatly in the 
direction of conviction. This is true regardless of the fact that 
any high * could have occurred in the absence of copying. 
An analogy from medicine may make it more clear why this 
should be the case. Consider the measurement of blood serum 
cholesterol level. There is a range of observed values that occur 
in perfectly healthy individuals. The high end of this range has 
values that are typically observed for persons with existing or 
incipient circulatory problems. When such a reading occurs, the 
physician does not immediately conclude that the patient is sick 
and administer therapy to reduce the cholesterol level. Instead, 
additional data are gathered. If all or most signs point to 
circulatory trouble, treatment is initiated. Considered in this 
light, even a moderately high * can be meaningful. Just as 
observing a person appearing to look at a neighbor's answers 
does not guarantee that copying took place, so a high or 
moderately high * does not. However, when events such as 
these occur together there is much less doubt of guilt. 

A few additional points: 

1. A * below any set level does not imply innocence. An 
examinee who copies only a small proportion of answers 
will probably not generate a very high * with respect to 
the other party. Alternatively, an examinee may copy a 
few answers from various other examinees generating no 
very high * in the process. Copying extensively from a 
person who earns a nearly perfect score on a test will also 
not yield a high *. In this instance, the * will not identify 
the two sets of responses as unusually similar, since any two 
highly competent examinees would have a large number 
of right-answer correspondences. 

2. In the computation of a *, values are added up for each 
test question. When examinee A chooses the same right 
choice as examinee B, the amount added is usually very small. 
In fact, if two examinees have only right choices in common, 
their *s will probably be negative. It is common wrong choices 
that innate the *s, especially common wrong choices that were 
marked by relatively few other examinees. This does not mean 
that right-answer correspondences have no effect. A cheating 
index involving only wrong answers would be less sensitive 
and accurate. This information pertains to the next point. 

3. A common defense in cases involving people who sat 
together and generated high *s is that they studied together. 
We have investigated this effect by asking students in a large 
number and wide variety of classes to indicate the extent to 
which they studied with other class members. This was 
accomplished through a question appended to tests in these 
classes with the following wording: 

To what extent did you study for this test with another person 
or persons in this class? 

1) Not at all
2) Less than an hour
3) One to four hours
4) Four to eight hours
5 ) Over eight hours

In no class was there any meaningful difference between the 
averages of the *s across the groups reporting different 
amounts of studying together. Some of the groups reporting 
the larger amounts of mutual studying were quite small, so that 
many if not most of the *s for these groups were from pairs 
who actually studied together. These results notwithstanding,
it is highly implausible that any pair of reasonably competent 
students would learn large amounts of wrong information 
jointly. When people study together, they sometimes fail to 
cover material that will be on a test, but they only rarely share 
or reinforce each other's misinformation or misconceptions. 
And even if there is some joint learning of wrong information, 
it is still beyond plausibility that the instructor would then 
fortuitously offer a number of wrong choices that matched the 
misinformation possessed by a specific pair of examinees. Since 
it is mainly wrong answers that inflate *s (see 2 above), 
defendants claiming an effect from studying together would 
have to show that they learned large amounts of wrong information
jointly, which the instructor then represented with corresponding 
wrong choices. 

In summary, the following points should be kept in mind when 
interpreting *s: 

1. No evidence proves guilt beyond any shadow of a doubt. There 
is some doubt about every judicial decision. If the standard for 
conviction is evidence that leaves no reasonable doubt, someone 
applying this standard should have some idea of how small the 
probability of innocence must be in order to conclude that there is 
no reasonable doubt of guilt. Then the probability associated with 
a * can be evaluated properly in a particular case. 

2. While a high * does not guarantee that copying occurred, it 
strongly corroborates other evidence to this effect. 

3. The fact that an accused pair studied together is almost never 
the cause of a high *. 

4. Copying a small number of answers will usually not result in a
large *, nor will extensive copying from someone with a very high 
score. Therefore, a low * does not imply innocence. 

5. The relative sizes of the two *s for a pair of examinees do not 
indicate the direction of possible copying. This must be established 
from witnesses or seating charts. 

Table 1

Approximate Probabilities of an Observed * in the Absence of Copying
          Number of Test Questions
[Note: k=000]
_*_    _10_    _20_      _30_      _40_      
4.6  1/1550    1/10200   1/25800   1/45600
4.8  1/2050    1/16100   1/45200   1/85100 
5.0  1/2710    1/25200   1/78800   1/160k 
5.2  1/3550    1/39200   1/137k    1/301k
5.4  1/4620    1/61k     1/240k    1/568k
5.6  1/5980    1/94k     1/417k    1/1070k
5.8  1/7710    1/145k    1/724k    1/2040k
6.0  1/9880    1/223k    1/1260k   1/3860k
6.5  1 18k     1/634k    1/4900k   1/19100k
7.0  1/32k     1/1740k   1/18700k  1/93300k
_*_  _50_      _100_
4.6  1/66600   1/160k
4.8  1/131k    1/355k
5.0  1/259k    1/806k
5.2  1/515k    1/1860k
5.4  1/1030k   1/4360k
5.6  1/2080k   1/10400k
5.8  1/4210k   1/25100k
6.0  1/8550k   1/61600k
6.5  1/50600k  1/629000k
7.0  1/301000k 1/3210000k

For more information, contact


Robert B. Frary, Director of Measurement
and Research Services
2096 Derring Hall
Virginia Polytechnic Institute and State
University
Blacksburg, VA 24060
703/231-5413 (voice)
frary#064;vtvm1.cc.vt.edu
###