VPI Occasional Paper: Detection of Answer Copying on
Multiple-Choice Tests and Interpretation of _g_ [subscript]
2 Statistics
by
Robert B. Frary
Virginia Polytechic Institute and State University
[Note: * will be substituted for _g_ [subscript] 2 throughout the
remainder of the text.]
When there is a high degree of correspondence between the
responses of two examinees on a multiple-choice test, it is only
natural to suspect cheating. However, other possibilities must be
considered. If both examinees are excellent students, the
correspondence will be unavoidably high because they will both
answer a large proportion of items correctly. Therefore, any
method of detecting cheating must take into consideration the
score levels of the examinees.
The popularity of right and wrong choices must also be taken into
consideration. For example, consider a suspected pair who had 20
choices in common on a 30-item test. Fifteen of their common
choices were correct answers to the easiest questions, and the five
wrong choices in common were the most popular wrong answers to
very hard questions. Contrast this pair with another pair who also
had 20 choices in common. However, for this pair 15 of the common
choices were wrong answers to relatively easy questions. Obviously,
it is very unlikely that by sheer chance two examinees would choose
that many wrong answers in common. As an illustration, consider,
the probability of throwing 15 or more doubles out of 20 throws of
a pair of dice. The probability of this event occurring is less than one
in 70 million.
Of course at the other end of the continuum, there is some degree
of response similarity to be expected. The probability of throwing
no doubles in 20 throws of dice is about one in 40. The job for
someone interested in detecting cheating is to produce a statistic
which reflects the probability that an observed degree of
correspondence between right as well as wrong answers was due
to chance.
Toward this end we have developed a statistic which is sensitive
to examinee score level and the popularity of right and wrong
choices. It is similar to the _t_ statistic and is approximately
normally distributed for tests with reasonable length and
moderately large numbers of examinees. A complete description
of this statistic, called the * statistic, may be found in the _Journal
of Educational Statistics_ 2 (1977): 235-256.
We have used the * statistic extensively at Virginia Tech. One
case started with an instructor's wish to determine the extent to
which various security measures were preventing cheating on
examinations in a multi-section course. The examination chosen
for analysis had 30 4-choice items responded to by 944 examinees
in 11 rooms, all of whom took the test simultaneously. There were
two forms of the test, designated A and B, the second a scrambled
version of the first, and these were passed out alternately in each
room.
Two distributions of *s were produced for mutually exclusive
groups:
Group 1
Pairs of examinees who were in the same room and took the same
form of the test. There were 29,617 such pairs, which yielded
59,234 *. since the computation of a * changes when the potential
copier and the person copied from are interchanged.
Group 2
Pairs of examinees who were in different rooms and took different
forms of the test. A total of 192,816 such pairs were available, which
yielded 385,632 *s.
If it can be assumed that there was no crossform-crossroom cheating,
the Group 2 distribution of *s may be taken as a norm with which to
compare other distributions arising from other subsamples of
examinees on the same test. The highest * for the different form
and room group was 4.60. The distribution of the same room and
form group had 95 *s above 4.60. This outcome suggested that
there was extensive copying among about 5% of examinees. Further
examination of the same room and form * revealed that, in six of the
11 testing rooms, no *s above 4.60 were observed. These rooms
were all relatively small, which suggests that the prevalence of
cheating was related to room size.
Given these and similar findings over a period of more than 10
years, it appears highly likely that answer copying will occur if
a single form of a test is used in a large class with close seating.
On the brighter side, many instructors have been successful in
preventing cheating, not always by threat of detection but by more
positive measures. Strategies have varied, but some elements
common to many have been the following:
1. Let students know in a low-key but sincere manner that cheating
is a matter of concern and inform them of the standards they must
observe. (Some students have reported that instructors appear
unconcerned at reports of cheating or disbelieve such reports without
investigating.)
2. Use widely-spaced seating or multiple (scrambled) forms of the test
(preferably on different colors of paper). In either case, an explanation
should be made that these measures are not taken because of general
distrust of students, but to protect the majority who would not cheat
and to make test taking less stressful. Then students don't have to sit
as if wearing blinders to avoid an accidental glance at a neighbor's
answer sheet.
3. Be present or have a representative present throughout all
examinations. This presence reinforces earlier statements of concern
and in no way violates the Honor System. If at all possible, make a
seating chart.
While some of the above measures require significant effort by
the instructor, it seems unlikely that anything less will alleviate
what is a widespread and well documented problem. A few
fortunate classes may be essentially free of cheating even though
the opportunity exists; however, even small doubts are better
resolved. Instructors are encouraged to request an analysis of
answer correspondence on multiple-choice tests through the
Office of Measurement and Research.
Should such an analysis reveal evidence of cheating, preventive
measures may be taken. If specific pairs of students are identified
as having correspondences beyond any reasonable attribution to
chance the instructor may wish to:
1. Advise the suspected students privately of the degree of observed
correspondence.
2. Observe suspected pairs on subsequent tests.
3. Collect other evidence, such as that from seating charts or
observations of other students in the vicinity of those suspected,
and turn this evidence over to the Honor System. Alternatively,
student witnesses may wish to undertake this responsibility.
It is indeed unfortunate that there is a need to distribute the
foregoing commentary. Even discussing such problems might
convey an undesirable preoccupation with the negative side of
education, to say nothing of the effect of policing exams,
explaining to students the need for multiple forms of tests, etc.
More undesirable, however, is the known effect of ignoring
the problem or reacting only to the most obvious and flagrant
violations. Only a unified effort by the faculty can alleviate
this problem.
Interpretation of * Answer Correspondence Statistics
For every pair of examinees who take a multiple-choice test,
two * statistics are computed. For examinees A and B, for
example, one * statistic reflects the probability that the observed
answer correspondence pattern would occur, under the
hypothesis that A did not copy from B. The other * reflects the
same probability under the hypothesis that B did not copy from
A. A high (positive) * yields a low probability that the observed
correspondence between answers would occur in the absence of
copying. The average * is approximately zero, with negative *s
representing correspondences that are less likely than expected.
The probability associated with a given * depends on the number
of questions on the test. Obviously if the test is very short there
could be a large number of highly similar answer patterns. Tests
as short as ten questions can be analyzed effectively for answer
correspondence, but the probabilities associated with higher *s
are much lower for tests with 20 or more questions. Table 1
gives the approximate probabilities associated with higher *s
for tests of different lengths.
Inspection of Table 1 reveals that all the probabilities associated
with *s of 4.6 or higher for tests of 20 or more questions are less
than 1/10000. Therefore, it can be said that such values (4.6 or
higher with 20 or more questions) certainly represent unusual
occurrences in the absence of copying.
When probabilities are as low as 1/10000, many people have
trouble understanding what they mean. However, it is easy to
understand what is meant by saying that, in coin tossing, the
probability of getting three heads in a row is about 1/8. This
means that if 800 people each toss a coin three times, about 100
of them will toss three heads. To understand what a probability
of 1/10000 means, consider the likelihood of being killed in an
auto accident within the next year. With 15,000 miles of auto
travel, the probability of death is about 1/5000. This means that,
among every 5000 people who travel about 15 000 miles per year
by auto, we should expect one death in an auto accident. This
probability is twice as high as 1/10000. It can be argued that one's
personal probability of an auto related death is lower than 1/5000.
Valid reasons for this conclusion might be traveling predominantly
in low-traffic areas, always driving carefully, and having exceptionally
good reflexes. Nevertheless, the probability of death in an auto
accident for almost anyone who drives is somewhat more than
most of the probabilities listed in Table 1. In spite of this macabre
fact, most of us drive without worrying very much about being
killed. This lack of concern attests to the truly small character of
probabilities of less than 1/10000. The probability of being the
top winner in a typical state lottery after buying one ticket is
greater than most of the probabilities in the lower right quadrant
of Table 1.
While the probability of an auto death was used above for
illustration, there is an important difference between this
occurrence and that of a high *. That difference is in assignment
of causality; it is almost always possible to attribute a cause to an
auto death, e.g., mechanical failure or driver negligence. Extensive
copying will almost always cause a high *, but, on rare occasions,
high *s occur in the complete absence of copying. We say that these
are chance occurrences because we can predict the approximate
number that will be observed for a given test length and class size.
This is analogous to being able to predict the number of jackpots
from a slot machine over a given number of plays. Actually, the slot
machine's results are caused in some sense just as much so as are auto
deaths, but we do not know the details of this causality and refer to
chance outcomes because of being able to estimate the small
probability of a jackpot. Similarly, we do not know what causes high
*s in the absence of copying, but the probability of their occurrence
is known with considerable accuracy.
Because a high * can occur in the absence of copying, it is important
to be able to interpret *s in the higher ranges correctly in the context
of the Virginia Tech Honor System. To do so, it is first necessary to
understand a fundamental fact concerning the operation of any justice
system. That fact is that such systems operate on the basis of
probabilities, and that there is never complete certainty of guilt or
innocence. The possibility always exists that the guilty may be found
not guilty or, more of concern, that the innocent may be found guilty.
How high a probability that an innocent defendant will be found guilty
can be accommodated within a justice system? One would like to
require this probability to be zero, but this would in turn require
that all defendants be found not guilty. Perhaps even one incorrect
guilty verdict in a thousand cases or one in ten thousand seems
excessive. Nevertheless, the answer to the preceding question must
be reasonable, because, the more unlikely it is than an innocent
party will be convicted, the more likely it becomes that the guilty
will be declared not guilty. Reasonableness has been exceeded
when convictions become rare in spite of pretrial investigations that
strongly suggest guilt.
One possible approach to setting a suitable probability of innocence
below which no reasonable doubt of guilt exists is to consider how
many false convictions could be tolerated in a 100-year period.
Perhaps one in 100 years is the most that could be justified for
maintenance of the system. If there are 50 to 100 cases per year,
the result would be in the range of 1/5000 to 1/10000. Another
possible guide to selecting an appropriate probability of convicting
the innocent might be to select one which is low enough so that
we would not worry very much if it were the probability of our
own accidental deaths, say 1/10000.
Given such a level, it would seem simple to identify and convict
cheaters. One could simply select a conservatively high * and
inflict penalties for all instances exceeding this value. There are
two reasons why this approach is not feasible:
1. The two * statistics for a pair of examinees do not indicate
who may have copied from whom. Usually the larger statistic
will reflect the probability that the higher scorer did not copy
from the lower scorer. This is a natural characteristic of the
statistic and has nothing to do with the direction of copying.
2. The number of examinees is related to the number of
occurrences of *s above any fixed point. For example, in a
class of 200, 39.800 *s will be computed. If the test has
only, say, 20 questions, there will be about four *s above 4.6
in the total absence of copying. In a class of any reasonable
size there will be some *s in a moderately high range, say,
3.3 to 4.0, again, in the complete absence of cheating.
For these reasons, as a matter of policy, the Virginia Tech
Honor System does not bring cases to trial solely on the basis
of * statistics, however large they may be. Additional evidence
is required for trial and conviction. For example, it might be
shown that the defendant was seated behind the other party
with whom the high * was generated and in a position from
which that person's answer sheet could have been seen. In
addition or alternatively, there might be witnesses' reports
that the defendant appeared to be looking at the answer sheet
of the other party.
Taken alone, evidence of seat location and apparent copying
behavior may not be sufficient to convict a defendant. However,
a corresponding high * should influence a panel greatly in the
direction of conviction. This is true regardless of the fact that
any high * could have occurred in the absence of copying.
An analogy from medicine may make it more clear why this
should be the case. Consider the measurement of blood serum
cholesterol level. There is a range of observed values that occur
in perfectly healthy individuals. The high end of this range has
values that are typically observed for persons with existing or
incipient circulatory problems. When such a reading occurs, the
physician does not immediately conclude that the patient is sick
and administer therapy to reduce the cholesterol level. Instead,
additional data are gathered. If all or most signs point to
circulatory trouble, treatment is initiated. Considered in this
light, even a moderately high * can be meaningful. Just as
observing a person appearing to look at a neighbor's answers
does not guarantee that copying took place, so a high or
moderately high * does not. However, when events such as
these occur together there is much less doubt of guilt.
A few additional points:
1. A * below any set level does not imply innocence. An
examinee who copies only a small proportion of answers
will probably not generate a very high * with respect to
the other party. Alternatively, an examinee may copy a
few answers from various other examinees generating no
very high * in the process. Copying extensively from a
person who earns a nearly perfect score on a test will also
not yield a high *. In this instance, the * will not identify
the two sets of responses as unusually similar, since any two
highly competent examinees would have a large number
of right-answer correspondences.
2. In the computation of a *, values are added up for each
test question. When examinee A chooses the same right
choice as examinee B, the amount added is usually very small.
In fact, if two examinees have only right choices in common,
their *s will probably be negative. It is common wrong choices
that innate the *s, especially common wrong choices that were
marked by relatively few other examinees. This does not mean
that right-answer correspondences have no effect. A cheating
index involving only wrong answers would be less sensitive
and accurate. This information pertains to the next point.
3. A common defense in cases involving people who sat
together and generated high *s is that they studied together.
We have investigated this effect by asking students in a large
number and wide variety of classes to indicate the extent to
which they studied with other class members. This was
accomplished through a question appended to tests in these
classes with the following wording:
To what extent did you study for this test with another person
or persons in this class?
1) Not at all
2) Less than an hour
3) One to four hours
4) Four to eight hours
5 ) Over eight hours
In no class was there any meaningful difference between the
averages of the *s across the groups reporting different
amounts of studying together. Some of the groups reporting
the larger amounts of mutual studying were quite small, so that
many if not most of the *s for these groups were from pairs
who actually studied together. These results notwithstanding,
it is highly implausible that any pair of reasonably competent
students would learn large amounts of wrong information
jointly. When people study together, they sometimes fail to
cover material that will be on a test, but they only rarely share
or reinforce each other's misinformation or misconceptions.
And even if there is some joint learning of wrong information,
it is still beyond plausibility that the instructor would then
fortuitously offer a number of wrong choices that matched the
misinformation possessed by a specific pair of examinees. Since
it is mainly wrong answers that inflate *s (see 2 above),
defendants claiming an effect from studying together would
have to show that they learned large amounts of wrong information
jointly, which the instructor then represented with corresponding
wrong choices.
In summary, the following points should be kept in mind when
interpreting *s:
1. No evidence proves guilt beyond any shadow of a doubt. There
is some doubt about every judicial decision. If the standard for
conviction is evidence that leaves no reasonable doubt, someone
applying this standard should have some idea of how small the
probability of innocence must be in order to conclude that there is
no reasonable doubt of guilt. Then the probability associated with
a * can be evaluated properly in a particular case.
2. While a high * does not guarantee that copying occurred, it
strongly corroborates other evidence to this effect.
3. The fact that an accused pair studied together is almost never
the cause of a high *.
4. Copying a small number of answers will usually not result in a
large *, nor will extensive copying from someone with a very high
score. Therefore, a low * does not imply innocence.
5. The relative sizes of the two *s for a pair of examinees do not
indicate the direction of possible copying. This must be established
from witnesses or seating charts.
Table 1
Approximate Probabilities of an Observed * in the Absence of Copying
Number of Test Questions
[Note: k=000]
_*_ _10_ _20_ _30_ _40_
4.6 1/1550 1/10200 1/25800 1/45600
4.8 1/2050 1/16100 1/45200 1/85100
5.0 1/2710 1/25200 1/78800 1/160k
5.2 1/3550 1/39200 1/137k 1/301k
5.4 1/4620 1/61k 1/240k 1/568k
5.6 1/5980 1/94k 1/417k 1/1070k
5.8 1/7710 1/145k 1/724k 1/2040k
6.0 1/9880 1/223k 1/1260k 1/3860k
6.5 1 18k 1/634k 1/4900k 1/19100k
7.0 1/32k 1/1740k 1/18700k 1/93300k
_*_ _50_ _100_
4.6 1/66600 1/160k
4.8 1/131k 1/355k
5.0 1/259k 1/806k
5.2 1/515k 1/1860k
5.4 1/1030k 1/4360k
5.6 1/2080k 1/10400k
5.8 1/4210k 1/25100k
6.0 1/8550k 1/61600k
6.5 1/50600k 1/629000k
7.0 1/301000k 1/3210000k
For more information, contact
Robert B. Frary, Director of Measurement
and Research Services
2096 Derring Hall
Virginia Polytechnic Institute and State
University
Blacksburg, VA 24060
703/231-5413 (voice)
frary#064;vtvm1.cc.vt.edu
###