Controversies Regarding the Nature of Score Validity: Still Crazy After All These Years
B. Thomas Gray
Texas A&M University 77843-4225
Validity is a critically important issue with far-reaching implications for testing. The history of conceptualizations of validity over the past 50 years is reviewed, and three important areas of controversy are examined. First, the question of whether the three traditionally recognized types of validity should be integrated as a unitary entity of construct validity is examined. Second, the issue of the role of consequences in assessing test validity is discussed, and finally, the concept that validity is a property of test scores and their interpretations, and not of tests themselves is reviewed.Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997.
Controversies Regarding the Nature of Score Validity: Still Crazy After All These Years
The most recent edition of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement and Education, 1985) included the bold statement that: Validity is the most important consideration in test evaluation (p. 9). It seems likely that the same point will be reiterated, perhaps verbatim, in the forthcoming revised edition of the same work. The importance indicated by such a strong declaration is reinforced by the fact that no new test can be introduced without a manual that includes a section on validity studies, and no text on testing and/or psychometrics is considered complete without at least one chapter addressing the topic of validity.
In 1949, Cronbach (p. 48) stated that the definition of validity as the extent to which a test measures what it purports to measure was commonly accepted, although he preferred a slight modification: A test is valid to the degree that we know what it measures or predicts (p. 48). Cureton (1951) provided similar commentary: The essential question of test validity is how well a test does the job it is employed to do... Validity is therefore defined in terms of the correlation between the actual test scores and the >true' criterion scores (pp. 621, 623). The enduring definition given by Anastasi (cf., 1954, p. 120; Anastasi & Urbani, 1997, p. 113)-- Validity is what the test measures and how well it does so --is cited quite widely.
It is interesting to note that Cronbach, one of the most prominent voices in the field of psychometrics, and a widely respected authority on the topic of validity, has of late tended to avoid the problem of defining the term after the 1949 statement cited above (cf., 1988, 1989). In 1971 (p. 443), however, he provided an insightful statement that foreshadowed some of the controversy of the future: Narrowly considered, validation is the process of examining the accuracy of a specific prediction or inference made from a test score.
Exceptions can be found to the apparent conservatism seen in the definitions cited above. Perhaps most notable is Messick (1989a, p. 13), who stated that, Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. This reflects much of the debate and controversy to be found in the literature of the past several years, indicative perhaps of greater intellectual movement in the field than would be implied by the previous paragraph.
It is certainly beyond the scope of the present paper to present a comprehensive review of the topic of validity. The purpose instead is to focus on a few more obvious points of controversy. Three areas in particular will be addressed: (a) the status of the different types of validity; (b) the issue of what is sometimes referred to as consequential validity ; and (c) the persistence of illogical statements taking the broad form of, The test is valid.
The above discussion illustrates the considerable differences in the way validity is conceptualized by different authorities. Some have changed their views little over the past 40+ years, while others have been advocating markedly different views, a few for many years. Since the roots of many of the shifts in thinking which are occurring today can be found in earlier works, a brief historical review of the ways in which the validity formulations have developed is helpful in understanding both the current controversies and the persistent themes. For further detail, the interested reader is referred particularly to the extensive discussions of the topic which have appeared in the three volumes of Educational Measurement (Cronbach, 1971b; Cureton, 1951; Messick, 1989) published thus far.
Conceptualizations of Validity: An Historical Sketch
By the early part of the 1950's a plethora of different types of validity (factorial, intrinsic, empirical, logical, and many others) had been named (see Anastasi, 1954). Among those whose contributions continue to be acknowledged are Gullickson (1950), Guilford (1946), Jenkins (1946), and Rulon (1946). Typical formulations recognized two basic categories, which Cronbach (1949) termed logical and empirical forms of validity. The former was a rather loosely organized, broadly defined set of approaches, including content analyses, and examination of operational issues and test taking processes. Test makers were expected to make a careful study of the test itself, to determine what test scores mean (Cronbach, 1949, p. 48). Much of what has since become known as content validity is found with in this broad category.
Empirical validity placed emphasis on the use of factor analysis (e.g., Guilford's 1946 factorial validity), and especially on correlation(s) between test scores and a criterion measure (Anastasi, 1950). Cureton (1951) devoted several pages to various issues concerning the criterion, and the influence of this approach is seen in its widespread use even today (Cronbach, 1989), despite some apparent limitations. For example, Cureton's assertion quoted above is easily (although perhaps slightly outlandishly) refuted by noting that a positive correlation could be obtained between children's raw scores on an achievement test and their heights. This is not to say that correlational studies are useless, but rather that their indiscriminate application can sometimes yield uninteresting results.
Several interesting and important political developments converged to influence events relating to validity conceptualization (Benjamin, 1996; Cronbach, 1988, 1989). In the late 1940's, the academically-oriented APA, was attempting to draw the membership of the new Association for Applied Psychology back into its ranks. The two groups combined into what was supposed to be a more broadly-oriented APA, and work was begun on establishing an appropriate code of ethics addressing both scientific and applied concerns. A committee was established in 1950 to develop standards for adequate psychological tests, and their work was published in 1954. At that time four categories of validity were defined: content, predictive, concurrent, and construct.
That basic outline is still in use today (AERA et al., 1985), essentially unchanged, with the exception that in 1966, the revised edition of the Standards combined the predictive and concurrent validity categories into the single grouping called criterion validity. The 1954 standards, which were actually referred to as Technical Recommendations in their first edition, were quickly followed up by the publication in 1955 of Cronbach and Meehl's landmark paper, Construct Validity in Psychological Tests (see Thompson & Daniel, 1996). The construct of Construct Validity was elucidated more clearly, including introduction of the concept of the nomological net. The latter is described as the interrelated laws supporting a given construct; Cronbach (1989) later presented this in somewhat less strict terms, acknowledging the impossibility with most constructs used in the social sciences of attaining the levels of proof demanded in the harder sciences.
Soon thereafter, Campbell (1957) introduced into the validation process the notion of falsification , and discussed the importance of testing plausible rival hypotheses. This was explained in further detail by Campbell and Fiske (1959) in their important paper (Thompson & Daniel, 1996) introducing the multitrait-multimethod approach and the notions of convergent and divergent (or discriminant) validity. There have been objections to some of the applications of this technique, particularly insofar as it can and sometimes does become a rather rote exercise, which will therefore produce only vapid results. Nonetheless, the multitrait-multimethod approach, like correlational studies, enjoys widespread appeal nearly 40 years after its introduction.
Construct Validity as a Unifying Theme
The so-called trinitarian doctrine, which conceptualizes validity in three parts, has been a fundamental part of the Standards since their inception (or, to be picky, at least since 1966). This doctrine is therefore presented as standard fare in most textbooks which cover the topic of validity. Anastasi, for example, has followed the same outline in her widely-used textbook Psychological Testing since 1961, despite her commentary (1986; Anastasi & Urbani, 1997; also see below) that there is considerable overlap between these different categor sensus nonetheless is that, however it may (or may not) be divided, the different parts represent lines of evidence pointing toward the single construct.
It has also been fairly well demonstrated that, contrary to prevailing opinion of 40 to 50 years ago, no mode of scientific inquiry is devoid of the influence of values. From this recognition, several authors have argued that one must include consequences of the application of a given test as an aspect of the validity of that application. This is a much more controversial area, for which there is far less consensus. It would seem that, at a minimum, many portions of this argument must be clarified before consequential validity is universally accepted as a facet of validity that must always be considered.
Finally, the illogic of the mantra, The test is valid was discussed. That statements of such form persist despite the strong reasons for not using them is testimony to the inertia that accrues to any long-standing practice. The phenomenon is similar to the persistence of what Cronbach (1989, pp. 162-163) termed empirical miscellany and unfocused empiricism seen most clearly in the accumulation of various correlation coefficients that serves as the validity argument in many (perhaps most) test manuals.
The controversies that persist are welcomed. Consider, for example, that the basic outline of validity presented by Anastasi did not change for over 30 years (cf. Anastasi, 1961, 1988; Anastasi & Urbani, 1997). This is strongly suggestive of stagnation in thinking, a condition which is only alleviated by the challenge of new ideas. Not all of the new ideas discussed in the works reviewed in the present paper are necessarily useful. At least most of those that are not useful will not survive the tests of time.
It is universally acknowledged that validity is a crucial consideration in evaluating tests and test applications. It is also generally stated that a true validation argument, rather than resulting from a single study, such as might be found in a newly published test manual, is an unending process. Contending with new ideas regarding the nature of validity itself is just a part of this process.
American Educational Research Association, American Psychological Association, & National Council on Measurement and Education (1985). Standards for educational and psychological testing. Washington, DC: Author.
American Psychological Association (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51, 201-238.
American Psychological Association (1966). Standards for educational and psychological tests and manuals. Washington, D.C.: Author.
Anastasi, A. (1954). Psychological testing. New York: Macmillan.
Anastasi, A. (1961). Psychological testing (2nd ed.). New York: Macmillan.
Anastasi, A. (1976). Psychological testing (4th ed.). New York: Macmillan.
Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1-15.
Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). New York: Macmillan.
Benjamin, L. T. (1996). The founding of the American Psychologist: The professional journal that wasn't. American Psychologist, 51, 8-12.
Brandon, P. R., Lindberg, M. A., & Wang, Z. (1993). Involving program beneficiaries in the early stages of evaluation: Issues of consequential validity and influence. Educational Evaluation and Policy Analysis, 15, 420-428.
Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297-312.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity in the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Cronbach, L. J. (1949). Essentials of psychological testing. New York: Harper & Row.
Cronbach, L. J. (1960). Essentials of psychological testing (2nd ed.). New York: Harper & Row.
Cronbach, L. J. (1971a). Essentials of psychological testing (3th ed.). New York: Harper & Row.
Cronbach, L. J. (1971b). Test validation. In R. L. Thorndike (Ed.),. Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education.
Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.). New York: Harper & Row.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum.
Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement theory and public policy (pp. 147-171). Urbana: University of Illinois Press.
Cronbach, L. J., & Meehl, P. E. (1954). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
Cureton, E. F. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 621-694). Washington, DC: American Council on Education.
Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427-439.
Gulliksen, H. (1950). Instrinsic validity. American Psychologist, 5, 511-517.
Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 11, 385-398.
Hunter, J. E., & Schmidt, F. L. (1982). Fitting people to jobs: The impact of personnel selection on national productivity. In M. D. Dunnette & E. A. Fleishman (Eds.), Human capability assessment. Hillsdale, NJ: Lawrence Erlbaum.
Jenkins, J. G. (1946). Validity for what? Journal of Consulting Psychology, 10, 93-98.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535.
Lees-Haley, P. R. (1996). Alice in validityland, or the dangerous consequences of consequential validity. American Psychologist, 51, 981-983.
Maguire, T., Hattie, J., & Haig, B. (1994). Alberta Journal of Educational Research, 40, 109-126.
Messick, S. (1965). Personality measurement and the ethics of assessment. American Psychologist, 20, 136-142.
Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-966.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012-1027.
Messick, S. (1989a). Validity. In R. L. Linn (Ed.),. Educational measurement (3rd ed., pp. 13-103). New York: Macmillan.
Messick, S. (1989b). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5-11.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.
Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50, 741-749.
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229-258.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12.
Moss, P. A. (1995). Themes and variations in validity theory. Educational Measurement: Issues and Practice, 14(2), 5-12.
Rogers, W. T. (1996). The treatment of measurement issues in the revised Program Evaluation Standards. Journal of Experimental Education, 63(1), 13-28.
Rulon, P. J. (1946). On the validity of educational tests. Harvard Educational Review, 16, 290-296.
Sackett, P. R., Tenopyr, M. L., Schmitt, N., & Kahn, J. (1985). Commentary on forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697-798.
Schmidt, F. L., Pearlman, K., Hunter, J. E., & Hirsh, H. R. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697-798.
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.),. Review of research in education (Vol. 19, pp. 405-450). Washington, DC: American Educational Research Association.
Thompson, B. (1994a, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771)
Thompson, B. (1994b). Guidelines for authors. Educational and Psychological Measurement, 54(4), 837-847.
Thompson, B., & Daniel, L. G. (1996). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and Psychological Measurement, 56, 741-745.
Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi Delta Kappan, 75, 200-214.
Zimiles, H. (1996). Rethinking the validity of psychological assessment. American Psychologist, 51, 980-981.
©1999-2012 Clearinghouse on Assessment and Evaluation. All rights reserved. Your privacy is guaranteed at