Educational Research On-Line
Test or Instrument Validity
When measuring anything (e.g., intelligence, width, height, self-efficacy, test anxiety, achievement, speed, time, etc.), the most important characteristic to be considered is the validity of the scores obtained from the measuring instrument. In this context, validity means that scores obtained from an instrument (or test) represent what they are intended to represent. For example, suppose I develop an instrument to measure test anxiety. The scale for this instrument will be 1 to 10, with low scores (like 1, 2, or 3) representing low levels of test anxiety, and high scores (8, 9, and 10) representing high levels of test anxiety.
For this test anxiety instrument, it is important that I establish that high scores really do represent high anxiety, and that low scores really do represent low anxiety. There are a number of methods for showing this, but the key to establishing validity is providing this evidence. The more evidence one provides to show that the scores actually represent what they are suppose to represent, the more evidence for the validity of the scores obtained from the instrument.
So, in summary, validity is the degree to which scores from an instrument or measuring device reflect what they are actually designed to represent. Validity--if an instrument is designed to measure, for example, achievement, then scores from the instrument really do represent various levels of achievement.
Establishing Evidence for Empirical Validity
As the discussion above indicates, evidence for validity is provided if one can show that scores from an instrument are measuring what they are supposed to represent. Below are three general approaches to establishing such evidence. Each offers a slightly different approach, yet the first two methods (concurrent and predictive) are actually forms of the third (construct). Construct validity is the most general and most powerful method for showing that scores represent what they are designed to measure.
(a) Concurrent Validity (Criterion Related)
Concurrent validity is the degree to which scores from an instrument are related to scores obtained for other already established instruments measuring the same thing; or how well the test or instrument estimates current performance on some valued measure other than the test itself. In short, concurrent validity is the degree to which scores obtained from an instrument correlate with some external criterion of performance. Scores obtained from other instruments or measures of current performance are referred to as the criterion. Concurrent validity can usually be established by calculating the correlation between scores obtained from the instrument under investigation with scores from the criterion. Such a correlation is referred to as the validity coefficient.
The validity coefficient is a standardize measure that ranges from -1 to 1. The closer the coefficient to 1, the stronger the concurrent validity of the instrument. The closer the coefficient to 0.00, the weaker the concurrent validity of the instrument.
Example of Concurrent Validity
A researcher develops a new measure of intelligence that takes only 20 minutes to administer to a group of people. The researcher wishes to show that this measure of intelligence provides scores that are very similar to other measures of intelligence (doing this shows concurrent validity). The researcher administers his 20 minute test to a group of people, then administers Weschler's test of intelligence to the same people. In this example, the researcher's test is the new test, and Weschler's is the old, established test of intelligence. Once scores are obtained from the same people for both tests, the scores are then correlated to determine the extent to which the scores are related. The stronger the correlation, the more evidence for concurrent validity.
(b) Predictive Validity (Criterion Related)
Predictive validity is the degree of predictive power of some instrument or test; how well a test or instrument predicts future performance or behavior on some valued measure other than the test itself; how well scores from the test or instrument correspond to scores on some future criterion. By correlating scores obtained from the test or instrument under investigation with scores from some criterion, one can obtain a validity coefficient like that described above with concurrent validity.
In summary, predictive validity is established by showing a strong correlation between an instrument designed to measure some behavior, and the actual future behavior the instrument was designed to measure. If scores from the instrument predict well some future behavior, then one has predictive validity.
Example of Predictive Validity
A researcher designs a test that will be used to determine admission into a special statistics course. The test is designed to show which students will perform best in the course. To show that this instrument actually predicts who will do well, the instrument is administered to a group of 100 randomly selected students. Each of these students then takes and completes the statistics course. Final course grades from each student are then correlated with scores from the instrument. If a strong correlation is found between test scores and final course grades, evidence for predictive validity is obtained. The stronger the correlation between the test scores and final course grades, the better the test can predict who will perform well in the course.
Both predictive and concurrent validity are referred to as criterion-related validity. The reasons stems from the use of an external criterion to establish validity. For concurrent validity, the criterion is the already established test that is used to judge the adequacy of the new test; for predictive validity, the criterion is the future behavior.
Note about the Validity Coefficient
With both concurrent and predictive validity, one may obtain a validity coefficient. The validity coefficient is simply the Pearon's correlation between the scores from the instrument and the criterion (another test for concurrent validity, and some future behavior for predictive validity). Like the reliability coefficient, the validity coefficient ranges from 0.00 to 1.00. The closer to 1.00 the validity coefficient, the better the evidence for criterion validity; the closer the validity coefficient to 0.00, the worse the evidence for criterion validity.
(c) Construct Validity
Construct validity is the degree to which an instrument measures the construct it was designed to measure; how well an instrument can be interpreted as a meaningful measure of some characteristic or quality; how well scores obtained from the instrument correspond with some theory, rationale, or behavior.
Construct validity may be established in several methods, but the most common is through the examination and testing of hypothetical relationships. For example, one may develop an instrument to measure verbal acuity. To begin to establish construct validity for verbal acuity, one must obtain scores from this instrument and then determine whether these scores correlate as one would expect with related phenomena like academic achievement, scores from other measures of verbal ability, etc. Should the verbal acuity scores correspond in an expected (or hypothesized fashion) with these related phenomena, then evidence of construct validity exists for the verbal acuity scores.
Why do hypothesized relationships among scores from an instrument and other variables show construct validity. The logic follows something like this. If an instrument truly measures what it suppose to measure, then scores from the instrument represent what is measured. For instance, assume an instrument provides valid scores of intelligence. How do we know these scores are a valid representation of intelligence? If these scores measure intelligence, then we should be able to predict how these scores will correlate, or behave, when compared with scores from other variables. We know, for example, that intelligence correlates well with verbal ability. So, if our instrument designed to measure intelligence provides valid scores of intelligence, then those scores should correlate in a positive manner with verbal achievement scores, like verbal scores from the SAT, GRE, or ITBS instruments. If verbal scores and intelligence scores show a positive correlation, then our hypothesis of the positive relationship between intelligence and verbal ability is supported. Support for this hypothesis provides support for the measure of intelligence since we could accurately predict how intelligence scores would behave. In short, anytime one can predict how scores from an instrument will behave relative to other variables--that is, anytime one can make a prediction or hypothesis--and the prediction is supported by the data, we have provided evidence that the given instrument is in fact measuring what is purported to be measuring.
Example 1: Providing Evidence of Test Anxiety Scores
As another example, suppose we develop a measure of test anxiety. Let us assume that high scores on this instrument represent high levels of test anxiety, and low scores represent low levels of test anxiety. We now wish to provide construct evidence that our instrument provides valid scores of test anxiety. To show this evidence, we formulate several hypotheses about how the scores from our instrument will behave. We make the following hypotheses:
1. Test anxiety should be negative correlated with achievement scores. If our instrument measures test anxiety, then we should find that people who score high on test anxiety should tend to score lower on the achievement test, and people who score low on test anxiety should score higher on the achievement test. We expect this negative relationship because research has shown that these two variables are negatively related.
2. Test anxiety should be negatively associated with academic self-efficacy. Academic self-efficacy is the degree to which you feel confident that you can handle difficult academic situations; it is the degree to which you think you can learn and perform well on academic tasks. Research shows that people who have a high level of academic self-efficacy tend to be low on test anxiety and that people low on academic self-efficacy tend to be high on test anxiety. This makes sense: the more confident one is in one's ability to do academic tasks, the less anxiety one should experience one those tasks.
3. Test anxiety should be positively associated with measures of general trait anxiety. Test anxiety is typically viewed as a special case of general anxiety that a person experiences. Some people are more prone to anxiety, in general. The more prone to anxiety an individual is in general, the more prone that individual should be to a specific type of anxiety, like test anxiety. So, we should find that a positive correlation should result from measures of test anxiety and measures of general anxiety.
To provide evidence for scores obtained from an instrument designed to measure test anxiety, suppose a researcher administers the test anxiety instrument to a groups of students immediately before they take an important test, then the researcher also administers a measure of general trait anxiety and a measure of academic self-efficacy. Finally, the researcher obtains test scores of achievement from the students' important test. The researcher correlates all of the scores and finds the following correlations (Table 1).
Table 1: Correlations Among Test Anxiety, Achievement, Self-efficacy, and General Anxiety
|Test Anxiety||Achievement Scores||Academic Self-efficacy|
All three hypotheses were supported. There is a negative correlation between test anxiety and achievement, a negative correlation between test anxiety and self-efficacy, and a positive correlation between test anxiety and general anxiety. Given the support for all three hypotheses, it appears that scores from the test anxiety instrument provide a valid measure of test anxiety.
Example 2: Read Attitudes Among Young Children
A researcher was interested in measuring reading attitudes among young children (grades 1st through 5th). The instrument was developed with the intention that higher scores would indicate more positive attitudes toward reading. In order to provide evidence of construct validity, the researcher made the following hypotheses, among others:
1. Children with more positive attitudes toward reading should have more reading material at their home. The logic here is that if young children are exposed to more books at home, then they should be more inclined to read and more inclined to enjoying reading.
2. Children with more positive attitudes toward reading will be more likely to have a library card. The argument here is that if one enjoys reading, then one will be more likely to visit the public library.
3. Children's attitude toward reading will decline by grade level. Unfortunately, research shows that attitudes toward reading show a steady decline from 1st to about 6th grade, then attitudes maintain a level position.
To test these hypotheses and thereby show evidence for the validity of scores obtained from the reading attitude survey, the researcher administered the survey to a random sample of students from grades 1 through 5. The reader then asked parents of the surveyed children to estimate how many children's books they have at home, and to indicate whether their child had a library card for the public library. Results of the study are reported below.
The correlation between reading attitudes and the number of children's books at home is .35, which is a statistically significant (p < .05), moderate to weak correlation. This positive correlation was anticipated and shows some evidence for the validity of reading attitude scores. A t-test was used to determine whether a difference exists between reading attitudes of children with and without library cards. Children with library cards had a mean of M = 30.73 (SD = 5.63) and children without a library card had a mean of M = 28.74 (SD = 5.47). The t-test indicated a statistically significant difference in reading attitudes (t = 2.76, df = 156, p < .05) with children with the library card showing a slightly more positive attitude toward reading. This finding also provides evidence for the validity of reading attitude scores. Finally, means of reading attitudes were calculated for each grade level examined. Descriptive statistics are reported in Table 2.
Table 2: Reading Attitude by Grade Level
As Table 2 shows, there is a steady decline in mean reading attitude as children advance through school. This finding supports previous research and also provides further evidence for the validity of scores obtained from the reading attitude survey.
Establishing Evidence for Logical Validity
The previous section showed how empirical evidence for validity could be used to establish validity. By empirical, what is meant is that data are collected to show that scores obtained from an instrument behave in a predictable manner. In this section, a different approach is used to provide evidence for validity. In this section the use of logic to establish evidence of validity is discussed. While some strong arguments can be made using logic for showing the validity of an instrument, logical validity is much weaker than empirical validity, although logical validity is almost always the first approach used to address concerns for the validity of any instrument.
Content validity is the degree to which an instrument adequately addresses a specific content area--it is how well an instrument's tasks or items represent the domain to be assessed. Usually content validity is used for achievement tests, but occasionally researchers will use content validity to demonstrate the adequacy of items included in their instruments.
In education, researchers often use content validity to show that achievement tests adequately cover the domain of interest. In this context, content validity is formally established by the use of item validity and sampling validity. Briefly defined, item validity is the detailed analysis of each and every item included on the achievement test. The goal of item validity is to determine that each item included on the test covers a topic or meets an objective for the class. If the item does not match well an objective or topic covered in the course, the item should be eliminated. For example, it would be poor to include an item that covers calculus on a test of educational research. Sampling validity, briefly explained, occurs after item validity and is used to ensure that all major objectives to be covered on the test are adequately represented. For example, in educational research the first test may cover hypotheses, sampling, variables, and statistics. So for sampling validity to be adequate, I would need a number of items for each of those topics to be sure that I am testing the student's knowledge well for each topic.
In addition to item and sampling validity, one may occasionally see reference to face validity. Face validity is much less formal than item and sampling validity, and really represents only a cursory review of a test. With face validity, one is usually only interested in determining whether a test appears to be appropriate for a certain group. For example, one may quickly review a test to ensure that it is appropriate for educational research and inappropriate for, say, a French class. This entails a short review of some of the items and format of the test to ensure it is appropriate for the class to which is will be administered.
Content validity may also be used to show that the items on an instrument a appropriate for measuring whatever they were designed to measure. Often content validity is argued when one develops an instrument. The developmental process often includes (a) a thorough review of the literature (and the use of logic), (b) assessment by experts, and (c) a pilot or field test of the instrument. Each of these steps helps to ensure content validity. The literature review will point to important areas that must be included on the instrument for the measurement of whatever is to be measured. Further, logic may dictate that certain areas be covered on the instrument. Once the items have been developed from the literature or logic, the next step is to have experts in the area covered by the instrument to examine the items to ensure they are appropriate. This step is similar to item analysis described above. The next step is to pilot test the instrument with people just like those who will be administered the instrument. Normally researchers find the pilot study, or pilot test of the instrument, to be highly informative. Respondents to the instrument often provide very useful information about what works and what does not work well on the instrument.
Example: Establishing Content Validity for a Measure of Test Anxiety
A researcher is interested in developing a measure of test anxiety. The first step is to determine the type of items that should be included in the measure of test anxiety. A review of the test anxiety literature shows that test anxiety is thought to be composed to two aspects: emotionality and worry. Since both of these components are important, it will be necessary to develop items that cover both in order to logically measure test anxiety. Emotionality represents the physiological and affective arousal one experiences immediately before taking an important test, like getting a sick feeling. Worry is the cognitive concern one experiences about the outcomes of the important test, like negative thoughts or concerns. Now that all the important components of test anxiety are identified, the next step is the develop the items to measure these two components. Below are two items to cover each emotionality and worry.
Students will respond to each item on a scale ranging from 1 (not at all true for me) to 7 (very true for me).
The next step is to have experts on test anxiety to review this instrument and determine whether each item represents the component of test anxiety it was designed to measure. Let us assume that the two emotionality items were judged to be adequate measures of the emotionality component of test anxiety, and the two worry items were also judged to be adequate measures of the worry component of test anxiety.
The last step to showing content validity is to have a pilot study of the instrument. In the pilot study I will administer the instrument to a small number of students, then I will take to each student and ask for their opinions about the instrument. I will ask if each item was clear and easy to understand, and will ask each student to tell me what they thought each item meant to ensure that each student interpreted each item in the same way.
Using the above steps, we have provided evidence for logical validity--content validity--for this instrument. We have shown that each major component of test anxiety was covered by including items on the instrument for each component, and that these items and components were derived from prior research (literature review and possibly from logic). We have shown that experts in the area agreed the items adequately represented the major components of test anxiety. And we have shown through a pilot study that students found the instrument easy to understand and doable.
Relationship between Reliability and Validity
Measurement specialists like to highlight the relationship between validity and reliability. In summary, it is not possible to have validity without reliability. Recall that validity means scores from an instrument represent what the are suppose to measure, and reliability means that one obtains consistent scores. To see that reliability is a necessary condition for validity (although not a sufficient condition), recall the following weight example:
Suppose, for example, that I have bathroom scales at home. One morning I step on the scales, record my weight, then step off. I immediately repeat this process two more times. My recorded weights are:
148, 286, 205
Are these measurements reliable?
I try weighing myself in the same manner with a second set of scales, and my recorded weights are:
195, 210, 205
Are these measurements reliable? Are they more reliable than the first set of measurements?
I try weighing myself in the same manner with a third set of scales, and my weights are:
204, 204, 203
With this example, let us assume that my valid, real weight is 205 pounds. The three sets of measurements were:
A: 148, 286, 205
B: 195, 210, 205
C: 204, 204, 203
Note that set A is highly variable--it has not reliability. As you can see, it is impossible for something with this little reliability to accurately indicate my weight. As a result, it is impossible for set A to be a valid measurement of my weight. Set B also has variability, but not as much as set A. Again, due to the fluctuation in the scores, set B is not an accurate measurement of my weight, so it too is invalid, although it is closer to reality than set A. Set C is the more reliable, and in this example, the most valid measure of my weight. This example shows that scores that are not reliability cannot be valid due to the wild fluctuations in the scores. Consistency is a necessary condition for validity to be present.
While consistency is necessary for validity, it is not enough. To illustrate, consider the following weights:
D: 175, 175, 175
Set D is perfectly consistent--perfectly reliable, but set D is also invalid. Set D does not accurately reflect my weight, so despite the reliability of the measurement, validity is lacking. This illustrates that reliability is not enough to ensure validity.
So what is the relationship between validity and reliability? Well, validity requires reliability, so one cannot have validity without reliability. The reverse is not true. One may have reliability without validity. The following table illustrates the relationship between these two constructs.
|We Know:||What We Know:|
|Test scores are valid.||The test scores must be reliable if it they are valid.|
|Test scores are not valid.||We don't know--if the test scores are not valid, we don't know anything about reliability; the scores may or may not be reliable.|
|Test scores are reliable.||We don't know--if the test scores are reliable, we don't know if the scores are valid; the scores may or may not be valid.|
|Test scores are not reliable.||The test scores are not valid--without reliability it is impossible for the scores to be valid.|
The article below provides a strong overview of measurement:
Controversies Regarding the Nature of Score Validity:
Still Crazy After All These Years
B. Thomas Gray
Texas A&M University 77843-4225
Validity is a critically important issue with far-reaching implications for testing. The history of conceptualizations of validity over the past 50 years is reviewed, and three important areas of controversy are examined. First, the question of whether the three traditionally recognized types of validity should be integrated as a unitary entity of construct validity is examined. Second, the issue of the role of consequences in assessing test validity is discussed, and finally, the concept that validity is a property of test scores and their interpretations, and not of tests themselves is reviewed.
Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997.
Controversies Regarding the Nature of Score Validity: Still Crazy After All These Years
The most recent edition of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement and Education, 1985) included the bold statement that: Validity is the most important consideration in test evaluation (p. 9). It seems likely that the same point will be reiterated, perhaps verbatim, in the forthcoming revised edition of the same work. The importance indicated by such a strong declaration is reinforced by the fact that no new test can be introduced without a manual that includes a section on validity studies, and no text on testing and/or psychometrics is considered complete without at least one chapter addressing the topic of validity.
In 1949, Cronbach (p. 48) stated that the definition of validity as the extent to which a test measures what it purports to measure was commonly accepted, although he preferred a slight modification: A test is valid to the degree that we know what it measures or predicts (p. 48). Cureton (1951) provided similar commentary: The essential question of test validity is how well a test does the job it is employed to do... Validity is therefore defined in terms of the correlation between the actual test scores and the >true' criterion scores (pp. 621, 623). The enduring definition given by Anastasi (cf., 1954, p. 120; Anastasi & Urbani, 1997, p. 113)--Validity is what the test measures and how well it does so--is cited quite widely.
It is interesting to note that Cronbach, one of the most prominent voices in the field of psychometrics, and a widely respected authority on the topic of validity, has of late tended to avoid the problem of defining the term after the 1949 statement cited above (cf., 1988, 1989). In 1971 (p. 443), however, he provided an insightful statement that foreshadowed some of the controversy of the future: Narrowly considered, validation is the process of examining the accuracy of a specific prediction or inference made from a test score.
Exceptions can be found to the apparent conservatism seen in the definitions cited above. Perhaps most notable is Messick (1989a, p. 13), who stated that, Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. This reflects much of the debate and controversy to be found in the literature of the past several years, indicative perhaps of greater intellectual movement in the field than would be implied by the previous paragraph.
It is certainly beyond the scope of the present paper to present a comprehensive review of the topic of validity. The purpose instead is to focus on a few more obvious points of controversy. Three areas in particular will be addressed: (a) the status of the different types of validity; (b) the issue of what is sometimes referred to as consequential validity ; and (c) the persistence of illogical statements taking the broad form of, The test is valid.
The above discussion illustrates the considerable differences in the way validity is conceptualized by different authorities. Some have changed their views little over the past 40+ years, while others have been advocating markedly different views, a few for many years. Since the roots of many of the shifts in thinking which are occurring today can be found in earlier works, a brief historical review of the ways in which the validity formulations have developed is helpful in understanding both the current controversies and the persistent themes. For further detail, the interested reader is referred particularly to the extensive discussions of the topic which have appeared in the three volumes of Educational Measurement (Cronbach, 1971b; Cureton, 1951; Messick, 1989) published thus far.
Conceptualizations of Validity: An Historical Sketch
By the early part of the 1950's a plethora of different types of validity (factorial, intrinsic, empirical, logical, and many others) had been named (see Anastasi, 1954). Among those whose contributions continue to be acknowledged are Gullickson (1950), Guilford (1946), Jenkins (1946), and Rulon (1946). Typical formulations recognized two basic categories, which Cronbach (1949) termed logical and empirical forms of validity. The former was a rather loosely organized, broadly defined set of approaches, including content analyses, and examination of operational issues and test taking processes. Test makers were expected to make a careful study of the test itself, to determine what test scores mean (Cronbach, 1949, p. 48). Much of what has since become known as content validity is found with in this broad category.
Empirical validity placed emphasis on the use of factor analysis (e.g., Guilford's 1946 factorial validity), and especially on correlation(s) between test scores and a criterion measure (Anastasi, 1950). Cureton (1951) devoted several pages to various issues concerning the criterion, and the influence of this approach is seen in its widespread use even today (Cronbach, 1989), despite some apparent limitations. For example, Cureton's assertion quoted above is easily (although perhaps slightly outlandishly) refuted by noting that a positive correlation could be obtained between children's raw scores on an achievement test and their heights. This is not to say that correlational studies are useless, but rather that their indiscriminate application can sometimes yield uninteresting results.
Several interesting and important political developments converged to influence events relating to validity conceptualization (Benjamin, 1996; Cronbach, 1988, 1989). In the late 1940's, the academically-oriented APA, was attempting to draw the membership of the new Association for Applied Psychology back into its ranks. The two groups combined into what was supposed to be a more broadly-oriented APA, and work was begun on establishing an appropriate code of ethics addressing both scientific and applied concerns. A committee was established in 1950 to develop standards for adequate psychological tests, and their work was published in 1954. At that time four categories of validity were defined: content, predictive, concurrent, and construct.
That basic outline is still in use today (AERA et al., 1985), essentially unchanged, with the exception that in 1966, the revised edition of the Standards combined the predictive and concurrent validity categories into the single grouping called criterion validity. The 1954 standards, which were actually referred to as Technical Recommendations in their first edition, were quickly followed up by the publication in 1955 of Cronbach and Meehl's landmark paper, Construct Validity in Psychological Tests (see Thompson & Daniel, 1996). The construct of Construct Validity was elucidated more clearly, including introduction of the concept of the nomological net. The latter is described as the interrelated laws supporting a given construct; Cronbach (1989) later presented this in somewhat less strict terms, acknowledging the impossibility with most constructs used in the social sciences of attaining the levels of proof demanded in the harder sciences.
Soon thereafter, Campbell (1957) introduced into the validation process the notion of falsification , and discussed the importance of testing plausible rival hypotheses. This was explained in further detail by Campbell and Fiske (1959) in their important paper (Thompson & Daniel, 1996) introducing the multitrait-multimethod approach and the notions of convergent and divergent (or discriminant) validity. There have been objections to some of the applications of this technique, particularly insofar as it can and sometimes does become a rather rote exercise, which will therefore produce only vapid results. Nonetheless, the multitrait-multimethod approach, like correlational studies, enjoys widespread appeal nearly 40 years after its introduction.
Construct Validity as a Unifying Theme
The so-called trinitarian doctrine, which conceptualizes validity in three parts, has been a fundamental part of the Standards since their inception (or, to be picky, at least since 1966). This doctrine is therefore presented as standard fare in most textbooks which cover the topic of validity. Anastasi, for example, has followed the same outline in her widely-used textbook Psychological Testing since 1961, despite her commentary (1986; Anastasi & Urbani, 1997; also see below) that there is considerable overlap between these different categories. Consensus nonetheless is that, however it may (or may not) be divided, the different parts represent lines of evidence pointing toward the single construct.
It has also been fairly well demonstrated that, contrary to prevailing opinion of 40 to 50 years ago, no mode of scientific inquiry is devoid of the influence of values. From this recognition, several authors have argued that one must include consequences of the application of a given test as an aspect of the validity of that application. This is a much more controversial area, for which there is far less consensus. It would seem that, at a minimum, many portions of this argument must be clarified before consequential validity is universally accepted as a facet of validity that must always be considered.
Finally, the illogic of the mantra, The test is valid was discussed. That statements of such form persist despite the strong reasons for not using them is testimony to the inertia that accrues to any long-standing practice. The phenomenon is similar to the persistence of what Cronbach (1989, pp. 162-163) termed empirical miscellany and unfocused empiricism seen most clearly in the accumulation of various correlation coefficients that serves as the validity argument in many (perhaps most) test manuals.
The controversies that persist are welcomed. Consider, for example, that the basic outline of validity presented by Anastasi did not change for over 30 years (cf. Anastasi, 1961, 1988; Anastasi & Urbani, 1997). This is strongly suggestive of stagnation in thinking, a condition which is only alleviated by the challenge of new ideas. Not all of the new ideas discussed in the works reviewed in the present paper are necessarily useful. At least most of those that are not useful will not survive the tests of time.
It is universally acknowledged that validity is a crucial consideration in evaluating tests and test applications. It is also generally stated that a true validation argument, rather than resulting from a single study, such as might be found in a newly published test manual, is an unending process. Contending with new ideas regarding the nature of validity itself is just a part of this process.
American Educational Research Association, American Psychological Association, & National Council on Measurement and Education (1985). Standards for educational and psychological testing. Washington, DC: Author.
American Psychological Association (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51, 201-238.
American Psychological Association (1966). Standards for educational and psychological tests and manuals. Washington, D.C.: Author.
Anastasi, A. (1954). Psychological testing. New York: Macmillan.
Anastasi, A. (1961). Psychological testing (2nd ed.). New York: Macmillan.
Anastasi, A. (1976). Psychological testing (4th ed.). New York: Macmillan.
Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1-15.
Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). New York: Macmillan.
Benjamin, L. T. (1996). The founding of the American Psychologist: The professional journal that wasn't. American Psychologist, 51, 8-12.
Brandon, P. R., Lindberg, M. A., & Wang, Z. (1993). Involving program beneficiaries in the early stages of evaluation: Issues of consequential validity and influence. Educational Evaluation and Policy Analysis, 15, 420-428.
Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297-312.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity in the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Cronbach, L. J. (1949). Essentials of psychological testing. New York: Harper & Row.
Cronbach, L. J. (1960). Essentials of psychological testing (2nd ed.). New York: Harper & Row.
Cronbach, L. J. (1971a). Essentials of psychological testing (3th ed.). New York: Harper & Row.
Cronbach, L. J. (1971b). Test validation. In R. L. Thorndike (Ed.),. Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education.
Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.). New York: Harper & Row.
Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum.
Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement theory and public policy (pp. 147-171). Urbana: University of Illinois Press.
Cronbach, L. J., & Meehl, P. E. (1954). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
Cureton, E. F. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 621-694). Washington, DC: American Council on Education.
Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427-439.
Gulliksen, H. (1950). Instrinsic validity. American Psychologist, 5, 511-517.
Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 11, 385-398.
Hunter, J. E., & Schmidt, F. L. (1982). Fitting people to jobs: The impact of personnel selection on national productivity. In M. D. Dunnette & E. A. Fleishman (Eds.), Human capability assessment. Hillsdale, NJ: Lawrence Erlbaum.
Jenkins, J. G. (1946). Validity for what? Journal of Consulting Psychology, 10, 93-98.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535.
Lees-Haley, P. R. (1996). Alice in validityland, or the dangerous consequences of consequential validity. American Psychologist, 51, 981-983.
Maguire, T., Hattie, J., & Haig, B. (1994). Alberta Journal of Educational Research, 40, 109-126.
Messick, S. (1965). Personality measurement and the ethics of assessment. American Psychologist, 20, 136-142.
Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-966.
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012-1027.
Messick, S. (1989a). Validity. In R. L. Linn (Ed.),. Educational measurement (3rd ed., pp. 13-103). New York: Macmillan.
Messick, S. (1989b). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5-11.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.
Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50, 741-749.
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229-258.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12.
Moss, P. A. (1995). Themes and variations in validity theory. Educational Measurement: Issues and Practice, 14(2), 5-12.
Rogers, W. T. (1996). The treatment of measurement issues in the revised Program Evaluation Standards. Journal of Experimental Education, 63(1), 13-28.
Rulon, P. J. (1946). On the validity of educational tests. Harvard Educational Review, 16, 290-296.
Sackett, P. R., Tenopyr, M. L., Schmitt, N., & Kahn, J. (1985). Commentary on forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697-798.
Schmidt, F. L., Pearlman, K., Hunter, J. E., & Hirsh, H. R. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology, 38, 697-798.
Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.),. Review of research in education (Vol. 19, pp. 405-450). Washington, DC: American Educational Research Association.
Thompson, B. (1994a, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771)
Thompson, B. (1994b). Guidelines for authors. Educational and Psychological Measurement, 54(4), 837-847.
Thompson, B., & Daniel, L. G. (1996). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and Psychological Measurement, 56, 741-745.
Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi Delta Kappan, 75, 200-214.
Zimiles, H. (1996). Rethinking the validity of psychological assessment. American Psychologist, 51, 980-981.