EDUR 7130
Educational Research On-Line

Reliability


Assigned Reading

Gall's Text
6th, 7th, and 8th edition: Chapter 7

See supplemental readings.

Reliability

In measurement, reliability refers to the ability to measure something consistently. That is, to obtain consistent scores every time something is measured.

Suppose, for example, that I have bathroom scales at home. One morning I step on the scales, record my weight, then step off. I immediately repeat this process two more times. My recorded weights are:

148, 286, 205

Are these measurements reliable?

I try weighing myself in the same manner with a second set of scales, and my recorded weights are:

195, 210, 205

Are these measurements reliable? Are they more reliable than the first set of measurements?

I try weighing myself in the same manner with a third set of scales, and my weights are:

204, 204, 203

Are these measurements reliable? As you can see, reliability comes in degrees; some measurements are more reliable than others. In this example, the third scale is more reliable than the second, which in turn is more reliable than the first.

Reliability Coefficients

To assess the degree of reliability, measurement specialists have developed methods to measure the reliability of a given set of scores. Typically the measurement of reliability is reflected in what is called a reliability coefficient. Reliability coefficients range from 1.00 (which is highest) to 0.00 (which is lowest). Reliability coefficients of .6 or .7 and above are considered good for classroom tests, and .9 and above is expected for professionally developed instruments. So the closer to 1.00 the coefficient of reliability, the more reliable the scores from an instrument or the more consistent scores obtained from an instrument.

Types of Reliability

There are several methods for assessing reliability; the most common are presented below.

(a) Test-Retest

Test-retest reliability is established by correlating scores obtained, on two separate occasions, from the same group of people on the same test. The correlation coefficient obtained is referred to as the coefficient of stability. With test-retest reliability, one attempts to determine whether consistent scores are being obtained from the same group of people over time; hence, one wish to learn whether scores are stable over time.

For example, one administers a test, say Test A, to students on June 1, then re-administers the same test (Test A) to the same students at a later date, say June 15. Scores from the same person are correlated to determine the degree of association between the two sets. Table 1 shows an example.

Table 1: Example of Test-Retest Scores for Reliability

Test Form A
Person June 1  Administration Scores June 15 Administration Scores
Bryan 85 83
Bob 75 77
Brenda 63 60
Bertha 59 57
Bert 91 89
Brent 35 40
Bathsheba 55 60
Beth 95 99
Bernie 86 83
Betty 83 77

The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the first administration also scored high on the second, and those who scored low on first administration scored low on the second. There is a strong relationship between these two sets of scores, so high reliability.

The problem with test-retest reliability is that it is only appropriate for instruments for which individuals are not likely to remember their answers from administration to administration. Remembering answers will likely inflate, artificially, the reliability estimate. In general, test-retest reliability is not a very useful method for establishing reliability.

(b) Equivalent-forms

Equivalent-forms reliability is established in a manner similar to test-retest. Scores are obtained from the same group of people, but the scores are taken from different forms of a test. The different forms of the test (or instrument) are designed to measure the same thing, the same construct. The forms should be as similar as possible, but use different questions or wording. It is not enough to simply rearrange the item order; rather, new and different items are required between the two forms. Examples of parallel (equivalent) forms that most of you are familiar with include SAT, GRE, MAT, Miller's Analogy Test, and others. It is unlikely that should you take one of these standardized test more than once would you take the same form.

To establish equivalent forms reliability, one administers two forms of an instrument to the same group of people, take the scores and correlate them. The higher the correlation coefficient, the higher the equivalent forms reliability. Table 2 below illustrates this.

Table 2: Example of Equivalent-Forms Reliability

Test Form A
Person Instrument Form A Scores Instrument Form B Scores
Bryan 85 83
Bob 75 77
Brenda 63 60
Bertha 59 57
Bert 91 89
Brent 35 40
Bathsheba 55 60
Beth 95 99
Bernie 86 83
Betty 83 77

The scores are the same as given in Table 1, the only difference is that these scores come from two different instruments. The correlation between the sets of scores in Table 1 is r = .98, which indicates a strong association between the scores. Note that people who scored high on the Form A also scored high on Form B, and those who scored low on Form A scored low on Form B. There is a strong relationship between these two sets of scores, so high reliability.

Equivalent forms reliability is not a practical method for establishing reliability. One reason is due to the inability or difficulty in developing forms of an instrument that are parallel (equivalent). A second problem is the impractical aspect of asking study participants to complete two forms of an instrument. In most cases a researcher wishes to use instruments that are as short and to the point as possible, so asking one to complete more than one instrument is not often reasonable.

(c) Internal Consistency

This is the preferred method of establishing reliability for most measuring instruments. Internal consistency reliability represents the consistency with which items on an instrument provide similar scores. For example, suppose I develop three items to measure test anxiety. Test anxiety represents one's fear or over-concern for one's performance in a testing situation. In developing the three items, I should concentrate on three items that measure the same construct (test-anxiety), yet provide a slightly different view or angle on it. Below are three items to measure test anxiety that would be administered immediately before one takes a test:

  1. I have an uneasy, upset feeling.
  2. I'm concerned about doing poorly.
  3. I'm thinking of the consequences of failing.

For each item, I would ask the respondent to indicate on a scale how true each statement is at that time. For example:

Table 3: Test Anxiety Items

Instructions: Please indicate, on the scale provided, how true each statement is for you immediately before taking an important test.

  Not True of Me         Very True of Me
1. I have an uneasy, upset feeling. 1 2 3 4 5 6
2. I'm concerned about doing poorly. 1 2 3 4 5 6
3. I'm thinking of the consequences of failing. 1 2 3 4 5 6

If these three items show evidence of internal consistency, then a given person should show similar answers for each item. A person who has a high degree of of anxiety for testing situations would probably choose response 6 for item 1, response 6 for item 2, and response 6 for item 3. A person with little to no anxiety might choose response 1 for all three items. Note that both of these people show a high degree of consistency in their responses, and this internal consistency.

Defined, internal consistency is essentially the degree to which similar responses are provided for items designed to measure the same construct (variable), like test anxiety. In Table 4 another example is given for internal consistency. In this example, the survey is designed to measure satisfaction with a course. Let's suppose that a student is very dissatisfied with a course--the student hates the course. In response to question 1, the student is likely to select option 5 "always," for item 2 the student is likely to select option 5 "not at all." I think you can see the pattern here--the internally consistent pattern of selecting the negative responses.

Is there any item on this survey that is likely not to elicit a consistent pattern or response?

Table 4: Internal Consistency Example "The Course Satisfaction Survey"

1. Do your ever feel like skipping this class? never rarely sometimes often always
1 2 3 4 5
2. Do you like this class? very much quite fairly not too not at all
1 2 3 4 5
3. Do you like the way this class is taught? very much quite fairly not too not at all
1 2 3 4 5
4. Are you glad you chose or were assigned to be in this class? very glad most of the time sometimes not too often not at all
1 2 3 4 5
5. How much do you feel you have learned in this class? a great deal quite a bit a fair amount not much nothing
1 2 3 4 5
6. Do you always do your best in this class? always most of the time usually sometimes never
1 2 3 4 5
7. Do you like your other courses? very much quite a bit a fair amount not much not at all
1 2 3 4 5
8. Does the teacher give you help when you need it? always most of the time usually sometimes never
1 2 3 4 5
9. Do you find the time you spend in this class to be interesting? very much quite a bit a fair amount not much not at all
1 2 3 4 5

Adapted from B. W. Tuckman (1988). Conducting Educational Research (3rd ed.). New York: Harcourt, Brace, Jovanovich, p. 236.

Two ways of calculating internal consistency is split-half and Cronbach’s alpha (also known as KR-20 and KR-21). Alpha is the better of the two because it provides the average of all possible split-half reliabilities. All these measures of internal consistency provide an index that ranges from 0.00 to 1.00 with 1.00 indicating higher levels of internal consistency. In most research writers will report Cronbach's alpha like:

Cronbach's alpha was calculated for each subscale: test anxiety a = .76; academic self-efficacy, a = .54; and motivation to learn, a = .83.

Often students will overlook the use of Cronbach's alpha in research reports as a form of reporting internal consistency reliability. For example, consider this scenario:

Jones and Smith (1993) conducted research on academic self-efficacy among adolescences and found that self-efficacy scores, with a reported Cronbach's alpha of .87, correlated strongly with persistence, grades, self-selected tasks related to academics. Further research reported by Jones and Smith also denotes...

In the above report note that self-efficacy scores have an internal consistency of .87, as indicated by Cronbach's alpha.

 

(d) Scorer/Rater

Scorer/Rater reliability is used to determine the consistency in which one rater assigns scores (intra-judge), or the consistency in which two or more raters assign scores (inter-judge). Intra-judge reliability refers to a single judge assigning scores. Remember that consistency requires multiple scores (at least two) in order to establish reliability, so for intra-judge reliability to be established, a single judge or rater must score something more than once. If asked to judge an art exhibit, to establish reliability, a judge most rate the same exhibit more than once to learn if the judge is reliable in assigning scores. If the judge rates it high once and low the second time, obviously rater reliability is lacking.

For inter-judge reliability, one is concerned with showing that multiple raters have consistency in their scoring of something. As an example, consider the multiple judges used at the Olympics. For the high dive competition, often about seven judges are used. If the seven judges provide scores like 5.4, 5.2, 5.1, 5.5, 5.3, 5.4, and 5.6, then there is some consistency there. If, however, scores are something like 5.4, 4.3, 5.1, 4.9, 5.3, 5.4, and 5.6, then it is clear the judges are not using the same criteria for determining scores, so they lack consistency, reliability.


Supplemental Reading

Suzanne Boren and David Moxley also present material on reliability and validity here:

http://www.hmi.missouri.edu/course_materials/Executive_HSM/semesters/s99materials/Hsm450/boren/measurement.htm 


Copyright 2000, Bryan W. Griffin

Last revised on 28 September, 2006 02:40 PM