Identifying Reliability Information - Answers
Study 1: Hoffman and Nadelson
1. What type of reliability evidence was presented?
They report two types of reliability for the Goals Inventory. First, they present test-retest (stability) reliability; note this quotation:
“Test–retest reliability of the instrument revealed r = .73 for the learning goals subscale and r = .76 for the performance goals subscale, …”
They also, in the same sentence, present prior evidence of internal consistency for the two sub-scales that form the Goals Inventory:
“…with Cronbach’s alpha at .85 and .75 for the learning and performance subscales respectively.”
We know these are estimates for internal consistency because Cronbach’s alpha is an index for internal consistency.
2. Do the reliability coefficients presented show acceptable levels of consistency? How do we know?
Yes, we typically want to see reliability coefficients no lower than .70. Note that each of the reliability coefficients presented are greater than the lower limit of .70.
3. In addition to administering questionnaires to collect data, the authors also interviewed participants and collected their responses. The authors coded those responses. Now go to page 252 and read the section entitled “Coding methodology”. Read to the top of page 254; what type of reliability was presented for the coding of interview responses?
The last paragraph on page 253 the authors write this:
“In Phase II, we created smaller categories from the Phase I data set by cluster coding. We intended to categorize participant language that targeted motivational antecedents, the degree and strength of engagement, evaluation of playing outcomes and experiences, and explanations for gaming preferences. Initially, each researcher coded the data individually.
After individual coding, each researcher exchanged coded data to determine agreement. Codes that were ambiguous, similar in meaning, or not agreed upon between researchers were reviewed to create a single code. For example, one researcher coded the statement, ‘‘It’s a way for me to sort of slow down’’ as ‘‘chillin’’, a term used by several game players, the other researcher coded this same term as ‘‘escapism’’. After discussion the code ‘‘relaxation’’ was designated to represent the comment. Inter-rater reliability for cluster coding was .96.”
What are they telling us here about coding reliability?
Note the high level of inter-rater agreement at .96; this appears to be a very high level of agreement.
4. Do the authors explain how the measure of reliability for coding of interview responses was calculated or determined?
No. There are several ways to calculate inter-rater agreement (reliability). One way is simply the percentage or proportion of agreement among raters (e.g., our scores agreed 78% of the time, .78). Another method is to calculate something called Cohen’s kappa (it is a complex formula). Another is to calculate the correlation between raters’ ranks of items. Another is the intra-class correlation coefficient, which is similar to the correlation, but calculated differently. Unfortunately the authors failed to specify exactly how they calculated this .96 level of agreement.
Study 2: Boeckner et. al.
5. What is the purpose of this study?
Page S21:
“The investigators initiated this study to estimate the parallel forms reliability of paper and pencil and Web versions of the 1998 HHHQ and to examine the feasibility of administering it via the Internet to older rural women for dietary assessment.”
6. To assess equivalent forms reliability (they call it parallel forms), what must they do? I am not asking what they did, but from our study chat session you should know the basic steps for establishing equivalent forms reliability. What are the steps the authors should take to establish equivalent forms reliability?
To establish equivalent forms reliability, they must
(a) Have two (or more) forms of an instrument (scale, test, etc.)
(b) Administer the both forms to the same group of people at roughly the same
time (e.g., within a few hours or days).
(c) Obtain scores from the two forms.
(d) Correlate the paired scores using Pearson’s correlation coefficient.
(e) Show that mean scores from the two forms are similar.
7. Did the authors follow the steps outlined for establishing equivalent forms reliability?
(a) Have two (or more) forms of an instrument (scale, test, etc.)
Yes, they had a paper version and a Web version with identical items (p. S21).
(b) Administer the both forms to the same group of people at roughly the same time (e.g., within a few hours or days).
Yes, participants first completed the paper version, then within 1 to 2 weeks completed the Web version (p. S21).
(c) Obtain scores from the two forms.
(d) Correlate the paired scores using Pearson’s correlation coefficient.
(e) Show that mean scores from the two forms are similar.
For (c), (d), and (e), yes, the authors explain (p. S21) that data were collected and analyzed using Pearson’s correlation to estimate parallel form reliability and t-tests to compare mean scores between the two versions.
8. The issue for these authors is whether both versions produced similar responses from participants. Where can we find evidence about the level of agreement among scores from both versions?
Table 1 (page S22) shows the correlations for sub-sections of the HHHQ between scores from the paper and pencil version and the web version. These correlations are estimates of equivalent form reliability for each sub-section of the instrument.
9. Do you see any problematic correlations in this table? Any correlations lower than the lowest value typically accepted for reliability estimates?
Most of the correlations are above the .70 value sought for reliability coefficients, but one is a bit lower than that value. The correlation r = .54 for “Vitamin C, mg.” is weak and suggests problems for equivalent forms reliability for this sub-scale of the instrument. Two other correlations are slightly below the .70 also.
10. Now look at Table 2; how does this table help us assess equivalent forms reliability?
Table 2 shows t-test results comparing mean scores between the two versions of the HHHQ instrument. None of the t-ratios are large, and none of the p-values (called “Sig (2-Tailed)” here) are small enough (less than .05) to warrant a rejection of the null hypothesis (null indicates no difference in group means). Since none are rejected, it seems both version of the HHHQ tend to provide similar mean scores despite the weak correlations noted for a few subscales in Table 1.
Brief Note about p-values
P-values are used in hypothesis testing and they represent the probability of obtaining data like those obtained in the sample if Ho is true. If the p-value is small, this suggests data like those obtained in the study would be very rare if Ho is true. In this case Ho is rejected. If the p-value is large, say greater than .05, then researchers conclude that the data collected are not so different from that specified by the null hypothesis, therefore the null is not rejected.
If p-values (Sig two tailed in their table) is less than .05, that is significant (which means Ho is rejected) and one would then conclude that the means for whichever subscale is examined show larger differences than what one would expect by chance. Normally a t-ratio of about 2 or larger will produce p-values that will be smaller than .05. If the sample is small, as in this case, t-ratios of about 2.5 or thereabouts will produce p-values of less than .05.