Reliability/Agreement among Coders
Read
Note - ignore material below - not yet incorporated into EDUR 9131
Instructor Note: Other material to consider
- Email explaining percentage agreement, Fleiss kappa, Krippendorff alpha
- http://www.kenbenoit.net/courses/tcd2014qta/readings/Banerjee%20et%20al%201999_Beyond%20kappa.pdf
- Krippendorff alpha details http://repository.upenn.edu/cgi/viewcontent.cgi?article=1286&context=asc_papers
- http://www.agreestat.com/research_papers/wiley_encyclopedia2008_eoct631.pdf
- Fleiss kappa http://en.wikipedia.org/wiki/Fleiss'_kappa
- Fleiss kappa 1971 introduction: http://www.bwgriffin.com/gsu/courses/edur9131/content/Fleiss_kappa_1971.pdf (shows formula for 2+ rater agreement as noted in my email link
- Liao, Hunt, & Chen (2010). Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment. Illustrates how high reliability does not equal to high agreement, and how high agreement does not equal to high reliability.
- They claim data in Table 2 show high agreement but low consistency. The are incorrect, agreement indices are very low. K alpha = -.29 (ordinal, interval, and ratio), and ICC for agreement = -.38 single and -5.00 multiple raters. Moreover subject means in Table 2 show almost 2 SD difference, large ES difference.
- Artstein and Poesio (2008) Inter-Coder Agreement for Computational Linguistics. Excellent detailed source for explaining inter-coder agreement measures including K alpha -- good review source.
(c) Reliability -- inter-rater and intra-rater agreement
- Rating Scales are Categorical/Nominal (angry, fearful, contempt, disgust) or Few Ranked Categories (poor, good, very good)
- Presentation notes: Inter-rater Agreement with Nominal/Categorical Ratings with SPSS commands
- Graham, Milanowski, & Miller (2012). Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings.
- See their Table 1 for a nice illustration of the difference between reliability and agreement. Pearson r vs. percent agreement: r may be 1.00 despite no agreement.
- Percentage Agreement
- Simple count of "agreement codes" / "total codes" forms percentage agreement
- Percentage agreement can be calculated for more than two raters
- Generally percentage agreement is not recommended since it can capitalize on chance agreements, thus it may overstate the level of agreement, but do report percent agreement.
- Cohen's Kappa
- Useful for two raters with unordered rating scale or few ordered options
- Discussion: Viera & Garrett 2005; present table for interpreting kappa; also note limitations of kappa
- Example of kappa use: Uiters et al 2006, see page 4 and table
- SPSS
- Requires symmetrical scores: all codes must be present for each rater (Rater A uses 1 to 5, Rater B uses 1 to 4, no kappa) - update, new versions of SPSS correct this
- Likely not useful if number of categories is large (many themes to coding responses, good for overall judgments with limited codes [excellent, pass, fail])
- SPSS notes
- http://www.stattutorials.com/SPSS/TUTORIAL-SPSS-Interrater-Reliability-Kappa.htm
- http://www.sma.org.sg/smj/4412/4412bs1.pdf -- page 617
- (problem, all categories must be present for both raters): http://www.ats.ucla.edu/stat/spss/faq/kappa.htm (can 3+ raters be used?)
- Scott's pi (or Fleiss Kappa for two raters)
- Same formula as Cohen's Kappa but calculates expected disagreement differently; often results are similar
Excel file to calculate Fleiss Generalized Kappa: http://www.bwgriffin.com/gsu/courses/edur9131/content/fleiss_kappa2.xls (Note, spreadsheet not working)
- Original link: http://www.ccitonline.org/jking/homepage/interrater.html (Excel files for Fleiss kappa; large excel may be problematic)
- How to calculate Fleiss kappa in Excel: http://www.real-statistics.com/reliability/fleiss-kappa/
- Krippendorff's alpha
- Krippendorff agues
- alpha superior to Kappa and pi
- serious work should see alpha > .80
- can handle missing data when raters are 3+ unlike kappa and pi
- SPSS Syntax for running Krippendorff alpha
- Hayes' website: http://www.afhayes.com/public/kalpha.sps
- Copied syntax here: SPSS syntax
- Spring 2015 students noted errors with SPSS version 21 and 22; check syntax
- Knut De Swert (2012). Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. University of Amsterdam. Explains how to use Hayes' SPSS syntax to run K alpha, and meaning of values of K alpha. Provides examples.
- Three or More Raters
- All of the agreement indices noted above can be extended to more than two raters
- See Inter-rater Agreement with Nominal/Categorical Ratings for examples and instructions
- On-line Reliability Calculators:
- Geertzen, J. (2012). Inter-Rater Agreement with multiple raters and variables. Retrieved February 20, 2015, from https://mlnl.net/jg/software/ira/
- Deen Freelon (2015) ReCal: reliability calculation for the masses
- Both report (a) mean percentage agreement, (b) mean kappa, (c) Scott pi or Fleiss kappa, (d) Krippendorff alpha
- Material to Add
- Index of concordance = A / (A+D) where A=agreement and D=disagreement
- Replicate Table 3 to show problem with Cohen/Fleiss Kappa: Joyce (2013) Blog Entry: Picking the Best Intercoder Reliability Statistic for Your Digital Activism Content Analysis (PDF version of page)
- Rating Scales are Ordinal with Several Steps, or Interval/Ratio Scale
- Presentation notes: To be Revised Inter-rater Agreement with Ranked/Interval Data (revisions needed, ICC absolute agreement vs. consistency -- AA is same score, consistency is similar scores)
- Two Raters
- Agreement vs. Consistency
- Are scores similar between raters: Question of Agreement
- Is the pattern of scores similar between raters: Question of Consistency
- See Inter-rater Agreement with Ranked/Interval Data for illustration of this difference.
- Liao, Hunt, & Chen (2010). Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment.
- p 613 "Inter-rater agreement and inter-rater reliability are both important for PA. The former shows stability of scores a student receives from different raters, while the latter shows the consistence of scores across different students from different raters."
- Not sure I agree with this statement. Consistency does not show consistency of scores across different cases. Scores can be widely different but perfectly consistent.
- Update: They claim data in Table 2 show high agreement but low consistency. The are incorrect, agreement indices are very low. K alpha = -.29 (ordinal, interval, and ratio), and ICC for agreement = -.38 single and -5.00 multiple raters. Moreover subject means in Table 2 show almost 2 SD difference, large ES difference.
- Is it possible to have high agreement but low consistency?
- Pearson r and Correlated t-test
- Can be used, but Pearson r is not a measure of agreement
- Cronbach's Weighted Kappa for ordinal data
- gamma? others?
- Krippendorff's Alpha for ordinal, interval, and ratio data
- Hayes' website: http://www.afhayes.com/public/kalpha.sps
- Copied syntax here: SPSS syntax
- Spring 2015 students noted errors with SPSS version 21 and 22; check syntax
- Intra-class Correlation Coefficient
- Absolute Agreement (scores match) vs. Consistency (patterns match)
- Multiple raters vs. single rater: multiple for studies that use scores combined across raters, use single for studies using scores from a single rater
- Sources:
- Hallgren (2012). Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol.
- David P. Nichols (1998) Choosing an intraclass correlation coefficient (in SPSS). Explains different model options, difference between consistency and absolute agreement, and between single measure vs. average measure.
- Yaffee (1998) Enhancement of Reliability Analysis: Application of Intraclass Correlations with SPSS/Windows v.8. Explains differences in ICC provided by SPSS and gives examples.
- Notes:
- Wuensch (2014). Inter-rater Agreement. SPSS instructions with discussion.
- Krippendorff Alpha for ordinal data provides much better assessment than nominal measures (kappa, etc.) whenever there is any order to data. Source: Antoine, Villaneau, Lefeuvre. (2014) Weighted Krippendorff 's alpha is a more reliable metrics for multi- coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation.
- Three or More Raters
- Krippendorff's alpha
- Key benefit is ability to handle missing data from multiple raters
- Handles ordinal, interval, and ratio data
- Intra-class Correlation
- SPSS instructions and discussion: Wuensch (2013) The Intraclass Correlation Coefficient.
- Cronbach's alpha
- Not a measure of agreement, instead measure of pattern consistency
- Cronbach's alpha = ICC for multiple raters using consistency, but not for absolute agreement
- Can be used as aggregate measure of consistency across raters if mean score is to be used
- Source: Hayes & Krippendorff (2007; Answering the Call for a Standard Reliability Measure for Coding Data) explain this p. 81
- Other Measures of Agreement/Consistency
- Many measures exist. Some discussion can be found here:
- Barnhart et a. (2014). Choice of agreement indices for assessing and improving measurement reproducibility in a core laboratory setting
- Banerjee, Capozzoli, McSweeney, and Sinha (1999). Beyond kappa: A review of interrater agreement measures. Reviews a number of agreement measures.
- Euclidean coefficients -- https://conservancy.umn.edu/bitstream/handle/11299/114459/v15n4p321.pdf?sequence=1
- Loglinear models to examine patterns of agreement.
- Latent Class models
- Bennet's sigma
- Gwet's gamma
- Aickin alpha
- Concordance correlation coefficient (CCC)