Hi Ryan --

When I wrote that page I was developing the mean agreement simply on logic rather than literature. Fortunately, my logic seems to agree with folks in the measurement field. For example, see Fleiss page 379 measure of overall agreement.

http://www.wpic.pitt.edu/research/biometrics/Publications/Biometrics%20Archives%20PDF/395-1971%20Fleiss0001.pdf

While the formula looks different from what I show on my page, I believe the result is the same. As you probably know, raw percent agreement may overstate the amount of agreement due to chance agreement, so modification to this measure may be useful such as the one offered by Fleiss.

To demonstrate that the calculations I provide agree with Fleiss' overall agreement measure, you can use this page which also provides better measures of agreement:

https://mlnl.net/jg/software/ira/

I took my example on page 5 and imputed the 3 raters' scores as a text file with the following content:

Reviewer1  Reviewer2  Reviewer3
1  1  1
2  2  2
3  3  3
2  3  3
1  1  1
2  3  1
2  2  1
1  1  1
2  1  1
1  1  1
2  2  2
3  3  3
1  1  1
1  1  2
2  2  2
2  2  2
1  1  1

The page above provided the following calculations as a result:

Data
3 raters and 18 cases
1 variable with 54 decisions in total
no missing data

1: rra
 Fleiss Krippendorff Pairwise avg. A_obs = 0.741A_exp = 0.372Kappa = 0.587 D_obs = 0.259D_exp = 0.639Alpha = 0.595 % agr = 74.1Kappa = 0.592

You can see Fleiss A_obs at .741 is within rounding error of the 74.07% I show on my page (ignoring differences between proportion and percentage), and the % agreement of 74.1 under pairwise average is also the same.

Krippendorff's alpha seems to be a popular measure to use for multiple raters.

Personally, if I were reporting level of agreement among raters, I would report the mean level of agreement (74.1%), Fleiss kappa (.587), and Krippendorff alpha (.595) then let readers decide which they wish to believe.

Good luck
B Griffin