EDUR 7130
Educational Research On-Line

Descriptive and Inferential Statistics


Assigned Readings

Gall's Text
6th edition: Chapter 5; and pages 389-403; 409-432; 456-459
7th edition: Chapter 5; and pages 304-316; 320-339; 360-363 
8th edition: Chapter 5; and pages 315-328; 332-352; 375-377

Statistics and Research

Statistics are used by researchers to describe data and relationships among data. For example, if one asks how a class of students did on a test, it would be inefficient to communicate each and every score. Rather, a more efficient method is to provide the average to give a general indication of how the students performed on the test. Below are various methods of describing data (Descriptive Statistics) and of modeling relationships among variables (Inferential Statistics).

Types of Descriptive Statistics

Descriptive statistics are used to describe data in a concise, understandable way. Descriptive statistics are summary indicators of larger groups of data. The example above illustrates how descriptive statistics may be used to reduce large amounts of information into a few summary indicators--thus reducing class scores to a class average. Two important summary methods for data are measures of central tendency (typical or average scores) and measures of dispersion (variability or spread of scores).

 

1. Measures of Central Tendency

Measures of central tendency are indicators of average or typical scores one might find in a distribution of scores. The three most common measures of central tendency are mode, median, and mean.

(a) Mode (Symbolized as Mo): This is the most frequent score in a distribution. In Table 1 above, the mode is 85, and in Table 2, the mode is female. The mode is appropriate for nominal, ordinal, interval, and ratio data.

(b) Median (Symbolized as Md, Mdn, or X50): The score directly in the middle for all scores in rank order; the point at which 50% of the scores are above, and 50% are below. For example, of the following scores, 5, 3, 7, 6, 9, 1, 4, the median is:

1, 3, 4, 5, 6, 7, 9

the score that falls in the middle of the distribution, which is five in this example.

If one has an even number of scores, the median is the mean (arithmetic average) of the two middle scores. For these scores, 2, 1, 3, 10, the median is

1, 2, 3, 10

(2+3)/2 = 2.5

so 2.5 is the median--exactly 50% of the scores fall below this and 50% above this score.

This median is a good measure for ordinal data, or interval/ratio data when the distribution is highly skewed (e.g., income in U.S. is positively skewed, so use Mdn). Skew means that there are a few very high scores or a few very low scores and these extreme scores often affect the mean.

(c) Mean (Symbolized as M): The mean is what one usually thinks of as the average. The formula is:

S Xi/n = mean = M,

where Xi represents the raw scores, n is the sample size (the number of scores), and S means to sum the scores (to add all the scores together). In words, the mean is simply the sum of all scores divided by the number of scores. For example, for this set of scores (1, 2, 3, 10) the mean is:

(1+2+3+10) / 4 = 16/4 = 4

This measure is best used for ratio or interval data, but is often okay with ordinal data. It is not appropriate for nominal data since the mean assumes rank and nominal data do not have rank.

For the scores given in the frequency display above (see Table 1: Frequency Distribution for Test Scores), the mean, median, and mode are:

Scores from Table 1: 83 83 84 85 85 85 86 87 88 88

 

2. Measures of Variability

A measure of variability provides some indication of the dispersion or spread of scores in a distribution. Note that central tendency indicates typical or average scores, and variability indicates spread of scores. For example, consider the following two sets of scores:

Set A: 80, 80, 80, 80, 80

Set B: 60, 70, 80, 90, 100

Both sets have the same mean and median (M = 80, Mdn = 80), yet they have very different spread or dispersion. Set A has no variability at all, while Set B has much variability (no two scores are the same).

There are several measures of variability that help to show differences in variability like that found in Sets A and B above. Several of these measures are provided below.

(a) Range (Symbolized as R): The range is the quickest and easiest measure of dispersion to calculate. The formula is simply the difference between the largest and smallest scores in the distribution of scores, i.e.,

Xmax - Xmin.

For example, with Set B, Xmax = 100, Xmin = 60, so the range is R = 100 - 60 = 40, or a 40 point spread. The range for Set A is R = 80 - 80 = 0, or no spread.

The problem with the range is that it only considers two numbers in the distribution, the highest and lowest score. Does the range adequately address variability for the following two sets?

Set C:  70 75 75 75 75 80

Set D:  70 72 74 76 78 80

Note that for both Sets C and D, the range is R = 80 - 70 =  10, or 10 points, yet the numbers suggest that Set D has more variability because no two numbers are the same while in Set C, there are only three unique numbers, 70, 75, and 80.

What is needed is a measure of variability that takes into account all numbers in the data, not just the two extreme numbers.

(b) Standard Deviation (Symbolized as SD, s): The standard deviation is more complex than the range and it provides a more useful indication of variability in a set of scores. The formula will not be discussed, but you should note that the standard deviation, like the range, cannot be less than zero (i.e., 0.00), and the larger the standard deviation, the greater the variability in a set of scores.

Using Sets C and D, the standard deviations are SD = 3.16 for Set C, and SD = 3.74 for Set D. Set D has the larger standard deviation and this indicates that scores in Set D have more variability that scores in Set C.

As another example, listed below are two sets of scores with the same measures of central tendency, but with different measures of variability.

Table 3: Example 2 for Variability

Girls

Boys

83

75

84

80

85

85

85

85

85

85

86

90

87

95

M

85

85

Mdn

85

85

Mo

85

85

SD

1.29

6.45

R

4

20

As this table shows, the measures of central tendency are identical for both boys and girls, but vary dramatically in terms of measures of variability.

 

3. Graphing Data

Another method for presenting and describing data in an efficient manner is through the use of graphs.

Graphs are frequently used to display data. Graphs provide pictorial displays that enable one to more readily understand the distribution of scores, etc. A few commonly used graphs will be described below.

(a) Frequency Distributions

Frequency distributions are used to indicate how many times a particular value was obtained in a set of scores. For example, consider the following scores obtained on a test:

85, 85, 88, 87, 84, 85, 86, 83, 88, 83.

The frequencies for the above scores are:

Table 1: Frequency Distribution for Test Scores

X

(raw score)

F

(frequency of score)

88

2

87

1

86

1

85

3

84

1

83

2

As this frequency distribution shows, the most common score was 85, and the least frequent scores were 84, 86, and 87. Frequency distributions will work with data from any type of variable (nominal, ordinal, interval, or ratio). To illustrate, one could make a frequency distribution for the sexes enrolled in a course. Suppose there are 11 women and 5 men, the frequency distribution would be:

Table 2: Frequency Distribution for Sex

Sex Frequency
Female 11
Male 5

(b) Bar Chart and Histogram

Other commonly used graphical tools are bar charts and histograms. Both are similar; the only difference is that a bar chart is used for qualitative data (so the bars do not touch, thus indicating a lack of continuity), while the histogram is used with quantitative data (the bars touch). Examples of both are provided below.

 

(c) Time-Series Graph

A time-series graph displays figures over time. Often time-series graphs are used on single subject research. For more examples, see the section Single Subject Research.

 

 

 

4. Measure of Relative Position

Relative position refers to the location in a distribution of a given score relative to other scores. Relative position indicates how well one performed on a test relative to others.

The only measure of relative position discussed here will be Percentile Rank (PR). A percentile rank indicates the proportion or percentage of individuals who scored less than a given score. For example, if you receive a PR of 75, then this means that 75% of those who took the test scored less than you. It does not mean that you got 75% of the items correct on the test. If you had a PR of 4, this means you scored better than 4% of those who took the test.

Note also that some define percentile rank as representing the percentage who scored at or below a given score. Thus, a PR of 75 means that one scored the same as or better than 75% of test takers. 

Both ways of defining percentile rank ([a] score better than or [b] scored equal to or better than) are used an commonly found in education. 

Hypothesis Testing

A key to quantitative research is the hypothesis test. When trying to determine whether two variables are related (like intelligence and scholastic performance, or sex [female vs. male] and dropout status), statistical testing of hypotheses is the tool most frequently used.

What does it mean to statistically test a hypothesis? With hypothesis testing, data from studies are used to make a judgement about whether enough information exists to support a conclusion that a relationship exists between variables. To make this judgement, two types of hypotheses are considered, the research hypothesis (which in this case will be either directional or non-directional) and the null hypothesis.

To introduce you to the idea of hypothesis testing, consider the following research situation.

Study 1: Boys/Girls and Math

A researcher wishes to know if there is statistical evidence that boys and girls differ on a standardized mathematics test. The research hypothesis is:

Ha: Boys will have higher scores than girls on the mathematics test.

The null hypothesis is

Ho: There will be no difference in mathematics test scores between boys and girls.

In this example, Ha is the research hypothesis (H indicates hypothesis, the letter "a" indicates that this is the alternative or research hypothesis) and Ho is the null hypothesis (H is hypothesis, "o" is the symbol for null).

When making a scientific judgement about this study, hypothesis testing is used. In all types of hypothesis testing, the null hypothesis is always the hypothesis tested. The logic of testing the null hypothesis follows something like this:

The null indicates no difference in math scores between boys and girls in this example. If the null hypothesis can be rejected, then some hypothesis other than the null must be accepted which will be either the directional or non-directional. Note that only the null indicates no difference, and both the directional and non-directional hypotheses indicate some difference between boys and girls on math scores. So by rejecting the null hypothesis, we can establish evidence for a difference between boys and girls.

So in Study 1, we will first collect data, then we will use a statistical procedure (which are discussed below) to test the null hypothesis. If the null hypothesis is rejected, then we can accept the alternative or research hypothesis and state that sex and math scores are related in some way (maybe that boys score higher than girls). If the null hypothesis is not rejected, then we state that there is not enough evidence in the data collected to say that a difference exists between boys and girls on math test scores. The key to the conclusion we draw is whether the null hypothesis is rejected or not rejected. If rejected, then a relationship or difference exists, if not rejected, then sex and math scores are not related.

Study 2: Intelligence and Math

A researcher wishes to know if there is statistical evidence that one's intelligence is related to one's performance on a mathematics test. The research hypothesis is:

Ha: Those with higher levels of intelligence will score higher on a mathematics test.

The null hypothesis is

Ho: There is no relationship between intelligence and mathematics test scores.

For Study 2 we are interested in knowing whether a relationship exists between two variables, intelligence and math scores. As with Study 1, the question of interest is whether the null hypothesis can be rejected. As before, the null hypothesis is tested. If the null hypothesis is rejected, based upon the evidence provided in the data collected, then we can say that intelligence and math scores are related (perhaps positively related). If the null hypothesis is not rejected, then we say that there is not enough evidence in the data to say that intelligence and math scores are related in any way.

Statistical Significance (Statements of Significant)

As noted above, when performing statistical testing of hypotheses, one always tests the null hypothesis. There are many statistical methods for testing null hypotheses, and a number of them are discussed below (e.g., t-test, Pearson's r, ANOVA). All statistical testing procedures have corresponding null hypotheses. When testing hypotheses with statistical tests like the t-test, ANOVA, and Pearson's r, one must assume that the null hypothesis is true; that is, one must assume that no differences actually exist among the groups examined, or that no relationships exist among the variables.

All of the statistical tests will provide some indication as to whether evidence provided by the data support or do not support the null hypothesis. If the evidence indicates that the null hypothesis is supported, then the researcher concludes that no differences between the groups really exist (or that no relationship among the variables exists), and the researcher can therefore state that the statistical results indicate that there were no statistically significant differences among the groups, or that there were no statistically significant relationships among the variables.

If, however, the data do not support the null hypothesis, then the researcher can reject the null hypothesis and accept the research hypothesis. If the data do not support the null hypothesis, then the researcher can say that a statistically significant difference exists among the groups (which is, of course, what the research hypothesis claims), or that a statistically significant relationship exists among the variables..

Thus, when one claims that results are statistically significant, this does not mean that important results were obtained; rather, it simply means that the data do not support the null hypothesis. Whether important results were obtained in any study cannot be determined by statistical hypothesis testing; importance can only be assessed through one’s interpretation of the results--that is, one must examine the data closely to determine whether something important was found.

So, what does it mean when one hears, say in the news, that researchers found a significant relationship between smoking and lung cancel, or that researchers found a significant relationship between leaving a light on for young children at night and their likelihood of developing near-sightedness? Does the term significant mean they found an important result? No. In this case significant only means that the data collected by the researchers provided enough evidence to reject the null hypothesis, but this does not mean that something important was found.

To illustrate the difference, consider a study of computer-assisted instruction in mathematics. A recent study found a significant difference in math achievement scores between two groups: one group used traditional instruction (paper and pencil exercises, etc.), and the other group supplemented this instruction with mathematics software instruction three times a week. One might be tempted to conclude that since a significant difference was found, the computer-assisted instruction makes a big difference. This is not the case. There is a curious characteristic of hypothesis testing that follows sample size. The larger the sample one collects, the easier it is to reject the null hypothesis (hence the easier it is to claim a significant effect was found). In this study, the mean math-achievement scores for the traditional instruction group was 84.3, and the mean math-achievement scores for the computer-assisted group was 85.2, less than 1 point difference in achievement. Do you consider this to be a significant improvement in achievement? When making this judgement, consider also the resources required to use computer-assisted instruction (money, time, space, etc.). I hope this example illustrate that just because someone reports a significant finding in research, this does not necessarily mean something important was found, it only means that the null hypothesis was rejected.

Statistical Significance (Probability Values)

Whenever one performs a statistical test to determine whether to reject or fail to reject the null hypothesis, one always relies probability. Keep in mind that when performing hypothesis tests, one is working with samples, not populations. In any given sample, it is possible that the sample does or does not represent well the population. So anything found in a sample study may or may not be true for the population.

The reason one works with a sample is to make inferences or generalizations back to the population. The reason one tests hypotheses is to help with the decision making process of generalizing from sample to population. So, for example, if one fails to reject the null hypothesis that intelligence is unrelated to math scores, one would then conclude that intelligence and math scores are unrelated in the population.

With all statistical testing procedures for testing null hypotheses, a few key bits of information will be calculated. One thing calculated is call the inferential test statistic. This will be described for each statistical test discussed below, but usually it will be one of the following: t, F, or c2. Each of these three symbols represent the inferential test statistic used to judge whether one should reject or not reject the null hypothesis being tested.

Another statistic produced by all hypothesis testing procedures is the p-value, which is usually denoted simply as "p" in reports. The p-value indicates the level of probability associated with any test statistic. The smaller the p-value, the less likely that the data support the null hypothesis. So with small p-values, researchers are more likely to conclude that the null hypothesis must be rejected. In many cases researchers will use one of two cut-off levels for rejecting the null hypothesis: a p-value less than .05 or less than .01. These cut-off values are called alpha (a) in statistics, and are sometimes referred to as the significance level.

So, for example, if we test the null hypothesis that intelligence is unrelated to math scores, the p-value obtained (from the computer software used to analyze the data) might be, say, p = .33. Since this value is larger than an alpha of either .05 or .01, we would not reject the null hypothesis and therefore conclude that intelligence is unrelated to math scores. If, however, we obtained a p-value of .04, we would be able to reject the null hypothesis at the alpha level of .05 since .04 is less than .05 and we could then conclude that our data indicate that intelligence is related to math scores.

Whenever you read research reports or articles, look for p-values. If presented, they can be found in tables that report statistical information. Unfortunately, researchers choose not to report p-values and instead use an asterisks (*) to indicate if the p-value was less than a given cut level like .05 or .01. See the examples below for illustrations.

Types of Inferential Statistics

There are many statistical procedures used to test null hypotheses, and they all a best suited for specific research situations and types of data. Below is a table that lists some of the more commonly used statistical procedures. Study this table as you study the various types of inferential statistical procedures.

Table 4: Types of Statistical Procedures and Their Characteristics

Statistical Test Independent Variable Dependent Variable Special Feature Example Hypothesis
(1) Pearson's r (correlation coefficient) Quantitative Quantitative   There is a positive relationship between intelligence and mathematics achievement scores.
(2) t-test Qualitative Quantitative IV has only 2 categories There will be a difference between boys and girls on mathematics achievement scores.
(3) ANOVA (Analysis of Variance) Qualitative Quantitative IV may have 2 or more categories. There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores.
(4) ANCOVA (Analysis of Covariance) 1. Qualitative

2. Covariate may be either qual. or quan., but usually quantitative

Quantitative IV may have 2 or more categories.

Covariate used to make adjustments to DV means.

There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores after taking into account levels of motivation.
(5) Chi-Square c2 Qualitative Qualitative   Males will be more likely to drop out of school than females.

(1) Pearson's r (Correlation Coefficient)

The relationship between two quantitative variables may be measured by Pearson's product moment correlation coefficient, which is symbolized by the letter "r." The correlation coefficient, r, may range from 1.00 (indicating a perfect, positive, linear relationship) to -1.00 (indicating a perfect, negative, linear relationship), and any value between the two. When r = 0.00, then no linear relationship exists between the two variables. The closer the coefficient is to +1.00, the stronger the positive, or direct, relationship; the closer the coefficient is to -1.00, the stronger the negative, or inverse, relationship. If the coefficient is nearer 0.00, the variables are not linearly related to each other, although they may be non-linearly related. A correlation coefficient of 0.00 represents the weakest relationship possible relationship, although it is still possible a non-linear relationship may exist when r = 0.00.

A linear relationship is such that as one variable increases, the other increases (or decreases) in a straight line fashion. If, however, the relationship was such that as one variable increased the other increased, to a point, and then began to decrease after that point, a curvilinear or non-linear relationship would exist. These types of relationships are presented below with three scatterplots:

 

 

 

 

 

Figure (a) shows a positive relationship, figure (b) shows a negative relationship, and figure (c) depicts a curvilinear or non-linear relationship.

In case you have difficulty understanding a scatterplot, it can be viewed as point intersections of two variables. Each scatterplot has two axis, the vertical and horizontal, and each of two variables is represented on one axis. In the example below, two variables are plotted, intelligence (measured on a 1 to 5 scale) and test scores (also measured on a 1 to 5 scale).correlation.gif (2472 bytes)

To help illustrate what a scatterplot does, three people are identified on the scatterplot. The dot with an "a" beside it indicates the two scores for persona "a." This individual scored a 1 on intelligence and a 1 on test scores; person "b" scored a 5 on intelligence and a 5 on test scores; and person "c" scored a 3 on intelligence and a 4 on test scores. By plotting the combination of scores from two variables in this manner, one can more readily see any type of relationship that may exist between two variables. In this example, there is a positive relationship between intelligence and test scores--as intelligence increases (goes from 1 to 5), so too do test scores.

Figure 1 illustrates the strength of different correlations via scatterplots. Note that for both positive and negative relationships, the stronger the correlation, the more compact the scatter to a central area. Note that for both positive and negative correlations, the tightness of the scatter corresponds to the size of the correlation coefficient r. The closer r is to 1.00 or -1.00, the tighter the scatter, and the closer r is to 0.00, the more scattered the relationship.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The null hypothesis in correlational research states that no relationship exists, which means r = 0.00. As stated earlier, the null hypothesis is tested to determine statistical significance. In the case of correlational research, the question of interest is whether there is evidence that a relationship exists between the two variables in the population. Statistical significance testing is used to make this determination.

Significance testing allows one to ask whether the relationship, as measured by r, is likely to be equal to zero. If the correlation coefficient is not likely to be equal to zero, the relationship between two variables is said to be statistically significant. If zero is a strong possibility, then the relationship is not statistically significant. A non-significant relationship means that it is possible that no relationship exists, i.e., r = 0.00 (recall that an r of zero indicates no relationship). Note that determining whether the coefficient is statistically significant or not is a statement of probability or chance—that is, if we say the coefficient is statistically significant, we are stating that the coefficient is probably different from 0.00, but we are not sure.

Example 1: Reporting Correlations in Research

Davis (1990, p. 89) writes:

"Correlation between the RADS and the Hamilton was .83 (p <  .001), indicating a strong relationship between the two methods of assessing depression."

The correlation in this example is r = .83, which is a strong, positive correlation. The p-value is reported to be less than .001, which is far below the standard cut-off alpha levels of .05 and .01, so the null hypothesis (i.e., no relationship) is rejected in this case.

In another example, Martin (1990, p. 10) writes:

"The perceived relative importance of Behavior 15 (employees verbally complimenting bowlers) was most strongly correlated with bowling average (r = .251, n = 82, p = .011)."

The correlation in this example is r = .251, which is a moderate to weak correlation. The sample size is n = 82, and the p-value is .011, which is less than an alpha of .05, so the null hypothesis of no relationship is rejected.

Example 2: Correlation Matrices

Sometimes researchers are interested in several correlations among a number of variables. The most efficient method for presenting multiple correlations is through a correlation matrix.

Quinn and Griffin report the following correlation matrix (Table 5) for their study of motivational needs, course satisfaction, and course achievement among undergraduate students in a cooperative learning course.

Table 5. Means, standard deviations, and correlations for
Motivational Needs, Course Satisfaction, and Course Achievement

   

1

2

3

4

5

6

1.

Need for Achievement

         

2.

Need for Affiliation

-.36*

       

3.

Need for Autonomy

.26

-.33*

4.

Need for Dominance

.14

-.31*

.12

   

5.

Course Satisfaction

.39*

.24

-.05

-.08

 

6.

Course Achievement

.43*

.05

.28*

.06

.41*

 

M

11.23

12.42

12.04

9.51

14.43

3.38

 

SD

2.57

3.45

2.48

2.58

2.99

0.57

Note: n = 53.

* p < .05

Note that in Table 5 means (M), standard deviations (SD), and sample size (n) are also reported. To show how to read this table, I will highlight a few bits of information. First, the mean for "Need for Dominance" is 9.51 with a SD of 2.58. The correlation between "Need for Dominance" and "Course Satisfaction" is r = -.08 (a weak and insignificant relationship). The correlation between "Need for Autonomy" and "Need for Affiliation" is r = -.33. Note that there is an asterisks next this correlation (e.g. -.33*). At the bottom of the table note that the asterisks denotes correlations that are statistically significant at the .05 alpha level. This means that correlations marked with an asterisks are statistically significant, so the null hypothesis of no relationship is rejected for these correlations.

(2) t-test

A t-test is used if one wishes to make a comparison between two groups, say an experimental and control group. The t-test allows one to determine whether a statistically significant difference exists between two means—thus, the t-test compares two group means. For example, a t-test would be appropriate to use for testing the following hypothesis:

"It is expected that boys will have higher ITBS mathematics scores than girls."

The inferential statistic calculated in the t-test is called the t-ratio, which is denoted as "t" in most reports and articles. The larger the t-ratio (in absolute value), the more likely we will reject the null hypothesis because the more evidence in the data that the two groups differ from each other. The formula for the t-ratio will not be discussed here, but you should recognized that a "t" is the test statistic used to determine whether the null hypothesis should be rejected.

When reporting t-tests, a common method is to use a table with appropriate statistical information, like the table below.

Example 1: Reporting t-test in Table

Table 5: Results of t-test for mathematics achievement by sex

  Boys Girls
Mean 85.35 78.64
Standard Deviation 6.89 5.99
N 32 36

Note. t = 2.59, df = 67, p = .032

In the table above, the mean math score for boys is 85.35 and for girls it is 78.64, so it is higher for boys as the hypothesis predicted. We also see that the variability in the boys' scores is greater since the SD is larger for boys. There were more girls in the study than boys, with a sample size of 36 for girls. The t-ratio in this example is t = 2.59, and its corresponding p-value is p = .032. If the cut-off level for statistical significance testing--alpha--is set at .05, then since the p-value is less than .05 (p = .032), the null hypothesis of no difference between boys and girls is rejected and we conclude that there is enough evidence in the data to state that boys appear to perform better on math than girls in the population from which these two samples were selected.

Example 2: Reporting t-test in Text

In a study of tutoring, Fuchs, Fuchs, Karns, Hamlett, Dutka, and Katzaroff (1996) wrote (p. 648):

"On the problem sets worked during the tutoring generalization sessions, tutees correctly completed 91% (SD = 12) of the problems they attempted with HA tutors; with AA tutors, they correctly completed 75% (SD = 26), t(19) = 2.78, p < .05, ES = .84."

In this sentence, they report means of 91 and 75 for two groups (tutees under two different types of tutors, HA and AA). The corresponding t-ratio is t = 2.78. They did not report the exact p-value, but they did report that the p-value is less than an alpha of .05, so the null hypothesis is rejected in this case. Since the null is rejected, we can conclude that people studying under HA tutors will get more problem sets correct (91%) when compared to tutees studying with AA tutors (75%).

(3) Analysis of Variance (ANOVA)

ANOVA is similar to the t-test, but it is appropriate when one wishes to compare two or more means—compare two or more groups—to determine whether a statistically significance difference exists among the group means. For example, an ANOVA would be appropriate for the following:

"There will be a difference between grades 1, 2, and 3 scores on the elementary mathematics achievement test."

In this example, one wishes to learn whether math test scores differ between students in grades 1, 2, and 3, so there are three groups to compare.

The null hypothesis tested in this example is that there is no difference in math score means between these three grades. If the null hypothesis is rejected, then one may conclude that there is evidence in the sample to suggest math score means differ among these three groups in the population from which the sample was selected.

Rather than report an r like a correlation coefficient, or a t like the t-test, the ANOVA provides an "F" value for determining whether to reject the null hypothesis. Like all inferential statistical tests, associated with the F-value is a p-value. Below are examples of how ANOVA may be reported in research.

Example 1: ANOVA in Text

Goodenow (1993) reports (p. 85):

"Finally, as a direct assessment of construct validity, it was hypothesized that students rated by their English teachers as having different levels of social standing with peers would also exhibit significantly different levels of self-reported psychological membership. A one-way analysis of variance confirmed this hypothesis: Students rated as having high, medium, or low social standing were different in their PSSM scores (4.23, 3.87, and 3.32, respectively, F[2,451] = 26,.59, p < .001)."

Several things are reported above. First, the three PSSM mean scores for the three groups (high, medium, and low) were M = 4.23 for the high group, M = 3.87 for the medium group, and M = 3.32 for the low group. Next, the author reports an F value of F = 26.59. This test statistic is high--F values can only be positive, and any F value greater than, say 4.00, is usually an indication that the null hypothesis should be rejected. The p-value for this F-value is not reported, but the author does indicate that the p-value is less than .001, which is certainly smaller than alpha levels of .05 and .01, so the null hypothesis of no difference among these three groups is rejected and we conclude that the mean scores do differ in the population. Two other things to note here. First, the author refers to the ANOVA as a one-way analysis of variance. The term one-way refers to the number of independent variables tested. If it is one-way, then only one independent variable is tested, if it is two-way, then two independent variables are tested, etc. The terms one-way, two-way, etc. do not refer to the number of categories in the independent variable. Second, the author reports some numbers adjacent the F-value, i.e., F[2,451]. These two numbers, 2 and 451, are called degrees of freedom, and every statistical test of hypotheses has degrees of freedom. While degrees of freedom play an important role in statistics, they are not important to understand at this point or for general reading of statistical reports, so I will not cover them here. See your text for a more detailed description of degrees of freedom if you wish to know more.

Example 2: ANOVA in Text and Table

Woznica (1990) reports the results of a one-way ANOVA in both text and table form (p. 711):

'The 'pure resticters' obtained a mean STIC score of 44.00, which was higher than the mean STIC scores of the bulimic anorexic, normal control, and psychiatric control groups. A one-way analysis of variance of group differences on the STIC yielded a significant F ratio of 3.53, which indicated that there was a significant difference between groups on the STIC (p < .01)."

Woznica Table 1: Analysis of variance: Group STIC scores using only "pure restricters" in restricting anorexic group

Source df Sum of Squares Mean of Squares F
Between 3 453.500 151.167 3.531*
Within 35 1498.500 42.814  
Total 38 1952.000    

* p < .05

In this table there are several things that you probably do not understand. Each column represents standard information reported in an ANOVA table: source, df (degrees of freedom), sums of squares (SS), mean of squares (MS), and the F-ratio (F). I won't cover what each of these components represent; see your text for a description of these. Note that the reported F-ratio is F = 3.53 and it is marked with an asterisks. At the bottom of the table the asterisks is shown to indicate that the corresponding p-value is less than .05 (the alpha level). In this case, the F-ratio is large enough to reject the null hypothesis of no difference at the .05 level.

(4) Analysis of Covariance (ANCOVA)

To help explain how ANCOVA are used, consider the following research example. In this study, we are interested in knowing whether some form of cooperative learning produces higher achievement scores than lecture-based instruction. Two groups of students are assigned to either one or the other type instruction. Table 6 provides summary information about the two groups.

Table 6: Summary Statistics for Learning Groups

  Cooperative Learning (n = 9) Lecture (n = 9)
Intelligence Mean 103.78 100.78
Intelligence SD 2.82 2.82
Posttest Mean 80.00 75.00
Posttest SD 2.74 2.74

The table above shows that students using cooperative learning (M = 80) scored higher on a posttest taken at the end of a semester than students in the lecture group (M = 75). Using this information, it appears that cooperative learning produces higher achievement scores. However, at the beginning of the semester, each student was given an intelligence test to determine how equivalent the two groups were in terms of academic ability before starting the class. The intelligence test shows that students assigned to the cooperative learning class had a slightly higher scores (M = 103.78) than students assigned to the lecture class (M = 100.78).

Now we have a confounded study. Which factor can account for the difference in achievement scores, they type of instruction they received (cooperative learning vs. lecture), or their differences in intelligence (103.78 vs. 100.78)? Schematically, this study can be illustrated as:

Table 7: Outline of Instructional Study

Intelligence Instruction Achievement
103.78 Cooperative Learning 80
100.78 Lecture 75

Since there is a difference between the two groups in terms of instruction and in terms of intelligence, we simply do not know which of the two factors contributed to the 5 point difference in achievement, so we cannot state with certainty that instruction was the cause of this difference.

To help alleviate this problem, we could use ANCOVA. Like the ANOVA, ANCOVA is good for comparing two or more means (groups) to learn whether a statistically significant difference exists among the groups, but ANCOVA also allows the researcher to statistically control for confounding variables (termed covariates in ANCOVA parlance) by statistically adjusting the group means on the dependent variable to take into account initial group differences on some other variable. These statistically adjusted means are then used to compare the groups.

So how does this work; how are the dependent variable means adjusted? The mathematics that drives ANCOVA is too difficult to be presented here, but the logic can be explained. Taking into account the correlations among the independent, dependent, and covariate variables, ANCOVA uses this information to adjust the dependent variable means either higher or lower depending upon where the groups started on the covariate. To illustrate, in the example described above, the cooperative learning group started higher in intelligence than the lecture group, so since they started higher, they achievement mean will be adjusted downward to compensate for their "head start" in the class. This is like a handicap in gold. The lecture group, since they started behind in terms of intelligence, will have their achievement mean adjusted upward. The amount of adjustment in both groups varies and depends on a several things mathematically, so it is difficult to know just how much of an adjustment will take place. In this example, I used ANCOVA to calculate the adjustment means, which are presented in Table 8 below.

Table 8: Outline of Instructional Study with Adjusted Means

Intelligence Instruction Achievement Adjusted Achievement
103.78 Cooperative Learning 80 78.68
100.78 Lecture 75 76.32

After taking into account of the covariate--the level of intelligence--we see that the difference between the two groups on adjusted means is much less: 78.68 - 76.32 = 2.36 points rather than 5.00 points. One way of understanding adjusted means is to consider what they theoretically allow one to do in terms of interpretation. The goal in experimental research is to have groups that are as alike, equivalent, as possible, so any differences that result between the two groups is not due to differences that the groups started with, but due to differences in treatments (in this example instruction is the treatment). By using ANCOVA to make statistical adjustments, we can state that the adjusted means indicated the level of difference between the two groups if they had started at the same intelligence level (say, 102.28, which is the overall mean of intelligence for the two groups).

As another example, consider the following experiment:

Table 9: Outline of Instructional Study with Adjusted Means and Motivation
as a Covariate

Motivation Instruction Achievement Adjusted Achievement
4.52 Cooperative Learning 80 75
3.35 Lecture 70 75

In the example of Table 9, motivation to learn is the covariate, and we see that the lecture group has lower motivation to start with, so we would expect their achievement to be lower due to the lower motivation. This study, like the example above, is confounded in that we do not know whether the type of instruction or the level of motivation caused the 10 point difference found with achievement scores. We would anticipate ANCOVA to adjust upward the lecture group's achievement means, and downward the cooperative learning group's achievement means. The adjusted means are 75 for both groups, which signifies that once we take into account the level of motivation, there is no difference in achievement performance between the two groups.

With ANCOVA, one may have many covariates to consider. One may include more than one covariate in an effort to control for, or statistically adjust for, many confounded variables.

ANCOVA would be appropriate for the following research question:

"Does a difference in salary exist between males and females at the university once academic rank (professor, associate professor, and assistance professor) and number of publications are taken into account?"

The covariates in the above research question are academic rank and number of publications. The researcher wishes to take these variables into account (i.e., control) before comparing mean salaries between males and females.

Example: ANCOVA in Table Form

Below is a table taken from a study designed to test whether two forms of reciprocal peer tutoring (RPT) impacts classroom achievement, academic self-efficacy, and test anxiety. The table provides means (M), adjusted means (Madj.), and standard deviations (SD) for each of the three dependent variables and the three groups (RPT in-class, RPT out-of-class, and the control group). Each mean is adjusted for a pre-measure of the variable (pretest). The pre-measure, or pretest, in each case served as the covariate.

Table 9. ANCOVAs and Summary Statistics for Experiment 3

     

F

 

Source

df

Posttest

Performance

Test Anxiety

Academic Self-efficacy

RPT (R)

2

0.23

0.84

0.45

Pretests (P)

1

8.89*

30.82*

5.69*

R x P

2

0.39

0.83

0.45

Error

61

(10.92)

(2.56)

(1.06)

RPT In-class

  M = 26.65 M = 3.51 M = 5.34
    Madj. = 26.69 Madj. = 3.75 Madj. = 5.32
    SD = 2.93 SD = 1.93 SD = 1.04

RPT Out-of-class

  M = 27.74 M = 3.75 M = 5.41
    Madj. = 27.72 Madj. = 3.68 Madj. = 5.39
    SD = 3.33 SD = 1.92 SD = 1.02

Control

  M = 26.95 M = 3.61 M = 5.03
    Madj. = 26.94 Madj. = 3.45 Madj. = 5.07
    SD = 4.18 SD = 1.99 SD = 1.14

Note. Values enclosed in parentheses represent mean square errors. M are the unadjusted posttest means, Madj. are the covariate adjusted posttest means, and SD is the standard deviation. RPT In-class represents the scores of students who quizzed each other in class, immediately prior to completing their exams, and RPT Out-of-class refers to the students who quizzed each other out of class, prior to completing their exams.

* p < .05.

 

Pretests/Pre-measures and ANCOVA

In many studies in education it is common for researchers to take pre-measures or pretests of students to measure their initial standings before initiating the experimental treatments. These pre-measures, or pretests, serve as ideal covariates in ANCOVA because they provide useful information about whether groups of students in the experiment started at similar levels of knowledge or aptitude. Should any group discrepancies exist among groups on these pre-measures, one may attempt to statistically equate these groups by including these pretests/pre-measures as covariates. Thus, any time a study contains pre-measures for all study participants, look for those to be used as covariates in data analysis of experimental results. 

(5) Chi-Square (c2)

All of the other statistical testing procedures covered require a dependent variable that is quantitative. The chi-square test is useful for learning whether a relationship exists between two qualitative (nominal) variables, such as sex (male, female) and dropping out of school (dropout, stay-in), or race (Black, Hispanic, White) and choice of high school program of study (college track, vocational, general).

Chi-square would be appropriate for the following research question:

"Is political party affiliation (Democratic, Republican) associated with presidential voting choice (vote for either the Democratic, Republican, or other candidate)?"

or

"Is race related to special education classification?"

Example: Chi-Square in Text

Dover and Shore (1991. p. 103) wrote:

"The gifted group responded significantly more often without probes on question 1 (c2(1) = 4.37, p < .05) and question 3 (c2(1) = 5.56, p < .05)."

In this example, the values for the chi-squares were 4.37 and 5.56. The larger the chi-square statistics, the more likely the null hypothesis will be rejected. The number in parentheses following the c2, e.g., c2(1), is the degrees of freedom. Note the p-values reported to be less than alpha of .05.

Margalit, Ankonina, and Avraham (1991, p. 431) wrote:

"No difference were found between groups for age or gender of the handicapped children (Kibbutz: 32 males and 11 females; City: 30 males and 18 females; c2(1, N = 91) = .99, ns)."

In this example, the chi-square statistics is equal to 0.99, not a large value. The "ns" indicates "not significant," although it would be better for them to report the actually p-value obtained. The symbol "N" indicates the sample size, which is 91.

Example: Chi-Square in Table

Reynolds, Kunce, and Cope (1991), in their study driving under the influence of alcohol and personality type, reported the following table (p. 293):

Table 10. Distribution of participants by personality type and offender group.

  Offenders
  First-Time Repeat
Personality Type n % n %
Stability-oriented extravert 68 39.5 34 53.1
Change-oriented extravert 49 28.5 4 6.3
Stability-oriented introvert 34 19.8 23 35.9
Change-oriented introvert 21 12.2 3 4.7
         
Total 172 100.0 64 100.0

Note. c2 (3, N = 236) = 19.91, p < .0001.

The chi-square value is 19.91, which is significant at the .0001 level, so the null hypothesis of no difference in distribution among these groups is rejected. Apparently personality type is associated with repeat offender status.


Supplemental Reading

Bruce Thompson discussion statistical significance testing below:

http://pareonline.net/getvn.asp?v=4&n=5 

James Hill provides a nice introduction to descriptive statistics:

http://www.mste.uiuc.edu/hill/dstat/dstat.html


References

Davis, N.F. (1990). The Reynolds Adolescent Depression Scale. Measurement and Evaluation in Counseling and Development, 23.

Dover, A., & Shore, B. (1991). Giftedness and flexibility on a mathematical set-breaking task. Gifted Child Quarterly, 35.

Fuchs, L.S., Fuchs, D., Karns, K., Hamlett, C.L., Dutka, S., and Katzaroff, M. (1996). The relation between student ability and the quality and effectiveness of explanations. American Educational Research Journal, 33, pp. 631-664.

Goodenow, C. (1993). The psychological sense of school membership among adolescents: Scale development and educational correlates. Psychology in the Schools, 30.

Martin. C.L. (1990). An empirical investigation of employee behaviors and customer perceptions. Journal of Sport Management, 4.

Margalit, M., Ankonina, D., & Avraham, Y. (1991). Community support in Israeli Kibbutz and city families of children with disabilities: Family climate and parental coherence. Journal of Special Education, 24.

Quinn, G., & Griffin, B. W. (1999). Students' motivational needs and satisfaction in relation to achievement within a cooperative learning setting. Unpublished manuscript.

Reynolds, J.R., Kunce, J.T., & Cope, C.S. (1991). Personality differences of first-time and repeat offenders arrested for driving while intoxicated. Journal of Counseling Psychology, 38.

Woznica, J.G. (1990). Delay of gratification in bulimic and restricting anorexia nervosa patients. Journal of Clinical Psychology, 46.


Copyright 2000, Bryan W. Griffin

Last revised on 08 April, 2018 04:07 AM