EDUR 7130
Educational Research On-Line

Inferential Statistics

Inferential Statistics and Conducting Educational Studies

Researchers in education are in the business to study learning, motivation, and other factors that influence how well students perform in school activities. Researchers may be interested, for example, in learning whether textbook A helps student achieve at a higher level in mathematics than textbook B, or whether more time spent on homework produces better scores on tests.

To conduct such studies, educational researchers rely upon samples often because working with a population is inefficient and too costly. What researchers wish to know is how well two textbooks help the population of students (say all 5th grade students in the USA) learn mathematics. Since researchers work with a sample of students, the goal is to conduct a study that will provide insight into how the population might react to the two textbooks. Thus the goal is to learn from the sample and apply it to the population -- to infer results from sample to population.

If one learns, for example, that textbook B provides a marked improvement in mathematics relative to textbook A for the sample studied, then the results obtained from that sample will be inferred to the population -- if it works this way in the sample, and if the sample is representative of the population, then the results obtained in the sample should hold for the population as well.

The process of making these inferences fall to inferential statistics. To help researchers decide whether the results obtained in a given sample may apply to the population, inferential statistics employ a process known as hypothesis testing.


Hypothesis Testing

A key to quantitative research is the hypothesis test. When trying to determine whether two variables are related (such as intelligence and scholastic performance, or whether dropout status differs by sex [female vs. male]), statistical testing of hypotheses is the tool most frequently used for making inferences from sample to population.

What does it mean to statistically test a hypothesis? With hypothesis testing, data from studies are used to make a judgment about whether enough information exists to support a conclusion that a relationship exists between variables. That is, data are collected from the sample to address the null hypothesis. Researchers examine collected data to determine whether those data either support the null hypothesis (no differences exists in sample data) or refute the null hypothesis (sample data show evidence of group differences, so reject Ho).

To make this judgment, two types of hypotheses are considered, the research hypothesis (which in this case will be either directional or non-directional) and the null hypothesis.

To introduce you to the idea of hypothesis testing, consider the following research situation.


Study 1: Sex (Boys/Girls) and Mathematic scores

A researcher wishes to know if there is statistical evidence that boys and girls differ on a standardized mathematics test. The research hypothesis is:

Ha: There will be difference in mathematics test scores between boys and girls.

The null hypothesis is

Ho: There will be no difference in mathematics test scores between boys and girls.

In this example, Ha is the research hypothesis (H indicates hypothesis, the letter "a" indicates that this is the alternative or research hypothesis) and Ho is the null hypothesis (H is hypothesis, "o" is the symbol for null).

When making a scientific judgment about this study, hypothesis testing is used and the results obtained are then inferred to the population. In all types of hypothesis testing, the null hypothesis is always the hypothesis tested. The logic of testing the null hypothesis follows something like this:

The null indicates no difference in math scores between boys and girls in this example. If the null hypothesis can be rejected (because the sample data collected do not support the null hypothesis), then some hypothesis other than the null must be accepted (usually a non-directional form is used as illustrated above). Note that only the null indicates no difference, while the alternative indicates some difference between boys and girls exists on mathematic test scores. So by rejecting the null hypothesis, we can establish evidence for a difference between boys and girls in the sample which then is inferred to the population.

Considering Study 1, one would first collect data, then use a statistical procedure (such as the t-test, which is discussed below) to test the null hypothesis. If the null hypothesis is rejected because the data do not support it, then we can accept the alternative or research hypothesis and state that mathematics scores differ by sex. If the null hypothesis is not rejected, then one states that there is not enough evidence in the data collected to say that a difference exists between boys and girls on math test scores. The key to the conclusion drawn is whether the null hypothesis is rejected or not rejected. If rejected, then a difference is thought to exist in mathematic scores by sex, and if not rejected, then mathematics scores do not appear to differ by sex.


Study 2: Intelligence and Math

A researcher wishes to know if there is statistical evidence that one's intelligence is related to one's performance on a mathematics test. The research hypothesis is:

Ha: Those with higher levels of intelligence will score higher on a mathematics test.

The null hypothesis is

Ho: There is no relationship between intelligence and mathematics test scores.

For Study 2 one is interested in knowing whether a relationship exists between two variables, intelligence and math scores. As with Study 1, the question of interest is whether the null hypothesis can be rejected. As before, the null hypothesis is tested, in this case maybe with Pearson's correlation coefficient. If the null hypothesis is rejected, based upon the evidence provided in the data collected, then one may state that intelligence and math scores are related (perhaps positively related). If the null hypothesis is not rejected, then one may state that there is not enough evidence in the data to say that intelligence and math scores are related.


Statistical Significance (Statements of Significant)

As noted above, when performing statistical testing of hypotheses, one always tests the null hypothesis. There are many statistical methods for testing null hypotheses, and a number of them are discussed below (e.g., t-test, Pearson's r, ANOVA). All statistical testing procedures have corresponding null hypotheses. When testing hypotheses with statistical tests such as the t-test, ANOVA, and Pearson's r, one must assume that the null hypothesis is true; that is, one must assume that no differences actually exist among the groups examined in the population, or that no relationships exist among the variables in the population.

All of the statistical tests will provide some indication as to whether evidence provided by the data support or do not support the null hypothesis. If the evidence from the sample indicates that the null hypothesis is supported, then the researcher concludes that no differences between the groups exist (or that no relationship exists among the variables), and the researcher can therefore state that the statistical results indicate that there were no statistically significant differences among the groups, or that there were no statistically significant relationships among the variables.

If, however, the data do not support the null hypothesis, then the researcher can reject the null hypothesis and accept the research hypothesis. If the data do not support the null hypothesis, then the researcher can say that a statistically significant difference exists among the groups (which is, of course, what the research hypothesis claims), or that a statistically significant relationship exists among the variables.

What does statistical significance indicate?

Significance = rejected null hypothesis

Thus, when one claims that results are statistically significant, this does not mean that important results were obtained; rather, it simply means that the data do not support the null hypothesis. Whether important results were obtained in any study cannot be determined by statistical hypothesis testing; importance can only be assessed through one’s interpretation of the results--that is, one must examine the sample data closely to determine whether something important was found.

So, what does it mean when one hears, say in the news, that researchers found a significant relationship between smoking and lung cancer, or that researchers found a significant relationship between leaving a light on for young children at night and their likelihood of developing near-sightedness? Does the term significant mean they found an important result? No. In this case significant only means that the data collected by the researchers provided enough evidence to reject the null hypothesis, but this does not mean that something important was found.

To illustrate the difference, consider a study of computer-assisted instruction in mathematics. A recent study found a significant difference in math achievement scores between two groups: one group used traditional instruction (paper and pencil exercises, etc.), and the other group supplemented this instruction with mathematics software instruction three times a week. One might be tempted to conclude that since a significant difference was found, the computer-assisted instruction makes a big difference. This is not the case. There is a curious characteristic of hypothesis testing that follows sample size. The larger the sample one collects, the easier it is to reject the null hypothesis (hence the easier it is to claim a significant effect was found). In this study, the mean math-achievement scores for the traditional instruction group was 85.3, and the mean math-achievement scores for the computer-assisted group was 85.6, less than 1 point difference in achievement. Do you consider this to be a significant improvement in achievement? When making this judgment, consider also the resources required to use computer-assisted instruction (money, time, space, etc.). I hope this example illustrates that just because someone reports a significant finding in research, this does not necessarily mean something important was found, it only means that the null hypothesis was rejected.


Hypothesis Testing Decisions and Errors

Recall that researchers perform hypothesis testing on sample data, and sample data were selected from a larger population. The purpose of collecting the sample data is to enable the researcher to infer from sample to population. Sometimes inferences from samples to populations may be incorrect. When one performs hypothesis testing, each decision reached may be in error. That is, rejecting Ho may be in error, and failing to reject Ho may be in error.

To help demonstrate this, consider the example study of sex differences in mathematic scores. Suppose that in the population, there are really no differences between boys and girls in terms of mathematics scores. When researchers select a sample, the sample data may reflect this population state, or it may show, through random sampling error, something different.  Since reality shows no difference in the population, a decision to reject the null hypothesis in the sample will be an incorrect decision (which is called a Type 1 error), while a decision to fail to reject (retain) the null hypothesis will be the correct decision.

Similarly, assume now that in the population there truly are large differences between boys and girls on mathematics scores. When selecting a random sample, one may obtain a sample that is representative of the population, or perhaps one that is not representative. Either could happen due to the random nature of sample selection. In this scenario, the correct decision is to reject the null hypothesis since in the population there are differences between boys and girls in terms of mathematics performance, and the error results when the null hypothesis is not rejected (called a Type 2 error).

Each of these possibilities are outlined in the table below.


Table 1: Hypothesis Testing Decisions and Errors

  Population Reality
Ho is true 

(no difference in
mathematics scores
between boys and girls)
Ho is false

(true difference in
mathematics scores
between boys and girls)
Hypothesis Testing
Decision in Sample
Reject Ho Mistake; Type 1 Error
(probability of this error
is alpha, normally set
at .05 or 5%)
Correct Decision
 Fail to Reject Ho
(retain Ho) 
Correct Decision Mistake; Type 2 Error (the
probability of this error is


Four possibilities exist with hypothesis testing:

(a) Type 1 error -- rejecting a true null hypothesis. Researcher determines, based upon sample data, to reject the null hypothesis. This is a mistake because in the population there is really no difference in mathematics scores between boys and girls, so the null in the population is true and therefore the null should not be rejected.

(b) Correct decision by failing to reject Ho -- if in the population the null hypothesis is true (there are no differences in mathematics scores between boys and girls), then the correct decision with sample data is to fail to reject Ho (to retain the null).

(c) Type 2 error -- failing to reject a false null hypothesis. Sample data indicate no differences between boys and girls in mathematics, so researcher decides not to reject Ho (researcher fails to reject). This is a mistake because in population there differences in mathematics performance between boys and girls. Failure to detect these differences in the sample leads to a Type 2 error.

(d) Correct decision by rejecting Ho -- in population there are differences and the sample data show these differences between boys and girls, so researcher rejects Ho based upon sample data. This is a correct decision; researcher has found or detected differences, based upon the sample data, in the population.


Alpha Level (level of Statistical Significance; Probability Values)

From Table 1 above, note that the probability of committing a Type 1 Error is called Alpha (Greek letter α). Alpha is set by the researcher, but by convention alpha usually is set to .05 or .01. When alpha is .05, that means a Type 1 Error is likely to occur once out of 20 tests performed. If alpha = .01, then one is likely to commit a Type 1 Error only 1 time out of 100 hypotheses tests performed.

Alpha (α) = Probability of committing a Type 1 Error ( probability of research concluding, from a sample, that a difference or relationship exists when it does not in the population)

Whenever one performs a statistical test to determine whether to reject or fail to reject the null hypothesis, one always relies upon probabilities. Keep in mind that when performing hypothesis tests, one is working with samples, not populations. In any given sample, it is possible that the sample does or does not represent well the population. So anything found in a sample study may or may not be true for the population.

The reason one works with a sample is to make inferences or generalizations back to the population. The reason one tests hypotheses is to help with the decision making process of generalizing from sample to population. So, for example, if one fails to reject the null hypothesis that intelligence is unrelated to mathematics scores, one would then conclude that intelligence and mathematics scores are unrelated in the population.

With all statistical testing procedures for testing null hypotheses, a few key bits of information will be calculated. One thing calculated is call the inferential test statistic. This will be described for each statistical test discussed below, but usually it will be one of the following: t, F, or c2. Each of these three symbols represent the inferential test statistic used to judge whether one should reject or not reject the null hypothesis being tested.

Another statistic produced by all hypothesis testing procedures is the p-value, which is usually denoted simply as "p" in reports. The p-value indicates the level of probability associated with any test statistic. The smaller the p-value, the less likely that the data support the null hypothesis. So with small p-values, researchers are more likely to conclude that the null hypothesis must be rejected. In many cases researchers will use one of two cut-off levels for rejecting the null hypothesis: a p-value less than .05 or less than .01. These cut-off values are called alpha (a) in statistics, and are sometimes referred to as the significance level.

So, for example, if we test the null hypothesis that intelligence is unrelated to math scores, the p-value obtained (from the computer software used to analyze the data) might be, say, p = .33. Since this value is larger than an alpha of either .05 or .01, we would not reject the null hypothesis and therefore conclude that intelligence is unrelated to math scores. If, however, we obtained a p-value of .04, we would be able to reject the null hypothesis at the alpha level of .05 since .04 is less than .05 and we could then conclude that our data indicate that intelligence is related to math scores.

Whenever you read research reports or articles, look for p-values. If presented, they can be found in tables that report statistical information. Unfortunately, researchers choose not to report p-values and instead use an asterisks (*) to indicate if the p-value was less than a given cut level like .05 or .01. See the examples below for illustrations.


Types of Inferential Statistics

There are many statistical procedures used to test null hypotheses, and they all a best suited for specific research situations and types of data. Below is a table that lists some of the more commonly used statistical procedures. Study this table as you study the various types of inferential statistical procedures.


Table 2: Types of Statistical Procedures and Their Characteristics

Statistical Test Independent Variable Dependent Variable Special Feature Example Hypothesis
(1) Pearson's r (correlation coefficient) Quantitative Quantitative   There is a positive relationship between intelligence and mathematics achievement scores.
(2) t-test Qualitative Quantitative IV has only 2 categories There will be a difference between boys and girls on mathematics achievement scores.
(3) ANOVA (Analysis of Variance) Qualitative Quantitative IV may have 2 or more categories. There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores.
(4) ANCOVA (Analysis of Covariance) 1. Qualitative

2. Covariate may be either qual. or quan., but usually quantitative

Quantitative IV may have 2 or more categories.

Covariate used to make adjustments to DV means.

There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores after taking into account levels of motivation.
(5) Chi-Square Test of Association Qualitative Qualitative   Males will be more likely to drop out of school than females.


(1) Pearson's r (Correlation Coefficient)

The relationship between two quantitative variables may be measured by Pearson's product moment correlation coefficient, which is symbolized by the letter "r." The correlation coefficient, r, may range from 1.00 (indicating a perfect, positive, linear relationship) to -1.00 (indicating a perfect, negative, linear relationship), and any value between the two. When r = 0.00, then no linear relationship exists between the two variables. The closer the coefficient is to +1.00, the stronger the positive, or direct, relationship; the closer the coefficient is to -1.00, the stronger the negative, or inverse, relationship. If the coefficient is nearer 0.00, the variables are not linearly related to each other, although they may be non-linearly related. A correlation coefficient of 0.00 represents the weakest relationship possible relationship, although it is still possible a non-linear relationship may exist when r = 0.00.

A linear relationship is such that as one variable increases, the other increases (or decreases) in a straight line fashion. If, however, the relationship was such that as one variable increased the other increased, to a point, and then began to decrease after that point, a curvilinear or non-linear relationship would exist. These types of relationships are presented below with three scatterplots:


Figure 1






Figure (a) shows a positive relationship, figure (b) shows a negative relationship, and figure (c) depicts a curvilinear or non-linear relationship.

In case you have difficulty understanding a scatterplot, it can be viewed as point intersections of two variables. Each scatterplot has two axis, the vertical and horizontal, and each of two variables is represented on one axis. In the example below, two variables are plotted, intelligence (measured on a 1 to 5 scale) and test scores (also measured on a 1 to 5 scale).


Figure 2correlation.gif (2472 bytes)








To help illustrate what a scatterplot does, three people are identified on the scatterplot. The dot with an "a" beside it indicates the two scores for persona "a." This individual scored a 1 on intelligence and a 1 on test scores; person "b" scored a 5 on intelligence and a 5 on test scores; and person "c" scored a 3 on intelligence and a 4 on test scores. By plotting the combination of scores from two variables in this manner, one can more readily see any type of relationship that may exist between two variables. In this example, there is a positive relationship between intelligence and test scores--as intelligence increases (goes from 1 to 5), so too do test scores.

The figure below illustrates the strength of different correlations via scatterplots. Note that for both positive and negative relationships, the stronger the correlation, the more compact the scatter to a central area. Note that for both positive and negative correlations, the tightness of the scatter corresponds to the size of the correlation coefficient r. The closer r is to 1.00 or -1.00, the tighter the scatter, and the closer r is to 0.00, the more scattered the relationship.


Figure 3

























The null hypothesis in correlational research states that no relationship exists, which means r = 0.00. As stated earlier, the null hypothesis is tested to determine statistical significance. In the case of correlational research, the question of interest is whether there is evidence that a relationship exists between the two variables in the population. Statistical significance testing is used to make this determination.

Significance testing allows one to ask whether the relationship, as measured by r, is likely to be equal to zero. If the correlation coefficient is not likely to be equal to zero, the relationship between two variables is said to be statistically significant. If zero is a strong possibility, then the relationship is not statistically significant. A non-significant relationship means that it is possible that no relationship exists, i.e., r = 0.00 (recall that an r of zero indicates no relationship). Note that determining whether the coefficient is statistically significant or not is a statement of probability or chance—that is, if we say the coefficient is statistically significant, we are stating that the coefficient is probably different from 0.00, but we are not sure.


Example 1: Reporting Correlations in Research

Davis (1990, p. 89) writes:

"Correlation between the RADS and the Hamilton was .83 (p <  .001), indicating a strong relationship between the two methods of assessing depression."

The correlation in this example is r = .83, which is a strong, positive correlation. The p-value is reported to be less than .001, which is far below the standard cut-off alpha levels of .05 and .01, so the null hypothesis (i.e., no relationship) is rejected in this case.

In another example, Martin (1990, p. 10) writes:

"The perceived relative importance of Behavior 15 (employees verbally complimenting bowlers) was most strongly correlated with bowling average (r = .251, n = 82, p = .011)."

The correlation in this example is r = .251, which is a moderate to weak correlation. The sample size is n = 82, and the p-value is .011, which is less than an alpha of .05, so the null hypothesis of no relationship is rejected.

Example 2: Correlation Matrices

Sometimes researchers are interested in several correlations among a number of variables. The most efficient method for presenting multiple correlations is through a correlation matrix.

Quinn and Griffin report the following correlation matrix for their study of motivational needs, course satisfaction, and course achievement among undergraduate students in a cooperative learning course.


Table 3. Means, standard deviations, and correlations for
Motivational Needs, Course Satisfaction, and Course Achievement









Need for Achievement



Need for Affiliation




Need for Autonomy




Need for Dominance






Course Satisfaction







Course Achievement






















Note: n = 53.

* p < .05

Note that in the table above means (M), standard deviations (SD), and sample size (n) are also reported. To show how to read this table, I will highlight a few bits of information. First, the mean for "Need for Dominance" is 9.51 with a SD of 2.58. The correlation between "Need for Dominance" and "Course Satisfaction" is r = -.08 (a weak and insignificant relationship). The correlation between "Need for Autonomy" and "Need for Affiliation" is r = -.33. Note that there is an asterisks next this correlation (e.g. -.33*). At the bottom of the table note that the asterisks denotes correlations that are statistically significant at the .05 alpha level. This means that correlations marked with an asterisks are statistically significant, so the null hypothesis of no relationship is rejected for these correlations.


(2) t-test

A t-test is used if one wishes to make a comparison between two groups, say an experimental and control group. The t-test allows one to determine whether a statistically significant difference exists between two means—thus, the t-test compares two group means. For example, a t-test would be appropriate to use for testing the following hypothesis:

"It is expected that boys will have higher ITBS mathematics scores than girls."

The inferential statistic calculated in the t-test is called the t-ratio, which is denoted as "t" in most reports and articles. The larger the t-ratio (in absolute value), the more likely we will reject the null hypothesis because the more evidence in the data that the two groups differ from each other. The formula for the t-ratio will not be discussed here, but you should recognized that a "t" is the test statistic used to determine whether the null hypothesis should be rejected.

When reporting t-tests, a common method is to use a table with appropriate statistical information, like the table below.

Example 1: Reporting t-test in Table


Table 4: Results of t-test for mathematics achievement by sex

  Boys Girls
Mean 85.35 78.64
Standard Deviation 6.89 5.99
N 32 36

Note. t = 2.59, df = 67, p = .032

In the table above, the mean math score for boys is 85.35 and for girls it is 78.64, so it is higher for boys as the hypothesis predicted. We also see that the variability in the boys' scores is greater since the SD is larger for boys. There were more girls in the study than boys, with a sample size of 36 for girls. The t-ratio in this example is t = 2.59, and its corresponding p-value is p = .032. If the cut-off level for statistical significance testing--alpha--is set at .05, then since the p-value is less than .05 (p = .032), the null hypothesis of no difference between boys and girls is rejected and we conclude that there is enough evidence in the data to state that boys appear to perform better on math than girls in the population from which these two samples were selected.


Example 2: Reporting t-test in Text

In a study of tutoring, Fuchs, Fuchs, Karns, Hamlett, Dutka, and Katzaroff (1996) wrote (p. 648):

"On the problem sets worked during the tutoring generalization sessions, tutees correctly completed 91% (SD = 12) of the problems they attempted with HA tutors; with AA tutors, they correctly completed 75% (SD = 26), t(19) = 2.78, p < .05, ES = .84."

In this sentence, they report means of 91 and 75 for two groups (tutees under two different types of tutors, HA and AA). The corresponding t-ratio is t = 2.78. They did not report the exact p-value, but they did report that the p-value is less than an alpha of .05, so the null hypothesis is rejected in this case. Since the null is rejected, we can conclude that people studying under HA tutors will get more problem sets correct (91%) when compared to tutees studying with AA tutors (75%).


(3) Analysis of Variance (ANOVA)

ANOVA is similar to the t-test, but it is appropriate when one wishes to compare two or more means—compare two or more groups—to determine whether a statistically significance difference exists among the group means. For example, an ANOVA would be appropriate for the following:

"There will be a difference between grades 1, 2, and 3 scores on the elementary mathematics achievement test."

In this example, one wishes to learn whether math test scores differ between students in grades 1, 2, and 3, so there are three groups to compare.

The null hypothesis tested in this example is that there is no difference in math score means between these three grades. If the null hypothesis is rejected, then one may conclude that there is evidence in the sample to suggest math score means differ among these three groups in the population from which the sample was selected.

Rather than report an r like a correlation coefficient, or a t like the t-test, the ANOVA provides an "F" value for determining whether to reject the null hypothesis. Like all inferential statistical tests, associated with the F-value is a p-value. Below are examples of how ANOVA may be reported in research.


Example 1: ANOVA in Text

Goodenow (1993) reports (p. 85):

"Finally, as a direct assessment of construct validity, it was hypothesized that students rated by their English teachers as having different levels of social standing with peers would also exhibit significantly different levels of self-reported psychological membership. A one-way analysis of variance confirmed this hypothesis: Students rated as having high, medium, or low social standing were different in their PSSM scores (4.23, 3.87, and 3.32, respectively, F[2,451] = 26,.59, p < .001)."

Several things are reported above. First, the three PSSM mean scores for the three groups (high, medium, and low) were M = 4.23 for the high group, M = 3.87 for the medium group, and M = 3.32 for the low group. Next, the author reports an F value of F = 26.59. This test statistic is high--F values can only be positive, and any F value greater than, say 4.00, is usually an indication that the null hypothesis should be rejected. The p-value for this F-value is not reported, but the author does indicate that the p-value is less than .001, which is certainly smaller than alpha levels of .05 and .01, so the null hypothesis of no difference among these three groups is rejected and we conclude that the mean scores do differ in the population. Two other things to note here. First, the author refers to the ANOVA as a one-way analysis of variance. The term one-way refers to the number of independent variables tested. If it is one-way, then only one independent variable is tested, if it is two-way, then two independent variables are tested, etc. The terms one-way, two-way, etc. do not refer to the number of categories in the independent variable. Second, the author reports some numbers adjacent the F-value, i.e., F[2,451]. These two numbers, 2 and 451, are called degrees of freedom, and every statistical test of hypotheses has degrees of freedom. While degrees of freedom play an important role in statistics, they are not important to understand at this point or for general reading of statistical reports, so I will not cover them here. See your text for a more detailed description of degrees of freedom if you wish to know more.


Example 2: ANOVA in Text and Table

Woznica (1990) reports the results of a one-way ANOVA in both text and table form (p. 711):

'The 'pure resticters' obtained a mean STIC score of 44.00, which was higher than the mean STIC scores of the bulimic anorexic, normal control, and psychiatric control groups. A one-way analysis of variance of group differences on the STIC yielded a significant F ratio of 3.53, which indicated that there was a significant difference between groups on the STIC (p < .01)."

Woznica Table 5: Analysis of variance: Group STIC scores using only "pure restricters" in restricting anorexic group

Source df Sum of Squares Mean of Squares F
Between 3 453.500 151.167 3.531*
Within 35 1498.500 42.814  
Total 38 1952.000    

* p < .05

In this table there are several things that you probably do not understand. Each column represents standard information reported in an ANOVA table: source, df (degrees of freedom), sums of squares (SS), mean of squares (MS), and the F-ratio (F). I won't cover what each of these components represent; see your text for a description of these. Note that the reported F-ratio is F = 3.53 and it is marked with an asterisks. At the bottom of the table the asterisks is shown to indicate that the corresponding p-value is less than .05 (the alpha level). In this case, the F-ratio is large enough to reject the null hypothesis of no difference at the .05 level.


(4) Analysis of Covariance (ANCOVA)

To help explain how ANCOVA are used, consider the following research example. In this study, we are interested in knowing whether some form of cooperative learning produces higher achievement scores than lecture-based instruction. Two groups of students are assigned to either one or the other type instruction. Table 6 provides summary information about the two groups.


Table 6: Summary Statistics for Learning Groups

  Cooperative Learning (n = 9) Lecture (n = 9)
Intelligence Mean 103.78 100.78
Intelligence SD 2.82 2.82
Posttest Mean 80.00 75.00
Posttest SD 2.74 2.74

The table above shows that students using cooperative learning (M = 80) scored higher on a posttest taken at the end of a semester than students in the lecture group (M = 75). Using this information, it appears that cooperative learning produces higher achievement scores. However, at the beginning of the semester, each student was given an intelligence test to determine how equivalent the two groups were in terms of academic ability before starting the class. The intelligence test shows that students assigned to the cooperative learning class had a slightly higher scores (M = 103.78) than students assigned to the lecture class (M = 100.78).

Now we have a confounded study. Which factor can account for the difference in achievement scores, they type of instruction they received (cooperative learning vs. lecture), or their differences in intelligence (103.78 vs. 100.78)? Schematically, this study can be illustrated as:

Table 7: Outline of Instructional Study

Intelligence Instruction Achievement
103.78 Cooperative Learning 80
100.78 Lecture 75

Since there is a difference between the two groups in terms of instruction and in terms of intelligence, we simply do not know which of the two factors contributed to the 5 point difference in achievement, so we cannot state with certainty that instruction was the cause of this difference.

To help alleviate this problem, we could use ANCOVA. Like the ANOVA, ANCOVA is good for comparing two or more means (groups) to learn whether a statistically significant difference exists among the groups, but ANCOVA also allows the researcher to statistically control for confounding variables (termed covariates in ANCOVA parlance) by statistically adjusting the group means on the dependent variable to take into account initial group differences on some other variable. These statistically adjusted means are then used to compare the groups.

So how does this work; how are the dependent variable means adjusted? The mathematics that drives ANCOVA is too difficult to be presented here, but the logic can be explained. Taking into account the correlations among the independent, dependent, and covariate variables, ANCOVA uses this information to adjust the dependent variable means either higher or lower depending upon where the groups started on the covariate. To illustrate, in the example described above, the cooperative learning group started higher in intelligence than the lecture group, so since they started higher, they achievement mean will be adjusted downward to compensate for their "head start" in the class. This is like a handicap in gold. The lecture group, since they started behind in terms of intelligence, will have their achievement mean adjusted upward. The amount of adjustment in both groups varies and depends on a several things mathematically, so it is difficult to know just how much of an adjustment will take place. In this example, I used ANCOVA to calculate the adjustment means, which are presented in Table 8 below.

Table 8: Outline of Instructional Study with Adjusted Means

Intelligence Instruction Achievement Adjusted Achievement
103.78 Cooperative Learning 80 78.68
100.78 Lecture 75 76.32

After taking into account of the covariate--the level of intelligence--we see that the difference between the two groups on adjusted means is much less: 78.68 - 76.32 = 2.36 points rather than 5.00 points. One way of understanding adjusted means is to consider what they theoretically allow one to do in terms of interpretation. The goal in experimental research is to have groups that are as alike, equivalent, as possible, so any differences that result between the two groups is not due to differences that the groups started with, but due to differences in treatments (in this example instruction is the treatment). By using ANCOVA to make statistical adjustments, we can state that the adjusted means indicated the level of difference between the two groups if they had started at the same intelligence level (say, 102.28, which is the overall mean of intelligence for the two groups).

As another example, consider the following experiment:

Table 9: Outline of Instructional Study with Adjusted Means and Motivation
as a Covariate

Motivation Instruction Achievement Adjusted Achievement
4.52 Cooperative Learning 80 75
3.35 Lecture 70 75

In the example of Table 9, motivation to learn is the covariate, and we see that the lecture group has lower motivation to start with, so we would expect their achievement to be lower due to the lower motivation. This study, like the example above, is confounded in that we do not know whether the type of instruction or the level of motivation caused the 10 point difference found with achievement scores. We would anticipate ANCOVA to adjust upward the lecture group's achievement means, and downward the cooperative learning group's achievement means. The adjusted means are 75 for both groups, which signifies that once we take into account the level of motivation, there is no difference in achievement performance between the two groups.

With ANCOVA, one may have many covariates to consider. One may include more than one covariate in an effort to control for, or statistically adjust for, many confounded variables.

ANCOVA would be appropriate for the following research question:

"Does a difference in salary exist between males and females at the university once academic rank (professor, associate professor, and assistance professor) and number of publications are taken into account?"

The covariates in the above research question are academic rank and number of publications. The researcher wishes to take these variables into account (i.e., control) before comparing mean salaries between males and females.


Example: ANCOVA in Table Form

Below is a table taken from a study designed to test whether two forms of reciprocal peer tutoring (RPT) impacts classroom achievement, academic self-efficacy, and test anxiety. The table provides means (M), adjusted means (Madj.), and standard deviations (SD) for each of the three dependent variables and the three groups (RPT in-class, RPT out-of-class, and the control group). Each mean is adjusted for a pre-measure of the variable (pretest). The pre-measure, or pretest, in each case served as the covariate.


Table 9. ANCOVAs and Summary Statistics for Experiment 3








Test Anxiety

Academic Self-efficacy






Pretests (P)





R x P










RPT In-class

  M = 26.65 M = 3.51 M = 5.34
    Madj. = 26.69 Madj. = 3.75 Madj. = 5.32
    SD = 2.93 SD = 1.93 SD = 1.04

RPT Out-of-class

  M = 27.74 M = 3.75 M = 5.41
    Madj. = 27.72 Madj. = 3.68 Madj. = 5.39
    SD = 3.33 SD = 1.92 SD = 1.02


  M = 26.95 M = 3.61 M = 5.03
    Madj. = 26.94 Madj. = 3.45 Madj. = 5.07
    SD = 4.18 SD = 1.99 SD = 1.14

Note. Values enclosed in parentheses represent mean square errors. M are the unadjusted posttest means, Madj. are the covariate adjusted posttest means, and SD is the standard deviation. RPT In-class represents the scores of students who quizzed each other in class, immediately prior to completing their exams, and RPT Out-of-class refers to the students who quizzed each other out of class, prior to completing their exams.

* p < .05.


Pretests/Pre-measures and ANCOVA

In many studies in education it is common for researchers to take pre-measures or pretests of students to measure their initial standings before initiating the experimental treatments. These pre-measures, or pretests, serve as ideal covariates in ANCOVA because they provide useful information about whether groups of students in the experiment started at similar levels of knowledge or aptitude. Should any group discrepancies exist among groups on these pre-measures, one may attempt to statistically equate these groups by including these pretests/pre-measures as covariates. Thus, any time a study contains pre-measures for all study participants, look for those to be used as covariates in data analysis of experimental results. 


(5) Chi-Square Test of Association (χ2)

All of the other statistical testing procedures covered require a dependent variable that is quantitative. The chi-square test is useful for learning whether a relationship exists between two qualitative (nominal) variables. For example, one may be interested in learning which group has the higher dropout rate, males or females. We may find that 20% of males dropout of school and 10% of females leave school before graduation. Here both variables, sex and dropout, are categorical/nominal (sex has two categories, male and female, and dropout has two categories, in school or out). Since both variables are qualitative, chi-square is the statistical procedure of choice.

As another example, consider student race (Black, Hispanic, and White) and high school program of study (college track, vocational track, and general track). As above, one would examine the number students from each race who opted for each of the tracks, for example, we may find that 40% of Black students opt for the college track, 20% for vocational track, and 40% for general track. Similar figures may also be present for Hispanic and White students. Here both variables are qualitative (race is nominal with three categories here, and program of study is also nominal with three categories).

The key is determining whether both variables involved are categorical/nominal (and hence qualitative). If yes, then a chi-square analysis can be used.

Chi-square would be appropriate for the following research question:

"Is political party affiliation (Democratic, Republican) associated with presidential voting choice (vote for either the Democratic, Republican, or other candidate)?"


"Is race related to special education classification?"

Example: Chi-Square in Text

Dover and Shore (1991. p. 103) wrote:

"The gifted group responded significantly more often without probes on question 1 (χ2(1) = 4.37, p < .05) and question 3 (χ2(1) = 5.56, p < .05)."

In this example, the values for the chi-squares were 4.37 and 5.56. The larger the chi-square statistics, the more likely the null hypothesis will be rejected. The number in parentheses following the χ2, e.g., χ2(1), is the degrees of freedom. Note the p-values reported to be less than alpha of .05.

Margalit, Ankonina, and Avraham (1991, p. 431) wrote:

"No difference were found between groups for age or gender of the handicapped children (Kibbutz: 32 males and 11 females; City: 30 males and 18 females; χ2(1, N = 91) = .99, ns)."

In this example, the chi-square statistics is equal to 0.99, not a large value. The "ns" indicates "not significant," although it would be better for them to report the actually p-value obtained. The symbol "N" indicates the sample size, which is 91.

Example: Chi-Square in Table

Reynolds, Kunce, and Cope (1991), in their study driving under the influence of alcohol and personality type, reported the following table (p. 293):


Table 10. Distribution of participants by personality type and offender group.

  First-Time Repeat
Personality Type n % n %
Stability-oriented extravert 68 39.5 34 53.1
Change-oriented extravert 49 28.5 4 6.3
Stability-oriented introvert 34 19.8 23 35.9
Change-oriented introvert 21 12.2 3 4.7
Total 172 100.0 64 100.0

Note. χ2 (3, N = 236) = 19.91, p < .0001.

The chi-square value is 19.91, which is significant at the .0001 level, so the null hypothesis of no difference in distribution among these groups is rejected. Apparently personality type is associated with repeat offender status.


Davis, N.F. (1990). The Reynolds Adolescent Depression Scale. Measurement and Evaluation in Counseling and Development, 23.

Dover, A., & Shore, B. (1991). Giftedness and flexibility on a mathematical set-breaking task. Gifted Child Quarterly, 35.

Fuchs, L.S., Fuchs, D., Karns, K., Hamlett, C.L., Dutka, S., and Katzaroff, M. (1996). The relation between student ability and the quality and effectiveness of explanations. American Educational Research Journal, 33, pp. 631-664.

Goodenow, C. (1993). The psychological sense of school membership among adolescents: Scale development and educational correlates. Psychology in the Schools, 30.

Martin. C.L. (1990). An empirical investigation of employee behaviors and customer perceptions. Journal of Sport Management, 4.

Margalit, M., Ankonina, D., & Avraham, Y. (1991). Community support in Israeli Kibbutz and city families of children with disabilities: Family climate and parental coherence. Journal of Special Education, 24.

Quinn, G., & Griffin, B. W. (1999). Students' motivational needs and satisfaction in relation to achievement within a cooperative learning setting. Unpublished manuscript.

Reynolds, J.R., Kunce, J.T., & Cope, C.S. (1991). Personality differences of first-time and repeat offenders arrested for driving while intoxicated. Journal of Counseling Psychology, 38.

Woznica, J.G. (1990). Delay of gratification in bulimic and restricting anorexia nervosa patients. Journal of Clinical Psychology, 46.