Descriptive and Inferential Statistics

Statistics are used by researchers to describe data and relationships among data. For example, if one asks how a class of students did on a test, it would be inefficient to communicate each and every score. Rather, a more efficient method is to provide the average to give a general indication of how the students performed on the test. Below are various methods of describing data (Descriptive Statistics) and of modeling relationships among variables (Inferential Statistics).

Descriptive statistics are used to describe data in a concise, understandable way. Descriptive statistics are summary indicators of larger groups of data. The example above illustrates how descriptive statistics may be used to reduce large amounts of information into a few summary indicators--thus reducing class scores to a class average. Two important summary methods for data are measures of central tendency (typical or average scores) and measures of dispersion (variability or spread of scores).

Measures of central tendency are indicators of average or typical scores one might find in a distribution of scores. The three most common measures of central tendency are mode, median, and mean.

(a) Mode (Symbolized as Mo): This is the most frequent score in a distribution. In Table 1 above, the mode is 85, and in Table 2, the mode is female. The mode is appropriate for nominal, ordinal, interval, and ratio data.

(b) Median (Symbolized as Md, Mdn, or X₅₀): The score directly in the middle for all scores in rank order; the point at which 50% of the scores are above, and 50% are below. For example, of the following scores, 5, 3, 7, 6, 9, 1, 4, the median is:

the score that falls in the middle of the distribution, which is five in this example.

If one has an even number of scores, the median is the mean (arithmetic average) of the two middle scores. For these scores, 2, 1, 3, 10, the median is

so 2.5 is the median--exactly 50% of the scores fall below this and 50% above this score.

This median is a good measure for ordinal data, or interval/ratio data when the distribution is highly skewed (e.g., income in U.S. is positively skewed, so use Mdn). Skew means that there are a few very high scores or a few very low scores and these extreme scores often affect the mean.

where Xi represents the raw scores, n is the sample size (the number of scores), and S means to sum the scores (to add all the scores together). In words, the mean is simply the sum of all scores divided by the number of scores. For example, for this set of scores (1, 2, 3, 10) the mean is:

This measure is best used for ratio or interval data, but is often okay with ordinal data. It is not appropriate for nominal data since the mean assumes rank and nominal data do not have rank.

For the scores given in the frequency display above (see Table 1: Frequency Distribution for Test Scores), the mean, median, and mode are:

A measure of variability provides some indication of the dispersion or spread of scores in a distribution. Note that central tendency indicates typical or average scores, and variability indicates spread of scores. For example, consider the following two sets of scores:

Both sets have the same mean and median (M = 80, Mdn = 80), yet they have very different spread or dispersion. Set A has no variability at all, while Set B has much variability (no two scores are the same).

There are several measures of variability that help to show differences in variability like that found in Sets A and B above. Several of these measures are provided below.

(a) Range (Symbolized as R): The range is the quickest and easiest measure of dispersion to calculate. The formula is simply the difference between the largest and smallest scores in the distribution of scores, i.e.,

For example, with Set B, Xmax = 100, Xmin = 60, so the range is R = 100 - 60 = 40, or a 40 point spread. The range for Set A is R = 80 - 80 = 0, or no spread.

The problem with the range is that it only considers two numbers in the distribution, the highest and lowest score. Does the range adequately address variability for the following two sets?

Note that for both Sets C and D, the range is R = 80 - 70 = 10, or 10 points, yet the numbers suggest that Set D has more variability because no two numbers are the same while in Set C, there are only three unique numbers, 70, 75, and 80.

What is needed is a measure of variability that takes into account all numbers in the data, not just the two extreme numbers.

(b) Standard Deviation (Symbolized as SD, s): The standard deviation is more complex than the range and it provides a more useful indication of variability in a set of scores. The formula will not be discussed, but you should note that the standard deviation, like the range, cannot be less than zero (i.e., 0.00), and the larger the standard deviation, the greater the variability in a set of scores.

Using Sets C and D, the standard deviations are SD = 3.16 for Set C, and SD = 3.74 for Set D. Set D has the larger standard deviation and this indicates that scores in Set D have more variability that scores in Set C.

As another example, listed below are two sets of scores with the same measures of central tendency, but with different measures of variability.

As this table shows, the measures of central tendency are identical for both boys and girls, but vary dramatically in terms of measures of variability.

Another method for presenting and describing data in an efficient manner is through the use of graphs.

Graphs are frequently used to display data. Graphs provide pictorial displays that enable one to more readily understand the distribution of scores, etc. A few commonly used graphs will be described below.

Frequency distributions are used to indicate how many times a particular value was obtained in a set of scores. For example, consider the following scores obtained on a test:

X (raw score)	F (frequency of score)
88	2
87	1
86	1
85	3
84	1
83	2

As this frequency distribution shows, the most common score was 85, and the least frequent scores were 84, 86, and 87. Frequency distributions will work with data from any type of variable (nominal, ordinal, interval, or ratio). To illustrate, one could make a frequency distribution for the sexes enrolled in a course. Suppose there are 11 women and 5 men, the frequency distribution would be:

Other commonly used graphical tools are bar charts and histograms. Both are similar; the only difference is that a bar chart is used for qualitative data (so the bars do not touch, thus indicating a lack of continuity), while the histogram is used with quantitative data (the bars touch). Examples of both are provided below.

A time-series graph displays figures over time. Often time-series graphs are used on single subject research. For more examples, see the section Single Subject Research.

Relative position refers to the location in a distribution of a given score relative to other scores. Relative position indicates how well one performed on a test relative to others.

The only measure of relative position discussed here will be Percentile Rank (PR). A percentile rank indicates the proportion or percentage of individuals who scored less than a given score. For example, if you receive a PR of 75, then this means that 75% of those who took the test scored less than you. It does not mean that you got 75% of the items correct on the test. If you had a PR of 4, this means you scored better than 4% of those who took the test.

Note also that some define percentile rank as representing the percentage who scored at or below a given score. Thus, a PR of 75 means that one scored the same as or better than 75% of test takers.

Both ways of defining percentile rank ([a] score better than or [b] scored equal to or better than) are used an commonly found in education.

A key to quantitative research is the hypothesis test. When trying to determine whether two variables are related (like intelligence and scholastic performance, or sex [female vs. male] and dropout status), statistical testing of hypotheses is the tool most frequently used.

What does it mean to statistically test a hypothesis? With hypothesis testing, data from studies are used to make a judgement about whether enough information exists to support a conclusion that a relationship exists between variables. To make this judgement, two types of hypotheses are considered, the research hypothesis (which in this case will be either directional or non-directional) and the null hypothesis.

To introduce you to the idea of hypothesis testing, consider the following research situation.

A researcher wishes to know if there is statistical evidence that boys and girls differ on a standardized mathematics test. The research hypothesis is:

Ho: There will be no difference in mathematics test scores between boys and girls.

In this example, Ha is the research hypothesis (H indicates hypothesis, the letter "a" indicates that this is the alternative or research hypothesis) and Ho is the null hypothesis (H is hypothesis, "o" is the symbol for null).

When making a scientific judgement about this study, hypothesis testing is used. In all types of hypothesis testing, the null hypothesis is always the hypothesis tested. The logic of testing the null hypothesis follows something like this:

The null indicates no difference in math scores between boys and girls in this example. If the null hypothesis can be rejected, then some hypothesis other than the null must be accepted which will be either the directional or non-directional. Note that only the null indicates no difference, and both the directional and non-directional hypotheses indicate some difference between boys and girls on math scores. So by rejecting the null hypothesis, we can establish evidence for a difference between boys and girls.

So in Study 1, we will first collect data, then we will use a statistical procedure (which are discussed below) to test the null hypothesis. If the null hypothesis is rejected, then we can accept the alternative or research hypothesis and state that sex and math scores are related in some way (maybe that boys score higher than girls). If the null hypothesis is not rejected, then we state that there is not enough evidence in the data collected to say that a difference exists between boys and girls on math test scores. The key to the conclusion we draw is whether the null hypothesis is rejected or not rejected. If rejected, then a relationship or difference exists, if not rejected, then sex and math scores are not related.

A researcher wishes to know if there is statistical evidence that one's intelligence is related to one's performance on a mathematics test. The research hypothesis is:

Ha: Those with higher levels of intelligence will score higher on a mathematics test.

For Study 2 we are interested in knowing whether a relationship exists between two variables, intelligence and math scores. As with Study 1, the question of interest is whether the null hypothesis can be rejected. As before, the null hypothesis is tested. If the null hypothesis is rejected, based upon the evidence provided in the data collected, then we can say that intelligence and math scores are related (perhaps positively related). If the null hypothesis is not rejected, then we say that there is not enough evidence in the data to say that intelligence and math scores are related in any way.

As noted above, when performing statistical testing of hypotheses, one always tests the null hypothesis. There are many statistical methods for testing null hypotheses, and a number of them are discussed below (e.g., t-test, Pearson's r, ANOVA). All statistical testing procedures have corresponding null hypotheses. When testing hypotheses with statistical tests like the t-test, ANOVA, and Pearson's r, one must assume that the null hypothesis is true; that is, one must assume that no differences actually exist among the groups examined, or that no relationships exist among the variables.

All of the statistical tests will provide some indication as to whether evidence provided by the data support or do not support the null hypothesis. If the evidence indicates that the null hypothesis is supported, then the researcher concludes that no differences between the groups really exist (or that no relationship among the variables exists), and the researcher can therefore state that the statistical results indicate that there were no statistically significant differences among the groups, or that there were no statistically significant relationships among the variables.

If, however, the data do not support the null hypothesis, then the researcher can reject the null hypothesis and accept the research hypothesis. If the data do not support the null hypothesis, then the researcher can say that a statistically significant difference exists among the groups (which is, of course, what the research hypothesis claims), or that a statistically significant relationship exists among the variables..

Thus, when one claims that results are statistically significant, this does not mean that important results were obtained; rather, it simply means that the data do not support the null hypothesis. Whether important results were obtained in any study cannot be determined by statistical hypothesis testing; importance can only be assessed through one’s interpretation of the results--that is, one must examine the data closely to determine whether something important was found.

So, what does it mean when one hears, say in the news, that researchers found a significant relationship between smoking and lung cancel, or that researchers found a significant relationship between leaving a light on for young children at night and their likelihood of developing near-sightedness? Does the term significant mean they found an important result? No. In this case significant only means that the data collected by the researchers provided enough evidence to reject the null hypothesis, but this does not mean that something important was found.

To illustrate the difference, consider a study of computer-assisted instruction in mathematics. A recent study found a significant difference in math achievement scores between two groups: one group used traditional instruction (paper and pencil exercises, etc.), and the other group supplemented this instruction with mathematics software instruction three times a week. One might be tempted to conclude that since a significant difference was found, the computer-assisted instruction makes a big difference. This is not the case. There is a curious characteristic of hypothesis testing that follows sample size. The larger the sample one collects, the easier it is to reject the null hypothesis (hence the easier it is to claim a significant effect was found). In this study, the mean math-achievement scores for the traditional instruction group was 84.3, and the mean math-achievement scores for the computer-assisted group was 85.2, less than 1 point difference in achievement. Do you consider this to be a significant improvement in achievement? When making this judgement, consider also the resources required to use computer-assisted instruction (money, time, space, etc.). I hope this example illustrate that just because someone reports a significant finding in research, this does not necessarily mean something important was found, it only means that the null hypothesis was rejected.

Whenever one performs a statistical test to determine whether to reject or fail to reject the null hypothesis, one always relies probability. Keep in mind that when performing hypothesis tests, one is working with samples, not populations. In any given sample, it is possible that the sample does or does not represent well the population. So anything found in a sample study may or may not be true for the population.

The reason one works with a sample is to make inferences or generalizations back to the population. The reason one tests hypotheses is to help with the decision making process of generalizing from sample to population. So, for example, if one fails to reject the null hypothesis that intelligence is unrelated to math scores, one would then conclude that intelligence and math scores are unrelated in the population.

With all statistical testing procedures for testing null hypotheses, a few key bits of information will be calculated. One thing calculated is call the inferential test statistic. This will be described for each statistical test discussed below, but usually it will be one of the following: t, F, or c2. Each of these three symbols represent the inferential test statistic used to judge whether one should reject or not reject the null hypothesis being tested.

Another statistic produced by all hypothesis testing procedures is the p-value, which is usually denoted simply as "p" in reports. The p-value indicates the level of probability associated with any test statistic. The smaller the p-value, the less likely that the data support the null hypothesis. So with small p-values, researchers are more likely to conclude that the null hypothesis must be rejected. In many cases researchers will use one of two cut-off levels for rejecting the null hypothesis: a p-value less than .05 or less than .01. These cut-off values are called alpha (a) in statistics, and are sometimes referred to as the significance level.

So, for example, if we test the null hypothesis that intelligence is unrelated to math scores, the p-value obtained (from the computer software used to analyze the data) might be, say, p = .33. Since this value is larger than an alpha of either .05 or .01, we would not reject the null hypothesis and therefore conclude that intelligence is unrelated to math scores. If, however, we obtained a p-value of .04, we would be able to reject the null hypothesis at the alpha level of .05 since .04 is less than .05 and we could then conclude that our data indicate that intelligence is related to math scores.

Whenever you read research reports or articles, look for p-values. If presented, they can be found in tables that report statistical information. Unfortunately, researchers choose not to report p-values and instead use an asterisks (*) to indicate if the p-value was less than a given cut level like .05 or .01. See the examples below for illustrations.

There are many statistical procedures used to test null hypotheses, and they all a best suited for specific research situations and types of data. Below is a table that lists some of the more commonly used statistical procedures. Study this table as you study the various types of inferential statistical procedures.

The relationship between two quantitative variables may be measured by Pearson's product moment correlation coefficient, which is symbolized by the letter "r." The correlation coefficient, r, may range from 1.00 (indicating a perfect, positive, linear relationship) to -1.00 (indicating a perfect, negative, linear relationship), and any value between the two. When r = 0.00, then no linear relationship exists between the two variables. The closer the coefficient is to +1.00, the stronger the positive, or direct, relationship; the closer the coefficient is to -1.00, the stronger the negative, or inverse, relationship. If the coefficient is nearer 0.00, the variables are not linearly related to each other, although they may be non-linearly related. A correlation coefficient of 0.00 represents the weakest relationship possible relationship, although it is still possible a non-linear relationship may exist when r = 0.00.

A linear relationship is such that as one variable increases, the other increases (or decreases) in a straight line fashion. If, however, the relationship was such that as one variable increased the other increased, to a point, and then began to decrease after that point, a curvilinear or non-linear relationship would exist. These types of relationships are presented below with three scatterplots:

Figure (a) shows a positive relationship, figure (b) shows a negative relationship, and figure (c) depicts a curvilinear or non-linear relationship.

In case you have difficulty understanding a scatterplot, it can be viewed as point intersections of two variables. Each scatterplot has two axis, the vertical and horizontal, and each of two variables is represented on one axis. In the example below, two variables are plotted, intelligence (measured on a 1 to 5 scale) and test scores (also measured on a 1 to 5 scale). correlation.gif (2472 bytes)

To help illustrate what a scatterplot does, three people are identified on the scatterplot. The dot with an "a" beside it indicates the two scores for persona "a." This individual scored a 1 on intelligence and a 1 on test scores; person "b" scored a 5 on intelligence and a 5 on test scores; and person "c" scored a 3 on intelligence and a 4 on test scores. By plotting the combination of scores from two variables in this manner, one can more readily see any type of relationship that may exist between two variables. In this example, there is a positive relationship between intelligence and test scores--as intelligence increases (goes from 1 to 5), so too do test scores.

Figure 1 illustrates the strength of different correlations via scatterplots. Note that for both positive and negative relationships, the stronger the correlation, the more compact the scatter to a central area. Note that for both positive and negative correlations, the tightness of the scatter corresponds to the size of the correlation coefficient r. The closer r is to 1.00 or -1.00, the tighter the scatter, and the closer r is to 0.00, the more scattered the relationship.

The null hypothesis in correlational research states that no relationship exists, which means r = 0.00. As stated earlier, the null hypothesis is tested to determine statistical significance. In the case of correlational research, the question of interest is whether there is evidence that a relationship exists between the two variables in the population. Statistical significance testing is used to make this determination.

Significance testing allows one to ask whether the relationship, as measured by r, is likely to be equal to zero. If the correlation coefficient is not likely to be equal to zero, the relationship between two variables is said to be statistically significant. If zero is a strong possibility, then the relationship is not statistically significant. A non-significant relationship means that it is possible that no relationship exists, i.e., r = 0.00 (recall that an r of zero indicates no relationship). Note that determining whether the coefficient is statistically significant or not is a statement of probability or chance—that is, if we say the coefficient is statistically significant, we are stating that the coefficient is probably different from 0.00, but we are not sure.

"Correlation between the RADS and the Hamilton was .83 (p < .001), indicating a strong relationship between the two methods of assessing depression."

The correlation in this example is r = .83, which is a strong, positive correlation. The p-value is reported to be less than .001, which is far below the standard cut-off alpha levels of .05 and .01, so the null hypothesis (i.e., no relationship) is rejected in this case.

"The perceived relative importance of Behavior 15 (employees verbally complimenting bowlers) was most strongly correlated with bowling average (r = .251, n = 82, p = .011)."

The correlation in this example is r = .251, which is a moderate to weak correlation. The sample size is n = 82, and the p-value is .011, which is less than an alpha of .05, so the null hypothesis of no relationship is rejected.

Sometimes researchers are interested in several correlations among a number of variables. The most efficient method for presenting multiple correlations is through a correlation matrix.

Quinn and Griffin report the following correlation matrix (Table 5) for their study of motivational needs, course satisfaction, and course achievement among undergraduate students in a cooperative learning course.

Note that in Table 5 means (M), standard deviations (SD), and sample size (n) are also reported. To show how to read this table, I will highlight a few bits of information. First, the mean for "Need for Dominance" is 9.51 with a SD of 2.58. The correlation between "Need for Dominance" and "Course Satisfaction" is r = -.08 (a weak and insignificant relationship). The correlation between "Need for Autonomy" and "Need for Affiliation" is r = -.33. Note that there is an asterisks next this correlation (e.g. -.33*). At the bottom of the table note that the asterisks denotes correlations that are statistically significant at the .05 alpha level. This means that correlations marked with an asterisks are statistically significant, so the null hypothesis of no relationship is rejected for these correlations.

A t-test is used if one wishes to make a comparison between two groups, say an experimental and control group. The t-test allows one to determine whether a statistically significant difference exists between two means—thus, the t-test compares two group means. For example, a t-test would be appropriate to use for testing the following hypothesis:

The inferential statistic calculated in the t-test is called the t-ratio, which is denoted as "t" in most reports and articles. The larger the t-ratio (in absolute value), the more likely we will reject the null hypothesis because the more evidence in the data that the two groups differ from each other. The formula for the t-ratio will not be discussed here, but you should recognized that a "t" is the test statistic used to determine whether the null hypothesis should be rejected.

When reporting t-tests, a common method is to use a table with appropriate statistical information, like the table below.

In the table above, the mean math score for boys is 85.35 and for girls it is 78.64, so it is higher for boys as the hypothesis predicted. We also see that the variability in the boys' scores is greater since the SD is larger for boys. There were more girls in the study than boys, with a sample size of 36 for girls. The t-ratio in this example is t = 2.59, and its corresponding p-value is p = .032. If the cut-off level for statistical significance testing--alpha--is set at .05, then since the p-value is less than .05 (p = .032), the null hypothesis of no difference between boys and girls is rejected and we conclude that there is enough evidence in the data to state that boys appear to perform better on math than girls in the population from which these two samples were selected.

In a study of tutoring, Fuchs, Fuchs, Karns, Hamlett, Dutka, and Katzaroff (1996) wrote (p. 648):

"On the problem sets worked during the tutoring generalization sessions, tutees correctly completed 91% (SD = 12) of the problems they attempted with HA tutors; with AA tutors, they correctly completed 75% (SD = 26), t(19) = 2.78, p < .05, ES = .84."

In this sentence, they report means of 91 and 75 for two groups (tutees under two different types of tutors, HA and AA). The corresponding t-ratio is t = 2.78. They did not report the exact p-value, but they did report that the p-value is less than an alpha of .05, so the null hypothesis is rejected in this case. Since the null is rejected, we can conclude that people studying under HA tutors will get more problem sets correct (91%) when compared to tutees studying with AA tutors (75%).

ANOVA is similar to the t-test, but it is appropriate when one wishes to compare two or more means—compare two or more groups—to determine whether a statistically significance difference exists among the group means. For example, an ANOVA would be appropriate for the following:

"There will be a difference between grades 1, 2, and 3 scores on the elementary mathematics achievement test."

In this example, one wishes to learn whether math test scores differ between students in grades 1, 2, and 3, so there are three groups to compare.

The null hypothesis tested in this example is that there is no difference in math score means between these three grades. If the null hypothesis is rejected, then one may conclude that there is evidence in the sample to suggest math score means differ among these three groups in the population from which the sample was selected.

Rather than report an r like a correlation coefficient, or a t like the t-test, the ANOVA provides an "F" value for determining whether to reject the null hypothesis. Like all inferential statistical tests, associated with the F-value is a p-value. Below are examples of how ANOVA may be reported in research.

"Finally, as a direct assessment of construct validity, it was hypothesized that students rated by their English teachers as having different levels of social standing with peers would also exhibit significantly different levels of self-reported psychological membership. A one-way analysis of variance confirmed this hypothesis: Students rated as having high, medium, or low social standing were different in their PSSM scores (4.23, 3.87, and 3.32, respectively, F[2,451] = 26,.59, p < .001)."

Several things are reported above. First, the three PSSM mean scores for the three groups (high, medium, and low) were M = 4.23 for the high group, M = 3.87 for the medium group, and M = 3.32 for the low group. Next, the author reports an F value of F = 26.59. This test statistic is high--F values can only be positive, and any F value greater than, say 4.00, is usually an indication that the null hypothesis should be rejected. The p-value for this F-value is not reported, but the author does indicate that the p-value is less than .001, which is certainly smaller than alpha levels of .05 and .01, so the null hypothesis of no difference among these three groups is rejected and we conclude that the mean scores do differ in the population. Two other things to note here. First, the author refers to the ANOVA as a one-way analysis of variance. The term one-way refers to the number of independent variables tested. If it is one-way, then only one independent variable is tested, if it is two-way, then two independent variables are tested, etc. The terms one-way, two-way, etc. do not refer to the number of categories in the independent variable. Second, the author reports some numbers adjacent the F-value, i.e., F[2,451]. These two numbers, 2 and 451, are called degrees of freedom, and every statistical test of hypotheses has degrees of freedom. While degrees of freedom play an important role in statistics, they are not important to understand at this point or for general reading of statistical reports, so I will not cover them here. See your text for a more detailed description of degrees of freedom if you wish to know more.

Woznica (1990) reports the results of a one-way ANOVA in both text and table form (p. 711):

'The 'pure resticters' obtained a mean STIC score of 44.00, which was higher than the mean STIC scores of the bulimic anorexic, normal control, and psychiatric control groups. A one-way analysis of variance of group differences on the STIC yielded a significant F ratio of 3.53, which indicated that there was a significant difference between groups on the STIC (p < .01)."

Woznica Table 1: Analysis of variance: Group STIC scores using only "pure restricters" in restricting anorexic group

In this table there are several things that you probably do not understand. Each column represents standard information reported in an ANOVA table: source, df (degrees of freedom), sums of squares (SS), mean of squares (MS), and the F-ratio (F). I won't cover what each of these components represent; see your text for a description of these. Note that the reported F-ratio is F = 3.53 and it is marked with an asterisks. At the bottom of the table the asterisks is shown to indicate that the corresponding p-value is less than .05 (the alpha level). In this case, the F-ratio is large enough to reject the null hypothesis of no difference at the .05 level.

To help explain how ANCOVA are used, consider the following research example. In this study, we are interested in knowing whether some form of cooperative learning produces higher achievement scores than lecture-based instruction. Two groups of students are assigned to either one or the other type instruction. Table 6 provides summary information about the two groups.

The table above shows that students using cooperative learning (M = 80) scored higher on a posttest taken at the end of a semester than students in the lecture group (M = 75). Using this information, it appears that cooperative learning produces higher achievement scores. However, at the beginning of the semester, each student was given an intelligence test to determine how equivalent the two groups were in terms of academic ability before starting the class. The intelligence test shows that students assigned to the cooperative learning class had a slightly higher scores (M = 103.78) than students assigned to the lecture class (M = 100.78).

Now we have a confounded study. Which factor can account for the difference in achievement scores, they type of instruction they received (cooperative learning vs. lecture), or their differences in intelligence (103.78 vs. 100.78)? Schematically, this study can be illustrated as:

Since there is a difference between the two groups in terms of instruction and in terms of intelligence, we simply do not know which of the two factors contributed to the 5 point difference in achievement, so we cannot state with certainty that instruction was the cause of this difference.

To help alleviate this problem, we could use ANCOVA. Like the ANOVA, ANCOVA is good for comparing two or more means (groups) to learn whether a statistically significant difference exists among the groups, but ANCOVA also allows the researcher to statistically control for confounding variables (termed covariates in ANCOVA parlance) by statistically adjusting the group means on the dependent variable to take into account initial group differences on some other variable. These statistically adjusted means are then used to compare the groups.

So how does this work; how are the dependent variable means adjusted? The mathematics that drives ANCOVA is too difficult to be presented here, but the logic can be explained. Taking into account the correlations among the independent, dependent, and covariate variables, ANCOVA uses this information to adjust the dependent variable means either higher or lower depending upon where the groups started on the covariate. To illustrate, in the example described above, the cooperative learning group started higher in intelligence than the lecture group, so since they started higher, they achievement mean will be adjusted downward to compensate for their "head start" in the class. This is like a handicap in gold. The lecture group, since they started behind in terms of intelligence, will have their achievement mean adjusted upward. The amount of adjustment in both groups varies and depends on a several things mathematically, so it is difficult to know just how much of an adjustment will take place. In this example, I used ANCOVA to calculate the adjustment means, which are presented in Table 8 below.

After taking into account of the covariate--the level of intelligence--we see that the difference between the two groups on adjusted means is much less: 78.68 - 76.32 = 2.36 points rather than 5.00 points. One way of understanding adjusted means is to consider what they theoretically allow one to do in terms of interpretation. The goal in experimental research is to have groups that are as alike, equivalent, as possible, so any differences that result between the two groups is not due to differences that the groups started with, but due to differences in treatments (in this example instruction is the treatment). By using ANCOVA to make statistical adjustments, we can state that the adjusted means indicated the level of difference between the two groups if they had started at the same intelligence level (say, 102.28, which is the overall mean of intelligence for the two groups).

Table 9: Outline of Instructional Study with Adjusted Means and Motivation
as a Covariate

In the example of Table 9, motivation to learn is the covariate, and we see that the lecture group has lower motivation to start with, so we would expect their achievement to be lower due to the lower motivation. This study, like the example above, is confounded in that we do not know whether the type of instruction or the level of motivation caused the 10 point difference found with achievement scores. We would anticipate ANCOVA to adjust upward the lecture group's achievement means, and downward the cooperative learning group's achievement means. The adjusted means are 75 for both groups, which signifies that once we take into account the level of motivation, there is no difference in achievement performance between the two groups.

With ANCOVA, one may have many covariates to consider. One may include more than one covariate in an effort to control for, or statistically adjust for, many confounded variables.

"Does a difference in salary exist between males and females at the university once academic rank (professor, associate professor, and assistance professor) and number of publications are taken into account?"

The covariates in the above research question are academic rank and number of publications. The researcher wishes to take these variables into account (i.e., control) before comparing mean salaries between males and females.

Below is a table taken from a study designed to test whether two forms of reciprocal peer tutoring (RPT) impacts classroom achievement, academic self-efficacy, and test anxiety. The table provides means (M), adjusted means (Madj.), and standard deviations (SD) for each of the three dependent variables and the three groups (RPT in-class, RPT out-of-class, and the control group). Each mean is adjusted for a pre-measure of the variable (pretest). The pre-measure, or pretest, in each case served as the covariate.

Note. Values enclosed in parentheses represent mean square errors. M are the unadjusted posttest means, M_adj. are the covariate adjusted posttest means, and SD is the standard deviation. RPT In-class represents the scores of students who quizzed each other in class, immediately prior to completing their exams, and RPT Out-of-class refers to the students who quizzed each other out of class, prior to completing their exams.

In many studies in education it is common for researchers to take pre-measures or pretests of students to measure their initial standings before initiating the experimental treatments. These pre-measures, or pretests, serve as ideal covariates in ANCOVA because they provide useful information about whether groups of students in the experiment started at similar levels of knowledge or aptitude. Should any group discrepancies exist among groups on these pre-measures, one may attempt to statistically equate these groups by including these pretests/pre-measures as covariates. Thus, any time a study contains pre-measures for all study participants, look for those to be used as covariates in data analysis of experimental results.

All of the other statistical testing procedures covered require a dependent variable that is quantitative. The chi-square test is useful for learning whether a relationship exists between two qualitative (nominal) variables, such as sex (male, female) and dropping out of school (dropout, stay-in), or race (Black, Hispanic, White) and choice of high school program of study (college track, vocational, general).

"Is political party affiliation (Democratic, Republican) associated with presidential voting choice (vote for either the Democratic, Republican, or other candidate)?"

"The gifted group responded significantly more often without probes on question 1 (c2(1) = 4.37, p < .05) and question 3 (c2(1) = 5.56, p < .05)."

In this example, the values for the chi-squares were 4.37 and 5.56. The larger the chi-square statistics, the more likely the null hypothesis will be rejected. The number in parentheses following the c2, e.g., c2(1), is the degrees of freedom. Note the p-values reported to be less than alpha of .05.

"No difference were found between groups for age or gender of the handicapped children (Kibbutz: 32 males and 11 females; City: 30 males and 18 females; c2(1, N = 91) = .99, ns)."

In this example, the chi-square statistics is equal to 0.99, not a large value. The "ns" indicates "not significant," although it would be better for them to report the actually p-value obtained. The symbol "N" indicates the sample size, which is 91.

Reynolds, Kunce, and Cope (1991), in their study driving under the influence of alcohol and personality type, reported the following table (p. 293):

The chi-square value is 19.91, which is significant at the .0001 level, so the null hypothesis of no difference in distribution among these groups is rejected. Apparently personality type is associated with repeat offender status.

Davis, N.F. (1990). The Reynolds Adolescent Depression Scale. Measurement and Evaluation in Counseling and Development, 23.

Dover, A., & Shore, B. (1991). Giftedness and flexibility on a mathematical set-breaking task. Gifted Child Quarterly, 35.

Fuchs, L.S., Fuchs, D., Karns, K., Hamlett, C.L., Dutka, S., and Katzaroff, M. (1996). The relation between student ability and the quality and effectiveness of explanations. American Educational Research Journal, 33, pp. 631-664.

Goodenow, C. (1993). The psychological sense of school membership among adolescents: Scale development and educational correlates. Psychology in the Schools, 30.

Martin. C.L. (1990). An empirical investigation of employee behaviors and customer perceptions. Journal of Sport Management, 4.

Margalit, M., Ankonina, D., & Avraham, Y. (1991). Community support in Israeli Kibbutz and city families of children with disabilities: Family climate and parental coherence. Journal of Special Education, 24.

Quinn, G., & Griffin, B. W. (1999). Students' motivational needs and satisfaction in relation to achievement within a cooperative learning setting. Unpublished manuscript.

Reynolds, J.R., Kunce, J.T., & Cope, C.S. (1991). Personality differences of first-time and repeat offenders arrested for driving while intoxicated. Journal of Counseling Psychology, 38.

Woznica, J.G. (1990). Delay of gratification in bulimic and restricting anorexia nervosa patients. Journal of Clinical Psychology, 46.

Sex	Frequency
Female	11
Male	5

Statistical Test	Independent Variable	Dependent Variable	Special Feature	Example Hypothesis
(1) Pearson's r (correlation coefficient)	Quantitative	Quantitative		There is a positive relationship between intelligence and mathematics achievement scores.
(2) t-test	Qualitative	Quantitative	IV has only 2 categories	There will be a difference between boys and girls on mathematics achievement scores.
(3) ANOVA (Analysis of Variance)	Qualitative	Quantitative	IV may have 2 or more categories.	There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores.
(4) ANCOVA (Analysis of Covariance)	1. Qualitative 2. Covariate may be either qual. or quan., but usually quantitative	Quantitative	IV may have 2 or more categories. Covariate used to make adjustments to DV means.	There will be a difference among Blacks, Hispanics, and Whites in mathematics achievement scores after taking into account levels of motivation.
(5) Chi-Square c2	Qualitative	Qualitative		Males will be more likely to drop out of school than females.

		1	2	3	4	5	6
1.	Need for Achievement	—
2.	Need for Affiliation	-.36*	—
3.	Need for Autonomy	.26	-.33*	—
4.	Need for Dominance	.14	-.31*	.12	—
5.	Course Satisfaction	.39*	.24	-.05	-.08	—
6.	Course Achievement	.43*	.05	.28*	.06	.41*	—
	M	11.23	12.42	12.04	9.51	14.43	3.38
	SD	2.57	3.45	2.48	2.58	2.99	0.57

Source	df	Sum of Squares	Mean of Squares	F
Between	3	453.500	151.167	3.531*
Within	35	1498.500	42.814
Total	38	1952.000

	Cooperative Learning (n = 9)	Lecture (n = 9)
Intelligence Mean	103.78	100.78
Intelligence SD	2.82	2.82
Posttest Mean	80.00	75.00
Posttest SD	2.74	2.74

Girls		Boys
83		75
84		80
85		85
85		85
85		85
86		90
87		95
M	85	85
Mdn	85	85
Mo	85	85
SD	1.29	6.45
R	4	20

Intelligence	Instruction	Achievement
103.78	Cooperative Learning	80
100.78	Lecture	75

Motivation	Instruction	Achievement	Adjusted Achievement
4.52	Cooperative Learning	80	75
3.35	Lecture	70	75

			F
Source	df	Posttest Performance	Test Anxiety	Academic Self-efficacy
RPT (R)	2	0.23	0.84	0.45
Pretests (P)	1	8.89*	30.82*	5.69*
R x P	2	0.39	0.83	0.45
Error	61	(10.92)	(2.56)	(1.06)
RPT In-class		M = 26.65	M = 3.51	M = 5.34
		M_adj. = 26.69	M_adj. = 3.75	M_adj. = 5.32
		SD = 2.93	SD = 1.93	SD = 1.04
RPT Out-of-class		M = 27.74	M = 3.75	M = 5.41
		M_adj. = 27.72	M_adj. = 3.68	M_adj. = 5.39
		SD = 3.33	SD = 1.92	SD = 1.02
Control		M = 26.95	M = 3.61	M = 5.03
		M_adj. = 26.94	M_adj. = 3.45	M_adj. = 5.07
		SD = 4.18	SD = 1.99	SD = 1.14

	Offenders
	First-Time		Repeat
Personality Type	n	%	n	%
Stability-oriented extravert	68	39.5	34	53.1
Change-oriented extravert	49	28.5	4	6.3
Stability-oriented introvert	34	19.8	23	35.9
Change-oriented introvert	21	12.2	3	4.7

Total	172	100.0	64	100.0