Experimental Research: Control, Designs, Internal and External Validity

EDUR 7130
Educational Research On-Line

Experimental Research: Control, Control Procedures, Experimental Designs, Internal and External Validity

An introduction to experimental research was presented in the discussion of quantitative research, which can be found in Quantitative Research Matrix. The information presented here provides a more in-depth view of experimental research with specific emphasis on control issues, experimental designs, and internal and external validity.

Control

In an experiment the researcher forms or selects the groups under study, manipulates the treatments (which is the IV) for the groups, attempts to to control extraneous or confounding variables (variables outside the study but that have an effect on the DV), and observes the effects of the IV on the DV across the groups. Two key components of this experimental process are manipulation and control. Manipulation has been defined previously, and simply means that the research determines which groups in the experiment will receive which treatments.

For example, in an experiment, the experimental group is the group designated to receive the new or novel treatment, and the control group usually receives the traditional treatment. The researcher will decide which group will be experimental and which will be control, and this is manipulation.

Another important issue in experimental research is control. Control refers to the effort by the researcher to remove the influence of any extraneous, confounding variable on the DV under investigation. To help illustrate the nature of control, consider the following example experiment. We are interested in learning whether students benefit more, in terms of academic achievement, from cooperative learning when compared with a class that uses lecture exclusively. Table 1 shows the posttest means scores on an achievement test for the two groups.

Table 1: Example Experimental Results

Experimental Treatments	Mean Achievement Scores
Cooperative Learning	85
Lecture	75

From Table 1 we would be tempted to concluded that students exposed to cooperative learning scored, on average, 10 points higher than students exposed to lecture, therefore cooperative learning does appear to benefit students academically. However, what if the following additional information were provided? Table 2 also shows that the groups were not equivalent from the outset since their average intelligence scores were different.

Table 2: Example Experimental Results with Confounding Variable Present

Confounding Variable: Mean Intelligence Scores	Experimental Treatments	Mean Achievement Scores
115	Cooperative Learning	85
100	Lecture	75

From Table 2 we see that the two groups differ by 15 points on a measure of intelligence, and we also see that the group exposed to cooperative learning was the group with the higher measured intelligence. Now the question is which variable, the intelligence or the experimental treatments caused the 10 point difference in achievement? One could argue that either the difference in intelligence or the difference in the experimental treatment caused the difference in the achievement scores. As you can see, however, the effects of the experiment treatments are confounded--confused--with the intelligence variable, and this is why intelligence in this example is called a confounding variable--because it confuses the interpretation of the results.

The issue of control hinges on the ability of a researcher to eliminate confounding variables. With control, one wishes to control for, or eliminate, the effects of confounding variables from the experiment. By eliminating the effects of confounding variables, the research can then make a definitive statement about the effects of the treatment upon the dependent variable. As the example above in Table 2 shows, since intelligence is a confounding variable, it is difficult to determine what effect the experimental treatment had on the dependent variable (achievement). If intelligence could be eliminated as a confounding variable, then the researcher could more readily determine the effects of the experimental treatments.

So, in short, control refers to the ability to eliminate from an experiment the influences upon the dependent variable of confounding variables.

Control Procedures

How does one eliminate the effects of confounding variables from an experiment? Before first discussing the logic of several control procedures, consider the following experimental study and the confounding variable sex.

Table 3: Example Experimental Results with Confounding Variable Sex

Confounding Variable: Sex of Students in Groups	Experimental Treatments	Mean Achievement Scores
All Male	Cooperative Learning	85
All Female	Lecture	75

From Table 3 we see that the two groups differ by 10 points on a measure of achievement. However, the two groups are also very different in terms of sexual composition: the cooperative learning group is all male, and the lecture group is all female. It is therefore possible that the differences observed in achievement could somehow stem from differences in sexual composition. One way to eliminate the confounding variable of sex is the redistrict students between the two group as follows:

Table 4: Example Experimental Results with Non-Confounding Variable Sex

Non-Confounding Variable: Sex of Students in Groups	Experimental Treatments	Mean Achievement Scores
50% male, 50% female	Cooperative Learning	85
50% male, 50% female	Lecture	75

As Table 4 shows, sex is no longer a confounding variable since both groups are the same, or equal, in terms of sex. To make this even clearer, consider Table 5.

Table 5: Example Experimental Results with Sex of Students Not Confounding Results

Non-Confounding Variable: Sex of Students in Groups	Experimental Treatments	Mean Achievement Scores
All Female	Cooperative Learning	85
All Female	Lecture	75

As Table 5 shows, both groups were composed exclusively of female. Now it is clear that both groups were exactly equal in terms of the sex of student participants. As a result it is impossible to say that sex of students caused the 10 difference in achievement since sex does not vary between the two groups in any way. In other words, sex cannot be used as a confounding variable since sex is not a variable as displayed in Table 5. With control issues, one always wishes to eliminate confounding variables so it become easier to identify the reasons groups differ.

Often it is not possible to limit an experiment to only one category of a variable, as was done with sex above. Rather, a more practical method is to show that the groups are as equal, equivalent, as possible on possible confounding variables. As Table 4 shows, sex is a equal as possible for the two groups with 1/2 the students in each group being male or female. When the groups are balanced in this manner, one can say the variable is controlled as a confounded since the groups show similar levels of the variable, on average. So, on average, sex is the same for the two groups in Table 4. How does one make the groups equal, on average?

1. Randomly Formed Groups

This is the process of randomly assigning experimental participants to groups in the experiment. Note that randomization is based upon the premise that since subjects were randomly (arbitrarily) assigned to groups, the chances are that the groups will not differ, significantly, on any major characteristics. Since group formation is based upon random assignment, the chances are that major differences between subjects will average out for the groups. For example, differences between subjects on IQ in the study will average out between the groups, so the groups will have, on average, similar IQs.

Note that random assignment of people to groups is NOT the same as random sampling (selection). One method is used to select people for a study (random sampling) and the other is used to assign individuals to groups (random assignment). Just because random sampling was used does NOT mean randomly formed groups were used. However, it is possible that through the process of randomly selection randomly formed groups emerged in the study. One must read the study characteristics carefully to learn whether groups were randomly formed.

Most researchers simply assume that random assignment works to equate groups, but random assignment is not a guarantee that groups will be similar. Random assignment may be performed in two ways. First, the researcher randomly assigns individuals to groups. Second, the researcher randomly samples and as subjects are randomly selected, they are placed in groups. Note, however, that random sampling is not a requirement for true experimental research—only random assignment is needed.

Table 6: Example Experimental Results with Randomly Formed Groups and Intelligence

Mean Scores on Intelligence	Experimental Treatments	Mean Achievement Scores
102.5	Cooperative Learning	85
101.8	Lecture	75

As table 6 shows, the two groups have very slight differences on intelligence. This is often the case with randomly formed groups--that is, variables one wishes to control for may be slightly different, but not different enough to really matter when one wishes to interpret the results. In the example given in Table 6, it looks like random assignment of people to groups worked to make the groups relatively equal in terms if intelligence. If, however, we saw means for intelligence of, say, 112.3 and 101.4, then we could be sure that randomly assignment did not work well.

Usually with random assignment, researchers will not know whether random assignment actually worked since there will be no measures take for all the possible confounding variables like intelligence, motivation, self-efficacy, study habits, etc. Often researchers simply rely upon and assume that randomly assignment works. And in most cases random assignment does work, especially with large groups of people (30+ per group).

2. Using Subjects as Their Own Control

In this control procedure all study participants (or groups of people) are exposed to each and every treatment, one at a time. For example, assume that we have three experimental treatments--three types of instruction in mathematics--and they are labeled A, B, and C. Table 7 shows, each participant was exposed to each of the three treatments. However, the order of the treatments varied, and after each treatment, a test was given to determine achievement.

Table 7: Subjects-as-Their-Own-Control

Subject	Time Period 1	Time Period 2	Time Period 3
John	Treatment A: 80	Treatment B: 84	Treatment C: 86
Joe	Treatment B: 70	Treatment C: 74	Treatment A: 87
Jill	Treatment C: 78	Treatment B: 80	Treatment A: 90
Jerry	Treatment B: 75	Treatment A: 90	Treatment C: 92

How does subjects-as-their-own-control (SATOC) work to eliminate confounding effects? SATOC does not control for all types of confounding variables, only those that are stable within an individual over time. For example, we will assume that intelligence does not change greatly over a three week period. As we can see, John scored an 80 after treatment A, an 84 after B, and an 86 after C. The growth from A to C was only 6 points. This suggests that treatments B and C offer little over treatment A. How do we know that John's intelligence cannot be a confounding variable in these scores? Since all treatments were done with John, and since John's intelligence is likely to remain unchanged from treatment A to treatment B, it is therefore not possible that intelligence could account for the 6 point difference between treatments A and C.

So, SATOC works in such a way that certain characteristics of each person are held constant since those characteristics do not change over the course of the experimental treatments. There are some characteristics, however, that do change such as self-esteem, self-efficacy, test anxiety, etc., and SATOC cannot be used to handle those.

3. Matching and Subgroup Matching

Matching is a procedure that is sometimes used to equate groups. Literally, individuals from two or more groups are matched according to common characteristics, like race, sex, motivation level, intelligence, etc. For example, suppose we have the two groups listed below, and we wish to match on sex to equate the groups on sexual composition.

Table 8: Two Groups with Matching

Group 1		Group 2
Beth	female	Bertha	female
Bill	male	Betty	female
Bryan	male	Ben	male
Bob	male	Brent	male

We can match Beth (female, Group 1) to Bertha (female, Group 2), and Bill (male, Group 1) to Ben (male, Group 2), and Bryan (male, Group 1) to Brent (male, Group 2). Note, however, than Bob (Group 1) and Betty (Group 2) have no matches, so they are dropped from the study. This is one of the problems with matching--it becomes difficult to find appropriate matches with others in different groups, especially when more than one variable is required to build a match.

Subgroup matching works similarly to matching, except that individuals within each group are sub-divided first into smaller groups based on common characteristics, like males and females; Blacks, Hispanics, and Whites; etc. Then subgroups are matched across the groups. As with matching, the drawback to this approach is that it becomes difficult to match once more than one or two variables are used for matching purposes.

So matching works well when one wishes to equate groups on one or two variables, but becomes unwieldy once more variables are used for matching.

4. Analysis of Covariance (ANCOVA)

The last procedure commonly used in experimental studies to equate groups is ANCOVA. This statistical procedure for equating groups was discussed earlier in Descriptive and Inferential Statistics in the section entitled "(4) Analysis of Covariance (ANCOVA)."

Briefly reviewed, ANCOVA is a statistical procedure that matches groups by equating the groups on confounding variables or covariates. ANCOVA is a method for analyzing differences between experimental and control groups on the dependent variable after taking into account any initial differences between the groups on pretest measures or on any other measures of relevant independent, confounding variables (called covariates). ANCOVA will adjust the dependent variable mean scores according to differences that exist on the covariates. These adjusted means are then compared among the groups to determine whether statistically significant differences exist among the groups on the dependent variable.

Experimental Designs

Experimental designs are formats that experiments follow in order to allow one to judge whether a treatment was effective relative to a control or comparison group. In experimental research, a number of research designs have been devised. The different designs have strengths and weaknesses in terms of their control of threats to experimental validity. By experimental validity, I refer to internal validity and external validity, which are discussed in detail below.

I will not offer detailed presentations of the experimental designs here. Your assigned readings relevant discussion of the major designs, and the supplemental reading provided below also examines various designs in detail. Read each to learn more about experimental designs.

To understand the designs, one must understand the symbols used. Table 9 presents each symbol used and its meaning.

Table 9: Symbols in Experimental Designs

Symbol	Meaning
O	O is for observation, which usually means the administration of a test or instrument of some type to obtain a measure on some important variable like achievement, intelligence, motivation, etc. Sometimes this symbol is subscripted with numbers like 1, 2, 3, etc., and these signify repeated observations, repeated measurement of some variable, like achievement.
X	X represents various treatments and controls administered to groups in the experiment.
R	R indicates that the groups were randomly formed.
Non R	Non R indicates that the groups were not randomly formed and are therefore intact groups.

1. Pre-experimental Designs

These designs are poor and are not recommended. Note from the discussion of these designs in the supplemental reading and from your text that pre-experimental designs control for few sources of internal invalidity. Generalizations and conclusions drawn from such studies are suspect and should not be trusted.

(a) One-shot case study:

X O

Only one group is used, and only a posttest is used; there is no pretest, thus there is no method of comparison available. One cannot compare gains from an initial starting point due to lack of pretest, and one cannot compare growth relative to a control group since there is no control group.

(b) One-group pretest-posttest design

O X O

Only one group is used, but a pre- and post-test are used. The pretest allows for comparisons over time relative to a starting point, but this design does not for comparisons relative to a control group, so one cannot be sure gains from pre- to post-tests were caused by the treatment or due simply to maturation or some similar effect.

X₁ O

X₂ O

Two groups used, a control and treatment, but since there is no pretest and no random assignment, one cannot be sure that the groups were equivalent from the outset.

2. True Experimental Designs

These designs represent the strongest research designs possible for assessing the existence of causal relationships. These designs are also the best designs possible for controlling sources of internal invalidity.

All true experimental designs have randomly formed groups, or random assignment of subjects to groups (but not necessarily random selection or sampling), and none of the other designs have this characteristic. Note that all experimental designs have control groups, which are very important for comparison purposes.

(a) Pretest-Posttest Control Group

R O X₁ O

R O X₂ O

This design has a pretest, posttest, and a control group. Very strong design with the only weakness being the possible pretest-treatment interaction (discussed below) which is likely to limit generalizability somewhat. It is important that the instruments used in this design have adequate test-retest or equivalent forms reliability.

(b) Posttest only Control Group

R X₁ O

R X₂ O

This design is the same as the pretest-posttest control groups, except that this design lacks the pretest. Since no pretest is present, one must rely upon random assignment to ensure that the groups are equivalent at the outset.

R O X₁ O

R O X₂ O

R X₁ O

R X₂ O

The best of the three true experimental designs discussed. This has four groups, two with pretests, two without pretests, two control groups, and two treatment groups. This design control for nearly all sources of experimental invalidity, and allows the researcher to determine whether the pretest has an effect on the posttest independent of the treatment.

3. Quasi-experimental Designs

The primary difference between quasi- and true-experimental designs is the lack of random assignment of subjects to groups. Other than the lack of random assignment, most other characteristics are the same, and the designs listed above can be used with quasi-experiments. The only difference between the designs is the lack of randomly formed groups. For example, in quasi-experimental research the pretest-posttest control group design is renamed nonequivalent control group design since the groups are intact, not randomly formed, from the outset. Other designs in quasi-experimental studies include times series designs and counterbalanced designs, which are discussed below and in the text.

Nonequivalent Control Group Design

Non R O X₁ O

Non R O X₂ O

Experimental Validity: Internal Validity and External Validity

If any uncontrolled extraneous variable affects the outcome (performance on the DV) of the experiment, the validity of the experiment is in question. A completely valid experiment is one in which the result obtained is due solely to differences in the experimental (manipulated) IV, and not due to any extraneous, confounding, uncontrolled variables. This type of validity is referred to as internal validity. To be valid, the results must also be generalizibility to settings, situation, and populations outside the experimental conditions. This type of validity is referred to as external validity. Both of these constructs are discussed in more detail below.

Internal Validity

To be internally valid, one is sure that the observed differences or variation on the DV is the direct result of manipulation of the experimental IV, and not due to some other variable uncontrolled in the study. Stated differently, internal validity is the degree to which observed differences or variation on the dependent variable is directly related to the independent variable and not to some other (uncontrolled) variable. Control is the key to internal validity. Without control of extraneous, confounding variables, one would not have internal validity.

To help you better understand internal validity, below are three quotations from educational research texts regarding internal validity and control.

1. The internal validity of an experiment is the extent to which extraneous variables have been controlled by the researcher. If extraneous variables are not controlled in the experiment, we cannot know whether observed changes in the experimental group are due to the experimental treatment or to an extraneous variable (Borg, W.R. & Gall, M.D., [1983]. Educational research: An introduction. NY: Longman, p. 634).

2. When a study has internal validity, it means that any relationship observed between two or more variables should be meaningful in its own right, rather than being due to ‘something else.’ [Where something else refers to confounding variables.] Stated differently, internal validity means that observed differences on the dependent variable are directly related to the independent variable, and not due to some other unintended variable (Fraenkel, & Wallen, [1993]. How to design and evaluate research. p.222).

3. The central question of internal validity is whether the independent variable is responsible for changes (or variation) in the dependent variable. Researchers must be sure that changes (or variation) in the dependent variable did not happen because of other variables that were not controlled in the experiment. For a research design to be internally valid, it must control factors that might account for changes (or variation) in the dependent variable. This control enables researchers to assume that the independent variable has caused any observed differences (Sowell, E., & Casey, R. [1982]. Research methods in education. Belmont, CA: Wadworth, p.88).

In summary, internal validity refers to the ability of the researcher to say that the experimental variable, the treatment, caused the differences observed in the dependent variable and not some other variable. Recall from Table 3 (reproduced below) that a 10 point difference in achievement exists between the cooperative learning group and the lecture group. This experiment, however, lacks internal validity since the differences observed in the dependent variable (achievement) can be attributed to either the experimental manipulation or to the difference in sexual composition of the two classes. To be internally valid, one would have to eliminate the effects of sex from this study by making the groups equal in sexual composition.

Table 3: Example Experimental Results with Confounding Variable Sex

Confounding Variable: Sex of Students in Groups	Experimental Treatments	Mean Achievement Scores
All Male	Cooperative Learning	85
All Female	Lecture	75

External Validity

The more the results obtained in an experiment generalize to other groups or situations outside of the experimental setting, the more externally valid the experimental results. External validity relates to the generalizability of experimental findings to the ‘real’ world. Even if the results of an experiment are an accurate gauge of what happened during that experiment, do they really tell us anything about life in the wilds of society? Thus, external validity focuses on generalizability of findings from the sample to the target population, as well as other populations. Stated differently, external validity is the degree to which results are generalizable, or applicable, to groups and environments outside the research setting. For example, if we performed an experiment on cooperative learning and lecture based instruction among fourth grade students in Statesboro, what are the chances that our results would generalize to fourth grade students in New York, California, or Florida? The better are results hold for students elsewhere, the greater the external validity of our study.

High internal validity implies that the researcher has tight control on the experiment, and that the setting of the experiment may even take on an artificial flavor due to the manipulated and controlled environment. Under such conditions, it is often difficult to generalize back to real life situations. In research with human subjects, both internal and external validity must be well balanced. Note that, generally, the more control a researcher has in the experiment (the more internal validity), the less likely the results will generalize to outside situations (low external validity). One cannot, however, downplay the importance of internal validity (i.e., knowing precisely what caused the observed effects).

Threats to Internal Validity

A threat to the internal validity of an experiment refers to anything that can occur during an experiment that makes it difficult or impossible for the researcher to say that the experimental IV caused the changes on the DV and not something else. There are a list of common or standard threats. I will provide a brief description of each, but please read the supplemental material and the text. Here is a link that discusses internal validity and threats to internal validity:

http://www.socialresearchmethods.net/kb/intval.htm

Threats to internal validity include:

(a) History: The occurrence of events that are not part of the experiment, or that occur outside of the experiment, which impact, nevertheless, upon the DV of interest. For example, in the study of third grade students' attention span, suppose a fire drill occurs during the experiment. This is something the researcher cannot control, but will surely affect the outcome of the study. This is called a historical threat. History does not refer to past events but rather to things that occur during the experiment.

(b) Maturation: Changes in subjects over time (mental and physical changes) that are not part of the experiment. This is a threat since these changes may have some affect on the variation of the DV. For example, in the study of a special instructional strategy to increase reading vocabulary from first to sixth grade, we would expect reading vocabulary to increase from first to sixth grade naturally, so it will be difficult to determine whether a treatment or simple maturation (growth) to cause this change in vocabulary. Only the use of a control group would help to disentangle these effects. For short term maturation to be a problem, prolonged testing over a given day could cause study participants to grow mentally tired and therefore performed poorly toward the end of the day. Such reduced performance is not due to a weaker treatment, but due to tiredness with reflects mental and physical maturation over a short term. So maturation does not refer strictly to long term changes but also to short term changes. The key to assessing maturation effects is to use a study design that includes a control group; a control group allows the research to control and assess natural maturation effects relative to the treatment group.

(c) Testing (also called pretesting): Subjects who take a pretest in an experiment may learn things from the testing experience that will improve their posttest performance on the dependent variable. If the pretest and posttest are identical or similar, subjects may demonstrate improved scores on the posttest due to having taken the pretest. Such improvement confounds the effect of the treatment.

(d) Instrumentation: The use of invalid tests or instruments (for example, two forms of same test with different difficulty levels—i.e., the forms are not equivalent, not parallel). If the instrument is invalid, then one cannot trust scores obtained from the instrument, therefore observed variation on the DV cannot be directly attributable to the treatment. Another form of instrumentation results when the calibration of an instrument changes over time, or when the way scorers assign scores changes over time.

For example, suppose one is interested learning whether a particular counselor training activity changes the way counselors interact with students in a small group counseling situation. To assess possible change pre and post training, a group of counselors are watched and rated by one trained observed prior to exposure to the counselor training session. After concluding the training session, which occurred over a period of weeks, a second trained observed watches the counselors again and rates their interactions. If there are differences in the way the two trained raters score what they see, it is possible that differences in scores for both pre- and post observations could result, and these differences may not be the result of the counselor training activity, but, instead, could be due simply to differences in the way the observers score the counselors.

(e) Statistical Regression: The possibility that results of the experiment are due to a tendency for groups, selected on the basis of extreme scores, to regress toward a more average score on subsequent measurements, regardless of the effects of the experimental treatment. A group selected because of unusually low (or high) performance will, on average, score closer to the mean on subsequent testing, regardless of what transpires in the experiment. Essentially, statistical regression could result anytime extreme groups are selected for the experiment, like the top 10% of scorers on a standardized test.

(f) Differential Selection of Subjects: When intact or already formed groups are used, they may differ on some important characteristic, or when groups are selected or formed differently. The threat here is that this initial difference will result in differences on the posttest or outcome, which will confuse the study. In short, anytime already formed groups (groups that are not randomly formed by the researcher) are used, we must expect the groups to differ on important characteristics, or, if people are selected in different manner for different groups, then the groups are likely to differ.

(g) Mortality: Subjects simply drop out of the study. This is a threat since one will not know whether these subjects represent some important difference.

Threats to External Validity

A threat to external validity is anything that could limit the researcher's ability to apply the results of a study to other people or settings. There are a list of common or standard threats. I will provide a brief description of each, but please read the supplemental material and the text.

Threats to external validity include:

(a) Pretest-Treatment Interaction: Pretesting may sensitize the experimental subjects to the experimental factor so the results obtained can be generalized only to other pretested groups. In some studies a pretest may interact with the treatment and affect the results of the dependent measure; thus, one cannot generalize the results to studies in which no pretest was administered. in short, the pretest may work in conjunction with the treatment to make the treatment better (or worse) than it would be if the pretest was not given, thus this limits generalizability.

(b) Multiple-Treatment Interaction/Interference: Subjects receive more than one treatment and the researcher has no idea which treatment caused the results.

(c) Selection-Treatment Interaction: Generalizability limited due to the interaction between the subjects (selection) and treatment, i.e., the effect of the treatment is unique to the subjects and cannot be generalized to other subjects. The particular characteristics of an experimental group—its intellectual level, its academic orientation, and other features of its particular background—may make it more (or less) responsive than average to the experimental treatment. For example, research shows that more intelligent children tend to benefit less from cooperative learning than do less intelligent children. This illustrates selection-treatment interaction--the treatment (cooperative learning) works differently with different groups.

(d) Specificity of Variables: The more specialized or specific the experiment in terms of the subjects, measuring instruments, outcomes, etc., the less likely it will be generalizable. This simply means that if an experiment is narrowly defined (e.g., 4th grade students at Bulloch elementary school, using only a locally developed achievement test) will be less likely to generalize that a study that incorporates more broadly defined aspects (e.g., 4th, 5th, and 6th grade students from four states participated in the study and used a variety of achievement tests--locally developed tests and several nationally standardized tests). The more broad the characteristics of those things used in the study, the better the generalizability to other settings or people.

(e) Experimenter Effects: Presence or actions of the experimenter or researcher may influence subjects’ reactions or behaviors. The behavior of the subjects may be unintentionally influenced by certain characteristics or behaviors of the experimenter. The expectations of the experimenter may also bias the administration of the treatment and the observation of the subjects’ behavior. In short, the researcher could bias the study in some way.

(f) Reactive Arrangements: Subjects may react according to their knowledge of the experiment. If subjects know they are in an experiment, then they may act differently than had they not known they were part of the experiment (i.e., Hawthorn effect). The very fact that a person is selected to participate in an experiment often motivates him to greater efforts, so the results are not applicable to other people exposed to the same experimental factor in a non-experimental setting.

Supplemental Reading #1

Brenda R. Motheral writes an article that covers basic of quantitative research. In that article, she covers internal and external validity and provides examples. Please have a read of it at:

http://www.jmcp.org/doi/pdf/10.18553/jmcp.1998.4.4.382

also copied here in case the above link fails: http://www.bwgriffin.com/gsu/courses/edur7130/readings/Motheral1998jmcp.pdf

Supplemental Reading #2

A Primer on Experimental and Quasi-Experimental Design

Thomas E. Dawson

Texas A&M University, January 1997

Abstract

Counseling psychology is a burgeoning field still in its inchoate stages, attempting to gain/maintain autonomy and respect. As students of a scientific-practicing counseling psychology program, it behooves us to conduct well thought-out, meaningful research in the name of practicing "good science," as does it benefit all counseling psychologists in the name of furthering the field's namesake. Unfortunately, many times the tendency to embark on a research endeavor lacks the necessary foresight in constructing the design. Research designs are pervious to many different types of threats to their internal and external validity. In the traditions of Campbell and Stanley, and Cook and Campbell, this paper will elucidate some of the more common types of research designs, along with the coexistent threats to validity. Further, an example of each type of design has been given from the counseling literature for the interested reader to paruse and help make the concepts concrete.

Paper presented at the annual meeting of the Southwest Educational Research Association, Austin, January, 1997.

A Primer on Experimental and Quasi-Experimental Design

Poem by Skyler Huck, University of Tennessee
Upon its inception as an experimental science, psychology has utilized the scientific method found in the physical sciences. In 1879 Wilhelm Wundt opened the first psychological laboratory in Leipzig, Germany, and with that commencement also came the first arguments about the validity of Wundt's experiments (Benjamin, 1988). Since that time, the scientific method has been applied to various psychological constructs, e.g., behaviorism, learning theory, Gestalt psychology, animal experimentation, cognition, and functionalism (Gannon, 1991).

Counseling Psychology has experienced many "growing pains" in its attempts at being recognized as a separate (Hiebert, Simpson, & Uhlemann (1992); see also Wooley, Duffy, & Dawson, (1996) for a preliminary study to support counseling psychology's attempts at autonomy), viable science, which has been linked previously by some to the utilization of the scientific method (Hill & Corbett, 1993; Schultz, 1972). Since the inauguration of counseling psychology in 1946 (Whitley, 1984) it was mainly an applied psychology. At this juncture in time, the application of psychology was only beginning to gain respect from the intact group of psychologists who considered themselves as "pure"--that is they engaged in experimental psychology (Benjamin, 1988). In light of this zeitgeist and the identity struggles within the division, it stands to reason that counseling psychology places scientific inquiry through the rigor of the scientific method, as a core function.

In the past 20 years, there has been growing dissension in the ranks of counseling psychology researchers regarding the way in which research focusing on the philosophy of science and counseling psychology is being conducted. Many believe that researchers are placing too much emphasis on objectives and questions to which research should be directed, with little attention to actual research designs and methods (Ford, 1984; Goldman, 1976; Howard, 1982; Parker, 1993; Serlin, 1987; Serlin & Lapsley, 1985). Others have directly stated that more attention should be placed on the training aspects of research methodology (Gelso, 1979a, 1979b; Goldfried, 1984; Magoon & Holland, 1984), (note though that Birk and Brooks, 1986, report that 81% of 300 counseling psychologists surveyed reported adequate training in research). Indeed, the results of a national survey of counseling psychologists indicate that 49.6% of their time is devoted to the activity of research (Watkins, Lopez, Campbell, & Himmell, 1986), thus further supporting the relevance of the present paper to counseling psychologists.

If this is the case, how is it that so many counseling psychologists and counseling psychology students are producing "bad science"? For example, in a specific instance of reviewing one (psychology-type) department's recent dissertations, Thompson (1994) found numerous blatant mistakes due to methodological errors, and many others have challenged the current state of research (e.g., O'Hear & MacDonald, 1995). These errors are most likely representative and indicative of the more common mistakes found in research presently.

With this backdrop and apparent need for remediation, the present paper presents a "primer on experimental designs," with the specific goal of (review), and the more comprehensive intention that through "better and improved science", counseling psychology will continue to solidify its place as an indisputably separate and viable field. In the traditions of Campbell and Stanley (1963) and Cook and Campbell (1979), a review of experimental and quasi-experimental designs and how threats to validity impacts counseling research will be presented, employing examples from the current counseling literature.

The Validity of Experimental Designs

Internal validity

Internal validity is one important type of research validity. The term "internal validity" refers to the extent that extraneous variables (error variance) in an experiment are accounted for. It is paramount to the researcher that model specification error variance (as distinct from measurement and sampling error variance) is controlled because if not, the researcher can not emphatically conclude that the observed outcome is due to the independent variable(s) (Parker, 1993). Campbell and Stanley (1963) stated that "internal validity is the basic minimum without which any experiment is uninterpretable" (p. 5). There are eight major threats to internal validity: (a) history, encompassing the environmental events occurring between the first and second observations in addition to the independent variable(s); (b) maturation, which refers to the processes within the participants (psychological and/or biological) taking place as a function of the passage of time, not attributable to the independent variable(s); (c) testing, which is sensitization to the posttest as a result of having completed the pretest; (d) instrumentation, which refers to deterioration or changes in the accuracy of instruments, devices or observers used to measure the dependent (outcome) variable; (e) statistical regression, which operates when groups are selected on the basis of their extreme scores, because these anomalous scores tend to regress toward the mean on repeated testing; (f) selection, which refers to the factors involved in placing certain participants in certain groups (e.g., treatment versus control), based on preferences; (g) mortality, which refers to the loss of participants and their data due to various reasons, e.g., death or sickness; and (h) interactions of previous threats with selection. For example, a selection-maturation interaction results when the experimental groups are maturing at different rates based on the selection of the participants (Campbell & Stanley, 1963). In later writings, Cook and Campbell (1979) identify an additional threat to internal validity. This is ambiguity about the direction of casual influence when all other plausible third-variable explanations have been ruled out of the A-B relationship, but it remains unclear as to whether A causes B, or B causes A.

External Validity

This construct asks the question of generalizability. Which populations, settings, treatment variables and measurement variables can these results be generalized to? Generalizing across persons requires research samples to be representative of the population of interest. Generalizing across times and settings usually necessitates systematically administering the experimental procedure at different times and different settings (Parker, 1993). The inability to obtain samples that are representative of the populations from which they came, especially if studied in various settings, and at different times, results in the inability to generalize beyond the persons, time, and setting of the original study. Tests that do meet the representativeness criteria are, in essence, tests of statistical interaction. For example, if there is an interaction between a therapeutic treatment and ethnicity, then it can not be decisively stated that the treatment holds true across different ethnicities. When effects of differing magnitude exist, the researcher must delineate when and where the effect holds, and when and where it does not (Cook & Campbell, 1979).

The statistical interaction threats to external validity outlined by Cook and Campbell (1979) are as follows: Interaction of selection and treatment (as in the previous example dealing with ethnicity); and interaction of setting and treatment (e.g., can a casual relationship obtained on a military installation also be obtained on a university campus?). The last interaction is between history and treatment. In this case, the question involves to which period of the past or future can the results obtained be generalized. For example, the majority of experiments take place on university campuses, with undergraduate university students as participants. If an experiment was conducted on the day after a football loss to this university's arch rival, then the results may not generalize even to a week after the loss, much less beyond the participants and setting represented in the original study.

Parker (1993) reviewed and synthesized the Campbell and Stanley (1963) and the Cook and Campbell (1979) work and explicated two additional threats to external validity: The interaction of treatments with treatments, which refers to the administration of multiple treatments administered to the same participants, e.g., time-series designs wherein the effects may be cumulative; and the interaction of testing with treatment, not to be confused with the internal validity threat of testing when the pretest sensitizes the participant to the posttest. In the external validity case, the pretest may increase or decrease the participants responsiveness or sensitivity to the treatment.

The above description of the most common threats to internal and external validity lays the groundwork for the planning of research projects. With these potential pitfalls in mind, the researcher is now ready to begin to plan which treatment design will be implemented (Lysynchuk, Pressley, d'Ailly, Smith, & Cake, 1989). The following explanation of the different types of treatment designs and the inherent threats to their validity will use an "X" to represent the exposure of a group to an experimental treatment or event. An "O" will signify some type of observation or measurement. The Xs and Os in the same row will refer to the same group, and the order of the characters from left to right will designate the temporal order of the events. "R" will exemplify random assignment, if necessary (Campbell & Stanley, 1963).

Three Pre-Experimental Designs

In a review of the designs of the process and outcome studies published in the Journal of Counseling Psychology (JCP) between the years of 1964 through 1968, Kelley, Smits, Leventhal, and Rhodes (1970) found that 54% of the studies utilized a preexperimental design. Preexperimental designs are those in which there is no control group and/or have comparison groups that are formed nonrandomly, therefore yielding results which are difficult to interpret (Huck & Cormier, 1996). The three preexperimental designs presented by Campbell and Stanley (1963) are the one-shot case study, the one-group pretest-posttest design, and the static group comparison. We will examine these designs in the order given.

The one-shot case study

Much past research applied a design in which a single group was studied only once after a treatment was applied. These studies are diagrammed as follows:

X O

According to Kelley Smits, Leventhal, and Rhodes (1970) the preponderance of designs they reviewed in the JCP were one-shot case studies (31%). Campbell and Stanley (1963) refer to these studies as having "...such a total absence of control as to be of almost no scientific value" (p. 5). They go on to state that "securing scientific evidence involves making at least one comparison" (p. 6) ...and that "It seems well-nigh unethical at least at the present time to allow, as theses or dissertations in education, case studies of this nature" (p. 7). As these studies are practically unused today, we will examine the threats inherent in the one-shot case design below, when they are associated with other more commonly used designs.

The one-group pretest-posttest design

This design is judged to be better than design one (Campbell & Stanley, 1976) and is a catalyst for understanding how many of the extraneous variables that threaten internal validity play out. The one-group pretest-posttest design can be reviewed by referencing Jemmett and Jemmett (1992), and is diagrammed as follows:

O₁X O₂

In this design, history is one of the uncontrolled rival hypotheses, as the changes between O₁and O₂may have been due to events that possibly occured in addition to the experimenter's X. The longer the time that elapses between the two observations, and the more participants for which specific events happen collectively, then the more plausible history becomes as a rival hypothesis (Campbell & Stanley, 1963).

Other rival hypotheses include the participants maturing (physically or psychologically) between the pre and posttests, or possibly the participants do better on the posttest as a result of taking the pretest (testing). Maybe the measuring instrument changed over the course of the study (instrumentation), or certain participants may have selectively dropped out of the study (mortality/selection). If the participants scored atypically on the pretest, they may have regressed toward the mean naturally (statistical regression), without any influence of X (Huck & Cormier, 1996).

The static-group comparison

The third preexperimental design is the static-group comparison (e.g., Laser, 1984). In this design, a posttest is administered to two groups, one having been administered the X, and the other not (a control group). When diagrammed, this design appears as follows:

X O

The basic problem with this design is the unknown status of the two groups prior to the administration of X, since the participants are not randomly assigned to the two groups. If a difference is obtained at posttest, these results may have been the influence of X. Alternatively though, the difference could have been an initial difference between the two groups. Since the participants either self-select themselves for participation into either group, or two existing groups are used, this is a selection threat to internal validity. Another threat to internal validity in this design that might threaten, even if the groups began as equal, is the selective drop-out rate of the participants in one group (mortality) (Campbell & Stanley, 1963; Huck & Cormier, 1996).

Three True Experimental Designs

True experimental designs yield results that are more trustworthy than the preexperimental designs due to the fact that random assignment is utilized, therefore reducing the amount of potential threats to internal validity (Huck & Cormier, 1996).

Pretest-posttest control group design

For an example of this design in practice, see Hains and Szyjakowski (1990) or Kush and Cochran (1993). The design is diagrammed as follows:

R O₁ X O₂

R O₃O₄

Random assignment is employed to both groups, and both are given a pretest. One group is administered the X, and the other is not. A comparison of O₂and O₄should elucidate any effect of the X. The unique strength of this design is the addition of the pretest, though there are controversies surrounding the use of pretests after random assignment (Heppner, Kivligham, & Wampold, 1992). In this design, many of the previous internal threats to validity discussed so far are accounted for. The differences attributed to history (and hopefully instrumentation) between O₁and O₂ would be similar to the differences between O₃and O₄. The threat of maturation and testing should be equally manifested between the experimental and the control group, statistical regression, mortality, and selection interaction threats are protected by the random assignment of participants (Campbell & Stanley, 1963), occurring probably equally across the two groups.

Ironically, the major weakness of this design is in fact, its major strength, but for external validity reasons. The pretest would sensitize both the control group and the experimental group to the posttest in a like manner, therefore presenting no internal threat to validity. However, generalizing the results of a treatment that included a pretest in the design, to a different sample without a pretest, may yield much different results (Heppner et al., 1992).

The posttest-only control group design

Here, randomization is utilized to ensure the equalization of the two groups, without a pretest, as in the Van Noord and Kagan (1976) study. The design is depicted in this way:

R X O₁

R O₂

Again, random assignment is employed to both groups. The X is administered to the experimental group, and the second group acts as a control. The internal validity of this design is basically solid. According to Cook and Campbell (1979), the posttest-only control group design is the prototypical experimental design, and most closely exemplifies a condition in which a casual relationship can be discerned between an independent and dependent variable.

The main weakness of this design concerns external validity, i.e., the interaction of selection and treatment (Campbell & Stanley, 1963). Because of random assignment, the selection of subjects is not supposed to present a threat to internal validity. Nonetheless, it is often unknown whether the results of the study would generalize to another population (Heppner et al., 1992). For example, there could potentially be great differences between the results of a course on speed reading taught to a graduate class, versus a speed reading course taught to a high school class. Another problem deals with the absence of a pretest employed to reduce variability in the dependent variable. Random assignment is thought by some to account for this preexisting variability. But according to Huck and Cormier (1996) random assignment is not always random because (a) many researchers have a very loose definition of what randomization is, and (b) true randomization carries with it very stringent criteria and many researchers are unaware of the necessary precision and falsely believe they have true randomization, when they do not. The suggestion given by Huck and Cormier is that researchers explain definitively just how they accomplished randomization.

The Solomon four-group design

When a pretest is desired, but there is concern over the effects of using a pretest, as in the Bateman, Sakano, and Fujita (1992) study, this design is used, notationally described as:

R O₁ X O₂

R O₃ O₄

R X O₅

R O₆

This design is a combination of the pretest-posttest control group design (the first two groups), and the posttest-only control group (the last two groups). The main purpose of this design is to account for potential effects of the pretest on the posttest, and lends some degree of future replicability. For example, the Solomon four-group design accounts for the problem that the pretest-posttest control group design has, by comparing O₂to O₅ to account for pretest sensitization, the only difference being that O₂ receives a pretest prior to treatment. With regard to generalizability, the researcher can compare O₂to O₄and O₅ to O₆. If treatment effects are found in both cases, the results will be considered strong, and suggest future replicability, as one replication is confirmed with the data in hand (Heppner et al., 1992). The major drawback of this design is the amount of time, energy, and resources necessary to complete the study.

Three Quasi-Experimental Designs

When a true experimental design is not available to a researcher for various reasons, e.g., in clinical settings where intact groups are already formed, when treatment can not be withheld from a group, or when no appropriate control or comparison groups are available, the researcher can use a quasi-experimental design. As in the case of the true experimental design, quasi-experiments involve the manipulation of one or more independent variables and the measurement of a dependent variable. The major difference between true and quasi-experimental designs is the random assignment of participants (Heppner et al., 1992). Therefore, the internal validity of the quasi-experimental design is higher than that of the pre-experimental design, but lower than the true experimental design (Huck & Cormier, 1996). There are three major categories of quasi-experimental design: the nonequivalent-groups designs, cohort designs, and time-series designs (Cook & Campbell, 1979). An example of each category will be given, though the reader should be aware that there are many variations of each of the following examples.

The nonequivalent-groups design

The nonequivalent-groups design is the most frequently used quasi-experimental design (Heppner et al., 1992; Huck & Cormier, 1996). This design is similar to the pretest-posttest control group experimental design considered earlier. The difference is the nonrandom assignment of subjects to their respective groups in the quasi-experimental design. The design is diagrammed as follows, and can be further perused by referencing Braaten (1989):

Non R O₁ X O₂

Non R O₃ O₄

This design is one of most widely used designs in the social sciences because it is often interpretable. Cook and Campbell (1979) recommend this design when nothing better is available. The nonequivalent-groups design accounts for many of the threats to internal validity, except for four. The first uncontrolled threat is that of selection-maturation. As stated earlier, many researchers falsely believe that the administration of a pretest remedies the nonrandom assignment of participants, and use ANCOVA to "level" the groups. As has been succinctly pointed out by Loftin and Madison (1991), applying an ANCOVA does not always make groups equal. Furthermore, using a no-difference null hypothesis based on pretest scores is faulty logic as any fail-to-reject decision when testing any H_odoes not justify believing that the null hypothesis is true (Huck & Cormier, 1996).

Other uncontrolled threats to validity include instrumentation, differential statistical regression, and the interaction of selection and history (Cook & Campbell, 1979). These threats have been described earlier and thus warrant only mention here.

Cohort design

The second class of quasi-experimental designs is the cohort designs. Cohort designs are typically stronger than nonequivalent-groups design because cohorts are more likely to be closer to equal at the outset of the experiment (Heppner et al., 1992). An example of a cohort in this context would be, TAMU freshman in 1995 versus TAMU freshman in 1996. For an example from the counseling literature, see Hogg and Deffenbacher (1988). The basic cohort design is diagrammed as follows:

O₁

X O₂

In this design, the O₁ represents a posttest administered to one cohort, while O₂ represents a posttest administered to the following cohort. Even though the posttests occur at different points in time, the posttests do occur at the same point in the progression of the cohort (Cook & Campbell, 1979).

The most obvious problems with this design deal with the passage of time between the two cohorts, and the nonrandom assignment of participants to the cohort. The differences within the cohort before the treatment can be confounding. The specific threats to internal validity include history, when a researcher has no control over what events might occur in one cohort versus the other; changes surrounding instrumentation, testing, selection, and many interactions with selection.

The reason why cohort designs are useful include the "quasi-comparability" (Cook & Campbell, 1979, p. 127) that can often be assumed between cohorts who do receive a treatment, and those who do not. Many times these cohort groups are more similar to each other than are experimental groups, especially with regard to certain demographics (Cook & Campbell, 1979).

Time series design

The third class of quasi-experimental designs is the time-series design. These designs are characterized by multiple observations over time (e.g., Kivligham & Jauquet, 1990) and involve the same participant observations to record differences attributed to some treatment, or similar but different participants. In the interrupted time-series design (the most basic of this class) a treatment is introduced at some point in the series of observations (Heppner et al., 1992), and this design is diagrammed as follows:

O₁ O₂ O₃ O₄ X O₅ O₆ O₇ O₈

The impetus for performing this design is to observe over time, any difference after the treatment is implemented to discern if there is a continuous effect versus a discontinuous effect. A continuous effect would be treatment effects that remain stable after the initial discontinuity produced by the intervention. A discontinuous effect would be a result that decays over time. This design also accounts for effects that are instantaneous, versus delayed in their manifestations (Cook & Campbell, 1979), i.e., with repeated observations after the treatment is implemented, the researcher can ascertain how quickly the effects are initiated.

The major threat to internal validity in this design is history, i.e., that variables other than the treatment under investigation came into play immediately after introduction of the treatment, e.g., a seasonal effect (the weather in a study on absenteeism from work) (Heppner et al., 1992). Also, sometimes instrumentation changes with the passage of time, as does selection (or attrition). Selection is a plausible threat when there is a differential attrition rate, after introduction of the treatment (Cook & Campbell, 1979).

Summary and Conclusions

In counseling research, which is often applied, the utility of the findings are greatly reduced if the findings may be attributable to something other than the treatment under observation. We have examined the major threats to validity and their resultant effects in the context of using different research designs. Every experiment is imperfect from the standpoint of the final interpretation, and the attempt to "fit" the results into a developing science (Campbell & Stanley, 1963). The previous discussion of validity and designs was meant to guide researchers to use better designs when developing their studies, to increase awareness of the residual imperfections in their particular design to help account for alternative interpretations of the results. Hopefully, the call for "better and improved" science has been supported through this paper, and this reading will entice other students and researchers to contemplate long and hard before settling on a research design.

Reference

Bateman, T. S., Sakano, T., & Fujita, M. (1992). Roger, me, and my attitude: Film propaganda and cynicism toward corporate leadership. Journal of Applied Psychology, 77, (5), 768-771.

Benjamin, L. (Ed.). (1988). A history of psychology. New York: McGraw-Hill.

Birk, J. M., & Brooks, L. (1986). Required skills and training needs of recent counseling psychology graduates. Journal of Counseling Psychology, 33, (3), 320-325.

Braaten, L. J. (1989). The effects of person-centered group therapy. Person Centered Review, 4, (2), 183-209.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis for field settings. Chicago: Rand McNally.

Ford, D. H. (1984). Reexamining guiding assumptions: Theoretical and methodological assumptions. Journal of Counseling Psychology, 31, (5) 461-466.

Gannon, T., & Deely, J. (Eds.). (1991). Shaping psychology: How we got where we're going. In J. Deely & B. Williams (Series Eds.), Sources in semiotics, (Vol. 10). Lanham, MD: University Press of America.

Gelso, C. J. (1979b). Research in counseling: Clarifications, elaborations, defenses, and admissions. The Counseling Psychologist, 8, (3), 61-67.

Gelso, C. J. (1979a). Research in counseling: Methodological and professional issues. The Counseling Psychologist, 8, (3), 7-36.

Goldfried, M. R. (1984). Training the clinician as scientist-professional. Professional Psychology: Research and Practice, 15, 477-481.

Goldman, L. A. (1976). A revolution in counseling research. Journal of Counseling Psychology, 23, (6), 543-552.

Hains, A. A., & Szyjakowski, M. (1990). A cognitive stress-reduction intervention program for adolescents. Journal of Counseling Psychology, 37, (1), p. 80.

Heppner, P. P, Kivlighan, D. M., & Wampold, B. E. (1992). Major research designs. In C. Verduin (Ed.), Research design in counseling (pp. 115-165). Pacific Grove, CA: Brooks-Cole.

Hiebert, B., Simpson, L., & Uhlemann, M. R. (1992). Professional identity and counselor education. Canadian Journal of Counseling, 26, 201-208.

Hill, C. E., & Corbett, M. M. (1993). A perspective on the history of process and outcome research in counseling psychology. Journal of Counseling Psychology, 40, (1), 3-24.

Hogg, J. A., & Deffenbacher, J. L. (1988). A comparison of cognitive and interpersonal-process group therapies in the treatment of depression among college students. Journal of Counseling Psychology, 35, (3), 304-310.

Howard, G. S. (1982). Improving methodology via research on research methods. Journal of Counseling Psychology, 29, (3), 318-326.

Huck, S. W. (1991). True experimental design. Journal of Experimental Education, 59, (2), 193-196.

Huck, S. W., & Cormier, W. H. (1996). Principles of research design. In C. Jennison (Ed.), Reading statistics and research (2nd ed., pp. 578-622). New York: Harper Collins.

Jemmott, L. S., & Jemmott, J. B. (1992). Increasing condom-use intentions among sexually active black adolescent women. Nursing Research, 41, (5), p. 273.

Kelley, J., Smits, S. J., Leventhal, R., & Rhodes, R. (1970). Critique of the designs of process and outcome research. Journal of Counseling Psychology, 17, (4), 337-341.

Kivlighan, D. M., & Jauquet, C. A. (1990). Quality of group member agendas and group session climate. Small Group Research, 21, (3), 205-219.

Kush, K., & Cochran, L. (1993). Enhancing a sense of agency through career planning. Journal of Counseling Psychology, 40, (4), 434-439.

Laser, E. D. (1984). The relationship between obesity, early recollections, and adult life-style. Individual Psychology: Journal of Adlerian Theory, Research & Practice, 40, (1), 29-35.

Lofton, L. B., & Madison, S. Q. (1991). The extreme dangers of covariance corrections. In B. Thompson (Ed.), Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 133-147). Greenwich, CT: JAI Press.

Lysynchuk, L. M., Pressley, M., d?Ailly, H., Smith, M., & Cake, H. (1989). A methodological analysis of experimental studies of comprehension strategy instruction. Reading Research Quarterly, 24, (4), 458-472.

Magoon, T. M., & Holland, J. L. (1984). Research training and supervision. In R. W. Lent & S. D. Brown (Eds.), Handbook of counseling psychology (pp. 682-715). New York: Wiley.

O?Hear, M. F., & MacDonald, R. B. (1995). A critical review of research in developmental education: Part I. Journal of Developmental Education, 19, (2), 2-4, 6.

Parker, R. M. (1993). Threats to the validity of research. Rehabilitation Counseling Bulletin, 36, (3), 131-138.

Schulz, D. P. (1972). A history of modern psychology. San Diego: Academic Press.

Serlin, R. C. (1987). Hypothesis testing, theory building, and the philosophy of science. Journal of Counseling Psychology, 34, (4), 365-371.

Serlin, R. C., & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73-83.

Thompson, B. (1994, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproductive Service No. ED 368 771)

Van Noord, R. W., & Kagan, N. (1976). Stimulated recall and affect simulation in counseling: Client growth reexamined. Journal of Counseling Psychology, 23, (1), 28-33.

Watkins, C. E., Lopez, F. G., Campbell, V. L., & Himmell, C. D. (1986). Contemporary counseling psychology: Results of a national survey. Journal of Counseling Psychology, 33, (3), 301-309.

Whitley, J. M. (1984). A historical perspective on the development of counseling psychology as a profession. In S. D. Brown & R. W. Lent (Eds.), The handbook of counseling psychology (pp. 3-55). New York: Wiley.

Wooley, K. K., Duffy, M., & Dawson, T. E. (1996, August). Developmental themes in the clinical judgments of counseling psychologists. Poster session presented at the annual meeting of the American Psychological Association, Toronto, Canada.