What is an Extraneous Variable?

Next: Control Conditions in mycin: Up: The Concept of Control Previous: The Concept of Control

What is an Extraneous Variable?

In an ideal experiment, one controls all variables except the one that is manipulated, but in reality one can directly control very few variables. In Figure 3.1, for instance, we list four variables besides gender that might account for math scores, and we could easily have expanded the list to include dozens more. To control for the possibility that parent's occupation affects math scores, we would have to compare Abigail, whose parents are doctors, with another child of doctors, not with Fred, whose parents are artists. In the extreme, we would have to compare Abigail with a 67'' tall boy who has three siblings, doctors for parents, and is viewed by his teacher as ``an angel.'' And if we could find such an individual, he probably wouldn't have been born in the same town as Abigail, or delivered by the same physician, or fed the same baby food, or dropped on his head on the same dark night that Abigail was. Agreed, this example seems a bit ridiculous, but you can't prove that these factors don't account for Abigail's math scores. So if you take the definition of ``extraneous variable'' literally (i.e., extraneous variables are other possible causes) then the identity of the physician who delivered Abigail is an extraneous variable, and must be controlled in an experiment.

In practice, extraneous variables are not merely ``possible causes''; they are ``plausible causes.'' It is plausible to believe that a teacher's view of a student influences her math scores; it is unlikely that the identity of the physician who delivered the baby who became the student influences her math scores. Thus, we distinguish extraneous variables from noise variables.

Experiments control extraneous variables directly, but noise variables are controlled indirectly by random sampling. Suppose we are concerned that a student's math scores are affected by how many siblings, s, he or she has. We can control s directly or let random sampling do the job for us. In the first instance, treating s as an extraneous variable, we would compare math scores of girls with s = 0 (no siblings) to scores of boys with s = 0, and girls with s = 1 to boys with s = 1, and so on. Alternatively, we can treat s as a noise variable and simply compare girls' scores with boys' scores. Our sample of girls will contain some with s = 0, some with s = 1, and so on. If we obtained the samples of girls and boys by random sampling, and if s is independent of gender, then the distribution of s is apt to be the same in both samples, and the effect of s on math scores should be the same in each sample. We cannot measure this effect--it might not even exist--but we can believe it is equal in both samples. Thus random sampling controls for the effect of s, and for other noise variables, including those we haven't even thought of.

The danger is that sampling might not be random. In particular, a noise variable might be correlated with the independent variable, in which case, the effects of the two variables are confounded. Suppose gender is the independent variable and s, the number of siblings, is the noise variable. Despite protestations to the contrary, parents want to have at least one boy, so they keep having babies until they get one (this phenomenon is universal, see Beal, 1994). If a family has a girl, they are more likely to have another child than if they have a boy. Consequently, the number of children in a family is not independent of their genders. In a sample of 1000 girls, the number with no siblings is apt to be smaller than in an equally-sized sample of boys. Therefore, the frequency distribution of s is not the same for girls and boys, and the effect of s on math scores is not the same in a sample of girls as it is in a sample of boys. frequency distribution Two influences--gender and s--are systematically associated in these samples, and cannot be teased apart to measure their independent effects on math scores. This is a direct consequence of relying on random sampling to control for a noise variable that turns out to be related to an independent variable; had we treated s as an extraneous variable, this confounding would not have occurred. The lesson is that random sampling controls for noise variables that are not associated with independent variables, but if we have any doubt about this condition, we should promote the noise variables to extraneous variables and control them directly. You will see another example of a sampling bias in section 3.3.

We hope, of course, that noise variables have negligible effects on dependent variables, so confounding of the kind we just described doesn't arise. But even when confounding is avoided, the effects of noise variables tend to obscure the effects of independent variables. Why is Abigail's math score 720 when Carole, her best friend, scored 740? If gender was the only factor that affected math scores, then Abigail's and Carole's scores would be equal. They differ because Abigail and Carole are different: one studies harder, the other has wealthier parents; one has two older brothers, the other has none; one was dropped on her head as a child, the other wasn't. The net result of all these differences is that Abigail has a lower math score than Carole, but a higher score than Jill, and so on.

Figure 3.2 Distribution of heights for boys and girls at direrent ages.

The variance in math scores within the sample of girls is assumed to result from all these noise variables. It follows, therefore, that you can reduce the variance in a sample by partitioning it into two or more samples on the basis of one of these variables--by promoting a noise variable to be an extraneous or independent variable. For example, Figure 3.2 shows the distributions of the heights of boys and girls. In the top two distributions, the age of the children is treated as a noise variable, so, not surprisingly, the distributions have large variances. In fact, the variances are such that height differences between boys and girls are obscured. By promoting age to be an extraneous or independent variable--by controlling for age directly instead of letting random sampling control for its effect--we can reduce variance and see effects due to gender. The bottom two distributions represent boys and girls in the more tightly constrained 9-10 year age bracket. They have smaller variances (compare their horizontal axes to those of the top two graphs) and we can now see that girls are taller than boys.

In sum, experiments test whether factors influence behavior. Both are represented by variables. In manipulation experiments, one sets levels of one or more independent variables, resulting in two or more conditions, and observes the results in the dependent variable. Extraneous variables represent factors that are plausible causes; we control for them directly by reproducing them across conditions. Noise variables represent factors we assume have negligible effects. They are controlled by random sampling, if the sampling is truly random. If noise factors turn out to have large effects, then variance within conditions will be larger than we like, and it can be reduced by treating noise variables as extraneous or independent (i.e., directly controlled) variables.

Next: Control Conditions in mycin: Up: The Concept of Control Previous: The Concept of Control

Exper imental Methods for Artificial Intelligence, Paul R. Cohen, 1995
Mon Jul 15 17:05:56 MDT 1996