next up previous
Next: Order Effects Up: Four Spurious Effects Previous: Bounding Performance

Regression Effects

Consider the following hypothetical experiment: Dr.X runs a program on 100 test problems to get a baseline level of performance. Some of the results are shown in the ``Pretest Score'' row of Table 3.3. The pretest scores are listed in ascending order, so problem number 6 has the lowest score (zero), problem 14 has the next lowest score, and so on. Dr.X then modifies the program to increase its performance. Because he wants a stringent test of these modifications, and he doesn't want the expense of running 100 more trials, Dr.X decides to test the modified program on the ten problems on which it previously performed worst, problems 6, 14, 15.... The performance of the modified program is shown in the ``Posttest Score'' row of Table 3.3. Dr.X is delighted. You, however, are skeptical. You ask Dr.X a single question, and when he answers in the affirmative, you know he has made two related and inexcusable mistakes. What is the question?

Table 3.3: Pre- and posttest score illustrating a regression effect.

Problem Number6141521293655616384
Pretest Score0112224556
Posttest Score564181222619

The question is, ``Does chance play any role in determining scores?'' If it does, then rerunning the original program on problems 6, 14, 15, ...will produce different scores. And not only different scores, higher scores! Let's start with problem 1, on which Dr.X's program scored zero. If zero is the lowest possible score, then when the original program is run again, it has no chance of attaining a lower score, but it has a chance of attaining a higher score. Thus, on the second attempt, Dr.X's program is expected to attain a score higher than zero on problem 1. Now consider problem 2. When the original program is run again, it has some chance of attaining a score lower than one, and some chance of attaining a score higher than one. If the first probability is smaller than the second, then Dr.X's original program is expected to attain a score higher than one on problem 2. And if the original program is expected to achieve higher scores simply by chance when it is run again, how can Dr.X be sure that the higher scores achieved by his modified program are due to the modifications instead of chance? Dr.X's first mistake is testing his program on problems whose scores have ``nowhere to go but up.''

The second mistake is best understood in terms of a model of scores. If chance plays a role, then each score can be represented as the sum of a true value and a chance value. Assume for a moment that the problems in Dr.X's test set are equally challenging, so his program ought to attain the same true value on each (i.e., the variance in scores is due entirely to the chance component). Then if Dr.X's original program attains a very low score on a problem, it is apt to attain a higher score next time it encounters that problem. Similarly, if the program first attains a very high score on a problem, then it is apt to attain a lower score on the next encounter. This means that if Dr.X reruns his original program on the ten highest-scoring problems, it will appear to magically become worse, and if he reruns the program on the ten lowest-scoring problems it will appear to magically improve. But this is no magic: it is a simple statistical artifact called regression toward the mean.

The best way to avoid regression effects is to run the same problems in both pretest and posttest. If the pretest involves 100 problems and you want to run only 10 in a posttest, then you should rank the problems by their pretest scores and select every tenth, or some such scheme to ensure a representative distribution of pretest scores. Be warned, however, that if you use a t test to compare samples of very disparate sizes (e.g., 10 and 100), you should treat marginal (i.e., barely significant) results with caution.


next up previous
Next: Order Effects Up: Four Spurious Effects Previous: Bounding Performance

Exper imental Methods for Artificial Intelligence, Paul R. Cohen, 1995
Mon Jul 15 17:05:56 MDT 1996