Next: Bounding Performance Up: Four Spurious Effects Previous: Ceiling and Floor Effects

## How to Detect Ceiling and Floor Effects

If the maximum or minimum value of a dependent variable is known, then one can detect ceiling or floor effects easily. This strongly suggests that the dependent variable should not be open-ended; for example, it is easy to see a ceiling effect if y is a percentage score that approaches 100% in the treatment and control conditions. But the mere fact that y is bounded does not ensure we can detect ceiling and floor effects. For example, the acreage burned by fires in Phoenix is bounded--no fewer than zero acres are ever burned by a fire--so if two versions of the Phoenix planner each lost approximately zero acres when they fought fires, we would recognize a ceiling effect (approached from above). But now imagine running each version of the planner (call them P and ) on ten fires and calculating the mean acreage lost by each: and . Does this result mean is not really better than P, or have we set ten fires so challenging that 49.5 acres is nearly the best possible performance?

To resolve this question--to detect a ceiling effect--it doesn't help to know that zero is the theoretical best bound on lost acreage; we need to know the practical best bound for the 10 fires we set. If it is, say, 10 acres, then is no better than P. But if the practical best bound is, say, 47 acres, then the possible superiority of is obscured by a ceiling effect. To tease these interpretations apart, we must estimate the practical best bound. A simple method is illustrated in table 3.1. For each fire, the least acreage lost by P and is an upper bound on the practical minimum that would be lost by any planner given that fire. For example, lost 15 acres to fire 1, so the practical minimum number of acres lost to fire 1 is at least 15. The average of these minima over all fires, 400/10 = 40, is an overestimate of the practical best performance. If this number was very close to the average areas lost by P and , we could claim a ceiling effect. In fact, one planner can contain any fire in the sample with, on average, ten fewer acres lost than the other planner. So we simply cannot claim that the average areas lost by P or are the practical minimum losses for these fires. In other words, there is no ceiling effect and no reason to believe the alleged superiority of is obscured.

Table 3.1: The acreage lost by fires fought by P and , and the minimum acreage lost.

FIRE12345678910Total
P55605035402090703050500
15505075654060654035495
Min15505035402060653035400

A dramatic example of ceiling effects came to light when Robert Holte analyzed 14 datasets sets from a corpus that had become a mainstay of machine learning research. The corpus is maintained by the Machine Learning Group at the University of California, Irvine. All 14 sets involved learning classification rules, which map from vectors of features to classes. Each item in a dataset includes a vector and a classification, although features are sometimes missing, and both features and classifications are sometimes incorrect. All datasets were taken from real classification problems, such as classifying mushrooms as poisonous or safe, and classifying chess endgame positions as wins for white or black. Holte included two other sets, not from the Irvine corpus, in his study, as well.

Table 3.2: Average classification accuracies for two algorithms that learn classification rules, C4 and 1R*

DatasetBCCHGLG2HDHEHOHY
C47299.263.274.373.681.283.699.1
1R*72.569.256.4777885.181.697.2
Max72.599.263.2777885.183.699.1

DatasetIRLALYMUSESOVOVIMean
C493.877.277.5100.097.797.595.689.485.9
1R*95.987.477.398.4958795.287.983.8
Max95.987.477.5100.097.797.595.689.487.4

Run it yourself or change the numbers and try alternatives...

### Means Test

Distribution:

A typical classification learning experiment goes like this: a dataset is divided randomly into a training set (typically two-thirds of the items) and a test set (one third of the items). An algorithm learns classification rules from the training set, and with these rules attempts to classify the items in the test set. The proportion of correctly classified items is recorded, all classification rules are discarded, and the process is repeated. After, say, 25 iterations, the mean of the proportions is calculated. This is the average classification accuracy for the algorithm. (See 6.10 for details on this and related procedures.) Table 3.2 shows the average classification accuracy computed by Holte for two algorithms on 16 datasets. The first algorithm, C4, was at the time of Holte's study state-of-the-art. The other, 1R*, will be described momentarily. Note that the average classification accuracies for C4 and 1R* are just a bit more than two percentage points apart: C4 correctly classified 85.9% of the items in its test sets, whereas 1R*'s figure is 83.8. Following the logic of the previous section we must ask, is C4 hardly better than 1R*, or are we seeing a ceiling effect? To answer the question we can estimate the practical maximum average classification accuracy to be the average of the maxima of the classification accuracies in Table 3.2. This estimate, 87.4, is not much larger than 85.9 or 83.8, so it appears that on average, the difference between an algorithm's performance and an estimate of the best possible performance is only two or three points. A ceiling effect seems to be lurking.

Next: Bounding Performance Up: Four Spurious Effects Previous: Ceiling and Floor Effects

Exper imental Methods for Artificial Intelligence, Paul R. Cohen, 1995
Mon Jul 15 17:05:56 MDT 1996