Bias when using feature selection

When using feature selection you need to be very careful in how you evaluate your classifier.

Here's the wrong way of doing it:

from PyML import *
# the wrong way of using feature selection
data = SparseDataSet('')
# distinguish between normal tissue and tissue affected by colon cancer
# data is available from:
# create an instance of the RFE feature selection method
rfe = featsel.RFE()
# a feature selector's train method selects a subset of features
results1 = SVM().stratifiedCV(data)

If you run this you will get a classifier with perfect accuracy. Now let's do it the right way:

# the right way to perform feature selection:
# feature selection is performed as part of training the classifier
data = SparseDataSet('')
results2 = composite.FeatureSelect(SVM(), featsel.RFE()).stratifiedCV(data)
