User Tools

Site Tools


feature_selection_bias

Bias when using feature selection

When using feature selection you need to be very careful in how you evaluate your classifier.

Here's the wrong way of doing it:

from PyML import *
 
# the wrong way of using feature selection
 
data = SparseDataSet('colon.data')
# distinguish between normal tissue and tissue affected by colon cancer
# data is available from:
# http://mldata.org/repository/data/viewslug/colon-cancer/
 
# create an instance of the RFE feature selection method
rfe = featsel.RFE()
# a feature selector's train method selects a subset of features
rfe.train(data)
 
results1 = SVM().stratifiedCV(data)

If you run this you will get a classifier with perfect accuracy. Now let's do it the right way:

# the right way to perform feature selection:
# feature selection is performed as part of training the classifier
data = SparseDataSet('colon.data')
results2 = composite.FeatureSelect(SVM(), featsel.RFE()).stratifiedCV(data)
feature_selection_bias.txt ยท Last modified: 2016/08/09 10:25 (external edit)