This shows you the differences between two versions of the page.
feature_selection_bias [2016/08/09 10:25] |
feature_selection_bias [2016/08/09 10:25] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Bias when using feature selection ====== | ||
+ | When using feature selection you need to be very careful in how you evaluate your classifier. | ||
+ | |||
+ | Here's the wrong way of doing it: | ||
+ | |||
+ | <code python> | ||
+ | from PyML import * | ||
+ | |||
+ | # the wrong way of using feature selection | ||
+ | |||
+ | data = SparseDataSet('colon.data') | ||
+ | # distinguish between normal tissue and tissue affected by colon cancer | ||
+ | # data is available from: | ||
+ | # http://mldata.org/repository/data/viewslug/colon-cancer/ | ||
+ | |||
+ | # create an instance of the RFE feature selection method | ||
+ | rfe = featsel.RFE() | ||
+ | # a feature selector's train method selects a subset of features | ||
+ | rfe.train(data) | ||
+ | |||
+ | results1 = SVM().stratifiedCV(data) | ||
+ | </code> | ||
+ | |||
+ | If you run this you will get a classifier with perfect accuracy. Now let's do it the right way: | ||
+ | |||
+ | <code python> | ||
+ | # the right way to perform feature selection: | ||
+ | # feature selection is performed as part of training the classifier | ||
+ | data = SparseDataSet('colon.data') | ||
+ | results2 = composite.FeatureSelect(SVM(), featsel.RFE()).stratifiedCV(data) | ||
+ | </code> |