Assignment 5: Naive Bayes

Assignment 5: Naive Bayes

Due: November 17th at 6pm

Part 1: A few short questions about naive Bayes

Can you use naive Bayes for data that contains both categorical and real-valued features?
The basic assumption in naive Bayes is that all attributes are independent given the label. How can you model just 2 of $d$ features as dependent?
Given a trained naive Bayes classifier, and without access to the training data, how would you select a subset of features that are most predictive of the class label?

Part 2: naive Bayes implementation

Implement a naive Bayes classifier for either categorical or continuous data. Compare its performance to that of an SVM (make sure to perform proper model selection for classifier parameters using internal cross-validation). Use two UCI repository datasets for this task. There are several datasets that have categorical data: e.g. nursery school application ranking, census income prediction, and splice junction detection. If you are implementing naive Bayes for categorical data, make sure to include pseudo-counts to avoid over fitting.

Grading

Here is what the grading sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:

Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code. If your code is more than a few lines, you can include it as an appendix to your report. For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy.
You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data.
In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results. For example, for the learning curve problem, what can you say on the basis of the observed learning curve?

Grading sheet for assignment 5

Part 1:  40 points.
(14 points):  1st question
(13 points):  2nd question
(13 points):  3rd question

Part 2:  50 points.
(10 points):  Experimental protocol
(20 points):  Correct classifier implementation
(10 points):  Results for the two classifiers on both datasets
(10 points):  Discussion of the results

Report structure, grammar and spelling:  10 points
( 3 points):  Heading and subheading structure easy to follow and
              clearly divides report into logical sections.
( 4 points):  Code, math, figure captions, and all other aspects of  
              report are well-written and formatted.
( 3 points):  Grammar, spelling, and punctuation.

Table of Contents

Assignment 5: Naive Bayes

Part 1: A few short questions about naive Bayes

Part 2: naive Bayes implementation

Grading