Due date: 9/15 at 3:30pm
In this assignment we will work with two datasets from the UCI machine learning repository.
The first dataset is the Heart disease diagnosis dataset. To make it simpler for you to read the dataset, here is a processed version in csv format that is ready to use. Each row in the file corresponds to a training example. You can ignore the first column; the second column contains the label (+1 or -1), and the rest of the columns contain the feature vectors. Note that the first row in the file is a comment, and needs to be ignored. To read the data matrix you can use numpy's genfromtxt function:
In [1]: import numpy as np In [2]: data=np.genfromtxt("heart.data", delimiter=",", comments="#")
The second dataset is the Gisette handwritten digit recognition dataset. In this case the feature data matrix is provided separately from the labels, and the feature matrix is a delimited file that you can read into Python using the same method.
In [3]: X=np.genfromtxt("gisette_train.data")
Now you will need to read the labels separately.
In this section you will compare the performance of several variants of the perceptron algorithm. The baseline method is the perceptron without a bias. Your task is to implement the following versions of the perceptron:
Our modified perceptron is a variant where we choose to update a particular example using the perceptron rule where that example is chosen according to a given criterion. Here's the pseudo-code:
Compare the accuracy of these four perceptron variants on the two datasets referred to above. Can you explain the reasoning for the modified perceptron? Hint: Consider the effect of the weight update on examples that are classified correctly. Compute both $E_{in}$ and an estimate of $E_{out}$ using a subset of examples that you set aside as a test set. For each dataset randomly divide the data into two parts: training data and testing data. For the heart dataset reserve 100 examples for testing; for the Gisette dataset reserve 1500 examples for testing.
Whenever we learn a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier accuracy as a function of the number of training examples. Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed?
In this section we will explore the effect of normalizing the data, focusing on normalization of features. The simplest form of normalization is to scale each feature to be in the range [-1, 1]. We'll call this scaling.
Here's what you need to do:
Your report needs to be written in LaTex. Here are some files to help you start playing with LaTex and writing your report. Download and extract the files from start_latex.tar. You will now have the following files:
The Makefile contains commands required for generating a pdf file out of the latex source, and other files that are required. On a Unix/Linux that has Latex you can just run
> make
The file listings-python-options.sty
is a latex style file that tweaks the parameters of the listings
latex package to makes it such that Python code is displayed nicely.
Submit your report via Canvas. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).
Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:
Grading sheet for assignment 1 Part 1: 45 points. (25 points): Correct implementation of the classifiers (10 points): Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way (10 points): Discussion of the results and the modified perceptron algorithm. Part 2: 20 points. (15 points): Learning curves are correctly generated and displayed in a clear and readable way ( 5 points): Discussion of the results Part 3: 20 points. ( 5 points): How to perform data scaling (10 points): Comparison of normalized/raw data results; discussion of results ( 5 points): Comparison of scaling and standardization Report structure, grammar and spelling: 15 points ( 5 points): Heading and subheading structure easy to follow and clearly divides report into logical sections. ( 5 points): Code, math, figure captions, and all other aspects of report are well-written and formatted. ( 5 points): Grammar, spelling, and punctuation.