Assignment 1

Due date: 9/15 at 3:30pm

Preliminaries: reading data

In this assignment we will work with two datasets from the UCI machine learning repository.

The first dataset is the Heart disease diagnosis dataset. To make it simpler for you to read the dataset, here is a processed version in csv format that is ready to use. Each row in the file corresponds to a training example. You can ignore the first column; the second column contains the label (+1 or -1), and the rest of the columns contain the feature vectors. Note that the first row in the file is a comment, and needs to be ignored. To read the data matrix you can use numpy's genfromtxt function:

In [1]: import numpy as np
In [2]: data=np.genfromtxt("heart.data", delimiter=",", comments="#")

The second dataset is the Gisette handwritten digit recognition dataset. In this case the feature data matrix is provided separately from the labels, and the feature matrix is a delimited file that you can read into Python using the same method.

In [3]: X=np.genfromtxt("gisette_train.data")

Now you will need to read the labels separately.

Part 1: The perceptron algorithm

In this section you will compare the performance of several variants of the perceptron algorithm. The baseline method is the perceptron without a bias. Your task is to implement the following versions of the perceptron:

The perceptron with a bias term.
The pocket algorithm as described in the slides or page 80 in the book (with bias).
The modified perceptron described next.

Our modified perceptron is a variant where we choose to update a particular example using the perceptron rule where that example is chosen according to a given criterion. Here's the pseudo-code:

Initialize the weight vector $\mathbf{w}(0)$ to small random values and choose a constant $c$ satisfying $0 < c < 1$.
At iteration $t$ evaluate $\lambda_i = y_i \mathbf{w}(t)^T \mathbf{x}_i$ for each training example. For those $i$ for which $\lambda_i < c ||\mathbf{w}||$, choose the example $j$ that maximizes $\lambda_i$, and update the weight vector according to the perceptron update rule for example $j$: $\mathbf{w}(t+1) = \mathbf{w}(t) + \eta y_j\mathbf{x}_j$.
Stop when a maximum number of iterations has passed, or no more corrections are made.

Compare the accuracy of these four perceptron variants on the two datasets referred to above. Can you explain the reasoning for the modified perceptron? Hint: Consider the effect of the weight update on examples that are classified correctly. Compute both $E_{in}$ and an estimate of $E_{out}$ using a subset of examples that you set aside as a test set. For each dataset randomly divide the data into two parts: training data and testing data. For the heart dataset reserve 100 examples for testing; for the Gisette dataset reserve 1500 examples for testing.

Part 2: Learning Curves

Whenever we learn a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier accuracy as a function of the number of training examples. Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed?

Part 3: Data normalization

In this section we will explore the effect of normalizing the data, focusing on normalization of features. The simplest form of normalization is to scale each feature to be in the range [-1, 1]. We'll call this scaling.

Here's what you need to do:

Explain how to scale the data to be in the range [-1, 1].
Compare the accuracy of the perceptron with bias on the original data and the scaled version of the heart dataset. Does one of them lead to better performance? Explain why you think this happens.
An alternative way of normalizing the data is to standardize it: for each feature subtract the mean and divide by its standard deviation. What can you say about the range of the data in this case? There are some conditions when standardization is a better idea than scaling variables to the range [-1, 1]. Can you think when that is the case?

Your Report

Your report needs to be written in LaTex. Here are some files to help you start playing with LaTex and writing your report. Download and extract the files from start_latex.tar. You will now have the following files:

start.tex
listings-python-options.sty
start.bib
wowTest.py
sinecurve.py
sine.pdf
Makefile

The Makefile contains commands required for generating a pdf file out of the latex source, and other files that are required. On a Unix/Linux that has Latex you can just run

> make

The file listings-python-options.sty is a latex style file that tweaks the parameters of the listings latex package to makes it such that Python code is displayed nicely.

Submission

Submit your report via Canvas. Python code can be displayed in your report if it is succinct (not more than a page or two at the most) or submitted separately. The latex sample document shows how to display Python code in a latex document. Also, please check-in a text file named README that describes what you found most difficult in completing this assignment (or provide that as a comment on ramct).

Grading

Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:

Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code. If your code is more than a few lines, you can include it as an appendix to your report. For example, for the first part of the assignment, provide the protocol you use to evaluate classifier accuracy.
You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data.
In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results. For example, for the learning curve problem, what can you say on the basis of the observed learning curve?

Grading sheet for assignment 1

Part 1:  45 points.
(25 points):  Correct implementation of the classifiers
(10 points):  Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way
(10 points):  Discussion of the results and the modified perceptron algorithm.

Part 2:  20 points.
(15 points):  Learning curves are correctly generated and displayed in a clear and readable way
( 5 points):  Discussion of the results

Part 3:  20 points.
( 5 points):  How to perform data scaling
(10 points):  Comparison of normalized/raw data results; discussion of results
( 5 points):  Comparison of scaling and standardization

Report structure, grammar and spelling:  15 points
( 5 points):  Heading and subheading structure easy to follow and
              clearly divides report into logical sections.
( 5 points):  Code, math, figure captions, and all other aspects of  
              report are well-written and formatted.
( 5 points):  Grammar, spelling, and punctuation.

CS545 fall 2015

Sidebar

Table of Contents

Assignment 1

Preliminaries: reading data

Part 1: The perceptron algorithm

Part 2: Learning Curves

Part 3: Data normalization

Your Report

Submission

Grading

CS545 fall 2015

User Tools

Site Tools

Sidebar

Table of Contents

Assignment 1

Preliminaries: reading data

Part 1: The perceptron algorithm

Part 2: Learning Curves

Part 3: Data normalization

Your Report

Submission

Grading

Page Tools