Assignment 2

Due date: Friday 9/17 at 11:59pm

Datasets

In this assignment we will use the following datasets:

Heart disease diagnosis dataset. Here's a processed version in csv format that is ready to use.
The Gisette handwritten digit recognition dataset.
The QSAR data for predicting the biochemical activity of a molecule.

Part 1: Variants of the perceptron algorithm

In this assignment you will work with several variants of the perceptron algorithm:

The “vanila” version of the perceptron algorithm, which was introduced in class.
The pocket algorithm as described in the slides or page 80 in the book.
The adatron version of the perceptron described next.

In each case make sure that your implementation of the classifier includes a bias term (in slide set 2 and page 7 in the book you will find guidance on how to add a bias term to an algorithm that is expressed without one).

The adatron

Before we get to the adatron, we will derive an alternative form of the perceptron algorithm — the dual perceptron algorithm. All we need to look at is the weight update rule:

$$\mathbf{w} \rightarrow \mathbf{w} + \eta y_i \mathbf{x}_i.$$

This is performed whenever example $i$ is misclassified by the current weight vector. The thing to notice, is that the weight vector is always a weighted combination of the training examples since it is that way to begin with, and each update maintains that property. So in fact, rather than representing $\mathbf{w}$ explicitly, all we need to do is to keep track of how much each training example is contributing to the value of the weight vector, i.e. we will express it as:

$$\mathbf{w} = \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i,$$

where $\alpha_i$ are positive numbers that describe the magnitude of the contribution $\mathbf{x}_i$ is making to the weight vector, and $N$ is the number of training examples.

Therefore to initialize $\mathbf{w}$ to 0, we simply initialize $\alpha_i = 0$ for $i = 1,\ldots,N$. In terms of the variables $\alpha_i$, the perceptron update rule becomes:

$$\alpha_i = \alpha_i + \eta y_i,$$

and you can always retrieve the weight vector using its expansion in terms of the $\alpha_i$.

Now we're ready for the adatron - the only difference is in the initialization and update equation.

Initialization:

$\alpha_i = 1$ for $i = 1,\ldots,N$

Like in the perceptron we run the algorithm until convergence, or until a fixed number of epochs has passed (an epoch is a loop over all the training data), and an epoch of training consists of the following procedure:

for each training example $i=1,\ldots,N$ perform the following steps:

$\gamma = y_i * \mathbf{w}^{t} \mathbf{x}_i$
$\delta\alpha = \eta * (1 - \gamma)$
if $(\alpha_i + \delta\alpha < 0)$ :
$~~~~\alpha_i = 0$
else :
$~~~~\alpha_i = \alpha_i + \delta\alpha$

The variable $\eta$ plays the role of the learning rate $\eta$ employed in the perceptron algorithm and $\delta \alpha$ is the proposed magnitude of change in $\alpha_i$. We note that the adatron tries to maintain a sparse representation in terms of the training examples by keeping many $\alpha_i$ equal to zero. The adatron converges to a special case of the SVM algorithm that we will learn later in the semester; this algorithm tries to maximize the margin with which each example is classified, which is captured by the variable $\gamma$ in the algorithm (notice that the magnitude of change proposed for each $\alpha_i$ becomes smaller as the margin increases towards 1).

Note: if you observe an overflow issues in running the adatron, add an upper bound on the value of $\alpha_i$.

Here's what you need to do:

Implement the pocket algorithm and the adatron; each classifier should be implemented in a separate Python class (they can all be in the same module), and use the same interface used in the code provided for the perceptron algorithm, i.e. provides the same methods. Make sure each classifier you use (including the original version of the perceptron) implements a bias term.
Compare the performance of these variants of the perceptron on the Gisette and QSAR datasets by computing an estimate of the out of sample error on a sample of the data that you reserve for testing (the test set). In each case reserve about 60% of the data for training, and 40% for testing. To gain more confidence in our error estimates, repeat this experiment using 10 random splits of the data into training/test sets. Report the average error and its standard deviation in a LaTex table. Is there a version of the perceptron that appears to perform better? (In answering this, consider the differences in performance you observe in comparison to the standard deviation).

Part 2: Learning Curves

Whenever we train a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier performance (i.e. its error) as a function of the number of training examples. Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed? Make sure that you use a fixed test set to evaluate performance while varying the size of the training set.

Part 3: Data normalization

In this section we will explore the effect of normalizing the data, focusing on normalization of features. The simplest form of normalization is to scale each feature to be in the range [-1, 1]. We'll call this scaling.

Here's what you need to do:

Explain how to scale the data to be in the range [-1, 1].
Compare the accuracy of the perceptron with bias on the original data and the scaled version of the heart dataset. Does one of them lead to better performance? Explain why you think this happens.
An alternative way of normalizing the data is to standardize it: for each feature subtract the mean and divide by its standard deviation. What can you say about the range of the resulting features in this case?

Submission

Submit your report via Canvas. Python code can be displayed in your report if it is short, and helps understand what you have done. The sample LaTex document provided in assignment 1 shows how to display Python code. Submit the Python code that was used to generate the results as a file called assignment2.py (you can split the code into several .py files; Canvas allows you to submit multiple files). Typing

$ python assignment2.py

should generate all the tables/plots used in your report.

Grading

A few general guidelines for this and future assignments in the course:

Your answers should be concise and to the point.
You need to use LaTex to write the report.
The report is well structured, the writing is clear, with good grammar and correct spelling; good formatting of math, code, figures and captions (every figure and table needs to have a caption that explains what is being shown).
Whenever you use information from the web or published papers, a reference should be provided. Failure to do so is considered plagiarism.

We will take off points if these guidelines are not followed.

Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code.
You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data.
In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results. For example, for the learning curve problem, what can you say on the basis of the observed learning curve?

Grading sheet for assignment 1

Part 1:  60 points.
(25 points):  Correct implementation of the classifiers
(10 points):  Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way
(10 points):  Discussion of the results

Part 2:  20 points.
(15 points):  Learning curves are correctly generated and displayed in a clear and readable way
( 5 points):  Discussion of the results

Part 3:  20 points.
( 5 points):  How to perform data scaling
(10 points):  Comparison of normalized/raw data results; discussion of results
( 5 points):  Range of features after standardization