Assignment 1

Due date: 9/4 at 11:59pm

Preliminaries

We need to start with a little bit of notation… In supervised learning we work with a dataset of $N$ labeled examples: $\mathcal{D} = \{ (\mathbf{x}_i, y_i) \}_{i=1}^N$, where $\mathbf{x}_i$ is a $d$-dimensional vector (we always use boldface to denote vectors), and $y_i$ is the label associated with $\mathbf{x}_i$. In a binary classification problem we'll usually use the values $\pm 1$ to denote positive/negative examples.

Part 1: Measuring classifier error

First let's recall that the estimate of a classifier's error is given by:

$$E(h) = \frac{1}{N}\sum_{i=1}^N I(h(\mathbf{x}_i) \neq y_i),$$

where $I(\cdot)$ is the indicator function, and $h$ is the model/hypothesis we are trying to evaluate.

Whenever training a classifier, we like to know how well it's performing. This is done by computing an estimate of the out of sample error: pick an independent test set that was not used during training and compute the error of your classifier on this dataset (the test set). You always want to know that your classifier is learning something, i.e. the error is smaller than what we would expect by chance, i.e. better than a model that simply guesses or picks a fixed answer. Consider the following classifier, which always classifies an example as belonging to the majority class, i.e. the class to which the largest number of training examples belong to.

Answer the following:

Suppose you have data that is very imbalanced, and let's say for concreteness that we're working with a binary classification problem where the number of negative examples is much larger than the number of positive examples. What can you say about the estimated error of the majority classifier? What issue does that raise in your opinion about evaluating classifiers?

To solve the issue with the standard error rate, it has been suggested to assign different costs to different types of errors using a cost matrix $c(y_i, h(\mathbf{x}_i))$, where $y_i$ is the actual class of example $i$, and $h(\mathbf{x}_i)$ is the the predicted class. For a binary classification problem this is a $2 x 2$ matrix, and we'll assume there is no cost associated with a correct classification, which leaves two components to be determined:

$c_r = c(+1, -1)$, which is the reject cost (the cost of a false negative)
$c_a = c(-1, +1)$, which is the accept cost (the cost of a false positive).

Incorporating the cost matrix into computing the error: (clarification 8/27 at 5:19 pm)

The regular error $$E(h) = \frac{1}{N}\sum_{i=1}^N I(h(\mathbf{x}_i) \neq y_i)$$ is now replaced with: $$E_{cost}(h) = \frac{1}{N}\sum_{i=1}^N c(y_i, h(\mathbf{x}_i)) \cdot I(h(\mathbf{x}_i) \neq y_i)$$

With these definitions, answer the following:

How should we choose $c_r$ and $c_a$ such that the majority classifier and the minority classifier both have an error of 0.5? (The minority classifier is analogous to the majority classifier, except that it classifies everything as positive, since we assumed the positive class has fewer representatives). Section 1.4.1 in the book has a brief discussion of error measures.

Part 2: The nearest centroid classifier

The closest centroid classifier classifies a data point $\mathbf{x}$ according to the class of the nearest centroid. More formally, let $C_k$ be the set of examples that belong to class $k$, and let $$\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} \mathbf{x}_i,$$ where $|C_k|$ is the cardinality of the set $C_k$. The closest centroid classifier predicts the class of a point $\mathbf{x}$ as:

$$h(\mathbf{x}) = \textrm{argmin}_k ||\mathbf{x} - \mu_k||,$$ where $||\mathbf{x}||$ is the Euclidean norm of $\mathbf{x}$.

Show that for a binary classification problem where the number of positive examples equals the number of negative examples the nearest centroid classifier can be expressed as a linear classifier with the weight vector $$\mathbf{w} = \frac{1}{N}\sum_{i=1}^N y_i \mathbf{x}_i.$$ Hint: consider the vector that connects the centroids of the two classes and draw a figure in two dimensions to help you think about the problem. Also note that this form only holds if the two classes have equal number of examples, so we'll assume that is the case.

Part 3: Are my features useful?

In order to obtain an accurate classifier you need good features. But what does that mean? In this task we will explore what that means, and how to visually inspect a dataset to identify useful features.

First we need some data… the UCI machine learning repository contains a large selection of datasets that we can experiment with. Choose a binary classification problem (some examples include the Heart , disease diagnosis dataset the Gisette handwritten digit recognition dataset, and a QSAR dataset ( QSAR refers to models that aim at predicting the biochemical activity of molecules, usually in the context of their usefulness as drug molecules). Most data will come as a data matrix in csv or related formats (here is a processed version of the heart dataset). Each row in the file corresponds to a training example. You can ignore the first column; the second column contains the label (+1 or -1), and the rest of the columns contain the feature vectors. Note that the first row in the file is a comment, and needs to be ignored. To read a data matrix you can use numpy's genfromtxt function. For example, to read the heart dataset you can use the following command:

>>> import numpy as np
>>> data=np.genfromtxt("heart.data", delimiter=",", comments="#")

And note that since the file contains both the labels and data points, they need to be separated out. As an alternative you can use the usecols option of genfromtxt to directly read the columns you are interested in.

For the Gisette dataset the feature matrix is provided separately from the labels. To read the data matrix of the Gisette dataset:

>>> X=np.genfromtxt("gisette_train.data")

The QSAR dataset can be loaded in a similar way.

Our objective in this task is to visualize the usefulness of the features that make up the dataset. For a given feature (a column in the feature matrix), generate two histograms of its values: one for the positive examples, and one for the negative examples. Use matplotlib's hist function to generate the histogram and use the normed=True option to generated a histogram normalized to be a distribution. The matplotlib website has a lot of usage examples. Here are some examples that demonstrate what you can do with its hist function.

What does this kind of visualization tell us about the usefulness of a feature for classifying a dataset? Demonstrate this idea using a dataset from the UCI repository: plot histograms for four features, two of which you think are going to be useful, and two that have a more limited usefulness in your opinion – Explain!! In plotting, create a single plot composed of four subplots, one for each feature. This is a convenient way of grouping together related plots. When choosing which features to display, simply use your judgement as to which ones you should show.

Your Report

Your report needs to be written in LaTex. You can either use LaTex by installing it on your computer, or using one of the (very good) online LaTex solutions ( Overleaf or ShareLatex). Here are some files to help you start playing with LaTex and writing your report. Download and extract the files from start_latex.tar. You will now have the following files:

start.tex
listings-python-options.sty
start.bib
wowTest.py
sinecurve.py
sine.pdf
Makefile

The Makefile contains commands required for generating a pdf file out of the latex source, and other files that are required. On a Unix/Linux that has Latex you can just run

$ make

The file listings-python-options.sty is a latex style file that tweaks the parameters of the listings latex package to makes it such that Python code is displayed nicely.

Submission

Submit your report via Canvas. Python code can be displayed in your report if it is short, and helps understand what you have done. The latex sample document shows how to display Python code in a latex document. Submit the Python code that was used to generate the results as a file called assignment1.py. Typing

$ python assignment1.py

should generate all the tables/plots used in your report.

Grading

Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:

Your answers should be concise and to the point. We will take off points if that is not the case.
Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code.

Grading sheet for assignment 1

Part 1:  30 points.

Part 2:  30 points.

Part 3:  30 points.
(15 points):  Histograms of informative/non-informative features.
(15 points):  Discussion of the plots.

Report structure, grammar and spelling:  10 points
( 5 points):  Heading and subheading structure easy to follow and
              clearly divide the report into logical sections.
( 5 points):  Code, math, figure captions, and all other aspects of the 
              report are well-written and formatted.  Correct grammar, spelling, and punctuation.