{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignment 2:  the perceptron\n",
    "\n",
    "Due date:  Friday 9/21 at 11:59pm\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Datasets\n",
    "\n",
    "In this assignment we will use the following datasets:\n",
    "  * The [Gisette](http://archive.ics.uci.edu/ml/datasets/Gisette) handwritten digit recognition dataset. \n",
    "  * The [QSAR](http://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation) data for predicting the biochemical activity of a molecule.\n",
    "  * The [Heart disease diagnosis](http://archive.ics.uci.edu/ml/datasets/Heart+Disease) dataset.\n",
    "  * For developing your code, you can use one of the scikit-learn datasets, such as the [breast cancer wisconsin dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) and the [make_classification](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) toy dataset generator.\n",
    "  \n",
    "When writing your notebook, you can assume the datasets are in the same directory as the notebook.  Please keep the same file names as in the UCI repository.\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1:  Variants of the perceptron algorithm \n",
    "\n",
    "In this assignment you will work with several variants of the perceptron algorithm:\n",
    "\n",
    "  * The \"vanila\" version of the perceptron algorithm, which was introduced in class.\n",
    "  * The pocket algorithm as described in the slides or page 80 in the book.\n",
    "  * The **adatron** version of the perceptron described next.\n",
    "\n",
    "In each case make sure that your implementation of the classifier **includes a bias term** (in slide set 2 and page 7 in the book you will find guidance on how to add a bias term to an algorithm that is expressed without one).\n",
    "\n",
    "## The adatron \n",
    "\n",
    "Before we get to the adatron, we will derive an alternative form of the perceptron algorithm --- the dual perceptron algorithm.  All we need to look at is the weight update rule:\n",
    "\n",
    "$$\\mathbf{w} \\rightarrow \\mathbf{w} + \\eta y_i \\mathbf{x}_i.$$\n",
    "\n",
    "This is performed whenever example $i$ is misclassified by the current weight vector.  The thing to notice, is that the weight vector is always a weighted combination of the training examples since it is that way to begin with, and each update maintains that property.  So in fact, rather than representing $\\mathbf{w}$ explicitly, all we need to do is to keep track of how much each training example is contributing to the value of the weight vector, i.e. we will express it as:\n",
    "\n",
    "$$\\mathbf{w} = \\sum_{i=1}^N \\alpha_i y_i \\mathbf{x}_i,$$\n",
    "\n",
    "where $\\alpha_i$ are positive numbers that describe the magnitude of the contribution $\\mathbf{x}_i$ is making to the weight vector, and $N$ is the number of training examples.\n",
    "\n",
    "Therefore to initialize $\\mathbf{w}$ to 0, we simply initialize $\\alpha_i = 0$ for $i = 1,\\ldots,N$.  When expressed using the variables $\\alpha_i$, the perceptron update rule becomes:\n",
    "\n",
    "$$\\alpha_i = \\alpha_i + \\eta y_i,$$\n",
    "\n",
    "and you can always retrieve the weight vector using its expansion in terms of the $\\alpha_i$.\n",
    "\n",
    "Now we're ready for the adatron - the only difference is in the initialization and update equation.\n",
    "\n",
    "Initialization:\n",
    "\n",
    "$\\alpha_i = 1$ for $i = 1,\\ldots,N$\n",
    "\n",
    "Like in the perceptron we run the algorithm until convergence, or until a fixed number of epochs has passed (an epoch is a loop over all the training data), and an epoch of training consists of the following procedure:\n",
    "\n",
    "for each training example $i=1,\\ldots,N$ perform the following steps:\n",
    "\n",
    "1.  $\\gamma = y_i * \\mathbf{w}^{t} \\mathbf{x}_i$\n",
    "2.  $\\delta\\alpha = \\eta * (1 - \\gamma)$\n",
    "3.  `if` $(\\alpha_i + \\delta\\alpha < 0)$ : $\\alpha_i = 0$, `else : ` $\\alpha_i = \\alpha_i + \\delta\\alpha$\n",
    "\n",
    "\n",
    "The variable $\\eta$ plays the role of the learning rate $\\eta$ employed in the perceptron algorithm and $\\delta \\alpha$ is the proposed magnitude of change in $\\alpha_i$. \n",
    "We note that the adatron tries to maintain a **sparse** representation in terms of the training examples by keeping many $\\alpha_i$ equal to zero.  The adatron converges to a special case of the SVM algorithm that we will learn later in the semester; this algorithm tries to maximize the margin with which each example is classified, which is captured by the variable $\\gamma$ in the algorithm (notice that the magnitude of change proposed for each $\\alpha_i$ becomes smaller as $\\gamma$ increases towards 1).\n",
    "\n",
    "**Note:** if you observe an overflow issues in running the adatron, add an upper bound on the value of $\\alpha_i$.\n",
    "\n",
    "Here's what you need to do:\n",
    "\n",
    "  - Implement the pocket algorithm and the adatron; each classifier should be implemented in a separate Python class, and use the same interface used in the code provided for the perceptron algorithm, i.e. provides the same methods with the same signature.  Make sure each classifier you use (including the original version of the perceptron) implements a bias term.\n",
    "  - Compare the performance of these variants of the perceptron on the Gisette and QSAR datasets by computing an estimate of the out of sample error on a sample of the data that you reserve for testing (the test set).  In each case reserve about 60% of the data for training, and 40% for testing.  To gain more confidence in our error estimates, repeat this experiment using 10 random splits of the data into training/test sets.  Report the average error and its standard deviation in a nicely formatted table.  Is there a version of the perceptron that appears to perform better?   (In answering this, consider the differences in performance you observe in comparison to the standard deviation).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2:  Learning Curves \n",
    "\n",
    "Whenever we train a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification.  A good way of determining that is to construct a **learning curve**, which is a plot of classifier performance (i.e. its error) as a function of the number of training examples.  Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset.  The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800.  Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe.  What can you conclude from the learning curve you have constructed?  Make sure that you use a fixed test set to evaluate performance while varying the size of the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3:  Data normalization \n",
    "\n",
    "In this section we will explore the effect of normalizing the data, focusing on normalization of features.  The simplest form of normalization is to scale each feature to be in the range [-1, 1].  We'll call this **scaling**.\n",
    "\n",
    "Here's what you need to do:\n",
    "\n",
    "  - Explain how to scale the data to be in the range [-1, 1].\n",
    "  - Compare the accuracy of the perceptron with bias on the original data and the scaled version of the heart dataset.  Does one of them lead to better performance?  Explain why you think this happens.  \n",
    "  - An alternative way of normalizing the data is to **standardize** it:  for each feature subtract the mean and divide by its standard deviation.  What can you say about the range of the resulting features in this case?  \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your Report\n",
    "\n",
    "Answer the questions in the cells reserved for that purpose.\n",
    "\n",
    "Mathematical equations should be written as LaTex equations; the assignment contains multiple examples of both inline formulas (such as the one exemplifying the notation for the norm of a vector $||\\mathbf{x}||$ and those that appear on separate lines, e.g.:\n",
    "\n",
    "$$\n",
    "||\\mathbf{x}|| = \\sqrt{\\mathbf{x}^T \\mathbf{x}}.\n",
    "$$\n",
    "\n",
    "\n",
    "\n",
    "### Submission\n",
    "\n",
    "Submit your report as a Jupyter notebook via Canvas.  Running the notebook should generate all the plots and results in your notebook.\n",
    "\n",
    "\n",
    "### Grading \n",
    "\n",
    "Here is what the grade sheet will look like for this assignment.  A few general guidelines for this and future assignments in the course:\n",
    "\n",
    "  * Your answers should be concise and to the point.  We will take off points if that is not the case.\n",
    "  * Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description.  You can use a few lines of python code or pseudo-code.\n",
    "\n",
    "\n",
    "Grading sheet for the assignment:\n",
    "\n",
    "```\n",
    "Part 1:  60 points.\n",
    "(30 points):  Correct implementation of the classifiers\n",
    "(15 points):  Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way\n",
    "(15 points):  Discussion of the results\n",
    "\n",
    "Part 2:  20 points.\n",
    "(15 points):  Learning curves are correctly generated and displayed in a clear and readable way\n",
    "( 5 points):  Discussion of the results\n",
    "\n",
    "Part 3:  20 points.\n",
    "( 5 points):  How to perform data scaling\n",
    "(10 points):  Comparison of normalized/raw data results; discussion of results\n",
    "( 5 points):  Range of features after standardization\n",
    "```\n",
    "\n",
    "\n",
    "Grading will be based on the following criteria:\n",
    "\n",
    "  * Correctness of answers to math problems\n",
    "  * Math is formatted as LaTex equations\n",
    "  * Correct behavior of the required code\n",
    "  * Easy to understand plots \n",
    "  * Overall readability and organization of the notebook\n",
    "  * Effort in making interesting observations where requested.\n",
    "  * Conciseness.  Points may be taken off if the notebook is overly \n",
    "  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}