This assignment is purely optional!

**Due:** November 27th at 11:59pm

In this assignment you will get your hands dirty with theano, which is a framework that has been the basis of a lot of work in deep-learning. Writing code in theano is very different than what we are accustomed to. In class you had a taste of it, where we saw how to program logistic regression. Your task for this assignment is to implement ridge regression (again!), and explore some variants of it.

Recall that ridge regression is the regularized form of linear regression, and is the linear function

$$h(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x} + b$$

that minimizes the cost function

$$ \frac{1}{N}\sum_{i=1}^N (h(\mathbf{x}) - y_i)^2 + \lambda ||\mathbf{w}||_2^2.$$

The first term is the average **loss** incurred over the training set, and the second term is the regularization term.
The regularization we considered thus far uses the so-called L2 norm, $||\cdot||_2^2$.
As discussed in class (see the second slide set that discusses SVMs), there are other options, the primary one being the L1 penalty, $||\mathbf{w}||_1 = \sum_{i=1}^d |w_i|$.
L1 regularization often leads to very sparse solutions, i.e. a weight vector with many coefficients that are zero (or very close to that). However, gradient descent does not work in this case, since the L1 penalty is not differentiable.
A simple solution to that is to use a smooth approximation of the L1 norm defined by:
$$\sum_{i=1}^d \sqrt{w_i^2 + \epsilon}.$$
This function converges to the L1 norm when $\epsilon$ goes to 0, and is a useful surrogate which can be used with gradient descent.

In this assignment we will also explore using a different loss function. As discussed in class, the squared loss $(h(\mathbf{x}) - y)^2$ has the issue of being sensitive to outliers. The Huber loss is an alternative that combines a quadratic part for its smoothness, and a linear part for resistance to outliers. We'll consider a simpler loss function: $$\log \cosh (h(\mathbf{x}) - y) ),$$ called the Log-Cosh loss. Recall that $\cosh(z) = \frac{\exp{(z)} + \exp{(-z)}}{2}$.

What you need to do for this assignment:

- Choose two regression datasets, at least one of which you haven't used yet.
- Implement a Python class called
`RegularizedRegression`

that provides regularized linear regression and has the options of two loss functions ('square', or 'logcosh') and two regularizers ('L1', or 'L2'). The class should provide both regular gradient descent, and stochastic gradient descent for optimizing the weight vector and bias, and should use theano for its computations. - Using your code and the two datasets compare the rate of convergence of stochastic gradient descent and gradient descent by monitoring the cost function as a function of training time (I have shown such plots in the segment that discussed training of neural networks). Use squared loss and L2 norm in this exploration.
- Next, compare the test set error obtained when using the L1 regularizer in comparison to the L2 regularizer, and when comparing the two loss functions. Comment on the sparsity of the solutions that you obtain using the L1 norm.

To help you with the implementation here's a theano symbolic expression that implements the squared loss:

squared_loss = T.mean(T.sqr(prediction - y))

In your code, follow the standard interface we have used in coding classifiers; the code I have shown for logistic regression gives you much of what you need for the coding part of this assignment.

Submit your report via Canvas. Python code can be displayed in your report if it is short, and helps understand what you have done. The sample LaTex document provided in assignment 1 shows how to display Python code. Submit the Python code that was used to generate the results as a file called `assignment6.py`

(you can split the code into several .py files; Canvas allows you to submit multiple files). Typing

$ python assignment6.py

should generate all the tables/plots used in your report.

A few general guidelines for this and future assignments in the course:

- Your answers should be concise and to the point.
- You need to use LaTex to write the report.
- The report is well structured, the writing is clear, with good grammar and correct spelling; good formatting of math, code, figures and captions (every figure and table needs to have a caption that explains what is being shown).
- Whenever you use information from the web or published papers, a reference should be provided. Failure to do so is considered plagiarism.
- Always provide a description of the method you used to produce a given result in sufficient detail such that the reader can reproduce your results on the basis of the description. You can use a few lines of python code or pseudo-code.
- You can provide results in the form of tables, figures or text - whatever form is most appropriate for a given problem. There are no rules about how much space each answer should take. BUT we will take off points if we have to wade through a lot of redundant data.
- In any machine learning paper there is a discussion of the results. There is a similar expectation from your assignments that you reason about your results.

We will take off points if these guidelines are not followed.

Grading sheet for assignment 6 (50 points): Correct implementation of regularized ridge regression (20 points): Exploration of gradient descent vs stochastic gradient descent (30 points): Exploration of loss and regularization term