$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Assignment 2: Ridge Regression with K-Fold Cross-Validation

Steve Kommrusch

Some subroutines identified in the writeup were provided by professor Chuck Anderson

Overview

In this assignment we explore the use of Ridge Regression by adding a $\lambda$ component to the cost function of our model that we are trying to minimize:

$$ \sum_{i=1}^N (\tv_i - \xv_i^T \wv)^2 + \lambda \sum_{i=2}^N w_i^2$$

The effect of $\lambda$ is to help prevent overfitting of the model to the training data. We explore the best setting for $\lambda$ using K-folds cross validation. For data in this assignment, I use historical stock price data for 9 major technology companies to attempt to model the price of Microsoft stock 3 months into the future.

Method

The 'use' and 'rmse' functions for this assignement are the same ones I wrote in assignment 1; the train and partitionKFolds functions given below are from the lecture notes by Chuck Anderson, and I wrote the multipleLambdas function for this assignment. Ultimately, for each lambda value provided, multipleLamdas will test a linear model predicting $\Tv$ from $\Xv$ for all $nFolds*(nFolds-1)$ possible choices for test and validate folds. The best $\lambda$ choice for each of the $nFolds*(nFolds-1)$ choices is summarized as the output.

In [13]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Define functions from HW1
def use (model,X):
    Xplus1=np.ones((X.shape[0],X.shape[1]+1))
    Xplus1[:,1:]=(X-model['means'])/model['stds']
    return np.dot(Xplus1,model['w'])

def rmse (predict,T):
    # both args have shape equal to number of samples
    return np.sqrt(np.mean((predict-T)**2))
In [14]:
 rmse(np.array([[1,2],[2,3]]), np.array([[1.1,1.9], [1.8, 3.1]]))
Out[14]:
0.1323
In [11]:
def rmse_myorig (predict,T):
    # both args have shape equal to number of samples
    return np.sqrt(np.mean((predict-T)**2,axis=0))
In [22]:
print(rmse_myorig(np.array([[1,2],[2,3]]), np.array([[1.1,1.9], [1.8, 3.1]])))
print(rmse_myorig(np.array([1,2,3]),np.array([1,3,3])))
np.mean(np.array([[1,2],[2,3]]))
[ 0.1581  0.1   ]
0.57735026919
Out[22]:
2.0000
In [2]:
# Define train with lambda from lecture notes
def train(X,T,lamb):
    means = X.mean(0)
    stds = X.std(0)
    n,d = X.shape
    Xs1 = np.insert( (X - means)/stds, 0, 1, axis=1)
    lambDiag = np.eye(d+1) * lamb
    lambDiag[0,0] = 0       # lambdiag has lambda on diagonal except for 0,0 (which is bias)
    w = np.linalg.lstsq( Xs1.T @ Xs1 + lambDiag, Xs1.T @ T)[0]
    return {'w': w, 'means':means, 'stds':stds}
In [3]:
# Define partitionKFolds from lecture
def partitionKFolds(X,T,nFolds,shuffle=False,nPartitions=3):
    '''Usage: for Xtrain,Ttrain,Xval,Tval,Xtest,Ttest in partitionKFolds(X,T,5):'''
    # Randomly arrange row indices
    rowIndices = np.arange(X.shape[0])
    if shuffle:
        np.random.shuffle(rowIndices)
    # Calculate number of samples in each of the nFolds folds
    nSamples = X.shape[0]
    nEach = int(nSamples / nFolds)
    if nEach == 0:
        raise ValueError("partitionKFolds: Number of samples in each fold is 0.")
    # Calculate the starting and stopping row index for each fold.
    # Store in startsStops as list of (start,stop) pairs
    starts = np.arange(0,nEach*nFolds,nEach)
    stops = starts + nEach
    stops[-1] = nSamples
    startsStops = list(zip(starts,stops))
    # Repeat with testFold taking each single fold, one at a time
    for testFold in range(nFolds):
        if nPartitions == 3:
            # Repeat with validateFold taking each single fold, except for the testFold
            for validateFold in range(nFolds):
                if testFold == validateFold:
                    continue
                # trainFolds are all remaining folds, after selecting test and validate folds
                trainFolds = np.setdiff1d(range(nFolds), [testFold,validateFold])
                # Construct Xtrain and Ttrain by collecting rows for all trainFolds
                rows = []
                for tf in trainFolds:
                    a,b = startsStops[tf]                
                    rows += rowIndices[a:b].tolist()
                Xtrain = X[rows,:]
                Ttrain = T[rows,:]
                # Construct Xvalidate and Tvalidate
                a,b = startsStops[validateFold]
                rows = rowIndices[a:b]
                Xvalidate = X[rows,:]
                Tvalidate = T[rows,:]
                # Construct Xtest and Ttest
                a,b = startsStops[testFold]
                rows = rowIndices[a:b]
                Xtest = X[rows,:]
                Ttest = T[rows,:]
                # Return partition matrices, then suspend until called again.
                yield Xtrain,Ttrain,Xvalidate,Tvalidate,Xtest,Ttest,testFold
        else:
            # trainFolds are all remaining folds, after selecting test fold
            trainFolds = np.setdiff1d(range(nFolds), [testFold])
            # Construct Xtrain and Ttrain by collecting rows for all trainFolds
            rows = []
            for tf in trainFolds:
                a,b = startsStops[tf]                
                rows += rowIndices[a:b].tolist()
            Xtrain = X[rows,:]
            Ttrain = T[rows,:]
            # Construct Xtest and Ttest
            a,b = startsStops[testFold]
            rows = rowIndices[a:b]
            Xtest = X[rows,:]
            Ttest = T[rows,:]
            # Return partition matrices, then suspend until called again.
            yield Xtrain,Ttrain,Xtest,Ttest,testFold
In [4]:
def multipleLambdas(X, T, nFolds, lambdas=[0]):
  '''Test partitions with multiple lambdas'''
  foldCount=0
  results = []   # results will gather a row for all fold options wich all lambda options
  for Xtrain,Ttrain,Xval,Tval,Xtest,Ttest,_ in partitionKFolds(X,T,nFolds,shuffle=True):
    for ll in lambdas:
      model = train(Xtrain,Ttrain,ll)
      results.append([foldCount, ll,
                     rmse(use(model,Xtrain),Ttrain),
                     rmse(use(model,Xval),Tval),
                     rmse(use(model,Xtest),Ttest)])
    foldCount+=1
  results = np.array(results)
  bestResults=[]  # bestResults will find the best lamda option for each fold
  for fold in range(foldCount):
    err=float('inf')
    for row in results:
      if fold == row[0] and err > row[3]:
        err = row[3]
        bestrow=row
    bestResults.append(bestrow)
  return np.array(bestResults)
In [5]:
%precision 4
Out[5]:
'%.4f'

Data

My data comes from https://quantquote.com/historical-stock-data, which includes daily stock market data for all 500 of the stocks in the SP500 from 2004 through 2013. For this assignment, I chose to predict Microsoft stock price 3 months into the future using the current stock data for Google, Microsoft, AMD, Nvidia, Intel, Apple, Amazon, IBM, and Hewlett-Packard, as well as the price of those companies 3 months prior. Hence, there are 18 input variables and 1441 samples. There are 'only' 1441 sample points because I only include days that have data exactly 3 months before and after a given date, so if any of those days are on weekends or market holidays the data was not used. Below are show the first 3 lines of the 1441 line Stock.txt file.

In [6]:
! head -n 3 Stock.txt
21.2959,29.8475,21.39,38.41,167.47,18.0008,82.8616,18.3426,22.3425,6.42584,15.1085,12.25,39.45,109.95,15.9356,73.5309,17.1511,20.5755,4.12013
21.4396,31.1653,21.57,38.75,174.8,18.0187,83.0181,18.5074,22.4436,6.43894,15.5024,11.92,39.02,105,15.7664,73.557,16.9866,20.5831,3.97603
21.3635,32.9257,22.55,39.77,180,18.3318,83.4006,18.0838,22.9491,6.40946,17.4476,11.68,38.33,99.95,15.9535,73.1232,16.783,20.6662,4.15288

The grep below shows some of the structure of the data. The first column is the price of Microsoft stock 3 months into the future; the current cost of Microsoft stock is in column 9; and the cost 3 months ago is column 18. As can be seen below, line 1 of the data starts with the target predicted stock value of 21.2959 in column 1. Line 45 of the data is for 3 months later, and 21.2959 is in column 9; and line 88 is for 3 months after that with 21.2959 in column 18. Note that 3 months is about 40-45 rows of data, which is useful to know for viewing later results.

In [7]:
! grep -n 21.2959 Stock.txt
1:21.2959,29.8475,21.39,38.41,167.47,18.0008,82.8616,18.3426,22.3425,6.42584,15.1085,12.25,39.45,109.95,15.9356,73.5309,17.1511,20.5755,4.12013
45:21.906,42.904,16.6,34.28,193.6,18.3038,80.2152,18.3605,21.2959,9.0918,29.8475,21.39,38.41,167.47,18.0008,82.8616,18.3426,22.3425,6.42584
88:22.8893,38.6491,15.93,35.7,255.29,20.3433,66.8402,20.903,21.906,8.80687,42.904,16.6,34.28,193.6,18.3038,80.2152,18.3605,21.2959,9.0918
In [8]:
data = np.loadtxt('Stock.txt', usecols=range(19), delimiter=',')
T=np.array(data[:,0],ndmin=2).T    # The first column is the Target to predict: Microsoft stock price in 3 months
X=np.array(data[:,1:])  # The rest of the data is current or previous stock values
Stocks=list(("AAPL","AMD","AMZN","GOOG","HPQ","IBM","INTC","MSFT","NVDA"))

Below are shown charts for the 9 stocks included in this analysis. Not explicitely shown is the target to predict (MSFT 3 months in the future), nor the charts for all 9 stocks 3 months in the past. Data sample 0 (row 1) is associated with 11/23/2004 and data sample 1440 is associated with 5/8/2013. 3 months is about 40-45 sample positions and a full year is about 170 samples in this data.

In [9]:
plt.figure(figsize=(15,10))
plt.suptitle('Stock prices', fontsize=40)
for c in range(9):  # Just plot the 9 stocks over time, not +/- 3 month values
    plt.subplot(3,3, c+1)
    plt.plot(data[:,c+1],'-')
    plt.ylabel(Stocks[c])
    plt.xlabel("Sample number (approx time in work days)")
    plt.grid(True)

Results

The results of the multipleLambdas call are shown below. Note that this is stock market data, which has an interesting property in that 'easily predictable' patterns will be priced out of the data, since if a stock will clearly be going up or down the stock will be bought or sold until the 'easy prediction' is no longer accurate. Given this behavior, one would expect that this Ridge Regression would have small true predictive power for MSFT stock in 3 months, and the rather large variations of $\lambda$ seen are related to this (the data for any particular fold seems to track differently than other folds). However, given that MSFT does trend with other technology stocks, there is value in using current technology stock prices when predicting MSFT stock over simply predicting the average stock price with a very high $\lambda$.

In [20]:
results = multipleLambdas(X,T,4,np.append(np.arange(0,9.1,0.1),(10,20,100,1000,10000)))
results #Each row is: fold number, best lambda, training RMSE, validation RMSE, and test RMSE.
Out[20]:
array([[  0.    ,   2.2   ,   1.9902,   2.0141,   1.9998],
       [  1.    ,   0.    ,   1.9296,   2.1194,   1.9903],
       [  2.    ,   0.    ,   2.0284,   1.9385,   2.0021],
       [  3.    ,   0.    ,   1.9861,   1.9881,   2.0199],
       [  4.    ,   0.7   ,   1.9133,   2.1244,   2.0272],
       [  5.    ,   1.3   ,   2.023 ,   1.9107,   2.0095],
       [  6.    ,   0.    ,   1.9296,   1.9903,   2.1194],
       [  7.    ,   3.9   ,   1.9223,   2.0206,   2.1289],
       [  8.    ,   0.    ,   1.9636,   1.924 ,   2.122 ],
       [  9.    ,   0.    ,   2.0284,   2.0021,   1.9385],
       [ 10.    ,   5.7   ,   2.0356,   2.0019,   1.9154],
       [ 11.    ,   0.3   ,   1.9637,   2.1219,   1.9241]])
In [21]:
plt.hist(results[:,1]);

Now, given the analysis above, I will train a model using the first 7+ years of data and then test the model with the final year, using $\lambda = 0.7$, which gives some pressure against overfitting the model to the trained data. First, we'll look at the graph of the model for the 7 years it was trained on.

In [25]:
Xtrain=X[0:1270,:];      Ttrain=T[0:1270,:];
Xtest=X[1270:1441,:];    Ttest=T[1270:1441,:]
# Just use one of the outputs for graphing
model=train(Xtrain,Ttrain,(0.7))  # Choose 0.7 based on previous results
predict=use(model,Xtrain)
error=rmse(predict,Ttrain)
plt.plot(Ttrain,'-',label="Actual data")
plt.plot(predict,'-',label="Model prediction")
plt.ylabel("Microsoft stock price")
plt.xlabel("Sample number (approx time in workdays")
plt.legend(loc='best')
plt.grid(True);
error
Out[25]:
array([ 2.019])

As seen above, the model does a fair job tracking the actual price, but given that it is only predicting 40-45 samples ahead, some issues can be seen. For investing, ideally the model would be neither above or below the actual price, but from sample 200 through about 250 the model price goes up well ahead of the actual, and for samples 250-300, 600-650, and 950-1000 the model price lags the actual price by 20 days or so.

An interesting observation about the model can be seen below. For the 7+ years trained on, the most significant variable for predicting MSFT stock in 3 months was IBM stock movement. Indeed, the current IBM price was a positive contribution and the 3 months prior IBM price was the next strongest contribution with with a negative weight. Surprisingly, the current and prior price of MSFT itself did not weigh very strongly in the predictive model.

In [26]:
colnames=['Bias']; colnames.extend(Stocks); colnames.extend(Stocks)
for i in range(10,19): colnames[i] += " 3 months ago"
for i in range(len(colnames)): print("%33s: %.3f" % (colnames[i],model['w'][i]))
                             Bias: 24.073
                             AAPL: -0.898
                              AMD: -1.444
                             AMZN: 0.313
                             GOOG: 1.271
                              HPQ: 0.244
                              IBM: 2.226
                             INTC: 0.687
                             MSFT: 0.492
                             NVDA: 0.704
                AAPL 3 months ago: 0.455
                 AMD 3 months ago: 1.211
                AMZN 3 months ago: 0.297
                GOOG 3 months ago: -0.054
                 HPQ 3 months ago: -1.384
                 IBM 3 months ago: -1.995
                INTC 3 months ago: -0.270
                MSFT 3 months ago: 0.196
                NVDA 3 months ago: -0.155

Finally, below is a test of the MSFT price prediction for the last year of data. The RMSE for this period is about 2.1, which is about 7% of the stock price. To describe again how this model could be used, consider you are on the timeline below on the day associated with sample 60. Using current prices for the 9 stocks in the model and prices from 3 months back, the model is predicting that MSFT stock will be at around 30.0 in 3 months (not known to you yet is that the actual stock price will be about 26.0). On the day for sample 60, you can look back at your model and see that 3 months ago (about the day for sample 18), your model was predicting MSFT would be at around 28 and yet the stock is actually at about 30. So, on day 60 your model is telling you that the stock will be flat in 3 months (stay at the current level of 30), but in reality the stock will fall over 10%. Of note is that the model did a pretty good job aligning with the later part of the graph after sample 120.

To really consider using a model like this for investing, one would want to create a buy/sell output and compare returns using the model recommendations to a simple buy-and-hold strategy.

In [28]:
predict=use(model,Xtest)
error=rmse(predict,Ttest)
plt.plot(Ttest,'-',label="Actual data")
plt.plot(predict,'-',label="Model prediction")
plt.ylabel("Microsoft stock price")
plt.xlabel("Sample number (approx time in workdays")
plt.legend(loc='best')
plt.grid(True);
error
Out[28]:
array([ 2.0994])

Grading

Your notebook will be run and graded automatically. After running all of the above cells in your notebook, run the code in the following cell to demonstrate an example grading session. You should see a perfect score of 80/100 if your functions are defined correctly. The remaining 20% will be based on the instructors reading of your notebooks. We will be looking for how well the method is explained in text with some LaTeX math, and how well the results are summarized.

As before, first download and extract from A2grader.tar

In [29]:
%run -i "A2grader.py"
 Testing:
X = np.arange(20).reshape((-1,1))
T = np.abs(X -10) + X
for Xtrain,Ttrain,Xval,Tval,Xtest,Ttest,_ in partitionKFolds(X,T,5,shuffle=False,nPartitions=3)
20/20 points. partitionKFolds produced 20 partitions. Correct.
10/10 points. Final training set contains 12 samples. Correct.
10/10 points. Final validation set contains 4 samples. Correct.
Testing:
X = np.linspace(0,100,1000).reshape((-1,1))
T = X * 0.1
results = multipleLambdas(X,T,4,range(0,10))
20/20 points. All best lambdas are 0.  Correct.
20/20 points. Mean of all train, validation and test errors for best lambda are correct.

notebooks Grade is 80/100
Up to 20 more points will be given based on the qualty of your descriptions of the method and the results.