$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\ebx}[1]{e^{\wv_{#1}^T \xv_n}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Assignment 4: Classification with LDA and Logistic Regression

Steve Kommrusch

Key neural network subroutines provided by professor Chuck Anderson

Overview

The goal of this assignment is to compare LDA (linear discriminant analysis) and linear and nonlinear logistic regression applied to the classification of two data sets. Below is a summary of these 3 methods.

LDA

LDA is a generative model for the data and estimates the probability that a given sample is in each class to choose the highest likelihood class for the sample. We use a gaussian model for each class and Bayes' rule to build up the equation for the the probability that $C=k$ given an input $\xv$:

$$ \begin{align*} p(C=k|\xv) &= \frac{p(\xv|C=k)p(C=k)}{\sum_{k=1}^K p(\xv|C=k)p(C=k)} \end{align*} $$

For linear discriminant analysis (as opposed to quadratic), all classes share the same covariant matrix by using the average covariant matrix for all classes, which results in the following discriminant funcion $\delta_k(\xv)$:

$$ \begin{align*} \delta_k(\xv) = \xv^T \Sigmav^{-1} \muv_k - \frac{1}{2}\muv_k^T \Sigmav^{-1} \muv_k + \log P(C=k) \end{align*} $$

This is linear in $\xv$, hence and can be written as:

$$ \begin{align*} \delta_k(\xv) = \xv^T \wv_k + \text{constant}_k \end{align*} $$

the class predicted for the sample is the one with the highest discriminant function.

Linear logistic regression

Linear logistic regression is a discriminative model. Instead of creating a generative model for the data in a given class like LDA, we want to learn network weights that maximize $P(C=k\,|\,\xv_n)$ and the combined data likelihood:

$$ \begin{align*} g_k(\xv_n) & = P(C=k\,|\,\xv_n) = \frac{f(\xv_n;\wv_k)}{\sum_{m=1}^{K} f(\xv_n;\wv_m)}\\ f(\xv_n;\wv_k) & = \ebx{k}; \\ L(\wv) & = \prod_{n=1}^N \prod_{k=1}^K p(C=k\,|\, \xv_n) ^{t_{n,k}}\\ & = \prod_{n=1}^N \prod_{k=1}^K g_k(\xv_n)^{t_{n,k}} \end{align*} $$

Where $t_{n,k}$ are class indicator variables with values of 0 or 1. The gradient of log likelihood with respect to $\wv_j$ is:

$$ \begin{align*} \grad_{\wv_j} LL(\wv) & = \sum_{n=1}^N \sum_{k=1}^K \frac{t_{n,k}}{g_k(\xv_n)} \grad_{\wv_j} g_k(\xv_n)\\ %& = \sum_{n=1}^N \left ( \sum_{k=1}^K t_{n,k} \delta_{jk} - % g_j(\xv_n) \sum_{k=1}^K t_{n,k} \right )\\ & = \sum_{n=1}^N \xv_n (t_{n,j} - g_j(\xv_n)) \end{align*} $$

which results in the following update rule for $\wv_j$, which is optimized using gradient ascent.

$$ \begin{align*} \wv_j \leftarrow \wv_j + \alpha \sum_{n=1}^N (t_{n,j} - g_j(\xv_n)) \xv_n \end{align*} $$

Nonlinear logistic regression

Nonlinear logistic regression is similar to linear logistic regression except that a multi-layer neural network is used to create $y_{n,k}$. Backpropagation and gradient ascent can then be used to converge on optimal weights for the network given the gradient for the log likelihood relative to the weight matrix:

$$ \begin{align*} LL(\Wv) & = \sum_{n=1}^N \sum_{k=1}^K t_{n,k} \log g_{n,k}\;\;\;\;\; \text{ where } g_{n,k} = \frac{\eby{k}}{\sum_{m=1}^K \eby{m}}; \\ \frac{\partial LL(\Wv)}{\partial \Wv_{d,j}} & = \sum_{n=1}^N \sum_{k=1}^K \frac{t_{n,k}}{g_{n,k}} \frac{\partial g_{n,k}}{\partial \Wv_{d,j}}\\ \end{align*} $$

Required Code

For this assignment, nn2.tar is provided (by professor Chuck Anderson) to be used by the subroutines I wrote below. nn2.tar contains the 3 files described below.

  • neuralnetworks.py defines the NeuralNetworkClassifier class that allows for single or multilayer neural networks to be initialized, trained, and used with various methods. (Specifying 0 hidden units is used to apply linear logistic regression).
  • scaledconjugategradient.py defines scg and related functions for use in the back-propagation phase of neural network training.
  • mlutils.py defines the 'trainValidateTestKFoldsClassification' function which calls my code.

Next, I wrote the following functions that train and evaluate LDA and neural network logistic regression models.

  • model = trainLDA(X,T,parameters)
  • percentCorrect = evaluateLDA(model,X,T)
  • model = trainNN(X,T,parameters)
  • percentCorrect = evaluateNN(model,X,T)

The parameters argument for trainNN is a list of the hidden layers structure and the number of SCG iterations. The value of the parameters argument for trainLDA is not used.

In [1]:
import numpy as np
import mlutils as ml
import neuralnetworks as nn
import qdalda as ql
import sys
%precision 4
import matplotlib.pyplot as plt
%matplotlib inline 
def printResults(label,results):
    print('{:4s} {:>20s}{:>8s}{:>8s}{:>8s}'.format('Algo','Parameters','TrnAcc','ValAcc','TesAcc'))
    print('-------------------------------------------------')
    for row in results:
        # 20 is expected maximum number of characters in printed parameter value list
        print('{:>4s} {:>20s} {:7.2f} {:7.2f} {:7.2f}'.format(label,str(row[0]),*row[1:]))
In [2]:
# X and T are the input at target matrices to train for
def useLDA (model,X):
    Xs = (X-model['means'])/model['stds'];
    K = len(model['unique'])
    discriminant=np.zeros((K,X.shape[0]))
    for ki in range(K):
        mu = model['mus'][ki]
        disc = np.dot(np.dot(Xs,model['covinv']), mu) - 0.5 * np.dot(np.dot(mu.T,model['covinv']), mu) + np.log(model['priors'][ki])
        discriminant[ki]=disc.reshape(-1)
    # Restore actual class value using argmax index from classifier
    return model['unique'][np.argmax(discriminant,axis=0)]

def evaluateLDA (model,X,T):
    # Use model
    predict = useLDA(model,X)
    return np.sum(predict.ravel()==T.ravel()) / float(len(T)) * 100

# trainLDA does not use parameters (needed just for wrapper calls)
def trainLDA(X,T,parms=0):
    means = X.mean(0)
    stds = X.std(0) + 1e-15 # Add in tiny float to avoid divide by zero in next line
    Xs = (X-means)/stds
    Tunique,counts = np.unique(T,return_counts=True)
    priors=counts/np.sum(counts)
    Tind = (T == Tunique).astype(int)
    K = Tind.shape[1]
    cov = np.zeros((Xs.shape[1],Xs.shape[1]))
    mus = []
    for ki in range(K):
        Ttr = (Tind[:,ki]==1).reshape((-1))
        mus.append(np.mean(Xs[Ttr,:],axis=0).reshape((-1,1)))
        cov = np.cov(Xs[Ttr,:].T)*priors[ki] + cov
    return {'mus': mus, 'covinv': np.linalg.pinv(cov), 'priors':priors,
            'means':means, 'stds':stds, 'unique': Tunique}
In [3]:
# X and T are the input and target matrices to train for
# params[0] defines the number of layers and neurons per layer for the network
# params[1] defines the number of training iterations to run
def trainNN(X,T,params):
    nnet = nn.NeuralNetworkClassifier(X.shape[1],params[0],len(np.unique(T)))
    nnet.train(X, T, errorPrecision=1.e-10, weightPrecision=1.e-10, nIterations=params[1])
    return nnet

def evaluateNN(model,X,T):
    predict = model.use(X)
    return np.sum(predict.ravel()==T.ravel()) / float(len(T)) * 100

Data and Results

My first data set is the classic MNIST data set for distinguishing 0-9 handwritten numbers. It contains 50,000 28x28 pixel images, 16 samples of which are shown below.

In [4]:
data = np.load('mnist.npz')
X = data['X']
T = data['T'].reshape((-1,1))
In [5]:
X.shape
Out[5]:
(50000, 784)
In [6]:
plt.figure()
for i in range(16):
    plt.subplot(4,4,i+1)
    plt.imshow(-X[i,:].reshape((28,28)),cmap='gray',interpolation='nearest')
    plt.axis('off')
    plt.title(T[i][0])
plt.tight_layout()

LDA

Given this data, below I test LDA classification. Although there are no parameters to explore, the KFolds procedure shows the small range on the modest results achieved when testing random samples after training on other samples.

In [11]:
resultsLDA = ml.trainValidateTestKFoldsClassification( trainLDA,evaluateLDA, X,T, [None],
                                                       nFolds=5, shuffle=True,verbose=False)
printResults('LDA:',resultsLDA)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
LDA:                 None   87.36   86.26   87.68
LDA:                 None   87.47   86.24   87.67
LDA:                 None   87.61   86.32   86.49
LDA:                 None   87.72   86.55   86.34
LDA:                 None   87.84   86.88   84.89

Below is total percentage correct and the confusion matrix for LDA classification when the training set is simply reused as the test set.

In [29]:
model = trainLDA(X,T)
predict = useLDA(model,X)
print("Total percentage accurate: ",evaluateLDA(model,X,T))
ml.confusionMatrix(T,predict,[0,1,2,3,4,5,6,7,8,9]);
Total percentage accurate:  87.048
       0    1    2    3    4    5    6    7    8    9
    ------------------------------------------------------------
 0 | 94.2  0.1  0.4  0.5  0.4  1.8  0.9  0.0  1.5  0.1   (4932 / 4932)
 1 |  0   95.7  0.6  0.4  0.2  0.6  0.1  0.1  2.2  0.2   (5678 / 5678)
 2 |  1.0  3.3 81.5  3.3  2.0  0.4  3.1  0.7  4.3  0.5   (4968 / 4968)
 3 |  0.2  1.6  2.7 84.4  0.4  3.6  0.3  1.7  2.8  2.3   (5101 / 5101)
 4 |  0.1  0.9  0.6  0.0 89.3  0.8  0.6  0.0  0.9  6.7   (4859 / 4859)
 5 |  1.0  1.2  0.4  4.3  1.0 82.9  2.0  0.6  4.3  2.4   (4506 / 4506)
 6 |  1.1  0.8  1.0  0.0  1.4  2.7 91.6  0    1.3  0.1   (4951 / 4951)
 7 |  0.4  2.3  0.7  0.7  2.8  0.2  0.0 83.4  0.4  9.1   (5175 / 5175)
 8 |  0.6  5.8  0.7  3.1  1.3  4.8  0.5  0.2 80.2  2.7   (4842 / 4842)
 9 |  0.5  0.5  0.3  1.5  5.3  0.5  0    4.5  1.1 85.7   (4988 / 4988)

Below is an interesting display of the $\mu$ values for the LDA inputs relative to all 10 classes.

In [35]:
plt.figure()
for i in range(10):
    plt.subplot(2,5,i+1)
    plt.imshow(model['mus'][i].reshape((28,28)),cmap='gray',interpolation='nearest')
    plt.axis('off')
    plt.title(i)
plt.tight_layout()

Linear logistic regression

The neural network classification model will implement linear logistic regression when the hidden layer size is set to 0 neurons. Below we find that training for 30 iterations yields the best testing accuracy among 4 test folds.

In [16]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                     [ [[0], 10], [[0], 20], [[0], 30] , [[0], 40]],
                                                     nFolds=4, shuffle=True,verbose=False)
printResults('LLR:',resultsNN)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
LLR:            [[0], 30]   93.15   91.57   91.99
LLR:            [[0], 30]   93.17   91.63   92.10
LLR:            [[0], 30]   93.31   91.85   91.93
LLR:            [[0], 30]   93.50   91.88   92.00

Below is the confusion matrix that results when the entire set of training data is run through linear logistic regression with 30 iterations. Note that the overall accuracy has increased from 87% with LDA to 93% with linear logistic regression.

In [18]:
model = trainNN(X,T,[ [0], 30])
predict = model.use(X)
print("Total percentage accurate: ",evaluateNN(model,X,T))
ml.confusionMatrix(T,predict,[0,1,2,3,4,5,6,7,8,9]);
Total percentage accurate:  92.902
       0    1    2    3    4    5    6    7    8    9
    ------------------------------------------------------------
 0 | 97.0  0.0  0.3  0.2  0.2  0.7  0.7  0.1  0.5  0.2   (4932 / 4932)
 1 |  0.0 97.4  0.5  0.3  0.1  0.4  0.0  0.3  0.9  0.1   (5678 / 5678)
 2 |  0.6  0.9 91.3  1.4  0.9  0.4  1.0  1.1  1.9  0.3   (4968 / 4968)
 3 |  0.4  0.5  2.0 90.5  0.2  2.9  0.2  0.9  1.6  0.8   (5101 / 5101)
 4 |  0.2  0.5  0.4  0.2 94.0  0.1  0.9  0.6  0.4  2.7   (4859 / 4859)
 5 |  1.0  0.8  0.7  2.6  1.0 89.1  1.3  0.4  2.2  1.0   (4506 / 4506)
 6 |  0.6  0.3  0.5  0    0.8  1.2 96.1  0.1  0.4  0.0   (4951 / 4951)
 7 |  0.2  0.5  1.0  0.4  0.8  0.2  0.1 94.4  0.2  2.4   (5175 / 5175)
 8 |  0.9  2.9  1.1  2.1  0.6  2.4  0.7  0.3 87.5  1.4   (4842 / 4842)
 9 |  0.5  0.4  0.2  1.1  2.4  0.7  0.1  3.3  0.7 90.7   (4988 / 4988)

Nonlinear logistic regression

For nonlinear logistic regression, the times to run KFolds increases as the need for more iterations rises. First, I'll test a variety of network sizes and depths with only 40 iterations and 4 folds.

In [20]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                     [ [[10], 40], [[20], 40], [[10,10], 40] , [[10,10,10], 40]],
                                                     nFolds=4, shuffle=True,verbose=False)
printResults('NLR:',resultsNN)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
NLR:           [[20], 40]   96.02   92.34   93.61
NLR:           [[20], 40]   93.80   92.95   92.00
NLR:           [[20], 40]   95.17   92.69   93.25
NLR:           [[20], 40]   96.74   93.16   92.95

Based on the sucess of 20 hidden neurons, I test 60 iterations and 3 folds with a few related network sizes to find an improved network for this classification problem.

In [21]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                     [ [[20], 60], [[20,10], 60] , [[30], 60]],
                                                     nFolds=3, shuffle=True,verbose=False)
printResults('NLR:',resultsNN)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
NLR:           [[30], 60]   99.43   93.24   94.30
NLR:           [[30], 60]   99.53   92.58   94.27
NLR:           [[30], 60]   99.54   93.42   93.60

A single hidden layer with 30 neurons is looking pretty good. Now I test few quick runs on the full training set to see the effect of iteration count on a network of 30 hidden neurons. Note that since I'm not checking test sets anymore, there is a real risk of overtraining (since the test sets in the previous table had much less accuracy than the training sets, but I'm interested in seeing the confusion matrix with low error rates).

In [23]:
print("Total percentage accurate with  60 iterations: ", evaluateNN(trainNN(X,T,[[30],60]),X,T))
print("Total percentage accurate with 100 iterations: ", evaluateNN(trainNN(X,T,[[30],100]),X,T))
print("Total percentage accurate with 200 iterations: ", evaluateNN(trainNN(X,T,[[30],200]),X,T))
Total percentage accurate with  60 iterations:  97.836
Total percentage accurate with 100 iterations:  99.602
Total percentage accurate with 200 iterations:  99.986

Now we have a very high accuracy on the training data with 200 iterations, so compute the confusion matrix.

In [24]:
model = trainNN(X,T,[ [30], 200])
predict = model.use(X)
print("Total percentage accurate: ",evaluateNN(model,X,T))
ml.confusionMatrix(T,predict,[0,1,2,3,4,5,6,7,8,9]);
Total percentage accurate:  99.928
       0    1    2    3    4    5    6    7    8    9
    ------------------------------------------------------------
 0 |100.0  0    0    0    0    0    0    0    0.0  0     (4932 / 4932)
 1 |  0  100.0  0    0    0.0  0    0    0    0    0     (5678 / 5678)
 2 |  0.0  0   99.9  0    0.0  0    0.0  0    0.0  0     (4968 / 4968)
 3 |  0    0    0.1 99.8  0.0  0.0  0    0    0.0  0.0   (5101 / 5101)
 4 |  0    0    0.0  0  100.0  0.0  0    0    0    0     (4859 / 4859)
 5 |  0    0    0    0.0  0.0 99.9  0    0    0.0  0     (4506 / 4506)
 6 |  0    0    0    0    0.0  0  100.0  0    0    0     (4951 / 4951)
 7 |  0.0  0.0  0.1  0    0    0    0   99.9  0    0.0   (5175 / 5175)
 8 |  0    0    0    0.0  0.0  0.0  0    0.0 99.9  0     (4842 / 4842)
 9 |  0    0    0    0    0.0  0    0    0    0.0 99.9   (4988 / 4988)

Here are some of the images that are still misclassified:

In [27]:
plt.figure()
n=0;
for i in range(40000):
    if (predict[i]!=T[i] and n<16):
        plt.subplot(4,4,n+1)
        plt.imshow(-X[i,:].reshape((28,28)),cmap='gray',interpolation='nearest')
        plt.axis('off')
        plt.title('mispredict {0}'.format(predict[i][0]))
        n=n+1
plt.tight_layout()

My second data set is based on 4 shapes: circle, square, triangle, and line. These can vary in size from 16 to 28 units across in a 30x30 grid, with some rotation positions. The goal is to classify the shape and the data is generated below. I thought it would be interesting to test the classification code using this data which is easily understandable and recognizable to a human.

In [36]:
classname=('circle','square','triangle','line')
# generate 200 30x30 pixel examples of each class with sizes (radii) 8-10 and 12-14
X=np.zeros((4*200,30*30))
T=np.zeros((4*200,1))
for n in range(200):  # Generate 200 random circles
    r = np.random.uniform(8,12)
    if (r > 10.0):
        r=r+2.0
    T[n][0] = 0; # Circle
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (x*x + y*y < r*r):
            X[n][i]=1.0
for n in range(200,400):  # Generate 200 random squares
    r = np.random.uniform(8,12)
    if (r > 10.0):
        r=r+2.0
    T[n][0] = 1; # Square
    orientation = np.random.uniform(0,2)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0 or 45 degrees
            if (x > -r and x < r and y > -r and y < r):
                X[n][i]=1.0
        else:
            if (x-y > -r and x-y < r and x+y > -r and x+y < r):
                X[n][i]=1.0
for n in range(400,600):  # Generate 200 random triangles
    r = np.random.uniform(8,12)
    if (r > 10.0):
        r=r+2.0
    T[n][0] = 2; # Triangle
    orientation = np.random.uniform(0,4)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0, 90, 180, or 270
            if (x > (y-r)/2 and x < (r-y)/2 and y > -r and y < r):
                X[n][i]=1.0
        elif (orientation < 2.0):
            if (x > (-y-r)/2 and x < (r+y)/2 and y > -r and y < r):
                X[n][i]=1.0
        elif (orientation < 3.0):
            if (y > (x-r)/2 and y < (r-x)/2 and x > -r and x < r):
                X[n][i]=1.0
        else:
            if (y > (-x-r)/2 and y < (r+x)/2 and x > -r and x < r):
                X[n][i]=1.0 
for n in range(600,800):  # Generate 200 random lines
    r = np.random.uniform(8,12)
    if (r > 10.0):
        r=r+2.0
    T[n][0] = 3; # Line    
    orientation = np.random.uniform(0,4)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0,45,90,135
            if (x > -r and x < r and y > -3 and y < 3):
                X[n][i]=1.0
        elif (orientation < 2.0):
            if (x+y > -r and x+y < r and x-y > -3 and x-y < 3):
                X[n][i]=1.0
        elif (orientation < 3.0):
            if (x > -3 and x < 3 and y > -r and y < r):
                X[n][i]=1.0             
        else:
            if (x+y > -3 and x+y < 3 and x-y > -r and x-y < r):
                X[n][i]=1.0                  

# generate 100 30x30 pixel tests of each class with sizes 10-12
Xtst=np.zeros((4*100,30*30))
Ttst=np.zeros((4*100,1))
for n in range(100):  # Generate 100 random circles
    r = np.random.uniform(10,12)
    Ttst[n][0] = 0; # Circle
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (x*x + y*y < r*r):
            Xtst[n][i]=1.0
for n in range(100,200):  # Generate 100 random squares
    r = np.random.uniform(10,12)
    Ttst[n][0] = 1; # Square
    orientation = np.random.uniform(0,2)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0 or 45 degrees
            if (x > -r and x < r and y > -r and y < r):
                Xtst[n][i]=1.0
        else:
            if (x-y > -r and x-y < r and x+y > -r and x+y < r):
                Xtst[n][i]=1.0
for n in range(200,300):  # Generate 100 random triangles
    r = np.random.uniform(10,12)
    Ttst[n][0] = 2; # Triangle
    orientation = np.random.uniform(0,4)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0, 90, 180, or 270
            if (x > (y-r)/2 and x < (r-y)/2 and y > -r and y < r):
                Xtst[n][i]=1.0
        elif (orientation < 2.0):
            if (x > (-y-r)/2 and x < (r+y)/2 and y > -r and y < r):
                Xtst[n][i]=1.0
        elif (orientation < 3.0):
            if (y > (x-r)/2 and y < (r-x)/2 and x > -r and x < r):
                Xtst[n][i]=1.0
        else:
            if (y > (-x-r)/2 and y < (r+x)/2 and x > -r and x < r):
                Xtst[n][i]=1.0 
for n in range(300,400):  # Generate 100 random lines
    r = np.random.uniform(10,12)
    Ttst[n][0] = 3; # Line
    orientation = np.random.uniform(0,4)
    for i in range(900):
        x = int(i/30)-14.5
        y = (i%30)-14.5
        if (orientation < 1.0):  # rotate 0,45,90,135
            if (x > -r and x < r and y > -3 and y < 3):
                Xtst[n][i]=1.0
        elif (orientation < 2.0):
            if (x+y > -r and x+y < r and x-y > -3 and x-y < 3):
                Xtst[n][i]=1.0
        elif (orientation < 3.0):
            if (x > -3 and x < 3 and y > -r and y < r):
                Xtst[n][i]=1.0             
        else:
            if (x+y > -3 and x+y < 3 and x-y > -r and x-y < r):
                Xtst[n][i]=1.0      
In [37]:
plt.figure()
for i in range(32):
    plt.subplot(4,8,i+1)
    plt.imshow(-X[i+int(i/8)*192,:].reshape((30,30)),cmap='gray',interpolation='nearest')
    plt.axis('off')
    plt.title(classname[int(T[i+int(i/8)*192][0])])
plt.tight_layout()

The above table shows various orientations and sizes for the objects in the training set. The training set includes 'large' and 'small' examples of each type of image. Note that due to pixel aliasing in the image generation step some squares rotated by 45 degrees have a slightly circular shape, and some circles look a bit bulgy. This effect will result in training confusion, as shown in the following analysis.

Below are samples from the test data set, which is full of images of 'middle' size. None of these specific sizes were in the training set, so these images help test how well the classifier learned the concept of the shape.

In [39]:
plt.figure()
for i in range(32):
    plt.subplot(4,8,i+1)
    plt.imshow(-Xtst[i+int(i/8)*92,:].reshape((30,30)),cmap='gray',interpolation='nearest')
    plt.axis('off')
    plt.title(classname[int(Ttst[i+int(i/8)*92][0])])
plt.tight_layout()

LDA

Below I train an LDA classifier on this recognition problem. It is highly accurate on the training data, but has not 'learned' the concept of the shape for use with the test data.

In [64]:
model = trainLDA(X,T)
predict = useLDA(model,Xtst)
print("Training data percentage accurate: ",evaluateLDA(model,X,T))
print("Test data percentage accurate: ",evaluateLDA(model,Xtst,Ttst))
ml.confusionMatrix(Ttst,predict,(0,1,2,3));
Training data percentage accurate:  99.75
Test data percentage accurate:  66.5
       0    1    2    3
    ------------------------
 0 | 13.0 87.0  0    0     (100 / 100)
 1 | 47.0 53.0  0    0     (100 / 100)
 2 |  0    0  100.0  0     (100 / 100)
 3 |  0    0    0  100.0   (100 / 100)

Below are plots using the $\mu$ values for the classifier, perhaps hinting at the way circles and squares get confused. Because the training set does not include middle size images, the classifier may have keyed off of specific pixel value combinations for discrimination instead of concepts like 'corner' and 'curve'.

In [65]:
plt.figure()
for i in range(4):
    plt.subplot(2,2,i+1)
    plt.imshow(model['mus'][i].reshape((30,30)),cmap='gray',interpolation='nearest')
    plt.axis('off')
    plt.title(classname[i])
plt.tight_layout()

Linear Logistic Regression

Below we see that linear logistic regression easily matches LDA for accuracy on the training data, and with only 20 iterations it can learn the training data perfectly. Yet the model is still confused by the test data it is given. Again, I expect the model has learned too precise a pixel position for classifying images. I would wonder if a convolusional layer could be 'forced' to learn features like corners by having a thin hidden layer after the convolutional layer, which could help generalize the learning better.

In [50]:
print("Training data percentage accurate with 10 iterations: ",evaluateNN(trainNN(X,T,[[0],10]),X,T))
print("Training data percentage accurate with 20 iterations: ",evaluateNN(trainNN(X,T,[[0],20]),X,T))
print("Test data percentage accurate with 10 iterations: ",evaluateNN(trainNN(X,T,[[0],10]),Xtst,Ttst))
print("Test data percentage accurate with 20 iterations: ",evaluateNN(trainNN(X,T,[[0],10]),Xtst,Ttst))
print("Test data percentage accurate with 100 iterations: ",evaluateNN(trainNN(X,T,[[0],10]),Xtst,Ttst))
Training data percentage accurate with 10 iterations:  99.75
Training data percentage accurate with 20 iterations:  100.0
Test data percentage accurate with 10 iterations:  64.75
Test data percentage accurate with 20 iterations:  64.75
Test data percentage accurate with 100 iterations:  64.75

The model ended up learning a classification from the 'large' and 'small' images that resulted in significant confusion between circles and squares for the 'medium' size test images.

In [51]:
model = trainNN(X,T,[ [0], 10])
predict = model.use(Xtst)
print("Total percentage accurate: ",evaluateNN(model,Xtst,Ttst))
ml.confusionMatrix(Ttst,predict,[0,1,2,3]);
Total percentage accurate:  64.75
       0    1    2    3
    ------------------------
 0 |  6.0 94.0  0    0     (100 / 100)
 1 | 47.0 53.0  0    0     (100 / 100)
 2 |  0    0  100.0  0     (100 / 100)
 3 |  0    0    0  100.0   (100 / 100)

Nonlinear logistic regression

The nonlinear model was able to predict the test images slightly better than the LDA or linear logistic regression classifiers. I would guess that the additional layer allowed for some amount of image feature recognition compare to the linear approaches. Nonetheless, it was so easy for the model to train to 100% training accuracy, it may not have have the right pressure to optimize well for the test data. Below, I show the search for a good parameter set which in the end results in 10 neurons in the hidden layer trained for 100 iterations.

In [47]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                     [ [[10], 100], [[20], 100], [[10,10], 100] , [[10,10,10], 100]],
                                                     nFolds=5, shuffle=True,verbose=False)
printResults('NLR:',resultsNN)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
NLR:          [[10], 100]  100.00  100.00  100.00
NLR:          [[10], 100]  100.00   99.69  100.00
NLR:          [[20], 100]  100.00   99.91  100.00
NLR:          [[10], 100]  100.00  100.00  100.00
NLR:          [[10], 100]  100.00  100.00  100.00
In [63]:
model = trainNN(X,T,[ [10], 20])
predict = model.use(Xtst)
print("Test data predicted accuracy: ",evaluateNN(model,Xtst,Ttst))
ml.confusionMatrix(Ttst,predict,[0,1,2,3]);
Test data predicted accuracy:  64.75
       0    1    2    3
    ------------------------
 0 |  6.0 94.0  0    0     (100 / 100)
 1 | 47.0 53.0  0    0     (100 / 100)
 2 |  0    0  100.0  0     (100 / 100)
 3 |  0    0    0  100.0   (100 / 100)
In [60]:
model = trainNN(X,T,[ [10], 100])
predict = model.use(Xtst)
print("Test data predicted accuracy: ",evaluateNN(model,Xtst,Ttst))
ml.confusionMatrix(Ttst,predict,[0,1,2,3]);
Test data predicted accuracy:  69.75
       0    1    2    3
    ------------------------
 0 | 26.0 74.0  0    0     (100 / 100)
 1 | 47.0 53.0  0    0     (100 / 100)
 2 |  0    0  100.0  0     (100 / 100)
 3 |  0    0    0  100.0   (100 / 100)

Like the linear models, the nonlinear network learned the training data perfectly:

In [53]:
ml.confusionMatrix(T,model.use(X),[0,1,2,3]);
       0    1    2    3
    ------------------------
 0 |100.0  0    0    0     (200 / 200)
 1 |  0  100.0  0    0     (200 / 200)
 2 |  0    0  100.0  0     (200 / 200)
 3 |  0    0    0  100.0   (200 / 200)

Like the MNIST data, below I show some of the many circles and squares that were mispredicted. As expected, the 45 degree rotation squares that have rounded edges due to pixel aliasing were often confused with circles, which is understandable. The circles that were predicted as squares on not obviously misshapen, so again I expect the issue is the size of the image is not well handled by this learning system.

This shape size analysis helps explain why image recognition systems often scale the object of interest (like handwritten numbers) to a standard size to help the learning engine train properly.

In [59]:
plt.figure()
n=0;
for i in range(50):
    if (predict[i]!=Ttst[i] and n<8):
        plt.subplot(4,4,n+1)
        plt.imshow(-Xtst[i,:].reshape((30,30)),cmap='gray',interpolation='nearest')
        plt.axis('off')
        plt.title('predicted square')
        n=n+1
for i in range(100,150):
    if (predict[i]!=Ttst[i] and n<16):
        plt.subplot(4,4,n+1)
        plt.imshow(-Xtst[i,:].reshape((30,30)),cmap='gray',interpolation='nearest')
        plt.axis('off')
        plt.title('predicted circle')
        n=n+1
plt.tight_layout()

Additional Testing

This section includes extra tests to insure the 4 functions I've written perform as expected.

In [34]:
# Here is the example from the assignment, using our automobile MPG data.  
# We quantize the MPG values into 5 intervals, and classify each sample as 
# being in one of these 5 intervals.

def makeMPGData(filename='auto-mpg.data'):
    def missingIsNan(s):
        return np.nan if s == b'?' else float(s)
    data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
    print("Read",data.shape[0],"rows and",data.shape[1],"columns from",filename)
    goodRowsMask = np.isnan(data).sum(axis=1) == 0
    data = data[goodRowsMask,:]
    print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
    X = data[:,1:]
    T = data[:,0:1]
    Xnames =  ['cylinders','displacement','horsepower','weight','acceleration','year','origin']
    Tname = 'mpg'
    return X,T,Xnames,Tname
def makeMPGClasses(T):
    bounds = np.arange(5,45,10)
    Tclasses = -np.ones(T.shape).astype(np.int)
    for i,mpg in enumerate(T):
        for k in range(len(bounds)-1):
            if bounds[k] < mpg <= bounds[k+1]:
                Tclasses[i] = bounds[k+1]
        if Tclasses[i] == -1:
            Tclasses[i] = 50  # max mpg is 46.6
    return Tclasses
X,T,Xnames,Tname = makeMPGData('auto-mpg.data')
Tclasses = makeMPGClasses(T)
classes,counts = np.unique(Tclasses,return_counts=True)
def printResults(label,results):
    print('{:4s} {:>20s}{:>8s}{:>8s}{:>8s}'.format('Algo','Parameters','TrnAcc','ValAcc','TesAcc'))
    print('-------------------------------------------------')
    for row in results:
        # 20 is expected maximum number of characters in printed parameter value list
        print('{:>4s} {:>20s} {:7.2f} {:7.2f} {:7.2f}'.format(label,str(row[0]),*row[1:]))
resultsLDA = ml.trainValidateTestKFoldsClassification( trainLDA,evaluateLDA, X,Tclasses, [None],
                                                       nFolds=6, shuffle=False,verbose=False)
printResults('LDA:',resultsLDA)
r1=resultsLDA[1]
r5=resultsLDA[5]
if (sum(r1[1:4]) > 220 and sum(r1[1:4]) < 230 and sum(r5[1:4]) > 200 and sum(r5[1:4]) < 210): 
                      print ("---Pass---")
else:                 print ("***** FAIL ******")
Read 398 rows and 8 columns from auto-mpg.data
After removing rows containing question marks, data has 392 rows and 8 columns.

Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
LDA:                 None   81.05   70.73   62.64
LDA:                 None   76.10   67.13   82.56
LDA:                 None   77.18   67.19   85.19
LDA:                 None   77.72   67.36   82.89
LDA:                 None   79.75   71.09   69.01
LDA:                 None   81.73   75.51   50.00
---Pass---
In [35]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,Tclasses, 
                                                     [ [ [0], 10], [[10], 100] ],
                                                     nFolds=6, shuffle=False,verbose=False)
printResults('NN:',resultsNN)
Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
 NN:            [[0], 10]   83.42   74.48   51.65
 NN:            [[0], 10]   80.78   65.90   84.88
 NN:          [[10], 100]   92.31   73.12   76.54
 NN:          [[10], 100]   94.43   72.09   82.89
 NN:            [[0], 10]   81.50   72.81   67.61
 NN:            [[0], 10]   82.96   78.98   48.48
In [36]:
lda = ql.LDA()
lda.train(X,Tclasses)
predictedClasses,_,_ = lda.use(X)
ml.confusionMatrix(Tclasses,predictedClasses,classes); # <- semi-colon prevents printing of returned result
      15   25   35   50
    ------------------------
15 | 91.3  8.7  0    0     (69 / 69)
25 | 11.4 68.9 18.6  1.2   (167 / 167)
35 |  0    8.9 68.3 22.8   (123 / 123)
50 |  0    0   18.2 81.8   (33 / 33)

Grading

Below is the output of the preliminary grading program that insures the parameters and simple tests of my code is correct.

In [4]:
%run -i A4grader.py
   Testing   model = trainLDA(X,T)
             accuracy = evaluateLDA(model,X,T)

20/20 points. Accuracy is within 10 of correct value 50%

   Testing   model = trainNN(X,T, [[5],100])
             accuracy = evaluateNN(model,X,T)

30/30 points. Accuracy is within 10 of correct value 100%

  Testing
    resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                          [ [ [0], 5], [ [10], 100] ],
                                                          nFolds=3, shuffle=False,verbose=False)
    bestParms = [row[0] for row in resultsNN]


30/30 points. You correctly find the best parameters to be [[10],100] for each fold.

A4 CODING GRADE is 80/80

A4 WRITING GRADE is ??/20

A4 FINAL GRADE is ??/100

Remember, this python script is just an example of how your code will be graded.
Do not be satisfied with an 80% from running this script.  Write and run additional
tests of your own design.