The Geometry of LDA and PCA Classifiers Illustrated with 3D Examples.
J. Ross Beveridge
Technical Report 01-101
Computer Science Department
Colorado State University
ross@cs.colostate.edu
Last Update, June 15, 2001
Overview
This report will help in developing a geometric interpretation of Fisher Linear Discrimants. The report builds upon an understanding of the connection between Principal Component Analysis and Gaussian Distributions. It contains a running example showing how Fisher Discriminants are computed and what they look like for an illustrative 3 class problem in 3 dimensional space.
Maple 6.0 was used to write this report, and consequently it is both a document and a program. It is available in three forms:
When viewing the Maple or HTML versions, the first "Section" is not intended to read. It is, instead, a collection of helper routines written to service the remainder of the document. The reader will also note that mathematics appears in three forms. In line math, input expressions, and the results of evaluating these expression. In line math appears is what one normally expects for type set mathematics. Inputs to expressions are set aside in blocks and are shown in red. Results of these expression appear below the expression in blue.
Helper Maple Code
Introduction
Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) play a critical role in many pattern classification tasks. It is helpful to develop a geometric intuition for how each transforms the coordinate reference frame in which data is classified. There are limits to how far this can go, since these techniques are typically applied to problems in higher dimensional spaces. However, conceding this weakness, with some small effort many people can usefully extend a natural 3D intuition to higher dimensions.
There are several motivations for this paper, but one in particular concerns whether LDA basis vectors are orthogonal. It has been reported in the literature that the LDA basis vectors are orthogonal: unfortunately this is false. It is easy show that LDA basis vectors are not orthogonal. As this paper will show, one may conceive of an intermediate space in which LDA basis vectors are indeed orthogonal. Moreover, understanding how this intermediate space is subsequently transformed provides insight into how LDA vectors are configured. The general mathematics of these transformations are reviewed here and each step is illustrated with a 3D running example.
We will start with some very basic properties of Gaussian point clouds in 3D and draw heavily upon the connection between these point clouds and PCA. For our purposes, a Gaussian point cloud is formed by sampling a finite number of observations from a multivariate Gaussian random variable in 3D. Our choice of a Gaussian point cloud establishes a well known and tight connection between PCA and Gaussian distributions. However, PCA classifiers are often successfully applied to non-Gaussian data: the choice should not be seen as limiting. The advantage of fixing our attention on Gaussian points clouds is that it allows us to develop an intuition for the geometry underlying both PCA and LDA spaces.
In particular, there is a tight coupling between scatter matrices and covariance matrices. In moving from PCA to LDA, emphasis shifts from the properties of a single scatter matrix to the properties of two related scatter matrices: the between class and within class scatter matrices. Understanding in geometric terms what is happening when one solves for the linear discriminants becomes more difficult. It is well know that the problem of finding the linear discriminants is equivalent to solving a generalized Eigenproblem. However, quoting [Strang], ''Geometrically, this has a meaning which we do not understand very well.'' In this statement, Strang is addressing all generalized Eigenproblems. Fortunately, both PCA and LDA induce positive definite and symmetric scatter matrices. These restrictions in turn lend themselves to a much clearer geometric interpretation.
Gaussian Random Variables and Principal Components
Axis Aligned Gaussians
Consider the equation that defines the probability density function for a Gaussian random variable in 3D.
=
where
>
eq1 := x = <x,y,z>:
eq2 := mu = <mu[x],mu[y],mu[z]>:
eq3 := Omega = Matrix(3,3, [[sigma[xx], sigma[xy], sigma[xz]],[sigma[yx], sigma[yy], sigma[yz]],[sigma[zx], sigma[zy], sigma[zz]]]):
eq1, eq2, eq3;
In the special case that the cross terms in the covariance matrix are zero, then the pdf reduces to:
>
Vx := <x,y,z>:
Ms := Matrix(3,3, [[sigma[xx], 0, 0],[0, sigma[yy], 0],[0, 0, sigma[zz]]]):
p(x) = (1/((2*Pi)^(3/2) * Determinant(Ms)^(1/2))) * e^((-1/2)*(Transpose(Vx) . MatrixInverse(Ms) . Vx));
Collecting terms:
=
When the cross terms in the covariance matrix are zero, then the pdf is simply the product of the independent probabilities along each of the three axes. This also means, from a practical standpoint, that samples from an axis-aligned 3D Gaussian random variable may be generated by independently calling a 1D Gaussian random number generator three times, once for each of the three independent components.
Example of an Axis Aligned Gaussian
Consider the Gaussian random variable defined by the following mean and standard deviation:
>
MU := Matrix(3,1, [[5],[25],[50]]):
SD := Matrix(3,1, [[1],[3],[10]]):
EA := Matrix(3,1, [[0],[0],[0]]):
mu = MU, sigma = SD;
Note the sigmas along the diagonal of the covariance matrix have been placed in a column vector.
Here is a plot of 100 points sampled from this distribution:
3D Plot of Points
>
P := listClasses(MU,SD,EA):
theta := -45: phi := 45:
plotPoints(P,theta,phi,FRAMED);
Here is an animation of the same set of points.
3D Animated Plot of Points
One sees from these plots that the distribution shown has minimal variation along the x dimension, modest variation along y, and the greatest variation along z.
The Scattter Matrix, Sample Mean and Sample Covariance
To begin to draw the connection between our sampling from a Gaussian random variable and classification using PCA and LDA, let us begin by defining a data matrix A as the collection of points.
>
eq5 := A = Matrix(1,4, [p[1], p[2], `...` , p[n]]):
eq6 := A = Transpose(Matrix(4,3, [[x[1], y[1], z[1]],[x[2], y[2], z[2]],[`...`, `...`, `...`],[x[n], y[n], z[n]]])):
eq5; eq6;
The sample mean and sample covariance for this set of points are:
=
=
=
The matrix S is commonly called the scatter matrix or moment matrix for the set of points. For the specific points in the example above, the sample mean, scatter matrix and covariance matrix are:
>
MUs := dataMatrixMean(P[1]): Ss := scatterMatrix(P[1]): OMEGA := sampleCovariance(P[1]):
mu[s] = roundMatrix(MUs,1), S = roundMatrix(Ss,1), Omega[s] = roundMatrix(OMEGA,1);
The standard deviation along the x, y and z axis are the square roots of the diagonal elements of the sample covariance matrix:
> sigma[xx] = sigDigits(sqrt(OMEGA[1,1]),1), sigma[yy] = sigDigits(sqrt(OMEGA[2,2]),1), sigma[zz] = sigDigits(sqrt(OMEGA[3,3]),1);
These values are close to the original. How close depends in part on how many points are sampled.
Example of a Gaussian Rotated with respect to Principal Axes
Now let us consider rotating the distribution about the x axis by 45 degrees and about the z axis by 45 degrees.
>
MU := Matrix(3,1, [[5],[25],[50]]):
SD := Matrix(3,1, [[1],[3],[10]]):
EA := Matrix(3,1, [[0.7854],[0],[0.7854]]):
mu = MU, sigma = SD, angles = EA;
Again, the sigmas along the diagonal of the covariance matrix have been placed in a column vector.
Here is a plot of 100 points sampled from this distribution:
3D plot of points.
>
P := listClasses(MU,SD,EA):
theta := -45: phi := 45:
plotPoints(P,theta,phi,FRAMED);
Here is an animation of the same set of points.
3D annimated plot of points.
Now we see the point cloud is moving up in a diagonal direction relative to the x, y and z coordinates. As we will see in the next section, we can analyze the data matrix of this new set of points to recover the principal axes of this distribution.
Scatter and Covariance Matrices for Rotated Distribution
In the same fashion as above, we can compute the sample mean vector, the scatter matrix, and the associated sample covariance matrix.
>
MUs := dataMatrixMean(P[1]): Ss := scatterMatrix(P[1]): OMEGA := sampleCovariance(P[1]):
mu[s] = roundMatrix(MUs,1), S = roundMatrix(Ss,1), Omega[s] = roundMatrix(OMEGA,1);
While the sample mean has not been changed, the scatter matrix is very different; the off diagonal elements now have significant value indicating the variance along one axis is correlated with that along another.
Recovering the Rotation and Original Principal Axes
By construction, the covariance and scatter matrices are symmetric positive definite. Thus, they may be diagonalized in the following manner:
=
Here, R is an orthogonal matrix consisting of unit length row (and column) vectors. Thus, it is essentially a rotation matrix. The only reason we need qualify our statement is that it may include reflection about axes as well as rotation. The matrix is a digonal matrix:
> eq10 := Delta = Matrix([[sigma[xx]^2, 0, 0], [0, sigma[yy]^2, 0], [0, 0, sigma[zz]^2]]): eq10;
For reasons that will become apparent shortly, let us write as
=
So now our diagonalization may be written as
=
The actual R and S matrices for our sample covariance matrix are:
>
ROMEGA,SOMEGA := DiagonalizeRS(OMEGA):
R = roundMatrix(ROMEGA,2), S = roundMatrix(SOMEGA,2);
The diagonal terms of the S matrix are in fact the sample standard deviations for the set of points after they have been rotated by R.
> sigma[xx] = sigDigits(SOMEGA[1,1],1), sigma[yy] = sigDigits(SOMEGA[2,2],1), sigma[zz] = sigDigits(SOMEGA[3,3],1);
We can see this visually if we actually rotate and plot the points in the new coordinate system
3D Plot of Points after Rotation to Principal Components
>
Q := homPad(ROMEGA) . P[1]:
theta := -45: phi := 45:
plotPoints([Q],theta,phi,FRAMED);
3D Animated Plot of Points after Rotation to Principal Components
As further confirmation, here is the sample mean, the scatter matrix, and the sample covariance matrix for the new set of points Q.
>
MUs := dataMatrixMean(Q): Ss := scatterMatrix(Q): OMEGA := sampleCovariance(Q):
mu[s] = roundMatrix(MUs,1), S = roundMatrix(Ss,1), Omega[s] = roundMatrix(OMEGA,1);
sigma[xx] = sigDigits(sqrt(OMEGA[1,1]),1), sigma[yy] = sigDigits(sqrt(OMEGA[2,2]),1), sigma[zz] = sigDigits(sqrt(OMEGA[3,3]),1);
The matrix defines the principal components of the point cloud generated by our rotated Gaussian distribution. The row vectors in R are also the Eigenvectors of the covariance or scatter matrix. The covariance and scatter matrices may be used interchangeably to find R since these matrices only differ by a constant scale factor.
Interpreting Variance as Scale
The matrix may be described as a diagonal matrix containing the sample estimates for the standard deviation of the points along each of the principal components. Another way to view this is to observe that scaling the points by the inverse of this matrix will generate a new point set with unit variance along each of the principals axes. To illustrate, let us apply this transformation to the data matrix containing our sample points. The result will be a new set of points .
3D Plot of Points after Rotation and Scaling
>
Q := MatrixInverse(homPad(SOMEGA)) . homPad(ROMEGA) . P[1]:
theta := -45: phi := 45:
plotPoints([Q],theta,phi,FRAMED);
3D Animated Plot of Points after Rotation to Principal Components
Looking at the sample mean, scatter matrix and sample covariance for the resulting set of points Q we will see the sample standard deviations are now all close to one.
>
MUs := dataMatrixMean(Q): Ss := scatterMatrix(Q): OMEGA := sampleCovariance(Q):
mu[s] = roundMatrix(MUs,1), S = roundMatrix(Ss,1), Omega[s] = roundMatrix(OMEGA,1);
sigma[xx] = sigDigits(sqrt(OMEGA[1,1]),1), sigma[yy] = sigDigits(sqrt(OMEGA[2,2]),1), sigma[zz] = sigDigits(sqrt(OMEGA[3,3]),1);
In essence, what we have done with the rotation and scale derived from the original covariance matrix is create a transformation to a new space in which the points have unit variance in all directions.
Principal Components Subpace Maximizes Variance
Often we do not bother to make explicit the criteria that the principal components optimize. However, because it will play a role latter in how the Fisher Linear Discriminants are defined, it is worthwhile to review this basic material. Commonly it is stated that principal components represent axes of maximal variance. To put this more precisely, if one sought a single dimension over which the variance of the data in a data matrix was maximized, it would be the axis that maximized the following function .
For a data matrix of points in 3D space, this product becomes.
=
Looking at the general case where is not diagonal, it is not obvious what unit length basis vector to choose for . However, for the case where is diagonal, the choice is obvious given.
Assume that > > , then the that maximizes is . Now recall our earlier diagonalization of the general case.
=
We observed that is a rotation matrix that shifts us from the original coordinates system, where is not diagonal, to the principal components space, where the covariance matrix is diagonal. So, the axis in our original space that maximizes variance is the backward mapping of into the original space. Putting this more simply,
and even more simply, we see that the first column of is the unit length basis vector that defines the axis of maximum variance. This axis is what we typically call the first prinicipal component of the data matrix.
The extension to axes that collectively maximize variance generalizes the criterion function to
.
Now we are maximizing the determinant of the x matrix formed from product of the x matrix and the covariance matrix . By a generalization of the argument made above, the basis vectors that maximize this determinant are the first columns of .
Fisher Discriminants
Ficher's Linear Discriminants are defined as the basis vectors in a x matrix that maximizes the following function
Where is the between class scatter matrix and is the within class scatter matrix [Duda]. More particularly, the within class scatter matrix is
where
where is the number of elements in class , is the th point in class , and is the mean vector for class .
The between class scatter matrix is
Now, quoting from [Duda], "The problem of finding a rectangular matrix W that minimizes J(.) is tricky." They, and other common references, state without elaboration that the optimal W may be found by solving the generalized eigenvector problem
Specifically, the optimal W consists of the eigenvectors associated with the largest eigenvalues.
The problem with stopping at this point is two fold. First, many of us have no geometric intuition for what it means to solve this general eigenvector problem. Second, while general, this approach may not always be the most efficient/robust means of finding the Fisher Discriminants. This latter observation has been made by [Zhao] and others. They provide a general means of transforming a generalized eigenvector problem to a symmetric eigenvector problem, and this transformation is the basis for developing a geometric intuition for what is actually taking place when computing the Fisher Discriminants.
Here is the transformation with many of the intermediate steps omitted from [Zhao] made explicit.
Transformation to a Standard Eigenvector Problem
Begin with the generalized eigen vector problem:
Now use singular value decomposition to express in diagonal form, and indeed, go one step further and explicitly identify the square root of the diagonal as a scale matrix. So
The original problem may now be written as
.
Take the inverse scale and rotation and right multiply both sides
Define a new matrix that will become the eigenvectors of a symmetric eigenvector problem
and substitute on the right hand side
To simplify the left hand side, introduce a single matrix G that combines the scale and rotation derived from the within class scatter matrix.
and
Expand the left hand side by left multiplying be a sequence or scales and rotations that together are the identity transformation.
Now substitute G and V in where appropriate
.
The final result is a symmetric eigenvector problem in where :
Once we solve for , W is found directly by the equality
We will see below that G is a transformation that takes us into a space where the fisher discriminants are related directly to the eigenvectors of the scatter matrix associated with the transformed points.
How this Transformation Acts Upon Fisher's Criterion
Now we can draw the connection between the general algebraic manipulation above and the Fisher Criterion we wish to optimize.
Recall Fisher's criterion
Consider how much easier this problem would be to solve if the denominator were a constant with respect to our choice of W. We can map to a new space where this is true by exploiting the rotation and scale transformations obtained from diagonalizing . Thus, reiterating the diagonalization used in the previous section.
Applying the transformation to our points before creating the within and between class scatter matrices leads to a new problem where only the numerator varies as a function of W. Showing the algebra of this transformation,
The denominator of this new criterion function collapses to , and this further collapses and the entire denominator becomes 1.
In geometric terms, we have transformed our problem to a space where the covariance of the within class scatter is one in all principal directions. This is exactly the same as we illustrated above when we showed that the R and S derived from the diagonalization of a covariance matrix could be used to remap a data matrix into a space where the points formed a compact ball with variance 1 in all directtions.
One loose end is why maximizing J in the transformed space is equivalent to maximizing it in the original space. The answer lies in observing that the rotations leave the determinant unchanged, and the two scale matrices alter both the numerator and denominator by the same constant factor. Thus, maximizing one ratio is equivalent to maximizing the other. We are now ready to illustrate this method of finding the Fisher Discriminants on a specific 3D example.
Example of a Three Class Problem
Consider a three class recognition problem where samples from each class are drawn from distinct distributions.
Different Covariance Structures
Here are three distinct tri-variate Gaussian random variables and sample points from each. These three classes give rise to two Fisher Linear Discriminants in 3D.
>
MU := Transpose(Matrix(3,3, [[0,0,0],[20,5,30],[50,10,70]])):
SD := Transpose(Matrix(3,3, [[10,3,1],[10,3,1],[10,3,1]])):
EA := Transpose(Matrix(3,3, [[.4,0,0],[0,0.7,0],[0,0,1.2]])):
[mu = MU, sigma = SD, angles = EA];
> P := listClasses(MU,SD,EA);
3D Plot of Points from Three Classes
>
theta := -45: phi := 45:
plotPoints(P,theta,phi,FRAMED);
3D Animated Plot of Points from Three Classes
The sample means and scatter matrices for the three classes are:
>
seq(mu[i] = roundMatrix(dataMatrixMean(P[i]),1),i=1..3);
seq(Omega[i]^2 = roundMatrix(scatterMatrix(P[i]).1/ColumnDimension(P[i]),1),i=1..3);
Illustrating the Fisher Linear Discriminants
The within and between class scatter matrices are shown below. Next, the rotation and scale associated with the within and between class scatter matrices are shown. The rotation and scale for the within class scatter matrix is used to compute the tranformation described above that will cause the variance within classes to be unity in all directions. The Fisher Basis Vectors (Discriminants) are then computed in this space be solving a standard eigenvector problem, and subsequently transformed into the original data space. The resulting discriminants computed in this manner care compared against discriminants computed using the more commonly prescribed manner of solving a generalized eigenvector problem: as expected, they are the same. Finally, the 3D points are shown projected into the 2D space defined by the 2 Fisher discriminants, thus illustrating how these discriminants concentrate samples within classes while at the same time separating the classes.
The Within and Between Class Scatter Matrices
The within class scatter matrix is the sum of the scatter matrices for the three classes. The between class scatter matrix is formed from three points that are the centroids (means) of the three classes.
>
SW,SB := scatterWBClass(P):
S[W] = roundMatrix(SW,1), S[B] = roundMatrix(SB,1);
Computer R and S for the within class and between class matrices
Just for interest, compute the diagonalization of the within class and between class scatter matrices
>
RSW, SSW := DiagonalizeRS(SW):
RSB, SSB := DiagonalizeRS(SB):
R[S[W]] = roundMatrix(RSW,2), S[S[W]] = roundMatrix(SSW,2);
R[S[B]] = roundMatrix(RSB,2), S[S[B]] = roundMatrix(SSB,2);
Test that diagonalization generates proper matrices
Change Coordinates by the G Transformation
As derived above, G is the composition of inverse rotation followed by the inverse scale derived from the within class scatter matrix.
>
GM := Multiply(MatrixInverse(SSW), MatrixInverse(RSW)):
G = roundMatrix(GM,4);
> Q := [seq(Multiply(homPad(GM),P[i]),i=1..3)];
Look at within and between scatter matrices after transformation
Here we confirm with an example that indeed the within class scatter matrix after this change is now the identity matrix. Thus, searching for the basis vectors that maximize Fisher's criterion in this new space amounts to maximizing the numerator for the new between class scatter matrix.
>
SWQ, SBQ := scatterWBClass(Q):
S[W] = roundMatrix(SWQ,5), S[B] = roundMatrix(SBQ,5);
The next step is to diagonalize the between class matrix to obtain the principal components of the between class scatter matrix. The first two of these are the Fisher Linear Discriminants expressed in this transformed space.
>
VV, LambdaV := DiagonalizeRS(SBQ):
WW := RSW . MatrixInverse(SSW). VV:
LambdaW := SSW . LambdaV:
V = roundMatrix(VV,2), Lambda = roundMatrix(LambdaV,2), W = roundMatrix(WW,2);
Confirm which are eigenvectors, rows or columns?
Test if W and Lambda are solution to Generarl Eigenvector Problem
The Fisher Basis Vectors in Transformed and Original Space
The Fisher Basis Vectors in the transformed space are and in the original space they are .
>
FBW := Matrix(3,2):
FBV := Matrix(3,2):
FBW[1..3,1] := Normalize(WW[1..3,1],Euclidean):
FBW[1..3,2] := Normalize(WW[1..3,2],Euclidean):
FBV[1..3,1] := Normalize(VV[1..3,1],Euclidean):
FBV[1..3,2] := Normalize(VV[1..3,2],Euclidean):
FB[v] = roundMatrix(FBV,3), FB[w] = roundMatrix(FBW,3);
The angle between the vectors in the two spaces are
>
unassign('theta'):
an := (180/evalf(Pi)) * arccos(DotProduct(FBW[1..3,1],FBW[1..3,2])):
if (an > 90.0) then an := 180 - an else an := an fi:
theta[v] = sigDigits((180/evalf(Pi)) * arccos(DotProduct(FBV[1..3,1],FBV[1..3,2])),2);
theta[w] = sigDigits(an,2);
As expected, these vectors are orthogonal in the intermediate space, but not in the final space.
Plot Data Matrix with Fisher Basis Vectors
Generate 3D scatter plots with the 2 Fisher Basis Vectors drawn in as line segments.
Fisher Basis Vectors in Transformed Space
>
theta := -90: phi := 0:
plotPointsBasis(Q,basisVectorEndPoints(FBV,LambdaV,Q),theta,phi,FRAMED);
Animation of Fisher Basis Vectors in Transformed Space
Fisher Basis Vectors in Original Space
>
theta := -90: phi := 0:
plotPointsBasis(P,basisVectorEndPoints(FBW,LambdaW,P),theta,phi,FRAMED);
Animation of Fisher Basis Vectors in Original Space
What is the fischerCriterion for the resulting W
> J(W) = sigDigits(fischerCriterion(FBW,SW,SB),4);
Compare to Generalized Eigenvector Method
As discussed above, the standard way recommended in [Duda] for finding the Fisher Basis Vectors is to solve a generalized Eigenvector problem. Here we check to see if the results we obtain through this method are comparable.
>
S[W] = roundMatrix(SW,1), S[B] = roundMatrix(SB,1);
GR,GS := generalizedEigenvectorsRS(SB,SW):
G[R] = roundMatrix(GR,3), G[S] = roundMatrix(GS,3), W = roundMatrix(WW,3);
3D Plot of Fisher Basis Vectors Found using Generalized Eigenvector Method
>
GBV := basisVectorEndPoints(GR,GS . 10,P):
theta := -90: phi := 0:
plotPointsBasis(P, GBV,theta,phi,FRAMED);
3D Animation of Fisher Basis Vectors Found using Generalized Eigenvector Method
Of interest is the angle between the Fisher Basis Vectors computed using the generalized eigenvector method . This should be the same as the angle between the vectors as found using the geometric transformation method. Also of interest is the value of the Fisher Criterion. These should be the same for the generalized eigenvector and geometric transformation methods, denoted as and respectively.
>
unassign('theta','a'):
an := (180/evalf(Pi)) * arccos(DotProduct(GR[1..3,1],GR[1..3,2])):
if (an > 90.0) then an := 180 - an else an := an fi:
theta[w] = sigDigits(an,2);
J(W[b]) = Re(fischerCriterion(GR[1..3,1..2],SW,SB));
J(W[a]) = fischerCriterion(FBW,SW,SB);
View 2D projection using Alternative Bases
Another way to look at the result of finding Fisher Discriminants is to look at the projected points. As expected, there is good separation between the resulting 2D points.
> PP := projectLD(P,FBW):
2D Plot of classes projected onto Fisher Discriminants
> plotPoints2D(PP);
References