CS545
Machine Learning

Spring 2008
Department of Computer Science
Link to Colorado State University Home
 Page

Assignment 2: Linear Regression--Abalone Age Prediction

Part 1: Reading the Data

Download the abalone data from the
UCI Machine Learning Repository. Download both the data file and the names file. Read in the data using an R expression like
abalone <- read.table("abalone.data",sep=",")
In the abalone.names file you will find the names of the attibutes in the data file. With that information, you can assign column names to the data matrix by doing
abaloneNames <- c("Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings")
colnames(abalone) <- abaloneNames
Notice that the last column containing the number of rings is our target variable, representing the age of the abalone.

For a quick view of the data and linear relationships among pairs of attributes, learn about and use the pairs function.

Part 2: Data Preparation

Notice the first column of data is character data indicating "M", "F", or "I" for "Male", "Female", or "Infant". Convert this to three columns of binary-valued numbers as follows. A male is represented as (1,0,0), a female by (0,1,0) and an infant by (0,0,1). If you do this correctly, then the first five rows of data should look like
> abalone[1:5,]
  M F I  Length Diameter Height Whole weight Shucked weight Viscera weight
1 1 0 0   0.455    0.365  0.095       0.5140         0.2245         0.1010
2 1 0 0   0.350    0.265  0.090       0.2255         0.0995         0.0485
3 0 1 0   0.530    0.420  0.135       0.6770         0.2565         0.1415
4 1 0 0   0.440    0.365  0.125       0.5160         0.2155         0.1140
5 0 0 1   0.330    0.255  0.080       0.2050         0.0895         0.0395
  Shell weight Rings
1        0.150    15
2        0.070     7
3        0.210     9
4        0.155    10
5        0.055     7

Now to obtain linear regression weights that can be numerically compared with each other, we must first standardize the independent variables, which are all but the number of rings. Standardizing means subtracting each attribute by its mean and dividing by its standard deviation, so that all attributes across samples have zero mean and unit variance. Here is one possible implementation of this.

standardize <- function(X,means=apply(X,2,mean),stdevs=apply(X,2,sd),
                      returnParms=FALSE) {
  ## X is nSamples by nInputComponents
  stdevs[stdevs==0] <- 1
  N <- nrow(X)
  p <- ncol(X)
  X <- (X - matrix(rep(means,N),N,p,byrow=TRUE))/
          matrix(rep(stdevs,N),N,p,byrow=TRUE)
  if (returnParms)
    list(data=X,means=means,stdevs=stdevs)
  else
    X
}
Here is another implementation, which I prefer, that takes advantage of R's lexical scoping rules. It returns a function. (see Frames, Environments, and Scope in R and S-PLUS, by John Fox):
makeStandardizeF <- function(X) {
  if (missing(X)) {
    cat("Usage:
         standardize <- makeStandardizeF(X)  ## X is nSamples x nDimensions
         Xs <- standardize(X)
         X2s <- standardize(X2)\n")
    return(invisible())
  }
  ## X is nSamples x nDimensions
  mu <- colMeans(X)
  sigma <- sd(X) ##sd should be named colSds

  function(newX) {
    nr <- nrow(newX)
    nc <- ncol(newX)
    (newX - matrix(mu,nr,nc,byrow=TRUE)) / matrix(sigma,nr,nc,byrow=TRUE)
  }
}

After standardizing the independent variables, check their means and standard deviations.

Part 3: Finally, the regression

Implement the function named llsMake(X,Y,lambda) that calculates the linear least squares solution with the ridge regression penalty parameterized by lambda and simply returns the weights. Implement the function llsUse(weights,X) that returns a prediction of the target variable.

Before applying these functions to your abalone data, randomly partition the data into a training set composed of 50% of the data and a test set composed of the other 50%. Let's refer to the partition fraction as the training fraction, or 0.5 in this case. Call a 20%/80% training/testing partition a 0.2 training set fraction. Now use your llsMake with a variety of lambda values to fit the penalized linear model to the training data and predict the target variable for the training and also for the testing data using two calls to your llsUse function. Keep the results in a matrix with each row being the lambda value, the training RMS error and the testing RMS error, using rbind. After experimenting with lambda to find a range (nonnegative) that shows variations in results, plot the resulting training and testing RMS errors versus the value of lambda.

Pick a lambda and examine the weights of the ridge regression model. Which are most significant? Try removing two or three of the least significant attributes and see how the RMS errors change. For the next part, do not remove any attributes.

Part 4: More Experiments

Okay, that was just a warm up. Now you have questions, at least I hope you have the following questions.
  1. Does the effect of lambda on error change for different partitions of the data into training and testing sets?
  2. What does an error of 2.1 mean? Is it a good prediction? Have we learned a good model?
To answer these questions, modify your code to perform the following steps.
  1. For different training set fractions,
    1. Repeat 200 times
      1. Randomly divide data into training and testing partitions.
      2. Standardize the training input variables.
      3. Standardize the testing input variables using the means and standard deviations from the training set.
      4. For different values of lambda
        1. Fit a linear model to the training data for the given lambda
        2. Use it to predict the number of rings in the training data and calculcate the root mean square (RMS) error
        3. Do this again, using the same linear model applied to the testing data.
    2. Calculate the average of the RMS errors over the 200 repetitions for each combination of training set fraction and lambda value.
Executing my R code on a Core2 Intel laptop took no more than 5 minutes.

Now for some visualizations to answer some of your questions.

  1. To see if the training set fraction affects the effect of lambda on error, plot the effect in multiple graphs, one for each training set fraction, by building the following figure. Make one figure of multiple graphs, one for each training set fraction, each graph being a plot of the average RMS training error versus lambda values and a plot of the average RMS testing error versus lambda. To enable the comparison across graphs, force each graph to have the same error (y axis) limits. Check out the range() function and ylim parameter of matplot.
  2. Hummm. Interesting, right? That figure shows the whole story, but not very clearly nor concisely. Just how does the training set fraction affect which lambda value is best? Well, let R tell you, by making the following graphs. In one figure, make two graphs. In the first graph, plot the minimum average RMS testing error versus the training set fraction. In the second graph, plot the lambda value that produced the minimum average RMS testing error versus the training set fraction.
  3. So far we have just looked at mean RMS error. Maybe the error is due to a few samples with huge errors and all others have tiny errors. There are various ways of looking at the distribution of the errors (hist() comes to mind) but a common plot in regression studies is to make a scatterplot (points without connecting lines) of predicted values versus actual values. For a choice of training set fraction, such as 0.5, or one giving the best test results, make two such graphs, one for the training set and one for the testing set. If the model is good then all points will be close to a 45 degree line through the plot, which you can draw using abline(0,1,col=3,lwd=3) after each call to plot.
So, there are some ways of looking at the results. Include these graphs and lots of your own observations about them in your report.

IMPORTANT: Add at least one additional question you come up with and the graphs and discussions that show your attempts to answer the questions.

Your Report

Include a table of contents.

For this assignment, and all coming assigments, unless told otherwise, be creative in the structure of your report. Do not include headings like Part 1 and Part 2. Instead create your own report structure with headings that make sense. Try to tell a story that flows well for a reader who is not familiar with the assignment.

Here is an example LaTeX file to get you started. Download assignment2.tex and the figure files plot1.eps, plot2.eps, plot3.eps. Follow the steps written as comments at the top of the latex file. You should be able to generate assignment2.pdf.

Grading

Here is what the grade sheet will look like for this assignment.
CS545: Assignment 2                     Name: ________________________

Grade: ___ out of 100 points

======================================================================
Correct results with correct R code.  Total of 40 points.

( 2 points):  Reading the data and constructing input and target matrices.
( 2 points):  Correct use of pairs()
( 2 points):  Conversion of sex attribute to three binary-valued attributes.
( 2 points):  Standardization of input attributes.
( 2 points):  Random partitioning of data.
(10 points):  llsMake and llsUse
( 5 points):  Experiments with some variables removed.
(10 points):  Code for performing repetitions over different training
set partitions and lambda values.
( 5 points):  Correct plotting code

======================================================================
Discussion. Total of 40 points.

( 5 points): Discussion of result of pairs()
( 5 points): Discussion of effects of lambda on training and testing error.
(10 points): Discussion of weight values and effects of removing some attributes.
(10 points): Discussion of effect of training set fraction on the
effect of lambda on error.
( 5 points): Discussion of how good the linear model is.
( 5 points): Discussion of additional question.

======================================================================
Report structure. Total of 10 points.

( 2 points): Table of contents included
( 3 points): Heading and subheading structure easy to follow and
             clearly divides report into logical sections.
( 5 points): Code, math, figure captions, and all other aspects of  
             report are well-written and formatted.

======================================================================
Grammar and spelling. Total of 10 points.

(5 points): Spelling.  Use a spell checker!
(5 points): Grammar and punctuation.