![]() |
Machine Learning Spring 2008 Department of Computer Science | ![]() |
abalone <- read.table("abalone.data",sep=",")
In the abalone.names file you will find the names of the
attibutes in the data file. With that information, you can assign
column names to the data matrix by doing
abaloneNames <- c("Sex","Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings")
colnames(abalone) <- abaloneNames
Notice that the last column containing the number of rings is our
target variable, representing the age of the abalone.
For a quick view of the data and linear relationships among pairs of
attributes, learn about and use the pairs function.
> abalone[1:5,] M F I Length Diameter Height Whole weight Shucked weight Viscera weight 1 1 0 0 0.455 0.365 0.095 0.5140 0.2245 0.1010 2 1 0 0 0.350 0.265 0.090 0.2255 0.0995 0.0485 3 0 1 0 0.530 0.420 0.135 0.6770 0.2565 0.1415 4 1 0 0 0.440 0.365 0.125 0.5160 0.2155 0.1140 5 0 0 1 0.330 0.255 0.080 0.2050 0.0895 0.0395 Shell weight Rings 1 0.150 15 2 0.070 7 3 0.210 9 4 0.155 10 5 0.055 7
Now to obtain linear regression weights that can be numerically compared with each other, we must first standardize the independent variables, which are all but the number of rings. Standardizing means subtracting each attribute by its mean and dividing by its standard deviation, so that all attributes across samples have zero mean and unit variance. Here is one possible implementation of this.
standardize <- function(X,means=apply(X,2,mean),stdevs=apply(X,2,sd),
returnParms=FALSE) {
## X is nSamples by nInputComponents
stdevs[stdevs==0] <- 1
N <- nrow(X)
p <- ncol(X)
X <- (X - matrix(rep(means,N),N,p,byrow=TRUE))/
matrix(rep(stdevs,N),N,p,byrow=TRUE)
if (returnParms)
list(data=X,means=means,stdevs=stdevs)
else
X
}
Here is another implementation, which I prefer, that takes advantage
of R's lexical scoping rules. It returns a function.
(see Frames,
Environments, and Scope in R and S-PLUS, by John Fox):
makeStandardizeF <- function(X) {
if (missing(X)) {
cat("Usage:
standardize <- makeStandardizeF(X) ## X is nSamples x nDimensions
Xs <- standardize(X)
X2s <- standardize(X2)\n")
return(invisible())
}
## X is nSamples x nDimensions
mu <- colMeans(X)
sigma <- sd(X) ##sd should be named colSds
function(newX) {
nr <- nrow(newX)
nc <- ncol(newX)
(newX - matrix(mu,nr,nc,byrow=TRUE)) / matrix(sigma,nr,nc,byrow=TRUE)
}
}
After standardizing the independent variables, check their means and standard deviations.
llsMake(X,Y,lambda)
that calculates the linear least squares solution with the ridge
regression penalty parameterized by lambda and simply returns
the weights. Implement the function llsUse(weights,X)
that returns a prediction of the target variable.
Before applying these functions to your abalone data, randomly partition
the data into a training set composed of 50% of the data and a test
set composed of the other 50%. Let's refer to the partition fraction
as the training fraction, or 0.5 in this case. Call a 20%/80%
training/testing partition a 0.2 training set fraction. Now use
your llsMake with
a variety of lambda values to fit the penalized linear model to the
training data and predict the target variable for the training and
also for the testing data using two calls to your llsUse
function. Keep the results in a matrix with each row being the lambda
value, the training RMS error and the testing RMS error,
using rbind. After experimenting with lambda to find a
range (nonnegative) that shows variations in results, plot the
resulting training and testing RMS errors versus the value of lambda.
Pick a lambda and examine the weights of the ridge regression model. Which are most significant? Try removing two or three of the least significant attributes and see how the RMS errors change. For the next part, do not remove any attributes.
Now for some visualizations to answer some of your questions.
range() function and ylim parameter
of matplot.
hist() comes to mind) but a common plot
in regression studies is to make a scatterplot (points without
connecting lines) of predicted values
versus actual values. For a choice of training set fraction, such
as 0.5, or one giving the best test results, make two such graphs,
one for the training set and one for the testing set. If the
model is good then all points will be close to a 45 degree line
through the plot, which you can draw
using abline(0,1,col=3,lwd=3) after each call
to plot.
IMPORTANT: Add at least one additional question you come up with and the graphs and discussions that show your attempts to answer the questions.
For this assignment, and all coming assigments, unless told otherwise, be creative in the structure of your report. Do not include headings like Part 1 and Part 2. Instead create your own report structure with headings that make sense. Try to tell a story that flows well for a reader who is not familiar with the assignment.
Here is an example LaTeX file to get you started. Download assignment2.tex and the figure files plot1.eps, plot2.eps, plot3.eps. Follow the steps written as comments at the top of the latex file. You should be able to generate assignment2.pdf.
CS545: Assignment 2 Name: ________________________
Grade: ___ out of 100 points
======================================================================
Correct results with correct R code. Total of 40 points.
( 2 points): Reading the data and constructing input and target matrices.
( 2 points): Correct use of pairs()
( 2 points): Conversion of sex attribute to three binary-valued attributes.
( 2 points): Standardization of input attributes.
( 2 points): Random partitioning of data.
(10 points): llsMake and llsUse
( 5 points): Experiments with some variables removed.
(10 points): Code for performing repetitions over different training
set partitions and lambda values.
( 5 points): Correct plotting code
======================================================================
Discussion. Total of 40 points.
( 5 points): Discussion of result of pairs()
( 5 points): Discussion of effects of lambda on training and testing error.
(10 points): Discussion of weight values and effects of removing some attributes.
(10 points): Discussion of effect of training set fraction on the
effect of lambda on error.
( 5 points): Discussion of how good the linear model is.
( 5 points): Discussion of additional question.
======================================================================
Report structure. Total of 10 points.
( 2 points): Table of contents included
( 3 points): Heading and subheading structure easy to follow and
clearly divides report into logical sections.
( 5 points): Code, math, figure captions, and all other aspects of
report are well-written and formatted.
======================================================================
Grammar and spelling. Total of 10 points.
(5 points): Spelling. Use a spell checker!
(5 points): Grammar and punctuation.