
CS 540, Spring 2009: Assignment 3
Data Mining

Due Thursday 4/23 at beginning of class
Late period ends 4/26 at noon
KDD Cup (Knowledge Discovery and Data Mining) is an annual competition
started in 1997. Each year, participants vie to see who can produce
the most accurate and efficient solution to one or more data mining
applications.
For your assignment, you will be working on the KDD Cup 2004 data
sets. The purpose of the challenge that year was to explore how
classification needs to be sensitive to different performance
measures. The
challenge is described at kddcup2004,
as a description of the task as orignally presented to the
participants, software for computing the performance metrics and two
datasets for a physics and protein classification problem.
What You Need to Do
You are being given considerable latitude on this assignment. You get
to pick the algorithm you think will work best; it needs to be one of
the algorithms covered in class or the readings or closely related to
them. You cannot use algorithms from CS545. You may also pick the
language for your implementation within some limits.
Restrictions/Guidelines:
- If you wish to use any language
other than Java, C, C++ or Lisp, you need prior approval as it must
run on the Linux machines in the lab and the instructor must be able
to read it.
- You may
(and should) choose to implement an existing algorithm: one of
those described in class or one from the literature. In your code, you
must indicate who developed the algorithm and provide a citation to
it.
- Your implementation must read the data in the format described at the
data web pages. The program must take as input, at least, the training
data file, the testing data file and the target metric. (So it should
take three arguments.) To simplify running your code, name your
executable kddcup04.
- Your implementation must output a file in the
required output format for the performance measure software
(basically predictions one per line). You
should use the -files option for perf because when perf is run it
will be using my file for the targets and your file for the
predictions.
- The implementation must be your own. No downloading of
existing classification code is allowed for this assignment. Any
copying of other's code will fall under the guidelines of the cheating
policy for the CS dept and will be handled accordingly.
- Your algorithm should take no more than 10 minutes to run on the
Linux boxes in the lab.
- You may use the source code for the perf software if you wish.
Data Sets and Supportive Software
The data sets and the perf software from the challenge are available
in ~cs540/kdd/kddcup04.
How You Will Be Graded
In the KDD competition, the training files were provided at first with
the testing files reserved. Because the full data set is already
available, training and testing sets will be selected from the full
set for the evaluation. You will not be given either in advance of the
evaluation.
Your grade will be based primarily on the overall quality of your
solution (code and implementation) and the answers to questions
(below). Ten percent of your grade will be based on your solution's
performance relative to the rest of the class.
What to hand in
- Hardcopy
- Printed version of your code (including comments: both block and
in-line)
- Written answers to the following:
- Briefly describe the algorithm you implemented. Why did you
choose this algorithm to implement? What makes it suited to the problem?
- What were the biggest challenges to producing your solution? What
about the problem was difficult?
- How did the need to address potentially different performance
metric change how you approached the problem?
- Electronic copy
- A tar/gz/whatever file containing the source code. You should submit this via email to
howe@cs.colostate.edu by the due date/time for the assignment. As with
prior assignments, you should include a README with compilation and
execution instructions.