Assignment 3: CS 540 Artificial Intelligence (Spring 2009)

CS 540, Spring 2009: Assignment 3
Data Mining

Due Thursday 4/23 at beginning of class
Late period ends 4/26 at noon

KDD Cup (Knowledge Discovery and Data Mining) is an annual competition started in 1997. Each year, participants vie to see who can produce the most accurate and efficient solution to one or more data mining applications. For your assignment, you will be working on the KDD Cup 2004 data sets. The purpose of the challenge that year was to explore how classification needs to be sensitive to different performance measures. The challenge is described at kddcup2004, as a description of the task as orignally presented to the participants, software for computing the performance metrics and two datasets for a physics and protein classification problem.

What You Need to Do

You are being given considerable latitude on this assignment. You get to pick the algorithm you think will work best; it needs to be one of the algorithms covered in class or the readings or closely related to them. You cannot use algorithms from CS545. You may also pick the language for your implementation within some limits.

Restrictions/Guidelines:

If you wish to use any language other than Java, C, C++ or Lisp, you need prior approval as it must run on the Linux machines in the lab and the instructor must be able to read it.
You may (and should) choose to implement an existing algorithm: one of those described in class or one from the literature. In your code, you must indicate who developed the algorithm and provide a citation to it.
Your implementation must read the data in the format described at the data web pages. The program must take as input, at least, the training data file, the testing data file and the target metric. (So it should take three arguments.) To simplify running your code, name your executable kddcup04.
Your implementation must output a file in the required output format for the performance measure software (basically predictions one per line). You should use the -files option for perf because when perf is run it will be using my file for the targets and your file for the predictions.
The implementation must be your own. No downloading of existing classification code is allowed for this assignment. Any copying of other's code will fall under the guidelines of the cheating policy for the CS dept and will be handled accordingly.
Your algorithm should take no more than 10 minutes to run on the Linux boxes in the lab.
You may use the source code for the perf software if you wish.

Data Sets and Supportive Software

The data sets and the perf software from the challenge are available in ~cs540/kdd/kddcup04.

How You Will Be Graded

In the KDD competition, the training files were provided at first with the testing files reserved. Because the full data set is already available, training and testing sets will be selected from the full set for the evaluation. You will not be given either in advance of the evaluation.

Your grade will be based primarily on the overall quality of your solution (code and implementation) and the answers to questions (below). Ten percent of your grade will be based on your solution's performance relative to the rest of the class.

What to hand in

Hardcopy
1. Printed version of your code (including comments: both block and in-line)
2. Written answers to the following:
  1. Briefly describe the algorithm you implemented. Why did you choose this algorithm to implement? What makes it suited to the problem?
  2. What were the biggest challenges to producing your solution? What about the problem was difficult?
  3. How did the need to address potentially different performance metric change how you approached the problem?
Electronic copy
1. A tar/gz/whatever file containing the source code. You should submit this via email to howe@cs.colostate.edu by the due date/time for the assignment. As with prior assignments, you should include a README with compilation and execution instructions.