CS 540, Spring 2009: Assignment 3
Data Mining

Due Thursday 4/23 at beginning of class
Late period ends 4/26 at noon

KDD Cup (Knowledge Discovery and Data Mining) is an annual competition started in 1997. Each year, participants vie to see who can produce the most accurate and efficient solution to one or more data mining applications. For your assignment, you will be working on the KDD Cup 2004 data sets. The purpose of the challenge that year was to explore how classification needs to be sensitive to different performance measures. The challenge is described at kddcup2004, as a description of the task as orignally presented to the participants, software for computing the performance metrics and two datasets for a physics and protein classification problem.

What You Need to Do

You are being given considerable latitude on this assignment. You get to pick the algorithm you think will work best; it needs to be one of the algorithms covered in class or the readings or closely related to them. You cannot use algorithms from CS545. You may also pick the language for your implementation within some limits.

Restrictions/Guidelines:

Data Sets and Supportive Software

The data sets and the perf software from the challenge are available in ~cs540/kdd/kddcup04.

How You Will Be Graded

In the KDD competition, the training files were provided at first with the testing files reserved. Because the full data set is already available, training and testing sets will be selected from the full set for the evaluation. You will not be given either in advance of the evaluation.

Your grade will be based primarily on the overall quality of your solution (code and implementation) and the answers to questions (below). Ten percent of your grade will be based on your solution's performance relative to the rest of the class.

What to hand in