CS 540, Spring 2014: Assignment 5
Data Mining for Building Algorithm Portfolios

Parts 1 & 2: Solver and Characteristic computation due Thursday 4/9/14 at noon MDT
Parts 3 & 4: Learned Model and Questions due Tuesday 4/21/14 at noon MDT

Download Master CSV

Download Master CSV Version 2 Last Updated: April 18th, 14:00 Changes: Fixed percent-from-optimal-costs for grouse and osprey

Download Master CSV Version 3 Last Updated: April 30th, 15:52 Changes: Fixed entries for red and osprey; Fixed four entries of red; fixed entry for "red" in line 65

A current trend in competitive solvers is to construct portfolios. Portfolios select between a set of solvers/algorithms to identify the one that is best suited to solve particular problem instances. At the heart of a portfolio is a strategy for selecting which algorithm to run. In the beginning, the programmer designed the strategy. Recently the strategies are based on learned models that relate problem chacteristics to expected solver performance; the solver predicted to have the best performance is often selected.

For your assignment, you will be developing code to produce a learned model that predicts which of the class's timetabling algorithms will perform best on a particular problem instance. This will be step 1 toward constructing a portfolio; step 2 will be in assignment 6.

This assignment will require three programs.

  1. The first is a solver from assignment 2 or assignment 3.
  2. The second is code to compute the problem characteristic required in assignment 4.
  3. The third is code that names a solver for a given problem instance -- a very simple algorithm portfolio.

Part 1: Solvers

Each student will provide the object code for the solver (method) they used in experiment 2 in assignment 4. This could be either from assignment 2 or assignment 3. Note: If you used someone else's solver for experiment 2, you will need to submit one of your own solvers from assignment 2 or 3, which may mean you need to fix it. In this case, you will also need to submit a csv file as in assignment 4. You cannot submit one of the provided solvers for credit for this part.

The solvers will be the components that can be selected in the portfolios that are constructed. Each solver will be assigned a name (randomly selected from a list of bird names) via email to keep the author private. However, the code will still be run using a script called timetable which takes the same arguments as in assignments 2 and 3.

In class the question was asked about whether to include the 0/1 switch for objective function. It does not really matter. Easiest is to remove it because only one objective function is now being used.

When you submit the code, make sure that you include object code and the script. For this assignment, you can submit object code or source code where the script includes compilation, as long as it produces output on the capital machines in less than 5 minutes.

If your code from either assignment is running correctly, then this is an easy 10 points!

Part 2: Problem Characteristics

The core of the portfolio will be a model that predicts which solver should find the best solution in five minutes on a problem with particular characteristic values. That means that we need to know the characteristic values for all problems in the training and testing sets.

We have the characteristic values for the 52 instances from assignment 4. To support the testing phase, you will need to provide code to compute your characteristic on a given problem instance. The code should run on the capital machines, be named compute-characteristic and take as input a timetabling problem instance (the .crs file and the .stu file). It should output a single value -- the value of your characteristic for this instance. So we will run your code on the command line, as in the prior assignments with a single command as in:

$ compute-characteristic instance1.crs instance1.stu
4.3
Output should be sent to standard out. You should create a script that makes this self contained; the script can compile and run a program in another language.

You have the choice of submitting code for the same characteristic as in assignment 4 or a different one if you prefer. If you submit a different one, then you need to provide a new csv file for the instances from assignment 4 and a text description.

Extra credit [5 points] If you have an idea for an additional characteristic, you can submit it by the due date for Part 2 (no late period). You need to submit both the code for computing it and a csv file.

Part 3: The Learned Model

To support learning the model, you will use a dataset that we will provide. It will be merged from the csv files that were turned in from assignment 4. Instead of including the problem instance name though, the first 24 fields in each row will be the characteristic values derived by the students. As with the solvers, each characteristic name will be designated by the pseudonym as assigned in assignment 4. The penult field will be the solver name (aka bird name).

The last field will be the percent from best known/optimal for that solver on the problem instance with the characteristics. To simplify, only the student costs objective function will be used.

The training data will include all solvers on all 52 problems, as was turned in for assignment 4 (the csv file). A simplified example of the format is:

red, blue, green, solver, quality
4.3, hard, T, robin, 12.65 
62.0, easy, F, robin, 37.13
27.6, hard, F, robin, 16.75
4.3, hard, T, bluejay, 15.75 
62.0, easy, F, bluejay, 42.42
27.6, hard, F, bluejay, 13.75
4.3, hard, T, dove, 8.57 
62.0, easy, F, dove, 35.00
27.6, hard, F, dove, 16.75
This example has 3 characteristic values that have been computed for each problem instance, as in assignment 4. Each solver solved each instance to produce the quality score. The input file names for the instance have been replaced with the characteristic values; no problem instance names will be used.

The characteristics can be numeric, categorical, binary, etc. The characteristic names will be the same as in the voting phase in assignment 4 with other names added for new characteristics. So at this stage, we do not know how many characteristics will be in the file, one for each student plus any extras submitted. The solver names will be what we assign. This file should have (52 problems * 24 students/solvers) + 1 header line =1249 lines in no set order (however we end up putting it together) as well the characteristics may not be ordered alphabetically.

Your model should be learned using the tools available in Weka. Weka is available on the department machines at: /usr/local/weka. Information about Weka is available at: Weka guide

You will need to translate the model into something that can be run as code (as before a script file) on the capital machines (how you do this depends on which Weka model you chose).

The code should be run as a script so that it is language independent. Note: this code does not take the training file as input; you should already have run the training and produced a model that can be run standalone!

Since this sounds rather confusing, consider a simplified example with just 3 characteristics. The input is:

aqua, azure, blue
4.9, hard, T
62.0, hard, F
Note: characteristics can be numeric, categorical, binary, etc. The first line lists the name of the characteristic. The characteristics may not be ordered alphabetically or in the same order as in the input. Lines after the header refer to specific problem instances, but the instance names will not be provided. Also, the testing problems will be not seen before instances and so the characteristics may not match any instance previously seen! The output could be:
bluejay
robin
In this case, order of output must match the order of input, e.g., "bluejay" is the algorithm selected for the problem represented by the first line. The accuracy of the model will be assessed as the number of instances in which the best solver was selected correctly. We will be running the solver to determine which is the best for that instance.

Part 4: Questions

As with the other assignments, you need to describe what you have done by answering the following two questions, one essay per question taking at least 1/2 a page.
  1. State which machine learning model you selected as the core of your portfolio. Why did you select it? How did it do on the training data (i.e., how accurate was it)?
  2. Consider the characteristics that were selected by your model. Were any not used at all? Which were most discriminating? Pick the characteristic you think was most predictive and try to explain why it was effective.

General Guidelines:

As with prior assignments, code will be submitted through Checkin to ensure that it runs with the correct parameters.

Rubric