CS 540, Spring 2014: Assignment 5
Data Mining for Building Algorithm Portfolios
Parts 1 & 2: Solver and Characteristic computation due Thursday 4/9/14 at noon MDT
Download Master CSV
Parts 3 & 4: Learned Model and Questions due Tuesday 4/21/14 at noon MDT
Download Master CSV Version 2 Last Updated: April 18th, 14:00 Changes: Fixed percent-from-optimal-costs for grouse and osprey
Download Master CSV Version 3 Last Updated: April 30th, 15:52 Changes: Fixed entries for red and osprey; Fixed four entries of red; fixed entry for "red" in line 65
A current trend in competitive solvers is to construct
portfolios. Portfolios select between a set of solvers/algorithms to
identify the one that is best suited to solve particular problem
instances. At the heart of a portfolio is a strategy for selecting
which algorithm to run. In the beginning, the programmer designed the
strategy. Recently the strategies are based on learned models that
relate problem chacteristics to expected solver performance; the
solver predicted to have the best performance is often selected.
For your assignment, you will be developing code to produce a learned
model that predicts which of the class's timetabling algorithms will perform
best on a particular problem instance. This will be step 1 toward
constructing a portfolio; step 2 will be in assignment 6.
This assignment will require three programs.
- The first is a solver
from assignment 2 or assignment 3.
- The second is code to compute the
problem characteristic required in assignment 4.
- The third is code that
names a solver for a given problem instance -- a very simple
Part 1: Solvers
Each student will provide the object code for the solver (method) they
used in experiment 2 in assignment 4. This could be either from
assignment 2 or assignment 3. Note: If you used someone else's
solver for experiment 2, you will need to submit one of your own
solvers from assignment 2 or 3, which may mean you need to fix it. In
this case, you will also need to submit a csv file as in assignment
4. You cannot submit one of the provided solvers for credit for this
The solvers will be the components that can be selected in the
portfolios that are constructed. Each solver will be assigned a name
(randomly selected from a list of bird names) via email to keep the
author private. However, the code will still be run using a script
called timetable which takes the same arguments as in
assignments 2 and 3.
In class the question was asked about whether to include the 0/1 switch for objective
function. It does not really matter. Easiest is to remove it because
only one objective function is now being used.
When you submit the code, make sure that you
include object code and the script. For this assignment, you can
submit object code or source code where the script includes
compilation, as long as it produces output on the capital machines in less
than 5 minutes.
If your code from either assignment is running correctly, then this is an easy 10 points!
Part 2: Problem Characteristics
The core of the portfolio will be a model that predicts
which solver should find the best solution in five minutes on a
problem with particular characteristic values. That means that we need to
know the characteristic values for all problems in the training and testing
We have the characteristic values for the 52 instances from assignment 4. To
support the testing phase, you will need to provide code to compute
your characteristic on a given problem instance. The code should run on the
capital machines, be named compute-characteristic and take as input
a timetabling problem instance (the .crs file and the .stu file). It
should output a single value -- the value of your characteristic for this
instance. So we will run your code on the command line, as in the
prior assignments with a single command as in:
$ compute-characteristic instance1.crs instance1.stu
Output should be sent to standard out. You should create a script that
makes this self contained; the script can compile and run a program in
You have the choice of submitting code for the same characteristic as in
assignment 4 or a different one if you prefer. If you submit a
different one, then you need to provide a new csv file for the
instances from assignment 4 and a text description.
Extra credit [5 points] If you have an idea for an additional
characteristic, you can submit it by the due date for Part 2 (no late
period). You need to submit both the code for computing it and a csv
Part 3: The Learned Model
To support learning the model, you will use a dataset that we will
provide. It will be merged from the csv files that were turned in from
assignment 4. Instead of including the problem instance name though,
the first 24 fields in each row will be the characteristic values derived by
the students. As with the solvers, each characteristic name will be
designated by the pseudonym as assigned in assignment 4. The penult
field will be the solver name (aka bird name).
The last field
will be the percent from best known/optimal for that solver on the problem
instance with the characteristics. To simplify, only the student costs
objective function will be used.
The training data will include all solvers
on all 52 problems, as was turned in for assignment 4 (the csv file). A simplified
example of the format is:
red, blue, green, solver, quality
4.3, hard, T, robin, 12.65
62.0, easy, F, robin, 37.13
27.6, hard, F, robin, 16.75
4.3, hard, T, bluejay, 15.75
62.0, easy, F, bluejay, 42.42
27.6, hard, F, bluejay, 13.75
4.3, hard, T, dove, 8.57
62.0, easy, F, dove, 35.00
27.6, hard, F, dove, 16.75
This example has 3 characteristic values that have been computed for each
problem instance, as in assignment 4. Each solver solved each instance
to produce the quality score. The input file names for the instance
have been replaced with the characteristic values; no problem instance names
will be used.
The characteristics can be numeric, categorical, binary, etc. The characteristic names
will be the same as in the voting phase in assignment 4 with other
names added for new characteristics. So at this stage, we do not know how
many characteristics will be in the file, one for each student plus any
extras submitted. The solver
names will be what we assign. This file should have (52 problems * 24
students/solvers) + 1 header line =1249
lines in no set order (however we end up putting it
together) as well the characteristics may not be ordered alphabetically.
Your model should be learned using the tools available in Weka. Weka is
available on the department machines at: /usr/local/weka. Information
about Weka is available at: Weka
You will need to translate the model into something that can be run as
code (as before a script file) on the capital machines (how you do this depends on which Weka
model you chose).
The code should be run as a script so that it is language
independent. Note: this code does not take the training file as
input; you should already have run the training and produced a model
that can be run standalone!
- As input, it should take a csv file formatted as in
the training data except missing the last two fields (solver and
percent from best known).
- As output, it should produce a file with a
single solver name on each line where that is the solver that is
expected to find the best solution for the problem from the
corresponding line in the input.
- It should be called
timetabling-portfolio and take two arguments: the csv
file name containing a line for each instance's characteristics and
the output file name. Note: we will not be expecting this code to
learn the model, so no training data will be provided as input. Your model
should be learned from the single training dataset that we provide.
Since this sounds rather confusing, consider a simplified example with
just 3 characteristics. The input is:
aqua, azure, blue
4.9, hard, T
62.0, hard, F
Note: characteristics can be numeric, categorical, binary, etc. The first
line lists the name of the characteristic. The
characteristics may not be ordered alphabetically or in the same order as in
the input. Lines after the header refer to specific
problem instances, but the instance names will not be provided. Also,
the testing problems will be not seen before instances and so the
characteristics may not match any instance previously seen! The output could be:
In this case, order of output must match the order of input, e.g.,
"bluejay" is the algorithm selected for the problem represented by the
first line. The
accuracy of the model will be assessed as the number of instances in which the
best solver was selected correctly. We will be running the solver to
determine which is the best for that instance.
Part 4: Questions
As with the other assignments, you need to describe what you have done
by answering the following two questions, one essay per question taking
at least 1/2 a page.
- State which machine learning model you selected as the core of
your portfolio. Why did you select it? How did it do on the training
data (i.e., how accurate was it)?
- Consider the characteristics that were selected by your model. Were any not used at all? Which
were most discriminating? Pick the characteristic you think was most
predictive and try to explain why it was effective.
As with prior assignments, code will be submitted through Checkin to
ensure that it runs with the correct parameters.
- As with prior assignments, you have considerable
latitude. Because the purpose is to use machine learning as a tool
for improving search, you get
to pick the machine learning model you think will work best from Weka.
- Although you get to pick the model, do think about how to
incorporate it in the code that identifies solvers. You can wrap
Weka in a script (the Weka primer includes some examples of this) or
you can translate it into your favorite programming language (e.g.,
the decision tree rules map pretty nicely to if/thens).
- Your codes should take no more than 5 minutes to run on the
capital machines in the lab because we will be creating a testing
dataset by running your codes (solver and characteristic computation) on
another set of problem instances.
- What to hand in
- Part 1: solver code as object code and script, plus a csv file
for your own solver if you needed to replace what was done in
- Part 2: characteristic computation code with script, plus a new csv file
if you have changed or added to your characteristic
- Part 3: learned model code with script
- Part 4: Questions in pdf
- Part 1: Solver code [10 points]
- Does it produce a verifiable solution in 5 minutes? [5
- Does it follow
the signature and I/O requirements? [5 points]
- Part 2: Characteristic code [21 points]
- Does it run in required time (e.g., less than 5 minutes)? [7 points]
- Does it follow
the signature and I/O requirements? [7 points]
- Does it either
match assignment 4 or include the new description and csv file? [7 points]
- Part 3: Learned model code [49 points]
- Does it run in required time (e.g., less than 5 minutes)? [9 points]
- Does it follow
the signature and I/O requirements? [10 points]
- How well
does it use the characteristics? [10 points]
it run on the testing data? [10 points]
- How accurate is it on
the training data? [5 points]
- How accurate is it on the
testing data? [5 points]
- Part 4: Questions [20 points] (readability, insight,