CS475 Fall 2011 - Final Project
Exploring knapsack DP parallelizations
Objectives
The objective of your final project is to explore parallelizations of
the memory efficient dynamic programming knapsack solver. The programs
you have written so far, as well as the initial studies that you have done
have now exposed you to a variety of parallelization options, and the goal
of your project is to systematically explore some of these choices. In
HW3, you developed the sequential divide-and-conquer program to solve the
knapsack, and this should be your starting point. We want to get good
parallelization on up to 16 provcessors, i.e., multiple nodes on carver.
So just an OpenMP parallelization will not be sufficient, but a purely MPI
approach may be possibe.
You also explored a coarse grain task parallelization of the calls to a
main workhorse function, (let's call it find_last_row(i,j,c)),
and also a direct OpenMP parallelization of the function itself. As you
know, the main workhorse find_last_row(i,j,c), takes as input a
range of item-indices (i, j) and a capacity (c) and computes an array (of
length c+1) giving the optimal profit that can be achieved for all
capacities between 0 and c, for the knapsack problem with only objects i
through j. The main goal of the project is to parallelize the complete
knapsack solver, but to focus on parallelization
of find_last_row(i,j,c). You may set aside the exploration of
the coarse-grain parallelization, and focus primarily
on find_last_row(i,j,c) itself.
Teams: The project will be done in teams of two students.
We strongly encourage mixed teams of online and on-campus students. As
you form your teams, send us email (cs475@cs.colostate.edu) to
register your team. If you like we can set up a discussion group for you
in RamCT.
Parallelization Options
The core driver of your project should be the divide-and-conquer program
of HW3. There are two main directions to pursue, and OpenMP version and
an MPI version. Here are some details of the options:
- OpenMP based and tiled parallelization of
the find_last_row function. This should use the wavefront
parallelization described in the last week's lectures. Since the
computation of each row uses only elements from the previous row, there is
an obvious simple parallelization (that you explored in HW3). However, this
does one synchronization per row, and a tiled implementation would reduce
the overhead. However, the knapsack dependences are not from just the
northwest as yo uhad in the Needlemand Wunsch (HW4), but
from j-w[i], which is different in each row (you may want to
allocate buffers of width WMAX, the maximum weight) may need/want to change
the order in which the table is filled -- instead of row-by-row (i.e., with
the item loop outer, capacity loop inner) you may want to explore making the
item loop the inner loop. Regardless of whether or not you do this
reordering, a number of values in each row need to be saved to be used by
the next tile.
- MPI based and tiled parallelization of find_last_row. Each
processor is (conceptually) responsible for a section of the table (it is
responsible for computing this section of the table, not necessarily
storing all of it) of width c/np, and height n. The required communication
is from one processor to the next. In order to ease your programming task,
you may assume that no weight is greater than c/np, so that the data a
processor needs must come from just the previous processor (send-recv to
nearest neighbors will suffice). Note that in recursive calls, c may be
small and this assumption may not hold. So your code should have a graceful
recovery. Here is one way to do a graceful recovery: assign the tile width
to be max(c/np, MAX_WT). Now, there may be some processors who
have no work to do, but that is merely inefficient, not catastrophic.
You could also explore tiling to reduce the communication frequency, and
increasing the granularity. As in the OpenMP case, this will necessitate
will require saving a "fringe" of values.
Your project must explore a minimal set of required options, and the remaining
choices may be pursued for extra credit. The following is a minimal set of
requirements (so you must submit two separate programs).
- One of your implementations must be a tield OpenMP parallelization of
the find_last_row function (so it may only scale up to 8 proessors
on a single node).
- One of the implementations must be an MPI parallelization of
the find_last_row function, and must scale to 16 processors on two
nodes.
- At least one of them must explore a tiling/blocking approach where the
communication granularity is increased.
Many additional options may be pursued and will earn you extra credit. Here
are some suggestions:
- Combinations of the OpenMP/MPI parallelizations not covered in the
minimum required parts above (e.g, if you did a loop interchange for the
OpenMP parallelization of find_last_row, you could do the same
strategy for the MPI one).
- You may choose to explore hybrid MPI-OpenMP parallelization.
- An analysis of the benefits of "early exit" in the recursion like the
problem in the mock midterm. Your analysis can be either a paper and pencil
quantitative analysis, or it may be experimental, i.e., you modify the
sequential code to do early exit, and then measure the gains for some test
cases (this does not have to be run on carver, since you are seeking
algorithmic gains).
- Finally, you may also get extra credit for turning in the best
performance amongst your peers. We will test your codes for this part on
carver on 16-processors (2-nodes). The problem size will be about 5000
items with a capacity of about 2M.
We post additional details about the format of the submission,
preferred/required naming conventions, etc.
Good luck and have fun!