Project Suggestions
This page is still under construction
Remember, these are merely suggestions, and you are more than
welcome to propose something on your own. Most projects are fairly open
ended, and therefore could be suitable for one or two students. It is not
desirable to have more than one "group" (i.e., one or two students together)
working on the same project: the idea is not to be in competition but to work
synergistically on different aspects of the problem. So the following list is
available "for grabs" on a first-come first-served basis. It would also be
nice if the groups inolved students with complementary skills (eg. one ECE and
one CS student). There are essentially three types of projects.
Applications
- APP: The algebraic path problem (APP) is a generalized
problem that unifies many problems (such as the all-pairs shortest paths in a
graph, the transitive closure of a boolean relation, matrix inversion, among
others) into a single seting. It (i.e., all instances of the problem) can be
resolved by a generic algorithm schema, simply by changing the nature
of some of the operations. Many parallel algorithms/architectures have been
proposed fopr the APP, the most recent ones may be found in this
and in
this recent papers. Implement some of the proposed algorithms on an FPGA
co-processor (we have acces to a few Annapolis boards in the CS department,
but you may use other too), measure the performance, determine the bottlenecks
and develop solutions to resolve them.
- Protein folding and the optimal parenthesization problem:
(Dae-Kyoo Kim and Eunjee Song) The optimal parenthesization problem is
a classic algorithm for which a neat systolic aray has been proposed more than
twenty years ago by Guibas, Kung and Thompson. The problem has recently
become important due to its immediate application in the determination of the
secondary structure of proteins. The project would require you to implement
the array on an FPGA co-processor, measure the performance, determine the
bottlenecks and develop solutions to resolve them. There are many issues to
resolve: the proposed architecture is a triangular array with about 0.5
n^2 processors, and the total computation time is about 1.5n,
leading to a work of 0.75 n^3, but the algorithm has only a ninth as
many operations. Because the "core" of each processor implements just
two add-compare operations, it may be possible to get a fairly large degree of
parallelism (as high as 100 procesors?), and so offset the work-inefficiency.
However, an even more interesting challenge is to obtain an efficient
architecture which can also exploit the parallelism. Finally, a potetial way
to improve efficiency is by judicious partitioning of the computation between
the host and the FPGA.
- SOR kernels: Successive-over-relaxation (SOR) is an
algorithmic technique commonly used in the numerical simulation of many
physical phenomena (eg. the so called "heat equation", electromagnetic wave
propagation using Maxwell's equations, etc.) It is an iterative technique,
that lends itself to elegant parallelization on a systolic array. However,
many interesting issues would arise when actually implementing such an
architecture on an FPGA co-processor: the tradeoff between precision and the
number of iterations (on an FPGA there is flexibility regarding the arithmetic
oprerations, from full IEEE floating point implementations to fixed
precision); the choice among different systolic arrays possible and the
resulting architectural tradeoffs, and finally issues related to tiling which
could be pursued if time permits.
Tools
These projects concern the development of
"compilation" tools for mapping Alpha programs to VHDL (in the context of this
class, we will use FPGA based co-processor boards as our target).
- Interface generation for 1-d and 2-d systolic arrays on FPGA's:
(Monica Chawathe and Charles Ross) When compiling Alpha (or
equivalently, an affine control loop program) to silicon it is necessary to
generate three pieces of "code": (i) the hardware description of the parallel
architecture that implements the Alpha program, (ii) code that runs on the
host (whose job is essentially to issue control signals to start and stop the
co-processor board, to indicate the memory addresses where input (and output)
data are (to be) stored, and (iii) code that actually accese the memory
locations at the appropriate times and transfers the data to/from the
architecture. This project calls for implementing a generator for the third
piece of code. You may (initially) assume that the Alpha program coresponds
to a simple, 1-dimensional systolic array. Depen on time constraints, the
generator could be extended to handle 2- and higher dimensional arrays. The
key issue here is the fact the memoru access is through a single memory
controller, and so somehow these accesses have to be serialized and buffered.
This project will require understanding of the MMAlpha system, and
(eventually) the notions of multi-dimensional time and code generation from
Alpha, topics that will be covered in class.
- Memory-time tradeoffs:
When compiling an Alpha
progra, or implementing a system of affine recurrence equations, there are
well developed methods that enable us to find an optimal schedule (i.e., a
schedule that computes the result as fast as possible) when there are no
constraints on the number of processors and/or on the memory available. There
are also known results on how, once a schedule is given, to choose the memory
allocation optimally (i.e., so as to minimize the total volume of memory
used). It is currenlty not known how to simultaneously optimize for both
memory and computation time. Your project will explore this problem.
- Code generation from hierarchical Alpha programs:
(Gautam Gupta) The Alpha language, originally designed by Mauras in
1989, was extended by Dupont de Dinechin in 1997 to incorporate hierarchy and
modularity. Subsystems in Alpha are similar to (but subtly different
from) subroutines in conventional languages, and allow the usual benefits of
structured programming and reuse. THe expressive power of subsystems has been
very useful in Alpha (eg. for modularly describing the hardware structure of
systolic arrays, and this is used just before VHDL generation in MMAlpha) and
seems useful for expressing "tiled" and "blocked" programs (which we expect to
be crucial in order to address the issues raised by the memory wall).
However, many of the "compilation" steps in MMALpha, notably the
code-generation and VHDL generation modules, do not satisfactorily deal with
subsystems. In this project you will extend the MMAlpha code-generator to
produce (parallel) code for Alpha programs with subsystems. I believe that a
clean solution to this problem will greatly aid the problem of interface
generation, among others.
- Compiling non-systolic recurrence equations to silicon:
(Lakshmi Renganarayana) Current methods/techniques of deriving systolic
arrays are applicable for w well-defined (proper) subset of Alpha programs.
Only programs in this subset can be "compiled all the way down to silicon".
Other programs must be compiled in the "conventional" way, i.e., on a
(sequential or parallel) instruction set processor: essentially as a (possibly
parallel) loop program. This project calls for an exploration of whether/how
it is possible to do better by removing the instruction-set processor
architecture. Essentially, we would like to "hardwire" the control of this
parallel program. This is an open research problem, possibly leading to a
thesis.
- Dynamic dependences and non-polyhedral domains: The
polyhderal model (and Alpha and MMAlpha) are based on systems of recurrence
equations where the dependences are affine and the domains are polyhderal.
This leads to very useful closure properties as well as to powerful static
analysis tools. If we relax these constraints, certain features would no
longer function (eg. code generation, scheduling, harware generation, etc.)
but some of the fundamental ideas would nevertheless remain valid (eg. the
notion of program transformations). In this project, we seek to explore how
the parts that do not work can be "fixed", specifically through the use of
asynchronous circuits, and explicit control.
Tiling
Tiling is a loop transformation that has a number
of objectives: improving the communication to computation ratio in loop
programs on general-purpose parallel machines, optimizing the performance of
codes on (sequential or parallel) machines with a well defined memory
hierarchy, etc. It is also used in partitioning systolic arrays for
implementation on bounded resources. Thus, many issues that arise in tiling
for general purpose machines are of interest to us (even though not all of
them specifically involve silicon compilation, the objectives of CS670).
Moreover, since the techniques used are very similar, I nevertheless list all
potential tiling projects here.
- Hierarchical tiling: Most previous work on tiling
considered a single level of tiling, i.e., a transformation designed
to optimally exploit a single level of the memory and/or communication
hierarchy. Though there has been some work on hierarchical tiling, notably at
UCSD, a number of questions remain open regarding (i) the choice of the tiling
parameters and (ii) program transformation, once the parameters have been
chosen. This project will explore these issues.
- Custom caches:
You are given a simple (systolizable)
loop program, and a certain silicon area. However, your chip has a specific
bandwidth (which will end up being the bottleneck, given that we have our
backs to the memory wall). So the issue is how best to use the available (sea
of) silicon. Some of it has to be used as memory (a cache) and the rest for
the processors of the systolic array. However, the cache has
bandwidth-latency-size parameters (the larger the cache the slower it is
interms of both latency and bandwidth) whereas the systolic array can
hopefully be made extremely fast throough aggressive pipelining. The project
requires you to develop an analytical model of the perormance of the combined
architecture,a nd then investigate the problem of choosing the parameters of
the architecture and the cache optimally (specifically tuned to the program at
hand).
- Validation of 2-D, 3-D and n-D tiling:
We have
previously developed methods for optimally choosing the parametrs for tiling a
perfectly nested uniform dependence program on a general purpose distributed
memory parallel machine (such as a Cray T3E, or an IBM SP3). This project
requires you to first experminatally validate the results, and then develop
extensions to compare the various results and address a few open questions.
- Algorithmic Tiling:
Transforming a standard "scalar"
algorithm to a "blocked" version, usually on matricial data-types is often a
useful algorithmic technique. Such blocked programs have all the advantages
of tiling even though, as we have experienced with the APP (see the refernces
above), the two programs are not semantically equivalent ( all intermediate
results are not necesarily identical, although the final values are).
Deriving such a a blocked program is thus not a nimpel syntactic program
transformation to be performed by a compiler, but rather a program derivation
involving programmer intuition and a (formal or informal) proof, aided or not
by a mechanical theorem prover. In thisproject, you will explore such
"algorithmic tiling" develop libraries of "blocked" versions of common program
together with formal proofs of correctness. You may explore the use of a
mechanical theorem prover (such as PVS) to derive your proofs.