Discussion 4: Sorter Architecture and Analysis
This is a paper-and-pencil, analytical assignment. You should work in your
discussion group and reach a consensus and post the results of your consensus.
The due date is midnight, Thursday, March 31.
Recall the architecture of a P-processor sorter, used to sort N
inputs. First consider the case when P=N, and then extend your
analysis to the case for N=kP, where k is some integer (this
case will require k passes through the array). Some questions to think
of:
- How to load and unload the array. Sure, pushing in MAXINT after the
data is good, but after that how do you reuse the architcture? With a global
reset? When can this occur? So, analyze the total execution time for the
architcture to completely finish the k passes (note that there will be
idle time between the passes).
- How many cycles (clock ticks) does it take to get the last data element
out of the array (for one-pass as well as for the multi-pass case)?
- For the one-pass case:
- How many cycles dos it take till the FIRST output appears?
- How many cycles dos it take till the first PROCESSOR is done?
- How many cycles dos it take till the LAST output appears?
- When does the i-th processor START?
- When is the i-th processor DONE?
- For the k-pass case, the essential question that you
MUST figure out is how to "tightly" interleave multiple passes, without a
global reset (hint: use the control sgnal). Once that is done, need to think
of similar questions as above.
- For the k-pass case, derive a formula for the total execution time
to sort the entire, N-element array (express this as a function of only
N and k, and eliminate P). Then, deduce the number of
passes that minimizes this function. Coment on the answer you get. Does it
make sense? Did your intuition tell you that this would be the case? Why?
- Starting from this Alphabets program, derive the
systolic architecture that we studied in class. Your derivation should choose
a schedule function, an allocation function, the resulting CoB, and then the
transformed program which can be interpreted as a description of the
architecture. Can you systematically derive the I/O (i.e., the loading and
unloading of the array) and the control signals needed for it?
Consider an Alphabets program, with two varaibles, X and Y. After choosing a
parallelization (schedule and processor allocation) and the subsequent CoB,
the domains for the variables are:
float X {i,j | 0 <= j; i-j < N; 2j<=i}
float Y {i,j | 0 <= i <= j < N}
What is the loop structure that would be generated when these two domains are
visited in lexicographical order. You don't need to come up with an algorithm
to do this in the general case (that's a pretty hard problem), as long as you
are able to convince yourself and your discussion partners that the code is
correct.
Sanjay Rajopadhye 2007-01-23