One thing that seems to need some reinforcement is that of generating control signals. In the convolution examples done in class (the standard and the bidirectional ones) I had stopped short of systematically deriving the control signals. In particular, my equations had "conditionals" such as t>p or t=p or even t=2p+N. In a systolic array, each processor cannot be expected to actually compute this, and use it to control a mux. What you need to do is to replace such conditionals by a boolean variable, and write the equations that define the values of these variables. This is exactly like pipelining: replacing a particular "shared" computation by the use of a new variable and making sure that the value is propagated to all the "consumers".