Can we do better?
How do we parameterize the computation so that we can optimize reuse?
Write an equation for the total amount of global to shared memory traffic based a set of parameters. Initially, make these parameters as general as you can, eg do not assume square any blocks, assume rectangles, parameterize the inner loop size that computes multiple elements of the C block. Can you think of other paramemers? Other strategies?