## Cuda Exercise three: Matrix Multiply

This is our first real exercise: let's discuss how to get a fast matrix multiply implemented in cuda.
• Form teams of two
• Let's study
• MMGlob and MMShared from the Cuda programming guide.
• MMGlob has no timing in it , just run it for fun.
• Run and time MMShared.
• Let's discuss how to improve MMShared. What can we optimize, how can we parameterize the "optimization / computation space?
• General parameters
• Minimize overhead (minimal initialization code, sentinels to avoid conditions in loop, loop peeling to avoid conditions in loops)
• Unroll loops
• Reuse data, exploiting locality (shared memory, registers, arrays in registers?) (Experiment: What is the performance difference between shared memory and registers?)
• minimize communication (registers <-> shared <-> global)
• exploit vectorization (coalescing)
• avoid bank conflicts
• In particular:
• How can we increase the size of the C block that a 16x16 threadBlock computes?
• Can we compute the total global <-> shared comms volume given a certain C block size?
• Do the A and B blocks fetched into shared memory HAVE TO BE square, and why not?
• Why would we fetch non square blocks of A and B?
• Matmult specific parameters
• MMShared reuses A and B elements in shared memory. In the provided code each 16x16 threadBlock computes a 16x16 C block in C = AxB, where A is n rows x m colums, B is m rows x p colums and thus C is n rows x p colums.
• Can we do better?

How do we parameterize the computation so that we can optimize reuse?

Write an equation for the total amount of global to shared memory traffic based a set of parameters. Initially, make these parameters as general as you can, eg do not assume square any blocks, assume rectangles, parameterize the inner loop size that computes multiple elements of the C block. Can you think of other paramemers? Other strategies?

• How do we block A and B to get optimal coalescing?
• should blocks of C live in shared memory? If not where?