Colorado State University Cuda Exercise three: Matrix Multiply

This is our first real exercise: let's discuss how to get a fast matrix multiply implemented in cuda.