Extra Credit (5 points) Assignment 6 - CUDA Chain 2x2 Matrix Multiply

Introduction

In the coalescing MAX exercise, we used the fact that Max is commutative. We could therefore reorder the MAX operations. Notice that matrix multiply is NOT commutative, so we need to obey the order of the multiplications, but we can still read in coalesced fashion using shared memory. Download the following PA6 tar file, containing:
  • Makefile
  • timer.h
  • timer.cu
  • MM2chain.cu
  • MM2chainKernel.h

*** Do not use wahoo for your CUDA experiments. ***

In this exercise you will coalesc a chain matrix multiply code. You will be given host code only. You will write two kernel codes: one non-coalescing and one coalescing kernel, both for a 1D grid of 200000 1D thread blocks of 32 threads.

Non Coalescing Kernel

A thread in a thread block of your non coalescing kernel, called MM2chainKernel00.cu, will read a contiguous chunk of the global input array d_A, containing 60 2x2 matrices (each 4 floats row major order: M00 M01 M10 and M11), i.e. 240 floats, chain multiply these using registers for intermediate results, and write one 4 element result matrix into one global result array d_matmults. The host will memcpy this result array back and check its validity.

Coalescing Kernel

Each thread in your coalescing kernel, called MM2chainKernel01.cu, will read 60 2x2 matrices from global array d_A into shared memory, in interleaved fashion to achieve coalescing. So a thread block collectively reads 60*32 2x2 matrices. Then each thread computes a chain matrix multiply of 60 contiguous 2x2 matrices by reading from shared memory and computing results registers, and writes the result back to global memory array d_matmults. This array should be identical to the array produced by MM2chainKernel00.cu. Again, the host will memcpy this result array back and check its validity.

You need to extend your Makefile.

Report and submission

Compare and report the performance of the two codes. Submit a PA6.tar file containing:
  • report .pdf
  • Makefile
  • timer.h
  • timer.cu
  • MM2chain.cu
  • MM2chainKernel00.cu
  • MM2chainKernel01.cu
  • MM2chainKernel.h
Again, use the "make tar" command from your Makefile