Extra Credit (5 points) Assignment 6 - CUDA Chain 2x2 Matrix Multiply
Introduction
In the coalescing MAX exercise, we used the fact that Max is commutative.
We could therefore reorder the MAX operations. Notice that matrix multiply
is NOT commutative, so we need to obey the order of the multiplications,
but we can still read in coalesced fashion using shared memory.
Download the following
PA6 tar file, containing:
- Makefile
- timer.h
- timer.cu
- MM2chain.cu
- MM2chainKernel.h
*** Do not use wahoo for your CUDA experiments. ***
In this exercise you will coalesc a chain matrix multiply code.
You will be given host code only. You will write two kernel codes:
one non-coalescing and one coalescing kernel, both for a 1D grid of
200000 1D thread blocks of 32 threads.
Non Coalescing Kernel
A thread in a thread block of your non coalescing kernel, called
MM2chainKernel00.cu, will read a contiguous chunk of the global input array
d_A, containing 60 2x2 matrices (each 4 floats row major order: M00 M01 M10
and M11), i.e. 240 floats, chain multiply these using registers for
intermediate results, and write one 4 element
result matrix into one global result array d_matmults. The host will
memcpy this result array back and check its validity.
Coalescing Kernel
Each thread in your coalescing kernel, called MM2chainKernel01.cu, will
read 60 2x2 matrices from global array d_A into shared memory, in interleaved
fashion to achieve coalescing. So a thread block collectively
reads 60*32 2x2 matrices. Then each thread computes a chain matrix multiply
of 60 contiguous 2x2 matrices by reading from shared memory and
computing results registers, and writes the result back to global memory
array d_matmults. This array should be identical to the array produced
by MM2chainKernel00.cu. Again, the host will memcpy this result array back
and check its validity.
You need to extend your Makefile.
Report and submission
Compare and report the performance of the two codes. Submit a PA6.tar file
containing:
- report .pdf
- Makefile
- timer.h
- timer.cu
- MM2chain.cu
- MM2chainKernel00.cu
- MM2chainKernel01.cu
- MM2chainKernel.h
Again, use the "make tar" command from your Makefile