Assignment 5 - CUDA MMScan
Introduction
In this assignment, you will implement the MMScan in CUDA, measure the
performance, revisit the effects on coalescing, tune the parameters for the
best performance, and also compare with the CPU implementation. You will
start with the
same
tar file from PA2. You will write three kernels: one for each of the
three phases of the MMScan algorithm. The first an third will be called with
a grid of G blocks, each with SxS threads (these are set in as #define
constants, it is expected that N is a multiple of G and B is a multiple of S).
You will be writing most of the code from scratch, with little scaffolding.
First, in the wrapper file, get rid of the call to MMScanDNC (just retain the
call to MMScan) which will serve as your CPU sequential baseline.
Next, you will allocate device memory for the following variables: X_GPU,
Y_GPU, R1_GPU, and R2_GPU. The first two are self explanatory, and the
others, the input and output arrays for Phase_2. They are both of size GxBxB
floats.
Then, write code to (i) transfer X to the GPU, (ii) call the three
kernels, Phase_1, Phase_2, and Phase_3, and (iii) transfer the result back
into Y (or as appropriate) on the host. Make sure that you instrument the
code to time each of the five parts of the computation including the data
transfer times.
Finally, write the three kernels described below.
Phase 1 Kernel
Each of the G blocks is responsible
for
reducing (using matrix product as the operator) a sequence of n=N/G
matrices. Block g start with the n*g-th matrix in X, and writes its BxB
result into the g-th position in the result array R1_GPU. The algorithm should
follow the same pattern as you saw in PA4: in an outer loop with n iterations,
- the SxS threads cooperatively and in a coalesced manner, read a BxB
matrix from global to shared memory.
- then, they compute a BxB matrix product, each thread responsible for
(B/S)2 elements. For this, you will need to allocate two BxB
arrays in shared memory, and update the current result using the previous
one.
Phase 2 Kernel
We expect that this kernel will be called with a grid
of exactly one block of SxS threads. All the threads execute a loop with G
iterations computing the scan of R1_GPU, and writing the result into R2_GPU.
The code is very similar to that in Phase_1, except that the accumulated
answer is copied to global memory in each iteration.
Phase 3 Kernel
In this kernel is executed by G blocks, where the g-th
block computes a scan of the g-th section of X and writes the result in to (the
appropriate section of Y), with one caveat, and therefore you may expect that
the code and logic of this kernel will be nearly identical to those of the
phase_2 kernel. The caveat is that in the first iteration, each block will be
using the appropriate element of R2, to stat off the scan.
Report
- 10 points
- Run your experiments on the planets machines: mercury,
venus, earth, mars, jupiter, saturn, uranus, and neptune
- Report the best execution time you get with N=500,000 and B=32 and N=100,000 and B=64
- Experimentally determine the best values of G and S (gridDim.x and blockDim.x) and report these values
- Report how G and S influence performance for both problem sizes
Submission
Submit a PA5.tar file through Checkin.
NOTE: Set your output file name to MMScanCUDA. Use the flags as you did in PA2.