Assignment 5 - CUDA MMScan

Introduction

In this assignment, you will implement the MMScan in CUDA, measure the performance, revisit the effects on coalescing, tune the parameters for the best performance, and also compare with the CPU implementation. You will start with the same tar file from PA2. You will write three kernels: one for each of the three phases of the MMScan algorithm. The first an third will be called with a grid of G blocks, each with SxS threads (these are set in as #define constants, it is expected that N is a multiple of G and B is a multiple of S). You will be writing most of the code from scratch, with little scaffolding.

First, in the wrapper file, get rid of the call to MMScanDNC (just retain the call to MMScan) which will serve as your CPU sequential baseline.

Next, you will allocate device memory for the following variables: X_GPU, Y_GPU, R1_GPU, and R2_GPU. The first two are self explanatory, and the others, the input and output arrays for Phase_2. They are both of size GxBxB floats.

Then, write code to (i) transfer X to the GPU, (ii) call the three kernels, Phase_1, Phase_2, and Phase_3, and (iii) transfer the result back into Y (or as appropriate) on the host. Make sure that you instrument the code to time each of the five parts of the computation including the data transfer times.

Finally, write the three kernels described below.

Phase 1 Kernel

Each of the G blocks is responsible for reducing (using matrix product as the operator) a sequence of n=N/G matrices. Block g start with the n*g-th matrix in X, and writes its BxB result into the g-th position in the result array R1_GPU. The algorithm should follow the same pattern as you saw in PA4: in an outer loop with n iterations,
  • the SxS threads cooperatively and in a coalesced manner, read a BxB matrix from global to shared memory.
  • then, they compute a BxB matrix product, each thread responsible for (B/S)2 elements. For this, you will need to allocate two BxB arrays in shared memory, and update the current result using the previous one.

Phase 2 Kernel

We expect that this kernel will be called with a grid of exactly one block of SxS threads. All the threads execute a loop with G iterations computing the scan of R1_GPU, and writing the result into R2_GPU. The code is very similar to that in Phase_1, except that the accumulated answer is copied to global memory in each iteration.

Phase 3 Kernel

In this kernel is executed by G blocks, where the g-th block computes a scan of the g-th section of X and writes the result in to (the appropriate section of Y), with one caveat, and therefore you may expect that the code and logic of this kernel will be nearly identical to those of the phase_2 kernel. The caveat is that in the first iteration, each block will be using the appropriate element of R2, to stat off the scan.

Report

  1. 10 points
  2. Run your experiments on the planets machines: mercury, venus, earth, mars, jupiter, saturn, uranus, and neptune
  3. Report the best execution time you get with N=500,000 and B=32 and N=100,000 and B=64
  4. Experimentally determine the best values of G and S (gridDim.x and blockDim.x) and report these values
  5. Report how G and S influence performance for both problem sizes

Submission

Submit a PA5.tar file through Checkin.

NOTE: Set your output file name to MMScanCUDA. Use the flags as you did in PA2.