CS475 PA5: CUDA Matrix Multiply

Introduction

The purpose of this exercise is for you to learn how to write programs using the CUDA programming interface, how to run such programs using an NVIDIA graphics processor, and how to think about the factors that govern the performance of programs running within the CUDA environment.

For this assignment, you will modify two provided CUDA kernels. The first is part of a program that performs vector addition. The second is part of a program to perform matrix multiplication. The provided vector addition program does not coalesce memory accesses. You will modify it to coalesce memory access. You will modify the matrix multiplication program to investigate its performance and how it can be optimized by changing how the task is parallelized. You will create an improved version of the matrix multiplication program and empirically test the time it takes to run. You will analyze the results of your timing tests.

You will turn in all code you have written and used for this assignment, a makefile that will compile your code, and a README file documenting your experiments and results. We need to be able to compile your code by executing "make".

Resources that may be useful for this assignment include:

The NVIDIA CUDA Developer web site
The NVIDIA CUDA Zone web site
Documentation for gnu make
Also, you can check out the latest version of the NVidia CUDA Programming Guide.

Compiling and Running a CUDA Program

For this assignment, you will need to use a linux machine with an NVIDIA graphics card. In the CSU labs, the machines in rooms 215, 225, and 325 have such a graphics card. In general, the machines in 325 are preferred for use with CUDA. These machines are: dates, figs, grapes, huckle- berries, kiwis, lemons, melons, nectarines, peaches, pears, raspberries, kumquats, bananas, coconuts, and oranges.

As the 1060s are currently out of order, we will do all experiments on the above machines, and on smaller problem sizes. You can use ssh to work on the lab machines remotely since your work with CUDA will not require that you have access to the main console of the machine you are using (you will not be doing any graphics). To determine other machines that are suitable for this assignment (those in rooms 215 and 225) look at the list of CS department machines noted above. Note that the machines in room 120 (the capital machines) do not have NVIDIA graphics cards, and are not suitable for this assignment.

To use CUDA on the lab machines at CSU, you will need to set the right environment variables. It's convenient to edit your .cshrc or your .profile file (depending on whether you are using tcsh or bash) to set them when you log in. You should add

/usr/local/cuda/lib64 to LD_LIBRARY_PATH
/usr/local/cuda/man to MANPATH

For detailed instructions, see the CUDA FAQ mentioned above.

Download and untar the CUDA1.tar file.

To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided makefile as a starting point for this assignment. Initially, you will use the provided makefile to verify that your environment is correctly configured to use CUDA.

Vector add: coalescing memory accesses

As discussed in class, vecadd is a micro benchmark to determine the effectiveness of coalescing. You are provided with a bnon-coalescing version, and it is your job to create a coalescing version, and to measure the difference in performance of the two codes.

Here is a set of files provided in CUDA1.tar

Makefile
vecadd.cu
vecaddKernel.h
vecaddKernel00.cu
timer.h
timer.cu

Compile and run the provided program vecadd00 and collect data on the time the program takes with the following number of values per thread: 500, 1000, 2000.

Include a short comment in your make file describing what your various programs do, and fully describe and analyse the experiments you performed in your README document.

In vecadd01 you will use a new device kernel, called vecAddKernel01.cu, to perform vector addition using coalesced memory reads. Change the makefile to compile a program called vecadd01 using your kernel rather than the provided kernel. Modify the makefile as appropriate so the clean and tar commands will deal with any files you have added.

Test your new program and compare the time it takes to perform the same work performed by the original. Note and analyse your results, and document your observations in the README file.

Shared / shared CUDA Matrix Multiply

For the next part of this assignment, you will use matmultKernel00 as a basis for another kernel with which you will investigate the performance of matrix multiplication using a GPU and the CUDA programming environment. Your kernel should be called matmultKernel01.cu. Your code files must include internal documentation on the nature of each program, and your README file must include notes on the nature of the experiment. You can have intermediate stages of this new kernel, eg without and with unrolling, but we will grade you on your final improved version of the code and its documentation.

Here is a set of provided files in CUDA1.tar for Matmult:

Makefile
matmult.cu
matmultKernel.h
matmultKernel00.cu
timer.h
timer.cu

You should investigate how each of the following factors influences performance of matrix multiplication in CUDA:

(1) The size of matrices to be multiplied
(2) The size of the block computed by each thread block
(3) Any other performance enhancing modifications you find

You should time the initial code provided using square matrices of each of the following sizes: 256, 512, 1024.

When run with a single parameter, the provided code multiplies that parameter by FOOTPRINT_SIZE (set to 16 in matmult00) and creates square matrices of the resulting size. This was done to avoid nasty padding issues: you always have data blocks perfectly fitting the grid. Here is an example of running matmult00 on figs:

figs 46 # matmult00 32
Data dimensions: 512x512 
Grid Dimensions: 32x32 
Block Dimensions: 16x16 
Footprint Dimensions: 16x16 
Time: 0.016696 (sec), nFlops: 268435456, GFlopsS: 16.077853

Notice that parameter 32 value creates a 32*16 = 512 sized square matrix problem.

In your new kernel each thread computes four values in the resulting C block rather than one. So now the FOOTPRINT_SIZE becomes 32. ( Notice that this is taken care of by the Makefile.) You will time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code.

Here is an example of running my matmult01 on figs:

figs 28>matmult01 16
Data dimensions: 512x512 
Grid Dimensions: 16x16 
Block Dimensions: 16x16 
Footprint Dimensions: 32x32 
Time: 0.012185 (sec), nFlops: 268435456, GFlopsS: 22.030248

Notice that parameter 16 now creates a 16*32 = 512 square matrix problem, because the footprint has become 32x32. Also notice a nice speedup.

Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds of about 220 GFlops/sec on a Tesla 1060. Can you reach that figure?

Turning in Your Results

Finish your lab report in README, explaining the results of your experiments. Provide any general conclusions you have reached about CUDA programming. Can you formulate any rules of thumb for obtaining good performance when using CUDA? To turn in your assignment, use tar to package your code (you may use the make tar command from the makefile), your makefile, and your Lab Notes README file. Your tar file should be named CUDA1.tar.