CS475 Lab 5: Introduction to GPUs and CUDA: Coalescing

Introduction

The purpose of this exercise is for you to learn how to write programs using the CUDA programming interface, how to run such programs using an NVIDIA graphics processor, and how to think about the factors that govern the performance of programs running within the CUDA environment.

In this, you will modify a CUDA kernel and empirically test and compare its performance to the given code. The given program that performs vector addition. The provided program does not coalesce memory accesses. You will modify it to coalesce memory access.

Resources that may be useful for this assignment include:

Compiling and Running a CUDA Program

You will need to use a linux machine with an NVIDIA graphics card. In the CSU 325 lab, the machines have such a graphics card. These are machines named after fish (anchovy .. ), see these machines and grep 325.

*** Exception: Do not use wahoo for your CUDA experiments.***

You'll need to take a few additional steps in order to use CUDA on the department machines. Minimally, you need to load the CUDA module:
module load cuda

Download and untar the CUDAL5.tar file. To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided Makefile as a starting point for this lab and for your CUDA assignmnts. The Makefile shows you which gcc is to be used.

Vector add: coalescing memory accesses

As discussed in class, vecadd is a micro benchmark to determine the effectiveness of coalescing. You are provided with a non-coalescing version, and it is your job to create a coalescing version, and to measure the difference in performance of the two codes. For this lab you must create a new kernel called vecaddKernel01.cu which implements the coalesced version of the vector addition.

In order to implement coalescing you need to change what part of the computation each thread is doing. In the base version each thread is computing N values in which all N values are consecutive in the array. Using coalescing we can improve performance by allowing each thread to compute a consecutive element in the array then index by number of threads. You must change the code to allow for this different method of computing the result.

Here is a set of files provided in CUDAL5.tar for Vecadd:

Compile and run the provided program vecadd00 and collect data on the time the program takes with 60,000 values per thread. Test your new program and compare the time it takes to perform the same work performed by the original.