CS475 Lab 8: Introduction to GPUs and CUDA: Coalescing

Introduction

The purpose of this exercise is for you to learn how to write programs using the CUDA programming interface, how to run such programs using an NVIDIA graphics processor, and how to think about the factors that govern the performance of programs running within the CUDA environment.

In this, you will modify a CUDA kernel and empirically test and compare its performance to the given code. The given program that performs vector addition. The provided program does not coalesce memory accesses. You will modify it to coalesce memory access.

Resources that may be useful for this assignment include:

The NVIDIA CUDA Developer web site
The NVIDIA CUDA Zone web site
Documentation for gnu make
Most importantly, check out the latest version of the NVIDIA CUDA Programming Guide.

Compiling and Running a CUDA Program

You will need to use a linux machine with an NVIDIA graphics card. In the CSU 325 lab, the machines have such a graphics card. These are machines named after fish (anchovy .. ), see these machines and grep 325.

*** Exception: Do not use wahoo for your CUDA experiments.***

To use CUDA on the lab machines at CSU, you will need to set the right environment variables. It's convenient to edit your .cshrc or your .profile file (depending on whether you are using tcsh or bash) to set them when you log in. You should add

/usr/local/cuda/bin to PATH
/usr/local/cuda/lib64 to LD_LIBRARY_PATH
/usr/local/cuda/man to MANPATH

Download and untar the CUDAL8.tar file. To compile a CUDA program, use the CUDA compiler nvcc. You should use the provided Makefile as a starting point for this lab and for your CUDA assignmnts. The Makefile shows you which gcc is to be used.

Vector add: coalescing memory accesses

As discussed in class, vecadd is a micro benchmark to determine the effectiveness of coalescing. You are provided with a non-coalescing version, and it is your job to create a coalescing version, and to measure the difference in performance of the two codes.

Here is a set of files provided in PA5.tar for Vecadd:

Makefile
vecadd.cu
vecaddKernel.h
vecaddKernel00.cu
timer.h
timer.cu

Compile and run the provided program vecadd00 and collect data on the time the program takes with 60,000 values per thread. Test your new program and compare the time it takes to perform the same work performed by the original. Discuss your results in discussion D8.