The purpose of this exercise is for you to learn how to write programs using the CUDA programming interface, how to run such programs using an NVIDIA graphics processor (GP GPU general purpose Graphics Processing Unit), and how to think about the factors that govern the performance of programs running within the CUDA environment.

Compiling and Running a CUDA Program

See lab 8 for information on compiling and running CUDA Programs. *** Do not use wahoo for your CUDA experiments. ***

For this assignment, you will modify two provided CUDA kernels and empirically test and analyze their performance.

Download PA5.tar. This contains a general Makefile, starter and timing files for your exercises.

Vector MAX: coalescing memory accesses

The provided kernel is part of a program that performs vector MAX reduction. Notice that the MAX operator is commutative: ( MAX(a,b) = MAX(b,a) and associative ( MAX(a,MAX(b,c)) = MAX(MAX(a,b),c) ).
This means that you can execute MAXs in any order, so you can apply the coalescing thread transformation you experimented with in Lab 8: Vector add. In the provided kernel each thread computes one maximum of a contiguous partition of the input array and writes it to global memory. The host mem copies the result back, and computes the max of all maxes.

Given a 1D grid of 80 threadblocks and a 1D threadblock of 128 threads, and a problem size n = 1,280,000,000, each thread computes the maximum of 125,000 elements. The input array is of size n, whereas the result array is of size n/125,000, in this case 80*128= 10240.

In your modified kernel each thread will read, in a coalescing fashion, n / 80*128 interleaved global array elements. In the case of n=1,280,000,000, each thread reads again 125,000 elements. Leave the grid and threadblock dims the same. Again, each thread computes one maximum, of a now interleaved partition. The intermediate maxes computed by the GPU threads will be different from the ones computed by the original kernel. However, the max of maxes computed by the host will be the same as the original max of maxes.

Measure and report the difference in performance of the two codes.

Additional files, relevant to this part of the PA are:

  • vecMaxKernel.h

You will write a new device kernel, called, to compute vector maxima using coalesced memory reads. Change the makefile to compile a program called vecMax01 using your kernel rather than the provided kernel.

Shared / shared CUDA Matrix Multiply

The first exercise made you aware of coalescing and its performance enhancement. The main exercise performs matrix multiplication. You will be given matmultKernel00 as a basis for a kernel you will write, called

Additional files for Matmult are:

  • matmultKernel.h

Investigate how each of the following factors influences performance of matrix multiplication in CUDA:

  • The size of matrices to be multiplied
  • The size of the block computed by each thread block
You should time the initial code provided using square matrices of each of the following sizes: 1600, 3200, 6400

When run with a single parameter, the provided code multiplies that parameter by BLOCK_SIZE (set to 16 in matmult00) and creates square matrices of the resulting size. This was done to avoid nasty padding issues: you always have data blocks perfectly fitting the grid. Here is an example of running matmult00 on anchovy:

cs475@anchovy [15:10] ~...SOL 239>matmult00 200
Data dimensions: 3200x3200 
Grid Dimensions: 200x200 
Block Dimensions: 16x16 
Footprint Dimensions: 16x16 
Time: 0.122176 (sec), nFlops: 65536000000, GFlopsS: 536.406794

Notice that parameter 200 value creates a 16*200=3200 sized square matrix problem.

In your new kernel each thread computes four values in the resulting C block rather than one. So now the footprint of the C block manipulated by a thread block becomes 32x32. Time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code.

Here is an example of running my matmult01 on anchovy:

cs475@anchovy [15:31] ~...SOL 254>matmult01 100
Data dimensions: 3200x3200 
Grid Dimensions: 100x100 
Block Dimensions: 16x16 
Footprint Dimensions: 32x32 
Time: 0.103993 (sec), nFlops: 65536000000, GFlopsS: 630.196633
Notice that parameter 100 now creates a 100*32 = 3200 square matrix problem, because the footprint has become 32x32. Also notice a nice speedup.


Use any GPU in the CS325 lab.
  1. Collect and compare the time vecMax00 and vecMax01 take with problem size n = 1,280,000,000.
  2. Collect and compare the time matmult00 and matmult01 take for square matrices of size: 1600x1600, 3200x3200, 6400x6400.
  3. For matmult00 and matmult01, investigate how each of the following factors influences performance:
    • The size of matrices to be multiplied
    • The size of the block computed by each thread block
  4. Can you formulate any rules of thumb for obtaining good performance when using CUDA?


Submit a PA5.tar file through Checkin, containig:
  • report.pdf
  • Makefile
  • matmultKernel.h
  • vecaddKernel.h
Use "make tar" to make the tar file.