See lab 8 for information on compiling and running CUDA Programs. *** Do not use wahoo for your CUDA experiments. ***
For this assignment, you will modify two provided CUDA kernels and empirically test and analyze their performance.
Download PA5.tar. This contains a general Makefile, starter and timing files for your exercises.
The provided kernel is part of a program that performs vector MAX reduction.
Notice that the MAX operator is commutative: ( MAX(a,b) = MAX(b,a)
and associative ( MAX(a,MAX(b,c)) = MAX(MAX(a,b),c) ).
This means that you can execute MAXs in any order, so you can apply the
coalescing thread transformation you experimented with in Lab 8:
Vector add. In the provided kernel each thread computes one maximum of
a contiguous partition of the input array and writes it to global memory.
The host mem copies the result back, and computes the max of all maxes.
Given a 1D grid of 80 threadblocks and a 1D threadblock of 128 threads, and a problem size n = 1,280,000,000, each thread computes the maximum of 125,000 elements. The input array is of size n, whereas the result array is of size n/125,000, in this case 80*128= 10240.
In your modified kernel each thread will read, in a coalescing fashion, n / 80*128 interleaved global array elements. In the case of n=1,280,000,000, each thread reads again 125,000 elements. Leave the grid and threadblock dims the same. Again, each thread computes one maximum, of a now interleaved partition. The intermediate maxes computed by the GPU threads will be different from the ones computed by the original kernel. However, the max of maxes computed by the host will be the same as the original max of maxes.
Measure and report the difference in performance of the two codes.
Additional files, relevant to this part of the PA are:
You will write a new device kernel, called vecMaxKernel01.cu, to compute vector maxima using coalesced memory reads. Change the makefile to compile a program called vecMax01 using your kernel rather than the provided kernel.
Additional files for Matmult are:
Investigate how each of the following factors influences performance of matrix multiplication in CUDA:
When run with a single parameter, the provided code multiplies that parameter by BLOCK_SIZE (set to 16 in matmult00) and creates square matrices of the resulting size. This was done to avoid nasty padding issues: you always have data blocks perfectly fitting the grid. Here is an example of running matmult00 on anchovy:
cs475@anchovy [15:10] ~...SOL 239>matmult00 200 Data dimensions: 3200x3200 Grid Dimensions: 200x200 Block Dimensions: 16x16 Footprint Dimensions: 16x16 Time: 0.122176 (sec), nFlops: 65536000000, GFlopsS: 536.406794
Notice that parameter 200 value creates a 16*200=3200 sized square matrix problem.
In your new kernel each thread computes four values in the resulting C block rather than one. So now the footprint of the C block manipulated by a thread block becomes 32x32. Time the execution of this new program with matrices of the sizes listed above to document how your changes affect the performance of the program. To get good performance, you will need to be sure that the new program coalesces its reads and writes from global memory. You will also need to unroll any loops that you might be inclined to insert into your code.
Here is an example of running my matmult01 on anchovy:
cs475@anchovy [15:31] ~...SOL 254>matmult01 100 Data dimensions: 3200x3200 Grid Dimensions: 100x100 Block Dimensions: 16x16 Footprint Dimensions: 32x32 Time: 0.103993 (sec), nFlops: 65536000000, GFlopsS: 630.196633Notice that parameter 100 now creates a 100*32 = 3200 square matrix problem, because the footprint has become 32x32. Also notice a nice speedup.