For this assignment, you will perform an experiment to learn about an additional issue governing CUDA performance, and you will write a fast CUDA matrix multiplication program using different grid and block structure from those you wrote in the first CUDA programming assignment. To perform the experiment, you will use a provided kernel and a minor modification of that kernel to investigate the performance difference between shared memory and registers. You will analyze the results of the experiment. To write the fast matrix multiply operation, you will use the matrix multiply code from the first CUDA assignment and modify it to achieve greater speed. You will empirically test this program and analyze the results of your empirical tests.
You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf LabNote document documenting your experiments and results. If you elect to do the assignment with a partner, your document will be named by the name you have given your group, in the form GroupnameCUDA2.pdf. If you elect to do the assignment by yourself, your document will be named using your first and last names in the form FirstnameLastnameCUDA2.pdf. Your grade will be based on this LabNote document, the performance of your code and your analysis of your results. Your analysis must state clearly what gave you improved performance.
For resources that may be useful for this assignment, and information on how to run CUDA programs, see CUDA Assignment 1.
Here is a set of provided files
Create a directory for this assignment, copy the provided files to that directory, then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile the provided progam by typing
make sharedshared
Modify the sharedshared.cu kernel for the multiply-add microexperiment to use one register (local) memory location as the B operand in the multiply-add operation in the innermost loop. The A operand still comes from shared memory. This is exactly the memory fetch we expect in the improved matrix multiply. Do not change any other feature of the program. Name the new kernel regshared.cu. Modify the makefile to compile the new kernel. Test the performance of the sharedshared program and the regshared program, and analyze the results.
Run the kernels with NUMREPS set to 0 or 1 (in the kernel header file) to obtain baseline data on the time taken by the IO for this program. Then run with NUMREPS set to 100000 to obtain timing data. Subtract the baseline number from the timing data to obtain data on actual computation time. Be sure to account for the baseline data in your analysis.
You have been provided with two simplified fragments of assembly produced when CUDA compiles the inner loop of sharedshared.cu and our version of regshared.cu. The assembly code has been cleaned up to make it easier to read. Analyze the assembly code and use what you learn to explain the difference in performance between the two programs. Describe and analyse the experiments you performed in your LabNote document.
Cij += Aik * Bkj
Since this program uses 64 threads in each thread block, you will need to be clever in copying into shared memory in order to use all 64 threads and avoid conditionals. Investigate how the performance of the new program compares to the performance of the matrix multiplication programs you used in the first CUDA assignment.
You should time your code using square matrices of the following sizes: 128, 256, 512, 1024, 2048, and 4096. When run with a single parameter, your code should work in the same way that the matrix multiply programs for the first CUDA assignment worked, multiplying the parameter by an appropriate factor, and creating square matrices of the resulting size. You must indicate in your submission how to run your programs to get matrices of the sizes listed above.
Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds above 300 GFlops/sec on a Tesla 1060. Can you reach that figure?
Last Modified: April 2013 WB