For this assignment, you will perform an experiment to learn about an additional issue governing CUDA performance, and you will write a fast CUDA matrix multiplication program using different grid and block structure from those you wrote in the first CUDA programming assignment. To perform the experiment, you will use a provided kernel and a minor modification of that kernel to investigate the performance difference between shared memory and registers. You will analyze the results of the experiment. To write the fast matrix multiply operation, you will use the matrix multiply code from the first CUDA assignment and modify it to achieve greater speed. You will empirically test this program and analyze the results of your empirical tests.
You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf LabNote document documenting your experiments and results. If you elect to do the assignment with a partner, your document will be named by the name you have given your group, in the form GroupnameCUDA2.pdf. If you elect to do the assignment by yourself, your document will be named using your first and last names in the form FirstnameLastnameCUDA2.pdf. Your grade will be based on this LabNote document, the performance of your code and your analysis of your results. Your analysis must state clearly what gave you improved performance.
For resources that may be useful for this assignment, and information on how to run CUDA programs, see CUDA Assignment 1.
Here is a set of provided files
Create a directory for this assignment, copy the provided files to that directory, then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile the provided progam by typing
make
Then use ls to verify that an executable program was generated, and run it by typing
sharedshared
Modify the sharedshared.cu kernel for the multiply-add microexperiment to use one register (local) memory location as the B operand in the multiply-add operation in the innermost loop. The A operand still comes from shared memory. This is exactly the memory fetch we expect in the improved matrix multiply. Do not change any other feature of the program. Name the new kernel registershared.cu. Modify the makefile to compile the new kernel. Test the performance of the sharedshared program and the registershared program, and analyze the results.
You have also been provided with two simplified fragments of assembly produced when CUDA compiles the inner loop of sharedshared.cu and our version of registershared.cu. The assembly code has been cleaned up to make it easier to read and to eliminate irrelevant complexities. You should analyze the assembly code and use what you learn to explain the difference in performance between the two programs.
Run the kernels with NUMREPS set to 0 or 1 (in the kernel header file) to obtain baseline data on the time taken by the IO for this program. Then run with NUMREPS set to 100000 to obtain timing data. Subtract the baseline number from the timing data to obtain data on actual computation time. Be sure to account for the baseline data in your analysis.
Include a short comment in your makefile describing how you performed your experiments and fully describe and analyse the experiments you performed in your LabNote document. User the .asm files to explain your performance results.
Here is a set of provided files
For the second part of this assignment, you will write fmatmult, an improved matrix multiply program, and you will measure its performance and compare it to the performance of the matrix multiply programs you tested in the first CUDA programming assignment. The new_matmult.cu driver creates input for matrix multiply that differs from the input for CUDA Assignment one. If you are interested, see Cormen et.al. Appendix A, Telescoping Series.
Write your new program using the code from the first CUDA assignment as a starting point. Proceed by stepwise refinement. Your first step should be a program that uses a 1x16 thread block to compute a 16x16 portion of the result. Each thread will compute a 16x1 column of the result. A thread block will copy a 16x16 portion of one matrix (A) into shared memory, and will read a 1x16 portion of the other matrix (B) into registers as needed. Each thread will use only one element of the other matrix (B), so the local copy of B should be a single float for each thread.
Your second step should be a program that uses a 1x64 thread block and computes a 16x64 portion of the result. A thread block copies a 16x16 portion of one matrix into shared memory, and reads a 1x64 portion of the second matrix into registers as needed. Each thread computes a 16x1 column within the result. Since this program uses 64 threads in each thread block, you will need to be more clever in copying into shared memory in order to use all 64 threads and avoid conditionals.
You may include other steps as you proceed from the code provided for the first CUDA assignment to the code required for this assignment. Each working version of your code should be named with a sequential number after the name "fmatmult", for example fmatmult00, fmatmult01, etc. Your code files must include internal documentation on the nature of your program, your make file must include notes on the nature of your experiments, and your make file must compile all programs you submit. It is particularly important that you indicate clearly which program you wish to be regarded as your final best effort.
You should investigate how the performance of the new program compares to the performance of the matrix multiplication programs you used in the first CUDA assignment.
You should time your code using square matrices of each of the following sizes: 128, 256, 512, 1024, 2048, 4096, and 8192. When run with a single parameter, your code should work in the same way that the matrix multiply programs for the first CUDA assignment worked, multiplying the parameter by an appropriate factor, and creating square matrices of the resulting size. You must indicate in your submission how to run your programs to get matrices of the sizes listed above.
Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds of about 290 GFlops/sec on Tesla1060. Can you reach that figure?
Last Modified: 4 April 2011