CS575 CUDA assigment two: improving performance

Introduction

The purpose of this exercise is for you to gain additional experience with the CUDA programming environment, to learn about some additional factors governing CUDA performance, and to apply your knowledge to the matrix multiplication problem to get a matrix multiply program with better performance than the ones you worked with in the first CUDA programming exercise.

For this assignment, you will perform an experiment to learn about an additional issue governing CUDA performance, and you will write a fast CUDA matrix multiplication program using different grid and block structure from those you wrote in the first CUDA programming assignment. To perform the experiment, you will use a provided kernel and a minor modification of that kernel to investigate the performance difference between shared memory and registers. You will analyze the results of the experiment. To write the fast matrix multiply operation, you will use the matrix multiply code from the first CUDA assignment and modify it to achieve greater speed. You will empirically test this program and analyze the results of your empirical tests.

You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf LabNote document documenting your experiments and results. If you elect to do the assignment with a partner, your document will be named by the name you have given your group, in the form GroupnameCUDA2.pdf. If you elect to do the assignment by yourself, your document will be named using your first and last names in the form FirstnameLastnameCUDA2.pdf. Your grade will be based on this LabNote document, the performance of your code and your analysis of your results. Your analysis must state clearly what gave you improved performance.

For resources that may be useful for this assignment, and information on how to run CUDA programs, see CUDA Assignment 1.

2a: Multiply-Add

Here is a set of provided files

Create a directory for this assignment, copy the provided files to that directory, then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile the provided progam by typing

	make

Then use ls to verify that an executable program was generated, and run it by typing

	sharedshared

RegisterShared

The sharedshared code does the global memory, shared memory and local (register) reading of the A and B operand vectors. The intent is to have exactly the same memory traffic for sharedshared as for registershared, so that the two approaches to fetching operands for the inner product is the only difference in the two codes, and the difference in performance can therefore be measured.

Modify the sharedshared.cu kernel for the multiply-add microexperiment to use one register (local) memory location as the B operand in the multiply-add operation in the innermost loop. The A operand still comes from shared memory. This is exactly the memory fetch we expect in the improved matrix multiply. Do not change any other feature of the program. Name the new kernel registershared.cu. Modify the makefile to compile the new kernel. Test the performance of the sharedshared program and the registershared program, and analyze the results.

You have also been provided with two simplified fragments of assembly produced when CUDA compiles the inner loop of sharedshared.cu and our version of registershared.cu. The assembly code has been cleaned up to make it easier to read and to eliminate irrelevant complexities. You should analyze the assembly code and use what you learn to explain the difference in performance between the two programs.

Run the kernels with NUMREPS set to 0 or 1 (in the kernel header file) to obtain baseline data on the time taken by the IO for this program. Then run with NUMREPS set to 100000 to obtain timing data. Subtract the baseline number from the timing data to obtain data on actual computation time. Be sure to account for the baseline data in your analysis.

Include a short comment in your makefile describing how you performed your experiments and fully describe and analyse the experiments you performed in your LabNote document. User the .asm files to explain your performance results.

2a: Faster Matrix Multiply

Here is a set of provided files

For the second part of this assignment, you will write fmatmult, an improved matrix multiply program, and you will measure its performance and compare it to the performance of the matrix multiply programs you tested in the first CUDA programming assignment. The new_matmult.cu driver creates input for matrix multiply that differs from the input for CUDA Assignment one. If you are interested, see Cormen et.al. Appendix A, Telescoping Series.

Write your new program using the code from the first CUDA assignment as a starting point. Proceed by stepwise refinement. Your first step should be a program that uses a 1x16 thread block to compute a 16x16 portion of the result. Each thread will compute a 16x1 column of the result. A thread block will copy a 16x16 portion of one matrix (A) into shared memory, and will read a 1x16 portion of the other matrix (B) into registers as needed. Each thread will use only one element of the other matrix (B), so the local copy of B should be a single float for each thread.

Your second step should be a program that uses a 1x64 thread block and computes a 16x64 portion of the result. A thread block copies a 16x16 portion of one matrix into shared memory, and reads a 1x64 portion of the second matrix into registers as needed. Each thread computes a 16x1 column within the result. Since this program uses 64 threads in each thread block, you will need to be more clever in copying into shared memory in order to use all 64 threads and avoid conditionals.

You may include other steps as you proceed from the code provided for the first CUDA assignment to the code required for this assignment. Each working version of your code should be named with a sequential number after the name "fmatmult", for example fmatmult00, fmatmult01, etc. Your code files must include internal documentation on the nature of your program, your make file must include notes on the nature of your experiments, and your make file must compile all programs you submit. It is particularly important that you indicate clearly which program you wish to be regarded as your final best effort.

You should investigate how the performance of the new program compares to the performance of the matrix multiplication programs you used in the first CUDA assignment.

You should time your code using square matrices of each of the following sizes: 128, 256, 512, 1024, 2048, 4096, and 8192. When run with a single parameter, your code should work in the same way that the matrix multiply programs for the first CUDA assignment worked, multiplying the parameter by an appropriate factor, and creating square matrices of the resulting size. You must indicate in your submission how to run your programs to get matrices of the sizes listed above.

Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds of about 290 GFlops/sec on Tesla1060. Can you reach that figure?

Turning in Your Results

Finish your lab report explaining the results of your experiments. Provide any general conclusions you have reached about CUDA programming. Can you formulate any rules of thumb for obtaining good performance when using CUDA? If you have done the project with a partner, name your pdf file using your group name (use the form GroupNameCUDA2.pdf). If you have done the project without a partner, name your pdf file using your first and last name (use the form FirstnameLastnameCUDA2.pdf, for example, JohnDoeCUDA1.pdf). To turn in your assignment, use tar to package your code (you may use the make tar command from the makefile), your makefile, and your Lab Notes file. Your tar file should be named in a manner similar to your lab report, but with the tar extension in place of the pdf extension.

Last Modified: 4 April 2011