CS575 CUDA assigment two: improving performance

Introduction

The purpose of this exercise is for you to gain additional experience with the CUDA programming environment, to learn more about CUDA performance, and to apply your knowledge to the matrix multiplication problem to get a matrix multiply program with better performance than the ones you worked with in the first CUDA programming exercise.

For this assignment, you will perform an experiment to learn about an additional issue governing CUDA performance, and you will write a fast CUDA matrix multiplication program using different grid and block structure from those you wrote in the first CUDA programming assignment. To perform the experiment, you will use a provided kernel and a minor modification of that kernel to investigate the performance difference between shared memory and registers. You will analyze the results of the experiment. To write the fast matrix multiply operation, you will use the matrix multiply code from the first CUDA assignment and modify it to achieve greater speed. You will empirically test this program and analyze the results of your empirical tests.

You will turn in all code you have written for this assignment, a makefile that will compile your code, and a pdf LabNote document documenting your experiments and results. If you elect to do the assignment with a partner, your document will be named by the name you have given your group, in the form GroupnameCUDA2.pdf. If you elect to do the assignment by yourself, your document will be named using your first and last names in the form FirstnameLastnameCUDA2.pdf. Your grade will be based on this LabNote document, the performance of your code and your analysis of your results. Your analysis must state clearly what gave you improved performance.

For resources that may be useful for this assignment, and information on how to run CUDA programs, see CUDA Assignment 1.

2a: Multiply-Add

Here is a set of provided files

Create a directory for this assignment, copy the provided files to that directory, then use the makefile to compile the provided sample source code. Assuming you have not changed the name of the makefile, you can compile the provided progam by typing

	make sharedshared

RegShared

The sharedshared code does the global memory, shared memory and local (register) reading of the A and B operand vectors. The intent here is to have exactly the same memory traffic for sharedshared as for regshared, the code you will create, so that the approach to fetching operands for the inner product is the only difference in the two codes, and the difference in performance can therefore be measured properly.

Modify the sharedshared.cu kernel for the multiply-add microexperiment to use one register (local) memory location as the B operand in the multiply-add operation in the innermost loop. The A operand still comes from shared memory. This is exactly the memory fetch we expect in the improved matrix multiply. Do not change any other feature of the program. Name the new kernel regshared.cu. Modify the makefile to compile the new kernel. Test the performance of the sharedshared program and the regshared program, and analyze the results.

Run the kernels with NUMREPS set to 0 or 1 (in the kernel header file) to obtain baseline data on the time taken by the IO for this program. Then run with NUMREPS set to 100000 to obtain timing data. Subtract the baseline number from the timing data to obtain data on actual computation time. Be sure to account for the baseline data in your analysis.

You have been provided with two simplified fragments of assembly produced when CUDA compiles the inner loop of sharedshared.cu and our version of regshared.cu. The assembly code has been cleaned up to make it easier to read. Analyze the assembly code and use what you learn to explain the difference in performance between the two programs. Describe and analyse the experiments you performed in your LabNote document.

2b: Faster Matrix Multiply

For the second part of this assignment, you will write an improved matrix multiply program, and you will measure its performance and compare it to the performance of the matrix multiply programs you tested in the first CUDA programming assignment. The source and header file structure are not specified here. You can start from scratch, or choose to start with the initial code from the first CUDA assignment. Your new program will use a 1x64 thread block and will compute a 16x64 portion of the result. A thread block copies a 16x16 portion of the A matrix into shared memory, and reads a 1x64 portion of the B matrix into registers, i.e. each thread reads one B element at the time into a register. Each thread updates a 16x1 column of C, using this one B value and 16 shared A values. Remember the k matching rule in matrix multiply
        Cij += Aik * Bkj

Since this program uses 64 threads in each thread block, you will need to be clever in copying into shared memory in order to use all 64 threads and avoid conditionals. Investigate how the performance of the new program compares to the performance of the matrix multiplication programs you used in the first CUDA assignment.

You should time your code using square matrices of the following sizes: 128, 256, 512, 1024, 2048, and 4096. When run with a single parameter, your code should work in the same way that the matrix multiply programs for the first CUDA assignment worked, multiplying the parameter by an appropriate factor, and creating square matrices of the resulting size. You must indicate in your submission how to run your programs to get matrices of the sizes listed above.

Note the results of each experiment in your Lab Notes. The type of code you have been asked to write can obtain speeds above 300 GFlops/sec on a Tesla 1060. Can you reach that figure?

Turning in Your Results

Finish your lab report explaining the results of your experiments. Provide any general conclusions you have reached about CUDA programming. Can you formulate any rules of thumb for obtaining good performance when using CUDA? If you have done the project with a partner, name your pdf file using your group name (use the form GroupNameCUDA2.pdf). If you have done the project without a partner, name your pdf file using your first and last name (use the form FirstnameLastnameCUDA2.pdf, for example, JohnDoeCUDA1.pdf). To turn in your assignment, use tar to package your code (you may use the make tar command from the makefile), your makefile, and your Lab Notes file. Your tar file should be named in a manner similar to your lab report, but with the tar extension in place of the pdf extension.

Last Modified: April 2013 WB