CS 475 Discussion 06 - Introduction to GPU and CUDA

CS 475: Parallel Programming
Computer Science Department
Fall 2016
CS 475 Discussion 06 - Introduction to GPU and CUDA

Discussion 6: Introduction to GPU and CUDA

Objective: The objective of this discussion is to teach yourself and your group how to write parallel codes using CUDA for GPUs.

Write short summary of what you learnt in Lab6 (limit to 5 sentences)

Basic questions:

What is CUDA and what is the compiler that we are using?
Who is the host? Who is the device?
What is the difference between normal C function and CUDA kernel function?
What is __global__ and what does it indicate?
What is SIMT ? Were you able to observe it in HelloWorld.cu?
What happened when you uncommented the line "//if(threadIdx.x==0)" in HelloWorld.cu?
What happened when you commented the line "cudaDeviceSynchronize();" in HelloWorld.cu? What does that tell you?
What is the syntax to invoke a CUDA kernel?

Moderate questions:

What is a Streaming Multiprocessor (SM)? How many do we have in the 3 GPUs?
What is a grid, a block and a thread?
Explain the memory model that CUDA exposes to the programmer.
Look into matmultKernel00.cu in PA5.tar. What does the _shared_ keyword tell you?
How is the CUDA memory model different from the standard memory model on CPU?
With respect to the previous 3 questions, what is the advantage that comes with CPU programming? (Hint: Shared memory in GPU is equivalent to ____ of CPU)
What are blockIdx.x, threadIdx.y, blockDim.z? What do they tell you?
Assume a 1D grid and 1D block architecture: 4 blocks, each with 8 threads. How many threads do we have in total ? How do you calculate the global thread ID? Show the calculation for global threadID=26.
Assume a 1D grid and 2D block architecture: 4 blocks, each with 2 threads in x direction and 4 threads in y direction. How many threads do we have in total ? How do you calculate the global thread ID? Show the calculation for global threadID=26.
Assume a 2D grid and 3D block architecture: 2 blocks in x direction and 2 blocks in y direction, each block with 2 threads in x direction and 2 threads in y direction and 2 threads in z direction. How many threads do we have in total ? How do you calculate the global thread ID? Show the calculation for global threadID=26.
What were your observations on varying block and grid sizes in HelloWorld.cu? Show how you changed the print statements to reflect this.

Advanced questions: (You can refer PA5 tarball and/or Internet)

Is the memory shared between CPU and GPU? Describe how data is transferred between CPU and GPU.
Assume a 1D grid and 1D block architecture: 64 blocks, each with 64 threads. How many threads do we have in total? Do all the threads execute in parallel?
What is a Warp? What is the Warp size and the scheduler found in the 3 GPUs?
From above 2 questions, how many threads can run in parallel at a given time step for the 3 GPUs?
What is the theoretical peak performance that the 3 GPUs can achieve in terms of GFLOPsS? How is this derived?
How is the given vecaddKernel(in PA5) different from HelloWorld Kernel?

Submit your individual answers to the above 25 questions in Piazza

Group Report

From the above questions(and answers that you individually obtained and discussed in group) were you able to understand the overview of how we can write parallel codes on GPUs using CUDA?
Come to a group consensus and form a single group report containing answers to the above questions & a short summary of lessons learnt from this lab/discussion.