# GPU Computing is everywhere!



# GPU in a HPC system



# <section-header><section-header><image>

# GPU in a HPC system



# <section-header><section-header><image><image>

# GPU in a HPC system





## Difference between a CPU and GPU

• Both cater to different needs  $\rightarrow$  Low Latency or High Throughput?







# HPC Machines in CS Labs

| Name        | GPUs                                            |
|-------------|-------------------------------------------------|
| bentley     | GF119 [NVS 315] and GM204 [GeForce GTX 980]     |
| bugatti     | GF119 [NVS 315] and GM204 [GeForce GTX 980]     |
| ferrari     | GF119 [NVS 315] and GM204 [GeForce GTX 980]     |
| jaguar      | GF119 [NVS 315] and GM204 [GeForce GTX 980]     |
| lamborghini | GF119 [NVS 315] and GM200 [GeForce GTX TITAN X] |
| lotus       | GF119 [NVS 315] and GM200 [GeForce GTX TITAN X] |
| maserati    | GF119 [NVS 315] and GM200 [GeForce GTX TITAN X] |
| porsche     | GF119 [NVS 315] and GM200 [GeForce GTX TITAN X] |
| raspberries | G96GL [Quadro FX 580]                           |

# Features of the Graphics card **GeForce GTX 980**

| GPU Architecture | Maxwel     |
|------------------|------------|
| GPU Name         | GM204      |
| CUDA Cores       | 2048       |
| Clock Speed      | 1126 MHz   |
| VRAM             | 4 GB GDDR5 |
| Memory Bus       | 256-bit    |
| Memory Clock     | 7.0 GHz    |
| Memory Bandwidth | 224.0 GB/s |
| Power Connectors | Two 6-Pin  |
|                  |            |









# GPU Architecture

# GPU Architecture : Two Main Components

### 1. Global Memory:

-Accessible by both GPU and CPU -Analogous to RAM in a CPU server

### 2. Streaming Multiprocessors (SMs):

-Perform the actual computations Each SM has its own: Control units, registers, execution pipelines, caches



### GPU Architecture – Fermi: Streaming **Multiprocessor** Instruction Cache er File Core Core Core Core 32 CUDA Cores per SM Core Core Core Core 32 fp32 ops/clock 16 fp64 ops/clock Core Core Core Core 32 int32 ops/clock Core Core Core Core 2 warp schedulers Core Core Core Core Up to 1536 threads concurrently Core Core Core Core 4 special-function units 64KB shared mem+ L1 cache Core Core Core Core 32K 32-bit registers Core Core Core Core **Register File** Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

## GPU Architecture – Fermi : CUDA Core Floating

### Core

- Floating point & Integer unit
  - IEEE 754-2008 floating-point standard Fused multiply-add (FMA) instruction for both single and double precision
- Logic unit
- Move, compare unit
- Branch unit



### Memory System -- Architecture

### Shared memory(L1)

- User-managed scratch-pad
- Hardware will not evict until threads overwrite
- 16 or 48KB / SM (64KB total is split between Shared and L1)
- Aggregate bandwidth per GPU: 1.03 TB/s

ECC Protection can be enabled.

### CUDA Programming Abstractions Shivani Dave

### CUDA Application structure:

Serial code executes in a host (CPU) thread

Parallel code executes in many device (GPU) threads across multiple processing elements





# Threads are grouped into blocks



• A kernel is executed as a grid of blocks of threads

### Kernel Execution



Each thread is executed by a core CUDA

--Each block is executed by one SM and does not migrate --Several concurrent blocks can reside on one SM depending on the blocks' memory requirements and the SM's memory resources

--Each kernel is executed on one device

--Multiple kernels can execute on a device at one time in an asynchronous way



# Memory Model (contd ...)

- Global memory is the slowest memory on the GPU
- Coalescing improves memory performance; it occurs when multiple (row major order) consecutive threads (IDs) read / write consecutive data items from / to global memory
- 16 (half a warp) global array elements are accessed at once: coalescing produces vectorized reads / writes that are much faster than element wise reads / writes.
- A warp in CUDA, is a group of 32 threads, which is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor.



### Parallelism and Threads

- To process operations in parallel, the operations must be independent of each other i.e., no data dependencies
- A thread is mapped to a single processor which executes in parallel with the remaining threads.
- A CUDA kernel is executed by a grid (array) of threads with each thread with a unique index id in a specific block.
- A grid is organized as a 2D array of blocks (gridDim.x and gridDim.y)
- Each block is organized as 3D array of threads (blockDim.x, blockDim.y, and blockDim.z)



gridDim = (3,2) blockDim = (2,2,1)

### **Thread Allocation**

- A thread block can be allocated on any stream multiprocessor and thread blocks must be independent of each other, i.e., cannot communicate with each other at all.
  - pro: now the computation can run on any number of SMs
  - con: this makes programming a GPU harder
- Multiple thread blocks can be scheduled on one multiprocessor, if resources allow it. They still are independent of each other.

### Thread Synchronization

- Threads inside one thread block can synchronize \_syncthreads() command
- host can synchronize kernel calls either explicitly through cudaThreadSynchronize() – or implicitly through memcpy()-s

### Parameterizing the thread code

- The programmer's job is to write the code so that when a collection of threadblocks, each with a collection of threads executes in concert, the collective work done solves the problem.
- But there is only one piece of code (what it does is a function of the grid and thread "coordinates."
  - GridIdx.x, GridIdx.y
  - threadIdx.x, threadIdx.y, threadIdx.z
- In our first example, there is just one block which has 256 threads in it: all but threadIdx.x are zero, and unused.
- So, we will have all values from 0 to 255
- Analogy with OpenMP/MPI:
  - Dim is line num\_threads(), and MPI\_Size
  - Ids is like thread\_num() and MPI\_rank



### Multiple Blocks



### Multiple block in the Grid.

Each block will have an equal number of threads as defined in the program.

In the above example: blockldx.x keeps on increasing from 0 to 255 similar to the threadId.x. The variable blockDim.x will be 256 which is equal to the number of blocks.

©https://class.coursera.org/hetero-004/lecture/10





### Parallel Code $\rightarrow$ Vector Addition

#include<cuda.h>
void main(){
int n=256; int b[n]; int a[n],c[n]; //initialize the arrays too
int \*dev\_b; int \*dev\_a, int \*dev\_c;
cudaMalloc((void\*\*)&dev\_a, n\*sizeof(int));
cudaMalloc((void\*\*)&dev\_b, n\*sizeof(int));
cudaMalloc((void\*\*)&dev\_c, n\*sizeof(int));
cudaMalloc((void\*\*)&dev\_c, n\*sizeof(int));
cudaMemcpy(dev\_a, &a, n\*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev\_b, &b, n\*sizeof(int), cudaMemcpyHostToDevice);
dim3 gridDim = 1;//Number of blocks in the grid
dim3 blockDim =n;//Number of threads in each block
Addition\_Vector<<<gridDim,blockDim>>>(dev\_a,dev\_b,dev\_c);// Call to the device
cudaMemcpy(&c, dev\_c, n\*sizeof(int), cudaMemcpyDeviceToHost);
}

### Kernel definition

{

}

\_\_global\_\_\_ void AddIntegers(int \*a, int \*b,int \*c)

c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x];

\_\_global\_\_ lets the compiler know that it is a kernel function and accordingly appropriate actions will be taken

Call to the kernel from the host is made using the below format: <<<number of blocks, number of threads>>>

### **Built-in Variables**

- In the kernel a set of built-in variables specifies the grid and block dimensions (Dim) and indices (Idx).
- These can be used to determine the thread ID
  - gridDim contains .x and .y grid dimensions (sizes)
  - blockIdx contains block indices .x and .y in the grid
- blockDim contains the thread block .x, .y, .z dimensions (sizes)
   threadIdx contains .x, .y and .z thread block indices
- 1D thread block: ID = threadIdx.x
- 2D thread block: ID = threadIdx.x + threadIdx.y\*blockDim.x
- 3D thread block: ID = threadIdx.x + threadIdx.y\*blockDim.x + threadIdx.z\*blockDim.x\*blockDim.y

### API functions in CUDA

- Device Memory Allocation function
- Host-Device Data /transfer

### **Device Memory Allocation function**

### <u>cudaMalloc()</u>

- Similar to Malloc function in C
- It will allocate memory in the GPU's global memory.
- Two parameters
- 1. Address of a pointer to the allocated object
- 2. Size of allocated object in terms of bytes

### cudaFree()

- Frees object from device global memory
- Parameter Pointer to freed object

### Host-Device Data /transfer

### cudaMemcpy()

--Used for memory data transfer Parameters:

- 1. Pointer to destination
- 2. Pointer to source
- 3. Number of bytes copied
- 4. Type of transfer

There are 4 types of data transfer:

-Host to Host

- -Host to Device
- -Device to Host
- -Device to Device
- All these data transfers are asynchronous.

# Programming Examples

# 2D Grid Block

- Cuda allows us to create 1-D,2-D,3-D grid blocks
- 2-D Kernel Launch:
- Image processing tasks typically impose a regular 2D raster (an image) over the problem domain. Computational Fluid dynamics tasks might be most naturally expressed by partitioning a volume over a 3D grid.

### 2-D Representation Program

### • 2-D mapping

\_global\_\_ void kernel(int \*array)
{
 int index\_x = blockIdx.x \* blockDim.x + threadIdx.x;
 int index\_y = blockIdx.y \* blockDim.y + threadIdx.y;
 // map the two 2D indices to a single linear, 1D index
 int grid\_width = gridDim.x \* blockDim.x;
 int index = index\_x \* grid\_width + index\_y;
 // map the two 2D block indices to a single linear, 1D block index
 int result = blockIdx.x \* gridDim.x + blockIdx.y;
 // write out the result
 array[index] = result;
}







k

ty

### 

# Programming Example-Jacobi1D.cu

24

### References:

- http://www.cc.gatech.edu/~vetter/keeneland/ tutorial-2011-04-14/02-cuda-overview.pdf
- https://class.coursera.org
- Programming Massively Parallel Processors David B. Kirk and Wenmei W. Hwu
- <u>http://docs.nvidia.com/cuda/parallel-thread-execution/</u> <u>#axzz3p0rtlrXV</u>
- <u>https://code.google.com/p/stanford-cs193g-sp2010/wiki/</u> <u>TutorialMultidimensionalKernelLaunch</u>