Lien de la note Hackmd

CUDA overview

What is CUDA ?

A product

It enables to use NVidia GPUs for computation

A C/C++ variant

Mostly C++ 15 compatible, with extensions
and also some restrictions !

A SDK

A set of compilers and toolchains for various architectures
Performance analysis tools

A runtime

An assembly specification
Computation libraries (linear algebra, etc.)

A new industry standard

Used by every major deep learning framework
Replacing OpenCL as Vulka is replacing OpenGL

The CUDA ecosystem (2021)

Libraries or Compiler Directives or Programming Language ?

CUDA is mostly based on a “new” programming language: CUDA C (or C++, or Fortran)

This grants much flexibility and performance

But is also exposes much of GPU goodness through libraries

An it supports a few compiler directives to facilitate some constructs

The big idea: Kernels instead of loops

No more for loop !

Arrays of parallel threads

A CUDA kernel is executed by a grid (array) of threads

All threads in grid run the same kernel code (Single Program Mutliple Data)
Each thread has indexes that is used to compute memory addresses and compute decisions

Threads blocks

Threads are grouped into thread blocks

Threads witihin a bloc cooperate via
- shared memory
- atomic operations
- barrier synchronization
Threads in different blocks do not interact

A multidimensional grid of computation threads

Each thread uses indices to decide what data to work on:

Each index has $x$, $y$ and $z$ attributes

Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition

Examples

Block decomposition enable automatic scalability

Architecture

Programming modeling

Block A set of threads that cooperate:

Synchronisation
Shared memory
Block ID = ID in a grid

Grid Array of blocks executing same kernel

Access to global GPU memory
Sync. by stop and start a new kernel

Mapping Programming model to hardware

the SMs

Zoom on the SM

warp: 32 unites de calcul

SM organize blocks into warps
1 warp = group of 32 threads

GTX 920:

128 cores = 4 x 32 cores
Quad warp scheduler selects 4 warps (TLP)
And 2 independant instructions per warp can be dispatched each cycle (ILP)

Ex: 1 (logical) block of 96 threads maps to: 3 (physical) warps of 32 threads

Zoom on the CUDA cores

A warp executes 32 threads on the 32 CUDA cores
The threads executes the same instructions (DLP)
All instructions are SIMD (width = 32) instructions

Each core:

FLoating point & integer unit
Fused multiply-add (FMA) instruction
Logic unit
Move, compare unit
Branch unit
The first IF/ID of the pipeline is done by the SM

SIMT allows to specify the execution

The SIMT Execution Model on CUDA cores

SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)

Chaque thread va executer le meme kernel et instructions

Divergent code paths (branching) pile up!

If/else: tous les threads vont effectuer en meme temps le if et else

If/else

What is the latency of this code in the best and worst case ?

Best case: $a\gt0$ is false for every thread. For all threads: inst-d
Worst case: $a\gt0$ and $b\gt0$ is true for some but not all threads. For all threads: inst-a, inst-b, inst-c, inst-d

Loops

Final note about terminology

GPU memory model

Computation cost vs. memory cost

Power measurements on NVIDIA GT200

With the same amount of energy:

Load 1 word from external memory (DRAM)
Compute 44 flops

Must optimize memory first

External memory: discrete GPU

Classical CPU-GPU model

Split memory space
Highest bandwith from GPU memory
Transfers to main memory are slower

Intel i7 4770 / GTX 780

External memory: embedded GPU

Most GPUs today:

Same memory
May support memory coherence (GPU can read directly from CPU caches)
More contention on external memory

GPU: on-chip memory

Cache area in CPU vs GPU:

But if we include registers:

GPU has many more registers but made of simpler memory

Memory model hierarchy

Hardware

Cache hierarchy:

Keep frequently-accessed data Core
Reduce throughtput demand on main memoru L1
Managed by hardware (L1, L2) or software (shared memory)

On CPU, caches are designed to avoid memory latency On GPU, multi-threading deals with memory latency

Software

Building and running a simple program

What you need to get started

NVidia GPU hardware
NVidia GPU drivers, properly loaded
CUDA runtime libraries
CUDA SDK (NVCC compiler in particular)

Summary

Host vs Device $\leftrightarrow$ Separate memory
- GPU are computation units which require explicit usage, as opposed to a CPU
- Need to load data and fetch result from device
Replace loops with kernels
- Kernel = Function computed in relative isolation on small chunks of data
Divide the work
Compile and run using CUDA SDK

Host view of GPU computation

Sequential and parallel sections

We use the GPU(s) as co-processor(s)

CUDA memory primitives

Why 2D and 3D variants ?

Strong alignment requirements in device memory
- Enables correct loading of memory chunks to SM caches (correct bank alignment)
Proper striding management in automated fashion

Host $\leftrightarrow$ Device memory transfer

Almost complete code

Checking errors

In practice check for API errors

Intermission: Can I use memory management functions inside kernels ?

No: cudaMalloc(), cudaMemcpy() and cudaFree() shall be called from host only

However, kernels may allocate, use and reclaim memory dynamically using regular malloc()

Fix the kernel invocation line

We want to fix this line:

Kernel invocation syntax:

How to set `gridDim` and `blockDim` properly ?

Lvl 0: Naive trial with as many threads as possible

Will fail with large vectors
- Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
Will fail with vectors of size which is not a multiple of warp size

Lvl 1: It works with just enough blocks

Lvl 2: Tune block size given the kernel requirements and hardware constraints

But wait…

This code prints nothing !

Kernel invocation is asynchronous

Host code synchronization requires cudaDeviceSynchronize() because kernel invocation is asynchronous from host perspective.

On the device, kernel invocations are striclty sequential (unless you schedule them on different streams)

Intermission: Can I make kernels inside kernels ?

Yes. This is the basic of dynamic parallelism

Some restrictions over the stack size apply. Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device memcpy, etc. but with some restrictions

The compiler may inline some of those calls.

Conclusion about the host-only view

A host-only view of the computation is sufficient for most of the cases:

upload data to the device
fire a kernel
download output data from the device

Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:

amount of threads per blocks
amount of blocks per grid
work per thread (if applicable)
…

This depends on:

hardware specifications: maximum gridDim and blockDim, etc.
kernel code: amount of register and shared memory used by each thread

Kernel programming

Several API levels

We now want to program kernels There are several APIs available:

PTX assembly
Driver API (C)
Runtime C++ API $\leftarrow$ let’s use this one

Function Execution Space Specifiers

__global__ defines a kernel function
- Each __ consists of 2 underscore characters
- A kernel function must return void
- It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
__device__ and __host__ can be used together
__host__ is optional if used alone

Built-in Vector Types

They make it easy to work with data like images Alignement mus be respected in all operations

They all are structures

They all come with a constructor function of the form make_<type name>

The 1st, 2nd, 3rd and 4th components are accessible through the fields $x$, $y$, $z$ respectively

Built-in variable

Example

Memory hierarchy

Types of Memory

Registers
- Used to store parameters, local variables, etc.
- Very fast
- Private to each thread
- Lots of thread $\Rightarrow$ little memory per threads
Shared
- Used to store temp data
- Very fast
- Shared among all threads in a block
Constant
- A special cach for read-only values
Global
- Large and slow
Caches
- Transparent uses
Local

Salient features of Device Memory

Cost to access memory

Variable Memory Space Specifiers

How to declaring CUDA variables

Remarks:

__device__ is optional when used with __shared__ or __constant__
Automatic variables reside in a register

Where to declare variables ? Can host access it ?

Yes	No
global and constant, declare outside of any function	register and shared, use of declare in the kernel

Who can be shared by who ?

Possible memory access:

Among threads in the same grid (a kernel invocation)
- Global memory
Among threads in the same block
- Global memory
- Shared memory

Relaxed consistency memory model

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread

Example

Possible outcomes for thread 2 ?

Memory Fence Functions

Memory fence functions can be used to enforce some ordering on memory accesses

Ensures that:

All writes to all memory made by the calling thread before the call to __threadfence_block()
All reads from all memory

Synch functions

Stronger than __threadfence() because it also synchronizes the execution

Atomic functions

Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory

Most of the atomic functions are available for all the numerical type

Arithmetic functions

Debugging, performance analysis and profiling

`printf`

Possible since Fermi devices (Compute Capability 2.x and higher)

Limited amount of lines:

circular buffer flushed at particular times

Global memory write

To dump then inspect a larger amount of intermediate data Analysis code should be removed for production

IRGPU: Getting started with CUDA

CUDA overview

What is CUDA ?

The CUDA ecosystem (2021)

Libraries or Compiler Directives or Programming Language ?

The big idea: Kernels instead of loops

Arrays of parallel threads

Threads blocks

A multidimensional grid of computation threads

Examples

Block decomposition enable automatic scalability

Architecture

Programming modeling

Mapping Programming model to hardware

the SMs

Zoom on the SM

Zoom on the CUDA cores

The SIMT Execution Model on CUDA cores

If/else

Loops

Final note about terminology

GPU memory model

Computation cost vs. memory cost

External memory: discrete GPU

External memory: embedded GPU

GPU: on-chip memory

Memory model hierarchy

Hardware

Software

Building and running a simple program

What you need to get started

Summary

Host view of GPU computation

Sequential and parallel sections

CUDA memory primitives

Host $\leftrightarrow$ Device memory transfer

Almost complete code

Checking errors

Intermission: Can I use memory management functions inside kernels ?

Fix the kernel invocation line

How to set gridDim and blockDim properly ?

But wait…

Intermission: Can I make kernels inside kernels ?

Conclusion about the host-only view

Kernel programming

Several API levels

Function Execution Space Specifiers

Built-in Vector Types

Built-in variable

Example

Memory hierarchy

Types of Memory

Salient features of Device Memory

Cost to access memory

Variable Memory Space Specifiers

Who can be shared by who ?

Relaxed consistency memory model

Example

Memory Fence Functions

Synch functions

Atomic functions

Arithmetic functions

Debugging, performance analysis and profiling

printf

Global memory write

Example

CUDA tools

The complete compilation trajectory

Further Reading

IRGPU: Introduction

IRGPU: Patterns for massively parallel programming

OCVX: Hyperplan d'appui

How to set `gridDim` and `blockDim` properly ?

`printf`