Lien de la note Hackmd
CUDA overview
What is CUDA ?
A product
- It enables to use NVidia GPUs for computation
A C/C++ variant
- Mostly C++ 15 compatible, with extensions
- and also some restrictions !
A SDK
- A set of compilers and toolchains for various architectures
- Performance analysis tools
A runtime
- An assembly specification
- Computation libraries (linear algebra, etc.)
A new industry standard
- Used by every major deep learning framework
- Replacing OpenCL as Vulka is replacing OpenGL
The CUDA ecosystem (2021)
Libraries or Compiler Directives or Programming Language ?
CUDA is mostly based on a “new” programming language: CUDA C (or C++, or Fortran)
This grants much flexibility and performance
But is also exposes much of GPU goodness through libraries
An it supports a few compiler directives to facilitate some constructs
The big idea: Kernels instead of loops
No more for
loop !
Arrays of parallel threads
A CUDA kernel is executed by a grid (array) of threads
- All threads in grid run the same kernel code (Single Program Mutliple Data)
- Each thread has indexes that is used to compute memory addresses and compute decisions
Threads blocks
Threads are grouped into thread blocks
- Threads witihin a bloc cooperate via
- shared memory
- atomic operations
- barrier synchronization
- Threads in different blocks do not interact
A multidimensional grid of computation threads
Each thread uses indices to decide what data to work on:
Each index has $x$, $y$ and $z$ attributes
Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition
Examples
Block decomposition enable automatic scalability
Architecture
Programming modeling
Block A set of threads that cooperate:
- Synchronisation
- Shared memory
- Block ID = ID in a grid
Grid Array of blocks executing same kernel
- Access to global GPU memory
- Sync. by stop and start a new kernel
Mapping Programming model to hardware
the SMs
Zoom on the SM
warp: 32 unites de calcul
- SM organize blocks into warps
- 1 warp = group of 32 threads
GTX 920:
- 128 cores = 4 x 32 cores
- Quad warp scheduler selects 4 warps (TLP)
- And 2 independant instructions per warp can be dispatched each cycle (ILP)
Ex: 1 (logical) block of 96 threads maps to: 3 (physical) warps of 32 threads
Zoom on the CUDA cores
- A warp executes 32 threads on the 32 CUDA cores
- The threads executes the same instructions (DLP)
- All instructions are SIMD (width = 32) instructions
Each core:
- FLoating point & integer unit
- Fused multiply-add (FMA) instruction
- Logic unit
- Move, compare unit
- Branch unit
- The first IF/ID of the pipeline is done by the SM
SIMT allows to specify the execution
The SIMT Execution Model on CUDA cores
SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)
Chaque thread va executer le meme kernel et instructions
- Divergent code paths (branching) pile up!
If/else: tous les threads vont effectuer en meme temps le if et else
If/else
What is the latency of this code in the best and worst case ?
- Best case: $a\gt0$ is false for every thread. For all threads:
inst-d
- Worst case: $a\gt0$ and $b\gt0$ is true for some but not all threads. For all threads:
inst-a
,inst-b
,inst-c
,inst-d
Loops
Final note about terminology
GPU memory model
Computation cost vs. memory cost
Power measurements on NVIDIA GT200
With the same amount of energy:
- Load 1 word from external memory (DRAM)
- Compute 44 flops
Must optimize memory first
External memory: discrete GPU
Classical CPU-GPU model
- Split memory space
- Highest bandwith from GPU memory
- Transfers to main memory are slower
Intel i7 4770 / GTX 780
External memory: embedded GPU
Most GPUs today:
- Same memory
- May support memory coherence (GPU can read directly from CPU caches)
- More contention on external memory
GPU: on-chip memory
Cache area in CPU vs GPU:
But if we include registers:
GPU has many more registers but made of simpler memory
Memory model hierarchy
Hardware
Cache hierarchy:
- Keep frequently-accessed data Core
- Reduce throughtput demand on main memoru L1
- Managed by hardware (L1, L2) or software (shared memory)
On CPU, caches are designed to avoid memory latency On GPU, multi-threading deals with memory latency
Software
Building and running a simple program
What you need to get started
- NVidia GPU hardware
- NVidia GPU drivers, properly loaded
- CUDA runtime libraries
- CUDA SDK (NVCC compiler in particular)
Summary
- Host vs Device $\leftrightarrow$ Separate memory
- GPU are computation units which require explicit usage, as opposed to a CPU
- Need to load data and fetch result from device
- Replace loops with kernels
- Kernel = Function computed in relative isolation on small chunks of data
- Divide the work
- Compile and run using CUDA SDK
Host view of GPU computation
Sequential and parallel sections
- We use the GPU(s) as co-processor(s)
CUDA memory primitives
Why 2D and 3D variants ?
- Strong alignment requirements in device memory
- Enables correct loading of memory chunks to SM caches (correct bank alignment)
- Proper striding management in automated fashion
Host $\leftrightarrow$ Device memory transfer
Almost complete code
Checking errors
In practice check for API errors
Intermission: Can I use memory management functions inside kernels ?
No: cudaMalloc()
, cudaMemcpy()
and cudaFree()
shall be called from host only
However, kernels may allocate, use and reclaim memory dynamically using regular malloc()
Fix the kernel invocation line
We want to fix this line:
Kernel invocation syntax:
How to set gridDim
and blockDim
properly ?
Lvl 0: Naive trial with as many threads as possible
- Will fail with large vectors
- Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
- Will fail with vectors of size which is not a multiple of warp size
Lvl 1: It works with just enough blocks
Lvl 2: Tune block size given the kernel requirements and hardware constraints
But wait…
This code prints nothing !
Kernel invocation is asynchronous
Host code synchronization requires cudaDeviceSynchronize()
because kernel invocation is asynchronous from host perspective.
On the device, kernel invocations are striclty sequential (unless you schedule them on different streams)
Intermission: Can I make kernels inside kernels ?
Yes. This is the basic of dynamic parallelism
Some restrictions over the stack size apply. Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device memcpy
, etc. but with some restrictions
The compiler may inline some of those calls.
Conclusion about the host-only view
A host-only view of the computation is sufficient for most of the cases:
- upload data to the device
- fire a kernel
- download output data from the device
Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:
- amount of threads per blocks
- amount of blocks per grid
- work per thread (if applicable)
- …
This depends on:
- hardware specifications: maximum
gridDim
andblockDim
, etc. - kernel code: amount of register and shared memory used by each thread
Kernel programming
Several API levels
We now want to program kernels There are several APIs available:
- PTX assembly
- Driver API (C)
- Runtime C++ API $\leftarrow$ let’s use this one
Function Execution Space Specifiers
__global__
defines a kernel function- Each
__
consists of 2 underscore characters - A kernel function must return void
- It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
- Each
__device__
and__host__
can be used together__host__
is optional if used alone
Built-in Vector Types
They make it easy to work with data like images Alignement mus be respected in all operations
They all are structures
They all come with a constructor function of the form make_<type name>
The 1st, 2nd, 3rd and 4th components are accessible through the fields $x$, $y$, $z$ respectively
Built-in variable
Example
Memory hierarchy
Types of Memory
- Registers
- Used to store parameters, local variables, etc.
- Very fast
- Private to each thread
- Lots of thread $\Rightarrow$ little memory per threads
- Shared
- Used to store temp data
- Very fast
- Shared among all threads in a block
- Constant
- A special cach for read-only values
- Global
- Large and slow
- Caches
- Transparent uses
- Local
Salient features of Device Memory
Cost to access memory
Variable Memory Space Specifiers
How to declaring CUDA variables
Remarks:
__device__
is optional when used with__shared__
or__constant__
- Automatic variables reside in a register
Where to declare variables ? Can host access it ?
Yes | No |
---|---|
global and constant, declare outside of any function | register and shared, use of declare in the kernel |
Who can be shared by who ?
Possible memory access:
- Among threads in the same grid (a kernel invocation)
- Global memory
- Among threads in the same block
- Global memory
- Shared memory
Relaxed consistency memory model
The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread
Example
Possible outcomes for thread 2 ?
Memory Fence Functions
Memory fence functions can be used to enforce some ordering on memory accesses
Ensures that:
- All writes to all memory made by the calling thread before the call to
__threadfence_block()
- All reads from all memory
Synch functions
Stronger than __threadfence()
because it also synchronizes the execution
Atomic functions
Atomic functions perform a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory
Most of the atomic functions are available for all the numerical type
Arithmetic functions
Debugging, performance analysis and profiling
printf
Possible since Fermi devices (Compute Capability 2.x and higher)
Limited amount of lines:
- circular buffer flushed at particular times
Global memory write
To dump then inspect a larger amount of intermediate data Analysis code should be removed for production