Lien de la note Hackmd
IRGPU: Patterns for massively parallel programming
Programming patterns & memory optimizations
The programming patterns
- Map
- Map + Local reduction
- Reduction
- Scan
The IP algorithms
- LUT applications
- Local features extraction
- Histogram
- Integral images
Map pattern
Map pattern overview
- Un pixel en entree et sortie
- La dependance est nulle
Map replicated a function over every element of an index set. The computation of each pixel is independant wrt the others
out(x,y) = f(in(x,y))
What do you think about k
’s impact of the performance ? Le premier fait threadIdx.x + k
et l’autre fait threadIdx.x * k
- A gauche, le thread numero i est le thread numero i+1: c’est continue au niveau des addresses memoire (lineaire)
- A droite: acces stride (palier)
- Linear sequential access with offset c’est bien
- Strided access c’est po bien
Strided access pattern
Sur le kernel numero 2, un grand temps du process est passe pour la gestion memoire (acces memoire mal fait)
Memory performance
Memory bandwith
What you think about memory
Reality:
Memory access hierarchy
- Memoire L2: cache intermediaire
- Memoire sur le chip low-latency
- Registres
- Shared memory
- L1 (meme zone que shared memory)
- Dire au processeur de gerer la memoire ou la gerer nous-meme
- Dans ce cas on configure nous meme la shared memory
- Sers de cache entre le L2 et la shared memory
Cached Loads from L1 low-latency
2 choses a prendre en compte:
- acces memoire alignes ou non
- acces memoire coalesced (stride) ou non
2 types of global memory loads: cached or uncached
Aligned vs Misaligned
A load is aligned if the first address of a memory access in 32 bytes
- Memory addresses must be type-aligned (
typeof(machin)
) - Otherwise poor perf
cudaMalloc
= alignement on 256 bits at least
Coalesced vs uncoalesced
A load is coalesced if a warp access is non-continuous
Misaligned cached loads from L1
We need a load strategy:
- 32 threads of warp access a 32-bit word = 128 bytes
- 128 bytes = L1 bus (single load - bus utilization = 100%)
- Access permutation has no (or very low) overheard
- If data are not 128-bits aligned, 2 loads are required
Adresses 96-224 required.. but 0-256 loaded
- If data is accessed strided
Pas possible d’augmenter la taille du bus ? Le bus est fixe (hardware)
Loads from gloabl (uncached) memory
Same idea but memory is split in segments of 32 bytes
Coalesced Memory Access (summary)
How memory works for real
- DRAM is organised in 2D Core array
- Each DRAM core array has about 16M bits
Example
- A 4x4 memory cell
- With 4 bits pin interface width
DRAM Burst
La memoire est lente
- DDR = 1/2 interface speed
- DDR2 = 1/4 interface speed
- DDR3 = 1/8 interface speed
Solution: Bursting
Load N x interface width of the same row
Au lieu d’avoir 1/4, on renvoit 3 x 1/4
Better, but not enough to saturate the memory bus
Summary
- Use coalesced (contiguous and aligned) accessed to memory
How to make coalesced loads with 2D arrays ?
Ca correspond au pitch
Pitch: taille des lignes pour que le debut des lignes correspond a un multiple de 32
Using shared memory
Transposition
Where are non-coalesced access ? Sur un warp, les x
sont lineaire (0 - 31). Ici, ce qui est lineaire selon x c’est la lecture dans in
. 32 threads vont ecrire 32 elements non-continus
a[x][y]
Tiling and memory privatization
On decoupe le travaille en sous-block (tuile)
Ca marche bien sur les images !
For each block:
- Read the tile from global to private block memory
- process the block (used shared memory)
- write the tile from the private (shared) block memory to global memory
Collaborative loading and writing then BLOCKDIM = TILEDIM
- All threads load one or more data
- Access must be coalesced
- Use barrier synch to make sure that all threads are ready to start the phase
Est-ce que ecrire dans la tile c’est lineaire ? Pour chaque ligne y
, x
va varier le plus rapidement possible: c’est lineaire
Tiled transposition in shared memory
L’algorithme pour transposer en utilisant la shared memory commence par copier la taille en shared memory, transpose en shared memory et transpose les acces coalesced en memoire globale
A quel moment on fait des acces aligned dans la shared memory ? On lit de maniere alignee en global, on ecrit en aligne partout sauf la derniere ligne ou c’est un acces non-aligne
Performance (GB/s on TESLA K40)
Speed up de 2 entre la version qui utilise la shared memory et celle qui ne l’utilise pas
About shared memory
Comment on peut combler le gap (encore) entre les GB/s de la TESLA ?
DRAM banks
- Bursting: access multiple locations of a line in the DRAM core array (horizontal parallelism)
Permet d’utiliser d’avantage la memoire a sa capacite totale
Bank conflits in shared memory
- If 2 threads try to perform 2 different loads in the same bank $\to$ Bank conflict
- Evert bank can provide 64 bits every cycle
- Only 2 modes:
- Change after 32 bits
- Change after 64 bits
Demander la meme adresse par 2 threads differents ne pose pas de problemes
Les addresses consecutives ne sont pas dans les memes banques.
2-way conflicts:
On fait 2 loads en shared memory
Conflict = serialized access po bien
Concrete example for shared memory
- Bank size: 4B = 4 uint8
- 32 Banks - Many channels
- Warp size = 32 threads
Pour le premier cas du tableau: Est-ce qu’il a des threads qui demandent des addresses differentes dans une meme banque ? Non
Quel est le nombre de loads (requetes memoire) qu’on va effectuer ? Une seule requete memoire
op | Items | Bank used | Conflict Free | #Loads |
---|---|---|---|---|
load M[tid.x] | $[0,1,…,31]$ | $[0,1,…,7]$ | Oui | 1 |
load M[tid.x % 4] | $[0,1,2,3,0,1,2,3…]$ | $[0]$ | Oui | 1 |
load M[tid.x + 1] | $[1,2,3,…32]$ | $[0,1,…,8]$ | Oui | 1 |
load M[tid.x * 2] | $[0,1,2,3,…62]$ | $[0,1,…,15]$ | Oui | 1 |
load M[tid.x * 8] | $[0,8,…248]$ | $[0,2,…,30]$ | Non (conflits sur toutes les banques) | 2 |
load M[tid.x * 12] | $[0,8,…248]$ | $[0,1,…,31]$ | Oui | 1 |
load M[tid.x * 8]
:
load M[tid.x * 12]
:
Bank conflicts in Transpose
Si on a 32 banques, on utilisent que 2 banques sur 32 et on a plein de conflits.
Reading a column may cause bank conflict
Solution to bank conflicts
With padding (to WRAP_SIZE + 1)
Comment se passe la lecture des columns ? Avec le padding, on decale la lecture de 1 (on se decale d’une banaque). En lisant la 1ere column, on tombe toujours dans une banque differente, evitant ainsi les conflits.
Index mapping function
Performances (GB/s on TESLA K40)
Shared memory (summary)
- Super fast access (almost as fast as registers)
- But limited resources
Occupancy
Stencil Pattern
Use case:
- Dilation/erosion
- Box (Mean) / Convolution Filters
- Bilateral Filter
- Gaussian Filter
- Sobel Filter
Il y a une dependance entre les pixels
Naive Stencil Implem
Local average with a rectangle of radius $r$ (Ignoring border problems for now)
Naive Stencil Performance
Say we have this GPU:
- Peak power: 1 500 GFlops and Memory Bandwith: 200 GB/s
All threads access global memory
- 1 memory access for 1 FP addition
- Requires 1500 x
sizeof(float)
= 6 TB/s of data - But only 200 GB/x mem bandwith $\to$ 50 GLFOPS (3% the peak)
Problem: too many access to global memory
Solution: tiling, copy data in shared memory to global memory
Handling Border
- Add border to the image to have in-memory access
- Copy tile + border to shared memory
The bad way
Each thread copies one value and border threads are then idle
The good way
A thread may copy several pixels
Stencil pattern with tiling performance
Reduction Pattern
Intuition for reduction pattern
Reduction combines every elemet in a collection into one element using an associative operator
Reduction pattern: solution 1
Est-ce que c’est correct ? Non, on va avoir des acces concurentiels (data race)
Data race
Plusieurs parties d’un programme qui essaie d’acceder sans ordre predefini a la meme donnee
We need to ensure that each of those read-compute-write sequences are atomic.
Atomics reminder
Atomics
- Read, modify, write in 1 operation
- Cannot be mixed with accesses from other threads
- On global memory and shared memory
- Atomic operations to the same address are serialized
Operations
Reduction Pattern Corrected
Accumulation in global memory
Analysis
Time: 5.619 ms Correct result but high contention on the global atomic variable
The execution is actually sequential !
Global atomics: is this really parallel ?
This version will produce the right result. However, is it really parallel ?
How our global atomic instruction is executed:
- lock memory cell
- read old value
- compute new value
- write new value
- release the memory cell
Memory cell = cache line burst
Our kernel generates a lot of collisions on global memory
Leverage Shared Memory
Atomic operations are much
Motivation for output privatization
Using shared memory
Reduction pattern V2: Output privatization
With sync
Reduction functions and trees
On peut reduire en parallele plusieurs fragments
Complexity in steps and operations
The tree parallel version is:
- work efficient
- not resource efficient
- Average number of thread $((N-1/\log_2(N))$ « peak requirement ($N/2$)
Proof of number of operations
Reduction pattern: tree reduction without atomics
- Use a local sum without atomics
- Map reduction tree to compute units (threads)
- Add to a global atomic once for each block
What is happening ? The (naive) tree version is slower than the locally sequential version
Sp starvation
In each iteration, 2 control flow paths will be sequentiall traversed for each warp
- Threads that perform addition and threads that do not
- Threads that do not perform addition still consume execution resources
Resource efficient version
Tous les threads vont etre utilise sauf a la fin. On va avoir que des warps actifs, a chaque iterations on libere la moitie des warps.
A quick analysis
For a 1024 thread block
- No divergence on the first 5 steps
- 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
- All threads in each warp either all active or all inactive
- The final 5 steps will still have divergence
- Can use warp-level optimization then (warp suffle)
Limit global collision
What happens with very large input arrays ? Lot of global atomics
How to avoid this ? Global array, one cell for each block
- No more locks
- But requires a second level of reduction
More work per thread
- Just fire enough blocks to hide latency
- Sequential reduction, then tree reduction
- “algorithm cascading”
Algorithm cascading
Perform first reduction during the collaborative loading
Warning: kernel launch parameters must be scaled accordingly !
Last optimizations
Loop unrolling
- Unroll tree reduction loop for the last warp (less sync needed)
- Unroll all tree reduction loops (need to know block size)
- Unroll the sequential reduction loop (knowing the work per thread)
Histogram computation
Mandelbrot practice session
During the practice session, you will have to compute the cumulated histogram of the image.
- Compute the histogram $H$
- Count the number of occurences of each value
- Computed the cumulated histogram $C$
Inefficient, non-coalesced memory access
First sample code
What is the issue ? Ajouter un data array
Parallel algorithm using output privatization
Local histogram
Initialization
Shared memory must be initialized This can be done with the “comb-like” pattern
We need synchronization after this stage
Computation
Like previous code, but with local atomics
We need synchronization after this stage
Commit to global memory
int n = blockDim.x * threadIdx.y + threadIdx.x
Summary
Performance boosters:
- Coalesced accesses
- Output privatization
Requirements:
- atomics
- synchronization
Scan pattern
What is a scan ?
Scan computes all partial reductions of a collection
Usage:
- Integration (cumulated histogram)
- Resource allocation (memory to parallel threads, camping spots…)
- Base building block for many algorithms (sorts, strings comparisons)
Performance baselines
Sequential version
Naive parallel version
Have every thread to add up all x elements needed for the y element
Scan pattern at the warp or block level
Kogge-Stone
- Number of steps: $\log N$ (bien)
- Ressource efficiency (bien)
- Work efficiency $\sim N\log N$ (pas bien)
Brent-Kung
- Number of steps: $2\log N$
- Ressource efficiency: all warps remain active till the end (pas bien)
- Work efficiency: $2N$ (bien)
Sklansky
- Number of steps: $\log N$
- Ressource efficiency: bien
- Work efficiency: $\frac{N}{2}\log N$ (bien)
Scan Pattern at the Block or Grid Level
The patterns before can be applied:
- At the warp level (no sync until Volta)
- At the block level (thread sync)
At the global level: multi-level kernel app in global memory
- Scan then propagate
- Reduce then scan