Lien de la note Hackmd

IRGPU: Patterns for massively parallel programming

Programming patterns & memory optimizations

The programming patterns

Map
Map + Local reduction
Reduction
Scan

The IP algorithms

LUT applications
Local features extraction
Histogram
Integral images

Map pattern

Map pattern overview

Un pixel en entree et sortie
La dependance est nulle

Map replicated a function over every element of an index set. The computation of each pixel is independant wrt the others

out(x,y) = f(in(x,y))

What do you think about k’s impact of the performance ? Le premier fait threadIdx.x + k et l’autre fait threadIdx.x * k

A gauche, le thread numero i est le thread numero i+1: c’est continue au niveau des addresses memoire (lineaire)
A droite: acces stride (palier)

Linear sequential access with offset c’est bien
Strided access c’est po bien

Strided access pattern

Sur le kernel numero 2, un grand temps du process est passe pour la gestion memoire (acces memoire mal fait)

Memory performance

Memory bandwith

What you think about memory

Reality:

Memory access hierarchy

Memoire L2: cache intermediaire
Memoire sur le chip low-latency
- Registres
- Shared memory
- L1 (meme zone que shared memory)
  - Dire au processeur de gerer la memoire ou la gerer nous-meme
  - Dans ce cas on configure nous meme la shared memory
  - Sers de cache entre le L2 et la shared memory

Cached Loads from L1 low-latency

2 choses a prendre en compte:

acces memoire alignes ou non
acces memoire coalesced (stride) ou non

2 types of global memory loads: cached or uncached

Aligned vs Misaligned

A load is aligned if the first address of a memory access in 32 bytes

Memory addresses must be type-aligned (typeof(machin))
Otherwise poor perf
cudaMalloc = alignement on 256 bits at least

Coalesced vs uncoalesced

A load is coalesced if a warp access is non-continuous

Misaligned cached loads from L1

We need a load strategy:

32 threads of warp access a 32-bit word = 128 bytes
128 bytes = L1 bus (single load - bus utilization = 100%)
Access permutation has no (or very low) overheard

If data are not 128-bits aligned, 2 loads are required

Adresses 96-224 required.. but 0-256 loaded

If data is accessed strided

Pas possible d’augmenter la taille du bus ? Le bus est fixe (hardware)

Loads from gloabl (uncached) memory

Same idea but memory is split in segments of 32 bytes

Coalesced Memory Access (summary)

How memory works for real

DRAM is organised in 2D Core array
Each DRAM core array has about 16M bits

Example

A 4x4 memory cell
With 4 bits pin interface width

DRAM Burst

La memoire est lente

DDR = 1/2 interface speed
DDR2 = 1/4 interface speed
DDR3 = 1/8 interface speed

Solution: Bursting

Load N x interface width of the same row

Au lieu d’avoir 1/4, on renvoit 3 x 1/4

Better, but not enough to saturate the memory bus

Summary

Use coalesced (contiguous and aligned) accessed to memory

How to make coalesced loads with 2D arrays ?

Ca correspond au pitch

Pitch: taille des lignes pour que le debut des lignes correspond a un multiple de 32

Using shared memory

Transposition

Where are non-coalesced access ? Sur un warp, les x sont lineaire (0 - 31). Ici, ce qui est lineaire selon x c’est la lecture dans in. 32 threads vont ecrire 32 elements non-continus

a[x][y]

Tiling and memory privatization

On decoupe le travaille en sous-block (tuile)

Ca marche bien sur les images !

For each block:

Read the tile from global to private block memory
process the block (used shared memory)
write the tile from the private (shared) block memory to global memory

Collaborative loading and writing then BLOCKDIM = TILEDIM

All threads load one or more data
Access must be coalesced
Use barrier synch to make sure that all threads are ready to start the phase

Est-ce que ecrire dans la tile c’est lineaire ? Pour chaque ligne y, x va varier le plus rapidement possible: c’est lineaire

Tiled transposition in shared memory

L’algorithme pour transposer en utilisant la shared memory commence par copier la taille en shared memory, transpose en shared memory et transpose les acces coalesced en memoire globale

A quel moment on fait des acces aligned dans la shared memory ? On lit de maniere alignee en global, on ecrit en aligne partout sauf la derniere ligne ou c’est un acces non-aligne

Performance (GB/s on TESLA K40)

Speed up de 2 entre la version qui utilise la shared memory et celle qui ne l’utilise pas

About shared memory

Comment on peut combler le gap (encore) entre les GB/s de la TESLA ?

DRAM banks

Bursting: access multiple locations of a line in the DRAM core array (horizontal parallelism)

Permet d’utiliser d’avantage la memoire a sa capacite totale

Bank conflits in shared memory

If 2 threads try to perform 2 different loads in the same bank $\to$ Bank conflict
Evert bank can provide 64 bits every cycle
Only 2 modes:
1. Change after 32 bits
2. Change after 64 bits

Demander la meme adresse par 2 threads differents ne pose pas de problemes

Les addresses consecutives ne sont pas dans les memes banques.

2-way conflicts:

On fait 2 loads en shared memory

Conflict = serialized access po bien

Concrete example for shared memory

Bank size: 4B = 4 uint8
32 Banks - Many channels
Warp size = 32 threads

Pour le premier cas du tableau: Est-ce qu’il a des threads qui demandent des addresses differentes dans une meme banque ? Non

Quel est le nombre de loads (requetes memoire) qu’on va effectuer ? Une seule requete memoire

op	Items	Bank used	Conflict Free	#Loads
`load M[tid.x]`	$[0,1,…,31]$	$[0,1,…,7]$	Oui	1
`load M[tid.x % 4]`	$[0,1,2,3,0,1,2,3…]$	$[0]$	Oui	1
`load M[tid.x + 1]`	$[1,2,3,…32]$	$[0,1,…,8]$	Oui	1
`load M[tid.x * 2]`	$[0,1,2,3,…62]$	$[0,1,…,15]$	Oui	1
`load M[tid.x * 8]`	$[0,8,…248]$	$[0,2,…,30]$	Non (conflits sur toutes les banques)	2
`load M[tid.x * 12]`	$[0,8,…248]$	$[0,1,…,31]$	Oui	1

load M[tid.x * 8]:

load M[tid.x * 12]:

Bank conflicts in Transpose

Si on a 32 banques, on utilisent que 2 banques sur 32 et on a plein de conflits.

Reading a column may cause bank conflict

Solution to bank conflicts

With padding (to WRAP_SIZE + 1)

Comment se passe la lecture des columns ? Avec le padding, on decale la lecture de 1 (on se decale d’une banaque). En lisant la 1ere column, on tombe toujours dans une banque differente, evitant ainsi les conflits.

Index mapping function

Performances (GB/s on TESLA K40)

Shared memory (summary)

Super fast access (almost as fast as registers)
But limited resources

Occupancy

Stencil Pattern

Use case:

Dilation/erosion
Box (Mean) / Convolution Filters
Bilateral Filter
Gaussian Filter
Sobel Filter

Il y a une dependance entre les pixels

Naive Stencil Implem

Local average with a rectangle of radius $r$ (Ignoring border problems for now)

Naive Stencil Performance

Say we have this GPU:

Peak power: 1 500 GFlops and Memory Bandwith: 200 GB/s

All threads access global memory

1 memory access for 1 FP addition
Requires 1500 x sizeof(float) = 6 TB/s of data
But only 200 GB/x mem bandwith $\to$ 50 GLFOPS (3% the peak)

Problem: too many access to global memory

Solution: tiling, copy data in shared memory to global memory

Handling Border

Add border to the image to have in-memory access
Copy tile + border to shared memory

The bad way

Each thread copies one value and border threads are then idle

The good way

A thread may copy several pixels

Stencil pattern with tiling performance

Reduction Pattern

Intuition for reduction pattern

Reduction combines every elemet in a collection into one element using an associative operator

Reduction pattern: solution 1

Est-ce que c’est correct ? Non, on va avoir des acces concurentiels (data race)

Data race

Plusieurs parties d’un programme qui essaie d’acceder sans ordre predefini a la meme donnee

We need to ensure that each of those read-compute-write sequences are atomic.

Atomics reminder

Atomics

Read, modify, write in 1 operation
Cannot be mixed with accesses from other threads
On global memory and shared memory
Atomic operations to the same address are serialized

Operations

Reduction Pattern Corrected

Accumulation in global memory

Analysis

Time: 5.619 ms Correct result but high contention on the global atomic variable

The execution is actually sequential !

Global atomics: is this really parallel ?

This version will produce the right result. However, is it really parallel ?

How our global atomic instruction is executed:

lock memory cell
read old value
compute new value
write new value
release the memory cell

Memory cell = cache line burst

Our kernel generates a lot of collisions on global memory

Leverage Shared Memory

Atomic operations are much

Motivation for output privatization

Using shared memory

Reduction pattern V2: Output privatization

With sync

Reduction functions and trees

On peut reduire en parallele plusieurs fragments

Complexity in steps and operations

The tree parallel version is:

work efficient
not resource efficient
- Average number of thread $((N-1/\log_2(N))$ « peak requirement ($N/2$)

Proof of number of operations

Reduction pattern: tree reduction without atomics

Use a local sum without atomics
- Map reduction tree to compute units (threads)
Add to a global atomic once for each block

What is happening ? The (naive) tree version is slower than the locally sequential version

Sp starvation

In each iteration, 2 control flow paths will be sequentiall traversed for each warp

Threads that perform addition and threads that do not
Threads that do not perform addition still consume execution resources

Resource efficient version

Tous les threads vont etre utilise sauf a la fin. On va avoir que des warps actifs, a chaque iterations on libere la moitie des warps.

A quick analysis

For a 1024 thread block

No divergence on the first 5 steps
- 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
- All threads in each warp either all active or all inactive
The final 5 steps will still have divergence
- Can use warp-level optimization then (warp suffle)