Home IRGPU: Patterns for massively parallel programming
Post
Cancel

IRGPU: Patterns for massively parallel programming

Lien de la note Hackmd

IRGPU: Patterns for massively parallel programming

Programming patterns & memory optimizations

The programming patterns

  • Map
  • Map + Local reduction
  • Reduction
  • Scan

The IP algorithms

  • LUT applications
  • Local features extraction
  • Histogram
  • Integral images

Map pattern

Map pattern overview

  • Un pixel en entree et sortie
  • La dependance est nulle

out(x,y) = f(in(x,y))

What do you think about k’s impact of the performance ? Le premier fait threadIdx.x + k et l’autre fait threadIdx.x * k

  • A gauche, le thread numero i est le thread numero i+1: c’est continue au niveau des addresses memoire (lineaire)
  • A droite: acces stride (palier)

Strided access pattern

Sur le kernel numero 2, un grand temps du process est passe pour la gestion memoire (acces memoire mal fait)

Memory performance

Memory bandwith

What you think about memory

Reality:

Memory access hierarchy

  • Memoire L2: cache intermediaire
  • Memoire sur le chip low-latency
    • Registres
    • Shared memory
    • L1 (meme zone que shared memory)
      • Dire au processeur de gerer la memoire ou la gerer nous-meme
      • Dans ce cas on configure nous meme la shared memory
      • Sers de cache entre le L2 et la shared memory

Cached Loads from L1 low-latency

2 choses a prendre en compte:

  1. acces memoire alignes ou non
  2. acces memoire coalesced (stride) ou non

2 types of global memory loads: cached or uncached

Aligned vs Misaligned

A load is aligned if the first address of a memory access in 32 bytes

  • Memory addresses must be type-aligned (typeof(machin))
  • Otherwise poor perf
  • cudaMalloc = alignement on 256 bits at least

Coalesced vs uncoalesced

A load is coalesced if a warp access is non-continuous

Misaligned cached loads from L1

We need a load strategy:

  • 32 threads of warp access a 32-bit word = 128 bytes
  • 128 bytes = L1 bus (single load - bus utilization = 100%)
  • Access permutation has no (or very low) overheard

  • If data are not 128-bits aligned, 2 loads are required

Adresses 96-224 required.. but 0-256 loaded

  • If data is accessed strided

Pas possible d’augmenter la taille du bus ? Le bus est fixe (hardware)

Loads from gloabl (uncached) memory

Same idea but memory is split in segments of 32 bytes

Coalesced Memory Access (summary)

How memory works for real

  • DRAM is organised in 2D Core array
  • Each DRAM core array has about 16M bits

Example

  • A 4x4 memory cell
  • With 4 bits pin interface width

DRAM Burst

La memoire est lente

  • DDR = 1/2 interface speed
  • DDR2 = 1/4 interface speed
  • DDR3 = 1/8 interface speed

Load N x interface width of the same row

Au lieu d’avoir 1/4, on renvoit 3 x 1/4

Better, but not enough to saturate the memory bus

Summary

  • Use coalesced (contiguous and aligned) accessed to memory

How to make coalesced loads with 2D arrays ?

Pitch: taille des lignes pour que le debut des lignes correspond a un multiple de 32

Using shared memory

Transposition

Where are non-coalesced access ? Sur un warp, les x sont lineaire (0 - 31). Ici, ce qui est lineaire selon x c’est la lecture dans in. 32 threads vont ecrire 32 elements non-continus

Tiling and memory privatization

Ca marche bien sur les images !

For each block:

  • Read the tile from global to private block memory
  • process the block (used shared memory)
  • write the tile from the private (shared) block memory to global memory

Collaborative loading and writing then BLOCKDIM = TILEDIM

  • All threads load one or more data
  • Access must be coalesced
  • Use barrier synch to make sure that all threads are ready to start the phase

Est-ce que ecrire dans la tile c’est lineaire ? Pour chaque ligne y, x va varier le plus rapidement possible: c’est lineaire

Tiled transposition in shared memory

L’algorithme pour transposer en utilisant la shared memory commence par copier la taille en shared memory, transpose en shared memory et transpose les acces coalesced en memoire globale

A quel moment on fait des acces aligned dans la shared memory ? On lit de maniere alignee en global, on ecrit en aligne partout sauf la derniere ligne ou c’est un acces non-aligne

Performance (GB/s on TESLA K40)

Speed up de 2 entre la version qui utilise la shared memory et celle qui ne l’utilise pas

About shared memory

Comment on peut combler le gap (encore) entre les GB/s de la TESLA ?

DRAM banks

  • Bursting: access multiple locations of a line in the DRAM core array (horizontal parallelism)

Bank conflits in shared memory

  • If 2 threads try to perform 2 different loads in the same bank $\to$ Bank conflict
  • Evert bank can provide 64 bits every cycle
  • Only 2 modes:
    1. Change after 32 bits
    2. Change after 64 bits

Les addresses consecutives ne sont pas dans les memes banques.

2-way conflicts:

On fait 2 loads en shared memory

Concrete example for shared memory

  • Bank size: 4B = 4 uint8
  • 32 Banks - Many channels
  • Warp size = 32 threads

Pour le premier cas du tableau: Est-ce qu’il a des threads qui demandent des addresses differentes dans une meme banque ? Non

Quel est le nombre de loads (requetes memoire) qu’on va effectuer ? Une seule requete memoire

opItemsBank usedConflict Free#Loads
load M[tid.x]$[0,1,…,31]$$[0,1,…,7]$Oui1
load M[tid.x % 4]$[0,1,2,3,0,1,2,3…]$$[0]$Oui1
load M[tid.x + 1]$[1,2,3,…32]$$[0,1,…,8]$Oui1
load M[tid.x * 2]$[0,1,2,3,…62]$$[0,1,…,15]$Oui1
load M[tid.x * 8]$[0,8,…248]$$[0,2,…,30]$Non (conflits sur toutes les banques)2
load M[tid.x * 12]$[0,8,…248]$$[0,1,…,31]$Oui1

load M[tid.x * 8]:

load M[tid.x * 12]:

Bank conflicts in Transpose

Si on a 32 banques, on utilisent que 2 banques sur 32 et on a plein de conflits.

Solution to bank conflicts

With padding (to WRAP_SIZE + 1)

Comment se passe la lecture des columns ? Avec le padding, on decale la lecture de 1 (on se decale d’une banaque). En lisant la 1ere column, on tombe toujours dans une banque differente, evitant ainsi les conflits.

Index mapping function

Performances (GB/s on TESLA K40)

Shared memory (summary)

  • Super fast access (almost as fast as registers)
  • But limited resources

Occupancy

Stencil Pattern

Use case:

  • Dilation/erosion
  • Box (Mean) / Convolution Filters
  • Bilateral Filter
  • Gaussian Filter
  • Sobel Filter

Naive Stencil Implem

Local average with a rectangle of radius $r$ (Ignoring border problems for now)

Naive Stencil Performance

Say we have this GPU:

  • Peak power: 1 500 GFlops and Memory Bandwith: 200 GB/s

All threads access global memory

  • 1 memory access for 1 FP addition
  • Requires 1500 x sizeof(float) = 6 TB/s of data
  • But only 200 GB/x mem bandwith $\to$ 50 GLFOPS (3% the peak)

Handling Border

  1. Add border to the image to have in-memory access
  2. Copy tile + border to shared memory

The bad way

Each thread copies one value and border threads are then idle

The good way

A thread may copy several pixels

Stencil pattern with tiling performance

Reduction Pattern

Intuition for reduction pattern

Reduction combines every elemet in a collection into one element using an associative operator

Reduction pattern: solution 1

Est-ce que c’est correct ? Non, on va avoir des acces concurentiels (data race)

Data race

We need to ensure that each of those read-compute-write sequences are atomic.

Atomics reminder

Atomics

  • Read, modify, write in 1 operation
  • Cannot be mixed with accesses from other threads
  • On global memory and shared memory
  • Atomic operations to the same address are serialized

Operations

Reduction Pattern Corrected

Accumulation in global memory

Analysis

Time: 5.619 ms Correct result but high contention on the global atomic variable

Global atomics: is this really parallel ?

This version will produce the right result. However, is it really parallel ?

How our global atomic instruction is executed:

  1. lock memory cell
  2. read old value
  3. compute new value
  4. write new value
  5. release the memory cell

Memory cell = cache line burst

Our kernel generates a lot of collisions on global memory

Leverage Shared Memory

Atomic operations are much

Motivation for output privatization

Using shared memory

Reduction pattern V2: Output privatization

With sync

Reduction functions and trees

On peut reduire en parallele plusieurs fragments

Complexity in steps and operations

The tree parallel version is:

  • work efficient
  • not resource efficient
    • Average number of thread $((N-1/\log_2(N))$ « peak requirement ($N/2$)

Proof of number of operations

Reduction pattern: tree reduction without atomics

  • Use a local sum without atomics
    • Map reduction tree to compute units (threads)
  • Add to a global atomic once for each block

What is happening ? The (naive) tree version is slower than the locally sequential version

Sp starvation

In each iteration, 2 control flow paths will be sequentiall traversed for each warp

  • Threads that perform addition and threads that do not
  • Threads that do not perform addition still consume execution resources

Resource efficient version

Tous les threads vont etre utilise sauf a la fin. On va avoir que des warps actifs, a chaque iterations on libere la moitie des warps.

A quick analysis

For a 1024 thread block

  • No divergence on the first 5 steps
    • 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
    • All threads in each warp either all active or all inactive
  • The final 5 steps will still have divergence
    • Can use warp-level optimization then (warp suffle)

Limit global collision

What happens with very large input arrays ? Lot of global atomics

How to avoid this ? Global array, one cell for each block

  • No more locks
  • But requires a second level of reduction

More work per thread

  • Just fire enough blocks to hide latency
  • Sequential reduction, then tree reduction
  • “algorithm cascading”

Algorithm cascading

Perform first reduction during the collaborative loading

Last optimizations

Loop unrolling

  • Unroll tree reduction loop for the last warp (less sync needed)
  • Unroll all tree reduction loops (need to know block size)
  • Unroll the sequential reduction loop (knowing the work per thread)

Histogram computation

Mandelbrot practice session

During the practice session, you will have to compute the cumulated histogram of the image.

  1. Compute the histogram $H$
    • Count the number of occurences of each value
  2. Computed the cumulated histogram $C$

First sample code

What is the issue ? Ajouter un data array

Parallel algorithm using output privatization

Local histogram

Initialization

Shared memory must be initialized This can be done with the “comb-like” pattern

Computation

Like previous code, but with local atomics

Commit to global memory

int n = blockDim.x * threadIdx.y + threadIdx.x

Summary

Performance boosters:

  • Coalesced accesses
  • Output privatization

Requirements:

  • atomics
  • synchronization

Scan pattern

What is a scan ?

Usage:

  • Integration (cumulated histogram)
  • Resource allocation (memory to parallel threads, camping spots…)
  • Base building block for many algorithms (sorts, strings comparisons)

Performance baselines

Sequential version

Naive parallel version

Have every thread to add up all x elements needed for the y element

Scan pattern at the warp or block level

Kogge-Stone

  • Number of steps: $\log N$ (bien)
  • Ressource efficiency (bien)
  • Work efficiency $\sim N\log N$ (pas bien)

Brent-Kung

  • Number of steps: $2\log N$
  • Ressource efficiency: all warps remain active till the end (pas bien)
  • Work efficiency: $2N$ (bien)

Sklansky

  • Number of steps: $\log N$
  • Ressource efficiency: bien
  • Work efficiency: $\frac{N}{2}\log N$ (bien)

Scan Pattern at the Block or Grid Level

The patterns before can be applied:

  • At the warp level (no sync until Volta)
  • At the block level (thread sync)

At the global level: multi-level kernel app in global memory

  • Scan then propagate
  • Reduce then scan

Scan then propagate

Reduce then scan

Summary

This post is licensed under CC BY 4.0 by the author.