Agenda

GPU and architectures (2h)
Programming GPUs with CUDA (2h)
TP 00 CUDA (3h)
Efficient programming with GPU (2h)
TP 01 CUDA

GPU and architectures

Why using GPU ?

On veut faire de la programmation rapide.

Un programme rapide est un programme qui consomme moins.

C’est important de consommer moins

Ex: nos smartphones consomment enormement d’energie, avoir des programmes qui consomment le moins possible permet d’economiser la batterie

On est aujourd’hui dans l’ere du big data, on veut traiter rapidement un tres gros volume de donnees. Sinon on aurait jamais ete capable d’avoir des techs comme les reseaux de neurones.

On veut que les programmes s’executent dans un temps borne.

On est pas des gistres… Mais quand meme

On veut pas une reponse d’1h avec des systemes critiques embarques (voitures, fusees, etc.)

Power Consumption on Smartphones

CPU is a major source of power in smartphones (even with graphical-oriented app)

Une bonne partie de la batterie est consommee par le CPU et GPU

Aujourd’hui, on essaie de tout transferer sur le GPU car ca consomme moins que le CPU

Power Consumption of Some Processors

Qu’est-ce qu’on remarque du prix par Gigaflops ?

Le GPU est beaucoup plus rentable que le CPU.

Tous les calculs ne sont pas basculables du CPU au GPU.

Scientific Computing

A bit of history - The first GPU

Ce qui a motive la creation des GPU c’etait la medecine, etc. (meme si le gaming en a prit l’avantage).

Back in 70’s GPU were for Image Synthesis
First GPU: Ikonas RDS-3000
- A l’epoque: tres difficile de programmer en GPU
N. England & M. Witton founded Ikonas Graphics Systems

The first GPGPU

General Purpose GPU

First programmable GPU:

Vertex Shaders: programmable vertex transforms, 32-bits float
- Pipeline graphics
- Shaders: etapes de la pipeline qu’on pouvait remplacer
- A ouvert la voie vers le scientific computic
Data-dependent, configurable texturing + register combiners

Enabled early GPGPU results:

Hoff (1999): Voronoi diagrams on NVIDIA TNT2
Larsen & MacAllister (2001): first GPU matrix multiplication (8-bit)
Rumpf & Strzodka (2001): first GPU PDEs

GPGPU for physics simulation on Geforce 3

Approximate simulation of natural phenomenon

GEFORCE FX (2003): floating point

True programmability enabled broader simulation research

Ray Tracing
Radiosity
PDE solvers
Physically-base simulation
FFT (2003)
High-level language: Brook for GPU (2004)

GPGPU becomes a trend (2006)

2 factors for the massive surge in GPGPU dev:

Architecture Nvidia G80
- Dedicated computing mode - threads rather than pixels/vertices
- General, byte-addressable memory architecture
Software support
- C and C++ languages and compilers for GPUs (spoiler.. it’s CUDA)

Le graphique a droite veut rien dire (c’est du marketing)

2010’s

Accelerating discoveries

Without GPUs, supercomputer would like 5x more times

And data center gave birth to Deep-Learning

Les reseaux de neurones existaient deja dans les annees 80 mais on n’avait pas la puissance de calcul avant 2010’s

Embedded systems - The real-time constraints

Need both of the 2 worlds:

Need ultra-performance computing
With limited resources

GPU vs CPU for parallelism

How to get things done quicker

Do less work
Do some work better (i.e. the one being he more time-consuming)
Do some work at the sane time
Distribute work between different workers
1. Choose the most adapted algorithms, and avoid re-computing thing
2. Choose the most adapted data structures
Parallelism

Why parallelism ?

Moore’s law: processors are not getting twice as powerful every 2 years anymore

So the processor is getting smarter
- Out-of-order execution / dynamic register renaming
- Speculative execution with branch prediction
And the processor is getting super-scalar
- Executer des choses en meme temps
- Nos CPUs sont des processeurs super-scalaires

Toward data-oritented programming

The burger factory assembly line

How to make several sandwiches as fast as possible ?

Avoir plusieurs personnes qui travaillent en meme temps sur le meme sandwich et vont executer les taches independantes en meme temps, avoir un worker maitre qui s’occupe d’assembler tout
Avoir plusieurs workers qui travaillent en meme temps sur des sandwichs differents
Mix entre les 2 strategies precedente: un worker qui peut bosser sur un sandwich ou plusieurs en meme temps
Pipeline: un worker fait une etape et passe le sandwich a un autre worker
Un worker a plusieurs bras

A prendre en compte:

La latence
Le debit

Avec la 1ere strategie: 2 cycles

optimisation en latence et debit
Pas la plus efficace car etape de synchronisation

Avec la 2e strategie

Optimisation en debit mais pas en latence

Avec la 3e strategie:

Tres lourd niveau synchronisation

Data-oriented programming parallelism

Flynn’s Taxonomy

SISD: no parallelism
SIMD: same instruction on data group (vector)
MISD: rare, mostly used for fault tolerant code
MIMD: usual parallel mode (multithreading)
SPMT: Single Programm Multiple Thread
- Execute le meme programme

Optimize for latency (MIMD with collaborative workers)

4 super-workers (4 CPU cores) collaborate to make 1 sandwich

Manu gets the bread and wait for the others

Time to make 4 sandwicches: $s$ (400% speed-up)

Optimize for throughtput (MIMD Horizontal with multiple jobs)

Time to make 4 sandwiches: s (400% speed-up)

Optimize for throughput (MIMD Vertical Pipelining)

Optimize for throughput (SIMD DLP)

Un seul optimise en latence

More cores is trendy

Data-oriented design have changed the way we make processors (even CPUs)

Lower clock rate
Large vector-size, more vector-oriented ISA
More cores (processing units)

Parallelisme: SIMD

Depuis 2005/2006, on a des “faux” coeurs pour faire du multi-threading

CPU vs GPU performance

And you see it with HPC apps:

Towards Heterogeneous Architectures

But don’t forget, you may need to optimize both latency and throughput

What is the bounds speedup attainable on a parallel machine with a program which is parallelizable at $P\%$ (i.e. must run sequentially for $(1-P)$)

Utiliser la bonne architecture pour le bon travail

GPU vs CPU architectures

It’s all about the data… The CPU:

optimized for low-latency access (many memory caches)
Control logic for out-of-order and speculative execution

It’s all about data.. the GPU:

Hiding latency with thread parallelism & pipelining

So… you want to hide the latency of getting data from from global memory… how ?

1 CPU Core:

1 GPU SMP (Streaming Multiprocessor)

CPU:

low-latency memory to get data ready
each thread context switch has a cost

GPU:

memory latency hidden by pipelining
context switch is free

Latency hiding:

= do other operations when waiting for data
= having a lot of parallelism
= having a lot of data
will run faster
but not faster than the peak
what is the peak btw ?

Peak:

Peak de la memoire
- Donnee par un nombre
Peak du compute
- Nombre d’instructions qu’on peut executer par secondes

It’s all about data… Little’s law”

La latence est typiquement la longueur de la pipeline

Hiding latency

With thread parallelism & pipelining

Note that pipeline exists on CPUs (cycle de Von Neumann)

More about forms of parallelism (the why!)

More about forms of parallelism (the how!)

Pourquoi on a un TLP horizontal sur les CPUs ? Multicoeurs Pourquoi on a un TLP vertical sur les CPUs ? Les hypercoeurs (coeurs logiques). Ce ne sont pas des vrais coeurs mais des threads capables de switch sur les coeurs.

Extracting parallelism

Parallel architectures and parallelism

All processors use hardware to turn parallelsim into performance

IRGPU: Introduction