Home IRGPU: Getting started with CUDA
Post
Cancel

IRGPU: Getting started with CUDA

Lien de la note Hackmd

CUDA overview

What is CUDA ?

A product

  • It enables to use NVidia GPUs for computation

A C/C++ variant

  • Mostly C++ 15 compatible, with extensions
  • and also some restrictions !

A SDK

  • A set of compilers and toolchains for various architectures
  • Performance analysis tools

A runtime

  • An assembly specification
  • Computation libraries (linear algebra, etc.)

A new industry standard

  • Used by every major deep learning framework
  • Replacing OpenCL as Vulka is replacing OpenGL

The CUDA ecosystem (2021)

Libraries or Compiler Directives or Programming Language ?

CUDA is mostly based on a “new” programming language: CUDA C (or C++, or Fortran)

This grants much flexibility and performance

But is also exposes much of GPU goodness through libraries

An it supports a few compiler directives to facilitate some constructs

The big idea: Kernels instead of loops

Arrays of parallel threads

A CUDA kernel is executed by a grid (array) of threads

  • All threads in grid run the same kernel code (Single Program Mutliple Data)
  • Each thread has indexes that is used to compute memory addresses and compute decisions

Threads blocks

Threads are grouped into thread blocks

  • Threads witihin a bloc cooperate via
    • shared memory
    • atomic operations
    • barrier synchronization
  • Threads in different blocks do not interact

A multidimensional grid of computation threads

Each thread uses indices to decide what data to work on:

Each index has $x$, $y$ and $z$ attributes

Grid and blocks can have different dimensions, but they usually are 2 levels of the same work decomposition

Examples

Block decomposition enable automatic scalability

Architecture

Programming modeling

Block A set of threads that cooperate:

  • Synchronisation
  • Shared memory
  • Block ID = ID in a grid

Grid Array of blocks executing same kernel

  • Access to global GPU memory
  • Sync. by stop and start a new kernel

Mapping Programming model to hardware

the SMs

Zoom on the SM

  • SM organize blocks into warps
  • 1 warp = group of 32 threads

GTX 920:

  • 128 cores = 4 x 32 cores
  • Quad warp scheduler selects 4 warps (TLP)
  • And 2 independant instructions per warp can be dispatched each cycle (ILP)

Ex: 1 (logical) block of 96 threads maps to: 3 (physical) warps of 32 threads

Zoom on the CUDA cores

  • A warp executes 32 threads on the 32 CUDA cores
  • The threads executes the same instructions (DLP)
  • All instructions are SIMD (width = 32) instructions

Each core:

  • FLoating point & integer unit
  • Fused multiply-add (FMA) instruction
  • Logic unit
  • Move, compare unit
  • Branch unit
  • The first IF/ID of the pipeline is done by the SM

The SIMT Execution Model on CUDA cores

SIMT: on programme comme si on avait un thread qui execute une donnees mais ca se cache derriere des instructions SIMD (a + b devient la somme du vecteur a avec le vecteur b)

  • Divergent code paths (branching) pile up!

If/else: tous les threads vont effectuer en meme temps le if et else

If/else

What is the latency of this code in the best and worst case ?

  • Best case: $a\gt0$ is false for every thread. For all threads: inst-d
  • Worst case: $a\gt0$ and $b\gt0$ is true for some but not all threads. For all threads: inst-a, inst-b, inst-c, inst-d

Loops

Final note about terminology

GPU memory model

Computation cost vs. memory cost

Power measurements on NVIDIA GT200

With the same amount of energy:

  • Load 1 word from external memory (DRAM)
  • Compute 44 flops

External memory: discrete GPU

Classical CPU-GPU model

  • Split memory space
  • Highest bandwith from GPU memory
  • Transfers to main memory are slower

Intel i7 4770 / GTX 780

External memory: embedded GPU

Most GPUs today:

  • Same memory
  • May support memory coherence (GPU can read directly from CPU caches)
  • More contention on external memory

GPU: on-chip memory

Cache area in CPU vs GPU:

But if we include registers:

Memory model hierarchy

Hardware

Cache hierarchy:

  • Keep frequently-accessed data Core
  • Reduce throughtput demand on main memoru L1
  • Managed by hardware (L1, L2) or software (shared memory)

On CPU, caches are designed to avoid memory latency On GPU, multi-threading deals with memory latency

Software

Building and running a simple program

What you need to get started

  • NVidia GPU hardware
  • NVidia GPU drivers, properly loaded
  • CUDA runtime libraries
  • CUDA SDK (NVCC compiler in particular)

Summary

  • Host vs Device $\leftrightarrow$ Separate memory
    • GPU are computation units which require explicit usage, as opposed to a CPU
    • Need to load data and fetch result from device
  • Replace loops with kernels
    • Kernel = Function computed in relative isolation on small chunks of data
  • Divide the work
  • Compile and run using CUDA SDK

Host view of GPU computation

Sequential and parallel sections

  • We use the GPU(s) as co-processor(s)

CUDA memory primitives

Why 2D and 3D variants ?

  • Strong alignment requirements in device memory
    • Enables correct loading of memory chunks to SM caches (correct bank alignment)
  • Proper striding management in automated fashion

Host $\leftrightarrow$ Device memory transfer

Almost complete code

Checking errors

In practice check for API errors

Intermission: Can I use memory management functions inside kernels ?

No: cudaMalloc(), cudaMemcpy() and cudaFree() shall be called from host only

However, kernels may allocate, use and reclaim memory dynamically using regular malloc()

Fix the kernel invocation line

We want to fix this line:

Kernel invocation syntax:

How to set gridDim and blockDim properly ?

Lvl 0: Naive trial with as many threads as possible

  • Will fail with large vectors
    • Hardware limitation on the maximum number of thread per block (1024 for compute capability 3.0-7.5)
  • Will fail with vectors of size which is not a multiple of warp size

Lvl 1: It works with just enough blocks

Lvl 2: Tune block size given the kernel requirements and hardware constraints

But wait…

On the device, kernel invocations are striclty sequential (unless you schedule them on different streams)

Intermission: Can I make kernels inside kernels ?

Yes. This is the basic of dynamic parallelism

Some restrictions over the stack size apply. Remember that the device runtime is a functional subset of the host runtime, ie you can perform device management, kernel launching, device memcpy, etc. but with some restrictions

The compiler may inline some of those calls.

Conclusion about the host-only view

A host-only view of the computation is sufficient for most of the cases:

  1. upload data to the device
  2. fire a kernel
  3. download output data from the device

Advanced CUDA requires to make sure we saturate the SMs, and may imply some kernel study to determine the best:

  • amount of threads per blocks
  • amount of blocks per grid
  • work per thread (if applicable)

This depends on:

  • hardware specifications: maximum gridDim and blockDim, etc.
  • kernel code: amount of register and shared memory used by each thread

Kernel programming

Several API levels

We now want to program kernels There are several APIs available:

  • PTX assembly
  • Driver API (C)
  • Runtime C++ API $\leftarrow$ let’s use this one

Function Execution Space Specifiers

  • __global__ defines a kernel function
    • Each __ consists of 2 underscore characters
    • A kernel function must return void
    • It may be called from another kernel for devices of compute capability 3.2 or higher (Dynamic Parallelism support)
  • __device__ and __host__ can be used together
  • __host__ is optional if used alone

Built-in Vector Types

They make it easy to work with data like images Alignement mus be respected in all operations

They all come with a constructor function of the form make_<type name>

The 1st, 2nd, 3rd and 4th components are accessible through the fields $x$, $y$, $z$ respectively

Built-in variable

Example

Memory hierarchy

Types of Memory

  • Registers
    • Used to store parameters, local variables, etc.
    • Very fast
    • Private to each thread
    • Lots of thread $\Rightarrow$ little memory per threads
  • Shared
    • Used to store temp data
    • Very fast
    • Shared among all threads in a block
  • Constant
    • A special cach for read-only values
  • Global
    • Large and slow
  • Caches
    • Transparent uses
  • Local

Salient features of Device Memory

Cost to access memory

Variable Memory Space Specifiers

How to declaring CUDA variables

Remarks:

  • __device__ is optional when used with __shared__ or __constant__
  • Automatic variables reside in a register

Where to declare variables ? Can host access it ?

YesNo
global and constant, declare outside of any functionregister and shared, use of declare in the kernel

Who can be shared by who ?

Possible memory access:

  • Among threads in the same grid (a kernel invocation)
    • Global memory
  • Among threads in the same block
    • Global memory
    • Shared memory

Relaxed consistency memory model

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory or global memory, is not necessarily the order in which the data is observed being written by another CUDA or host thread

Example

Possible outcomes for thread 2 ?

Memory Fence Functions

Memory fence functions can be used to enforce some ordering on memory accesses

Ensures that:

  • All writes to all memory made by the calling thread before the call to __threadfence_block()
  • All reads from all memory

Synch functions

Atomic functions

Most of the atomic functions are available for all the numerical type

Arithmetic functions

Debugging, performance analysis and profiling

printf

Possible since Fermi devices (Compute Capability 2.x and higher)

Limited amount of lines:

  • circular buffer flushed at particular times

Global memory write

To dump then inspect a larger amount of intermediate data Analysis code should be removed for production

Example

CUDA tools

The complete compilation trajectory

This post is licensed under CC BY 4.0 by the author.