A GPU (Graphics Processing Unit) is a specialized processor designed to accelerate parallel workloads, especially operations on large matrices and vectors — which is exactly what modern deep learning models need.

Originally built for 3D graphics and rendering, GPUs are now heavily used in:

  • Machine Learning (training/inference)
  • Scientific computing
FeatureCPUGPU
Designed forGeneral-purpose tasksParallel computation (e.g., matrices)
CoresFew (2–64 high-perf cores)Thousands of smaller cores
MemoryLow-latency, low-bandwidthHigh-bandwidth, high-latency
Control flowOptimized for branching logicOptimized for SIMD (Single Instruction Multiple Data)

📌 In ML, think:

  • CPU → control logic, training orchestration
  • GPU → number crunching for tensors (matrix mult, conv, activation…)

Architecture of a GPU (Simplified)

+-----------------------------------------------------+
|                     GPU Die                         |
| +----------------+  +-----------------------------+ |
| |  Control Logic |  |    SIMD Cores (hundreds)    | |
| |  (few cores)   |  |  ALUs, Registers, etc.      | |
| +----------------+  +-----------------------------+ |
| +----------------+                                |
| | High-BW Memory Controller (e.g. GDDR6, HBM2)    |
| +-------------------------------------------------+ |
+-----------------------------------------------------+

GPU internal blocks:

  • SMs (Streaming Multiprocessors)
  • CUDA cores
  • Tensor Cores
  • Cache hierarchy
  • Memory (HBM, GDDR)

SM

Each SM has:

  • CUDA cores (integer/floating point ALUs)
  • Tensor Cores (for matrix ops)
  • Registers, shared memory
  • Scheduler (handles warps — groups of 32 threads in NVIDIA)

CUDA Cores

  • The basic arithmetic units in NVIDIA GPUs.
  • They do simple math: add, multiply, move data.
  • Operate in SIMT model: Single Instruction, Multiple Threads.
  • Threads run in warps (32 threads).

Tensor Cores

  • Special cores for mixed-precision matrix math.
  • Optimized for deep learning → multiply-add operations (matrix-matrix).
  • Boost performance for FP16, INT8, TF32.
  • E.g., NVIDIA Volta → Tensor Cores introduced.

  • tensor core do matric multiplication
  • cuda core do arthmetic and other things which load the needed weight from global memory to tensor core for multiplicaton
+----------------------+
|       VRAM (DRAM)    |
+----------------------+
           |
           v
+----------------------+
|  Memory Controller   |
+----------------------+
           |
           v
+----------------------+
|      L2 Cache        |  <-- Large SRAM
+----------------------+
           |
           v
+-----------------------------------------------------------+
|                       GPU Chip                           |
|                                                           |
| +------------------+   +------------------+               |
| | SM (Streaming MP)|   | SM (Streaming MP)|               |
| |------------------|   |------------------|               |
| | +--------------+ |   | +--------------+ |               |
| | | Warp Sched   | |   | | Warp Sched   | |  <-- Schedulers |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Registers    | |   | | Registers    | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | L1 Cache     | |   | | L1 Cache     | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Shared Mem   | |   | | Shared Mem   | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | CUDA Cores   | |   | | CUDA Cores   | |               |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Tensor Cores | |   | | Tensor Cores | |               |
| | +--------------+ |   | +--------------+ |               |
| +------------------+   +------------------+               |
|                                                           |
|     ... more SMs ...                                      |
|                                                           |
+-----------------------------------------------------------+

The entire LLM model is loaded into VRAM (Video Random Access Memory), but computations happen in a much smaller, faster part of the GPU called SRAM

  • The full model weights (e.g., 7B parameters) are loaded into VRAM — this is your GPU’s GDDR6X, HBM, etc.
  • Why? Because VRAM is big enough (GBs) to hold all those weights.
  • VRAM acts as the main store for the model and any big activations or tensors.

Actual compute → on-chip SRAM (caches, registers, shared memory)

  • When you run a prompt, the GPU core:
    • Loads chunks of weights from VRAM into on-chip L2 Cache (SRAM).
    • Then pushes data down into L1 Cache, Shared Memory, and Registers inside the SMs.
    • The CUDA Cores or Tensor Cores then do the matrix math on these chunks.

Major GPU Manufacturers

NVIDIA

  • King of ML/AI
  • CUDA, cuDNN, TensorRT
  • Best framework support (PyTorch, TensorFlow, etc.)
  • Dominates data center (H100, A100, L40, etc.)

AMD

  • Competitive hardware (Radeon, Instinct)
  • Uses ROCm (alternative to CUDA) for ML
  • PyTorch ROCm support exists, but ecosystem is smaller

Intel

  • New in discrete GPUs (Intel Arc, Xe)
  • Has oneAPI, XPU, and Level Zero
  • Still maturing in ML support

Apple

  • M1/M2/M3 chips have built-in GPUs
  • Uses Metal Performance Shaders (MPS) backend in PyTorch

GPU in ML Ecosystem

To use GPU in ML code:

  • You need framework-level support: PyTorch, TensorFlow, etc.
  • Those frameworks need backend support:
    • torch.cuda → uses CUDA (NVIDIA)
    • torch.xpu → uses XPU/Level Zero (Intel)
    • torch.mps → uses Metal (Apple)
    • torch.device("rocm") → for AMD

So just having a GPU isn’t enough. You need the right driver + runtime + compiler stack.

What we need

  • GPU Hardware : NVIDIA RTX 4060, AMD Radeon RX 6700, Intel Arc A750
  • GPU Driver: Translates OS/system calls (like from PyTorch) into instructions the GPU understands. Example nvidia-driver-550 (Ubuntu package)
  • GPU Runtime API: The library/API layer that ML frameworks use to control the GPU
  • Compiler/Kernel Stack GPU kernels are like GPU “functions”. You need a compiler to build and dispatch them.
LayerWhat Happens
ML Frameworktorch.tensor → creates tensor on CPU, to("cuda") moves to GPU
PyTorch BuildCalls into torch._C (C++/CUDA bindings from PyTorch + cu121 build)
CompilerUses cuBLAS/cuDNN kernel for matmul, or JITs custom CUDA kernel
Runtime APICalls cudaMalloc, cudaMemcpy, cudaLaunchKernel
DriverAccepts kernel dispatch, manages memory on GPU
GPUExecutes parallel matmul on thousands of cores
Python → PyTorch → C++ ops / kernels → CUDA API → NVIDIA Driver → GPU
          |            |                  |             |           | high-level   compiled ops        runtime       hardware     

VendorRuntimeRole
NVIDIACUDAHandles memory allocation, kernel execution, tensor ops
InteloneAPI Level Zero, DPC++Cross-device dispatch
AMDROCmIncludes HSA, HIP, and drivers
AppleMetal Performance Shaders (MPS)On Apple Silicon
Key GPU Specs That Matter (for ML)
SpecMeaning
CUDA Cores / Stream ProcessorsParallel execution units
Memory Size (VRAM)How large your tensors can be
Memory TypeGDDR6 / HBM2 / shared (affects bandwidth)
Tensor CoresNVIDIA-only: optimized for matrix ops
FP32 / FP16 / BF16Floating-point precision support
BandwidthSpeed of data movement
ECC MemoryError checking — important in training stability
What Happens When You Use GPU in PyTorch?

When you do:

tensor = tensor.to("cuda")

This happens:

  1. PyTorch copies the tensor from CPU RAM to GPU VRAM
  2. The GPU kernel (written in CUDA C++) runs your operation (e.g., matmul)
  3. Results are returned staying in GPU memory unless moved back

This is where CUDA / XPU / ROCm / MPS drivers come into play.

CUDA Graph

Instead of launching kernels and memory copies one by one (like a regular program does), CUDA Graphs record the entire sequence of operations, and then replay them as a single unit.

CUDA Graphs are DAGs (Directed Acyclic Graphs). Each node in the graph is:

  • A kernel launch
  • A memory operation
  • A synchronization operation

The edges represent dependencies (e.g., op B must wait for op A to finish).

decoder in a transformer model like LLaMA 3 runs something like:

  1. Load KV cache
  2. Run attention kernel
  3. Run MLP kernel
  4. Write back to KV cache
  5. Generate logits

Without CUDA Graph:

Each of those steps is launched individually, with a tiny CPUGPU launch for each. This gets expensive for small batches or low-latency apps.

With CUDA Graph:

we record this entire sequence once, for a specific batch size (say 8), and then next time:

cudaGraphLaunch(graphExec, stream);  // ONE CALL replaces five kernel launches

CUDA Graphs are batch-size-specific. If you record a graph for batch size 8, you can’t reuse it for batch size 12 without re-recording.

Hence, frameworks (like TensorRT-LLM or vLLM) use a CUDA_GRAPH_MAX_BATCH_SIZE parameter. If your incoming request size ≤ that, you can use the prebuilt graph.

If we go over that size, the graph isn’t used, and you fall back to traditional launch mode (with higher latency).

Note : SGLANG use the above technique

nvidia-smi

Shows NVIDIA GPU details: driver version, GPU name, memory usage, etc.

nvidia-smi

Query GPU memory, name, and utilization only

nvidia-smi --query gpu=gpu_name,memory.total,memory.used,utilization.gpu --format=csv

See detailed hardware + performance data

nvidia-smi -q

LLMs like GPT-3 or GPT-4 are way too big for a single GPU.
You must learn:

  • Data Parallelism: Split data across GPUs.
  • Model Parallelism: Split model layers across GPUs.
  • Pipeline Parallelism: Split forward/backward pass stages.

Frameworks: DeepSpeed, Megatron-LM, NCCL (NVIDIA’s lib for multi-GPU comms).

How Much space need to load model parameters

Each parameter is usually a 32-bit float → 4 bytes. let say 82M paramerters

82 million × 4 bytes = ~328 MB.
  • Activations, intermediate buffers, and other overhead come on top.

Model size (number of parameters). Larger models need more memory. Models with tens or hundreds of billions of parameters usually require high-end GPUs like NVIDIA H100 or H200.

Bit precision. The precision used (e.g., FP16, FP8, INT8) affects memory consumption. Lower precision formats can significantly reduce memory footprint, but may have accuracy drops.

A rough formula to estimate how much memory is needed to load an LLM is:

Memory (GB) = P * (Q / 8) * (1 + Overhead)
  • P: Number of parameters (in billions)
  • Q: Bit precision (e.g., 16, 32), division by 8 converts bits to bytes
  • Overhead (%): Additional memory or temporary usage during inference (e.g., KV cache, activation buffers, optimizer states)

For example, to load a 70B model in FP16 with 20% overhead, you need around 168 GB of GPU memory:

Memory = 70 × (16 / 8) × 1.2 = 168 GB

NVidia GPU types

NVIDIA names its datacenter/server GPUs using:

  • Letter → often indicates the product family or target market:
    • TTuring generation, general-purpose inference (e.g., T4)
    • AAmpere generation, datacenter & workstation (e.g., A100, A10G, A6000)
    • HHopper generation, next-gen datacenter (e.g., H100, H200)
    • BBlackwell generation, future datacenter flagship (e.g., B200)
    • LAda Lovelace generation for datacenter inference/visualization (e.g., L4, L40S)
    • RTXConsumer cards (e.g., RTX 4090, 5090) → designed for gaming, also used for prosumer AI.
    • Quadro (older) → used for pro workstation branding → replaced by A-series (A6000).
Code NameYearTypical FamilySpecial feature
Turing2018T4, Turing RTX cards1st gen Tensor cores
Ampere2020A100, A10G, RTX 30 series3rd gen Tensor cores, TF32
Ada Lovelace2022/23L4, L40S, RTX 40 series4th gen Tensor, better efficiency
Hopper2022H100, H200Transformer Engine (FP8)
Blackwell2024+B200Next Transformer Engine, FP4
ArchitectureExample GPUsTensor Core GenSpecial Precision / QuantizationApprox. Tensor TFLOPSOther Notable Features
TuringT4, RTX 20 series1st GenFP16, INT8~65 TFLOPS (T4, FP16)First Tensor Cores, RT cores for RTX
AmpereA100, A10G, RTX 30 series3rd GenTF32, FP16, BF16, INT8A100: ~312 TFLOPS (FP16)TF32 precision mode for training
Ada LovelaceL4, L40S, RTX 40 series4th GenFP16, BF16, INT8, sparsityL40S: ~180 TFLOPS (FP16)Improved efficiency, higher clocks
HopperH100, H200Transformer EngineFP8, FP16, BF16, INT8H100: ~1000 TFLOPS (FP8)Dynamic mixed-precision, FP8
BlackwellB200Next Transformer EngineFP4, FP8, FP16, INT8B200: ~2000 TFLOPS (FP4/FP8 est.)Even lower precision for LLMs
GPUArchFP32 TFLOPsTensor TFLOPs (FP16/TF32/FP8)VRAM (GB)Special Features
T4Turing~8 TFLOPs~65 TFLOPs (FP16/INT8)16 GB GDDR6Low-power (70W), PCIe, 1st gen Tensor Cores, great for inference
A10GAmpere~31 TFLOPs FP32~125 TFLOPs (FP16)24 GB GDDR63rd gen Tensor Cores, PCIe, better for medium-sized inference/training
A100Ampere~19.5 TFLOPs FP32 (single chip)~312 TFLOPs (TF32/FP16)40 or 80 GB HBM2eSXM & PCIe, HBM, 3rd gen Tensor Cores, Multi-Instance GPU (MIG) for partitioning
H100Hopper~60 TFLOPs FP32~1000–1500 TFLOPs (FP8)80 GB HBM3Transformer Engine → native FP8/FP16, 4th gen Tensor Cores, NVLink/NVSwitch
H200Hopper+~60 TFLOPs FP32~1500–2000 TFLOPs (FP8)141 GB HBM3eFaster HBM3e, same Hopper base, larger VRAM for massive LLMs
B200Blackwell~75 TFLOPs FP32~4500 TFLOPs (FP4/FP8)192–228 GB HBM3eNext-gen Transformer Engine, FP4 precision for ultra-large LLMs, NVLink 5, huge scale
Ada Lovelace Architeture

Training LLMs on GPU Clusters

open source book is here to change that. Starting from the basics, we’ll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.

check here

Resources

https://leetgpu.com/

Write, run and benchmark GPU code to solve 50+ challenges with free access to T4, A100, H100, H200 and B200 GPUs.

under the hood

GPU Inference provider