GPU

A GPU (Graphics Processing Unit) is a specialized processor designed to accelerate parallel workloads, especially operations on large matrices and vectors — which is exactly what modern deep learning models need.

Originally built for 3D graphics and rendering, GPUs are now heavily used in:

Machine Learning (training/inference)
Scientific computing

Feature	CPU	GPU
Designed for	General-purpose tasks	Parallel computation (e.g., matrices)
Cores	Few (2–64 high-perf cores)	Thousands of smaller cores
Memory	Low-latency, low-bandwidth	High-bandwidth, high-latency
Control flow	Optimized for branching logic	Optimized for SIMD (Single Instruction Multiple Data)

📌 In ML, think:

CPU → control logic, training orchestration
GPU → number crunching for tensors (matrix mult, conv, activation…)

Architecture of a GPU (Simplified)

+-----------------------------------------------------+
|                     GPU Die                         |
| +----------------+  +-----------------------------+ |
| |  Control Logic |  |    SIMD Cores (hundreds)    | |
| |  (few cores)   |  |  ALUs, Registers, etc.      | |
| +----------------+  +-----------------------------+ |
| +----------------+                                |
| | High-BW Memory Controller (e.g. GDDR6, HBM2)    |
| +-------------------------------------------------+ |
+-----------------------------------------------------+

GPU internal blocks:

SMs (Streaming Multiprocessors)
CUDA cores
Tensor Cores
Cache hierarchy
Memory (HBM, GDDR)

SM

Each SM has:

CUDA cores (integer/floating point ALUs)
Tensor Cores (for matrix ops)
Registers, shared memory
Scheduler (handles warps — groups of 32 threads in NVIDIA)

CUDA Cores

The basic arithmetic units in NVIDIA GPUs.
They do simple math: add, multiply, move data.
Operate in SIMT model: Single Instruction, Multiple Threads.
Threads run in warps (32 threads).

Tensor Cores

Special cores for mixed-precision matrix math.
Optimized for deep learning → multiply-add operations (matrix-matrix).
Boost performance for FP16, INT8, TF32.
E.g., NVIDIA Volta → Tensor Cores introduced.

tensor core do matric multiplication
cuda core do arthmetic and other things which load the needed weight from global memory to tensor core for multiplicaton

+----------------------+
|       VRAM (DRAM)    |
+----------------------+
           |
           v
+----------------------+
|  Memory Controller   |
+----------------------+
           |
           v
+----------------------+
|      L2 Cache        |  <-- Large SRAM
+----------------------+
           |
           v
+-----------------------------------------------------------+
|                       GPU Chip                           |
|                                                           |
| +------------------+   +------------------+               |
| | SM (Streaming MP)|   | SM (Streaming MP)|               |
| |------------------|   |------------------|               |
| | +--------------+ |   | +--------------+ |               |
| | | Warp Sched   | |   | | Warp Sched   | |  <-- Schedulers |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Registers    | |   | | Registers    | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | L1 Cache     | |   | | L1 Cache     | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Shared Mem   | |   | | Shared Mem   | |  <-- SRAM     |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | CUDA Cores   | |   | | CUDA Cores   | |               |
| | +--------------+ |   | +--------------+ |               |
| | +--------------+ |   | +--------------+ |               |
| | | Tensor Cores | |   | | Tensor Cores | |               |
| | +--------------+ |   | +--------------+ |               |
| +------------------+   +------------------+               |
|                                                           |
|     ... more SMs ...                                      |
|                                                           |
+-----------------------------------------------------------+

The entire LLM model is loaded into VRAM (Video Random Access Memory), but computations happen in a much smaller, faster part of the GPU called SRAM

The full model weights (e.g., 7B parameters) are loaded into VRAM — this is your GPU’s GDDR6X, HBM, etc.
Why? Because VRAM is big enough (GBs) to hold all those weights.
VRAM acts as the main store for the model and any big activations or tensors.

Actual compute → on-chip SRAM (caches, registers, shared memory)

When you run a prompt, the GPU core:
- Loads chunks of weights from VRAM into on-chip L2 Cache (SRAM).
- Then pushes data down into L1 Cache, Shared Memory, and Registers inside the SMs.
- The CUDA Cores or Tensor Cores then do the matrix math on these chunks.

Major GPU Manufacturers

NVIDIA

King of ML/AI
CUDA, cuDNN, TensorRT
Best framework support (PyTorch, TensorFlow, etc.)
Dominates data center (H100, A100, L40, etc.)

AMD

Competitive hardware (Radeon, Instinct)
Uses ROCm (alternative to CUDA) for ML
PyTorch ROCm support exists, but ecosystem is smaller

Intel

New in discrete GPUs (Intel Arc, Xe)
Has oneAPI, XPU, and Level Zero
Still maturing in ML support

Apple

M1/M2/M3 chips have built-in GPUs
Uses Metal Performance Shaders (MPS) backend in PyTorch

GPU in ML Ecosystem

To use GPU in ML code:

You need framework-level support: PyTorch, TensorFlow, etc.
Those frameworks need backend support:
- torch.cuda → uses CUDA (NVIDIA)
- torch.xpu → uses XPU/Level Zero (Intel)
- torch.mps → uses Metal (Apple)
- torch.device("rocm") → for AMD

So just having a GPU isn’t enough. You need the right driver + runtime + compiler stack.

What we need

GPU Hardware : NVIDIA RTX 4060, AMD Radeon RX 6700, Intel Arc A750
GPU Driver: Translates OS/system calls (like from PyTorch) into instructions the GPU understands. Example nvidia-driver-550 (Ubuntu package)
GPU Runtime API: The library/API layer that ML frameworks use to control the GPU
Compiler/Kernel Stack GPU kernels are like GPU “functions”. You need a compiler to build and dispatch them.

Layer	What Happens
ML Framework	`torch.tensor` → creates tensor on CPU, `to("cuda")` moves to GPU
PyTorch Build	Calls into `torch._C` (C++/CUDA bindings from PyTorch + cu121 build)
Compiler	Uses cuBLAS/cuDNN kernel for `matmul`, or JITs custom CUDA kernel
Runtime API	Calls `cudaMalloc`, `cudaMemcpy`, `cudaLaunchKernel`
Driver	Accepts kernel dispatch, manages memory on GPU
GPU	Executes parallel matmul on thousands of cores

Python → PyTorch → C++ ops / kernels → CUDA API → NVIDIA Driver → GPU
          |            |                  |             |           | high-level   compiled ops        runtime       hardware

Vendor	Runtime	Role
NVIDIA	CUDA	Handles memory allocation, kernel execution, tensor ops
Intel	oneAPI Level Zero, DPC++	Cross-device dispatch
AMD	ROCm	Includes HSA, HIP, and drivers
Apple	Metal Performance Shaders (MPS)	On Apple Silicon
Key GPU Specs That Matter (for ML)

Spec	Meaning
CUDA Cores / Stream Processors	Parallel execution units
Memory Size (VRAM)	How large your tensors can be
Memory Type	GDDR6 / HBM2 / shared (affects bandwidth)
Tensor Cores	NVIDIA-only: optimized for matrix ops
FP32 / FP16 / BF16	Floating-point precision support
Bandwidth	Speed of data movement
ECC Memory	Error checking — important in training stability
What Happens When You Use GPU in PyTorch?

When you do:

tensor = tensor.to("cuda")

This happens:

PyTorch copies the tensor from CPU RAM to GPU VRAM
The GPU kernel (written in CUDA C++) runs your operation (e.g., matmul)
Results are returned staying in GPU memory unless moved back

This is where CUDA / XPU / ROCm / MPS drivers come into play.

CUDA Graph

Instead of launching kernels and memory copies one by one (like a regular program does), CUDA Graphs record the entire sequence of operations, and then replay them as a single unit.

CUDA Graphs are DAGs (Directed Acyclic Graphs). Each node in the graph is:

A kernel launch
A memory operation
A synchronization operation

The edges represent dependencies (e.g., op B must wait for op A to finish).

decoder in a transformer model like LLaMA 3 runs something like:

Load KV cache
Run attention kernel
Run MLP kernel
Write back to KV cache
Generate logits

Without CUDA Graph:

Each of those steps is launched individually, with a tiny CPU→GPU launch for each. This gets expensive for small batches or low-latency apps.

With CUDA Graph:

we record this entire sequence once, for a specific batch size (say 8), and then next time:

cudaGraphLaunch(graphExec, stream);  // ONE CALL replaces five kernel launches

CUDA Graphs are batch-size-specific. If you record a graph for batch size 8, you can’t reuse it for batch size 12 without re-recording.

Hence, frameworks (like TensorRT-LLM or vLLM) use a CUDA_GRAPH_MAX_BATCH_SIZE parameter. If your incoming request size ≤ that, you can use the prebuilt graph.

If we go over that size, the graph isn’t used, and you fall back to traditional launch mode (with higher latency).

Note : SGLANG use the above technique

nvidia-smi

Shows NVIDIA GPU details: driver version, GPU name, memory usage, etc.

nvidia-smi

Query GPU memory, name, and utilization only

nvidia-smi --query gpu=gpu_name,memory.total,memory.used,utilization.gpu --format=csv

See detailed hardware + performance data

nvidia-smi -q

LLMs like GPT-3 or GPT-4 are way too big for a single GPU.
You must learn:

Data Parallelism: Split data across GPUs.
Model Parallelism: Split model layers across GPUs.
Pipeline Parallelism: Split forward/backward pass stages.

Frameworks: DeepSpeed, Megatron-LM, NCCL (NVIDIA’s lib for multi-GPU comms).

How Much space need to load model parameters

Each parameter is usually a 32-bit float → 4 bytes. let say 82M paramerters

82 million × 4 bytes = ~328 MB.

Activations, intermediate buffers, and other overhead come on top.

Model size (number of parameters). Larger models need more memory. Models with tens or hundreds of billions of parameters usually require high-end GPUs like NVIDIA H100 or H200.

Bit precision. The precision used (e.g., FP16, FP8, INT8) affects memory consumption. Lower precision formats can significantly reduce memory footprint, but may have accuracy drops.

A rough formula to estimate how much memory is needed to load an LLM is:

Memory (GB) = P * (Q / 8) * (1 + Overhead)

P: Number of parameters (in billions)
Q: Bit precision (e.g., 16, 32), division by 8 converts bits to bytes
Overhead (%): Additional memory or temporary usage during inference (e.g., KV cache, activation buffers, optimizer states)

For example, to load a 70B model in FP16 with 20% overhead, you need around 168 GB of GPU memory:

Memory = 70 × (16 / 8) × 1.2 = 168 GB

NVidia GPU types

NVIDIA names its datacenter/server GPUs using:

Letter → often indicates the product family or target market:
- T → Turing generation, general-purpose inference (e.g., T4)
- A → Ampere generation, datacenter & workstation (e.g., A100, A10G, A6000)
- H → Hopper generation, next-gen datacenter (e.g., H100, H200)
- B → Blackwell generation, future datacenter flagship (e.g., B200)
- L → Ada Lovelace generation for datacenter inference/visualization (e.g., L4, L40S)
- RTX → Consumer cards (e.g., RTX 4090, 5090) → designed for gaming, also used for prosumer AI.
- Quadro (older) → used for pro workstation branding → replaced by A-series (A6000).

Code Name	Year	Typical Family	Special feature
Turing	2018	T4, Turing RTX cards	1st gen Tensor cores
Ampere	2020	A100, A10G, RTX 30 series	3rd gen Tensor cores, TF32
Ada Lovelace	2022/23	L4, L40S, RTX 40 series	4th gen Tensor, better efficiency
Hopper	2022	H100, H200	Transformer Engine (FP8)
Blackwell	2024+	B200	Next Transformer Engine, FP4

Architecture	Example GPUs	Tensor Core Gen	Special Precision / Quantization	Approx. Tensor TFLOPS	Other Notable Features
Turing	T4, RTX 20 series	1st Gen	FP16, INT8	~65 TFLOPS (T4, FP16)	First Tensor Cores, RT cores for RTX
Ampere	A100, A10G, RTX 30 series	3rd Gen	TF32, FP16, BF16, INT8	A100: ~312 TFLOPS (FP16)	TF32 precision mode for training
Ada Lovelace	L4, L40S, RTX 40 series	4th Gen	FP16, BF16, INT8, sparsity	L40S: ~180 TFLOPS (FP16)	Improved efficiency, higher clocks
Hopper	H100, H200	Transformer Engine	FP8, FP16, BF16, INT8	H100: ~1000 TFLOPS (FP8)	Dynamic mixed-precision, FP8
Blackwell	B200	Next Transformer Engine	FP4, FP8, FP16, INT8	B200: ~2000 TFLOPS (FP4/FP8 est.)	Even lower precision for LLMs

GPU	Arch	FP32 TFLOPs	Tensor TFLOPs (FP16/TF32/FP8)	VRAM (GB)	Special Features
T4	Turing	~8 TFLOPs	~65 TFLOPs (FP16/INT8)	16 GB GDDR6	Low-power (70W), PCIe, 1st gen Tensor Cores, great for inference
A10G	Ampere	~31 TFLOPs FP32	~125 TFLOPs (FP16)	24 GB GDDR6	3rd gen Tensor Cores, PCIe, better for medium-sized inference/training
A100	Ampere	~19.5 TFLOPs FP32 (single chip)	~312 TFLOPs (TF32/FP16)	40 or 80 GB HBM2e	SXM & PCIe, HBM, 3rd gen Tensor Cores, Multi-Instance GPU (MIG) for partitioning
H100	Hopper	~60 TFLOPs FP32	~1000–1500 TFLOPs (FP8)	80 GB HBM3	Transformer Engine → native FP8/FP16, 4th gen Tensor Cores, NVLink/NVSwitch
H200	Hopper+	~60 TFLOPs FP32	~1500–2000 TFLOPs (FP8)	141 GB HBM3e	Faster HBM3e, same Hopper base, larger VRAM for massive LLMs
B200	Blackwell	~75 TFLOPs FP32	~4500 TFLOPs (FP4/FP8)	192–228 GB HBM3e	Next-gen Transformer Engine, FP4 precision for ultra-large LLMs, NVLink 5, huge scale

Ada Lovelace Architeture

Training LLMs on GPU Clusters

open source book is here to change that. Starting from the basics, we’ll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.

check here

Resources

https://leetgpu.com/

Write, run and benchmark GPU code to solve 50+ challenges with free access to T4, A100, H100, H200 and B200 GPUs.

under the hood

https://www.aleksagordic.com/blog/matmul

GPU

Table of Contents

SM