An inference engine is a specialized runtime designed to execute trained models efficiently on hardware. Between a trained model, which is usually in its default PyTorch (.pt), TensorFlow (.tf2) format, and a serving framework such as NVIDIA Triton Inference Server, FastAPI, TorchServe, BentoML, or Tensorflow Serving, there is an inference engine.
An inference engine optimizes a model to balance latency, throughput, precision, and efficiency for it in a production scenario.
Commonly, inference engines do one or more of the following:
-
Graph Optimization
It analyzes the computational graph of the model and applies optimizations to reduce the graph size and depth, fusing model layers or removing redundant computations. -
Hardware-Specific Optimization
Models could be compiled for target hardware accelerators such as CPU, GPU, TPU, or custom accelerators, by selecting highly tuned compute kernels for each architecture in part. -
Lowering Precision
Reduces the memory footprint by quantizing layers’ precision (e.g., FP32 → FP16 → INT8/INT4). -
Model Pruning & Sparsity
Pruning redundant weights or exploiting sparsity in matrices.
Inference Engines and Inference Frameworks
- ONNX and ONNX Runtime
- TensorRT, TensorRT-LLM
- vLLM, vLLM + LMCache
- vLLM + Ray
- llama.cpp
- GGML (which is llama.cpp and whisper.cpp)
- ggml is a machine learning (ML) library written in C and C++ with a focus on Transformer inference.
- Ollama
- NVIDIA Triton Inference Server
- HuggingFace TGI (Text-Generation Inference)
- CoreML
- OpenVINO, OpenVINO GenAI for Intel Hardware
- WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration
- MLC LLM compiles and runs code on MLCEngine — a unified high-performance LLM inference engine across the platforms
Distributed GenAI Inference Frameworks 1. NVIDIA Dynamo 2. vLLM + llm-D (Kubernetes) 3. AirBrix 4. Mojo and Mojo MAX Engine
he “frontend” of the compiler (TorchDynamo) attempts to turn Python bytecode into a computation graph

Ray serve
Ray Serve scales each model independently based on its traffic. This applies across multiple nodes and models.
When traffic increases, Serve automatically adds replicas of models and can even start new instances to meet demand, scaling up models receiving traffic.
Crucially, Serve does not scale models that are not receiving traffic, preventing wasted GPU resources.
When traffic decreases, Serve automatically downscales models and removes extra nodes, saving GPU resources overnight8. This simple optimization can lead to a 50% cost saving compared to allocating a fixed number of GPUs for peak traffic
Model Layer Optimizations with Ray Serve
Imagine you have a language model that needs to generate completions for three different prompts. The prompts and their expected token lengths are as follows:
- Prompt 1: “2+2=4” → 10 tokens (s1,s1,S1)
- Prompt 2: “How are neutron stars formed?” → 100 tokens
- Prompt 3: “What is the capital of France?” → 12 tokens
Let’s assume the GPU can handle 2 sequences at a time. Now let’s see how static batching and continuous batching would handle this.
In static batching, you would put all three sequences into one batch. The model starts processing the batch, and it needs to wait for the longest sequence (Prompt 2, with 100 tokens) to finish.
- Prompt 1 (10 tokens) finishes quickly. But the GPU cannot process anything else yet because it’s waiting for the longest sequence.
- Prompt 3 (12 tokens) finishes next, but again, the GPU still waits for Prompt 2 to finish.
- Finally, Prompt 2 (100 tokens) finishes after a long time, and only then can the batch be considered complete.

- Continuous Batching: Instead of waiting for the entire batch to finish, in continuous batching, after each forward pass, the GPU processes one sequence and replaces it with a new sequence that’s waiting to be processed. So, after processing Prompt 1 (which finishes quickly), the GPU immediately starts processing Prompt 3.
- https://huggingface.co/blog/continuous_batching

- Ray sever contain all best opensource features in single lib as mentioned in picture
Speculative decoding
Speculative decoding is an optimization technique designed to reduce the time it takes for a language model to generate each token, making it faster and more efficient. The core idea behind speculative decoding is that predicting tokens in a sequence isn’t equally difficult for every token. Some tokens are easier to predict based on the preceding context, while others might require more computational power and time to determine.
To speed up token generation, speculative decoding uses a smaller, faster model (referred to as a “draft” model). For example, a smaller 7-billion parameter model can be up to 10 times faster than a much larger 70-billion parameter model. The smaller model is tasked with predicting a few tokens ahead (let’s say K tokens), which is much faster than waiting for the large model to generate tokens one by one.
Once the smaller model makes predictions, the larger model (which has more capacity and accuracy) steps in to verify the correctness of the predicted tokens. If the smaller model’s predictions are correct, the larger model can immediately accept those tokens and even generate multiple tokens in a single forward pass. This means that the large model doesn’t have to compute every token individually, which reduces the overall time it takes to generate the output.
Note: Eventhough we use Large LLM for verification which is faster then LLM inference
SGLang has implemented this as egale 3 framework
Example
Traditional Speculative Decoding:
- Draft model: Small model (e.g. LLaMA 1B)
- Target model: Full model (e.g. LLaMA 7B/8B)
- The draft model generates k tokens.
- The target model checks if those k tokens are correct.
Eagle 3: Reuse the first few layers of the target model as the draft model
No need to load a separate model Just run:
- Layers 0–N for draft
- Layers N+1–L for verification
This saves memory and reduces latency + I/O overhead.
- A full LLaMA 8B model with 32 layers
- We configure Eagle 3 to use layers 0–7 as the draft
- We’ll decode up to 4 tokens per speculative round
User prompt: "Tell me a joke"
with torch.no_grad():
hidden_state = model.run_layers(input_ids, layer_range=(0, 7)) # Layers 0 to 7
logits = head(hidden_state[-1]) # output of layer 7
draft_tokens = sample_tokens(logits, top_k=20, num_tokens=4)
Now we have 4 speculative tokens:
["Why", "did", "the", "chicken"]
Now we pass the combined input (prompt + draft_tokens) through only layers 8–32:
hidden_state = model.run_layers(
input_ids=concatenate(prompt, draft_tokens),
layer_range=(8, 32),
reuse_cache=True
)
logits = head(hidden_state[-1])
then we check
Are model logits[0] == draft_tokens[0]?
Are model logits[1] == draft_tokens[1]?
┌──────────────────────────────┐
Prompt → │ LLaMA layers 0–7 (Draft pass) │
└────────────┬────────────────┘
│ Sample top-K → draft tokens: ["Why", "did", "the", "chicken"]
↓
┌──────────────────────────────┐
│ LLaMA layers 8–32 (Verify) │ ← prompt + draft tokens
└────────────┬────────────────┘
↓
Check if target logits match draft
✔ If match → accept
✘ If mismatch → reject and resume
Rayllm build VLLM

- Floating point allows us to represent very large and very small numbers with a compact number of bits.
- The number is split into:
- Sign bit → whether it’s + or -
- Exponent bits → how big or small (the “scale”)
- Mantissa bits (or significand) → the actual significant digits
Float = Sign × Mantissa × 2^(Exponent)
Why integer?
- Integer quantization (like INT4) is simpler → no exponent → fixed range → more compact → but less flexible for values that vary a lot in scale.
| Datatype | Sign Bits | Mantissa Bits | Exponent Bits | Dynamic Range | Precision | Typical Use | Quality |
|---|---|---|---|---|---|---|---|
| Float32 (FP32) | 1 | 23 | 8 | Very wide | Very high | Training | Excellent |
| BFloat16 (BF16) | 1 | 7 | 8 | Same as FP32 | Medium | Training (TPUs, GPUs) | Excellent |
| Float16 (FP16) | 1 | 10 | 5 | Smaller than FP32 | Good | Mixed precision training & inference | Very good |
| FP8 (E5M2) | 1 | 2 | 5 | Same range as FP16 | Low | Inference (Hopper, Blackwell) | Very good |
| NF4 (Normal Float 4) | 0 | 4 | 0 | None (fixed scale) | Good | Quantized inference | Good |
| FP4 | 1 | 1 | 2 | Small | Very low | Research-stage inference | Good (suspected) |
| INT4 | 0 | 4 | 0 | None | Very low | Final quantized inference | OK |

- GPU that support different floating points operation per sec
Inference Optimization
KV caching
Before a language model like GPT can understand or generate text, raw input text must be transformed into a format it understands tokens.
Tokenization
When we give GPT a sentence like:
text = "The quick brown fox"
inputs = tokenizer(text, return_tensors="pt")The tokenizer:
- Breaks the sentence into subword units (tokens).
- Maps them to integers using a pre-trained vocabulary (e.g., 50,257 entries in GPT-2).
- Returns:
{
'input_ids': tensor([[464, 2068, 1212, 4417]]),
'attention_mask': tensor([[1, 1, 1, 1]])
}Each token ID (like 464) corresponds to a word or part of a word.
What Happens Inside the Model
Once the input is tokenized, it’s passed to the model:
outputs = model(**inputs)
logits = outputs.logitsThe output will have 50257 vocablity and each token with there attention score
| Text | vocab1 | vocab1 |
|---|---|---|
| input text1 | 10.39 | 10.9 |
| input text 2 | 10.31 | 9.8 |
Output of the Model
logitsshape:(batch_size, seq_len, vocab_size)- Each position contains a vector of size equal to the vocabulary size.
- These are unnormalized scores for predicting the next token.
Example:
logits.shape = torch.Size([1, 4, 50257])Here:
- 1 = batch size
- 4 = sequence length
- 50257 = number of possible tokens
You can get the most probable next token using:
last_logits = logits[0, -1, :]
next_token_id = last_logits.argmax()Vocabulary Explained
The model has a fixed vocabulary (e.g., 50,257 tokens in GPT-2), learned during training.
- Tokens can represent words, subwords, punctuation, or whitespace.
- The model only generates tokens from this fixed vocabulary.
- It cannot invent new words or characters only sequences of known tokens.
tokenizer.decode(464) ➝ 'The'
tokenizer.decode(next_token_id) ➝ ' lazy'Manual Token Generation: One-by-One
You can simulate generation manually:
for _ in range(10):
next_token_id = generate_token(inputs)
inputs = update_inputs(inputs, next_token_id)
generated_tokens.append(tokenizer.decode(next_token_id))This is auto-regressive generation each new token depends on the full previous sequence.
Why It’s Slow Without Caching
Each time we generate a new token:
- We re-feed the entire input sequence to the model.
- Attention layers recompute everything from scratch.
- This becomes increasingly expensive as the input grows.
Key-Value Caching (KV Caching)
Each transformer attention layer internally computes:
- Keys (K) and Values (V) for every token.
These are used to compute attention over the sequence:
Attention(Q, K, V) = softmax(QKᵀ / √d) * VThe trick: once we’ve computed K and V for earlier tokens, we don’t need to recompute them.
Example with KV-Caching
# First step
next_token_id, past_key_values = generate_token_with_past(inputs)
# Following steps
inputs = {
"input_ids": next_token_id.reshape(1, 1),
"attention_mask": updated_mask,
"past_key_values": past_key_values
}Now the model only computes attention for the new token — the old tokens are already cached.
Prefix caching
- During prefill, the model performs a forward pass over the entire input and builds up a key-value (KV) cache for attention computation.
- During decode, the model generates output tokens one by one, using the cached states from the prefill stage. The attention mechanism computes a matrix of token interactions. The resulting KV pairs for each token are stored in GPU memory.
- For a new request with a matching prefix, you can skip the forward pass for the cached part and directly resume from the last token of the prefix.
important
This works only when the prefix is exactly identical, including whitespace and formatting. Even a single character difference breaks the cache.
For example, consider a chatbot with this system prompt:
You are a helpful AI writer. Please write in a professional manner.
This prompt doesn’t change from one conversation to the next. Instead of recalculating it every time, you store its KV cache once. Then, when new messages come in, you reuse this stored prefix cache, only processing the new part of the prompt.
Prefix caching can reduce compute and latency by an order of magnitude in some use cases.
- Anthropic Claude Sonnet offers prompt caching with up to 90% cost savings and 85% latency reduction for long prompts.
- Google Gemini discounts cached tokens and charges for storage separately.
- Frameworks like vLLM, TensorRT-LLM, and SGLang support automatic prefix caching for different open-source LLMs.
In agent workflows, the benefit is even more pronounced. Some use cases have input-to-output token ratios of 100:1, making the cost of reprocessing large prompts disproportionately high.
Input Batching
Batching means feeding multiple input sequences to the model at the same time instead of one-by-one.
we have multiple prompts:
prompts = [
"The quick brown fox jumped over the",
"The rain in Spain falls",
"What comes up must",
]
We tokenize them together:
inputs = tokenizer(prompts, padding=True, return_tensors="pt")This returns a tensor of shape [batch_size, seq_len], where:
Each row is a tokenized prompt.
All rows are padded to match the longest prompt.
input_ids:
tensor([[ 464, 2068, 1212, 4417, 716, 262],
[ 464, 1619, 287, 743, 5792, 0],
[ 1332, 1110, 389, 5016, 0, 0]])In transformers, each token is aware of its position in the sequence using position_ids.
With padding, we must offset position IDs so they begin at 0 after the padding:
position_ids = attention_mask.cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
Continuous batching.
Refere above Continuous Batching from Ray serve
Dynamo
A Datacenter Scale Distributed Inference Serving Framework
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
ZML
ZML simplifies model serving,
ensuring peak performance and maintainability in production.
VLLM
Paged Attention
Assume we have:
- A GPU
- One LLM
- One request
- You generate tokens one by one
For every token, at every layer, you must store two tensors:
- K (keys)
- V (values)
Because future tokens must attend to past tokens.
So after 5 tokens, memory looks like:
Token 0 → K0, V0
Token 1 → K1, V1
Token 2 → K2, V2
Token 3 → K3, V3
Token 4 → K4, V4
So far so good.
Now imagine:
- 100 users
- Each has a different prompt length
- Each generates tokens at different speeds
- All share one GPU
GPU memory must now hold KV for many sequences, all growing at different times.
“For each request, store KV in one continuous tensor.”
But this breaks because:
- Sequences grow token by token
- GPU memory does not like resizing
- Copying tensors every token is very expensive
CUDA block
In vLLM, a CUDA block is simply:
A fixed-size chunk of GPU memory that can store KV for a small number of tokens
Example:
- One block stores KV for 16 tokens
- Size never changes
- All blocks are identical
Visualize GPU memory like this:
[ Block 0 ][ Block 1 ][ Block 2 ][ Block 3 ][ Block 4 ] ...
Now take one user request.
Token generation starts Token 0
- Put its KV into Block 0 Token 1
- Still space → same block
Token 15
- Block full Token 16
- Get a new empty block
- Write KV there
So the sequence now owns:
Sequence A → [ Block 0, Block 5 ]
- Blocks are not required to be adjacent
- They can be anywhere in GPU memory
Attention needs:
“Give me all past K and V for this sequence.”
But memory is not continuous.
So vLLM’s attention kernel does:
for each block in block_table:
read K, V from that block
It stitches the blocks together logically, without copying.
This is why it’s called Paged Attention.
What happens when a sequence finishes
- Returns all blocks used by that sequence
- Marks them as free
Memory reuse is instant.No cleanup cost.
safetensors
The .safetensors file format stores the trained neural network weights—the billions of learned parameters that define the model’s behavior.
┌─────────────────────────────────────────┐
│ 8 bytes: Header size (uint64) │
├─────────────────────────────────────────┤
│ JSON Header (metadata) │
│ { │
│ "embedding": { │
│ "shape": [50000, 4096], │
│ "dtype": "F32", │
│ "data_offsets": [0, 819200000] │
│ }, │
│ "layer.0.weight": { │
│ "shape": [4096, 4096], │
│ "dtype": "BF16", │
│ "data_offsets": [819200000, ...] │
│ } │
│ } │
├─────────────────────────────────────────┤
│ Raw Tensor Data (flat binary arrays) │
│ [0.234, -1.45, 0.891, ...] │
└─────────────────────────────────────────┘
safetensors files are created when we save a trained model from pytroch it similar to pickel it alternative to pickel we store and load on pytroch or using vllm when we run
import torch
from safetensors.torch import save_file
# After training, extract model weights
model_weights = {
"embedding.weight": model.embedding.weight,
"layer.0.attention.weight": model.layers[0].attention.weight,
"layer.0.ffn.weight": model.layers[0].ffn.weight,
# ... all model parameters
}
# Save as safetensors
save_file(model_weights, "model.safetensors")
Resources
- LLM Inference Handbook
- Mastering LLM Techniques: Inference Optimization
- https://www.bentoml.com/blog/the-shift-to-distributed-llm-inference
- Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyo
- The AI Engineer’s Guide to Inference Engines and Frameworks
- Inside vLLM: Anatomy of a High-Throughput LLM Inference System
- Fast LLM Inference From Scratch
- https://multimodalai.substack.com/p/the-ai-engineers-guide-to-inference
- https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- The AI Engineer’s Guide to Inference Engines and Frameworks
- https://multimodalai.substack.com/p/understanding-llm-inference
- https://bentoml.com/llm/inference-optimization
- https://read.theaimerge.com/p/the-ai-engineers-guide-to-inference