| Phase | What To Do | Key Questions To Ask | Output You Should Produce |
|---|---|---|---|
| 1. Understand Problem | Restate the problem clearly | What exactly are we building? Who are users? What is the main goal? | Clear problem statement in 1–2 lines |
| 2. Functional Requirements | Define what system must do | What are core features? MVP scope? CRUD only or real-time? | Bullet list of system capabilities |
| 3. Non-Functional Requirements | Define system qualities | Expected QPS? Latency target? Availability (99.9%)? Strong or eventual consistency? Multi-region? | Performance + reliability constraints |
| 4. Capacity Estimation | Rough calculations | How many users? Requests/sec? Read/write ratio? Storage growth? | Approx QPS, data size, bandwidth |
| 5. Data Modeling | Identify entities & access patterns | What are core objects? How are they queried? Any hot keys? | Basic schema + indexing plan |
| 6. High-Level Design | Draw minimal architecture | Client → LB → App → DB enough? | Simple architecture diagram |
| 7. Identify Bottlenecks | Predict what breaks first | DB overload? CPU? Network? Cache miss storm? | Identified scaling pressure points |
| 8. Scaling Strategy | Add components only if needed | Need caching? Read replicas? Sharding? Queue? | Evolved architecture |
| 9. Concurrency Handling | Prevent race conditions | Will two users update same data? Need locking or optimistic control? | Isolation strategy |
| 10. Consistency Model | Define data agreement level | Strong or eventual? Read-after-write needed? | Chosen consistency model |
| 11. Failure Handling | Design for crashes & spikes | What if node fails? Network partition? Traffic spike? | Retry, replication, rate limiting |
| 12. Resilience Patterns | Prevent cascading failures | Circuit breaker? Backpressure? Load shedding? | Stability mechanisms |
| 13. Security | Protect system & data | Auth? Encryption? Data sensitivity? | Basic security model |
| 14. Observability | Ensure visibility | How will we monitor errors, latency, replica lag? | Logging + metrics strategy |
| 15. Trade-off Discussion | Show maturity | What do we gain? What do we sacrifice? | Explicit trade-offs |
| 16. Deep Dive | Go deeper into one area | Database scaling? Cache invalidation? Partitioning? | Detailed explanation |
Step 1. Outline Requirements
You begin any system design by describing what the system must do (functional requirements) and how well it must do it (non-functional requirements).
Functional Requirements (product requirement)
Think of this as the observable behavior of the system from the user or API perspective.
They describe functions, operations, and use-cases — not performance or scale.
Examples:
- A user can upload photos.
- A user can like or comment on a post.
- The system should notify followers when someone posts.
- The service exposes an API for retrieving user profiles.
These are features.
They define “what” the system does, not how well it does it.
Non-Functional Requirements How well the system must do it (technical goals)
These define system qualities or constraints.
They describe measurable attributes like performance, scale, latency, reliability, etc.
Examples:
- The API should respond within 200 ms for 95% of requests.
- The system should handle 1 million DAU (daily active users).
- 99.99% uptime (≈52 minutes downtime per year).
- Data must be consistent across replicas within 1 second.
- The system should recover from failure in < 5 minutes.
These are not features — they’re engineering targets.
If functional requirements tell you what to build, non-functional ones tell you how strong the system must be to support real-world use.
From those numbers, derive what must be true.
Example (Twitter):
- Reads >> writes → system is read-heavy → caching becomes essential.
- Latency < 200 ms → need CDNs or precomputed timelines.
- 10 M users → sharding or partitioning required.
This is where your system design logic emerges from requirements, not from memorized architectures.
Step 2. Outline Core Entities
Once requirements are clear, identify the main things (objects) your system needs to store or process usually these correspond to database tables or data models.
Example (for a ticket-booking system):
UserEventTicketPayment
Each entity has relationships (e.g., one user can book many tickets, each ticket belongs to one event).
Step 3. Outline Basic APIs
Now that entities are known, you define how external clients (apps, users, services) will interact with your system.
Example:
POST /users→ Create a userGET /events→ Fetch all eventsPOST /tickets→ Book a ticketGET /tickets/{id}→ Get booking details
Step 4. Simple High-Level Design
What this step means
Before optimizing, draw a simple block diagram showing how the components interact — focusing only on correctness, not performance.
Example:
- Client → API Gateway → Application Service → Database
- Optional: add cache, message queue, or load balancer only if necessary to meet functionality.
Step 5. Deep Dives / Enhancement
After establishing a correct design, now refine it to meet the non-functional requirements (scalability, latency, consistency, fault tolerance).
Example enhancements:
- Add caching (e.g., Redis) for low latency.
- Add database sharding for scale.
- Add replication for high availability.
- Use message queues (Kafka) for async processing.
Focus on security also
Back-of-the-envelope calculations
Back-of-the-envelope calculations are quick, approximate estimations used in system design interviews (and engineering in general) to check whether an idea is feasible or scalable without doing detailed math or coding
When designing systems (say a chat app, a video streaming platform, or an AI API), you need to reason about scale — how many requests, how much data, how much bandwidth, etc.
But exact numbers are usually unknown or unnecessary.
So, you make order-of-magnitude estimates — simple enough to do mentally or on a whiteboard.
Example:
“If we have 10 million users, and each sends 10 messages a day, how much data do we store per day?”
That’s a back-of-the-envelope question.
Example
-
10 million daily active users
-
Each sends 20 messages/day
→ 200 million messages/day -
Average message = 100 bytes (text only)
→ 200 million × 100 bytes = 20 GB/day -
→ 20 GB × 365 ≈ 7.3 TB/year
-
→ ≈ 22 TB/year total storage
| Category | Real meaning |
|---|---|
| Storage | How many bytes must we persist |
| Bandwidth | Bytes/second moving around the network |
| Memory | Bytes kept hot in RAM |
| Throughput | Requests/second system must support |
| Latency | Queuing + processing time |
Everything must convert to per second. Why? Because machines operate per second.
So all interview calculations follow the same flow:
users → actions/day → actions/sec → data/action → bytes/sec
So the ONLY formula we truly need is:
Requests per second (RPS) =
Total events per day
-----------------------
86,400
60 seconds × 60 minutes × 24 hours = 86,400 seconds/day
1 KB = 10³ bytes
1 MB = 10⁶ bytes
1 GB = 10⁹ bytes
1 TB = 10¹² bytes
1000 KB = 1 MB
1000 MB = 1 GB
1000 GB = 1 TB
In memory (approximate, for interviews):
| Type | Size |
|---|---|
| 1 character | 1 byte |
| 1 image (low quality) | ~200 KB |
| 1 image (average) | ~500 KB – 1 MB |
| JSON object (typical) | 1 – 5 KB |
| Video (1 min low quality) | 5 – 10 MB |
| Video (HD 1 min) | 30 – 60 MB |
1) Daily Active Users (DAU)
2) Actions per user per day
3) Data per action (bytes)
4) Calculate:
Total bytes/day = DAU × Actions × Size
Total bytes/year = bytes/day × 365
Let DAU = 10M
Let actions = 20/day
Let data = 2 KB
Total/day = 10,000,000 × 20 × 2 KB
= 400,000,000 KB
= 400 GB per day
Per year ≈ 400 × 365
≈ 146,000 GB
≈ 146 TB/year
If 100 million events per day:
RPS = 100M / 86,400 ≈ 1157 req/sec
≈ 1200 RPS
API layers
| Dimension / Need | REST + JSON | gRPC + Protobuf | GraphQL + JSON |
|---|---|---|---|
| Primary clients | Browsers, 3rd‑party partners | Designed for high-performance service-to-service RPC. Ideal for internal microservices and low-latency backend systems. | -Designed for client-driven data fetching. Great for front-end apps (web/mobile) needing flexible queries and avoiding over/under-fetching. |
| Transport | HTTP/1.1 or HTTP/2 | Runs over HTTP/2 with built-in bi-directional streaming. | HTTP (usually POST) |
| Data format | JSON (text) | Protobuf (binary) | JSON (text) |
| Performance / latency | Good, but heavier than binary | Very high, 3–10× faster in many tests | Moderate; better than REST on over‑fetching |
| Caching | Native HTTP caching | Needs custom/HTTP‑level caching | Harder due to flexible queries |
| Streaming / realtime | Limited (SSE, websockets nearby) | First‑class streaming (client/server/bidi) | Possible but not the primary strength |
| Schema & typing | Optional (OpenAPI etc.) | Strong, enforced via .proto | Strong typed schema with introspection |
| Browser friendliness | Excellent | Needs proxy (gRPC‑Web) | Excellent |
| Typical best use | Public CRUD/web APIs | Internal RPC, high‑perf microservices | Aggregating backends, flexible client queries |
For more refer Communication