PhaseWhat To DoKey Questions To AskOutput You Should Produce
1. Understand ProblemRestate the problem clearlyWhat exactly are we building? Who are users? What is the main goal?Clear problem statement in 1–2 lines
2. Functional RequirementsDefine what system must doWhat are core features? MVP scope? CRUD only or real-time?Bullet list of system capabilities
3. Non-Functional RequirementsDefine system qualitiesExpected QPS? Latency target? Availability (99.9%)? Strong or eventual consistency? Multi-region?Performance + reliability constraints
4. Capacity EstimationRough calculationsHow many users? Requests/sec? Read/write ratio? Storage growth?Approx QPS, data size, bandwidth
5. Data ModelingIdentify entities & access patternsWhat are core objects? How are they queried? Any hot keys?Basic schema + indexing plan
6. High-Level DesignDraw minimal architectureClient → LB → App → DB enough?Simple architecture diagram
7. Identify BottlenecksPredict what breaks firstDB overload? CPU? Network? Cache miss storm?Identified scaling pressure points
8. Scaling StrategyAdd components only if neededNeed caching? Read replicas? Sharding? Queue?Evolved architecture
9. Concurrency HandlingPrevent race conditionsWill two users update same data? Need locking or optimistic control?Isolation strategy
10. Consistency ModelDefine data agreement levelStrong or eventual? Read-after-write needed?Chosen consistency model
11. Failure HandlingDesign for crashes & spikesWhat if node fails? Network partition? Traffic spike?Retry, replication, rate limiting
12. Resilience PatternsPrevent cascading failuresCircuit breaker? Backpressure? Load shedding?Stability mechanisms
13. SecurityProtect system & dataAuth? Encryption? Data sensitivity?Basic security model
14. ObservabilityEnsure visibilityHow will we monitor errors, latency, replica lag?Logging + metrics strategy
15. Trade-off DiscussionShow maturityWhat do we gain? What do we sacrifice?Explicit trade-offs
16. Deep DiveGo deeper into one areaDatabase scaling? Cache invalidation? Partitioning?Detailed explanation

Step 1. Outline Requirements

You begin any system design by describing what the system must do (functional requirements) and how well it must do it (non-functional requirements).

Functional Requirements (product requirement)

Think of this as the observable behavior of the system from the user or API perspective.
They describe functions, operations, and use-cases — not performance or scale.

Examples:

  • A user can upload photos.
  • A user can like or comment on a post.
  • The system should notify followers when someone posts.
  • The service exposes an API for retrieving user profiles.

These are features.
They define “what” the system does, not how well it does it.

Non-Functional Requirements How well the system must do it (technical goals)

These define system qualities or constraints.
They describe measurable attributes like performance, scale, latency, reliability, etc.

Examples:

  • The API should respond within 200 ms for 95% of requests.
  • The system should handle 1 million DAU (daily active users).
  • 99.99% uptime (≈52 minutes downtime per year).
  • Data must be consistent across replicas within 1 second.
  • The system should recover from failure in < 5 minutes.

These are not features — they’re engineering targets.

If functional requirements tell you what to build, non-functional ones tell you how strong the system must be to support real-world use.

From those numbers, derive what must be true.

Example (Twitter):

  • Reads >> writes → system is read-heavy → caching becomes essential.
  • Latency < 200 ms → need CDNs or precomputed timelines.
  • 10 M users → sharding or partitioning required.

This is where your system design logic emerges from requirements, not from memorized architectures.

Step 2. Outline Core Entities

Once requirements are clear, identify the main things (objects) your system needs to store or process usually these correspond to database tables or data models.

Example (for a ticket-booking system):

  • User
  • Event
  • Ticket
  • Payment

Each entity has relationships (e.g., one user can book many tickets, each ticket belongs to one event).

Step 3. Outline Basic APIs

Now that entities are known, you define how external clients (apps, users, services) will interact with your system.

Example:

  • POST /users → Create a user
  • GET /events → Fetch all events
  • POST /tickets → Book a ticket
  • GET /tickets/{id} → Get booking details

Step 4. Simple High-Level Design

What this step means

Before optimizing, draw a simple block diagram showing how the components interact — focusing only on correctness, not performance.

Example:

  • Client → API Gateway → Application Service → Database
  • Optional: add cache, message queue, or load balancer only if necessary to meet functionality.

Step 5. Deep Dives / Enhancement

After establishing a correct design, now refine it to meet the non-functional requirements (scalability, latency, consistency, fault tolerance).

Example enhancements:

  • Add caching (e.g., Redis) for low latency.
  • Add database sharding for scale.
  • Add replication for high availability.
  • Use message queues (Kafka) for async processing.

Focus on security also

Back-of-the-envelope calculations

Back-of-the-envelope calculations are quick, approximate estimations used in system design interviews (and engineering in general) to check whether an idea is feasible or scalable without doing detailed math or coding

When designing systems (say a chat app, a video streaming platform, or an AI API), you need to reason about scale — how many requests, how much data, how much bandwidth, etc.

But exact numbers are usually unknown or unnecessary.
So, you make order-of-magnitude estimates — simple enough to do mentally or on a whiteboard.

Example:

“If we have 10 million users, and each sends 10 messages a day, how much data do we store per day?”

That’s a back-of-the-envelope question.

Example

  • 10 million daily active users

  • Each sends 20 messages/day
    200 million messages/day

  • Average message = 100 bytes (text only)
    200 million × 100 bytes = 20 GB/day

  • → 20 GB × 365 ≈ 7.3 TB/year

  • → ≈ 22 TB/year total storage

CategoryReal meaning
StorageHow many bytes must we persist
BandwidthBytes/second moving around the network
MemoryBytes kept hot in RAM
ThroughputRequests/second system must support
LatencyQueuing + processing time

Everything must convert to per second. Why? Because machines operate per second.

So all interview calculations follow the same flow:

users → actions/day → actions/sec → data/action → bytes/sec

So the ONLY formula we truly need is:

Requests per second (RPS) =
    Total events per day
    -----------------------
         86,400

60 seconds × 60 minutes × 24 hours = 86,400 seconds/day

1 KB  = 10³ bytes
1 MB  = 10⁶ bytes
1 GB  = 10⁹ bytes
1 TB  = 10¹² bytes

1000 KB = 1 MB
1000 MB = 1 GB
1000 GB = 1 TB

In memory (approximate, for interviews):

TypeSize
1 character1 byte
1 image (low quality)~200 KB
1 image (average)~500 KB – 1 MB
JSON object (typical)1 – 5 KB
Video (1 min low quality)5 – 10 MB
Video (HD 1 min)30 – 60 MB
1) Daily Active Users (DAU)
2) Actions per user per day
3) Data per action (bytes)
4) Calculate:

Total bytes/day = DAU × Actions × Size

Total bytes/year = bytes/day × 365
Let DAU = 10M
Let actions = 20/day
Let data = 2 KB

Total/day = 10,000,000 × 20 × 2 KB
          = 400,000,000 KB
          = 400 GB per day

Per year ≈ 400 × 365
        ≈ 146,000 GB
        ≈ 146 TB/year


If 100 million events per day:

RPS = 100M / 86,400 ≈ 1157 req/sec

≈ 1200 RPS

API layers

Dimension / NeedREST + JSONgRPC + ProtobufGraphQL + JSON
Primary clientsBrowsers, 3rd‑party partnersDesigned for high-performance service-to-service RPC. Ideal for internal microservices and low-latency backend systems.-Designed for client-driven data fetching. Great for front-end apps (web/mobile) needing flexible queries and avoiding over/under-fetching.
TransportHTTP/1.1 or HTTP/2Runs over HTTP/2 with built-in bi-directional streaming.HTTP (usually POST)
Data formatJSON (text)Protobuf (binary)JSON (text)
Performance / latencyGood, but heavier than binaryVery high, 3–10× faster in many testsModerate; better than REST on over‑fetching
CachingNative HTTP cachingNeeds custom/HTTP‑level cachingHarder due to flexible queries
Streaming / realtimeLimited (SSE, websockets nearby)First‑class streaming (client/server/bidi)Possible but not the primary strength
Schema & typingOptional (OpenAPI etc.)Strong, enforced via .protoStrong typed schema with introspection
Browser friendlinessExcellentNeeds proxy (gRPC‑Web)Excellent
Typical best usePublic CRUD/web APIsInternal RPC, high‑perf microservicesAggregating backends, flexible client queries

For more refer Communication