Infrastructure design interviews focus on building the foundational systems that support applications at scale. Unlike product design interviews that emphasize user-facing features, infrastructure interviews dive deep into system-level components like distributed caches, message brokers, and consensus mechanisms. Here’s a comprehensive guide to help you master this interview type.

Infrastructure Design Interview Questions

Core Infrastructure Components

Distributed Storage & Caching:

  • Design a distributed cache (Redis/Memcached): Focus on consistent hashing, cache eviction policies (LRU/LFU), cache stampede prevention, hot key handling, and replication strategies
  • Design a key-value store (DynamoDB/Redis): Cover partitioning using consistent hashing, replication with quorum reads/writes, conflict resolution using vector clocks, gossip protocols for node discovery, and hinted handoff for temporary failures
  • Design a distributed file system (GFS/HDFS): Address file chunking, metadata management with master-worker architecture, replication across data nodes, and fault tolerance mechanisms

Messaging & Communication:

  • Design a message broker (Kafka/RabbitMQ): Implement partition-based scaling, leader-follower replication for durability, consumer group coordination, ordering guarantees, and offset tracking
  • Design a distributed queue: Handle FIFO ordering, message persistence, at-least-once/exactly-once delivery semantics, and dead letter queues
  • Design a pub-sub system: Manage topic-based routing, subscriber management, message filtering, and backpressure handling

Coordination & Control:

  • Design a distributed lock service (ZooKeeper/etcd): Implement mutual exclusion using consensus algorithms, deadlock prevention, fault tolerance with lease-based locks, and fencing tokens to prevent split-brain scenarios
  • Design a service discovery system (Consul/etcd): Support health checking, DNS-based service resolution, configuration storage with strong consistency, and multi-datacenter replication
  • Design a leader election system: Use Raft or Paxos for consensus, handle split votes, implement randomized timeouts, and ensure safety during network partitions

Traffic Management:

  • Design a rate limiter: Implement algorithms like token bucket, leaky bucket, or sliding window counter; handle distributed rate limiting with Redis, prevent race conditions with atomic operations
  • Design a load balancer: Cover Layer 4 vs Layer 7 balancing, implement algorithms (round robin, least connections, weighted distribution), health checks, session persistence, and global load balancing
  • Design an API gateway: Manage authentication, request routing, rate limiting, protocol translation, and response aggregation

Advanced Infrastructure Questions

Reliability & Durability:

  • Design a consensus system: Implement Paxos or Raft algorithms, handle leader election, log replication, and safety guarantees during failures
  • Design a write-ahead log (WAL): Ensure durability with sequential append-only logs, implement log sequence numbers (LSN), checkpointing, and crash recovery
  • Design a distributed transaction coordinator: Use two-phase commit (2PC) with prepare and commit phases, handle in-doubt transactions, timeout mechanisms

Observability:

  • Design a distributed logging system: Aggregate logs from multiple sources, implement log shipping, indexing for search, retention policies
  • Design a distributed tracing system: Track request flows across services, implement trace ID propagation, sampling strategies, span collection
  • Design a metrics aggregation system: Collect time-series metrics, implement downsampling, anomaly detection, and visualization

Advanced Patterns:

  • Design a CDN: Implement edge caching, cache invalidation strategies, origin shield, geo-routing with DNS
  • Design a circuit breaker: Monitor failure rates, implement open/half-open/closed states, fallback mechanisms
  • Design a distributed scheduler: Manage task distribution, implement priority queues, handle task failures and retries

Communication Patterns & Approaches

Synchronous Communication

Request-Response Pattern: Direct client-server interaction with blocking calls; use for immediate feedback requirements

RPC (Remote Procedure Call): Function-like invocation across network boundaries; gRPC uses HTTP/2 for efficient binary communication with strong typing

REST API: Resource-oriented with HTTP verbs; stateless, cacheable, widely adopted for web services

Asynchronous Communication

Message Queues: Decouples producers and consumers; provides buffering, load leveling, and fault tolerance; examples include RabbitMQ, ActiveMQ

Pub-Sub Pattern: Publishers broadcast messages to topics; multiple subscribers receive copies; enables event-driven architectures; Kafka, Amazon SNS

Event Streaming: Ordered sequence of events stored in durable logs; enables replay, temporal queries; Kafka streams, Apache Flink

Gossip Protocol

Epidemic Dissemination: Nodes periodically exchange state with random peers; eventual consistency model; used in Cassandra, Consul for cluster membership

  • Decentralized with no single point of failure
  • Scales well to thousands of nodes
  • Achieves convergence in O(log N) rounds
  • Fault-tolerant through redundant message paths

Essential Algorithms & Techniques

Consensus Algorithms

Paxos:

  • Quorum-based with proposers, acceptors, learners
  • Allows any server as leader but requires log updates
  • Optimal message complexity but complex to implement
  • Used in Google Chubby, Apache ZooKeeper

Raft:

  • Leader-based with explicit leader election using randomized timeouts
  • Only up-to-date servers can become leaders
  • More understandable than Paxos with similar performance
  • Used in etcd, Consul, CockroachDB

Key Differences:

  • Paxos divides terms between servers; Raft uses single-candidate-per-term
  • Paxos accepts any leader; Raft requires up-to-date log
  • Raft leader election is lightweight; Paxos requires full log transfer

Data Distribution Algorithms

Consistent Hashing:

  • Maps both nodes and keys to hash ring (0 to 2^32-1)
  • Keys assigned to nearest clockwise node
  • Minimizes remapping when nodes added/removed (only K/N keys affected)
  • Use virtual nodes for load balancing across heterogeneous hardware
  • Implementation: Binary Search Tree for O(log N) lookups

Use Cases: Distributed caches (Memcached, Redis Cluster), CDNs, distributed databases (Cassandra)

Rate Limiting Algorithms

Token Bucket:

  • Bucket holds tokens at maximum capacity
  • Tokens added at fixed rate (r tokens/second)
  • Requests consume tokens; rejected if insufficient
  • Pros: Allows bursts up to bucket capacity, memory efficient
  • Cons: Parameters (bucket size, refill rate) need tuning

Leaky Bucket:

  • FIFO queue processes requests at constant rate
  • New requests added to queue; dropped if full
  • Pros: Smooth output rate, simple implementation
  • Cons: Cannot handle bursts, may reject legitimate traffic

Sliding Window Counter:

  • Combines fixed window efficiency with sliding log accuracy
  • Weighted count: current_window_count + (previous_window_count × overlap_percentage)
  • Pros: More accurate than fixed window, efficient storage
  • Cons: Approximation may allow slight over-limit

Implementation Considerations:

  • Use Redis with atomic operations (INCR, EXPIRE) for distributed rate limiting
  • Local in-memory checks with eventual consistency for low latency
  • Implement request coalescing for cache stampede prevention

Load Balancing Algorithms

Round Robin:

  • Sequential distribution in cyclic order
  • Use Case: Homogeneous servers with predictable load
  • Pros: Simple, equal distribution
  • Cons: Ignores server capacity and current load

Least Connections:

  • Routes to server with fewest active connections
  • Use Case: Variable connection duration, similar server capacity
  • Pros: Dynamic load distribution, prevents overload
  • Cons: Requires connection state tracking

Weighted Least Connections:

  • Combines connection count with server capacity weights
  • Use Case: Heterogeneous server capacities
  • Pros: Capacity-aware, handles variable loads
  • Cons: Complex configuration, requires capacity profiling

IP Hash:

  • Consistent mapping of client IP to server
  • Use Case: Session persistence requirements
  • Pros: Sticky sessions without state
  • Cons: Uneven distribution with uneven IP distribution

Probabilistic Data Structures

Bloom Filter:

  • Space-efficient set membership testing with false positives
  • Uses k hash functions and m-bit array
  • False Positive Rate: (1 - e^(-kn/m))^k where n = elements inserted
  • Optimal k: (m/n) × ln(2)
  • Optimal m: -(n × ln(ε)) / (ln(2))^2 for target error rate ε

Use Cases: URL deduplication (web crawlers), cache filtering (avoid disk lookups), database query optimization

Trade-offs:

  • No false negatives guaranteed
  • Cannot remove elements (use counting Bloom filter variant)
  • Fixed size after creation (use scalable Bloom filter for dynamic workloads)

Write-Ahead Log (WAL)

Core Principle: Log changes before applying them to ensure durability and crash recovery

Process:

  1. Assign Log Sequence Number (LSN) to each operation
  2. Write log entry to durable storage (fsync)
  3. Apply change to data structures (can be async)
  4. Periodic checkpointing flushes data and truncates log

Benefits:

  • Durability: No data loss after log write
  • Fast writes: Sequential appends faster than random writes
  • Replication: Ship log entries to replicas
  • Point-in-time recovery: Replay log to specific LSN

Used In: PostgreSQL, MySQL, MongoDB, Apache Kafka, all major databases

CAP Theorem & Consistency Models

CAP Theorem Trade-offs

CP (Consistency + Partition Tolerance):

  • Sacrifices availability during partitions
  • All nodes see same data; blocks requests if can’t guarantee consistency
  • Examples: MongoDB, Redis, HBase, ZooKeeper
  • Use Case: Financial transactions, inventory management

AP (Availability + Partition Tolerance):

  • Sacrifices strong consistency for availability
  • System remains responsive; may return stale data
  • Examples: Cassandra, DynamoDB, CouchDB
  • Use Case: Social media feeds, shopping carts, content delivery

CA (Consistency + Availability):

  • Cannot tolerate partitions (impractical for distributed systems)
  • Only viable for single-node or tightly-coupled systems

Consistency Models

Strong Consistency:

  • All reads return most recent write immediately
  • Requires coordination (Paxos/Raft) between nodes
  • Pros: Data reliability, simpler application logic
  • Cons: Higher latency, reduced availability, scalability limits
  • Examples: Google Spanner, etcd

Eventual Consistency:

  • Guarantees convergence given sufficient time without new updates
  • Allows temporary inconsistencies for better performance
  • Pros: High availability, low latency, horizontal scalability
  • Cons: Temporary stale reads, complex conflict resolution
  • Examples: Amazon DynamoDB, Cassandra, Riak

Read-Your-Writes Consistency: Guarantees clients see their own writes immediately

Monotonic Reads: Successive reads never return older values

Replication Strategies

Master-Slave (Primary-Replica):

  • Single master handles writes; replicas serve reads
  • Synchronous: Zero data loss but higher latency
  • Asynchronous: Lower latency but potential data loss
  • Use Case: Read-heavy workloads (analytics, reporting)

Multi-Master:

  • Multiple nodes accept writes; changes propagated to peers
  • Pros: High availability, write scalability, geo-distribution
  • Cons: Complex conflict resolution, potential inconsistencies
  • Use Case: Global distributed systems (Active Directory, Oracle RAC)

Infrastructure Design Interview Approach

Step 1: Clarify Requirements (10 minutes)

Functional Requirements:

  • What operations must the system support? (set/get for cache, publish/subscribe for message broker)
  • What are the performance targets? (latency, throughput)
  • What consistency guarantees are needed?

Non-Functional Requirements:

  • Scale: How many requests per second? Data volume?
  • Availability: What’s the acceptable downtime? (99.9%, 99.99%?)
  • Durability: Can we lose data? How much?
  • Latency: P50, P99, P999 targets?

Example Questions:

  • “What’s the read/write ratio?”
  • “Do we need strong consistency or is eventual consistency acceptable?”
  • “What’s the expected failure rate we need to handle?”

Step 2: Capacity Estimation

Calculate storage, memory, network bandwidth, and QPS requirements:

  • Storage: data_size × replication_factor × retention_period
  • Memory: active_dataset_size × cache_ratio
  • Bandwidth: request_size × QPS
  • Servers: total_QPS / QPS_per_server

Step 3: High-Level Design

Start Simple: Draw basic components and data flow

  • Client → Load Balancer → Service Nodes → Database
  • Identify single points of failure

Add Layers:

  • Caching layer (Redis cluster with consistent hashing)
  • Message queue (Kafka for async processing)
  • Replication (master-slave for reads, multi-master for writes)

Step 4: Deep Dive into System Internals

Infrastructure Focus Areas:

  • Consensus: How do nodes agree on state? (Raft for leader election)
  • Durability: How do we prevent data loss? (WAL, replication, backups)
  • Partitioning: How is data distributed? (Consistent hashing, range partitioning)
  • Failure Handling: What happens when nodes crash? (Heartbeats, quorum reads)
  • Performance: How do we optimize for throughput? (Batching, compression, connection pooling)

Design Decisions to Address:

  • Choice of data structures (B-tree vs LSM-tree)
  • Network protocols (TCP vs UDP, HTTP/2 vs gRPC)
  • Serialization formats (Protocol Buffers vs JSON)
  • Storage engines (RocksDB, LevelDB)

Step 5: Discuss Trade-offs

Consistency vs Availability (CAP theorem): Strong consistency slows down writes

Latency vs Durability: Fsync every write is slow but safe

Complexity vs Performance: Complex algorithms (Paxos) offer optimality but are hard to implement

Cost vs Resilience: More replicas = higher availability but increased cost

Step 6: Operational Considerations

  • Monitoring: What metrics to track? (latency percentiles, error rates, saturation)
  • Alerting: When to page engineers? (error rate spikes, high latency)
  • Debugging: How to troubleshoot? (distributed tracing, centralized logging)
  • Deployment: Rolling updates, blue-green deployments, canary releases

Complete Preparation Roadmap

Phase 1: Foundations (2-3 weeks)

Week 1 - Core Concepts:

  • Study CAP theorem with real-world examples (DynamoDB, Cassandra, MongoDB)
  • Understand consistency models (strong, eventual, causal)
  • Learn replication strategies (master-slave, multi-master)
  • Review network fundamentals (TCP/IP, HTTP, DNS)

Week 2 - Distributed Systems Primitives:

  • Consistent hashing implementation and use cases
  • Leader election (Raft vs Paxos comparison)
  • Distributed transactions (2PC protocol)
  • Failure detection (heartbeats, health checks)

Week 3 - Data Structures & Algorithms:

  • Bloom filters with false positive calculations
  • Write-ahead logs for durability
  • Merkle trees for replication reconciliation
  • Gossip protocols for information dissemination

Phase 2: Core Infrastructure Systems (3-4 weeks)

Week 4 - Caching:

  • Redis/Memcached architecture
  • Eviction policies (LRU, LFU, TTL)
  • Cache stampede prevention strategies
  • Consistent hashing for cache distribution
  • Hot key handling techniques

Week 5 - Messaging:

  • Kafka architecture (brokers, topics, partitions)
  • Message ordering guarantees
  • Consumer groups and rebalancing
  • At-least-once vs exactly-once semantics
  • RabbitMQ vs Kafka trade-offs

Week 6 - Storage:

  • Key-value store design (DynamoDB, Redis)
  • Distributed file systems (GFS, HDFS)
  • Partitioning strategies (hash, range, directory)
  • Quorum-based replication
  • Conflict resolution (vector clocks, last-write-wins)

Week 7 - Coordination Services:

  • ZooKeeper/etcd architecture
  • Distributed locks implementation
  • Service discovery patterns
  • Configuration management
  • Leader election implementations

Phase 3: Advanced Topics (2-3 weeks)

Week 8 - Rate Limiting & Load Balancing:

  • Rate limiter algorithms (token bucket, sliding window)
  • Distributed rate limiting with Redis
  • Load balancing algorithms
  • Session persistence strategies
  • Health check mechanisms

Week 9 - Consensus & Fault Tolerance:

  • Deep dive into Raft algorithm
  • Paxos vs Raft comparison
  • Byzantine fault tolerance basics
  • Circuit breaker pattern
  • Bulkhead pattern for isolation

Week 10 - Observability:

  • Distributed tracing (Jaeger, Zipkin)
  • Log aggregation (ELK stack)
  • Metrics collection (Prometheus)
  • Alerting strategies
  • SLOs, SLIs, SLAs

Phase 4: Practice & Mock Interviews (2-3 weeks)

Week 11-12 - Design Practice: Practice designing 3-4 systems per week:

  • Start with core questions (rate limiter, cache, message broker)
  • Progress to advanced (consensus system, distributed lock)
  • Time yourself: 45 minutes per design
  • Focus on trade-off discussions and operational aspects

Week 13 - Mock Interviews:

  • Schedule mock interviews with peers or platforms
  • Practice explaining designs out loud
  • Get feedback on communication clarity
  • Work on drawing diagrams quickly
  • Time management: 10 min requirements, 20 min design, 15 min deep dive

Key Resources

Books:

  • “Designing Data-Intensive Applications” by Martin Kleppmann
  • “Database Internals” by Alex Petrov
  • “System Design Interview” by Alex Xu

Online Courses:

  • Grokking the System Design Interview (Educative)
  • System Design Primer (GitHub)
  • MIT 6.824 Distributed Systems

Practice Platforms:

  • InfraExpert (infrastructure-specific questions)
  • HelloInterview (system design practice)
  • InterviewBit (system design problems)

Key Differences: Infrastructure vs Product Design

Infrastructure Design:

  • Focus: Internal systems, performance, scalability at component level
  • Depth: Deep dive into algorithms (Paxos, consistent hashing)
  • Questions: “Design a rate limiter”, “Design Redis”, “Design Kafka”
  • Emphasis: Durability, fault tolerance, throughput optimization
  • Technologies: Consensus protocols, replication, partitioning

Product Design:

  • Focus: User-facing features, end-to-end functionality
  • Breadth: More components, lighter on each
  • Questions: “Design Instagram”, “Design Uber”, “Design Twitter”
  • Emphasis: APIs, data models, user flows, feature requirements
  • Technologies: Microservices, REST APIs, databases, caching

80% Overlap: Both cover distributed systems fundamentals, but infrastructure goes deeper into system internals while product focuses on feature delivery.