Infrastructure design interviews focus on building the foundational systems that support applications at scale. Unlike product design interviews that emphasize user-facing features, infrastructure interviews dive deep into system-level components like distributed caches, message brokers, and consensus mechanisms. Here’s a comprehensive guide to help you master this interview type.
Infrastructure Design Interview Questions
Core Infrastructure Components
Distributed Storage & Caching:
- Design a distributed cache (Redis/Memcached): Focus on consistent hashing, cache eviction policies (LRU/LFU), cache stampede prevention, hot key handling, and replication strategies
- Design a key-value store (DynamoDB/Redis): Cover partitioning using consistent hashing, replication with quorum reads/writes, conflict resolution using vector clocks, gossip protocols for node discovery, and hinted handoff for temporary failures
- Design a distributed file system (GFS/HDFS): Address file chunking, metadata management with master-worker architecture, replication across data nodes, and fault tolerance mechanisms
Messaging & Communication:
- Design a message broker (Kafka/RabbitMQ): Implement partition-based scaling, leader-follower replication for durability, consumer group coordination, ordering guarantees, and offset tracking
- Design a distributed queue: Handle FIFO ordering, message persistence, at-least-once/exactly-once delivery semantics, and dead letter queues
- Design a pub-sub system: Manage topic-based routing, subscriber management, message filtering, and backpressure handling
Coordination & Control:
- Design a distributed lock service (ZooKeeper/etcd): Implement mutual exclusion using consensus algorithms, deadlock prevention, fault tolerance with lease-based locks, and fencing tokens to prevent split-brain scenarios
- Design a service discovery system (Consul/etcd): Support health checking, DNS-based service resolution, configuration storage with strong consistency, and multi-datacenter replication
- Design a leader election system: Use Raft or Paxos for consensus, handle split votes, implement randomized timeouts, and ensure safety during network partitions
Traffic Management:
- Design a rate limiter: Implement algorithms like token bucket, leaky bucket, or sliding window counter; handle distributed rate limiting with Redis, prevent race conditions with atomic operations
- Design a load balancer: Cover Layer 4 vs Layer 7 balancing, implement algorithms (round robin, least connections, weighted distribution), health checks, session persistence, and global load balancing
- Design an API gateway: Manage authentication, request routing, rate limiting, protocol translation, and response aggregation
Advanced Infrastructure Questions
Reliability & Durability:
- Design a consensus system: Implement Paxos or Raft algorithms, handle leader election, log replication, and safety guarantees during failures
- Design a write-ahead log (WAL): Ensure durability with sequential append-only logs, implement log sequence numbers (LSN), checkpointing, and crash recovery
- Design a distributed transaction coordinator: Use two-phase commit (2PC) with prepare and commit phases, handle in-doubt transactions, timeout mechanisms
Observability:
- Design a distributed logging system: Aggregate logs from multiple sources, implement log shipping, indexing for search, retention policies
- Design a distributed tracing system: Track request flows across services, implement trace ID propagation, sampling strategies, span collection
- Design a metrics aggregation system: Collect time-series metrics, implement downsampling, anomaly detection, and visualization
Advanced Patterns:
- Design a CDN: Implement edge caching, cache invalidation strategies, origin shield, geo-routing with DNS
- Design a circuit breaker: Monitor failure rates, implement open/half-open/closed states, fallback mechanisms
- Design a distributed scheduler: Manage task distribution, implement priority queues, handle task failures and retries
Communication Patterns & Approaches
Synchronous Communication
Request-Response Pattern: Direct client-server interaction with blocking calls; use for immediate feedback requirements
RPC (Remote Procedure Call): Function-like invocation across network boundaries; gRPC uses HTTP/2 for efficient binary communication with strong typing
REST API: Resource-oriented with HTTP verbs; stateless, cacheable, widely adopted for web services
Asynchronous Communication
Message Queues: Decouples producers and consumers; provides buffering, load leveling, and fault tolerance; examples include RabbitMQ, ActiveMQ
Pub-Sub Pattern: Publishers broadcast messages to topics; multiple subscribers receive copies; enables event-driven architectures; Kafka, Amazon SNS
Event Streaming: Ordered sequence of events stored in durable logs; enables replay, temporal queries; Kafka streams, Apache Flink
Gossip Protocol
Epidemic Dissemination: Nodes periodically exchange state with random peers; eventual consistency model; used in Cassandra, Consul for cluster membership
- Decentralized with no single point of failure
- Scales well to thousands of nodes
- Achieves convergence in O(log N) rounds
- Fault-tolerant through redundant message paths
Essential Algorithms & Techniques
Consensus Algorithms
Paxos:
- Quorum-based with proposers, acceptors, learners
- Allows any server as leader but requires log updates
- Optimal message complexity but complex to implement
- Used in Google Chubby, Apache ZooKeeper
Raft:
- Leader-based with explicit leader election using randomized timeouts
- Only up-to-date servers can become leaders
- More understandable than Paxos with similar performance
- Used in etcd, Consul, CockroachDB
Key Differences:
- Paxos divides terms between servers; Raft uses single-candidate-per-term
- Paxos accepts any leader; Raft requires up-to-date log
- Raft leader election is lightweight; Paxos requires full log transfer
Data Distribution Algorithms
Consistent Hashing:
- Maps both nodes and keys to hash ring (0 to 2^32-1)
- Keys assigned to nearest clockwise node
- Minimizes remapping when nodes added/removed (only K/N keys affected)
- Use virtual nodes for load balancing across heterogeneous hardware
- Implementation: Binary Search Tree for O(log N) lookups
Use Cases: Distributed caches (Memcached, Redis Cluster), CDNs, distributed databases (Cassandra)
Rate Limiting Algorithms
Token Bucket:
- Bucket holds tokens at maximum capacity
- Tokens added at fixed rate (r tokens/second)
- Requests consume tokens; rejected if insufficient
- Pros: Allows bursts up to bucket capacity, memory efficient
- Cons: Parameters (bucket size, refill rate) need tuning
Leaky Bucket:
- FIFO queue processes requests at constant rate
- New requests added to queue; dropped if full
- Pros: Smooth output rate, simple implementation
- Cons: Cannot handle bursts, may reject legitimate traffic
Sliding Window Counter:
- Combines fixed window efficiency with sliding log accuracy
- Weighted count:
current_window_count + (previous_window_count × overlap_percentage) - Pros: More accurate than fixed window, efficient storage
- Cons: Approximation may allow slight over-limit
Implementation Considerations:
- Use Redis with atomic operations (INCR, EXPIRE) for distributed rate limiting
- Local in-memory checks with eventual consistency for low latency
- Implement request coalescing for cache stampede prevention
Load Balancing Algorithms
Round Robin:
- Sequential distribution in cyclic order
- Use Case: Homogeneous servers with predictable load
- Pros: Simple, equal distribution
- Cons: Ignores server capacity and current load
Least Connections:
- Routes to server with fewest active connections
- Use Case: Variable connection duration, similar server capacity
- Pros: Dynamic load distribution, prevents overload
- Cons: Requires connection state tracking
Weighted Least Connections:
- Combines connection count with server capacity weights
- Use Case: Heterogeneous server capacities
- Pros: Capacity-aware, handles variable loads
- Cons: Complex configuration, requires capacity profiling
IP Hash:
- Consistent mapping of client IP to server
- Use Case: Session persistence requirements
- Pros: Sticky sessions without state
- Cons: Uneven distribution with uneven IP distribution
Probabilistic Data Structures
Bloom Filter:
- Space-efficient set membership testing with false positives
- Uses k hash functions and m-bit array
- False Positive Rate:
(1 - e^(-kn/m))^kwhere n = elements inserted - Optimal k:
(m/n) × ln(2) - Optimal m:
-(n × ln(ε)) / (ln(2))^2for target error rate ε
Use Cases: URL deduplication (web crawlers), cache filtering (avoid disk lookups), database query optimization
Trade-offs:
- No false negatives guaranteed
- Cannot remove elements (use counting Bloom filter variant)
- Fixed size after creation (use scalable Bloom filter for dynamic workloads)
Write-Ahead Log (WAL)
Core Principle: Log changes before applying them to ensure durability and crash recovery
Process:
- Assign Log Sequence Number (LSN) to each operation
- Write log entry to durable storage (fsync)
- Apply change to data structures (can be async)
- Periodic checkpointing flushes data and truncates log
Benefits:
- Durability: No data loss after log write
- Fast writes: Sequential appends faster than random writes
- Replication: Ship log entries to replicas
- Point-in-time recovery: Replay log to specific LSN
Used In: PostgreSQL, MySQL, MongoDB, Apache Kafka, all major databases
CAP Theorem & Consistency Models
CAP Theorem Trade-offs
CP (Consistency + Partition Tolerance):
- Sacrifices availability during partitions
- All nodes see same data; blocks requests if can’t guarantee consistency
- Examples: MongoDB, Redis, HBase, ZooKeeper
- Use Case: Financial transactions, inventory management
AP (Availability + Partition Tolerance):
- Sacrifices strong consistency for availability
- System remains responsive; may return stale data
- Examples: Cassandra, DynamoDB, CouchDB
- Use Case: Social media feeds, shopping carts, content delivery
CA (Consistency + Availability):
- Cannot tolerate partitions (impractical for distributed systems)
- Only viable for single-node or tightly-coupled systems
Consistency Models
Strong Consistency:
- All reads return most recent write immediately
- Requires coordination (Paxos/Raft) between nodes
- Pros: Data reliability, simpler application logic
- Cons: Higher latency, reduced availability, scalability limits
- Examples: Google Spanner, etcd
Eventual Consistency:
- Guarantees convergence given sufficient time without new updates
- Allows temporary inconsistencies for better performance
- Pros: High availability, low latency, horizontal scalability
- Cons: Temporary stale reads, complex conflict resolution
- Examples: Amazon DynamoDB, Cassandra, Riak
Read-Your-Writes Consistency: Guarantees clients see their own writes immediately
Monotonic Reads: Successive reads never return older values
Replication Strategies
Master-Slave (Primary-Replica):
- Single master handles writes; replicas serve reads
- Synchronous: Zero data loss but higher latency
- Asynchronous: Lower latency but potential data loss
- Use Case: Read-heavy workloads (analytics, reporting)
Multi-Master:
- Multiple nodes accept writes; changes propagated to peers
- Pros: High availability, write scalability, geo-distribution
- Cons: Complex conflict resolution, potential inconsistencies
- Use Case: Global distributed systems (Active Directory, Oracle RAC)
Infrastructure Design Interview Approach
Step 1: Clarify Requirements (10 minutes)
Functional Requirements:
- What operations must the system support? (set/get for cache, publish/subscribe for message broker)
- What are the performance targets? (latency, throughput)
- What consistency guarantees are needed?
Non-Functional Requirements:
- Scale: How many requests per second? Data volume?
- Availability: What’s the acceptable downtime? (99.9%, 99.99%?)
- Durability: Can we lose data? How much?
- Latency: P50, P99, P999 targets?
Example Questions:
- “What’s the read/write ratio?”
- “Do we need strong consistency or is eventual consistency acceptable?”
- “What’s the expected failure rate we need to handle?”
Step 2: Capacity Estimation
Calculate storage, memory, network bandwidth, and QPS requirements:
- Storage: data_size × replication_factor × retention_period
- Memory: active_dataset_size × cache_ratio
- Bandwidth: request_size × QPS
- Servers: total_QPS / QPS_per_server
Step 3: High-Level Design
Start Simple: Draw basic components and data flow
- Client → Load Balancer → Service Nodes → Database
- Identify single points of failure
Add Layers:
- Caching layer (Redis cluster with consistent hashing)
- Message queue (Kafka for async processing)
- Replication (master-slave for reads, multi-master for writes)
Step 4: Deep Dive into System Internals
Infrastructure Focus Areas:
- Consensus: How do nodes agree on state? (Raft for leader election)
- Durability: How do we prevent data loss? (WAL, replication, backups)
- Partitioning: How is data distributed? (Consistent hashing, range partitioning)
- Failure Handling: What happens when nodes crash? (Heartbeats, quorum reads)
- Performance: How do we optimize for throughput? (Batching, compression, connection pooling)
Design Decisions to Address:
- Choice of data structures (B-tree vs LSM-tree)
- Network protocols (TCP vs UDP, HTTP/2 vs gRPC)
- Serialization formats (Protocol Buffers vs JSON)
- Storage engines (RocksDB, LevelDB)
Step 5: Discuss Trade-offs
Consistency vs Availability (CAP theorem): Strong consistency slows down writes
Latency vs Durability: Fsync every write is slow but safe
Complexity vs Performance: Complex algorithms (Paxos) offer optimality but are hard to implement
Cost vs Resilience: More replicas = higher availability but increased cost
Step 6: Operational Considerations
- Monitoring: What metrics to track? (latency percentiles, error rates, saturation)
- Alerting: When to page engineers? (error rate spikes, high latency)
- Debugging: How to troubleshoot? (distributed tracing, centralized logging)
- Deployment: Rolling updates, blue-green deployments, canary releases
Complete Preparation Roadmap
Phase 1: Foundations (2-3 weeks)
Week 1 - Core Concepts:
- Study CAP theorem with real-world examples (DynamoDB, Cassandra, MongoDB)
- Understand consistency models (strong, eventual, causal)
- Learn replication strategies (master-slave, multi-master)
- Review network fundamentals (TCP/IP, HTTP, DNS)
Week 2 - Distributed Systems Primitives:
- Consistent hashing implementation and use cases
- Leader election (Raft vs Paxos comparison)
- Distributed transactions (2PC protocol)
- Failure detection (heartbeats, health checks)
Week 3 - Data Structures & Algorithms:
- Bloom filters with false positive calculations
- Write-ahead logs for durability
- Merkle trees for replication reconciliation
- Gossip protocols for information dissemination
Phase 2: Core Infrastructure Systems (3-4 weeks)
Week 4 - Caching:
- Redis/Memcached architecture
- Eviction policies (LRU, LFU, TTL)
- Cache stampede prevention strategies
- Consistent hashing for cache distribution
- Hot key handling techniques
Week 5 - Messaging:
- Kafka architecture (brokers, topics, partitions)
- Message ordering guarantees
- Consumer groups and rebalancing
- At-least-once vs exactly-once semantics
- RabbitMQ vs Kafka trade-offs
Week 6 - Storage:
- Key-value store design (DynamoDB, Redis)
- Distributed file systems (GFS, HDFS)
- Partitioning strategies (hash, range, directory)
- Quorum-based replication
- Conflict resolution (vector clocks, last-write-wins)
Week 7 - Coordination Services:
- ZooKeeper/etcd architecture
- Distributed locks implementation
- Service discovery patterns
- Configuration management
- Leader election implementations
Phase 3: Advanced Topics (2-3 weeks)
Week 8 - Rate Limiting & Load Balancing:
- Rate limiter algorithms (token bucket, sliding window)
- Distributed rate limiting with Redis
- Load balancing algorithms
- Session persistence strategies
- Health check mechanisms
Week 9 - Consensus & Fault Tolerance:
- Deep dive into Raft algorithm
- Paxos vs Raft comparison
- Byzantine fault tolerance basics
- Circuit breaker pattern
- Bulkhead pattern for isolation
Week 10 - Observability:
- Distributed tracing (Jaeger, Zipkin)
- Log aggregation (ELK stack)
- Metrics collection (Prometheus)
- Alerting strategies
- SLOs, SLIs, SLAs
Phase 4: Practice & Mock Interviews (2-3 weeks)
Week 11-12 - Design Practice: Practice designing 3-4 systems per week:
- Start with core questions (rate limiter, cache, message broker)
- Progress to advanced (consensus system, distributed lock)
- Time yourself: 45 minutes per design
- Focus on trade-off discussions and operational aspects
Week 13 - Mock Interviews:
- Schedule mock interviews with peers or platforms
- Practice explaining designs out loud
- Get feedback on communication clarity
- Work on drawing diagrams quickly
- Time management: 10 min requirements, 20 min design, 15 min deep dive
Key Resources
Books:
- “Designing Data-Intensive Applications” by Martin Kleppmann
- “Database Internals” by Alex Petrov
- “System Design Interview” by Alex Xu
Online Courses:
- Grokking the System Design Interview (Educative)
- System Design Primer (GitHub)
- MIT 6.824 Distributed Systems
Practice Platforms:
- InfraExpert (infrastructure-specific questions)
- HelloInterview (system design practice)
- InterviewBit (system design problems)
Key Differences: Infrastructure vs Product Design
Infrastructure Design:
- Focus: Internal systems, performance, scalability at component level
- Depth: Deep dive into algorithms (Paxos, consistent hashing)
- Questions: “Design a rate limiter”, “Design Redis”, “Design Kafka”
- Emphasis: Durability, fault tolerance, throughput optimization
- Technologies: Consensus protocols, replication, partitioning
Product Design:
- Focus: User-facing features, end-to-end functionality
- Breadth: More components, lighter on each
- Questions: “Design Instagram”, “Design Uber”, “Design Twitter”
- Emphasis: APIs, data models, user flows, feature requirements
- Technologies: Microservices, REST APIs, databases, caching
80% Overlap: Both cover distributed systems fundamentals, but infrastructure goes deeper into system internals while product focuses on feature delivery.