Infrastructure Design

Infrastructure design interviews focus on building the foundational systems that support applications at scale. Unlike product design interviews that emphasize user-facing features, infrastructure interviews dive deep into system-level components like distributed caches, message brokers, and consensus mechanisms. Here’s a comprehensive guide to help you master this interview type.

Infrastructure Design Interview Questions

Core Infrastructure Components

Distributed Storage & Caching:

Design a distributed cache (Redis/Memcached): Focus on consistent hashing, cache eviction policies (LRU/LFU), cache stampede prevention, hot key handling, and replication strategies
Design a key-value store (DynamoDB/Redis): Cover partitioning using consistent hashing, replication with quorum reads/writes, conflict resolution using vector clocks, gossip protocols for node discovery, and hinted handoff for temporary failures
Design a distributed file system (GFS/HDFS): Address file chunking, metadata management with master-worker architecture, replication across data nodes, and fault tolerance mechanisms

Messaging & Communication:

Design a message broker (Kafka/RabbitMQ): Implement partition-based scaling, leader-follower replication for durability, consumer group coordination, ordering guarantees, and offset tracking
Design a distributed queue: Handle FIFO ordering, message persistence, at-least-once/exactly-once delivery semantics, and dead letter queues
Design a pub-sub system: Manage topic-based routing, subscriber management, message filtering, and backpressure handling

Coordination & Control:

Design a distributed lock service (ZooKeeper/etcd): Implement mutual exclusion using consensus algorithms, deadlock prevention, fault tolerance with lease-based locks, and fencing tokens to prevent split-brain scenarios
Design a service discovery system (Consul/etcd): Support health checking, DNS-based service resolution, configuration storage with strong consistency, and multi-datacenter replication
Design a leader election system: Use Raft or Paxos for consensus, handle split votes, implement randomized timeouts, and ensure safety during network partitions

Traffic Management:

Design a rate limiter: Implement algorithms like token bucket, leaky bucket, or sliding window counter; handle distributed rate limiting with Redis, prevent race conditions with atomic operations
Design a load balancer: Cover Layer 4 vs Layer 7 balancing, implement algorithms (round robin, least connections, weighted distribution), health checks, session persistence, and global load balancing
Design an API gateway: Manage authentication, request routing, rate limiting, protocol translation, and response aggregation

Advanced Infrastructure Questions

Reliability & Durability:

Design a consensus system: Implement Paxos or Raft algorithms, handle leader election, log replication, and safety guarantees during failures
Design a write-ahead log (WAL): Ensure durability with sequential append-only logs, implement log sequence numbers (LSN), checkpointing, and crash recovery
Design a distributed transaction coordinator: Use two-phase commit (2PC) with prepare and commit phases, handle in-doubt transactions, timeout mechanisms

Observability:

Design a distributed logging system: Aggregate logs from multiple sources, implement log shipping, indexing for search, retention policies
Design a distributed tracing system: Track request flows across services, implement trace ID propagation, sampling strategies, span collection
Design a metrics aggregation system: Collect time-series metrics, implement downsampling, anomaly detection, and visualization

Advanced Patterns:

Design a CDN: Implement edge caching, cache invalidation strategies, origin shield, geo-routing with DNS
Design a circuit breaker: Monitor failure rates, implement open/half-open/closed states, fallback mechanisms
Design a distributed scheduler: Manage task distribution, implement priority queues, handle task failures and retries

Communication Patterns & Approaches

Synchronous Communication

Request-Response Pattern: Direct client-server interaction with blocking calls; use for immediate feedback requirements

RPC (Remote Procedure Call): Function-like invocation across network boundaries; gRPC uses HTTP/2 for efficient binary communication with strong typing

REST API: Resource-oriented with HTTP verbs; stateless, cacheable, widely adopted for web services

Asynchronous Communication

Message Queues: Decouples producers and consumers; provides buffering, load leveling, and fault tolerance; examples include RabbitMQ, ActiveMQ

Pub-Sub Pattern: Publishers broadcast messages to topics; multiple subscribers receive copies; enables event-driven architectures; Kafka, Amazon SNS

Event Streaming: Ordered sequence of events stored in durable logs; enables replay, temporal queries; Kafka streams, Apache Flink

Gossip Protocol

Epidemic Dissemination: Nodes periodically exchange state with random peers; eventual consistency model; used in Cassandra, Consul for cluster membership

Decentralized with no single point of failure
Scales well to thousands of nodes
Achieves convergence in O(log N) rounds
Fault-tolerant through redundant message paths

Essential Algorithms & Techniques

Consensus Algorithms

Paxos:

Quorum-based with proposers, acceptors, learners
Allows any server as leader but requires log updates
Optimal message complexity but complex to implement
Used in Google Chubby, Apache ZooKeeper

Raft:

Leader-based with explicit leader election using randomized timeouts
Only up-to-date servers can become leaders
More understandable than Paxos with similar performance
Used in etcd, Consul, CockroachDB

Key Differences:

Paxos divides terms between servers; Raft uses single-candidate-per-term
Paxos accepts any leader; Raft requires up-to-date log
Raft leader election is lightweight; Paxos requires full log transfer

Data Distribution Algorithms

Consistent Hashing:

Maps both nodes and keys to hash ring (0 to 2^32-1)
Keys assigned to nearest clockwise node
Minimizes remapping when nodes added/removed (only K/N keys affected)
Use virtual nodes for load balancing across heterogeneous hardware
Implementation: Binary Search Tree for O(log N) lookups

Use Cases: Distributed caches (Memcached, Redis Cluster), CDNs, distributed databases (Cassandra)

Rate Limiting Algorithms

Token Bucket:

Bucket holds tokens at maximum capacity
Tokens added at fixed rate (r tokens/second)
Requests consume tokens; rejected if insufficient
Pros: Allows bursts up to bucket capacity, memory efficient
Cons: Parameters (bucket size, refill rate) need tuning

Leaky Bucket:

FIFO queue processes requests at constant rate
New requests added to queue; dropped if full
Pros: Smooth output rate, simple implementation
Cons: Cannot handle bursts, may reject legitimate traffic

Sliding Window Counter:

Combines fixed window efficiency with sliding log accuracy
Weighted count: current_window_count + (previous_window_count × overlap_percentage)
Pros: More accurate than fixed window, efficient storage
Cons: Approximation may allow slight over-limit

Implementation Considerations:

Use Redis with atomic operations (INCR, EXPIRE) for distributed rate limiting
Local in-memory checks with eventual consistency for low latency
Implement request coalescing for cache stampede prevention

Load Balancing Algorithms

Round Robin:

Sequential distribution in cyclic order
Use Case: Homogeneous servers with predictable load
Pros: Simple, equal distribution
Cons: Ignores server capacity and current load

Least Connections:

Routes to server with fewest active connections
Use Case: Variable connection duration, similar server capacity
Pros: Dynamic load distribution, prevents overload
Cons: Requires connection state tracking

Weighted Least Connections:

Combines connection count with server capacity weights
Use Case: Heterogeneous server capacities
Pros: Capacity-aware, handles variable loads
Cons: Complex configuration, requires capacity profiling

IP Hash:

Consistent mapping of client IP to server
Use Case: Session persistence requirements
Pros: Sticky sessions without state
Cons: Uneven distribution with uneven IP distribution

Probabilistic Data Structures

Bloom Filter:

Space-efficient set membership testing with false positives
Uses k hash functions and m-bit array
False Positive Rate: (1 - e^(-kn/m))^k where n = elements inserted
Optimal k: (m/n) × ln(2)
Optimal m: -(n × ln(ε)) / (ln(2))^2 for target error rate ε

Use Cases: URL deduplication (web crawlers), cache filtering (avoid disk lookups), database query optimization

Trade-offs:

No false negatives guaranteed
Cannot remove elements (use counting Bloom filter variant)
Fixed size after creation (use scalable Bloom filter for dynamic workloads)

Write-Ahead Log (WAL)

Core Principle: Log changes before applying them to ensure durability and crash recovery

Process:

Assign Log Sequence Number (LSN) to each operation
Write log entry to durable storage (fsync)
Apply change to data structures (can be async)
Periodic checkpointing flushes data and truncates log

Benefits:

Durability: No data loss after log write
Fast writes: Sequential appends faster than random writes
Replication: Ship log entries to replicas
Point-in-time recovery: Replay log to specific LSN

Used In: PostgreSQL, MySQL, MongoDB, Apache Kafka, all major databases

CAP Theorem & Consistency Models

CAP Theorem Trade-offs

CP (Consistency + Partition Tolerance):

Sacrifices availability during partitions
All nodes see same data; blocks requests if can’t guarantee consistency
Examples: MongoDB, Redis, HBase, ZooKeeper
Use Case: Financial transactions, inventory management

AP (Availability + Partition Tolerance):

Sacrifices strong consistency for availability
System remains responsive; may return stale data
Examples: Cassandra, DynamoDB, CouchDB
Use Case: Social media feeds, shopping carts, content delivery

CA (Consistency + Availability):

Cannot tolerate partitions (impractical for distributed systems)
Only viable for single-node or tightly-coupled systems

Consistency Models

Strong Consistency:

All reads return most recent write immediately
Requires coordination (Paxos/Raft) between nodes
Pros: Data reliability, simpler application logic
Cons: Higher latency, reduced availability, scalability limits
Examples: Google Spanner, etcd

Eventual Consistency:

Guarantees convergence given sufficient time without new updates
Allows temporary inconsistencies for better performance
Pros: High availability, low latency, horizontal scalability
Cons: Temporary stale reads, complex conflict resolution
Examples: Amazon DynamoDB, Cassandra, Riak

Read-Your-Writes Consistency: Guarantees clients see their own writes immediately

Monotonic Reads: Successive reads never return older values

Replication Strategies

Master-Slave (Primary-Replica):

Single master handles writes; replicas serve reads
Synchronous: Zero data loss but higher latency
Asynchronous: Lower latency but potential data loss
Use Case: Read-heavy workloads (analytics, reporting)

Multi-Master:

Multiple nodes accept writes; changes propagated to peers
Pros: High availability, write scalability, geo-distribution
Cons: Complex conflict resolution, potential inconsistencies
Use Case: Global distributed systems (Active Directory, Oracle RAC)

Infrastructure Design Interview Approach

Step 1: Clarify Requirements (10 minutes)

Functional Requirements:

What operations must the system support? (set/get for cache, publish/subscribe for message broker)
What are the performance targets? (latency, throughput)
What consistency guarantees are needed?

Non-Functional Requirements:

Scale: How many requests per second? Data volume?
Availability: What’s the acceptable downtime? (99.9%, 99.99%?)
Durability: Can we lose data? How much?
Latency: P50, P99, P999 targets?

Example Questions:

“What’s the read/write ratio?”
“Do we need strong consistency or is eventual consistency acceptable?”
“What’s the expected failure rate we need to handle?”

Step 2: Capacity Estimation

Calculate storage, memory, network bandwidth, and QPS requirements:

Storage: data_size × replication_factor × retention_period
Memory: active_dataset_size × cache_ratio
Bandwidth: request_size × QPS
Servers: total_QPS / QPS_per_server

Step 3: High-Level Design

Start Simple: Draw basic components and data flow

Client → Load Balancer → Service Nodes → Database
Identify single points of failure

Add Layers:

Caching layer (Redis cluster with consistent hashing)
Message queue (Kafka for async processing)
Replication (master-slave for reads, multi-master for writes)

Step 4: Deep Dive into System Internals

Infrastructure Focus Areas:

Consensus: How do nodes agree on state? (Raft for leader election)
Durability: How do we prevent data loss? (WAL, replication, backups)
Partitioning: How is data distributed? (Consistent hashing, range partitioning)
Failure Handling: What happens when nodes crash? (Heartbeats, quorum reads)
Performance: How do we optimize for throughput? (Batching, compression, connection pooling)

Design Decisions to Address:

Choice of data structures (B-tree vs LSM-tree)
Network protocols (TCP vs UDP, HTTP/2 vs gRPC)
Serialization formats (Protocol Buffers vs JSON)
Storage engines (RocksDB, LevelDB)

Step 5: Discuss Trade-offs

Consistency vs Availability (CAP theorem): Strong consistency slows down writes

Latency vs Durability: Fsync every write is slow but safe

Complexity vs Performance: Complex algorithms (Paxos) offer optimality but are hard to implement

Cost vs Resilience: More replicas = higher availability but increased cost

Step 6: Operational Considerations

Monitoring: What metrics to track? (latency percentiles, error rates, saturation)
Alerting: When to page engineers? (error rate spikes, high latency)
Debugging: How to troubleshoot? (distributed tracing, centralized logging)
Deployment: Rolling updates, blue-green deployments, canary releases

Complete Preparation Roadmap

Phase 1: Foundations (2-3 weeks)

Week 1 - Core Concepts:

Study CAP theorem with real-world examples (DynamoDB, Cassandra, MongoDB)
Understand consistency models (strong, eventual, causal)
Learn replication strategies (master-slave, multi-master)
Review network fundamentals (TCP/IP, HTTP, DNS)

Week 2 - Distributed Systems Primitives:

Consistent hashing implementation and use cases
Leader election (Raft vs Paxos comparison)
Distributed transactions (2PC protocol)
Failure detection (heartbeats, health checks)

Week 3 - Data Structures & Algorithms:

Bloom filters with false positive calculations
Write-ahead logs for durability
Merkle trees for replication reconciliation
Gossip protocols for information dissemination

Phase 2: Core Infrastructure Systems (3-4 weeks)

Week 4 - Caching:

Redis/Memcached architecture
Eviction policies (LRU, LFU, TTL)
Cache stampede prevention strategies
Consistent hashing for cache distribution
Hot key handling techniques

Week 5 - Messaging:

Kafka architecture (brokers, topics, partitions)
Message ordering guarantees
Consumer groups and rebalancing
At-least-once vs exactly-once semantics
RabbitMQ vs Kafka trade-offs

Week 6 - Storage:

Key-value store design (DynamoDB, Redis)
Distributed file systems (GFS, HDFS)
Partitioning strategies (hash, range, directory)
Quorum-based replication
Conflict resolution (vector clocks, last-write-wins)

Week 7 - Coordination Services:

ZooKeeper/etcd architecture
Distributed locks implementation
Service discovery patterns
Configuration management
Leader election implementations

Phase 3: Advanced Topics (2-3 weeks)

Week 8 - Rate Limiting & Load Balancing:

Rate limiter algorithms (token bucket, sliding window)
Distributed rate limiting with Redis
Load balancing algorithms
Session persistence strategies
Health check mechanisms

Week 9 - Consensus & Fault Tolerance:

Deep dive into Raft algorithm
Paxos vs Raft comparison
Byzantine fault tolerance basics
Circuit breaker pattern
Bulkhead pattern for isolation

Week 10 - Observability:

Distributed tracing (Jaeger, Zipkin)
Log aggregation (ELK stack)
Metrics collection (Prometheus)
Alerting strategies
SLOs, SLIs, SLAs

Phase 4: Practice & Mock Interviews (2-3 weeks)

Week 11-12 - Design Practice: Practice designing 3-4 systems per week:

Start with core questions (rate limiter, cache, message broker)
Progress to advanced (consensus system, distributed lock)
Time yourself: 45 minutes per design
Focus on trade-off discussions and operational aspects

Week 13 - Mock Interviews:

Schedule mock interviews with peers or platforms
Practice explaining designs out loud
Get feedback on communication clarity
Work on drawing diagrams quickly
Time management: 10 min requirements, 20 min design, 15 min deep dive

Key Resources

Books:

“Designing Data-Intensive Applications” by Martin Kleppmann
“Database Internals” by Alex Petrov
“System Design Interview” by Alex Xu

Online Courses:

Grokking the System Design Interview (Educative)
System Design Primer (GitHub)
MIT 6.824 Distributed Systems

Practice Platforms:

InfraExpert (infrastructure-specific questions)
HelloInterview (system design practice)
InterviewBit (system design problems)

Key Differences: Infrastructure vs Product Design

Infrastructure Design:

Focus: Internal systems, performance, scalability at component level
Depth: Deep dive into algorithms (Paxos, consistent hashing)
Questions: “Design a rate limiter”, “Design Redis”, “Design Kafka”
Emphasis: Durability, fault tolerance, throughput optimization
Technologies: Consensus protocols, replication, partitioning

Product Design:

Focus: User-facing features, end-to-end functionality
Breadth: More components, lighter on each
Questions: “Design Instagram”, “Design Uber”, “Design Twitter”
Emphasis: APIs, data models, user flows, feature requirements
Technologies: Microservices, REST APIs, databases, caching

80% Overlap: Both cover distributed systems fundamentals, but infrastructure goes deeper into system internals while product focuses on feature delivery.