Understanding large-scale system architecture
Why do companies do system design interviews? What are they looking for?
This round often determines your level and assesses your ability to design large-scale, scalable, reliable, and maintainable software systems.
Can you break down a complex problem? Companies want to see if you can decompose a product or service into individual components like APIs, databases, caches, queues, etc.
Can you make trade-offs? Choosing between SQL vs. NoSQL, monolith vs. microservices, or consistency vs. availability (CAP theorem).
Can your system handle scale? Think millions of users, high throughput, low latency.
Do you understand bottlenecks and how to mitigate them? Load balancing, partitioning, replication, etc.
Can you communicate your design clearly? Explaining concepts to stakeholders (PMs, other engineers) is crucial.
Are you open to feedback and iteration? It's not just about having a perfect design — it's about refining it collaboratively.
Do you apply practical, not theoretical, knowledge? For example, designing a notification system that prioritizes user experience under high load.
Can you anticipate edge cases and failures? Handling server crashes, data loss, retries, rate limiting, etc.
Can you innovate under constraints? Budget limits, data consistency requirements, or tight latency SLAs.
How do you approach new or unfamiliar domains? They want to see structured thinking even if the domain is new to you.
In short: System design interviews test your ability to be a "tech lead" or senior engineer — someone who can design solutions that work at scale, are maintainable, and meet business goals.
Note: The bar is raising — sometimes junior and mid-level folks get asked these now.
What IS a system?
Objective: Understand the problem and what is being asked. Ask clarifying questions to make sure you know exactly what the system needs to do.
Action: If the problem is broad, narrow it down. For example, if you're asked to design a "social media platform," clarify things like: "Are we focusing on newsfeed? Messaging? Scaling?" etc.
In a system design interview, you don't ask "What are the functional requirements?" or "What are the non-functional requirements?" These are things you as the interviewee are expected to determine and clarify during the discussion.
Your job is to identify what the system needs to do (functional) and how well it needs to perform (non-functional) through your analysis and questions about the problem.
Functional requirements describe the core behaviors and features of a system — the "what".
Examples:
Non-functional requirements describe the quality attributes of the system — the "how well" or "under what constraints".
Examples:
You can't say "the system must be available 99.99% of the time" unless you know:
Functional: "Users can submit payments"
Non-functional: "Payments must process in <1 second with 99.9% success rate"
Without knowing the function (payment submission), the performance goal (1s) is meaningless.
Objective: Start sketching the system's high-level architecture.
Action: Identify major components, how they interact, and how data flows. Consider the trade-offs between components (e.g., SQL vs. NoSQL databases, microservices vs. monolithic architecture). Also, bring up any assumptions you might be making.
Objective: Dive into individual components or subsystems of the architecture.
Action: Pick one or two critical parts of the system and design them in detail. For example, if you're designing a messaging system, you could focus on message storage, delivery guarantees, and scaling. Explain the protocols, APIs, and data models you would use.
Objective: Address scalability, availability, and reliability.
Action: Discuss how your system will scale horizontally and vertically. Consider potential bottlenecks (e.g., database, network, etc.), data sharding, caching strategies, load balancing, and failover strategies.
Objective: Discuss trade-offs in design decisions and how to mitigate potential bottlenecks.
Action: For instance, if you've chosen a certain database type, explain why you chose it and what trade-offs were involved (e.g., SQL vs. NoSQL). Also, talk about how you'd handle potential issues, such as system outages or data consistency challenges.
More Complex Examples: Discuss trade-offs between eventual consistency vs. strong consistency, microservices vs. monolith architecture, synchronous vs. asynchronous processing, or caching strategies (write-through vs. write-behind vs. cache-aside).
Objective: Summarize your system design and answer any final questions.
Action: Be prepared to summarize your system in 2-3 sentences and have thoughtful questions ready for the interviewer. Demonstrate your understanding by asking about potential edge cases, scaling considerations, or alternative approaches they might consider.
Examples: PostgreSQL, MySQL, Microsoft SQL Server, Oracle
Key Features:
When to Use: You need complex queries and relationships (e.g. banking, ecommerce)
Types:
When to Use: Unstructured data, high scalability needs
In-memory caches: Redis, Memcached
Use Cases: Session storage, page caching, database query results
Eviction Policies: LRU, LFU, FIFO
Cache Invalidation: Write-through, write-back, TTL
Message Brokers: RabbitMQ, Amazon SQS
Event Streaming: Apache Kafka, AWS Kinesis, Pulsar
Use Cases: Async processing, decoupling services, retry logic, data pipelines
Concepts: Producer/consumer, offset, partitioning, idempotency
CAP Theorem states that in any distributed system, you can only guarantee two out of the following three properties at any given time:
Every read receives the most recent write or an error. All nodes see the same data at the same time.
Every request receives a non-error response, even if it's not the latest data. The system is always responsive.
The system continues to operate despite network failures or message delays between nodes.
You must tolerate partitions in a distributed system — network failures are inevitable. So, in reality, you are forced to choose between Consistency vs. Availability when a partition occurs.
A transaction is a sequence of one or more operations that are treated as a single, indivisible unit of work. Either all operations complete successfully, or none of them do.
Imagine transferring $100 from Account A to Account B:
If the system crashes after step 2, the transaction rolls back to prevent data loss.
A race condition occurs when two or more operations happen concurrently, and the final outcome depends on the non-deterministic timing or order of their execution.
Example: Two users withdraw from same account
User 1: Read balance: $100
User 2: Read balance: $100 (same time)
User 1: Withdraw $80 → Balance: $20
User 2: Withdraw $80 → Balance: $20 (overdraft!)
Expected: $100 - $80 = $20 (only one withdrawal)
Actual: Both withdraw $80, ending with -$60
Solution: Use transactions with proper isolation levels, locks, or optimistic concurrency control.
A process is an independent program in execution. It has:
Think of it as: A fully isolated container running a program.
A thread is the smallest unit of execution within a process:
Think of threads as: Workers inside the same container doing different tasks but sharing tools.