Intro to System Design

Understanding large-scale system architecture

Understanding System Design Interviews

Why do companies do system design interviews? What are they looking for?

What Companies Are Looking For

This round often determines your level and assesses your ability to design large-scale, scalable, reliable, and maintainable software systems.

  • Ability to deal with ambiguity — Determine how much direction an engineer will need
  • Assess technical depth — On topics relevant to the job
  • Validate the candidate's resume — Does their demonstrated level of skill/expertise match their resume?
  • Test focus on business/customer impact — Understanding real-world constraints
  • Test communication and organizational skills — Can you explain complex systems clearly?

What They're Evaluating

1. Technical Architecture Skills

Can you break down a complex problem? Companies want to see if you can decompose a product or service into individual components like APIs, databases, caches, queues, etc.

Can you make trade-offs? Choosing between SQL vs. NoSQL, monolith vs. microservices, or consistency vs. availability (CAP theorem).

2. Scalability and Performance Awareness

Can your system handle scale? Think millions of users, high throughput, low latency.

Do you understand bottlenecks and how to mitigate them? Load balancing, partitioning, replication, etc.

3. Communication and Collaboration

Can you communicate your design clearly? Explaining concepts to stakeholders (PMs, other engineers) is crucial.

Are you open to feedback and iteration? It's not just about having a perfect design — it's about refining it collaboratively.

4. Real-World Engineering Judgment

Do you apply practical, not theoretical, knowledge? For example, designing a notification system that prioritizes user experience under high load.

Can you anticipate edge cases and failures? Handling server crashes, data loss, retries, rate limiting, etc.

5. Problem-Solving at the Systems Level

Can you innovate under constraints? Budget limits, data consistency requirements, or tight latency SLAs.

How do you approach new or unfamiliar domains? They want to see structured thinking even if the domain is new to you.

In short: System design interviews test your ability to be a "tech lead" or senior engineer — someone who can design solutions that work at scale, are maintainable, and meet business goals.

Note: The bar is raising — sometimes junior and mid-level folks get asked these now.

What is different about a system design interview compared to a coding interview?

  • The output is a design document rather than a fully implemented solution
  • The problems are intentionally underspecified
  • There is often no one best answer — there is a range of good answers depending on what parameters and priorities you set during the exploration phase
  • There are pros and cons to every decision and solution

Exploration of Systems

What IS a system?

  • Application or infrastructure that achieves user or business goals
  • Complex systems are constructed through a connected network of simpler systems (components)
  • These simpler systems are made up of still simpler components

System Design Interview Steps

1. Clarify Requirements (5-10 minutes)

Objective: Understand the problem and what is being asked. Ask clarifying questions to make sure you know exactly what the system needs to do.

Action: If the problem is broad, narrow it down. For example, if you're asked to design a "social media platform," clarify things like: "Are we focusing on newsfeed? Messaging? Scaling?" etc.

Key Questions to Ask:
  • Clarify assumptions:
    • Can we assume we have a 3rd party authentication system?
💡 Important Note:

In a system design interview, you don't ask "What are the functional requirements?" or "What are the non-functional requirements?" These are things you as the interviewee are expected to determine and clarify during the discussion.

Your job is to identify what the system needs to do (functional) and how well it needs to perform (non-functional) through your analysis and questions about the problem.

Why Functional Requirements Must Come Before Non-Functional Requirements

1. Functional Requirements Define What the System Does

Functional requirements describe the core behaviors and features of a system — the "what".

Examples:

  • Users can upload and share images
  • Admins can ban users
  • The system sends an email when a comment is posted
  • The service allows search over product listings
2. Non-Functional Requirements Describe How Well It Does That

Non-functional requirements describe the quality attributes of the system — the "how well" or "under what constraints".

Examples:

  • The system must handle 10,000 concurrent users
  • Latency must be under 200ms for 95% of requests
  • System must be available 99.99% of the time
  • Data must be eventually consistent within 5 seconds
3. Non-Functional Requirements Are Dependent On Functional Context

You can't say "the system must be available 99.99% of the time" unless you know:

  • What services are being used?
  • What are the critical user flows?
  • What is the expected workload?
  • What operations matter most?
Example:

Functional: "Users can submit payments"

Non-functional: "Payments must process in <1 second with 99.9% success rate"

Without knowing the function (payment submission), the performance goal (1s) is meaningless.

2. High-Level Design (10-15 minutes)

Objective: Start sketching the system's high-level architecture.

Action: Identify major components, how they interact, and how data flows. Consider the trade-offs between components (e.g., SQL vs. NoSQL databases, microservices vs. monolithic architecture). Also, bring up any assumptions you might be making.

3. Low Level Design (Component Design) (10-15 minutes)

Objective: Dive into individual components or subsystems of the architecture.

Action: Pick one or two critical parts of the system and design them in detail. For example, if you're designing a messaging system, you could focus on message storage, delivery guarantees, and scaling. Explain the protocols, APIs, and data models you would use.

4. Scaling and Performance Considerations (5-10 minutes)

Objective: Address scalability, availability, and reliability.

Action: Discuss how your system will scale horizontally and vertically. Consider potential bottlenecks (e.g., database, network, etc.), data sharding, caching strategies, load balancing, and failover strategies.

5. Trade-offs (5 minutes)

Objective: Discuss trade-offs in design decisions and how to mitigate potential bottlenecks.

Action: For instance, if you've chosen a certain database type, explain why you chose it and what trade-offs were involved (e.g., SQL vs. NoSQL). Also, talk about how you'd handle potential issues, such as system outages or data consistency challenges.

More Complex Examples: Discuss trade-offs between eventual consistency vs. strong consistency, microservices vs. monolith architecture, synchronous vs. asynchronous processing, or caching strategies (write-through vs. write-behind vs. cache-aside).

6. Conclusion / Q&A (5 minutes)

Objective: Summarize your system design and answer any final questions.

Action: Be prepared to summarize your system in 2-3 sentences and have thoughtful questions ready for the interviewer. Demonstrate your understanding by asking about potential edge cases, scaling considerations, or alternative approaches they might consider.

Pro Tips

  • Prioritize: Focus on the most important aspects of the system based on the problem's complexity. You might not have time to design every component in detail, so choose wisely where to dive deeper.
  • Explain Clearly: Always try to communicate clearly and succinctly. If you're running out of time, be honest and move on to the next section, but make sure you've covered the most critical aspects of the system.
  • Think Aloud: Interviewers are not just looking for the correct answer but also your problem-solving approach. Talk through your reasoning, so they can follow your thought process.
  • Time Allocations: 45 minutes (Meta), 60 minutes (Google)
  • Mindset: Treat the interviewer as your senior colleague — not setting traps for you
  • Tools: Get good at using Excalidraw for diagrams

How to Succeed as a Candidate

  • Clarify the requirements and scope of the problem
  • Make sure no major pieces of the problem are missing
  • Remember: it can be gamed just like the ACT, SAT, LSAT, GRE and leetcode

Key System Design Concepts

Database Types

1. Relational Databases (SQL)

Examples: PostgreSQL, MySQL, Microsoft SQL Server, Oracle

Key Features:

  • Structured data with schemas
  • Strong consistency (ACID transactions)
  • Use SQL for querying
  • Support joins, foreign keys, constraints

When to Use: You need complex queries and relationships (e.g. banking, ecommerce)

2. NoSQL Databases

Types:

  • Key-Value: Redis, DynamoDB (fast lookups, caching)
  • Document: MongoDB, Couchbase (JSON documents, flexible schemas)
  • Column-Family: Cassandra, HBase (high write throughput)
  • Graph: Neo4j (relationships, social networks)

When to Use: Unstructured data, high scalability needs

Caching

In-memory caches: Redis, Memcached

Use Cases: Session storage, page caching, database query results

Eviction Policies: LRU, LFU, FIFO

Cache Invalidation: Write-through, write-back, TTL

Message Queues / Event Streaming

Message Brokers: RabbitMQ, Amazon SQS

Event Streaming: Apache Kafka, AWS Kinesis, Pulsar

Use Cases: Async processing, decoupling services, retry logic, data pipelines

Concepts: Producer/consumer, offset, partitioning, idempotency

CAP Theorem

CAP Theorem states that in any distributed system, you can only guarantee two out of the following three properties at any given time:

C - Consistency

Every read receives the most recent write or an error. All nodes see the same data at the same time.

A - Availability

Every request receives a non-error response, even if it's not the latest data. The system is always responsive.

P - Partition Tolerance

The system continues to operate despite network failures or message delays between nodes.

⚠️ The Key Constraint

You must tolerate partitions in a distributed system — network failures are inevitable. So, in reality, you are forced to choose between Consistency vs. Availability when a partition occurs.

Classic Trade-offs:
  • CP: HBase, MongoDB (w/ config) — Prioritizes consistency, may reject reads/writes during partition
  • AP: Couchbase, Cassandra — Always responds (available), may return stale data
  • CA: Only in theory (no partition) — Not practical in real-world distributed systems

Transactions & ACID Properties

A transaction is a sequence of one or more operations that are treated as a single, indivisible unit of work. Either all operations complete successfully, or none of them do.

ACID Properties
  • Atomicity: All operations must complete fully or have no effect (all or nothing)
  • Consistency: Database must move from one valid state to another valid state
  • Isolation: Transactions run independently — intermediate steps not visible to others
  • Durability: Once committed, changes persist even if there's a crash
Example: Banking Transfer

Imagine transferring $100 from Account A to Account B:

  1. Read balance of A
  2. Subtract $100 from A
  3. Add $100 to B
  4. Write updated balances

If the system crashes after step 2, the transaction rolls back to prevent data loss.

Race Conditions

A race condition occurs when two or more operations happen concurrently, and the final outcome depends on the non-deterministic timing or order of their execution.

Example: Two users withdraw from same account
User 1: Read balance: $100
User 2: Read balance: $100 (same time)
User 1: Withdraw $80 → Balance: $20
User 2: Withdraw $80 → Balance: $20 (overdraft!)

Expected: $100 - $80 = $20 (only one withdrawal)
Actual: Both withdraw $80, ending with -$60

Solution: Use transactions with proper isolation levels, locks, or optimistic concurrency control.

Process vs Thread

Process

A process is an independent program in execution. It has:

  • Its own memory space (code, data, stack, heap)
  • Own resources (file handles, network connections, etc.)
  • At least one thread (called the main thread)

Think of it as: A fully isolated container running a program.

Thread

A thread is the smallest unit of execution within a process:

  • Threads in the same process share the same memory and resources
  • Multiple threads can run concurrently in the same process
  • Cheaper to create and switch between than processes

Think of threads as: Workers inside the same container doing different tasks but sharing tools.

Key Contexts Where Race Conditions Arise

  • Databases: SQL transactions vs NoSQL conditional writes
  • Caches: Cache stampede, race between DB write and cache set
  • Distributed Systems: Multiple services updating shared state
  • Queues: Multiple workers picking the same task
  • File Storage: Parallel uploads, version conflicts
  • Authentication: Login/logout races, token refresh