System Design — The Data Layer
How Your Data Stays Alive
One tweet lives in 7 places simultaneously. Databases, caches, search indexes, queues, and analytics pipelines — all keeping copies of your data for different purposes. Keeping them in sync is the hardest problem in system design.
"Hello world"
When you post a tweet, it doesn't just sit in one database. Within milliseconds, that 280-character message lives in 7 different places — a primary database, read replicas, Redis cache, CDN, Elasticsearch index, Kafka queue, and analytics pipeline. Each copy exists for a different reason, optimized for a different access pattern. Keeping them all in sync is the hardest problem in system design.
This page explores the data layer: how companies like Netflix, Twitter, and Amazon choose between SQL and NoSQL, when to shard versus replicate, why caching is 100x faster than querying, and how message queues let services fail gracefully. Every number comes from real production systems — Stack Overflow serving 1.3 billion page views on 9 servers, Redis handling 1.8 million operations per second, LinkedIn processing 7 trillion Kafka messages per day.
If you haven't seen The Network Layer yet, that covers the first half of the journey — DNS, TCP, HTTP, load balancers, CDNs. This page picks up where the request arrives at the server and needs to read or write data.
SQL vs NoSQL — The Great Database Debate
Same dataset, two different databases. Run 5 experiments and see which wins at each task. Spoiler: neither wins every time.
Primary key lookup. B-tree index. Instant.
Document lookup by _id. Instant.
Simple lookups are fast everywhere. The differences appear when things get complex.
SQL wins
NoSQL wins
It's not SQL vs NoSQL — it's “which tradeoff matters more for YOUR use case?”
Quick Decision Helper
Do you need complex joins between related data?
Does your data schema change often?
Do you need ACID transactions?
Do you need 100K+ writes per second?
The real answer is “both”
Instagram uses PostgreSQL for user accounts and relationships (where data integrity is critical) and Cassandra for the activity feed (where write throughput matters more than consistency). Uber uses MySQL for trip data and Redis for real-time geolocation. Netflix uses CockroachDB for billing and Cassandra for viewing history. The question isn't “SQL or NoSQL?” — it's “which database for which access pattern?”
The ACID vs BASE tradeoff is at the heart of this. ACID (Atomicity, Consistency, Isolation, Durability) guarantees that every transaction either fully completes or fully rolls back. Your bank transfer will never leave you with money debited but not credited. BASE (Basically Available, Soft state, Eventually consistent) trades strict consistency for availability and partition tolerance. Your Instagram like count might show 1,003 on one server and 1,005 on another for a few milliseconds — and that's fine.
MongoDB started as a pure document store with no joins and no transactions. By 2024, it added multi-document ACID transactions, time series collections, and a SQL-like aggregation pipeline. PostgreSQL started as a relational database and added JSONB columns, full-text search, and time-series extensions. The two worlds are converging. Pick based on your primary access pattern, then extend as needed.
Schema Design — Modeling Your Data
Picking SQL vs NoSQL is step one. Step two is designing the schema inside your database — how tables relate, which columns to index, and how to avoid the partition key mistake that takes down your service at 2 AM.
Normalization vs Denormalization — The Fundamental Tradeoff
Data lives in one place. No duplication. Updates are safe. Reads require JOINs.
| id | name |
|---|---|
| 1 | Jane |
| 2 | Alex |
| id | user_id | product_id | qty |
|---|---|---|---|
| 101 | 1 | 501 | 2 |
| 102 | 1 | 502 | 1 |
| 103 | 2 | 501 | 3 |
| id | name | price |
|---|---|---|
| 501 | Widget | $9.99 |
| 502 | Gadget | $24.99 |
SELECT u.name, p.name, o.qty FROM orders o JOIN users u ON o.user_id = u.id JOIN products p ON o.product_id = p.id
Banking systems, e-commerce inventory, anything where data integrity is non-negotiable. Stack Overflow uses normalized SQL Server — 2 instances serve 1.3B page views/month.
Indexing — Why Your Query Is Slow
Index on one column. The bread and butter — covers 80% of cases.
SELECT * FROM users WHERE email = 'jane@example.com'
B-tree lookup: 3 comparisons → direct row. ~0.1ms
How it works: B-tree stores sorted pointers. Binary search narrows 10M rows to 1 row in ~23 comparisons (log₂ 10M ≈ 23).
B-tree handles 90% of indexing needs. PostgreSQL creates one automatically on every primary key.
The Hot Partition Problem — When One Shard Melts
Justin Bieber posts → 100M fans write likes to the same partition
Partition key: post_id
All likes for one post hash to the same partition. DynamoDB limits: 3,000 RCU and 1,000 WCU per partition. Celebrity posts blow past this instantly.
The schema change that saved Instagram
In the early 2010s, Instagram faced a critical scalability challenge nicknamed the “Justin Bieber Problem.” Celebrity posts garnered millions of likes within moments, and every like triggered a COUNT(*) query across a normalized likes table. The database couldn't keep up — queries that should take milliseconds were taking seconds, slowing the platform for everyone.
The fix was elegant: denormalize. They added a LikeCount column directly to the Posts table and incremented it atomically on each like. Reads became 100x faster — one row lookup instead of a cross-table aggregation. The tradeoff? Writes became slightly more complex (update two tables instead of one), and there's a brief window where the count might be off by one. For a social media platform, that's an acceptable trade.
This is the schema design lesson: normalize for correctness, denormalize for performance. Most real systems use both — normalized tables for the source of truth, denormalized views or caches for read-heavy access patterns. Uber uses composite indexes on (trip_id, timestamp) to serve their most common query. Stack Overflow's 2 SQL Server instances handle 1.3 billion page views because they index aggressively.
Database Scaling — When One Machine Isn't Enough
Traffic keeps growing. Your single database is sweating. Four strategies to scale, in the order you should try them.
Get a bigger machine. More CPU, more RAM, more disk. Same single server.
No code changes. No distributed systems complexity. Just throw money at hardware.
There's a ceiling. The biggest AWS instance (u-24tb1.metal) has 24TB RAM, costs $218/hour. Even that has limits.
Stack Overflow runs on 2 SQL Servers. They squeezed everything from vertical scaling before touching horizontal. 1.3B page views/month.
Stack Overflow doesn't shard — and neither should you (yet)
Stack Overflow serves 1.3 billion page views per month from 9 web servers and 2 SQL Server instances. No sharding. No NoSQL. Just aggressive indexing, smart caching, and a well-designed schema. They estimate that a single modern database server can handle 10,000+ queries per second with proper indexes.
Sharding introduces real operational pain. Cross-shard queries become expensive joins across the network. Schema changes must be coordinated across every shard. Rebalancing data when you add a shard can take days. Discord learned this the hard way — their Cassandra cluster had billions of messages, but certain channels (like popular gaming communities) created hot partitions. They eventually migrated to ScyllaDB, dropping p99 latency from 200ms to 5ms.
The scaling ladder is deliberate: start with better hardware (vertical scaling is underrated — a single AWS r7g.16xlarge has 512GB RAM and 64 cores). Then add read replicas to offload reads. Add caching to reduce DB load by 99%. Only when you've exhausted these options should you consider sharding. Most applications will never reach that point.
Caching — The Speed Multiplier
The fastest database query is the one you never make. Caching stores hot data in memory — 100x faster than disk. Four strategies, each with tradeoffs.
Most common pattern. App manages cache explicitly. Data loaded into cache on first access.
Only caches what's actually requested. Resilient — if cache dies, app falls back to DB.
First request for every item is slow (cold start). Stale data if DB changes without cache invalidation.
Instagram uses cache-aside. First load of a popular post is slow. Every subsequent load is instant from cache.
Try It: Read a Key (Cache holds 4 items)
Empty — click a key above to populate
Eviction Policies — Cache Full, What Gets Kicked?
Evict what hasn't been touched in the longest time. Recent access = stay.
Cleaning your desk — remove what you haven't used lately.
The most popular choice. Redis uses LRU by default. Memcached too.
The Thundering Herd Problem
A popular cache key expires. Thousands of requests arrive simultaneously. All see "cache miss" and all slam the database at once. Pinterest saw this crash their DB when viral pins expired.
The two hardest problems in caching
Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He wasn't joking about the first one. When the underlying data changes, every cached copy must be updated or deleted. Miss one, and users see stale data. Facebook maintains 75 billion cache items across their Memcached fleet — invalidating the right entries when a user updates their profile picture is a distributed coordination problem.
Twitter builds your entire home timeline in Redis before you open the app. When someone you follow tweets, that tweet is written to the timelines of all their followers in Redis — a process called “fan-out on write.” For Lady Gaga with 84 million followers, one tweet triggers 84 million cache writes. This is why Twitter's Redis deployment is one of the largest in the world.
The speed gap between cache and database is staggering. Redis responds in microseconds (0.001ms). PostgreSQL responds in milliseconds (1-10ms). That's a 1,000-10,000x difference. Netflix's EVCache handles 30 million reads per second. Without it, they'd need 100x more database capacity — hundreds of thousands of database instances instead of a few thousand.
Search — Finding Needles in Haystacks
When your database has 100 million rows, you can't scan every one. Inverted indexes flip the problem — instead of “which words are in this document?” they answer “which documents contain this word?” in microseconds. That's how every search engine on earth works.
Forward Index vs Inverted Index
Maps terms → documents. "Which documents contain 'redis'?" Instant lookup. How every search engine works.
redis → [Doc 1, Doc 3] fast → [Doc 1, Doc 2] kafka → [Doc 2, Doc 3] scale → [Doc 3]
Search "redis": direct lookup → [Doc 1, Doc 3]. O(1). Same speed whether you have 100 or 100M documents.
Searching across millions of documents — the foundation of Elasticsearch, Lucene, Solr
How Elasticsearch Builds an Inverted Index
Start with a collection of documents to index.
Doc 1: "Redis is fast" Doc 2: "Kafka is fast too" Doc 3: "Redis and Kafka scale"
3 documents, 12 words total
Autocomplete — Trie in Action
Type a letter. A trie (prefix tree) narrows results with each keystroke — O(M) lookup where M is the length of your query, regardless of how many terms exist. Google serves 5 billion searches/day with <50ms autocomplete.
Search at Scale — Real Companies
Code search across 200M+ repositories
Tech: Custom search engine (replaced Elasticsearch in 2023 with Blackbird — written in Rust)
115TB of code indexed, queries return in <100ms
Title search, content discovery, personalized results
Tech: Elasticsearch cluster for catalog search + ML re-ranking
15,000+ titles across 190 countries, 30+ languages
Product search with typo tolerance, synonym expansion, faceted filters
Tech: Custom A9 search engine (evolved into OpenSearch, their Elasticsearch fork)
350M+ products indexed, handles Black Friday traffic spikes
Full-text search across 60M+ articles in 300+ languages
Tech: Elasticsearch (CirrusSearch) — replaced MySQL FULLTEXT in 2014
Serves 6B+ searches per month with sub-second response
GitHub rebuilt search from scratch — in Rust
For years, GitHub used Elasticsearch for code search. It worked, but searching across 200 million repositories with complex regex patterns was pushing Elasticsearch beyond its design. In 2023, they shipped Blackbird — a custom search engine written in Rust that indexes 115TB of code and returns results in under 100ms.
The architecture is clever: instead of indexing every file in full, Blackbird builds a content-addressable index where identical files across repositories share a single index entry. Since open-source projects are frequently forked, this deduplication dramatically reduces index size. The result? You can regex search across all of GitHub's public code faster than you can grep your own laptop.
Amazon took a different path. Their A9 product search engine evolved into OpenSearch — an open-source fork of Elasticsearch that now powers search for 350 million+ products. The key innovation: semantic search using vector embeddings alongside traditional inverted indexes. Type “comfy chair for reading” and it understands intent, not just keywords. In 2026, Elasticsearch itself added native vector search for AI-powered retrieval-augmented generation (RAG) pipelines — the search world is converging on hybrid approaches.
Message Queues — Decoupling Everything
Your user clicks "Sign Up." Do you make them wait while you send an email, update analytics, and notify Slack? Or do you save to DB, say "done," and handle the rest in the background?
Queue vs Pub/Sub
Each message consumed by exactly ONE consumer. Once processed, it's gone.
Work distribution. Process this payment. Encode this video. Each job done once.
AWS SQS at Amazon. Each message processed by exactly one worker. Used for order processing, image resizing.
Real Example: YouTube Upload Pipeline
When a creator uploads a video, one Kafka event triggers 5 parallel services. The creator doesn't wait for encoding to finish — the queue handles it.
Upload service receives video
Push event to Kafka
Encode to 240p, 480p, 720p, 1080p, 4K
Generate 3 thumbnail options
Scan for copyrighted content (Content ID)
Notify subscribers
How Kafka became the backbone of the modern internet
Kafka was built at LinkedIn in 2011 to solve a specific problem: they had hundreds of services that all needed to share data, and direct service-to-service communication was creating an unmaintainable web of dependencies. Jay Kreps designed Kafka as an append-only log — messages are written sequentially to disk, which is 100x faster than random writes.
Today, LinkedIn's Kafka deployment processes 7 trillion messages per day. Netflix pushes 700 billion events per day. When you press “play” on Netflix, one event triggers 8+ downstream services: recommendations update, viewing history records, bandwidth monitoring starts, A/B test metrics collect, content delivery optimizes, and billing tracks your usage — all from one Kafka event.
The key insight is that queues turn synchronous operations into asynchronous ones. Shopify's checkout during Black Friday handles 100x normal traffic by queueing order processing. The checkout is fast (payment captured, queue message sent). Inventory updates, email confirmations, and warehouse notifications all process from the queue at their own pace. If the email service goes down, no customer sees an error — the emails just send when it recovers.
Replication — Copies Keep You Alive
One database machine will eventually die. Replication copies your data to multiple machines. The question: do you wait for the copy to confirm, or trust it will get there eventually?
Write to Primary → Primary waits for ALL replicas to confirm → then tells client 'done'. Every replica has every write.
Slow — 3 network round trips before confirming to client.
Zero data loss. If primary crashes, replicas have everything.
Follower dies? Primary blocks. Writes halt until follower recovers or is removed.
Banks use synchronous replication. Wrong balance = lawsuit. Zero data loss is non-negotiable.
The Stale Read Problem
With async replication, a client might read from a replica before it gets the latest write. For 10-50ms, different replicas return different values.
Automatic Failover
When the primary dies, a follower gets promoted. With sync replication, zero data loss. With async, you might lose the last few seconds of writes.
The CAP theorem in practice
Eric Brewer's CAP theorem states that during a network partition, a distributed system must choose between consistency (every read returns the latest write) and availability (every request gets a response). You can't have both. Banks choose consistency — a transfer must be correct even if the system is briefly unavailable. Amazon chooses availability — you can always add to your cart, even if it means occasional inventory inconsistencies.
In practice, network partitions are rare but devastating. During Amazon's 2017 S3 outage, a single typo in a command took down services across the internet for 4 hours — Slack, Trello, IFTTT, and thousands of other services that depended on S3. This is why multi-region replication exists: Airbnb replicates MySQL across availability zones so that if an entire datacenter goes down, a replica in another zone takes over in seconds.
The numbers tell the story: 99.9% uptime means 8.7 hours of downtime per year. 99.99% means 52 minutes. 99.999% (“five nines”) means 5.2 minutes. Getting from four nines to five nines is exponentially harder and more expensive — it requires synchronous replication, automatic failover, and infrastructure redundancy across multiple geographic regions.
Object Storage — How the Cloud Stores Everything
The internet runs on object storage. S3 alone holds over 400 trillion objects. Netflix stores 19 petabytes of video. Dropbox saved $75M by building their own. Understanding block vs file vs object storage — and how erasure coding achieves 11 nines of durability — is essential for any system design that touches storage at scale.
Block vs File vs Object — Pick the Right One
Flat namespace — no directories, just keys and values. Each object includes data + metadata + a globally unique key. Built for the web: HTTP API, infinite scale, pay-per-GB.
PUT/GET by key over HTTP. No partial updates — replace entire object.
10–100ms first byte (S3), cached via CDN: <10ms
5 TB per object (S3), unlimited total storage
AWS S3, Azure Blob, Google Cloud Storage, MinIO
Images, videos, backups, data lakes, static websites, ML datasets
Databases, frequently updated files, low-latency random reads
| Block Storage | File Storage | Object Storage | |
|---|---|---|---|
| Structure | Fixed-size blocks | Hierarchical dirs | Flat key-value |
| Protocol | iSCSI / NVMe | NFS / SMB | HTTP REST API |
| Shared Access | Single server | Multiple servers | Global (internet) |
| Scalability | TB scale | PB scale | Unlimited |
| Cost/GB | $$$ | $$ | $ |
S3 — The Object Store That Runs the Internet
Launched in 2006 with a few thousand objects. Today it's the backbone of the internet — Netflix, Airbnb, NASA, and the majority of the web's static assets sit in S3.
400+ trillion
Objects Stored
S3 stores over 400 trillion objects as of 2024 — up from 100 trillion in 2021
100M+ req/sec
Peak Requests
S3 handles over 100 million requests per second at peak
99.999999999%
Durability
11 nines — store 10 million objects and expect to lose one every 10,000 years
99.99%
Availability
52 minutes of downtime per year for S3 Standard
How S3 Achieves 11 Nines — Erasure Coding vs Replication
Split data into fragments, add parity fragments using Reed-Solomon math. Reconstruct from any subset. 4x less storage than replication for better durability.
A 6+3 scheme splits data into 6 chunks and creates 3 parity chunks. Store all 9 across different drives. Any 6 of 9 can reconstruct the original. Lose up to 3 and nothing is lost.
~50% (e.g., RS 6+3)
99.999999999% (11 nines)
AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, Ceph
Erasure coding: better durability at less than half the storage cost
Content-Addressable Storage — Deduplicate Everything
Instead of naming files, hash their content. The hash is the address. Same content = same hash = stored once. This is how Git, Docker, and IPFS achieve massive deduplication.
Every commit, tree, and blob is stored by its SHA-1 hash. Two identical files = one object. This is why git repos are small despite tracking every version.
git hash-object file.txt → SHA-1 hash → stored in .git/objects/
Every storage decision has consequences
Block for databases. File for shared access. Object for everything on the web. Erasure coding for durability without bankruptcy. These choices compound — pick the wrong storage type and you're either overpaying 4x (replication where erasure coding works) or fighting latency you can't fix (object storage for a database workload).
You've now covered the full data layer: how to choose a database, design schemas, scale reads and writes, cache intelligently, search at speed, queue work asynchronously, replicate for safety, and store at scale. Next up — how distributed systems survive when things go wrong.
When should I use SQL vs NoSQL?+
What's the difference between Redis and Memcached?+
When should I shard my database?+
What happens when a cache expires and thousands of requests hit the DB?+
Why use a message queue instead of direct API calls?+
What's eventual consistency and when is it OK?+
How does Kafka handle 7 trillion messages per day?+
What's the difference between replication and sharding?+
Why does one tweet live in 7 places?+
How do I choose between Kafka, RabbitMQ, and SQS?+
What's Next?
You've explored how data is stored, cached, queued, replicated, and stored at scale. Next, dive into how distributed systems survive chaos — circuit breakers, rate limiters, consensus algorithms, and the patterns that keep services running when everything goes wrong.
Sources
- Stack Overflow Architecture — Nick Craver (nickcraver.com)
- Discord Engineering Blog — Cassandra to ScyllaDB migration
- Netflix Tech Blog — EVCache architecture
- LinkedIn Engineering — Kafka at scale
- Twitter Engineering — Timeline caching and fan-out
- AWS Architecture Blog — Database scaling patterns
- Redis Documentation — Performance benchmarks
- PostgreSQL Documentation — Streaming replication
- MongoDB Documentation — Replica sets and sharding
- ByteByteGo — System Design resources