System Design — Page 8
How Netflix Serves 300 Million Users
The ultimate system design case study. Every concept from the previous 4 pages — DNS, CDN, microservices, Kafka, Cassandra, chaos engineering — applied to one system. Plus the content pipeline most diagrams skip.
ORIGINAL SERIES
System Design: The Show
S1 E5 · The Netflix Architecture
Netflix accounts for 15% of all downstream internet traffic worldwide. More than YouTube. More than every social media platform combined. When 300 million subscribers press Play, the request touches DNS, an API gateway, microservices, a custom CDN, multiple databases, and an event pipeline processing 2 trillion events per day. And it all happens in under 2 seconds.
This page is the capstone of the system design series. Every concept from the previous pages — DNS, load balancing, CDN, caching, databases, queues, replication, consistent hashing, circuit breakers — appears here, applied to a real system at massive scale. Netflix isn't just using these concepts. They invented several of them. Chaos Monkey, EVCache, Zuul, and Open Connect all came from Netflix engineering.
But most architecture diagrams only show the user-facing flow: you press Play, video arrives. They skip the other half — how content gets from a studio's editing suite to 18,000+ servers inside ISPs worldwide. This page covers both sides: the request journey AND the content pipeline. Plus the estimation math that system design interviews actually test.
The Request Journey — What Happens When You Press Play
One click. Seven systems. Each step links back to a concept from the previous pages. Click through to trace the full journey.
DNS Resolution
Where is Netflix?
Your device asks DNS: 'Where is netflix.com?' Route 53 responds with the IP of the nearest AWS region.
Latency-based routing. Mumbai → Asia Pacific. London → EU-West. Not just 'here's an IP' but 'here's the BEST IP for you.'
One click. Seven systems. Under 2 seconds.
When you press Play, your device doesn't download the entire movie. It receives a manifest — a list of URLs for 4-second video chunks at various quality levels. Your device then fetches chunks one at a time from the nearest Open Connect Appliance, adapting quality per chunk based on your current bandwidth.
This is why Netflix never fully buffers. It only needs to be a few chunks ahead. If your WiFi drops, it plays the next chunk at lower quality rather than buffering. When bandwidth recovers, quality climbs back up. The user barely notices. This adaptive approach is why Netflix can stream 4K globally — they don't need consistent 25 Mbps, just enough bandwidth to stay a few chunks ahead.
The investment in AV1 codec is strategic. AV1 delivers the same visual quality as H.264 at 30% less bandwidth. At Netflix's scale (500+ petabytes per day), 30% less bandwidth means hundreds of millions of dollars in CDN and ISP costs saved annually. This is why Netflix co-founded the Alliance for Open Media to develop AV1.
The Full Architecture — Every Component Connected
20 components across 6 layers. Each node is a system design concept from the previous pages — applied at Netflix scale. Click any node to explore.
Microservices
OCA Servers
Kafka Clusters
ISP Partners
The 7-year migration that changed everything
In 2007, Netflix was a monolith running in its own data center. A single database corruption incident in August 2008 took down DVD shipping for 3 days. Engineers couldn't isolate the problem because everything was entangled. That was the turning point.
The migration to AWS and microservices took 7 years (2008-2015). They didn't do a big-bang rewrite. They strangled the monolith — new features went into microservices, while they gradually extracted existing functionality. By 2015, the last monolith service was retired. Today, ~1,000 microservices communicate via gRPC and Kafka, each owned by a small team, each deployed independently 100+ times per day.
In 2025, Netflix expanded its modernization through a zero-configuration service mesh built on Envoy proxies, enabling unified resilience, routing, and observability without application-level libraries. They also adopted GraphQL Federation for more efficient API composition across hundreds of services.
The Content Pipeline — From Studio to Your Screen
Before you press Play, every title passes through a 6-stage pipeline. One movie becomes 1,200+ files, pushed to 18,000+ servers worldwide. This is the backend story most architecture diagrams skip.
1. Content Ingestion
Studio uploads master file
Studios upload original content (typically 4K or 8K master files) to Netflix's cloud storage on Amazon S3. Each upload includes metadata: title, cast, audio tracks, subtitle files, and content ratings.
A single 2-hour 4K HDR master can be 200-500 GB. Netflix receives thousands of new assets daily across 190+ countries.
Why 1,200+ Files? — The Encoding Math
6 resolutions × 3 codecs × 3 audio × 60+ languages = ≈ 1,200+ files per title
Stranger Things S4E1 at 4K HDR ≈ 7 GB/hour · At 480p ≈ 700 MB/hour · AV1 saves 30% vs H.264
End-to-End Timeline
Total: 6-24 hours from studio upload to globally available. Pre-scheduled releases (like Squid Game S2) are distributed to OCAs days in advance.
Why Netflix builds its own hardware
Open Connect Appliances (OCAs) are custom-built by Netflix. Each box packs 100-200 TB of SSD storage, optimized for sequential read throughput. Netflix gives these servers to ISPs for free. The deal is straightforward: ISPs get 95% reduction in upstream bandwidth costs (a Stranger Things binge stays on their local network), and Netflix gets single-hop delivery to users.
The economics are decisive. At 15% of global internet traffic, renting CDN capacity from Akamai at market rates would cost billions per year. Building custom hardware costs millions. The back-of-envelope math made the decision obvious — build, don't rent. Today, 18,000+ OCAs sit inside 6,000+ ISPs across 190+ countries. Nearly all video traffic stays within the ISP's network.
Content distribution is intelligent: not every OCA gets every title. Netflix's control plane predicts which titles will be popular in each region and pre-caches them during off-peak hours. A global hit like Squid Game goes everywhere. A niche documentary caches primarily in regions where it's likely to be watched — but also in cities with relevant diaspora communities.
Chaos Engineering — Breaking Things on Purpose
Netflix runs Chaos Monkey in PRODUCTION. During business hours. Every weekday. Engineers have no idea which instance will die next. This forces every team to build resilient services.
Chaos Monkey
Chaos Kong — Region Killer
Healthy
Healthy
Healthy
The Simian Army
Kills random production instances
Kills entire AWS regions
Injects artificial network delay
Finds instances not following best practices
Finds security violations and misconfigs
Cleans up unused resources
“The best way to avoid failure is to fail constantly — on your own terms.”
Chaos engineering is a business decision, not a technical one
Netflix's revenue is $39+ billion per year. At 300 million subscribers paying an average of $11/month, one hour of global downtime costs roughly $2 million in direct revenue — plus immeasurable brand damage. The 2008 DVD outage lasted 3 days. That kind of failure at today's scale would be catastrophic.
The Chaos Monkey investment paid off definitively on Christmas Eve 2012. AWS Elastic Load Balancing had a major outage. Services across the internet went down — Reddit, Heroku, and numerous others. Netflix stayed up. Their circuit breakers detected the ELB failures, stopped routing through affected paths, and continued streaming from healthy instances.
Netflix now processes 38 million QoE (Quality of Experience) events per second during live events like the Jake Paul vs. Mike Tyson fight, which streamed to 60+ million concurrent viewers. Their observability platform Atlas processes 17 billion metrics and 700 billion distributed traces daily on 1.5 petabytes of log data.
Back-of-Envelope Calculator — The Math Explained
System design interviews require estimation skills. Here's how engineers calculate Netflix's infrastructure needs. Adjust the sliders, then click any metric to see the formula and reasoning.
Bitrate → GB/hour Conversion
480p mobile
= 0.68 GB/hr
1.5×3600/8/1000
1080p HD
= 2.25 GB/hr
5×3600/8/1000
4K SDR
= 6.75 GB/hr
15×3600/8/1000
4K HDR
= 11.25 GB/hr
25×3600/8/1000
What If?
Why back-of-envelope matters in interviews and architecture
System design interviews at Google, Meta, Amazon, and Netflix all include estimation questions. Not because they want exact numbers — they want to see you think about scale. Can you reason about whether a single server suffices, or whether you need a distributed system? Can you identify the bottleneck before building anything?
The calculator above shows each formula and its derivation. The key insight: start with users, derive everything else. Subscribers → DAU → concurrent streams → bandwidth → storage → API calls → events. Each step multiplies. A 2x increase in subscribers doesn't just mean 2x servers — it means 2x bandwidth, 2x storage, 2x events, and potentially 4x peak load if the growth concentrates in one timezone.
Design Decisions — The WHY Behind Every Choice
Five architectural decisions that define Netflix. Click each card to reveal the reasoning.
Every Concept — Applied at Netflix Scale
This is the consolidation. Every system design concept from the previous 4 pages appears in Netflix's architecture. Click any concept to see exactly where and how Netflix uses it.
14 system design concepts from 4 pages — all used in one system.
This is why Netflix is the definitive system design case study.
How does Netflix handle 300 million subscribers?+
Why did Netflix build their own CDN?+
What happens when you press Play on Netflix?+
Why does Netflix use Cassandra instead of PostgreSQL?+
What is Netflix's Chaos Monkey?+
How does the content pipeline work?+
How does adaptive bitrate streaming work?+
What is back-of-envelope estimation in system design?+
How does Netflix handle regional failures?+
Why does one Netflix title exist as 1,200+ files?+
Series Complete
You've covered the full system design stack: from the big picture to networking, databases, distributed patterns, security, monitoring, interview technique, and a real-world case study. Every concept connects.
Sources
- Netflix Tech Blog (netflixtechblog.com)
- Netflix Open Connect (openconnect.netflix.com)
- Mastering Chaos — Netflix Guide to Microservices (InfoQ, Josh Evans)
- How Netflix Scales its API with GraphQL Federation (Netflix blog)
- Netflix: What Happens When You Press Play? (High Scalability)
- Completing the Netflix Cloud Migration (Netflix blog, 2016)
- The Netflix Simian Army (Netflix blog)
- AV1 Codec Evaluation (Netflix Research)
- Netflix Keystone Pipeline — 2 Trillion Events/Day (Netflix TechBlog)
- Atlas: Netflix Observability Platform — 17B Metrics/Day
- From On-Demand to Live: Netflix Streaming to 100M Devices (InfoQ, 2025)
- ByteByteGo — Netflix Architecture Analysis