🚨 Disaster Occurred: August 12, 2025, 2:30 PM
“This new microservices platform has perfect design.”
Just 6 hours after I declared this with complete confidence, all company web services completely stopped, and customer complaint calls wouldn’t stop ringing.
Problems that occurred:
- Communication completely severed between 200+ microservices
- Services unable to start due to IP address exhaustion
- New Pods cannot be created when Auto Scaling triggers
- Company-wide system outage including payment systems
Impact scope:
- Customer-facing website: Complete outage
- Internal systems: 80% functionality lost
- Payment processing: 5-day outage
- Estimated revenue loss: ¥300 million
This article is the record of 5 days of hell caused by naive subnet design and the complete recovery process.
💀 Origin of Design Mistake: Overconfident Subnet Planning
The Problem Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# Disaster-inducing design
VPC: microservices-vpc (10.0.0.0/16)
Subnets:
# ❌ Fatal design mistake
container-subnet: 10.0.10.0/24 # ←Only 256 IPs available
Purpose: GKE Cluster (200 services)
Expected_Pods: "About 50?"
Reality: 1,200 Pods needed
service-mesh-subnet: 10.0.3.0/28 # ←16 IPs
Purpose: Istio Control Plane
Expected: "3 Control Plane nodes"
Reality: Istio Proxy needed for all Pods
database-subnet: 10.0.4.0/27 # ←32 IPs
Purpose: Database Services
Expected: "10 DB nodes"
Reality: Dedicated DB needed per service
|
🤦♂️ Naive Estimates
My optimistic calculation:
- Microservices: 200
- Pods per service: 1~2
- Required IPs: “500 should be plenty”
Reality:
- Pods per service: 3~15 (prod/staging/canary)
- Istio Proxy: Required for all Pods
- Database: 3~5 dedicated DB instances per service
- Total required IPs: 3,000+
🔥 Moment of Disaster: IP Exhaustion Chain Reaction
2:30 PM - Migration Work Begins
1
2
|
# Confidently starting deployment
kubectl apply -f microservices-manifests/
|
First 50 services started smoothly.
“Look, perfect design!”
3:45 PM - First Anomaly
1
2
3
4
|
Error: Pod "payment-service-7d4c8f9b-xrt2k" failed to schedule
Reason: IP address allocation failed in subnet container-subnet
Available IPs: 12
Required IPs: 45
|
“That’s strange… calculations showed plenty of room.”
4:20 PM - Cascading System Outages Begin
Due to IP exhaustion:
- New Pods cannot start
- Auto Scaling doesn’t function
- Existing Pod communication impossible via Istio
- Payment service unresponsive
5:00 PM - Complete Company System Outage
1
2
3
|
# Desperate situation check
kubectl get pods --all-namespaces | grep -v Running
# Result: 800+ Pods in Pending state
|
Emergency call from administrator:
“All customer sites are down. Fix it immediately!”
🚨 Emergency Response: 5-Day Battle
Day 1-2: Temporary Measures to Buy Time
Emergency IP Acquisition Operation
1
2
3
|
# Borrowing IPs from other subnets (temporary measure)
gcloud compute networks subnets expand-ip-range container-subnet \
--prefix-length=22 # Expand /24 → /22
|
Result: Some services restored, but not a fundamental solution
Attempting Rollback to Old System
1
2
3
4
|
# Emergency rollback to old system
kubectl rollout undo deployment/payment-service
kubectl rollout undo deployment/user-service
# ... Repeated 200 times
|
Problem: Database migration already completed, rollback impossible
Day 3-4: Fundamental Design Overhaul
Recalculating Proper IP Requirements
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# Realistic design change
Container_IP_Requirements:
Microservices: 200 services
Per_Service_Pods:
Production: 5 pods
Staging: 3 pods
Canary: 2 pods
Total: 10 pods/service
Total_Application_Pods: 200 × 10 = 2,000
Istio_Proxy: 2,000 (sidecar for each Pod)
Database_Pods: 200 services × 3 replicas = 600
Monitoring_Pods: 100
Safety_Buffer: 50%
Total_Required: (2,000 + 600 + 100) × 1.5 = 4,050 IPs
|
New Subnet Configuration Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# Post-correction design
VPC: microservices-vpc-v2 (10.0.0.0/16)
Subnets:
# ✅ Realistic design
container-subnet: 10.0.0.0/20 # 4,096 IPs
Purpose: GKE Main Cluster
Available: 4,096 - 16 = 4,080 IPs
container-staging-subnet: 10.0.16.0/22 # 1,024 IPs
Purpose: Staging Environment
service-mesh-subnet: 10.0.20.0/22 # 1,024 IPs
Purpose: Istio Control Plane + Proxies
database-subnet: 10.0.24.0/21 # 2,048 IPs
Purpose: Database Services
monitoring-subnet: 10.0.32.0/24 # 256 IPs
Purpose: Prometheus / Grafana
backup-subnet: 10.0.33.0/24 # 256 IPs
Purpose: Future Expansion
|
Day 5: Complete Reconstruction
New VPC Environment Construction
1
2
3
4
5
6
7
8
9
|
# Build new VPC in parallel
gcloud compute networks create microservices-vpc-v2 \
--subnet-mode regional
# Create appropriately sized subnets
gcloud compute networks subnets create container-subnet-v2 \
--network microservices-vpc-v2 \
--range 10.0.0.0/20 \
--region asia-northeast1
|
Database Migration (Most Difficult)
1
2
3
4
5
6
7
8
9
|
# Replication setup for zero-downtime data migration
gcloud sql instances patch main-db \
--enable-bin-log \
--backup-start-time 01:00
# Phased replication to new environment
gcloud sql instances create main-db-v2 \
--master-instance-name main-db \
--replica-type FAILOVER
|
💡 Root Cause Analysis: Why Such Design?
1. Excessive Optimistic Estimates
1
2
3
4
5
6
7
8
9
|
Wrong_Assumptions:
"1 Service = 1 Pod":
Reality: 10+ Pods for Production/Staging/Canary
"IPs are abundant":
Reality: Kubernetes clusters consume massive IPs
"Istio is lightweight":
Reality: Proxy IPs needed for all Pods
|
2. Gap Between Theory and Reality
1
2
3
4
5
|
Kubernetes_Reality:
Pod_Density: "About 50% of theoretical value"
IP_Fragmentation: "Difficult to secure continuous IPs"
Service_Mesh_Overhead: "5x expected resources"
Auto_Scaling_Burst: "Instantaneous 10x Pod startup"
|
3. Insufficient Testing Environment Validation
1
2
3
4
5
6
7
8
9
|
Test_Environment_Problems:
Scale: "Tested with only 10 services"
Load: "1/100 of real load"
Network: "Tested with simple configuration"
Real_Environment:
Scale: "200 services running simultaneously"
Load: "10x expected traffic"
Network: "Complex inter-service dependencies"
|
🛠️ Complete Solution: Enterprise-Level Subnet Design
Hierarchical Subnet Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# Production-level design principles
Subnet_Design_Strategy:
Principle_1_Isolation:
- Complete environment separation (prod/staging/dev)
- Service tier separation (web/app/data)
- Security zone separation (dmz/internal/restricted)
Principle_2_Scalability:
- Reserve 5x current requirements
- Auto Scaling burst support
- Future expansion consideration (10 years ahead)
Principle_3_Security:
- Zero Trust Network design
- Service Mesh segmentation
- Network Policy enforcement
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Practical IP calculation formula (avoid disasters)
IP_Calculation_Formula:
Base_Requirements:
Services: N
Pods_Per_Service: P
Environments: E (prod/staging/dev/canary)
Service_Mesh_Factor: 2.0 # For Istio proxy
Database_Factor: 1.5 # For DB replicas
Monitoring_Factor: 1.2 # For Monitoring stack
Auto_Scaling_Factor: 3.0 # For Burst scaling
Safety_Buffer: 2.0 # Generous design margin
Total_IPs = N × P × E × 2.0 × 1.5 × 1.2 × 3.0 × 2.0
Example:
200 services × 3 pods × 4 envs × 2.0 × 1.5 × 1.2 × 3.0 × 2.0
= 200 × 3 × 4 × 21.6 = 51,840 IPs required
Recommended_CIDR: /14 (65,536 IPs) or larger
|
Final Design Diagram
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# Design completed after disaster
VPC: microservices-enterprise-vpc (10.0.0.0/14) # 262,144 IPs
Production_Environment:
prod-ingress-subnet: 10.0.0.0/22 # 1,024 IPs
prod-app-subnet: 10.0.4.0/20 # 16,384 IPs
prod-data-subnet: 10.0.20.0/21 # 2,048 IPs
prod-mesh-subnet: 10.0.28.0/20 # 16,384 IPs
Staging_Environment:
staging-app-subnet: 10.1.0.0/21 # 2,048 IPs
staging-data-subnet: 10.1.8.0/22 # 1,024 IPs
staging-mesh-subnet: 10.1.12.0/21 # 2,048 IPs
Development_Environment:
dev-app-subnet: 10.2.0.0/22 # 1,024 IPs
dev-data-subnet: 10.2.4.0/23 # 512 IPs
dev-mesh-subnet: 10.2.6.0/22 # 1,024 IPs
Special_Purpose:
ci-cd-subnet: 10.3.0.0/22 # 1,024 IPs
monitoring-subnet: 10.3.4.0/22 # 1,024 IPs
backup-subnet: 10.3.8.0/22 # 1,024 IPs
future-expansion: 10.4.0.0/18 # 16,384 IPs
|
🚀 Recovery Work: Phased Migration Strategy
Phase 1: Emergency Recovery (Completed: Day 5)
1
2
3
4
5
6
7
|
# Critical services priority recovery
kubectl create namespace critical-services
kubectl apply -f critical-manifests/ -n critical-services
# Payment system highest priority
kubectl scale deployment payment-service --replicas=10
kubectl scale deployment user-auth-service --replicas=8
|
Phase 2: Phased Migration (Completed: Day 10)
1
2
3
4
5
6
7
8
9
10
11
|
Migration_Strategy:
Week_1: Critical services (payment, auth, user)
Week_2: Customer-facing services (web, api, mobile)
Week_3: Internal services (admin, reporting, batch)
Week_4: Development/staging environments
Risk_Mitigation:
- Blue-Green deployment per service
- Real-time health monitoring
- Immediate rollback capability
- Database replica synchronization
|
Phase 3: Monitoring & Automation Enhancement (Completed: Day 14)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Automated IP usage monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
spec:
groups:
- name: subnet-monitoring
rules:
- alert: SubnetIPUtilizationHigh
expr: subnet_ip_utilization > 0.8
labels:
severity: warning
annotations:
summary: "Subnet IP utilization above 80%"
- alert: SubnetIPUtilizationCritical
expr: subnet_ip_utilization > 0.95
labels:
severity: critical
annotations:
summary: "URGENT: Subnet running out of IPs"
|
📊 Lessons Learned from Disaster
❌ Design Patterns to Avoid
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
Anti_Patterns:
Underestimating_Kubernetes:
- "1 Service = 1 Pod" assumption
- Ignoring sidecar Proxy IP consumption
- Not considering Auto Scaling bursts
Insufficient_Testing:
- Small-scale environment testing only
- Insufficient load testing
- Lack of network partition testing
Poor_Capacity_Planning:
- Optimistic estimates
- No margin design
- No future expansion consideration
|
✅ Recommended Design Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
Best_Practices:
Realistic_Capacity_Planning:
- Realistic Pod count estimates (5-10x margin)
- Service Mesh overhead consideration (2-3x)
- Auto Scaling burst consideration (3-5x)
- Future expansion consideration (10 years ahead)
Comprehensive_Testing:
- Production-equivalent scale testing
- Load testing + network partition testing
- Risk mitigation through phased deployment
Proactive_Monitoring:
- Real-time IP usage monitoring
- Threshold alerts (80% warning, 95% critical)
- Auto-scale-out preparation
|
🎯 Business Lessons
1
2
3
4
5
6
7
8
9
10
|
Business_Lessons:
Technical_Debt_Cost:
- Improper design cost: ¥300 million revenue loss
- Recovery cost: 200 engineer hours
- Trust recovery cost: Immeasurable
Investment_Priority:
- Invest sufficient time and resources in infrastructure design
- Production-equivalent test environments
- Proactive investment in monitoring/operations automation
|
🔍 Technical Deep Dive: The Science of Subnet Design
Kubernetes Networking Reality
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
Kubernetes_Networking_Reality:
Pod_IP_Consumption:
- 1 Pod = 1 IP (basic)
- Istio Proxy = Additional IP per Pod
- Init Containers = Temporary IP consumption
- Failed Pods = IP fragmentation factor
Service_Discovery_Overhead:
- ClusterIP Service = Virtual IP consumption
- NodePort Service = Node IP consumption
- LoadBalancer Service = External IP consumption
- Ingress Controller = Additional IP consumption
Auto_Scaling_Burst_Pattern:
- CPU spikes: Instantaneous 5-10x Pod creation
- Memory pressure: Pod migration occurs
- Network policy changes: Pod restart cascade
|
Service Mesh Hidden Costs
1
2
3
4
5
6
7
8
9
10
11
12
|
Istio_Hidden_Costs:
Control_Plane:
- istiod: 3-5 replicas × N zones = 15 IPs
- Ingress Gateway: 3-5 replicas = 15 IPs
- Egress Gateway: 3-5 replicas = 15 IPs
Data_Plane:
- Envoy Proxy: All Pods = Application Pod count
- Mixer/Telemetry: Pod count × 0.1
- Pilot Agent: Pod count × 0.05
Total_Factor: Application Pod × 2.15
|
Large-Scale Operations Actual Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
Production_Data_200_Services:
Expected_vs_Reality:
Estimated_Pods: 400 (2 per service)
Actual_Pods: 3,200 (16 per service average)
Estimated_IPs: 800
Actual_IPs: 6,400
Growth_Rate: 800% over estimate
Peak_Scaling_Events:
Black_Friday: 15,000 Pods (4.7x normal)
Database_Failover: 8,000 Pods (2.5x normal)
Network_Partition: 12,000 Pods (3.8x normal)
|
🎯 Complete Recovery: System Stabilization
Final Stable Operation State
1
2
3
4
5
6
7
8
9
10
11
|
# Post-recovery situation check
kubectl get pods --all-namespaces | wc -l
# Result: 3,247 pods running
kubectl get nodes -o wide
# Result: 15 nodes, all Ready
# IP usage check
gcloud compute networks subnets describe container-subnet-v2 \
--format="value(ipCidrRange, availableIpAddressCount)"
# Result: 10.0.0.0/20, 892 available (78% utilization)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
Performance_Test_Results:
Load_Test_Peak:
Requests_Per_Second: 50,000
Response_Time_95th: 120ms
Error_Rate: 0.02%
Pod_Count_Peak: 4,200
IP_Utilization_Peak: 84%
Stress_Test_Results:
Auto_Scale_Time: 45 seconds
New_Pod_IP_Assignment: < 5 seconds
Service_Discovery_Propagation: < 10 seconds
Disaster_Recovery_Test:
Failover_Time: 2 minutes
Data_Loss: 0 transactions
Service_Restoration: 100%
|
📈 Long-term Operations Results
6 Months Later
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
Six_Months_Later:
System_Stability:
Uptime: 99.97%
Major_Incidents: 0
IP_Related_Issues: 0
Capacity_Utilization:
Average_Pod_Count: 2,800
Peak_Pod_Count: 4,200
IP_Utilization: 65-85%
Headroom_Available: 30%
Cost_Impact:
Infrastructure_Cost: +40% (larger subnets)
Operational_Cost: -60% (automation)
Incident_Cost: -100% (zero outages)
Business_Impact:
Customer_Satisfaction: Restored
Revenue_Impact: +15% (improved reliability)
Team_Productivity: +30% (less firefighting)
|
🏆 Summary: Iron Rules of Disaster-Preventing Subnet Design
🎯 Absolute Rules for Design
-
Realistic Capacity Planning
- 5-10x margin on optimistic estimates
- Always calculate Service Mesh overhead (2-3x)
- Consider Auto Scaling bursts (3-5x)
-
Production-Equivalent Testing
- Scale, load, and failure testing trinity
- Network partition testing implementation
- Risk validation through phased deployment
-
Proactive Monitoring
- Constant IP usage monitoring
- Threshold alerts (80% warning, 95% critical)
- Auto-expansion preparation
💡 Continuous Improvement in Operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
Continuous_Improvement:
Monthly_Review:
- IP usage trend analysis
- Growth forecast updates
- Capacity plan updates
Quarterly_Test:
- Disaster recovery drills
- Scale testing
- Security audits
Annual_Architecture_Review:
- Technology choice reevaluation
- Subnet configuration optimization
- Cost efficiency improvements
|
🚨 Mistakes to Absolutely Avoid
- ❌ Overconfidence of “we’re small so we’re fine”
- ❌ Ignoring scale differences between dev and production environments
- ❌ Underestimating Kubernetes “hidden resource consumption”
- ❌ Fixed design without considering growth
- ❌ Production deployment without monitoring
Three years after this disaster, our system still runs stably. Those 5 days of hell definitely made our team grow. To ensure we never repeat the same mistakes, I leave this record.
To readers, to avoid causing the same disaster: Subnet design must not be taken lightly. Design with realistic, generous margins based on reality.
📅 Disaster Occurred: August 12, 2025
📅 Complete Recovery: August 17, 2025
📅 Record Created: September 14, 2025
Lesson: “About right” doesn’t exist in infrastructure design