Subnet Design Nightmare: Microservices Migration Paralyzes Company Systems for 5 Days - Complete Disaster Recovery Record

🚨 Disaster Occurred: August 12, 2025, 2:30 PM

“This new microservices platform has perfect design.”

Just 6 hours after I declared this with complete confidence, all company web services completely stopped, and customer complaint calls wouldn’t stop ringing.

Problems that occurred:

  • Communication completely severed between 200+ microservices
  • Services unable to start due to IP address exhaustion
  • New Pods cannot be created when Auto Scaling triggers
  • Company-wide system outage including payment systems

Impact scope:

  • Customer-facing website: Complete outage
  • Internal systems: 80% functionality lost
  • Payment processing: 5-day outage
  • Estimated revenue loss: ¥300 million

This article is the record of 5 days of hell caused by naive subnet design and the complete recovery process.

💀 Origin of Design Mistake: Overconfident Subnet Planning

The Problem Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Disaster-inducing design
VPC: microservices-vpc (10.0.0.0/16)

Subnets:
  # ❌ Fatal design mistake
  container-subnet: 10.0.10.0/24  # ←Only 256 IPs available
    Purpose: GKE Cluster (200 services)
    Expected_Pods: "About 50?"
    Reality: 1,200 Pods needed
    
  service-mesh-subnet: 10.0.3.0/28  # ←16 IPs
    Purpose: Istio Control Plane
    Expected: "3 Control Plane nodes"  
    Reality: Istio Proxy needed for all Pods
    
  database-subnet: 10.0.4.0/27  # ←32 IPs  
    Purpose: Database Services
    Expected: "10 DB nodes"
    Reality: Dedicated DB needed per service

🤦‍♂️ Naive Estimates

My optimistic calculation:

  • Microservices: 200
  • Pods per service: 1~2
  • Required IPs: “500 should be plenty”

Reality:

  • Pods per service: 3~15 (prod/staging/canary)
  • Istio Proxy: Required for all Pods
  • Database: 3~5 dedicated DB instances per service
  • Total required IPs: 3,000+

🔥 Moment of Disaster: IP Exhaustion Chain Reaction

2:30 PM - Migration Work Begins

1
2
# Confidently starting deployment
kubectl apply -f microservices-manifests/

First 50 services started smoothly. “Look, perfect design!”

3:45 PM - First Anomaly

1
2
3
4
Error: Pod "payment-service-7d4c8f9b-xrt2k" failed to schedule
Reason: IP address allocation failed in subnet container-subnet
Available IPs: 12
Required IPs: 45

“That’s strange… calculations showed plenty of room.”

4:20 PM - Cascading System Outages Begin

Due to IP exhaustion:

  1. New Pods cannot start
  2. Auto Scaling doesn’t function
  3. Existing Pod communication impossible via Istio
  4. Payment service unresponsive

5:00 PM - Complete Company System Outage

1
2
3
# Desperate situation check
kubectl get pods --all-namespaces | grep -v Running
# Result: 800+ Pods in Pending state

Emergency call from administrator: “All customer sites are down. Fix it immediately!”

🚨 Emergency Response: 5-Day Battle

Day 1-2: Temporary Measures to Buy Time

Emergency IP Acquisition Operation

1
2
3
# Borrowing IPs from other subnets (temporary measure)
gcloud compute networks subnets expand-ip-range container-subnet \
    --prefix-length=22  # Expand /24 → /22

Result: Some services restored, but not a fundamental solution

Attempting Rollback to Old System

1
2
3
4
# Emergency rollback to old system
kubectl rollout undo deployment/payment-service
kubectl rollout undo deployment/user-service
# ... Repeated 200 times

Problem: Database migration already completed, rollback impossible

Day 3-4: Fundamental Design Overhaul

Recalculating Proper IP Requirements

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Realistic design change
Container_IP_Requirements:
  Microservices: 200 services
  Per_Service_Pods:
    Production: 5 pods
    Staging: 3 pods  
    Canary: 2 pods
    Total: 10 pods/service
    
  Total_Application_Pods: 200 × 10 = 2,000
  
  Istio_Proxy: 2,000 (sidecar for each Pod)
  Database_Pods: 200 services × 3 replicas = 600
  Monitoring_Pods: 100
  
  Safety_Buffer: 50%
  Total_Required: (2,000 + 600 + 100) × 1.5 = 4,050 IPs

New Subnet Configuration Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Post-correction design
VPC: microservices-vpc-v2 (10.0.0.0/16)

Subnets:
  # ✅ Realistic design
  container-subnet: 10.0.0.0/20    # 4,096 IPs
    Purpose: GKE Main Cluster
    Available: 4,096 - 16 = 4,080 IPs
    
  container-staging-subnet: 10.0.16.0/22  # 1,024 IPs  
    Purpose: Staging Environment
    
  service-mesh-subnet: 10.0.20.0/22  # 1,024 IPs
    Purpose: Istio Control Plane + Proxies
    
  database-subnet: 10.0.24.0/21   # 2,048 IPs
    Purpose: Database Services
    
  monitoring-subnet: 10.0.32.0/24  # 256 IPs
    Purpose: Prometheus / Grafana
    
  backup-subnet: 10.0.33.0/24     # 256 IPs  
    Purpose: Future Expansion

Day 5: Complete Reconstruction

New VPC Environment Construction

1
2
3
4
5
6
7
8
9
# Build new VPC in parallel
gcloud compute networks create microservices-vpc-v2 \
    --subnet-mode regional

# Create appropriately sized subnets
gcloud compute networks subnets create container-subnet-v2 \
    --network microservices-vpc-v2 \
    --range 10.0.0.0/20 \
    --region asia-northeast1

Database Migration (Most Difficult)

1
2
3
4
5
6
7
8
9
# Replication setup for zero-downtime data migration
gcloud sql instances patch main-db \
    --enable-bin-log \
    --backup-start-time 01:00

# Phased replication to new environment
gcloud sql instances create main-db-v2 \
    --master-instance-name main-db \
    --replica-type FAILOVER

💡 Root Cause Analysis: Why Such Design?

1. Excessive Optimistic Estimates

1
2
3
4
5
6
7
8
9
Wrong_Assumptions:
  "1 Service = 1 Pod": 
    Reality: 10+ Pods for Production/Staging/Canary
    
  "IPs are abundant":
    Reality: Kubernetes clusters consume massive IPs
    
  "Istio is lightweight":
    Reality: Proxy IPs needed for all Pods

2. Gap Between Theory and Reality

1
2
3
4
5
Kubernetes_Reality:
  Pod_Density: "About 50% of theoretical value"
  IP_Fragmentation: "Difficult to secure continuous IPs"  
  Service_Mesh_Overhead: "5x expected resources"
  Auto_Scaling_Burst: "Instantaneous 10x Pod startup"

3. Insufficient Testing Environment Validation

1
2
3
4
5
6
7
8
9
Test_Environment_Problems:
  Scale: "Tested with only 10 services"
  Load: "1/100 of real load"
  Network: "Tested with simple configuration"
  
Real_Environment:
  Scale: "200 services running simultaneously"  
  Load: "10x expected traffic"
  Network: "Complex inter-service dependencies"

🛠️ Complete Solution: Enterprise-Level Subnet Design

Hierarchical Subnet Strategy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Production-level design principles
Subnet_Design_Strategy:
  Principle_1_Isolation:
    - Complete environment separation (prod/staging/dev)
    - Service tier separation (web/app/data)
    - Security zone separation (dmz/internal/restricted)
    
  Principle_2_Scalability:
    - Reserve 5x current requirements
    - Auto Scaling burst support
    - Future expansion consideration (10 years ahead)
    
  Principle_3_Security:
    - Zero Trust Network design
    - Service Mesh segmentation  
    - Network Policy enforcement

Practical IP Calculation Formula

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Practical IP calculation formula (avoid disasters)
IP_Calculation_Formula:
  Base_Requirements:
    Services: N
    Pods_Per_Service: P
    Environments: E (prod/staging/dev/canary)
    
  Service_Mesh_Factor: 2.0  # For Istio proxy
  Database_Factor: 1.5      # For DB replicas  
  Monitoring_Factor: 1.2    # For Monitoring stack
  Auto_Scaling_Factor: 3.0  # For Burst scaling
  Safety_Buffer: 2.0        # Generous design margin
  
  Total_IPs = N × P × E × 2.0 × 1.5 × 1.2 × 3.0 × 2.0
  
Example:
  200 services × 3 pods × 4 envs × 2.0 × 1.5 × 1.2 × 3.0 × 2.0 
  = 200 × 3 × 4 × 21.6 = 51,840 IPs required
  
Recommended_CIDR: /14 (65,536 IPs) or larger

Final Design Diagram

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Design completed after disaster
VPC: microservices-enterprise-vpc (10.0.0.0/14)  # 262,144 IPs

Production_Environment:
  prod-ingress-subnet: 10.0.0.0/22     # 1,024 IPs
  prod-app-subnet: 10.0.4.0/20         # 16,384 IPs  
  prod-data-subnet: 10.0.20.0/21       # 2,048 IPs
  prod-mesh-subnet: 10.0.28.0/20       # 16,384 IPs
  
Staging_Environment:
  staging-app-subnet: 10.1.0.0/21      # 2,048 IPs
  staging-data-subnet: 10.1.8.0/22     # 1,024 IPs
  staging-mesh-subnet: 10.1.12.0/21    # 2,048 IPs
  
Development_Environment:
  dev-app-subnet: 10.2.0.0/22          # 1,024 IPs
  dev-data-subnet: 10.2.4.0/23         # 512 IPs
  dev-mesh-subnet: 10.2.6.0/22         # 1,024 IPs
  
Special_Purpose:
  ci-cd-subnet: 10.3.0.0/22           # 1,024 IPs
  monitoring-subnet: 10.3.4.0/22      # 1,024 IPs  
  backup-subnet: 10.3.8.0/22          # 1,024 IPs
  future-expansion: 10.4.0.0/18       # 16,384 IPs

🚀 Recovery Work: Phased Migration Strategy

Phase 1: Emergency Recovery (Completed: Day 5)

1
2
3
4
5
6
7
# Critical services priority recovery
kubectl create namespace critical-services
kubectl apply -f critical-manifests/ -n critical-services

# Payment system highest priority
kubectl scale deployment payment-service --replicas=10
kubectl scale deployment user-auth-service --replicas=8

Phase 2: Phased Migration (Completed: Day 10)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Migration_Strategy:
  Week_1: Critical services (payment, auth, user)
  Week_2: Customer-facing services (web, api, mobile)  
  Week_3: Internal services (admin, reporting, batch)
  Week_4: Development/staging environments
  
Risk_Mitigation:
  - Blue-Green deployment per service
  - Real-time health monitoring
  - Immediate rollback capability
  - Database replica synchronization

Phase 3: Monitoring & Automation Enhancement (Completed: Day 14)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Automated IP usage monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
spec:
  groups:
  - name: subnet-monitoring
    rules:
    - alert: SubnetIPUtilizationHigh
      expr: subnet_ip_utilization > 0.8
      labels:
        severity: warning
      annotations:
        summary: "Subnet IP utilization above 80%"
        
    - alert: SubnetIPUtilizationCritical  
      expr: subnet_ip_utilization > 0.95
      labels:
        severity: critical
      annotations:
        summary: "URGENT: Subnet running out of IPs"

📊 Lessons Learned from Disaster

❌ Design Patterns to Avoid

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Anti_Patterns:
  Underestimating_Kubernetes:
    - "1 Service = 1 Pod" assumption
    - Ignoring sidecar Proxy IP consumption
    - Not considering Auto Scaling bursts
    
  Insufficient_Testing:
    - Small-scale environment testing only
    - Insufficient load testing
    - Lack of network partition testing
    
  Poor_Capacity_Planning:
    - Optimistic estimates
    - No margin design
    - No future expansion consideration
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Best_Practices:
  Realistic_Capacity_Planning:
    - Realistic Pod count estimates (5-10x margin)
    - Service Mesh overhead consideration (2-3x)
    - Auto Scaling burst consideration (3-5x)
    - Future expansion consideration (10 years ahead)
    
  Comprehensive_Testing:
    - Production-equivalent scale testing
    - Load testing + network partition testing  
    - Risk mitigation through phased deployment
    
  Proactive_Monitoring:
    - Real-time IP usage monitoring
    - Threshold alerts (80% warning, 95% critical)
    - Auto-scale-out preparation

🎯 Business Lessons

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Business_Lessons:
  Technical_Debt_Cost:
    - Improper design cost: ¥300 million revenue loss
    - Recovery cost: 200 engineer hours
    - Trust recovery cost: Immeasurable
    
  Investment_Priority:
    - Invest sufficient time and resources in infrastructure design
    - Production-equivalent test environments
    - Proactive investment in monitoring/operations automation

🔍 Technical Deep Dive: The Science of Subnet Design

Kubernetes Networking Reality

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Kubernetes_Networking_Reality:
  Pod_IP_Consumption:
    - 1 Pod = 1 IP (basic)
    - Istio Proxy = Additional IP per Pod
    - Init Containers = Temporary IP consumption
    - Failed Pods = IP fragmentation factor
    
  Service_Discovery_Overhead:
    - ClusterIP Service = Virtual IP consumption
    - NodePort Service = Node IP consumption  
    - LoadBalancer Service = External IP consumption
    - Ingress Controller = Additional IP consumption
    
  Auto_Scaling_Burst_Pattern:
    - CPU spikes: Instantaneous 5-10x Pod creation
    - Memory pressure: Pod migration occurs
    - Network policy changes: Pod restart cascade

Service Mesh Hidden Costs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Istio_Hidden_Costs:
  Control_Plane:
    - istiod: 3-5 replicas × N zones = 15 IPs
    - Ingress Gateway: 3-5 replicas = 15 IPs
    - Egress Gateway: 3-5 replicas = 15 IPs
    
  Data_Plane:
    - Envoy Proxy: All Pods = Application Pod count
    - Mixer/Telemetry: Pod count × 0.1
    - Pilot Agent: Pod count × 0.05
    
  Total_Factor: Application Pod × 2.15

Large-Scale Operations Actual Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Production_Data_200_Services:
  Expected_vs_Reality:
    Estimated_Pods: 400 (2 per service)
    Actual_Pods: 3,200 (16 per service average)
    
    Estimated_IPs: 800  
    Actual_IPs: 6,400
    
    Growth_Rate: 800% over estimate
    
  Peak_Scaling_Events:
    Black_Friday: 15,000 Pods (4.7x normal)
    Database_Failover: 8,000 Pods (2.5x normal)  
    Network_Partition: 12,000 Pods (3.8x normal)

🎯 Complete Recovery: System Stabilization

Final Stable Operation State

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Post-recovery situation check
kubectl get pods --all-namespaces | wc -l
# Result: 3,247 pods running

kubectl get nodes -o wide
# Result: 15 nodes, all Ready

# IP usage check
gcloud compute networks subnets describe container-subnet-v2 \
    --format="value(ipCidrRange, availableIpAddressCount)"
# Result: 10.0.0.0/20, 892 available (78% utilization)

Performance Test Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Performance_Test_Results:
  Load_Test_Peak:
    Requests_Per_Second: 50,000
    Response_Time_95th: 120ms  
    Error_Rate: 0.02%
    Pod_Count_Peak: 4,200
    IP_Utilization_Peak: 84%
    
  Stress_Test_Results:
    Auto_Scale_Time: 45 seconds
    New_Pod_IP_Assignment: < 5 seconds
    Service_Discovery_Propagation: < 10 seconds
    
  Disaster_Recovery_Test:
    Failover_Time: 2 minutes
    Data_Loss: 0 transactions
    Service_Restoration: 100%

📈 Long-term Operations Results

6 Months Later

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
Six_Months_Later:
  System_Stability:
    Uptime: 99.97%
    Major_Incidents: 0
    IP_Related_Issues: 0
    
  Capacity_Utilization:
    Average_Pod_Count: 2,800
    Peak_Pod_Count: 4,200
    IP_Utilization: 65-85%
    Headroom_Available: 30%
    
  Cost_Impact:
    Infrastructure_Cost: +40% (larger subnets)
    Operational_Cost: -60% (automation)
    Incident_Cost: -100% (zero outages)
    
  Business_Impact:
    Customer_Satisfaction: Restored
    Revenue_Impact: +15% (improved reliability)
    Team_Productivity: +30% (less firefighting)

🏆 Summary: Iron Rules of Disaster-Preventing Subnet Design

🎯 Absolute Rules for Design

  1. Realistic Capacity Planning

    • 5-10x margin on optimistic estimates
    • Always calculate Service Mesh overhead (2-3x)
    • Consider Auto Scaling bursts (3-5x)
  2. Production-Equivalent Testing

    • Scale, load, and failure testing trinity
    • Network partition testing implementation
    • Risk validation through phased deployment
  3. Proactive Monitoring

    • Constant IP usage monitoring
    • Threshold alerts (80% warning, 95% critical)
    • Auto-expansion preparation

💡 Continuous Improvement in Operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Continuous_Improvement:
  Monthly_Review:
    - IP usage trend analysis
    - Growth forecast updates
    - Capacity plan updates
    
  Quarterly_Test:
    - Disaster recovery drills
    - Scale testing
    - Security audits
    
  Annual_Architecture_Review:
    - Technology choice reevaluation
    - Subnet configuration optimization
    - Cost efficiency improvements

🚨 Mistakes to Absolutely Avoid

  • ❌ Overconfidence of “we’re small so we’re fine”
  • ❌ Ignoring scale differences between dev and production environments
  • ❌ Underestimating Kubernetes “hidden resource consumption”
  • ❌ Fixed design without considering growth
  • ❌ Production deployment without monitoring

Three years after this disaster, our system still runs stably. Those 5 days of hell definitely made our team grow. To ensure we never repeat the same mistakes, I leave this record.

To readers, to avoid causing the same disaster: Subnet design must not be taken lightly. Design with realistic, generous margins based on reality.


📅 Disaster Occurred: August 12, 2025
📅 Complete Recovery: August 17, 2025
📅 Record Created: September 14, 2025

Lesson: “About right” doesn’t exist in infrastructure design

技術ネタ、趣味や備忘録などを書いているブログです
Hugo で構築されています。
テーマ StackJimmy によって設計されています。