[Design Hell] VPC Peering Design Mistake Causes 3-Day Company-wide System Outage → Emergency PSC Migration Battle Record
Prologue: The Sweet Temptation of “Easy Connection”
Monday, June 10, 2025, 11:00 AM
“VPC connection? Just click a few times with VPC Peering. 30 minutes is enough to connect everything.”
At the design meeting for our new multi-project environment, I answered with such confidence. Little did I know that this “easy connection” would cause a catastrophic 3-day company-wide system outage…
This is the record of the hellish experience caused by naive VPC design assumptions, and the network design truths learned from it.
Chapter 1: The Trap of “Beautiful Design”
Project Overview: Multi-VPC Environment Construction
Background:
Due to rapid company growth, we decided to separate GCP projects by department and environment.
Design Requirements:
- Ensure department independence
- Separate environments (dev/staging/prod)
- Inter-department collaboration as needed
- Connection to shared services (AD, DNS, monitoring)
My “Beautiful Design”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
[Shared Services VPC]
shared-vpc (10.0.0.0/16)
├─ AD: 10.0.1.0/24
├─ DNS: 10.0.2.0/24
└─ Monitoring: 10.0.3.0/24
↓ VPC Peering
┌─────────────────┼─────────────────┐
↓ ↓ ↓
[Sales VPC] [Dev VPC] [Finance VPC]
sales-vpc dev-vpc finance-vpc
(10.1.0.0/16) (10.2.0.0/16) (10.3.0.0/16)
↓ ↓ ↓
VPC Peering VPC Peering VPC Peering
↓ ↓ ↓
[Sales PROD] [Dev PROD] [Finance PROD]
(10.1.0.0/24) (10.2.0.0/24) (10.3.0.0/24)
↓ ↓ ↓
[Sales STAGING] [Dev STAGING] [Finance STAGING]
(10.1.1.0/24) (10.2.1.0/24) (10.3.1.0/24)
↓ ↓ ↓
[Sales DEV] [Dev DEV] [Finance DEV]
(10.1.2.0/24) (10.2.2.0/24) (10.3.2.0/24)
|
“Perfect! Departments and environments are cleanly separated. Connect necessary parts with VPC Peering, and we have a secure, manageable network!”
June 15: Optimistic Approval in Design Review
CTO: “I see, each department is independent. Good design.”
Me: “Yes, with VPC Peering for only necessary connections, security is perfect.”
Infrastructure Manager: “Are IP address conflicts okay?”
Me: “Private IP addresses, so conflicts are fine if VPCs are different.”
Security Officer: “Can unnecessary inter-department communication be blocked?”
Me: “VPC Peering only configured where needed, so control is perfect.”
Everyone: “Approved. Build it in 2 weeks.”
Implementation Start: Enthusiastic VPC Creation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Shared services VPC
gcloud compute networks create shared-vpc \
--subnet-mode=custom \
--project=shared-services
# Create each department VPC
for dept in sales dev finance; do
gcloud compute networks create ${dept}-vpc \
--subnet-mode=custom \
--project=${dept}-project
done
# Create each environment VPC
for dept in sales dev finance; do
for env in prod staging dev; do
gcloud compute networks create ${dept}-${env}-vpc \
--subnet-mode=custom \
--project=${dept}-${env}-project
done
done
|
“Going smoothly. Next create subnets and connect with VPC Peering, then done.”
Chapter 2: Beginning of Hell - IP Address Overlap Nightmare
June 20: Construction Work Begins
Start VPC Peering configuration:
1
2
3
4
5
6
7
8
9
10
11
12
|
# Shared services → Each department connection
gcloud compute networks peerings create shared-to-sales \
--network=shared-vpc \
--peer-project=sales-project \
--peer-network=sales-vpc
gcloud compute networks peerings create shared-to-dev \
--network=shared-vpc \
--peer-project=dev-project \
--peer-network=dev-vpc
# ... Similarly create 12 more Peerings
|
First warning sign:
1
2
3
|
WARNING: Peering 'shared-to-sales' created, but route may conflict
WARNING: Multiple routes to 10.0.0.0/16 detected
WARNING: Possible routing loop detected
|
“Warnings are showing, but should be okay if tests pass.”
June 22: First Connection Test
Connection test to shared AD (10.0.1.10):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# From Sales PROD environment
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# ping: cannot resolve 10.0.1.10: Unknown host
# From Dev PROD environment
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# ping: sendto: No route to host
# From Finance PROD environment
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# Request timeout for icmp_seq 0
|
“Huh? Nothing connects…”
Routing Table Check Reveals Horror
1
2
3
4
5
6
7
8
|
# Shared services VPC routing table
gcloud compute routes list --project=shared-services
NAME NETWORK DEST_RANGE NEXT_HOP
shared-vpc-route shared-vpc 10.0.0.0/16 shared-vpc
sales-peer-route shared-vpc 10.1.0.0/16 peering-sales
dev-peer-route shared-vpc 10.2.0.0/16 peering-dev
finance-peer-route shared-vpc 10.3.0.0/16 peering-finance
|
Sales VPC routing table:
1
2
3
4
5
6
7
8
|
gcloud compute routes list --project=sales-project
NAME NETWORK DEST_RANGE NEXT_HOP
sales-vpc-route sales-vpc 10.1.0.0/16 sales-vpc
shared-peer-route sales-vpc 10.0.0.0/16 peering-shared
sales-prod-route sales-vpc 10.1.0.0/24 peering-prod
sales-stg-route sales-vpc 10.1.1.0/24 peering-staging
sales-dev-route sales-vpc 10.1.2.0/24 peering-dev
|
“Oh… the routing is completely messed up…”
June 23: True Nature of Problem Revealed
Late-night debugging revealed fundamental problem:
1. Fatal IP Address Design Flaw
Actual IP address allocation for each environment:
1
2
3
4
5
6
7
8
9
10
11
|
Sales Department:
├─ sales-vpc: 10.1.0.0/16 (Department main)
├─ sales-prod: 10.1.0.0/24 (Production environment) ← Overlap!
├─ sales-staging: 10.1.1.0/24 (Staging)
└─ sales-dev: 10.1.2.0/24 (Development environment)
Development Department:
├─ dev-vpc: 10.2.0.0/16 (Department main)
├─ dev-prod: 10.2.0.0/24 (Production environment) ← Overlap!
├─ dev-staging: 10.2.1.0/24 (Staging)
└─ dev-dev: 10.2.2.0/24 (Development environment)
|
Problem: Subnet overlap between department main VPC and production environment VPC
2. VPC Peering Routing Limitations
VPC Peering is not “transitive”:
1
2
3
|
Sales PROD → Sales VPC → Shared VPC → Dev VPC → Dev PROD
↑_________________________↑
Communication via this route impossible
|
Direct communication from Sales PROD to Dev PROD requires separate Peering
3. Routing Loop Occurrence
1
2
3
4
|
Lost packet route:
10.1.0.10 → sales-vpc → shared-vpc → sales-prod → sales-vpc → ...
↑__________________|
Endless loop
|
“This is completely a design mistake…”
Chapter 3: June 24 - Complete System Outage Nightmare
9:00 AM: Scheduled Production Start
Planned migration schedule:
- 09:00: New network environment goes live
- 10:00: Each department system connection verification
- 11:00: Business operations begin
9:15 AM: First Alert
1
2
3
4
5
6
|
🚨 Monitoring Alert
Subject: [CRITICAL] Active Directory Authentication Failed
Body: Multiple authentication failures detected
- sales-prod: Cannot reach domain controller
- dev-prod: Authentication timeout
- finance-prod: LDAP connection failed
|
Authentication system malfunction across all departments
9:30 AM: Problem Chain Begins
Emergency call from Sales:
“Can’t log into customer management system! We have an important business meeting today!”
Slack from Development:
1
2
3
|
Dev Manager: Can't deploy to production system
Dev Manager: Monitoring system also invisible
Dev Manager: What's happening?
|
Internal call from Finance:
“Payroll and accounting systems are all erroring. Today is month-end deadline!”
10:00 AM: Emergency Response Meeting
Participants:
- CTO
- Each department heads
- Entire infrastructure team
- External consultants (emergency call)
CTO: “What’s the situation?”
Me: “There’s a problem with VPC Peering design, routing isn’t working properly.”
Sales Manager: “When will it be fixed? We have a major project presentation today!”
Me: “We’re investigating… urgently…”
Dev Manager: “Staging environment is also down, so we can’t fix production either.”
CTO: “Can we roll back everything to the old environment?”
Me: “The old environment has already been decommissioned… we can’t roll back…”
Everyone: “……” (Desperate silence)
Work assignment:
- Me: Emergency network configuration review
- Network Engineers (2): Investigation & fix of existing Peering
- System Engineers (3): Impact investigation for each system
- External Consultants (2): Alternative solution consideration
1:00 PM: Emergency Measures Attempt
Temporary fix proposal:
1
2
3
4
5
6
7
8
9
10
|
# Emergency fix of overlapping IP addresses
# sales-prod: 10.1.0.0/24 → 10.1.10.0/24
gcloud compute networks subnets create sales-prod-new \
--network=sales-prod-vpc \
--range=10.1.10.0/24
# dev-prod: 10.2.0.0/24 → 10.2.10.0/24
gcloud compute networks subnets create dev-prod-new \
--network=dev-prod-vpc \
--range=10.2.10.0/24
|
Result:
1
2
3
4
|
ERROR: Cannot delete subnet 'sales-prod-subnet'
ERROR: 15 instances still attached to subnet
ERROR: Database instances cannot be moved to different subnet
ERROR: Load balancer configuration requires subnet recreation
|
“Changing IP addresses requires recreating all instances… this will take days.”
3:00 PM: Alternative Solutions Consideration
Proposals from external consultants:
Option 1: VPC Peering Configuration Optimization
- Time: 1-2 days
- Risk: Unknown if routing problems completely resolved
- Impact: All systems need reconstruction
Option 2: Migration to Private Service Connect (PSC)
- Time: 2-3 days
- Risk: New technology with uncertainties
- Impact: Only shared services need reconstruction
Option 3: Complete Rollback (Reconstruct old environment)
- Time: 1 week
- Risk: Low
- Impact: Migration project completely back to square one
“All options are hell…”
4:00 PM: Decision for PSC Migration
Decision reasons:
- Leads to most fundamental solution
- High future scalability
- Only requires changes to shared services
CTO approval:
“Go with PSC. Restore within 48 hours.”
Chapter 4: Hell’s 2 Days - PSC Emergency Migration
June 24, 5:00 PM: PSC Migration Work Begins
What is Private Service Connect:
- Safely connects service providers and consumers
- Avoids IP address conflicts
- Solves transitive routing problems
- Enables more granular access control
Step 1: Convert Shared Services to PSC Services
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# PSC setup for Active Directory service
gcloud compute service-attachments create ad-service \
--region=asia-northeast1 \
--producer-forwarding-rule=ad-forwarding-rule \
--connection-preference=ACCEPT_AUTOMATIC \
--nat-subnets=ad-psc-subnet
# PSC setup for DNS service
gcloud compute service-attachments create dns-service \
--region=asia-northeast1 \
--producer-forwarding-rule=dns-forwarding-rule \
--connection-preference=ACCEPT_AUTOMATIC \
--nat-subnets=dns-psc-subnet
# PSC setup for monitoring service
gcloud compute service-attachments create monitoring-service \
--region=asia-northeast1 \
--producer-forwarding-rule=monitoring-forwarding-rule \
--connection-preference=ACCEPT_AUTOMATIC \
--nat-subnets=monitoring-psc-subnet
|
Step 2: Create PSC Endpoints from Each Department VPC
1
2
3
4
5
6
7
8
9
10
11
12
|
# Sales department connection to AD service
gcloud compute addresses create sales-ad-psc-ip \
--region=asia-northeast1 \
--subnet=sales-psc-subnet \
--project=sales-project
gcloud compute forwarding-rules create sales-ad-endpoint \
--region=asia-northeast1 \
--network=sales-vpc \
--address=sales-ad-psc-ip \
--target-service-attachment=projects/shared-services/regions/asia-northeast1/serviceAttachments/ad-service \
--project=sales-project
|
June 24, 11:00 PM: First PSC Connection Test
1
2
3
4
5
|
# Sales environment to AD connection test
ping 10.1.100.10 # PSC endpoint IP
# PING 10.1.100.10: 56 data bytes
# 64 bytes from 10.1.100.10: icmp_seq=0 ttl=64 time=2.3 ms
# 64 bytes from 10.1.100.10: icmp_seq=1 ttl=64 time=1.8 ms
|
“Yes! It’s connected!”
June 25, 3:00 AM: Configuration Changes at Each Department
Connection destination changes in each system:
1
2
3
4
5
6
7
8
9
10
11
12
|
# Sales system configuration change
# /etc/ldap/ldap.conf
URI ldap://10.1.100.10/ # Via PSC endpoint
BASE dc=company,dc=local
# Development system configuration change
# /etc/sssd/sssd.conf
ldap_uri = ldap://10.2.100.10/ # Dev department PSC endpoint
# Finance system configuration change
# application.properties
ldap.server.url=ldap://10.3.100.10/ # Finance department PSC endpoint
|
June 25, 6:00 AM: System Connection Verification
Operation test at each department:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Sales system verification
systemctl status sales-app
● sales-app.service - Sales Management Application
Loaded: loaded
Active: active (running)
Status: "Connected to AD via PSC endpoint"
# Development system verification
kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-server-7d4f8b9c 1/1 Running 0 2h
database-proxy-5c7d9 1/1 Running 0 2h
monitoring-agent-3x8k 1/1 Running 0 2h
|
“All systems are operating normally…”
June 25, 8:00 AM: Business Resumption Verification
Reports from each department:
Sales:
1
2
3
|
Sales Manager: Customer management system normal
Sales Manager: Email & calendar also restored
Sales Manager: Made it in time for today's presentation!
|
Development:
1
2
3
|
Dev Manager: Both production & staging environments normal
Dev Manager: CI/CD pipeline restored
Dev Manager: Monitoring system also no problems
|
Finance:
1
2
3
|
Finance Manager: Payroll & accounting systems restored
Finance Manager: Can start month-end closing process
Finance Manager: Thank you for your hard work!
|
“We did it… everything’s restored…”
Chapter 5: Post-Incident Analysis and Lessons Learned
Stable Operation After Recovery
PSC migration monitoring results after 1 week:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
Network connection status:
- Connection success rate: 99.98%
- Average response time: 1.2ms (previously 2.5ms)
- Error rate: 0.02% (previously 3.2%)
System operation status:
- Active Directory authentication: 100% success
- DNS resolution: 99.99% success
- Monitoring system: All metrics normal
Inter-department communication:
- Unnecessary inter-department communication: Completely blocked
- Necessary shared service access: 100% success
- Security policies: Fully applied
|
Root Cause Analysis
1. Insufficient Understanding in Design Phase
Wrong understanding:
1
2
3
|
× VPC Peering is "easy and universal" connection method
× IP address conflicts are "no problem if VPCs are different"
× Routing is "automatically optimized"
|
Correct understanding:
1
2
3
|
○ VPC Peering is connection method suitable for specific purposes
○ IP address design requires careful consideration of overall picture
○ Routing requires explicit design & management
|
2. Insufficient Understanding of Connection Method Applications
VPC Peering Application Scenarios:
1
2
3
4
5
6
7
8
9
10
11
|
Suitable cases:
- Direct connection within same organization
- Full mesh connection needed
- Low latency & high throughput requirements
- Minimize management costs
Unsuitable cases:
- Complex hub & spoke topology
- Service provider/consumer relationship
- Granular access control requirements
- IP address conflict environments
|
PSC Application Scenarios:
1
2
3
4
5
6
7
8
9
10
11
|
Suitable cases:
- Service provider/consumer relationship
- Complex multi-tenant environments
- IP address conflicts exist
- Granular access control needed
Unsuitable cases:
- Simple direct connection
- Low latency is most important
- Minimize setup/operation costs
- No protocol-level control needed
|
3. Inadequate Testing & Validation Process
Validation that should have been performed:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
## Network Design Validation Checklist
### Design Phase
- [ ] IP address conflict check
- [ ] Routing table design
- [ ] Connectivity matrix creation
- [ ] Security policy definition
### Implementation Phase
- [ ] Phased construction (VPC by VPC)
- [ ] Connection testing at each phase
- [ ] Load testing & failure testing
- [ ] Rollback procedure confirmation
### Pre-Production Testing
- [ ] All system connection verification
- [ ] Performance measurement
- [ ] Security testing
- [ ] Operations procedure confirmation
|
Improved Final Design
PSC-based Hub & Spoke topology:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
[Shared Services VPC]
(Service Producer)
┌─ AD Service (PSC)
├─ DNS Service (PSC)
└─ Monitor Service (PSC)
↑
PSC Service Attachment
↑
┌─────────────────────┼─────────────────────┐
↓ ↓ ↓
[Sales VPC] [Dev VPC] [Finance VPC]
PSC Endpoint PSC Endpoint PSC Endpoint
├─ AD: 10.1.100.10 ├─ AD: 10.2.100.10 ├─ AD: 10.3.100.10
├─ DNS: 10.1.100.11 ├─ DNS: 10.2.100.11 ├─ DNS: 10.3.100.11
└─ Mon: 10.1.100.12 └─ Mon: 10.2.100.12 └─ Mon: 10.3.100.12
|
Benefits:
- IP address conflict problem resolution
- Complete department isolation
- Granular access control
- Transitive routing problem resolution
- Future scalability
Cost Comparison
VPC Peering vs PSC:
Item |
VPC Peering |
PSC |
Difference |
Connection fees |
Free |
$0.01/hour/connection |
+$2,600/year |
Data transfer |
Standard rates |
Standard rates |
Same |
Operations effort |
High (complex management) |
Low (simple) |
-$50,000/year |
Incident response |
High (this case example) |
Low |
-$100,000/year |
Total cost |
High |
Low |
-$147,400/year |
“PSC resulted in significant cost reduction overall”
Chapter 6: Lessons for Other Organizations
VPC Connection Method Selection Flowchart
1
2
3
4
5
6
7
8
9
10
11
|
Network connection requirements
↓
┌─ IP address conflicts? ─ YES → Consider PSC
│ ↓ NO
├─ Service provider/consumer? ─ YES → PSC recommended
│ ↓ NO
├─ Complex control needed? ─ YES → Consider PSC or Proxy
│ ↓ NO
├─ Ultra-low latency needed? ─ YES → VPC Peering recommended
│ ↓ NO
└─ Simple direct connection? ─ YES → VPC Peering possible
|
Design Checkpoints
1. IP Address Design
1
2
3
4
5
6
7
8
9
10
11
12
|
# IP address overlap check script example
#!/bin/bash
echo "=== IP Address Overlap Check ==="
for vpc in $(gcloud compute networks list --format="value(name)"); do
echo "VPC: $vpc"
gcloud compute networks subnets list --network=$vpc \
--format="table(name,ipCidrRange)" --sort-by=ipCidrRange
echo ""
done
# Overlap check logic
python3 check_ip_overlap.py --vpc-list=vpc_list.txt
|
2. Routing Design
1
2
3
4
5
6
7
8
9
10
11
12
|
# Routing table visualization
gcloud compute routes list --format="table(
name,
destRange,
nextHopInstance,
nextHopVpnTunnel,
nextHopPeering,
priority
)" --sort-by=destRange
# Routing loop detection
./detect_routing_loops.sh
|
3. Connection Testing Automation
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Connection testing automation (Cloud Build)
steps:
- name: 'gcr.io/cloud-builders/gcloud'
script: |
# Reachability testing from each VPC
for source_vpc in sales-vpc dev-vpc finance-vpc; do
for target_service in ad-service dns-service; do
echo "Testing $source_vpc -> $target_service"
gcloud compute ssh test-vm --zone=asia-northeast1-a \
--project=${source_vpc}-project \
--command="nc -zv ${target_service}.internal 389"
done
done
|
Operational Considerations
1. PSC Operations Best Practices
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
PSC Operations Guidelines:
Monitoring_Targets:
- Endpoint connection status
- Service health
- Data transfer volume & costs
Alert_Settings:
- Connection failure rate > 1%
- Response time > 5 seconds
- Monthly cost > 80% of budget
Regular_Tasks:
- Monthly connection status report
- Quarterly cost optimization review
- Semi-annual security audit
|
2. Emergency Response Procedures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
## PSC Failure Response Flowchart
### Level 1: Endpoint Failure
1. Switch to other endpoints
2. Contact service provider
3. Notify users of impact scope
### Level 2: Service Provider Failure
1. Switch to backup service
2. Enable emergency VPC Peering
3. Begin recovery work
### Level 3: Complete Failure
1. Switch back to on-premises environment
2. Execute business continuity plan
3. Request external support
|
Summary: Escaping the Illusion of “Easy Connection”
Project Summary
Total damage:
- System outage opportunity loss: ¥15M
- Emergency response work cost: ¥5M
- External consultant fees: ¥3M
- PSC migration costs: ¥2M
Total: ¥25M
However, the value gained:
- Mastering correct network design: Priceless
- Deep understanding of PSC technology: Priceless
- Improved team cohesion: Priceless
- Enhanced incident response capability: Priceless
Most Important Lessons
1. The Danger of the Word “Easy”
Typical engineer assumption:
“VPC Peering just connects with a few clicks → Easy”
Reality:
Network design requires multi-layered consideration of IP address design, routing design, security design, and operational design.
2. Importance of Appropriate Technology Selection
Technology selection principles:
- Accurately understand requirements
- Deeply understand each technology’s characteristics
- Consider long-term operational costs
- Prioritize scalability & maintainability
3. Value of Phased Implementation and Testing
This failure’s cause:
Connecting all VPCs at once and testing operation
Correct approach:
Build VPC by VPC in phases, thoroughly testing each phase
Message to Other Engineers
Network design often appears “easy,” but it’s actually very deep, and once you make design mistakes, it’s a difficult area to correct.
My 3 pieces of advice:
-
Stop and think when something seems “easy”
- Question if it’s really easy
- Check for hidden complexity
- Consider phased approaches
-
Make technology choices based on requirements
- Don’t choose based on trends or apparent simplicity
- Consider long-term operations
- Compare multiple options
-
Don’t fear learning from failures
- Failures are the best learning opportunities
- Share knowledge across the entire team
- Create systems to prevent the same failures
I hope this article serves as a reference for engineers facing similar design decisions.
And even if you make similar mistakes, I believe that by continuing to improve without giving up, you can surely reach better solutions.
Related articles:
Note: This article is based on actual network incident experience, but specific organization names and system details have been anonymized.