[Design Hell] VPC Peering Design Mistake Causes 3-Day Company-wide System Outage → Emergency PSC Migration Battle Record

[Design Hell] VPC Peering Design Mistake Causes 3-Day Company-wide System Outage → Emergency PSC Migration Battle Record

Prologue: The Sweet Temptation of “Easy Connection”

Monday, June 10, 2025, 11:00 AM

“VPC connection? Just click a few times with VPC Peering. 30 minutes is enough to connect everything.”

At the design meeting for our new multi-project environment, I answered with such confidence. Little did I know that this “easy connection” would cause a catastrophic 3-day company-wide system outage…

This is the record of the hellish experience caused by naive VPC design assumptions, and the network design truths learned from it.

Chapter 1: The Trap of “Beautiful Design”

Project Overview: Multi-VPC Environment Construction

Background: Due to rapid company growth, we decided to separate GCP projects by department and environment.

Design Requirements:

  • Ensure department independence
  • Separate environments (dev/staging/prod)
  • Inter-department collaboration as needed
  • Connection to shared services (AD, DNS, monitoring)

My “Beautiful Design”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
                     [Shared Services VPC]
                    shared-vpc (10.0.0.0/16)
                   ├─ AD: 10.0.1.0/24
                   ├─ DNS: 10.0.2.0/24  
                   └─ Monitoring: 10.0.3.0/24
                          ↓ VPC Peering
        ┌─────────────────┼─────────────────┐
        ↓                 ↓                 ↓
   [Sales VPC]         [Dev VPC]        [Finance VPC]
sales-vpc           dev-vpc           finance-vpc
(10.1.0.0/16)      (10.2.0.0/16)     (10.3.0.0/16)
     ↓                  ↓                  ↓
VPC Peering        VPC Peering        VPC Peering
     ↓                  ↓                  ↓
[Sales PROD]         [Dev PROD]         [Finance PROD]  
(10.1.0.0/24)      (10.2.0.0/24)      (10.3.0.0/24)
     ↓                  ↓                  ↓
[Sales STAGING]      [Dev STAGING]      [Finance STAGING]
(10.1.1.0/24)      (10.2.1.0/24)      (10.3.1.0/24)
     ↓                  ↓                  ↓
[Sales DEV]          [Dev DEV]          [Finance DEV]
(10.1.2.0/24)      (10.2.2.0/24)      (10.3.2.0/24)

“Perfect! Departments and environments are cleanly separated. Connect necessary parts with VPC Peering, and we have a secure, manageable network!”

June 15: Optimistic Approval in Design Review

CTO: “I see, each department is independent. Good design.”

Me: “Yes, with VPC Peering for only necessary connections, security is perfect.”

Infrastructure Manager: “Are IP address conflicts okay?”

Me: “Private IP addresses, so conflicts are fine if VPCs are different.”

Security Officer: “Can unnecessary inter-department communication be blocked?”

Me: “VPC Peering only configured where needed, so control is perfect.”

Everyone: “Approved. Build it in 2 weeks.”

Implementation Start: Enthusiastic VPC Creation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Shared services VPC
gcloud compute networks create shared-vpc \
    --subnet-mode=custom \
    --project=shared-services

# Create each department VPC
for dept in sales dev finance; do
    gcloud compute networks create ${dept}-vpc \
        --subnet-mode=custom \
        --project=${dept}-project
done

# Create each environment VPC
for dept in sales dev finance; do
    for env in prod staging dev; do
        gcloud compute networks create ${dept}-${env}-vpc \
            --subnet-mode=custom \
            --project=${dept}-${env}-project
    done
done

“Going smoothly. Next create subnets and connect with VPC Peering, then done.”

Chapter 2: Beginning of Hell - IP Address Overlap Nightmare

June 20: Construction Work Begins

Start VPC Peering configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Shared services → Each department connection
gcloud compute networks peerings create shared-to-sales \
    --network=shared-vpc \
    --peer-project=sales-project \
    --peer-network=sales-vpc

gcloud compute networks peerings create shared-to-dev \
    --network=shared-vpc \
    --peer-project=dev-project \
    --peer-network=dev-vpc

# ... Similarly create 12 more Peerings

First warning sign:

1
2
3
WARNING: Peering 'shared-to-sales' created, but route may conflict
WARNING: Multiple routes to 10.0.0.0/16 detected
WARNING: Possible routing loop detected

“Warnings are showing, but should be okay if tests pass.”

June 22: First Connection Test

Connection test to shared AD (10.0.1.10):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# From Sales PROD environment
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# ping: cannot resolve 10.0.1.10: Unknown host

# From Dev PROD environment  
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# ping: sendto: No route to host

# From Finance PROD environment
ping 10.0.1.10
# PING 10.0.1.10: 56 data bytes
# Request timeout for icmp_seq 0

“Huh? Nothing connects…”

Routing Table Check Reveals Horror

1
2
3
4
5
6
7
8
# Shared services VPC routing table
gcloud compute routes list --project=shared-services

NAME                NETWORK      DEST_RANGE    NEXT_HOP
shared-vpc-route    shared-vpc   10.0.0.0/16   shared-vpc
sales-peer-route    shared-vpc   10.1.0.0/16   peering-sales
dev-peer-route      shared-vpc   10.2.0.0/16   peering-dev  
finance-peer-route  shared-vpc   10.3.0.0/16   peering-finance

Sales VPC routing table:

1
2
3
4
5
6
7
8
gcloud compute routes list --project=sales-project

NAME                NETWORK      DEST_RANGE    NEXT_HOP
sales-vpc-route     sales-vpc    10.1.0.0/16   sales-vpc
shared-peer-route   sales-vpc    10.0.0.0/16   peering-shared
sales-prod-route    sales-vpc    10.1.0.0/24   peering-prod
sales-stg-route     sales-vpc    10.1.1.0/24   peering-staging  
sales-dev-route     sales-vpc    10.1.2.0/24   peering-dev

“Oh… the routing is completely messed up…”

June 23: True Nature of Problem Revealed

Late-night debugging revealed fundamental problem:

1. Fatal IP Address Design Flaw

Actual IP address allocation for each environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Sales Department:
├─ sales-vpc: 10.1.0.0/16 (Department main)
├─ sales-prod: 10.1.0.0/24 (Production environment)    ← Overlap!
├─ sales-staging: 10.1.1.0/24 (Staging)
└─ sales-dev: 10.1.2.0/24 (Development environment)

Development Department:  
├─ dev-vpc: 10.2.0.0/16 (Department main)
├─ dev-prod: 10.2.0.0/24 (Production environment)      ← Overlap!
├─ dev-staging: 10.2.1.0/24 (Staging)
└─ dev-dev: 10.2.2.0/24 (Development environment)

Problem: Subnet overlap between department main VPC and production environment VPC

2. VPC Peering Routing Limitations

VPC Peering is not “transitive”:

1
2
3
Sales PROD → Sales VPC → Shared VPC → Dev VPC → Dev PROD
   ↑_________________________↑
   Communication via this route impossible

Direct communication from Sales PROD to Dev PROD requires separate Peering

3. Routing Loop Occurrence

1
2
3
4
Lost packet route:
10.1.0.10 → sales-vpc → shared-vpc → sales-prod → sales-vpc → ...
                                        ↑__________________|
                                        Endless loop

“This is completely a design mistake…”

Chapter 3: June 24 - Complete System Outage Nightmare

9:00 AM: Scheduled Production Start

Planned migration schedule:

  • 09:00: New network environment goes live
  • 10:00: Each department system connection verification
  • 11:00: Business operations begin

9:15 AM: First Alert

1
2
3
4
5
6
🚨 Monitoring Alert
Subject: [CRITICAL] Active Directory Authentication Failed
Body: Multiple authentication failures detected
- sales-prod: Cannot reach domain controller
- dev-prod: Authentication timeout  
- finance-prod: LDAP connection failed

Authentication system malfunction across all departments

9:30 AM: Problem Chain Begins

Emergency call from Sales: “Can’t log into customer management system! We have an important business meeting today!”

Slack from Development:

1
2
3
Dev Manager: Can't deploy to production system
Dev Manager: Monitoring system also invisible
Dev Manager: What's happening?

Internal call from Finance: “Payroll and accounting systems are all erroring. Today is month-end deadline!”

10:00 AM: Emergency Response Meeting

Participants:

  • CTO
  • Each department heads
  • Entire infrastructure team
  • External consultants (emergency call)

CTO: “What’s the situation?”

Me: “There’s a problem with VPC Peering design, routing isn’t working properly.”

Sales Manager: “When will it be fixed? We have a major project presentation today!”

Me: “We’re investigating… urgently…”

Dev Manager: “Staging environment is also down, so we can’t fix production either.”

CTO: “Can we roll back everything to the old environment?”

Me: “The old environment has already been decommissioned… we can’t roll back…”

Everyone: “……” (Desperate silence)

11:00 AM: Emergency Team Formation

Work assignment:

  • Me: Emergency network configuration review
  • Network Engineers (2): Investigation & fix of existing Peering
  • System Engineers (3): Impact investigation for each system
  • External Consultants (2): Alternative solution consideration

1:00 PM: Emergency Measures Attempt

Temporary fix proposal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Emergency fix of overlapping IP addresses
# sales-prod: 10.1.0.0/24 → 10.1.10.0/24
gcloud compute networks subnets create sales-prod-new \
    --network=sales-prod-vpc \
    --range=10.1.10.0/24

# dev-prod: 10.2.0.0/24 → 10.2.10.0/24  
gcloud compute networks subnets create dev-prod-new \
    --network=dev-prod-vpc \
    --range=10.2.10.0/24

Result:

1
2
3
4
ERROR: Cannot delete subnet 'sales-prod-subnet' 
ERROR: 15 instances still attached to subnet
ERROR: Database instances cannot be moved to different subnet
ERROR: Load balancer configuration requires subnet recreation

“Changing IP addresses requires recreating all instances… this will take days.”

3:00 PM: Alternative Solutions Consideration

Proposals from external consultants:

Option 1: VPC Peering Configuration Optimization

  • Time: 1-2 days
  • Risk: Unknown if routing problems completely resolved
  • Impact: All systems need reconstruction

Option 2: Migration to Private Service Connect (PSC)

  • Time: 2-3 days
  • Risk: New technology with uncertainties
  • Impact: Only shared services need reconstruction

Option 3: Complete Rollback (Reconstruct old environment)

  • Time: 1 week
  • Risk: Low
  • Impact: Migration project completely back to square one

“All options are hell…”

4:00 PM: Decision for PSC Migration

Decision reasons:

  1. Leads to most fundamental solution
  2. High future scalability
  3. Only requires changes to shared services

CTO approval: “Go with PSC. Restore within 48 hours.”

Chapter 4: Hell’s 2 Days - PSC Emergency Migration

June 24, 5:00 PM: PSC Migration Work Begins

What is Private Service Connect:

  • Safely connects service providers and consumers
  • Avoids IP address conflicts
  • Solves transitive routing problems
  • Enables more granular access control

Step 1: Convert Shared Services to PSC Services

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# PSC setup for Active Directory service
gcloud compute service-attachments create ad-service \
    --region=asia-northeast1 \
    --producer-forwarding-rule=ad-forwarding-rule \
    --connection-preference=ACCEPT_AUTOMATIC \
    --nat-subnets=ad-psc-subnet

# PSC setup for DNS service  
gcloud compute service-attachments create dns-service \
    --region=asia-northeast1 \
    --producer-forwarding-rule=dns-forwarding-rule \
    --connection-preference=ACCEPT_AUTOMATIC \
    --nat-subnets=dns-psc-subnet

# PSC setup for monitoring service
gcloud compute service-attachments create monitoring-service \
    --region=asia-northeast1 \
    --producer-forwarding-rule=monitoring-forwarding-rule \
    --connection-preference=ACCEPT_AUTOMATIC \
    --nat-subnets=monitoring-psc-subnet

Step 2: Create PSC Endpoints from Each Department VPC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Sales department connection to AD service
gcloud compute addresses create sales-ad-psc-ip \
    --region=asia-northeast1 \
    --subnet=sales-psc-subnet \
    --project=sales-project

gcloud compute forwarding-rules create sales-ad-endpoint \
    --region=asia-northeast1 \
    --network=sales-vpc \
    --address=sales-ad-psc-ip \
    --target-service-attachment=projects/shared-services/regions/asia-northeast1/serviceAttachments/ad-service \
    --project=sales-project

June 24, 11:00 PM: First PSC Connection Test

1
2
3
4
5
# Sales environment to AD connection test
ping 10.1.100.10  # PSC endpoint IP
# PING 10.1.100.10: 56 data bytes  
# 64 bytes from 10.1.100.10: icmp_seq=0 ttl=64 time=2.3 ms
# 64 bytes from 10.1.100.10: icmp_seq=1 ttl=64 time=1.8 ms

“Yes! It’s connected!”

June 25, 3:00 AM: Configuration Changes at Each Department

Connection destination changes in each system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Sales system configuration change
# /etc/ldap/ldap.conf
URI ldap://10.1.100.10/  # Via PSC endpoint
BASE dc=company,dc=local

# Development system configuration change  
# /etc/sssd/sssd.conf
ldap_uri = ldap://10.2.100.10/  # Dev department PSC endpoint

# Finance system configuration change
# application.properties
ldap.server.url=ldap://10.3.100.10/  # Finance department PSC endpoint

June 25, 6:00 AM: System Connection Verification

Operation test at each department:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Sales system verification
systemctl status sales-app
● sales-app.service - Sales Management Application
   Loaded: loaded
   Active: active (running)
   Status: "Connected to AD via PSC endpoint"

# Development system verification  
kubectl get pods -n production
NAME                    READY   STATUS    RESTARTS   AGE
api-server-7d4f8b9c     1/1     Running   0          2h
database-proxy-5c7d9    1/1     Running   0          2h
monitoring-agent-3x8k   1/1     Running   0          2h

“All systems are operating normally…”

June 25, 8:00 AM: Business Resumption Verification

Reports from each department:

Sales:

1
2
3
Sales Manager: Customer management system normal
Sales Manager: Email & calendar also restored
Sales Manager: Made it in time for today's presentation!

Development:

1
2
3
Dev Manager: Both production & staging environments normal  
Dev Manager: CI/CD pipeline restored
Dev Manager: Monitoring system also no problems

Finance:

1
2
3
Finance Manager: Payroll & accounting systems restored
Finance Manager: Can start month-end closing process
Finance Manager: Thank you for your hard work!

“We did it… everything’s restored…”

Chapter 5: Post-Incident Analysis and Lessons Learned

Stable Operation After Recovery

PSC migration monitoring results after 1 week:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Network connection status:
- Connection success rate: 99.98%
- Average response time: 1.2ms (previously 2.5ms)
- Error rate: 0.02% (previously 3.2%)

System operation status:
- Active Directory authentication: 100% success
- DNS resolution: 99.99% success  
- Monitoring system: All metrics normal

Inter-department communication:
- Unnecessary inter-department communication: Completely blocked
- Necessary shared service access: 100% success
- Security policies: Fully applied

Root Cause Analysis

1. Insufficient Understanding in Design Phase

Wrong understanding:

1
2
3
× VPC Peering is "easy and universal" connection method
× IP address conflicts are "no problem if VPCs are different"
× Routing is "automatically optimized"

Correct understanding:

1
2
3
○ VPC Peering is connection method suitable for specific purposes
○ IP address design requires careful consideration of overall picture
○ Routing requires explicit design & management

2. Insufficient Understanding of Connection Method Applications

VPC Peering Application Scenarios:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Suitable cases:
  - Direct connection within same organization
  - Full mesh connection needed
  - Low latency & high throughput requirements
  - Minimize management costs

Unsuitable cases:
  - Complex hub & spoke topology
  - Service provider/consumer relationship
  - Granular access control requirements
  - IP address conflict environments

PSC Application Scenarios:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Suitable cases:
  - Service provider/consumer relationship
  - Complex multi-tenant environments
  - IP address conflicts exist
  - Granular access control needed

Unsuitable cases:
  - Simple direct connection
  - Low latency is most important
  - Minimize setup/operation costs
  - No protocol-level control needed

3. Inadequate Testing & Validation Process

Validation that should have been performed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## Network Design Validation Checklist

### Design Phase
- [ ] IP address conflict check
- [ ] Routing table design
- [ ] Connectivity matrix creation
- [ ] Security policy definition

### Implementation Phase  
- [ ] Phased construction (VPC by VPC)
- [ ] Connection testing at each phase
- [ ] Load testing & failure testing
- [ ] Rollback procedure confirmation

### Pre-Production Testing
- [ ] All system connection verification
- [ ] Performance measurement
- [ ] Security testing
- [ ] Operations procedure confirmation

Improved Final Design

PSC-based Hub & Spoke topology:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
                [Shared Services VPC]
                 (Service Producer)
                ┌─ AD Service (PSC)
                ├─ DNS Service (PSC)
                └─ Monitor Service (PSC)
                PSC Service Attachment
    ┌─────────────────────┼─────────────────────┐
    ↓                     ↓                     ↓
[Sales VPC]              [Dev VPC]               [Finance VPC]  
PSC Endpoint          PSC Endpoint           PSC Endpoint
├─ AD: 10.1.100.10    ├─ AD: 10.2.100.10    ├─ AD: 10.3.100.10
├─ DNS: 10.1.100.11   ├─ DNS: 10.2.100.11   ├─ DNS: 10.3.100.11
└─ Mon: 10.1.100.12   └─ Mon: 10.2.100.12   └─ Mon: 10.3.100.12

Benefits:

  • IP address conflict problem resolution
  • Complete department isolation
  • Granular access control
  • Transitive routing problem resolution
  • Future scalability

Cost Comparison

VPC Peering vs PSC:

Item VPC Peering PSC Difference
Connection fees Free $0.01/hour/connection +$2,600/year
Data transfer Standard rates Standard rates Same
Operations effort High (complex management) Low (simple) -$50,000/year
Incident response High (this case example) Low -$100,000/year
Total cost High Low -$147,400/year

“PSC resulted in significant cost reduction overall”

Chapter 6: Lessons for Other Organizations

VPC Connection Method Selection Flowchart

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Network connection requirements
┌─ IP address conflicts? ─ YES → Consider PSC
│        ↓ NO
├─ Service provider/consumer? ─ YES → PSC recommended  
│        ↓ NO
├─ Complex control needed? ─ YES → Consider PSC or Proxy
│        ↓ NO  
├─ Ultra-low latency needed? ─ YES → VPC Peering recommended
│        ↓ NO
└─ Simple direct connection? ─ YES → VPC Peering possible

Design Checkpoints

1. IP Address Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# IP address overlap check script example
#!/bin/bash
echo "=== IP Address Overlap Check ==="
for vpc in $(gcloud compute networks list --format="value(name)"); do
    echo "VPC: $vpc"
    gcloud compute networks subnets list --network=$vpc \
        --format="table(name,ipCidrRange)" --sort-by=ipCidrRange
    echo ""
done

# Overlap check logic  
python3 check_ip_overlap.py --vpc-list=vpc_list.txt

2. Routing Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Routing table visualization
gcloud compute routes list --format="table(
    name,
    destRange,
    nextHopInstance,
    nextHopVpnTunnel,
    nextHopPeering,
    priority
)" --sort-by=destRange

# Routing loop detection
./detect_routing_loops.sh

3. Connection Testing Automation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Connection testing automation (Cloud Build)
steps:
- name: 'gcr.io/cloud-builders/gcloud'
  script: |
    # Reachability testing from each VPC
    for source_vpc in sales-vpc dev-vpc finance-vpc; do
      for target_service in ad-service dns-service; do
        echo "Testing $source_vpc -> $target_service"
        gcloud compute ssh test-vm --zone=asia-northeast1-a \
          --project=${source_vpc}-project \
          --command="nc -zv ${target_service}.internal 389"
      done
    done

Operational Considerations

1. PSC Operations Best Practices

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
PSC Operations Guidelines:
  Monitoring_Targets:
    - Endpoint connection status
    - Service health
    - Data transfer volume & costs
    
  Alert_Settings:
    - Connection failure rate > 1%
    - Response time > 5 seconds
    - Monthly cost > 80% of budget
    
  Regular_Tasks:
    - Monthly connection status report
    - Quarterly cost optimization review
    - Semi-annual security audit

2. Emergency Response Procedures

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## PSC Failure Response Flowchart

### Level 1: Endpoint Failure
1. Switch to other endpoints
2. Contact service provider
3. Notify users of impact scope

### Level 2: Service Provider Failure  
1. Switch to backup service
2. Enable emergency VPC Peering
3. Begin recovery work

### Level 3: Complete Failure
1. Switch back to on-premises environment
2. Execute business continuity plan
3. Request external support

Summary: Escaping the Illusion of “Easy Connection”

Project Summary

Total damage:

  • System outage opportunity loss: ¥15M
  • Emergency response work cost: ¥5M
  • External consultant fees: ¥3M
  • PSC migration costs: ¥2M Total: ¥25M

However, the value gained:

  • Mastering correct network design: Priceless
  • Deep understanding of PSC technology: Priceless
  • Improved team cohesion: Priceless
  • Enhanced incident response capability: Priceless

Most Important Lessons

1. The Danger of the Word “Easy”

Typical engineer assumption: “VPC Peering just connects with a few clicks → Easy”

Reality: Network design requires multi-layered consideration of IP address design, routing design, security design, and operational design.

2. Importance of Appropriate Technology Selection

Technology selection principles:

  • Accurately understand requirements
  • Deeply understand each technology’s characteristics
  • Consider long-term operational costs
  • Prioritize scalability & maintainability

3. Value of Phased Implementation and Testing

This failure’s cause: Connecting all VPCs at once and testing operation

Correct approach: Build VPC by VPC in phases, thoroughly testing each phase

Message to Other Engineers

Network design often appears “easy,” but it’s actually very deep, and once you make design mistakes, it’s a difficult area to correct.

My 3 pieces of advice:

  1. Stop and think when something seems “easy”

    • Question if it’s really easy
    • Check for hidden complexity
    • Consider phased approaches
  2. Make technology choices based on requirements

    • Don’t choose based on trends or apparent simplicity
    • Consider long-term operations
    • Compare multiple options
  3. Don’t fear learning from failures

    • Failures are the best learning opportunities
    • Share knowledge across the entire team
    • Create systems to prevent the same failures

I hope this article serves as a reference for engineers facing similar design decisions.

And even if you make similar mistakes, I believe that by continuing to improve without giving up, you can surely reach better solutions.


Related articles:

Note: This article is based on actual network incident experience, but specific organization names and system details have been anonymized.

技術ネタ、趣味や備忘録などを書いているブログです
Hugo で構築されています。
テーマ StackJimmy によって設計されています。