Disaster Recovery War Game — Copilot CLI Ideas

$ copilot --idea "Disaster Recovery War Game"

advanced ⏱ 3-5 hours Azure Projects

→ Simulate regional failures and build automated failover with Copilot CLI as your ops partner

The Problem

Everyone has a DR plan. Nobody tests it. When the actual disaster hits, the "runbook" is a 2-year-old Confluence page that references infrastructure that no longer exists. You find out your geo-replication was misconfigured 6 months ago.

What You'll Build

A complete disaster recovery simulation framework:

- Multi-region architecture with automated failover

- Chaos engineering tests using Azure Chaos Studio

- A tested, executable runbook (not a dusty document)

- Recovery time measurements and SLA validation

Step-by-Step Walkthrough

Phase 1: Assess Current Architecture

$ "Analyse the resources in my resource group and identify:

1. Which services have geo-redundancy configured

2. Which are single-region with no failover

3. Current RTO/RPO for each data service

4. Single points of failure in the architecture"

Phase 2: Design the Failover Architecture

$ "Design a multi-region failover architecture for this application.

Primary: UK South, Secondary: North Europe.

Include:

- Azure Front Door for global load balancing

- SQL geo-replication with auto-failover groups

- Cosmos DB multi-region writes

- Container Apps in both regions

Generate the Bicep for the secondary region."

Phase 3: Build the Runbook

$ "Create an executable DR runbook as a PowerShell script with these steps:

1. Detect primary region health (probe endpoints every 30s)

2. If 3 consecutive failures, initiate failover

3. Switch SQL failover group to secondary

4. Update Front Door to route 100% to secondary

5. Send alerts via Azure Monitor Action Groups

6. Log all actions with timestamps for post-mortem

Include a --dry-run mode and a --failback mode."

Phase 4: Chaos Engineering

$ "Create Azure Chaos Studio experiments that simulate:

1. Primary region network latency spike (500ms added)

2. SQL database connection pool exhaustion

3. Container App instance crash (kill 2 of 3 replicas)

4. DNS resolution failure for Key Vault

Run each experiment and verify the app degrades gracefully."

Phase 5: Measure and Report

$ "Run a full failover drill:

1. Start the chaos experiment on the primary region

2. Measure time from failure detection to full failover

3. Verify all data is accessible in the secondary region

4. Measure failback time

5. Generate a DR drill report with pass/fail criteria"

Pro Tips

• Use the research agent to pull current Azure SLA numbers for your services

• The azure-diagnostics skill can analyse failures in real-time during the drill

• Start with a single service failover before attempting full-stack DR

• Record the drill — the timeline is invaluable for post-mortem

What You'll Learn

• Multi-region Azure architecture patterns

• Azure Front Door, Traffic Manager, and failover groups

• Chaos engineering principles and Azure Chaos Studio

• Incident response and runbook automation