Part 4 · Reliability ও Security 📖 ১৩ মিনিট পড়া 📝 ২০টি কুইজ

Disaster Recovery

Worst-case ভেবে — fire, earthquake, region-wide outage।

📝 কুইজে যান

২০২২ সালের জুলাইয়ে Bangladesh-এ একটি বড় ISP-এর data center-এ আগুন। হাজারো website ঘণ্টার পর ঘণ্টা down। প্রস্তুত কোম্পানিগুলো — backup region-এ failover করে চালু থাকল। অপ্রস্তুতরা — মুখ থুবড়ে পড়ল। এটাই Disaster Recovery-এর গুরুত্ব।

Disaster Recovery (DR) কী?

Disaster Recovery = একটি pre-planned strategy যা catastrophic failure (fire, flood, earthquake, region-wide outage, cyber attack) থেকে business continuity restore করে।

Disaster Types

  • Natural: Earthquake, flood, hurricane।
  • Hardware: Server failure, disk crash।
  • Software: Bug, corrupt deployment।
  • Network: ISP outage, BGP misconfig।
  • Human: Accidental deletion, misconfiguration।
  • Cyber: Ransomware, DDoS, breach।
  • Power: Grid failure।

RPO ও RTO — দুই মূল metric

RPO (Recovery Point Objective)

"কত data হারানো acceptable?" — last good backup থেকে disaster পর্যন্ত time।

  • RPO 1 hour = max ১ ঘণ্টার data loss।
  • RPO 5 minutes = near-realtime backup।
  • RPO 0 = no data loss (synchronous replication)।

RTO (Recovery Time Objective)

"কত সময়ের মধ্যে service-up হবে?" — disaster থেকে operation resume পর্যন্ত।

  • RTO 24 hours = পরের দিন।
  • RTO 1 hour = ১ ঘণ্টায় up।
  • RTO 0 = instant (active-active)।
[Last backup]─────[DISASTER]─────[Recovered] ←── RPO ──→ ←── RTO ──→ data lost downtime

DR Strategies

১. Backup & Restore (Cold)

  • Periodic backup off-site/cloud।
  • Disaster-এ — new infrastructure, restore data।
  • RPO: hours-days। RTO: hours-days।
  • Cheapest। SMB-এর জন্য common।

২. Pilot Light

  • Minimal infrastructure DR site-এ চলছে (DB replica)।
  • Disaster-এ — application server scale up।
  • RPO: minutes। RTO: 10s of minutes।
  • Moderate cost।

৩. Warm Standby

  • DR site-এ scaled-down copy চলছে।
  • Disaster-এ — scale up + failover।
  • RPO: seconds। RTO: minutes।
  • Higher cost।

৪. Hot Standby / Active-Active

  • সব region-এ full production load।
  • Disaster-এ — traffic redirect (DNS/load balancer)।
  • RPO: 0। RTO: seconds।
  • Highest cost — 2× infrastructure।

Strategy Comparison

Backup & Restore

  • Cheapest
  • Hours-days RPO/RTO
  • Manual recovery
  • Small business

Pilot Light

  • Moderate cost
  • Tens of minutes RTO
  • Database replica + minimal infra
  • Medium business

Warm Standby

  • Higher cost
  • Minutes RTO
  • Scaled-down running copy
  • Critical apps

Active-Active

  • Highest cost
  • Seconds RTO, RPO 0
  • Full multi-region
  • Mission-critical

Backup Strategies

3-2-1 Rule

  • ৩ copies of data।
  • ২ different media types।
  • ১ off-site।

Backup Types

  • Full: Complete data copy। Slow, large।
  • Incremental: Last backup থেকে change। Fast, chain-dependent।
  • Differential: Last full থেকে change। Mid-ground।
  • Snapshot: Point-in-time view (DB, filesystem)।

Best Practices

  • Encrypted backup।
  • Automated + tested।
  • Geographic separation।
  • Retention policy।
  • Test restoration — backup নেওয়া যথেষ্ট না।

Multi-Region Architecture

Active-Passive

Primary region traffic; secondary standby। Failover-এ DNS/LB switch।

Active-Active

উভয় region traffic handle। Stateful sync challenging।

Geo-routing

User-এর কাছাকাছি region — latency কম।

Failover Mechanisms

  • DNS-based: Route 53 health check, failover routing।
  • BGP: Anycast IP — automatic routing।
  • Application-level: Code-এ retry to secondary।
  • Manual: Operator-triggered।

DR Testing

Untested DR plan = no DR plan।

  • Tabletop exercise: Discussion-based scenario।
  • Walkthrough: Steps verify।
  • Simulation: Test environment-এ run।
  • Game day: Production-এ controlled disaster (Netflix Chaos Monkey)।

বাস্তব উদাহরণ

  • Netflix Chaos Engineering: Production-এ random failure inject — resilience verify।
  • AWS Multi-Region: Active-active across us-east + us-west।
  • Banks: Multi-DC mandatory regulation।
  • Cloudflare: Global anycast — region failure invisible।
  • 2017 AWS S3 outage: Many service down — multi-region পরে standard হলো।

Business Continuity Plan (BCP)

DR = technical recovery। BCP = broader plan covering people, communication, customer notification, regulatory reporting।

  • Communication tree।
  • Status page update।
  • Customer notification।
  • Regulatory reporting।
  • Post-mortem।

সাধারণ ভুল ধারণা

  1. "Backup = DR": Backup data; DR pure recovery process।
  2. "Cloud auto-disaster-proof": Region outage হয়; multi-region দরকার।
  3. "Once setup forever": Quarterly test + update দরকার।
  4. "RPO 0 always good": Synchronous replication-এ massive cost।

Best Practices

  • RPO/RTO defined per service criticality।
  • 3-2-1 backup rule follow।
  • Test restoration quarterly।
  • Multi-region for critical services।
  • Runbook documented।
  • Chaos engineering — proactive test।
  • Communication plan — status page।
  • Insurance + legal aspects review।

📌 চ্যাপ্টার সারমর্ম

  • DR = catastrophic failure-এ business continuity।
  • RPO = data loss tolerance; RTO = downtime tolerance।
  • Strategies: Backup → Pilot Light → Warm → Active-Active।
  • 3-2-1 backup rule।
  • Untested plan = no plan; chaos engineering practice।