Disaster recovery

A disaster-recovery programthat runs every day

A real DR program is not a binder. It is a rhythm of detection, containment, recovery, communication, and learning that runs whether or not anything is going wrong. This is ours.

The DR loop

Five phases, repeating

A reliability program is a closed loop. Skipping a phase is how outages get longer.

Phase 1

Detect

Synthetic Health Checks from three U.S. regions hit every Worker every 60 seconds. An anomaly detector watches the audit log hourly. Either signal can open an incident automatically.

Phase 2

Contain

Per-tenant feature flags let us disable a misbehaving feature for one customer without redeploying. Per-provider circuit breakers route around failing dependencies in seconds.

Phase 3

Recover

D1 Time Travel restores any tenant database to any point in the last 30 days. A break-glass admin tool restores a single tenant in under five minutes without operator scripting.

Phase 4

Communicate

Incidents post updates to the public status page in real time. RSS, JSON, and email subscriptions are available. Customers can subscribe per-component.

Phase 5

Learn

Every incident with customer-visible impact of fifteen minutes or more receives a public post-mortem within five business days. Remediation items are tracked and linked.

RPO / RTO by data class

Different data, different recovery profile

Authentication recovers before everything else, because nothing works without it. The audit log recovers second, because forensic continuity must not be lost.

Data classRPO targetRTO target
Authentication≤ 30s≤ 2m
Tenant operations≤ 60s≤ 5m
Business records≤ 60s≤ 5m
Communications≤ 60s≤ 10m
Audit log≤ 5s≤ 5m
Customer files (R2)≤ 60sContinuous
Tooling

What we actually use during an incident

Point-in-time restore

Any tenant database, any moment in the last 30 days, with one click in the break-glass admin tool.

Queue replay

Dead-letter queues capture failed messages with full payloads. One click replays a checkpoint or a single message.

Deploy rollback

Versioned deploys let us pin traffic back to the previous working version of any Worker in seconds.

Tenant freeze

A super-admin can freeze a tenant during an investigation, preventing new writes while the audit log is preserved.

Per-tenant feature flags

Disable a misbehaving feature for one customer or globally without redeploying. The flag table lives in the auth database, the most resilient of the five.

Status incident workflow

Open, update, and resolve incidents from the admin console. The public status page reflects changes in real time.

Targets

The numbers behind the program

The same source of truth that drives the marketing site, the runbooks, and the synthetic alerts.

MetricTarget
Uptime99.99%
Read latency (p95, US)< 50 ms
Write latency (p95, US)< 150 ms
Recovery Point Objective (RPO)≤ 60 seconds
Recovery Time Objective (RTO)≤ 5 minutes
Hot backup retention30 days
Cold archive retention7 years
Backup durability11 nines
US edge presence30+ POPs
DDoS / WAFAlways on
EncryptionAES-256 + TLS 1.3
Admin MFAMandatory
DR drillsQuarterly, public
Public post-mortems≥ 15 minutes impact
Targets describe the engineering posture of the CloudIP platform during the current development phase. They are stated as engineering goals rather than as a contractual service-level agreement. Customers requiring binding SLAs, custom RPO/RTO guarantees, dedicated infrastructure, or cross-cloud cold backups should contact CloudIP Professional Services for a custom engagement.
FAQ

Common questions about disaster recovery

In a worst-case region failure, the maximum amount of data potentially unrecoverable is the work performed in the last 60 seconds. For most data classes, the observed window is far smaller because replication runs continuously.

Want to participate in our next DR drill?

Customers on enterprise plans can opt into hands-on participation in a quarterly disaster-recovery drill, run jointly with CloudIP engineering.