What does RTO 5 minutes mean in practice?

For a critical-path incident in production, the engineering target for service restoration is five minutes from detection. The break-glass admin tool is designed so a single super-admin can perform a tenant restore in under five minutes without writing scripts.

How often do you actually run a drill?

A scripted disaster-recovery drill is run quarterly. The drill restores a full sandbox tenant from cold backup, validates application behavior, measures observed RTO and RPO, and publishes the results in the changelog.

What about a Cloudflare-wide outage?

Cross-cloud cold backup to a non-Cloudflare provider (such as Backblaze B2 or AWS Glacier) is available as a CloudIP Professional Services engagement. It is not enabled by default because it adds operational complexity and cost; customers requiring this level of provider redundancy can request it.

Can a customer trigger a restore themselves?

Yes. Tenant administrators can request point-in-time restore from the in-app reliability dashboard for any moment in the last 30 days. Restores requiring older data, partial-table replay, or queue replay are performed by CloudIP support using the break-glass tooling.

Disaster recovery

A disaster-recovery programthat runs every day

Q: What does RPO 60 seconds mean in practice?

In a worst-case region failure, the maximum amount of data potentially unrecoverable is the work performed in the last 60 seconds. For most data classes, the observed window is far smaller because replication runs continuously.

A real DR program is not a binder. It is a rhythm of detection, containment, recovery, communication, and learning that runs whether or not anything is going wrong. This is ours.

See live status Availability targets

The DR loop

Five phases, repeating

A reliability program is a closed loop. Skipping a phase is how outages get longer.

Phase 1

Detect

Synthetic Health Checks from three U.S. regions hit every Worker every 60 seconds. An anomaly detector watches the audit log hourly. Either signal can open an incident automatically.

Phase 2

Contain

Per-tenant feature flags let us disable a misbehaving feature for one customer without redeploying. Per-provider circuit breakers route around failing dependencies in seconds.

Phase 3

Recover

D1 Time Travel restores any tenant database to any point in the last 30 days. A break-glass admin tool restores a single tenant in under five minutes without operator scripting.

Phase 4

Communicate

Incidents post updates to the public status page in real time. RSS, JSON, and email subscriptions are available. Customers can subscribe per-component.

Phase 5

Learn

Every incident with customer-visible impact of fifteen minutes or more receives a public post-mortem within five business days. Remediation items are tracked and linked.

RPO / RTO by data class

Different data, different recovery profile

Authentication recovers before everything else, because nothing works without it. The audit log recovers second, because forensic continuity must not be lost.

Data class	RPO target	RTO target	Notes
Authentication	≤ 30s	≤ 2m	Sessions are stored in KV with global replication. The auth database is the smallest of the five and recovers quickly.
Tenant operations	≤ 60s	≤ 5m	Companies, users, plans, and billing settings. Restore order: this is the second database brought online after auth.
Business records	≤ 60s	≤ 5m	Accounting, CRM, HR, e-commerce, and POS data. The largest of the five databases; carries the bulk of customer state.
Communications	≤ 60s	≤ 10m	Call history, voicemails, transcripts. Real-time delivery resumes within 60s of restore.
Audit log	≤ 5s	≤ 5m	Append-only, replicated continuously. Restored before any other database to preserve forensic continuity.
Customer files (R2)	≤ 60s	Continuous	Object storage with cross-region replication and 11 nines of durability. Object-lock retention is enforced at the storage layer.

Tooling

What we actually use during an incident

Point-in-time restore

Any tenant database, any moment in the last 30 days, with one click in the break-glass admin tool.

Queue replay

Dead-letter queues capture failed messages with full payloads. One click replays a checkpoint or a single message.

Deploy rollback

Versioned deploys let us pin traffic back to the previous working version of any Worker in seconds.

Tenant freeze

A super-admin can freeze a tenant during an investigation, preventing new writes while the audit log is preserved.

Per-tenant feature flags

Disable a misbehaving feature for one customer or globally without redeploying. The flag table lives in the auth database, the most resilient of the five.

Status incident workflow

Open, update, and resolve incidents from the admin console. The public status page reflects changes in real time.

Targets

The numbers behind the program

The same source of truth that drives the marketing site, the runbooks, and the synthetic alerts.

Metric	Target	What it means
Uptime	99.99%	Target uptime across the platform. Roughly 4.4 minutes of monthly downtime budget.
Read latency (p95, US)	< 50 ms	Read p95 measured coast-to-coast within the United States, served from the nearest edge replica.
Write latency (p95, US)	< 150 ms	Write p95 measured against the regional primary database, including queue acknowledgements.
Recovery Point Objective (RPO)	≤ 60 seconds	Maximum window of data potentially lost in a worst-case region failure for critical tables.
Recovery Time Objective (RTO)	≤ 5 minutes	Maximum time to restore the service for a regional incident affecting a critical workload.
Hot backup retention	30 days	Time-Travel-style point-in-time restore window covering every tenant database.
Cold archive retention	7 years	Object-locked archive in geographically redundant storage with retention enforced at the storage layer.
Backup durability	11 nines	Backups are stored on R2 with eleven nines of annual durability, replicated cross-region.
US edge presence	30+ POPs	Compute and cache run on Cloudflare’s anycast network with more than thirty US points of presence.
DDoS / WAF	Always on	Layer 3, 4, and 7 protection plus a managed WAF rule pack are enabled for every customer.
Encryption	AES-256 + TLS 1.3	Customer data is encrypted at rest with AES-256 and in transit with TLS 1.3 on every connection.
Admin MFA	Mandatory	Administrators must enroll TOTP or WebAuthn before performing privileged operations.
DR drills	Quarterly, public	A scripted disaster-recovery game day is run every quarter and the results are posted publicly.
Public post-mortems	≥ 15 minutes impact	Any incident with customer-visible impact of fifteen minutes or more is documented in a public post-mortem.

Targets describe the engineering posture of the CloudIP platform during the current development phase. They are stated as engineering goals rather than as a contractual service-level agreement. Customers requiring binding SLAs, custom RPO/RTO guarantees, dedicated infrastructure, or cross-cloud cold backups should contact CloudIP Professional Services for a custom engagement.

FAQ

Common questions about disaster recovery

In a worst-case region failure, the maximum amount of data potentially unrecoverable is the work performed in the last 60 seconds. For most data classes, the observed window is far smaller because replication runs continuously.

Want to participate in our next DR drill?

Customers on enterprise plans can opt into hands-on participation in a quarterly disaster-recovery drill, run jointly with CloudIP engineering.

Talk to Professional Services View live status

See past drill reports and post-mortems in the changelog