What Cloudflare’s “Fail Small” program gets right about resilience work
The interesting part is not the apology. It is the shift in rollout mechanics, fallback behavior, emergency access, and enforcement.
Most outage writeups are useful only up to a point. They explain what broke, apologize, and promise improvement. The harder question is what actually changed in the system afterward.
That is what makes Cloudflare’s recent “Code Orange: Fail Small” writeup worth reading. The article is not just a retrospective about reliability culture. It describes concrete changes Cloudflare says it made after its November 18, 2025 and December 5, 2025 global outages: safer configuration rollout, smaller failure radius, broader break-glass access, drilled incident response, and a rules system meant to stop old mistakes from sneaking back in.
For teams building products and platforms, the lesson is not “be more serious about reliability.” The lesson is much more specific: move safety into the mechanics of change, not just the language of postmortems.
1) The strongest move is treating configuration like code during rollout
Cloudflare says that, in most cases, internal configuration changes no longer hit the network instantly. Instead, those changes are rolled out progressively with real-time health monitoring. In the same writeup, Cloudflare describes Snapstone as the internal system it built to bring progressive rollout, health mediation, and automated rollback to configuration changes by default.
That matters because many teams still treat configuration as the “safe” part of the system. Code goes through staged rollout, canaries, and rollback plans. Configuration often does not. But many of the highest-risk production changes are configuration changes: feature flags, policy rules, classifiers, routing switches, pricing toggles, and traffic-shaping controls.
The useful takeaway here is not that every company needs Snapstone. It is that configuration deserves the same deployment discipline as software. If a change can affect customer traffic, it should have:
- a defined release path,
- real health signals,
- an automated stop condition,
- and a fast path back to the last known good state.
Cloudflare’s older writeup on Health Mediated Deployments makes this even clearer. It describes a rollout system that continues, pauses, or automatically reverts changes based on service health signals rather than operator intuition alone. That is the real upgrade: safety stops being a hero move and becomes part of the release mechanism.
2) Good resilience work shrinks blast radius before it tries to eliminate failure
Cloudflare also says it reviewed critical product failure modes, removed non-essential runtime dependencies, and adopted fail-stale, fail-open, or fail-close behavior depending on the service and the scenario.
That is a more mature frame than “prevent every outage.” Real systems fail. The better question is what happens next.
In the example Cloudflare gives, a future bad Bot Management configuration should no longer become an immediate global event. If the new data cannot be read, the system should refuse the new configuration and keep using the old one. If the old configuration is unavailable, the system should fail open so customer traffic can still be served. Cloudflare also says the same class of failure should now be caught during early deployment stages before it reaches more than a small percentage of traffic.
The second part of the same pattern is segmentation. Cloudflare says its Workers runtime is now split into independent cohorts of traffic, including a segment that handles free-customer traffic first, with deployment pacing that varies by criticality. That is a practical reliability move because it changes the shape of failure. A bad release no longer has to mean “everybody feels it at once.”
Smaller teams can borrow this pattern without copying Cloudflare’s exact architecture. You do not need global traffic engineering to apply the principle. Sometimes a “small cohort” is just:
- one region before all regions,
- internal users before customers,
- free plans before enterprise,
- a single queue consumer before the whole fleet,
- or one feature-flag audience before the rest of production.
The point is to make bad changes expensive in minutes, not in reputation.
3) Incident response is not only about code paths
One of the more useful sections in Cloudflare’s writeup is not about deployment at all. It is about break-glass access and communication.
Cloudflare says it audited the tools required for visibility, debugging, and production change, then built backup authorization pathways for 18 key services. It also says it ran an engineering-wide drill involving more than 200 team members. That matters because many organizations quietly depend on the same systems they would need to repair during an outage.
This is a classic failure pattern: the identity system, remote access layer, internal observability surface, or chat-based workflow becomes unavailable at the exact moment the company needs it most.
The practical lesson is simple. Emergency access should not exist only on a diagram. It should be tested by people who are not part of the tiny inner circle. And the customer communication path should be treated as part of incident response, not as an afterthought once engineers have already stabilized the system.
Cloudflare explicitly says it paired incident responders with a dedicated communications function and drilled both pathways. That is the right model. In a serious incident, clarity is operational work.
4) The final step is turning lessons into enforced defaults
The most ambitious part of the writeup is Cloudflare’s internal Codex. Cloudflare says the Codex is mandatory for engineering and product teams, and that AI code reviews enforce rules derived from RFCs across the codebase.
There is one especially important line in the article: the goal is to build institutional memory that enforces itself.
That is a strong standard. Most teams do collect lessons after incidents, but those lessons often remain trapped in postmortems, senior engineers’ heads, or one sprint’s worth of temporary caution. A rule that is not encoded somewhere eventually degrades into folklore.
Cloudflare gives concrete examples of what that codification looks like, including a rule against using .unwrap() outside of tests and build scripts and a broader principle that services must validate upstream dependencies before processing.
You do not need AI review to learn from that. The underlying pattern is bigger than the enforcement tool:
- write the rule down,
- tie it to a technical rationale,
- make it reviewable,
- and put it directly in the path of future changes.
If incident learnings do not change merge behavior, release behavior, or runtime behavior, then they are still only lessons.
What this means for teams that are not Cloudflare
Most companies are not running a network at Cloudflare’s scale. That does not make the article less useful. It makes the abstraction work more important.
The transferable patterns are straightforward:
- Roll out risky config progressively, not instantly.
- Prefer last-known-good behavior over brittle all-or-nothing failure.
- Segment traffic so failure hits a smaller surface first.
- Test emergency access before you need it.
- Turn postmortem lessons into rules that future changes must pass through.
What should not be copied blindly is the form factor. Teams do not need Cloudflare’s internal systems, data scale, or organizational structure to apply the idea. They need their own version of the same discipline.
Bottom line
The most credible resilience work is visible in system mechanics. It shows up in how changes move, how failures degrade, who can act during an incident, and what future code is no longer allowed to do.
That is what makes Cloudflare’s “Fail Small” writeup worth paying attention to. The valuable part is not that the company says it learned from outages. It is that the company describes changes to rollout safety, blast-radius control, incident access, and standards enforcement that should make the next bad change smaller than the last one.
That is the right target for reliability work.
Sources