Skip to content
~/petro

your deploy strategy matters less than you think

(last month) · 12 min read · views
Table of contents
  1. Recreate — Kill everything. Start over.
  2. Rolling — Change the tire while the car is moving.
  3. Blue/Green — The mirrored building.
  4. the myth of the rollback
  5. Canary — Three tables in the back to see if anyone throws up.
  6. the math nobody wants to do
  7. Shadow — Dirty work in the dark.
  8. the blast radius
  9. the database nightmare
  10. tactical checklist

Friday, 3:12 PM. Routine deploy. We changed the user_data.sh on the instance — nothing exotic. Rolling update, like we always did.

The Auto Scaling Group’s warmup time was set to 5 minutes. But with the new dependencies we added, the instance took 9 minutes to boot. The ASG, impatient, marked the instance as unhealthy. Killed it. Spun up another. Which also didn’t boot in time. Which was also marked unhealthy.

In just a few minutes, a cascading failure. 2 instances became 20. None healthy. Latency exploded and the screen was painted with 5xx errors everywhere.

We had an elegant Rolling deploy on paper, but a catastrophic Recreate in practice.


The root cause wasn’t the strategy. It was an ignored warmup. A secondary script with no load testing. A default value we accepted blindly.

Teams spend weeks debating whiteboard architectures as if they were magic shields against production outages. They aren’t.

The strategy is just the envelope. What dictates whether your system soars or sinks are the silent parameters: the readiness delay, the warmup time, the connection draining, the batch size. Having a perfect Canary rollout with a generic health check is like jumping out of a plane with a state-of-the-art parachute tied with a square knot. The design is flawless, but the execution will be fatal.

To understand the true physics of the deploy, think of your infrastructure like renovating a very busy restaurant. You have three fundamental goals:

  • Zero downtime: The restaurant never closes; customers keep eating.
  • Zero extra cost: You don’t rent a second building or buy spare stoves.
  • Zero risk: If the new kitchen fails, nobody’s food gets burned.

The physics of engineering dictates you can only pick two. The third will always cost you something — and that’s the trade-off where each strategy positions itself.

the deploy triangle

pick two — the third always costs you

Zero DowntimeZero Extra CostZero RiskRecreateRollingBlue/GreenCanaryShadow

To prove this, each strategy below has an interactive simulation you can (and should) break. The point isn’t to pick the “winner” — it’s to understand why none of them save you on their own.


Recreate — Kill everything. Start over.

The old version (v1) is completely shut down. The environment hits zero. The new version (v2) boots from scratch. No overlaps. No coexistence.

In Practice: You nail a sign to the door: “Closed for renovations”. You turn off the instances. For 30 days, nobody is served. When it reopens, the operation is 100% new and cohesive. No mixing old state with new schemas.

How it works on the Server:

  1. Kill all v1 traffic.
  2. Wait for full shutdown.
  3. Start the v2 instances.
  4. Turn traffic back on.

Where it makes sense: When your environment rejects concurrency. A structural DB change that fatally breaks the previous version.

The Cost: Absolute downtime. No crying.

ProsCons
Zero architectural riskInevitable downtime
No version collisionUsers completely locked out
Lowest complexityRollback requires immediate re-deploy
deploy config
Instances
1
Boot time
2s
LB warmup
2s
Drain before kill
2s
Request rate
3 req/s
✓ safeKeeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
infrastructuredirect
v1
v1
ONLINE — service stable
request flowdirect
0
Usersv1load 0/13 in-flight requests0/13
success: 0error: 0

Break the simulation: Increase to 4 instances and watch the cruel math. The larger your cluster, the longer your downtime window. The scale of your infrastructure becomes the scale of your unavailability.


Rolling — Change the tire while the car is moving.

Version B strangles version A gradually. One out, one in. Piece by piece until the environment is 100% updated. The system refuses to fully die.

In Practice: The service never stops. You isolate one machine from the pool, update it, and bring it back up. Half the traffic hits the old version, half hits the new. It’s temporary friction, but the revenue doesn’t choke.

The Catch: Your Database is forced to handle v1 and v2 requests altering the exact same structure simultaneously. This strategy inherently collects its toll upfront.

The Cost: Capacity or speed. You give up one.

ProsCons
Zero downtime (on paper)Painfully dragged out
The Industry Gold StandardRequires severe backward compatibility
Low aggregate costDrain errors are fatal
deploy config
Instances
4
Boot time
2s
LB warmup
2s
Drain before kill
2s
Request rate
9 req/s
✓ safeKeeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
strategy parameters
Batch size
1
Min active
2
Max instances
12
infrastructure
v1
v1
v1
v1
v1
ONLINE — service stable
request flow
0
UsersLBv1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13
success: 0error: 0

Break the simulation: Zero the minActive and max out the batchSize. Boom. You actively crushed capacity and your refined Rolling deploy just became a Recreate wearing a trench coat. Increase the bootTime to outlast the health check and watch the screen paint itself red — that’s the exact screen that ruined our Friday.


Blue/Green — The mirrored building.

High cost. Minimal risk. Version B (Green) rises in a parallel, shielded, identical infrastructure. Tested in the dark. Once it crosses the safety line, you press a button. The router flips 100% of traffic to the new place.

In Practice: You never touch the main infra. You spin up another environment, hire replicas, and test in secret. All clear? You flip the magnetic switch (DNS/Load Balancer) in a single second. The client keeps navigating without noticing.

The Collateral Limit: Virgin environments require warmed-up databases. Cold caches or brute-force schema sharing will instantly lock the door.

The Cost: Cloud costs double with a single stroke of the pen.

ProsCons
Zero downtime (theoretical)Double the monthly invoice
The famed “Red Button” RollbackStagnant DB dependencies
Robust pre-live smoke testingCold caches kill production
deploy config
Instances/env
2
Boot time
2s
LB warmup
2s
Drain before kill
2s
Request rate
8 req/s
✓ safeKeeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
strategy parameters
Cache Pre-warming
2
infrastructure
blue
v1
v1
green
ONLINE — service stable
request flow
0
UsersLBv1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13
success: 0error: 0

Break the simulation: Open the strategy parameters and decrease the Cache Pre-warming to zero. On the switch, behold the storm. This is the Thundering Herd. With empty caches, requests instantly drown the database. The LB flips 100%, and hyper-healthy machines shoot 504s in your face out of sheer desperation lock! You saved the code, but your perception rolled over.

the myth of the rollback

Blue/Green is celebrated for the dream: I hit Undo in 5 seconds. But it’s an illusion. You undo code. You never undo state.

  • The Event: Kafka posted the notification. 12 APIs consumed it. No Ctrl+Z.
  • Side-Effects: The machine ran webhooks. Email notified the users. The payment gateway pulled funds from credit cards.
  • The DB Altered: Did you drop that dead table in the DB while asynchronous jobs still queried user_id_legacy? The bleeding has started.

Distributed systems don’t accept “Ctrl+Z”. Choosing the precision of failing in small pieces (e.g., Canary) is safer than slamming on an invisible emergency brake. The infrastructure rolls back, but the data wound remains open. Assume there is no turning back — design your deploys to push forward (fix-forward) in smaller fractions.


Canary — Three tables in the back to see if anyone throws up.

If Blue/Green proved that you can’t undo state, Canary asks a sharper question: what if you never need to? Fail so small that the damage is irrelevant.

The crown jewel of risk mitigation. Drips microscopic (1~5%) traffic to version B. Silently watches the screens in cold blood. Tolerances fine? The faucet opens to 10, 25, and done: 100%.

In Practice: Nobody throws an oppressive update to 100% right away. You funnel the new version to a tiny edge node. Made a mistake? The log spits garbage without burning the overall operation.

The standard imposed by big techs — minimizing bugs on a microscopic scale before they sink global revenue.

The Cost: Deep operational anxiety and pure, complex mathematics.

ProsCons
Blast radius contained by dripDemands bulletproof automation
Revenue and data on real groundsStupid polling in brief times (read on)
Visually driven culture and metricsDeploys last for eras
deploy config
Instances
4
Boot time
2s
LB warmup
2s
Drain before kill
2s
Request rate
9 req/s
✓ safeKeeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
strategy parameters
Steps
4
infrastructure
v1
v1
v1
v1
v1
ONLINE — service stable
request flow
0
UsersLBv1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13
success: 0error: 0

Anti-madness warning: Don’t cross A/B Test with Canary. A/B figures out business product sales (Which button color pays?). Canary tries to save engineering and server integrity (Is some RAM garbage destroying a new JS query?). They run on the same road, but in opposing directions.


the math nobody wants to do

Teams buy the hype of The Canary with pitiful “15 minutes at 5%” pauses before the blind jump to 100%.

If there is an imperceptible memory leak that crashes in a meager 0.1% of requests, you will need roughly 30,000 requests directed to the Canary just to generate a statistically valid alert in DataDog. Leaving the traffic dripping at 5% for 15 minutes outside the scale of a big tech company is not technical rigor, it’s security theater. It’s trying to calm your mind looking at charts while a silent failure accumulates in the cluster.


Shadow — Dirty work in the dark.

Duplication into the shadows. The original routing (A) operates. Every single input produced passes into the hands of the hidden version (B) to audit latency, CPU, and raw logs against the response—without returning anything to the end client’s screen.

In Practice: A siloed zombie infrastructure where robots test metrics against live fire. The pressure of the real moment, in parallel, without the client bleeding any extra lag.

The Atomic “Side-Effect” Trap: Without injecting impenetrable mocks into these shadows, the ghost version would trigger emails billing your production user a second time. Double simulation causes a double database charge.

The Cost: Maximum immaculate precision. Highest Cloud financial tariff available.

ProsCons
Zero risks towards the online clientBudget overload (2x)
Absolutely precise performance monitoringUnreal complexity in Ghost-scope Mocks
deploy config
Instances
3
Boot time
2s
LB warmup
2s
Drain before kill
2s
Request rate
9 req/s
✓ safeKeeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
infrastructure
prod
v1
v1
v1
mirror
ONLINE — service stable
request flow
0
UsersLBv1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13v1load 0/13 in-flight requests0/13
success: 0error: 0

Break the simulation: Just stare actively at the yellow counter. Watch the double-billing of the machines hitting the account for nothing offered externally. That is total paranoia bought in dollars and flawlessly shielded from the users.


the blast radius

You’ve seen five strategies, each optimizing a different edge of the same triangle. But strategy is a what. The more dangerous question is where.

Deploys don’t just happen at the server level. The scope of the change — your blast radius — scales from a single CPU thread to intercontinental DNS routing:

Simple

Lower cost / Higher risk

Complex

Higher cost / Lower risk

Recreate

Total downtime

Rolling

Gradual update

Blue/Green

Exact copy, doubles cost

Canary

Percentage routing

Shadow

Mirrored traffic

LayerControl BoundaryCommon Technologies
OS ProcessesPorts / IPCGunicorn, PM2
Pods / ContainersIngress / Service MeshKubernetes, Envoy
Instances / VMsLoad Balancers (ALB/NLB)AWS EC2, Azure VMs
Global TrafficEdge Routing / DNSCloudflare, Route53

Saying you “do Canary” in a standup is vague. Are you isolating 2% of requests in a local load balancer, or recklessly draining 10% of Asian traffic via Cloudflare? The edge of your control is the edge of your deploy: never open the floodgates where you can’t measure the collateral damage.


the database nightmare

Blast radius tells you how far the damage spreads. The database tells you where it stays.

If there is a brutal force capable of crushing your perfect Blue/Green or Rolling architecture, it’s the relational database.

The old version conflicts with the new. If a deploy alters or drops a column that v1 instances are still desperately trying to read, the entire application crashes in a deadlock. The database is the shared state cemetary where elegant deployment strategies go to die.

To survive complex schema changes, senior engineers rely on the Expand & Contract pattern (Parallel Change):

  1. The Expansion: Create the new columns/tables. Update the code so it inherently writes to both legacy and new endpoints for days, maintaining exact v1 read behaviors.
  2. The Migration: Have background scripts quietly shuttle the historical data volume from the old schema over to the newest mold.
  3. The Contraction: Weeks or even months later, when v1 is utterly purged from reading anywhere in the cluster, cut the root of the legacy field off. That one single safe deploy took an entire month from start to finish..

tactical checklist

The Friday that started this post — we had the right strategy and the wrong parameter. Every simulation you broke above proved the same thing from a different angle.

Before applying the next infrastructure change, validate:

  1. Peripheral parameters dictate the deploy. A short timeout or a rigid batch limit will collapse the most elaborate Canary strategy into a forced Recreate.
  2. State never rolls back. Your database has no emergency undo button. Design flows assuming failures will leak into persistence and prioritize continuous correction (fix-forward).
  3. Statistical samples are unforgiving. Running a Canary with 5% traffic for 10 minutes under low volume is just generating metric theater. If the volume doesn’t reach statistical significance, the alert will be blind.

At the end of the day, architecture merely reflects intent. Actual production behavior is rigorously dictated by the hard-limits filled into the configuration files.


No victims yet. Go back and break something.