Skip to content
~/petro
← cd ..

Your Deploy Strategy Matters Less Than You Think

· 11 min read | devops infra interactive | Also in PT

Most engineering teams spend weeks debating which deploy strategy to adopt — Rolling, Blue/Green, Canary — as if the choice alone were the deciding factor between a smooth deploy and a 3am Friday firefight in production.

It isn’t.

The strategy is the envelope. What determines whether your deploy works or blows up are the parameters: readiness delay, warmup time, connection draining, batch size. Details most teams configure on autopilot — or worse, leave at default.

This guide covers every major strategy, each with an interactive simulation you can tweak and break. But the point isn’t to pick the “best” one — it’s to understand why none of them save you on their own.

Important: the strategy alone does not guarantee zero downtime. Readiness/health checks, warmup, and connection draining settings can still drop requests even with Rolling, Blue/Green, or Canary if they are configured too aggressively.


Recreate

The old version (v1) is completely shut down before the new version (v2) starts. No overlap, no coexistence. Just stop everything and start again.

How it works:

  1. Stop all v1 instances
  2. Wait for full shutdown
  3. Start all v2 instances
  4. Resume traffic

Where it makes sense:
When you can’t have two versions running simultaneously — maybe due to database schema changes, license constraints, or stateful processes that don’t support concurrent versions.

ProsCons
Maximum simplicityCauses downtime
No version conflictsUsers lose access during deploy
Clean state transitionRollback requires re-deploy
deploy config✓ safe
Instances?
1
Boot time?
2s
LB warmup?
2s
Drain before kill?
2s
Keeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
infrastructuredirect
v1
v2
v1100%
v1
v2
active capacity
1/1
online
request flowdirect access
0
Usersv1100%
success: 0error: 0

Rolling

Version B replaces version A gradually, instance by instance (or in batches), until the entire environment is updated.

How it works:

  1. Take one (or a batch of) v1 instances out of the load balancer
  2. Replace them with v2
  3. Route traffic to the new instances
  4. Repeat until all instances are v2

The catch: Your application (especially the database) must support two different versions running simultaneously. API contracts, database schemas, feature flags — all need to be backwards compatible during the transition window.

Parameters like batch size, max unavailable, and minimum active instances matter too. If you let too many instances leave at once, you can create bottlenecks or even downtime despite using a rolling update.

ProsCons
Zero downtimeSlow deploy process
Industry standardRequires backward compatibility
Low infrastructure costSlow rollback
deploy config✓ safe
Instances?
4
Boot time?
2s
LB warmup?
2s
Drain before kill?
2s
Keeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
strategy parameters
Batch size?
1
Min active?
2
infrastructure
v1
v2
v1100%
v1
v1
v1
v1
v2
active capacity
4/4
online
request flow
0
UsersLB100% v1 | 0% v2v125%v125%v125%v125%
success: 0error: 0

Blue/Green

Version B (Green) is deployed to an isolated but identical parallel environment to version A (Blue). After testing on Green, traffic is switched instantly at the load balancer from A to B.

How it works:

  1. Deploy v2 to the Green environment (Blue keeps serving traffic)
  2. Run smoke tests against Green
  3. Switch load balancer from Blue → Green
  4. If something breaks, switch back instantly

The expensive part: You’re paying for double infrastructure throughout the process. And database schema changes are a nightmare — both environments need to work with the same data layer.

ProsCons
Zero downtimeDouble infrastructure cost
Instant rollbackDB migrations are complex
Full pre-production testingSync between environments
deploy config✓ safe
Instances/env?
2
Boot time?
2s
LB warmup?
2s
Drain before kill?
2s
Keeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
infrastructure
blue
green
blue100%
v1
v1
green
serving capacity
2/2
online
request flow
0
UsersLBRoute 100% → BLUEv150%v150%
success: 0error: 0

Canary

Version B is released to a small subset of users (e.g., 5%). If health metrics (error rate, latency, CPU) look normal, traffic is gradually increased in steps until reaching 100%.

How it works:

  1. Deploy v2 alongside v1
  2. Route 5% of traffic to v2
  3. Monitor metrics (errors, latency, saturation)
  4. If healthy, increase to 25% → 50% → 100%
  5. If metrics degrade, roll back to 0%

This is how most large-scale systems deploy today. It limits blast radius — if v2 has a bug, only 5% of users are affected before you catch it.

ProsCons
Real production testingHigh complexity
Limited blast radiusRequires observability tools
Data-driven confidenceSlow total deploy time
deploy config✓ safe
Instances?
6
Boot time?
2s
LB warmup?
2s
Drain before kill?
2s
Keeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
strategy parameters
Steps?
3
infrastructure
v1
v2
v1100%
v1
v1
v1
v1
v1
v1
v2
active capacity
6/6
online
request flow
0
UsersLB100% v1 | 0% v2v116.7%v116.7%v116.7%v116.7%v116.7%v116.7%
success: 0error: 0

Canary ≠ A/B Testing: The infrastructure is similar — splitting traffic between versions — but the goals are opposite. Canary validates the technical health of the new version (error rates, latency, saturation). A/B testing validates a product hypothesis (which variant converts more, engages more, retains more). Canary is an engineering decision; A/B is a product decision. They often run on the same infrastructure but answer completely different questions.


Shadow

Version B runs in parallel and receives a copy (mirror) of all traffic going to version A. Version B’s responses are monitored by developers but ignored by end users — they always see v1’s response.

How it works:

  1. Deploy v2 alongside v1
  2. Mirror all incoming requests to both
  3. Users receive responses only from v1
  4. Compare v2’s responses, latency, and behavior
  5. When confident, promote v2 via another strategy

Critical detail: You must mock external side effects. If v2 processes a mirrored payment request, it should not actually charge the user’s card. Same for emails, webhooks, and any write operations to shared databases.

ProsCons
Zero user riskVery complex and expensive
Real load testingMust mock all side effects
Catch regressions before productionDouble infrastructure
deploy config✓ safe
Instances?
3
Boot time?
2s
LB warmup?
2s
Drain before kill?
2s
Keeping LB warmup at or above boot time, together with a good drain setting, helps keep deployments downtime-free.
infrastructure
prod100%
v1
v1
v1
mirror
production capacity
3/3
online
request flow
0
UsersLBMirror 100% → Shadowv133.3%v133.3%v133.3%
success: 0error: 0

This isn’t about tools

The simulations above show a load balancer routing traffic to instances. But that’s just one level of abstraction.

The same patterns apply at completely different scales:

Level”Load Balancer""Instance”Example
ProcessProcess managerWorker/threadPM2, Gunicorn, Puma
ContainerService mesh / IngressContainerKubernetes, ECS, Docker Swarm
InstanceALB / Nginx / HAProxyVM or serverAWS EC2, GCP Compute, bare metal
RegionDNS / Global LBEntire regionRoute 53, Cloudflare, GCP GLB

A “Canary deploy” can mean:

  • routing 10% of threads to the new code within a single machine
  • routing 10% of pods to the new version within a Kubernetes cluster
  • routing 10% of global traffic to an entire region running the new version

The mechanics are the same. What changes is the blast radius — how much of the system is affected if something goes wrong.

In the simulations, we use “instances” and “load balancer” as generic vocabulary. Mentally substitute the abstraction level that makes sense for your context.


Configuration matters more than strategy

The difference between a safe deploy and a disaster is 3 seconds of readiness delay.

Here’s the central thesis of this article: the belief that choosing Rolling or Canary automatically protects against downtime is dangerous. It doesn’t. With aggressive settings, even the “safest” strategy will drop requests.

Consider these scenarios:

Rolling with readinessDelay = 0: You remove a v1 instance, bring up v2, and the load balancer routes traffic immediately — before the application is actually ready. Result: 502 errors during the startup window. Multiply that by each batch and you get a cascade of silent degradation.

Blue/Green with warmupTime = 0: Traffic switches instantly to the Green environment, but Green is still warming up its connection pool, JIT, or cache. In the first few seconds, latency spikes and timeouts start appearing. The “instant rollback” helps — but the requests in that window are already lost.

Canary with very short bootTime: The canary pod joins the pool before it can respond correctly. Even at just 5% of traffic, if your system handles millions of requests per day, 5% is a lot of failed requests.

The good news: you can test this right now. Go back to the simulations above, set the safety parameters (readiness delay, warmup, drain time) to aggressive values, and watch what happens to the requests. This is the best way to internalize why configuration matters more than the choice of strategy.


Comparison Table

StrategyDowntime?Rollback SpeedInfrastructure CostComplexity
RecreateYesSlowLowLow
RollingNoSlowLowMedium
Blue/GreenNoInstantHigh (2x)Medium-High
CanaryNoFastMediumHigh
ShadowNoN/A (no prod impact)High (2x)Very High

The elephant in the room: database migrations

Every deploy strategy that keeps two versions running simultaneously (Rolling, Blue/Green, Canary) hits the same wall: the database must be compatible with both versions.

Adding a column? v1 must work with it present. Renaming a field? v1 will break if the old column disappears. Dropping a table? Impossible until v1 is completely gone.

The standard solution is the expand-contract pattern (also called “parallel change”):

Phase 1 — Expand: add the new structure (column, table, index) without removing the old one. Deploy v2 writing to both. At this point, v1 and v2 coexist without conflicts.

Phase 2 — Migrate: backfill data from the old structure to the new one. Ensure the new structure is the source of truth.

Phase 3 — Contract: remove the old structure. This is only safe when v1 no longer exists anywhere.

In practice, each phase is a separate deploy. A “simple” column rename becomes three sequential deploys. This is exactly why mature teams invest in feature flags and reversible migrations — because the deploy itself is only part of the story.


The deploy spectrum

The five strategies aren’t isolated choices — they form a spectrum. From left to right, operational complexity and infrastructure cost increase, but risk to end users decreases.

Simple

Lower cost / Higher risk

Complex

Higher cost / Lower risk

Recreate

Total downtime

Rolling

Gradual update

Blue/Green

Exact copy, doubles cost

Canary

Percentage routing

Shadow

Mirrored traffic

No strategy is universally better. The sweet spot depends on your team size, service criticality, and observability maturity.


Which one should you use?

Start simple. If you’re a small team with a single service, Rolling updates will serve you well. As your system grows in complexity and user base, graduate to Canary releases.

Blue/Green shines when you need instant rollback guarantees — critical for financial systems or healthcare applications where even seconds of degraded service matter.

Shadow deploys are for major refactors — swapping databases, rewriting core services, or migrating to a new architecture. The investment in infrastructure and mocking is worth it when the stakes are highest.

How the best teams combine strategies

In practice, mature teams don’t use just one strategy. The typical pipeline combines several, each optimized for a different type of change:

Type of changeStrategyWhy
New feature (code)Canary → gradual promotionLimits blast radius; fast rollback based on metrics
Critical hotfixRolling with batch of 1Speed + sequential validation
Database migrationBlue/Green + expand-contractInstant rollback if the migration fails
Core service rewriteShadowCanaryValidate with real traffic at zero risk, then promote gradually
Infrastructure changeRegional Blue/GreenSwitch entire regions with immediate fallback

A real-world pipeline might look like:

  1. Dev pushes → CI runs tests
  2. Automatic deploy to Canary 5% in production
  3. Automated monitoring for 10 minutes (error rate, p99 latency, saturation)
  4. If healthy: gradual promotion 25% → 50% → 100%
  5. If degraded: automatic rollback to v1
  6. For database migrations: separate Blue/Green, with the expand phase deployed days earlier via Canary

Your deploy strategy matters less than you think. Execution matters more than you imagine. The secret isn’t choosing the “right” strategy — it’s having the operational maturity to make any strategy work.

← back to blog