Oct 20, 2024
Zero-Downtime Deployments Explained
A practical breakdown of rolling deployments, health checks, drains, and rollback discipline.
Deployments break systems far more often than code does. Most outages in growing teams are not caused by exotic distributed systems bugs. They come from rushed releases, bad startup behavior, incomplete health checks, and no rollback discipline.
I learned this the hard way while working on product systems that had payments, enrolments, and admin workflows tied together. A deployment was never just “push the code.” A deployment could stop payments, break course access, or confuse support teams within minutes.
The simplest way to think about zero-downtime deployment is this: the user should never notice that a release happened.
What usually breaks
The biggest mistakes are boring:
- new containers start before the app is actually ready
- old containers stop before connections are drained
- migrations run in a risky order
- background workers and web apps are deployed with no sequencing
- rollbacks exist only in theory
Zero-downtime starts with admitting that “deployment successful” is not the same thing as “system healthy.”
My practical deployment checklist
I treat every deployment as a controlled handoff.
-
Build artifact once
The image or artifact promoted to production should be the exact one validated in earlier environments. -
Run checks before rollout
Linting and tests are basic. More importantly, run startup validation, environment validation, and secret checks. -
Health checks must reflect reality
A readiness check should answer: can this instance serve real traffic right now?
A liveness check should answer: is this process stuck or broken? -
Drain old instances cleanly
If you terminate before active requests finish, users experience random failures even though the release looks green. -
Separate schema risk from app risk
Database changes cause more pain than frontend or API releases. Expand-migrate-contract style changes are safer than destructive ones. -
Define rollback before deploy
Rollback is not a paragraph in documentation. It is a known sequence with a tested previous image, known config, and decision rule.
Where zero-downtime really matters
In normal CRUD apps, people say downtime is acceptable. That sounds fine until the app contains:
- payment webhooks
- login and session flows
- enrolment or purchase confirmation
- dashboards used by support or operations
These are the exact flows where a 30-second gap causes real business damage.
The hidden dependency problem
A lot of teams think they are deploying one service. In reality they are touching:
- the application
- reverse proxy
- database migrations
- background jobs
- webhooks
- feature flags
- external integrations
If one part starts slower than the others, the system enters a temporary half-broken state. That state is where bad user experiences happen.
A safer release model
The release flow I trust most is:
- build immutable image
- validate config
- deploy new instances
- wait for readiness
- shift traffic gradually
- watch key metrics
- keep previous version warm long enough to rollback
The metrics that matter most right after deploy:
- error rate
- p95 and p99 latency
- queue backlog
- webhook failures
- database connection pressure
If any of these move sharply, stop pretending the release is fine.
Final thought
Zero-downtime is not a tool. It is a discipline. Docker, Kubernetes, ECS, Nginx, or systemd do not save careless release engineering. What saves you is sequencing, observability, and rollback clarity.
Teams do not become reliable when they deploy more often. They become reliable when they can deploy without fear.