MervCodes

Tech Reviews From A Programmer

Database Migrations for Zero Downtime Deployments

1 min read

Database Migrations for Zero Downtime Deployments

Shipping code without downtime is now a baseline expectation. Users assume your service is available 24/7, and a maintenance window that takes the database offline is increasingly hard to justify. The hardest part of continuous delivery is rarely the application code — it's the database. Schema changes are stateful, often irreversible, and a single careless ALTER TABLE can lock a table for minutes and bring your service to its knees.

This guide walks through the patterns and discipline required to evolve your schema while keeping the lights on.

Why Database Migrations Are the Hard Part

When you deploy stateless application servers, you can spin up new instances, drain the old ones, and roll back instantly if something breaks. The database has none of these luxuries. It holds the single source of truth, it can't be casually duplicated, and many changes acquire locks that block reads or writes.

The core challenge is that during a rolling deployment, old and new versions of your application run simultaneously against the same database. If version N expects a column that version N+1 just dropped, requests served by the old instances will start throwing errors. Your migration strategy has to keep the schema compatible with both versions at every moment of the rollout.

The Expand-Contract Pattern

The single most important concept for zero downtime migrations is expand-contract (also called parallel change). Instead of making a breaking change in one step, you split it into phases that are each individually safe:

  1. Expand — Add the new schema elements without removing anything. The database now supports both the old and new shapes.
  2. Migrate — Backfill data and switch application code to use the new shape, deploying gradually.
  3. Contract — Once no code references the old shape, remove it.

Each phase is its own deployment. This is slower than a single migration, but every intermediate state is safe to run in production with mixed application versions.

Example: Renaming a Column

A naive ALTER TABLE users RENAME COLUMN email TO email_address breaks every running instance the moment it executes. The expand-contract version looks like this:

-- Phase 1 (Expand): add the new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);

-- Backfill in batches (see below), and have the app
-- write to BOTH columns during this transition.

-- Phase 3 (Contract): after all code uses email_address
ALTER TABLE users DROP COLUMN email;

Between phases, your application writes to both email and email_address and reads from whichever is authoritative. Only after every instance is updated and the backfill is complete do you drop the old column.

Make Migrations Backward Compatible

The golden rule: a migration must be compatible with the application version currently running in production. Translate that into concrete habits:

  • Adding is safe; removing and renaming are not. New nullable columns, new tables, and new indexes (created concurrently) don't break existing code.
  • Never rename in place. Renames are two operations disguised as one — add the new name, migrate, drop the old.
  • Don't change a column's type directly. Add a new column with the new type, backfill, and swap. In-place type changes often rewrite the entire table under a lock.
  • Decouple schema changes from code changes. Deploy the schema change first, verify it, then deploy the code that depends on it.

Watch Out for Locks

The reason a "simple" migration causes an outage is almost always locking. Different databases behave differently, so know your engine:

  • PostgreSQL: CREATE INDEX takes a lock that blocks writes — always use CREATE INDEX CONCURRENTLY. Adding a column with a non-volatile default is fast in modern Postgres (11+), but adding a NOT NULL constraint can require a full table scan. Use NOT VALID constraints followed by a separate VALIDATE.
  • MySQL: Many DDL operations support online, in-place changes via ALGORITHM=INPLACE, LOCK=NONE, but not all. For large tables, tools like pt-online-schema-change or gh-ost create a shadow table and copy rows in the background to avoid long locks.
  • Set a lock timeout. Before running DDL, set a short lock_timeout (e.g. a few seconds) so a migration that can't acquire its lock fails fast instead of queuing behind a long transaction and stalling every query that follows it.

Backfilling Data Safely

When you need to populate a new column for millions of existing rows, never do it in a single UPDATE. One giant statement holds locks, bloats the transaction log, and can run for hours.

Instead, backfill in small batches:

UPDATE users
SET email_address = email
WHERE email_address IS NULL
LIMIT 1000;
-- Repeat until zero rows are affected, with a brief pause between batches.

Run batches with a small sleep between them so you don't saturate I/O or replication. Make the operation idempotent (the WHERE email_address IS NULL clause ensures re-runs are safe) and monitor replication lag on your read replicas as you go.

Decouple Migrations from Application Deploys

A common mistake is bundling the migration into the application startup sequence. If the migration is slow or fails, it blocks or crashes your deploy. Better practice:

  • Run migrations as a separate, explicit step in your pipeline, before or after the code deploy depending on the phase.
  • Make every migration forward-only and idempotent so re-running it is harmless.
  • Keep migrations small and single-purpose — one logical change per migration file makes failures easy to diagnose and roll back.

Rollbacks and Feature Flags

Down migrations are a nice idea that rarely survive contact with production data — once you've dropped a column, the data is gone. Rather than relying on reversible migrations, prefer a roll-forward philosophy: if something is wrong, write a new migration to fix it.

For the application side, feature flags let you decouple "the schema and code exist" from "the feature is active." You can deploy the new code path dark, flip the flag for a small percentage of traffic, watch your metrics, and roll back instantly by flipping the flag — no deploy required.

A Practical Checklist

Before merging any migration, ask:

  • Does this work with the currently deployed application version?
  • Does it acquire a long-running lock? Have I set a lock timeout?
  • For new indexes, am I creating them concurrently / online?
  • Is the backfill batched and idempotent?
  • Have I split renames, type changes, and column drops into expand-contract phases?
  • Can I deploy this independently of the code that uses it?
  • What's my plan if it fails halfway?

FAQ

Q: What exactly is the expand-contract pattern? It's a technique for making breaking schema changes safely by splitting them into non-breaking steps. You first expand the schema to support both the old and new shape, migrate data and code while both work, then contract by removing the old shape once nothing uses it.

Q: Should migrations run before or after the application deploys? It depends on the phase. Additive (expand) migrations run before the code that uses them, so the schema is ready. Destructive (contract) migrations run after the last code referencing the old schema is fully rolled out. The guiding principle is that the database must always be compatible with the running app.

Q: How do I add a NOT NULL column without downtime? Add the column as nullable (or with a safe default) first, backfill existing rows in batches, then add the NOT NULL constraint as a separate validated step. Adding a NOT NULL column directly can force a blocking table rewrite or scan on large tables.

Q: Why can't I just rename a column in one migration? Because during a rolling deploy, old instances still reference the original name. The instant you rename, those instances start failing. Use expand-contract: add the new column, write to both, migrate reads, then drop the old one.

Q: What about down migrations and rollbacks? Treat migrations as roll-forward by default. Destructive changes can't truly be undone because the data is gone. If you need to revert, write a new corrective migration. Use feature flags to control behavior at the application layer for instant, deploy-free rollbacks.

Q: How do I avoid locking a huge table when adding an index? Use your database's online/concurrent index creation — CREATE INDEX CONCURRENTLY in PostgreSQL, or online DDL tools like gh-ost and pt-online-schema-change in MySQL. These build the index without blocking writes, at the cost of taking longer.

Q: Do I need all this for a small project? If you have low traffic and can tolerate a brief maintenance window, a simple migration is fine. These patterns pay off when downtime is unacceptable, tables are large, or you deploy continuously. Adopt the discipline early enough that it's a habit before scale forces it on you.

Conclusion

Zero downtime migrations come down to one mindset: never put the database in a state that's incompatible with running code. Split breaking changes into expand-contract phases, avoid long locks, backfill in batches, and lean on feature flags for instant rollback. It takes more steps than a single ALTER TABLE, but the reward is the ability to evolve your schema continuously while your users never notice a thing.

Sources

Related Articles

Model Context Protocol (MCP): Give AI Access to Systems

Web Workers Practical Guide: Offload Heavy Tasks

AI Code Review: The Complete Guide for Engineering Teams (2026)

A definitive, practical guide to AI code review in 2026 — how it works, where it helps and where it doesn't, how to roll it out, prompt and config patterns, security trade-offs, and the metrics that prove it's working.