The frontend team shipped fast. Some weeks we pushed three releases in a day. Every deploy created the same support message: "the site is down again."
It was not down. It was rebuilding.
The Problem
The app is a Next.js frontend running on a single AWS EC2 instance inside Docker. The deployment looked simple enough:
git pull
docker build -t web:latest .
docker stop web && docker rm web
docker run -d --name web -p 3000:3000 web:latestThat script works. It also creates the outage.
The build step takes a while, especially on a cold cache. The old container gets stopped. Nginx loses its backend and starts returning 502s. Then the new container starts, binds the port, and warms up. That whole sequence usually cost us 2 to 4 minutes of downtime.
Users felt it most when they were already in the middle of something, like a form submission or checkout flow.
"We deploy at night" is not a fix. It just changes who gets to see the outage. The deploy itself was the outage.
What I Picked, and Why
Three real options were on the table. Click any column header to see what changes if that option had been picked instead — the verdict at the bottom updates.
Blue-Green means two copies of the app are alive at once. On a single instance with limited RAM, you need enough headroom for both containers. For this Next.js app, around 350 MB per container was fine. For something heavier, it might not be.
What Blue-Green Actually Is
Two identical environments, side by side. One handles traffic. The other sits there waiting for the next release. Build the new version in the idle slot, check it, then flip the proxy. The old slot stays around long enough to make rollback easy.
The important bit is the reload. Nginx is not restarted. It rereads the config, starts new workers, and lets the old workers finish their in-flight requests before they exit. That is why the switch feels atomic from the outside.
The Implementation
This section walks through every change you need to turn a single-container Docker deploy into a blue-green deploy. By the end of it, the host will run two container slots on fixed ports, the Nginx config will be split into a stable file and a swappable file, and a single Bash script will orchestrate every release.
Throughout the steps, replace app.example.com, the image name web, and the route /api/health with the names that match your own application.
Read the whole section before changing anything in production. The pieces depend on each other — the Nginx split (Step 2) has to be in place before the deploy script (Step 4) will work, and the script will only succeed once the application exposes a health endpoint (Step 3).
Step 1 — Reserve a fixed port for each color
Choose two unused ports on the host: one for the blue slot, one for the green slot. Any pair of high ports works. The rest of this guide uses 3001 for blue and 3002 for green.
Update your docker run command so the container can be started against either port. The application keeps listening on its standard internal port (Next.js defaults to 3000); only the host-side mapping changes:
# blue slot
docker run -d --name web-blue -p 3001:3000 web:current
# green slot — started later by the deploy script
docker run -d --name web-green -p 3002:3000 web:nextPinning one port per color removes a whole class of race conditions. The script never has to discover or allocate a free port, and the Nginx config always knows where each color lives.
Step 2 — Split the Nginx config into two files
The deploy needs to change one line of Nginx config and reload — nothing more. To make that possible, separate the parts of the config that change every release from the parts that never do.
Create the swappable file. Add a new file at /etc/nginx/conf.d/upstream.conf. This is the only file the deploy script will ever rewrite:
upstream app {
server 127.0.0.1:3001; # blue (currently live)
# server 127.0.0.1:3002; # green (idle)
}Update the server block. Open your existing site config (commonly /etc/nginx/sites-available/app.conf or /etc/nginx/conf.d/app.conf). Replace any hard-coded proxy_pass http://127.0.0.1:3000 with a reference to the upstream defined above (the highlighted line is the only one that has to change):
include /etc/nginx/conf.d/upstream.conf;
server {
listen 443 ssl http2;
server_name app.example.com;
# SSL certs, gzip, and security headers omitted for brevity.
location / {
proxy_pass http://app;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 60s;
proxy_connect_timeout 5s;
}
}Validate the new layout. Run the following on the host and confirm both commands succeed and the site still serves traffic:
sudo nginx -t
sudo nginx -s reloadFrom this point on, the server block never changes. The deploy only ever touches upstream.conf.
Step 3 — Expose a meaningful health endpoint
The deploy script promotes the new container only after /api/health returns 200. That endpoint must therefore answer one question precisely: can this container serve real user traffic right now?
A health endpoint that returns 200 the moment the process starts is worse than no endpoint at all — it promotes containers that cannot actually serve users. Make the endpoint exercise every dependency a real request would touch (the database, the cache, any required outbound API). If any of those fail, return a 5xx.
Add the route to your application. The example below uses Next.js App Router; adapt the imports to match your framework:
// app/api/health/route.ts
import { NextResponse } from "next/server";
import { db } from "@/lib/db";
export async function GET() {
try {
await db.raw("SELECT 1");
return NextResponse.json({ ok: true });
} catch {
return NextResponse.json({ ok: false }, { status: 503 });
}
}Verify the endpoint behaves correctly before wiring it into the deploy script. Run the application locally, stop the database, and confirm that /api/health returns 503 — not 200, and not an uncaught 500.
Step 4 — Build the deploy script in stages
Create a new file at deploy/deploy.sh in your repository. This section walks through the script one stage at a time so you understand each block before you commit the whole thing. The full, ready-to-run version follows at the end.
Open the file with this header:
#!/usr/bin/env bash
set -euo pipefail
UPSTREAM_FILE=/etc/nginx/conf.d/upstream.conf
HEALTH_PATH=/api/health
HEALTH_TIMEOUT=90
DRAIN_SECONDS=30set -euo pipefail makes the script abort on the first failure, refuse to read undefined variables, and surface errors anywhere in a pipeline. Without it, a partial config rewrite could quietly continue and leave the system in a broken state.
4.1 Detect which color is currently live
Read upstream.conf and determine which port is uncommented. The live color is the deployment source; the other color is the deployment target.
if grep -q "^ server 127.0.0.1:3001;" "$UPSTREAM_FILE"; then
LIVE_COLOR=blue; LIVE_PORT=3001
NEXT_COLOR=green; NEXT_PORT=3002
else
LIVE_COLOR=green; LIVE_PORT=3002
NEXT_COLOR=blue; NEXT_PORT=3001
fi
echo "Live: $LIVE_COLOR → Deploying: $NEXT_COLOR"The echo line gives you confidence at a glance that the script picked the right slot before it does anything destructive.
4.2 Build the image and start the idle container
Build the image once and tag it web:next. Then remove any leftover container in the idle slot from a prior failed deploy and start the new one:
docker build -t web:next .
docker rm -f "web-$NEXT_COLOR" 2>/dev/null || true
docker run -d --name "web-$NEXT_COLOR" -p "$NEXT_PORT:3000" web:nextThe live color is untouched throughout this stage. Users continue to receive 200s from the previous version.
4.3 Probe the health endpoint until the new container is ready
Poll the new container directly on its host port. Stop the deploy if the container never goes healthy:
for i in $(seq 1 "$HEALTH_TIMEOUT"); do
if curl -fsS "http://127.0.0.1:$NEXT_PORT$HEALTH_PATH" >/dev/null; then
echo "Healthy after ${i}s"
break
fi
sleep 1
if [ "$i" -eq "$HEALTH_TIMEOUT" ]; then
echo "Health check failed — aborting deploy"
docker rm -f "web-$NEXT_COLOR"
exit 1
fi
doneTwo properties matter here. First, the script polls the new container directly, not through Nginx, so the live color stays untouched even while the new container is still warming up. Second, on timeout, the script removes the failing container and exits non-zero — Nginx is never reloaded, and users never see anything.
4.4 Swap the upstream and validate the new config
Back up the live config, then rewrite it. Validate the new file with nginx -t before you signal the running master process:
cp "$UPSTREAM_FILE" "${UPSTREAM_FILE}.bak"
sed -i \
-e "s|^ server 127.0.0.1:$LIVE_PORT;|# server 127.0.0.1:$LIVE_PORT;|" \
-e "s|^# server 127.0.0.1:$NEXT_PORT;| server 127.0.0.1:$NEXT_PORT;|" \
"$UPSTREAM_FILE"
if ! nginx -t; then
mv "${UPSTREAM_FILE}.bak" "$UPSTREAM_FILE"
docker rm -f "web-$NEXT_COLOR"
exit 1
fiIf nginx -t reports any error, the script restores the backup, removes the new container, and exits. The system is now in exactly the state it was in before the deploy started.
4.5 Reload Nginx — the actual flip
Tell the running Nginx master process to reread its config:
nginx -s reloadThis is the only line in the entire script that changes user-visible behaviour. The master spawns new worker processes that read the rewritten upstream and signals the old workers to stop accepting new connections. In-flight requests on the old workers complete normally; new requests route to the new color from the moment the reload returns.
4.6 Drain the old color, then recycle it
Wait for the old workers to finish, then stop the previous container and re-tag the image so a rollback is one command away:
sleep "$DRAIN_SECONDS"
docker stop "web-$LIVE_COLOR" && docker rm "web-$LIVE_COLOR"
docker tag web:next web:current
echo "Deploy complete — $NEXT_COLOR is live"Set DRAIN_SECONDS to a value comfortably above the slowest response your application produces. Thirty seconds is a safe default for typical web apps; raise it for endpoints that stream long responses or hold open WebSocket-style connections.
The full script
Save the whole file as deploy/deploy.sh:
#!/usr/bin/env bash
set -euo pipefail
UPSTREAM_FILE=/etc/nginx/conf.d/upstream.conf
HEALTH_PATH=/api/health
HEALTH_TIMEOUT=90
DRAIN_SECONDS=30
# 1. Detect which color is currently live.
if grep -q "^ server 127.0.0.1:3001;" "$UPSTREAM_FILE"; then
LIVE_COLOR=blue; LIVE_PORT=3001
NEXT_COLOR=green; NEXT_PORT=3002
else
LIVE_COLOR=green; LIVE_PORT=3002
NEXT_COLOR=blue; NEXT_PORT=3001
fi
echo "Live: $LIVE_COLOR → Deploying: $NEXT_COLOR"
# 2. Build and start the idle color.
docker build -t web:next .
docker rm -f "web-$NEXT_COLOR" 2>/dev/null || true
docker run -d --name "web-$NEXT_COLOR" -p "$NEXT_PORT:3000" web:next
# 3. Wait for the new container to go healthy.
for i in $(seq 1 "$HEALTH_TIMEOUT"); do
if curl -fsS "http://127.0.0.1:$NEXT_PORT$HEALTH_PATH" >/dev/null; then
echo "Healthy after ${i}s"
break
fi
sleep 1
if [ "$i" -eq "$HEALTH_TIMEOUT" ]; then
echo "Health check failed — aborting deploy"
docker rm -f "web-$NEXT_COLOR"
exit 1
fi
done
# 4. Swap the upstream and validate.
cp "$UPSTREAM_FILE" "${UPSTREAM_FILE}.bak"
sed -i \
-e "s|^ server 127.0.0.1:$LIVE_PORT;|# server 127.0.0.1:$LIVE_PORT;|" \
-e "s|^# server 127.0.0.1:$NEXT_PORT;| server 127.0.0.1:$NEXT_PORT;|" \
"$UPSTREAM_FILE"
if ! nginx -t; then
mv "${UPSTREAM_FILE}.bak" "$UPSTREAM_FILE"
docker rm -f "web-$NEXT_COLOR"
exit 1
fi
# 5. Reload Nginx — the actual flip.
nginx -s reload
# 6. Drain the old color, then recycle it.
sleep "$DRAIN_SECONDS"
docker stop "web-$LIVE_COLOR" && docker rm "web-$LIVE_COLOR"
docker tag web:next web:current
echo "Deploy complete — $NEXT_COLOR is live"Mark the script executable so the host can run it directly:
chmod +x deploy/deploy.shA single deploy is now sudo ./deploy/deploy.sh.
Step 5 — Walk the script visually before you run it
Use the diagram below to follow the script end-to-end, including the two abort branches. Drag the canvas to pan around the long flow; click any step in the timeline to jump there.
Step 6 — Test the deploy on a non-production host first
Run the full script on a staging host before you point real traffic at it. At minimum, exercise three scenarios and confirm each behaves as expected:
- Happy path. Run
./deploy/deploy.sh. Confirm that the live color flips, that no requests fail during the run, and that the previous container is removed after the drain window. - Health failure. Temporarily break a dependency the new image needs (for example, point the database URL at an unreachable host). Run the script. Confirm that the health probe times out, that the new container is removed, and that Nginx still routes to the old color.
- Config validation failure. Introduce a typo into the server block. Run the script. Confirm that
nginx -tfails, thatupstream.confis restored from its backup, and that the new container is removed.
Once those three pass on staging, schedule the first production deploy for a low-traffic window and watch the logs without pressure. After that, return to your normal release schedule.
Common questions
The Request Path, Before and After
Press play, or drag the cursor along the timeline. The single-container lane goes red the moment the build window ends; the blue-green lane stays green the whole way through.
The difference is not that the deploy got a little faster. The difference is that the user never sees the handoff at all. Deploy and service stopped being the same event.
Rollback
Rollback is basically the same flip in reverse, only without the rebuild. Because the old container stays alive during the drain window, the fastest rollback is usually just pointing Nginx back at the previous color. That takes less than a second.
After the drain window, the old image is still tagged web:current. So if I need the previous version back later, it is still one docker run plus one nginx -s reload. No rebuild, no waiting for the cache, no hunting through old commits.
After two months in production, there were no user-reported deploy outages. When rollback was needed, it usually took under 5 seconds. Same EC2 instance, same monthly bill.
What I Would Do Differently
- Move the upstream switch off the file system. Rewriting
upstream.confwithsedworks, but it is the weakest part of the setup. A tiny key-value store would make the flip a single write instead of a file rewrite plus reload. - Add a real smoke test before the flip. Health checks tell me the container started. They do not tell me the homepage renders or the important routes still work. A few HTTP checks would catch more bad builds.
- Build in CI instead of on the EC2 box. Right now the production host is also paying for the build. That is acceptable for a small app, but a prebuilt image would take pressure off the live machine.
I would keep the strategy itself. For a single-host Next.js app that needed zero downtime without moving platforms, blue-green behind Nginx was the smallest fix that actually solved the problem.