← Work·Case Study
Docker/Nginx/AWS/Next.js/Bash/DevOps/Blue-Green

Zero Downtime Frontend Deploys with Blue-Green on a Single EC2

A real blue green setup on one AWS EC2 instance with Docker and Nginx that removed the downtime caused by rebuilds and container restarts.

May 2026·Saad Hasan

The frontend team shipped fast. Some weeks we pushed three releases in a day. Every deploy created the same support message: "the site is down again."

It was not down. It was rebuilding.

~2-4 minDowntime per deploy (before)
0sDowntime per deploy (after)
up to 3Deploys per day
$0Infra cost change

The Problem

The app is a Next.js frontend running on a single AWS EC2 instance inside Docker. The deployment looked simple enough:

git pull
docker build -t web:latest .
docker stop web && docker rm web
docker run -d --name web -p 3000:3000 web:latest

That script works. It also creates the outage.

The build step takes a while, especially on a cold cache. The old container gets stopped. Nginx loses its backend and starts returning 502s. Then the new container starts, binds the port, and warms up. That whole sequence usually cost us 2 to 4 minutes of downtime.

Users felt it most when they were already in the middle of something, like a form submission or checkout flow.

Watch out

"We deploy at night" is not a fix. It just changes who gets to see the outage. The deploy itself was the outage.


What I Picked, and Why

Three real options were on the table. Click any column header to see what changes if that option had been picked instead — the verdict at the bottom updates.

Loading…
Trade-off

Blue-Green means two copies of the app are alive at once. On a single instance with limited RAM, you need enough headroom for both containers. For this Next.js app, around 350 MB per container was fine. For something heavier, it might not be.


What Blue-Green Actually Is

Two identical environments, side by side. One handles traffic. The other sits there waiting for the next release. Build the new version in the idle slot, check it, then flip the proxy. The old slot stays around long enough to make rollback easy.

Loading…

The important bit is the reload. Nginx is not restarted. It rereads the config, starts new workers, and lets the old workers finish their in-flight requests before they exit. That is why the switch feels atomic from the outside.

Loading…

The Implementation

This section walks through every change you need to turn a single-container Docker deploy into a blue-green deploy. By the end of it, the host will run two container slots on fixed ports, the Nginx config will be split into a stable file and a swappable file, and a single Bash script will orchestrate every release.

Throughout the steps, replace app.example.com, the image name web, and the route /api/health with the names that match your own application.

Insight

Read the whole section before changing anything in production. The pieces depend on each other — the Nginx split (Step 2) has to be in place before the deploy script (Step 4) will work, and the script will only succeed once the application exposes a health endpoint (Step 3).

Step 1 — Reserve a fixed port for each color

Choose two unused ports on the host: one for the blue slot, one for the green slot. Any pair of high ports works. The rest of this guide uses 3001 for blue and 3002 for green.

Update your docker run command so the container can be started against either port. The application keeps listening on its standard internal port (Next.js defaults to 3000); only the host-side mapping changes:

# blue slot
docker run -d --name web-blue  -p 3001:3000 web:current
 
# green slot — started later by the deploy script
docker run -d --name web-green -p 3002:3000 web:next

Pinning one port per color removes a whole class of race conditions. The script never has to discover or allocate a free port, and the Nginx config always knows where each color lives.

Step 2 — Split the Nginx config into two files

The deploy needs to change one line of Nginx config and reload — nothing more. To make that possible, separate the parts of the config that change every release from the parts that never do.

Create the swappable file. Add a new file at /etc/nginx/conf.d/upstream.conf. This is the only file the deploy script will ever rewrite:

upstream app {
    server 127.0.0.1:3001;     # blue (currently live)
#   server 127.0.0.1:3002;     # green (idle)
}

Update the server block. Open your existing site config (commonly /etc/nginx/sites-available/app.conf or /etc/nginx/conf.d/app.conf). Replace any hard-coded proxy_pass http://127.0.0.1:3000 with a reference to the upstream defined above (the highlighted line is the only one that has to change):

include /etc/nginx/conf.d/upstream.conf;
 
server {
    listen 443 ssl http2;
    server_name app.example.com;
 
    # SSL certs, gzip, and security headers omitted for brevity.
 
    location / {
        proxy_pass http://app;
        proxy_http_version 1.1;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
 
        proxy_read_timeout 60s;
        proxy_connect_timeout 5s;
    }
}

Validate the new layout. Run the following on the host and confirm both commands succeed and the site still serves traffic:

sudo nginx -t
sudo nginx -s reload

From this point on, the server block never changes. The deploy only ever touches upstream.conf.

Step 3 — Expose a meaningful health endpoint

The deploy script promotes the new container only after /api/health returns 200. That endpoint must therefore answer one question precisely: can this container serve real user traffic right now?

A health endpoint that returns 200 the moment the process starts is worse than no endpoint at all — it promotes containers that cannot actually serve users. Make the endpoint exercise every dependency a real request would touch (the database, the cache, any required outbound API). If any of those fail, return a 5xx.

Add the route to your application. The example below uses Next.js App Router; adapt the imports to match your framework:

// app/api/health/route.ts
import { NextResponse } from "next/server";
import { db } from "@/lib/db";
 
export async function GET() {
    try {
        await db.raw("SELECT 1");
        return NextResponse.json({ ok: true });
    } catch {
        return NextResponse.json({ ok: false }, { status: 503 });
    }
}

Verify the endpoint behaves correctly before wiring it into the deploy script. Run the application locally, stop the database, and confirm that /api/health returns 503 — not 200, and not an uncaught 500.

Step 4 — Build the deploy script in stages

Create a new file at deploy/deploy.sh in your repository. This section walks through the script one stage at a time so you understand each block before you commit the whole thing. The full, ready-to-run version follows at the end.

Open the file with this header:

#!/usr/bin/env bash
set -euo pipefail
 
UPSTREAM_FILE=/etc/nginx/conf.d/upstream.conf
HEALTH_PATH=/api/health
HEALTH_TIMEOUT=90
DRAIN_SECONDS=30

set -euo pipefail makes the script abort on the first failure, refuse to read undefined variables, and surface errors anywhere in a pipeline. Without it, a partial config rewrite could quietly continue and leave the system in a broken state.

4.1 Detect which color is currently live

Read upstream.conf and determine which port is uncommented. The live color is the deployment source; the other color is the deployment target.

if grep -q "^    server 127.0.0.1:3001;" "$UPSTREAM_FILE"; then
    LIVE_COLOR=blue;  LIVE_PORT=3001
    NEXT_COLOR=green; NEXT_PORT=3002
else
    LIVE_COLOR=green; LIVE_PORT=3002
    NEXT_COLOR=blue;  NEXT_PORT=3001
fi
 
echo "Live: $LIVE_COLOR  →  Deploying: $NEXT_COLOR"

The echo line gives you confidence at a glance that the script picked the right slot before it does anything destructive.

4.2 Build the image and start the idle container

Build the image once and tag it web:next. Then remove any leftover container in the idle slot from a prior failed deploy and start the new one:

docker build -t web:next .
docker rm -f "web-$NEXT_COLOR" 2>/dev/null || true
docker run -d --name "web-$NEXT_COLOR" -p "$NEXT_PORT:3000" web:next

The live color is untouched throughout this stage. Users continue to receive 200s from the previous version.

4.3 Probe the health endpoint until the new container is ready

Poll the new container directly on its host port. Stop the deploy if the container never goes healthy:

for i in $(seq 1 "$HEALTH_TIMEOUT"); do
    if curl -fsS "http://127.0.0.1:$NEXT_PORT$HEALTH_PATH" >/dev/null; then
        echo "Healthy after ${i}s"
        break
    fi
    sleep 1
    if [ "$i" -eq "$HEALTH_TIMEOUT" ]; then
        echo "Health check failed — aborting deploy"
        docker rm -f "web-$NEXT_COLOR"
        exit 1
    fi
done

Two properties matter here. First, the script polls the new container directly, not through Nginx, so the live color stays untouched even while the new container is still warming up. Second, on timeout, the script removes the failing container and exits non-zero — Nginx is never reloaded, and users never see anything.

4.4 Swap the upstream and validate the new config

Back up the live config, then rewrite it. Validate the new file with nginx -t before you signal the running master process:

cp "$UPSTREAM_FILE" "${UPSTREAM_FILE}.bak"
 
sed -i \
    -e "s|^    server 127.0.0.1:$LIVE_PORT;|#    server 127.0.0.1:$LIVE_PORT;|" \
    -e "s|^#    server 127.0.0.1:$NEXT_PORT;|    server 127.0.0.1:$NEXT_PORT;|" \
    "$UPSTREAM_FILE"
 
if ! nginx -t; then
    mv "${UPSTREAM_FILE}.bak" "$UPSTREAM_FILE"
    docker rm -f "web-$NEXT_COLOR"
    exit 1
fi

If nginx -t reports any error, the script restores the backup, removes the new container, and exits. The system is now in exactly the state it was in before the deploy started.

4.5 Reload Nginx — the actual flip

Tell the running Nginx master process to reread its config:

nginx -s reload

This is the only line in the entire script that changes user-visible behaviour. The master spawns new worker processes that read the rewritten upstream and signals the old workers to stop accepting new connections. In-flight requests on the old workers complete normally; new requests route to the new color from the moment the reload returns.

4.6 Drain the old color, then recycle it

Wait for the old workers to finish, then stop the previous container and re-tag the image so a rollback is one command away:

sleep "$DRAIN_SECONDS"
docker stop "web-$LIVE_COLOR" && docker rm "web-$LIVE_COLOR"
docker tag web:next web:current
echo "Deploy complete — $NEXT_COLOR is live"

Set DRAIN_SECONDS to a value comfortably above the slowest response your application produces. Thirty seconds is a safe default for typical web apps; raise it for endpoints that stream long responses or hold open WebSocket-style connections.

The full script

Save the whole file as deploy/deploy.sh:

#!/usr/bin/env bash
set -euo pipefail
 
UPSTREAM_FILE=/etc/nginx/conf.d/upstream.conf
HEALTH_PATH=/api/health
HEALTH_TIMEOUT=90
DRAIN_SECONDS=30
 
# 1. Detect which color is currently live.
if grep -q "^    server 127.0.0.1:3001;" "$UPSTREAM_FILE"; then
    LIVE_COLOR=blue;  LIVE_PORT=3001
    NEXT_COLOR=green; NEXT_PORT=3002
else
    LIVE_COLOR=green; LIVE_PORT=3002
    NEXT_COLOR=blue;  NEXT_PORT=3001
fi
echo "Live: $LIVE_COLOR  →  Deploying: $NEXT_COLOR"
 
# 2. Build and start the idle color.
docker build -t web:next .
docker rm -f "web-$NEXT_COLOR" 2>/dev/null || true
docker run -d --name "web-$NEXT_COLOR" -p "$NEXT_PORT:3000" web:next
 
# 3. Wait for the new container to go healthy.
for i in $(seq 1 "$HEALTH_TIMEOUT"); do
    if curl -fsS "http://127.0.0.1:$NEXT_PORT$HEALTH_PATH" >/dev/null; then
        echo "Healthy after ${i}s"
        break
    fi
    sleep 1
    if [ "$i" -eq "$HEALTH_TIMEOUT" ]; then
        echo "Health check failed — aborting deploy"
        docker rm -f "web-$NEXT_COLOR"
        exit 1
    fi
done
 
# 4. Swap the upstream and validate.
cp "$UPSTREAM_FILE" "${UPSTREAM_FILE}.bak"
sed -i \
    -e "s|^    server 127.0.0.1:$LIVE_PORT;|#    server 127.0.0.1:$LIVE_PORT;|" \
    -e "s|^#    server 127.0.0.1:$NEXT_PORT;|    server 127.0.0.1:$NEXT_PORT;|" \
    "$UPSTREAM_FILE"
 
if ! nginx -t; then
    mv "${UPSTREAM_FILE}.bak" "$UPSTREAM_FILE"
    docker rm -f "web-$NEXT_COLOR"
    exit 1
fi
 
# 5. Reload Nginx — the actual flip.
nginx -s reload
 
# 6. Drain the old color, then recycle it.
sleep "$DRAIN_SECONDS"
docker stop "web-$LIVE_COLOR" && docker rm "web-$LIVE_COLOR"
docker tag web:next web:current
echo "Deploy complete — $NEXT_COLOR is live"

Mark the script executable so the host can run it directly:

chmod +x deploy/deploy.sh

A single deploy is now sudo ./deploy/deploy.sh.

Step 5 — Walk the script visually before you run it

Use the diagram below to follow the script end-to-end, including the two abort branches. Drag the canvas to pan around the long flow; click any step in the timeline to jump there.

Loading…

Step 6 — Test the deploy on a non-production host first

Run the full script on a staging host before you point real traffic at it. At minimum, exercise three scenarios and confirm each behaves as expected:

  • Happy path. Run ./deploy/deploy.sh. Confirm that the live color flips, that no requests fail during the run, and that the previous container is removed after the drain window.
  • Health failure. Temporarily break a dependency the new image needs (for example, point the database URL at an unreachable host). Run the script. Confirm that the health probe times out, that the new container is removed, and that Nginx still routes to the old color.
  • Config validation failure. Introduce a typo into the server block. Run the script. Confirm that nginx -t fails, that upstream.conf is restored from its backup, and that the new container is removed.

Once those three pass on staging, schedule the first production deploy for a low-traffic window and watch the logs without pressure. After that, return to your normal release schedule.

Common questions

Loading…

The Request Path, Before and After

Press play, or drag the cursor along the timeline. The single-container lane goes red the moment the build window ends; the blue-green lane stays green the whole way through.

Loading…

The difference is not that the deploy got a little faster. The difference is that the user never sees the handoff at all. Deploy and service stopped being the same event.


Rollback

Rollback is basically the same flip in reverse, only without the rebuild. Because the old container stays alive during the drain window, the fastest rollback is usually just pointing Nginx back at the previous color. That takes less than a second.

After the drain window, the old image is still tagged web:current. So if I need the previous version back later, it is still one docker run plus one nginx -s reload. No rebuild, no waiting for the cache, no hunting through old commits.

Result

After two months in production, there were no user-reported deploy outages. When rollback was needed, it usually took under 5 seconds. Same EC2 instance, same monthly bill.


What I Would Do Differently

  • Move the upstream switch off the file system. Rewriting upstream.conf with sed works, but it is the weakest part of the setup. A tiny key-value store would make the flip a single write instead of a file rewrite plus reload.
  • Add a real smoke test before the flip. Health checks tell me the container started. They do not tell me the homepage renders or the important routes still work. A few HTTP checks would catch more bad builds.
  • Build in CI instead of on the EC2 box. Right now the production host is also paying for the build. That is acceptable for a small app, but a prebuilt image would take pressure off the live machine.

I would keep the strategy itself. For a single-host Next.js app that needed zero downtime without moving platforms, blue-green behind Nginx was the smallest fix that actually solved the problem.