Distributed HLS Streaming Platform

Video is a solved problem until you try to solve it yourself.

YouTube makes it look effortless. Upload a file, it appears. Press play, it plays. Behind that simplicity is a decade of infrastructure engineering and hundreds of millions in capital. This project is about understanding what that looks like at the seam level — the part where a raw file upload becomes a stream that survives hundreds of concurrent viewers.

I built this as a deliberate proof-of-concept. Not to ship a product. To design, implement, and break a real distributed pipeline — one that covers the full lifecycle from upload to adaptive bitrate playback.

5Services

4HLS renditions

4sSegment length

AES-128Encryption

The Architecture

Five services. Each one has exactly one job. They talk through a queue (BullMQ on Redis) and a shared object store (MinIO). There are no direct HTTP calls between Upload and Transcode — the queue is the contract, and that separation is intentional.

Loading…

Each service is its own Docker container. Docker Compose ties them together locally — each gets its own network namespace, its own process, and its own failure domain. The only shared resources are Redis and MinIO. Everything else is isolated.

The Pipeline, Step by Step

Loading…

HLS: Why and How

HLS (HTTP Live Streaming) works by pre-slicing a video into small chunks — in this case, 4-second .ts files. The player downloads them in sequence, buffers a few ahead, and switches renditions based on available bandwidth.

Drag the slider to see how the file count scales with video duration:

Loading…

All those files land in MinIO. The Streaming Service serves them as plain HTTP objects — zero video processing at request time. Work once at upload, serve cheaply forever.

Why HLS and not DASH? iOS Safari does not support DASH without a plugin. HLS plays natively on every Apple device. In a system where you cannot control the client environment, that settles the argument. DASH has technically superior adaptive bitrate algorithms, but browser compatibility is still inconsistent enough to make HLS the safer default.

◆ Insight

The Streaming Service does not touch video data. It is a thin proxy in front of MinIO object paths. All the heavy lifting happened once, in the Transcode Worker, when the video was uploaded — not at playback time for every viewer.

The Renditions

Four quality levels, each a separate FFmpeg invocation. Hover a row to inspect the bitrate:

Loading…

Codec: libx264 + aac. CRF 23 across all renditions. veryfast is the right preset for a background worker — it trades a modest file size increase for significantly shorter encode time, and no viewer is waiting on the HTTP response.

⇄ Trade-off

All four renditions run sequentially — one await runFFmpeg() after another. Total transcode time is roughly 4× the single-rendition time. Running them with Promise.all would cut wall time to just the slowest rendition (1080p). It is a real, straightforward optimization that was left on the table.

There is also a naming trap in the codebase. default-renditions.ts exports two arrays from the same file:

export const DEFAULT_RENDITIONS = [RENDITIONS[0]]; // 360p only
export const renditions = RENDITIONS;               // all four

The transcodeVideo function uses DEFAULT_RENDITIONS as its default parameter. The worker explicitly passes renditions — so it works correctly. But call transcodeVideo(path) anywhere else without the argument and you silently get 360p-only output. No error, no warning, no indication that three renditions were skipped.

The Hardest Architectural Decision

Making transcoding asynchronous.

The naive version: file arrives, server starts FFmpeg, client waits for the response. That breaks immediately. Four renditions of a 2-minute video at veryfast preset takes 30–90 seconds on a typical dev machine. No HTTP client waits 90 seconds, and no server should block its event loop on a CPU-heavy child process for that long.

The actual design: the Upload Service returns { uploaded: true } the moment the file is queued. A completely separate Node.js process — the Transcode Worker — picks it up from BullMQ and does the heavy work. The client polls or gets notified when the video transitions to "Ready."

Loading…

BullMQ on Redis was the right tool here. Job persistence, configurable retry semantics, concurrency limits, and a UI for inspecting queue state — all built in. If the worker crashes mid-transcode, the job stays in Redis. It does not vanish.

◆ Insight

The queue is not just a performance optimization. It is the boundary that lets the Upload Service and Transcode Worker scale, fail, and deploy independently. You could run ten worker containers processing jobs in parallel while the Upload Service stays at a single instance. The contract between them is just a job record in Redis.

What Actually Broke

Building this surfaced several real issues. None of these are theoretical — they are visible in the code.

1. Race Condition in Chunk Assembly

When all chunks arrive, the Upload Service assembles them like this:

const writeableStream = fs.createWriteStream(outputPath);
 
for (const chunkPath of chunkPaths) {
  const readable = fs.createReadStream(chunkPath);
  readable.pipe(writeableStream, { end: false });
}
 
writeableStream.end();
 
// BullMQ job is added here — immediately after end()
await queue.add("transcode", { path: outputPath });

The problem: writeableStream.end() schedules the stream to close — it does not block until the file is fully flushed to disk. The BullMQ job gets queued before the finish event fires. If the Transcode Worker picks up the job quickly and opens the file, it may read a partially-written file.

Loading…

The fix is a one-liner — add the job inside the finish callback:

writeableStream.on("finish", async () => {
  await queue.add("transcode", { path: outputPath });
});
writeableStream.end();

2. No Dead Letter Queue

BullMQ has built-in DLQ support and configurable retry policies. This codebase uses neither. A job that fails — say, FFmpeg encounters a corrupt source file, runs out of memory, or the MinIO write times out — gets marked as failed in Redis and stays there. No retry, no alert, no notification to the user.

The README lists this as a TODO. In production, it means silently lost uploads with no recovery path for the user or the operator.

▲ Watch out

A single FFmpeg crash permanently loses that transcode job. No retry fires. The video sits in a "Processing" state indefinitely. This is the failure mode that generates support tickets at 2am.

3. Chunk State Lives on the Filesystem

The chunked upload mechanism tracks received chunks in a JSON metadata file on the Upload Service's local disk:

Loading…

If the Upload Service container restarts mid-upload, that metadata file is gone. The client has no way to resume — all previously sent chunks are lost. In a production system this state belongs in Redis or a database, where it survives restarts and is accessible across replicas.

4. AES-128 Encryption — or Just the Shape of It

The Transcode Worker encrypts every .ts segment with AES-128, generating a key file per video and embedding a key URL into the .m3u8 playlist:

#EXT-X-KEY:METHOD=AES-128,URI="https://streaming.svc/keys/{videoId}",IV=0x...

This is structurally correct. The HLS player fetches the decryption key from that URI before playing any segment. Whether the encryption actually protects anything depends entirely on whether the Streaming Service enforces authentication on the /keys/ route. An open key endpoint — no token check, no session validation — means anyone who can read the .m3u8 file can also get the decryption key.

▲ Watch out

AES-128 HLS encryption is only as strong as the key delivery endpoint. An open /keys/{videoId} route makes the encryption structural — it exists in the file format, but it does not protect the content.

What I Would Do Differently

Parallel renditions. Replace four sequential await runFFmpeg() calls with Promise.all. Total wall time drops from sum(all renditions) to max(slowest rendition). On a machine with available CPU cores, this is a significant improvement for zero architectural cost.

DLQ and retry policy from day one. BullMQ makes this a few lines of configuration, not an engineering project. Skipping it and calling it a TODO is how reliability debt accumulates silently.

Chunk state in Redis, not on disk. One change makes the Upload Service stateless and restartable without losing in-progress uploads. It also makes horizontal scaling straightforward.

Separate key delivery server. Having the Streaming Service act as both the segment server and the key server means one compromised route exposes everything. A dedicated key server — JWT validation, rate limiting, per-video expiry — is the correct separation, and it is not complicated to build.

Honest Outcome

This is a proof-of-concept. Nothing here ran under real load — concurrent streaming was never tested, transcode times were never instrumented, and several production-critical pieces (DLQ, upload resume, key server auth) are documented TODOs.

What it did prove: the architecture holds. The queue-based async design correctly decouples upload speed from transcode time. HLS segment math produces valid, playable output. The five-service separation makes each component independently debuggable and, in principle, independently scalable.

The gaps are real. Knowing exactly where they are, and why, is the whole point of building something like this before you need it in production.

✓ Result

End-to-end: raw video upload → BullMQ job → FFmpeg four-rendition transcode → AES-128 encrypted HLS segments in MinIO → adaptive bitrate playback in the browser. Fully containerized, no external cloud dependency. Architecture sound. Production gaps known, documented, and understood.