CD
Back to Blog
February 1, 2026

Building Pulse: Designing a Low-Latency Real-Time Audio Streaming System

How I built a real-time audio streaming platform using Cloudflare Durable Objects, iterated from polling to WebSockets, and designed a custom binary protocol to achieve millisecond-level latency.

real-time-systemsnetworkingcloudflaredurable-objectswebsocketsbinary-protocols

TL;DR

  • Built a browser-based real-time audio and video streaming platform for small group sessions
  • Initial store-and-forward + polling approach introduced 5–10s latency
  • Migrated to WebSockets with edge-based Durable Objects, reducing latency to milliseconds
  • Designed a custom binary protocol to support extensible real-time streams (audio, video, screen share)
  • Learned practical limits of serverless architectures for real-time systems

Project Context

Pulse is a real-time audio and video streaming platform designed as a lightweight alternative to video conferencing tools for casual, interactive sessions such as live gameplay commentary and teaching.

My role: Sole developer — architecture design, backend implementation, protocol design, and client-side audio and video handling.

Constraints:

  • Browser-based clients (no native apps)
  • Edge-first deployment (Cloudflare Workers)
  • No WebRTC (learning objective: understand lower-level streaming mechanics)
  • Low latency prioritized over perfect reliability

Motivation: From Network Theory to a Real Problem

The idea for Pulse originated during a networking lecture on the store-and-forward model used in routers. While elegant in theory, I began wondering how this model would behave under real-time constraints.

Around the same time, my friends and I were using Zoom to stream gameplay. The experience was frustrating: the 40-minute meeting limit constantly interrupted our sessions. There for i wanted a platform that could handle real-time streaming of audio and video without any such limitations.

That combination — theory plus a practical annoyance — became the basis for Pulse.


First Attempt: Store-and-Forward with Polling

Architecture Overview

My initial design mirrored the store-and-forward concept:

  • Speakers

Capture audio using the `MediaRecorder` API Slice recordings into 200ms–1000ms chunks * Upload chunks via HTTP POST with timestamps

  • Listeners

Poll the backend for new chunks Append received audio data to a playback buffer

Speaker → Record → Slice → POST /room/:id?ts=X → Durable Object
                                                      ↓
Listener ← Play ← Poll /room/:id?ts=X ←───────────────┘

Backend Choice: Cloudflare Durable Objects

Standard serverless functions were unsuitable due to their stateless nature. I needed:

  • Persistent state across requests
  • A single coordination point per room
  • Support for concurrent clients

Cloudflare Durable Objects fit these requirements by providing:

  • Persistent, in-memory state
  • Single-threaded execution (consistency guarantees)
  • Native WebSocket support
  • Automatic edge routing

Unexpected Friction: MediaSource API

Playback reliability quickly became an issue.

While theoretically correct, the MediaSource API proved fragile:

  • Silent playback failures
  • Inconsistent buffer behavior
  • Codec and MIME-type edge cases

After extensive debugging, I replaced MediaSource with a PCM-based audio pipeline using a dedicated PCM player library. This simplified the playback path and significantly improved stability.


The Core Issue: Latency

Even with audio stored purely in memory, the system suffered from 5–10 seconds of end-to-end latency.

The causes were structural:

  • Polling intervals added unavoidable delay
  • HTTP request/response cycles increased overhead
  • No server push mechanism

This was the key realization:

Store-and-forward works well for throughput, but not for interactivity.

For real-time communication, latency mattered more than buffering elegance.


The Pivot: WebSockets and Push-Based Streaming

To eliminate polling delays, I redesigned the system around WebSockets.

Revised Architecture

  • Speakers stream audio frames directly over WebSockets
  • Durable Objects act as real-time relays
  • Audio frames are pushed instantly to all connected listeners
  • No intermediate storage required
Speaker → WebSocket → Durable Object → WebSocket → Listeners
              (instant relay at the edge)

This change reduced perceived latency from seconds to milliseconds, making real-time interaction viable.


Scaling the Idea: Designing a Custom Binary Protocol

With low-latency audio working, I began exploring extensibility:

  • Video
  • Screen sharing
  • Presence updates

Using JSON for everything quickly became limiting. When receiving raw binary data, clients had no reliable way to determine:

  • Who sent the data
  • What type of stream it represented

Protocol Design

export const STREAM_TYPES = {
  AUDIO: 1,
  VIDEO: 2,
  USER_LIST_UPDATE: 3,
  JOIN_REQUEST: 4,
  SCREEN_SHARE: 5,
  SCREEN_SHARE_STOP: 6,
  SCREEN_SHARE_START: 7
};

export const USER_ID_LENGTH = 36;
export const STREAM_TYPE_LENGTH = 1;

Message Layout:

┌──────────────────────┬───────────────┬─────────────────┐
│ User ID (36 bytes)   │ Type (1 byte) │ Payload (n)     │
└──────────────────────┴───────────────┴─────────────────┘

This structure allows the client to deterministically parse each message and route it to the correct handler.

What I Learned

  • Binary data handling using Uint8Array
  • Manual buffer slicing and offset management
  • Designing protocols with future extensibility in mind
  • Appreciation for the complexity handled by protocols like WebRTC and QUIC

Debugging Screen Share: The Black Screen Problem

Adding screen sharing proved significantly harder than audio. My initial attempt using a hardcoded VP9 codec resulted in a persistent black screen, despite valid data transmission.

The Missing Piece: Initialization Segments

After three days of debugging, I discovered that unlike simple audio streams, video requires an initialization segment (init segment) before any media data. This segment configures the decoder.

The Solution:

  1. Dynamic Codec Selection: The browser now iterates through a list of codecs to find a supported one, rather than forcing VP9.
  2. Protocol Update: When a screen share starts (SCREEN_SHARE_START), the client first sends the specific initialization segment.
  3. Stream Handling: Frame data is only processed after this init segment is digested.

Fighting Lag with Dynamic Playback

Using MediaSource API, I faced issues where network lag caused the buffer to fill up behind the live edge. To fix this, I implemented dynamic playback speed:

  • If the buffer lags behind, playback speed increases.
  • Once caught up, it normalizes.

Result: Achieved ~3s latency in real-world tests between Russia and Sri Lanka.


Current State

Implemented:

  • Real-time audio and video streaming
  • Screen sharing with dynamic codec negotiation (New!)
  • Presence tracking and user identification
  • Custom binary protocol for stream multiplexing
  • Edge-based deployment

In progress:

  • Synchronization of stream state across reconnects
  • Chat functionality
  • Waiting room for admitting participants

Key Takeaways

  1. Theory behaves differently under real-time constraints.
  2. Serverless architectures have clear boundaries.
  3. Polling is simple but fundamentally incompatible with low latency.
  4. Binary protocols are approachable with clear structure.
  5. Edge computing enables architectures that traditional backends struggle with.

Pulse began as a curiosity-driven experiment and evolved into a hands-on lesson in real-time systems design. It reinforced that meaningful learning often comes from building something imperfect, measuring its failures, and iterating based on evidence.