Building Pulse: Designing a Low-Latency Real-Time Audio Streaming System

TL;DR

Built a browser-based real-time audio and video streaming platform for small group sessions
Initial store-and-forward + polling approach introduced 5–10s latency
Migrated to WebSockets with edge-based Durable Objects, reducing latency to milliseconds
Designed a custom binary protocol to support extensible real-time streams (audio, video, screen share)
Learned practical limits of serverless architectures for real-time systems

Project Context

Pulse is a real-time audio and video streaming platform designed as a lightweight alternative to video conferencing tools for casual, interactive sessions such as live gameplay commentary and teaching.

My role: Sole developer — architecture design, backend implementation, protocol design, and client-side audio and video handling.

Constraints:

Browser-based clients (no native apps)
Edge-first deployment (Cloudflare Workers)
No WebRTC (learning objective: understand lower-level streaming mechanics)
Low latency prioritized over perfect reliability

Motivation: From Network Theory to a Real Problem

The idea for Pulse originated during a networking lecture on the store-and-forward model used in routers. While elegant in theory, I began wondering how this model would behave under real-time constraints.

Around the same time, my friends and I were using Zoom to stream gameplay. The experience was frustrating: the 40-minute meeting limit constantly interrupted our sessions. There for i wanted a platform that could handle real-time streaming of audio and video without any such limitations.

That combination — theory plus a practical annoyance — became the basis for Pulse.

First Attempt: Store-and-Forward with Polling

Architecture Overview

My initial design mirrored the store-and-forward concept:

Speakers

Capture audio using the `MediaRecorder` API Slice recordings into 200ms–1000ms chunks * Upload chunks via HTTP POST with timestamps

Listeners

Poll the backend for new chunks Append received audio data to a playback buffer

Speaker → Record → Slice → POST /room/:id?ts=X → Durable Object
                                                      ↓
Listener ← Play ← Poll /room/:id?ts=X ←───────────────┘

Backend Choice: Cloudflare Durable Objects

Standard serverless functions were unsuitable due to their stateless nature. I needed:

Persistent state across requests
A single coordination point per room
Support for concurrent clients

Cloudflare Durable Objects fit these requirements by providing:

Persistent, in-memory state
Single-threaded execution (consistency guarantees)
Native WebSocket support
Automatic edge routing

Unexpected Friction: MediaSource API

Playback reliability quickly became an issue.

While theoretically correct, the MediaSource API proved fragile:

Silent playback failures
Inconsistent buffer behavior
Codec and MIME-type edge cases

After extensive debugging, I replaced MediaSource with a PCM-based audio pipeline using a dedicated PCM player library. This simplified the playback path and significantly improved stability.

The Core Issue: Latency

Even with audio stored purely in memory, the system suffered from 5–10 seconds of end-to-end latency.

The causes were structural:

Polling intervals added unavoidable delay
HTTP request/response cycles increased overhead
No server push mechanism

This was the key realization:

Store-and-forward works well for throughput, but not for interactivity.

For real-time communication, latency mattered more than buffering elegance.

The Pivot: WebSockets and Push-Based Streaming

To eliminate polling delays, I redesigned the system around WebSockets.

Revised Architecture

Speakers stream audio frames directly over WebSockets
Durable Objects act as real-time relays
Audio frames are pushed instantly to all connected listeners
No intermediate storage required

Speaker → WebSocket → Durable Object → WebSocket → Listeners
              (instant relay at the edge)

This change reduced perceived latency from seconds to milliseconds, making real-time interaction viable.

Scaling the Idea: Designing a Custom Binary Protocol

With low-latency audio working, I began exploring extensibility:

Video
Screen sharing
Presence updates

Using JSON for everything quickly became limiting. When receiving raw binary data, clients had no reliable way to determine:

Who sent the data
What type of stream it represented

Protocol Design

export const STREAM_TYPES = {
  AUDIO: 1,
  VIDEO: 2,
  USER_LIST_UPDATE: 3,
  JOIN_REQUEST: 4,
  SCREEN_SHARE: 5,
  SCREEN_SHARE_STOP: 6,
  SCREEN_SHARE_START: 7
};

export const USER_ID_LENGTH = 36;
export const STREAM_TYPE_LENGTH = 1;

Message Layout:

┌──────────────────────┬───────────────┬─────────────────┐
│ User ID (36 bytes)   │ Type (1 byte) │ Payload (n)     │
└──────────────────────┴───────────────┴─────────────────┘

This structure allows the client to deterministically parse each message and route it to the correct handler.

What I Learned

Binary data handling using Uint8Array
Manual buffer slicing and offset management
Designing protocols with future extensibility in mind
Appreciation for the complexity handled by protocols like WebRTC and QUIC

Debugging Screen Share: The Black Screen Problem

Adding screen sharing proved significantly harder than audio. My initial attempt using a hardcoded VP9 codec resulted in a persistent black screen, despite valid data transmission.

The Missing Piece: Initialization Segments

After three days of debugging, I discovered that unlike simple audio streams, video requires an initialization segment (init segment) before any media data. This segment configures the decoder.

The Solution:

Dynamic Codec Selection: The browser now iterates through a list of codecs to find a supported one, rather than forcing VP9.
Protocol Update: When a screen share starts (SCREEN_SHARE_START), the client first sends the specific initialization segment.
Stream Handling: Frame data is only processed after this init segment is digested.

Fighting Lag with Dynamic Playback

Using MediaSource API, I faced issues where network lag caused the buffer to fill up behind the live edge. To fix this, I implemented dynamic playback speed:

If the buffer lags behind, playback speed increases.
Once caught up, it normalizes.

Result: Achieved ~3s latency in real-world tests between Russia and Sri Lanka.

Current State

Implemented:

Real-time audio and video streaming
Screen sharing with dynamic codec negotiation (New!)
Presence tracking and user identification
Custom binary protocol for stream multiplexing
Edge-based deployment

In progress:

Synchronization of stream state across reconnects
Chat functionality
Waiting room for admitting participants

Key Takeaways

Theory behaves differently under real-time constraints.
Serverless architectures have clear boundaries.
Polling is simple but fundamentally incompatible with low latency.
Binary protocols are approachable with clear structure.
Edge computing enables architectures that traditional backends struggle with.

Pulse began as a curiosity-driven experiment and evolved into a hands-on lesson in real-time systems design. It reinforced that meaningful learning often comes from building something imperfect, measuring its failures, and iterating based on evidence.