TL;DR
- Built a browser-based real-time audio and video streaming platform for small group sessions
- Initial store-and-forward + polling approach introduced 5–10s latency
- Migrated to WebSockets with edge-based Durable Objects, reducing latency to milliseconds
- Designed a custom binary protocol to support extensible real-time streams (audio, video, screen share)
- Learned practical limits of serverless architectures for real-time systems
Project Context
Pulse is a real-time audio and video streaming platform designed as a lightweight alternative to video conferencing tools for casual, interactive sessions such as live gameplay commentary and teaching.
My role: Sole developer — architecture design, backend implementation, protocol design, and client-side audio and video handling.
Constraints:
- Browser-based clients (no native apps)
- Edge-first deployment (Cloudflare Workers)
- No WebRTC (learning objective: understand lower-level streaming mechanics)
- Low latency prioritized over perfect reliability
Motivation: From Network Theory to a Real Problem
The idea for Pulse originated during a networking lecture on the store-and-forward model used in routers. While elegant in theory, I began wondering how this model would behave under real-time constraints.
Around the same time, my friends and I were using Zoom to stream gameplay. The experience was frustrating: the 40-minute meeting limit constantly interrupted our sessions. There for i wanted a platform that could handle real-time streaming of audio and video without any such limitations.
That combination — theory plus a practical annoyance — became the basis for Pulse.
First Attempt: Store-and-Forward with Polling
Architecture Overview
My initial design mirrored the store-and-forward concept:
- Speakers
Capture audio using the `MediaRecorder` API Slice recordings into 200ms–1000ms chunks * Upload chunks via HTTP POST with timestamps
- Listeners
Poll the backend for new chunks Append received audio data to a playback buffer
Speaker → Record → Slice → POST /room/:id?ts=X → Durable Object
↓
Listener ← Play ← Poll /room/:id?ts=X ←───────────────┘Backend Choice: Cloudflare Durable Objects
Standard serverless functions were unsuitable due to their stateless nature. I needed:
- Persistent state across requests
- A single coordination point per room
- Support for concurrent clients
Cloudflare Durable Objects fit these requirements by providing:
- Persistent, in-memory state
- Single-threaded execution (consistency guarantees)
- Native WebSocket support
- Automatic edge routing
Unexpected Friction: MediaSource API
Playback reliability quickly became an issue.
While theoretically correct, the MediaSource API proved fragile:
- Silent playback failures
- Inconsistent buffer behavior
- Codec and MIME-type edge cases
After extensive debugging, I replaced MediaSource with a PCM-based audio pipeline using a dedicated PCM player library. This simplified the playback path and significantly improved stability.
The Core Issue: Latency
Even with audio stored purely in memory, the system suffered from 5–10 seconds of end-to-end latency.
The causes were structural:
- Polling intervals added unavoidable delay
- HTTP request/response cycles increased overhead
- No server push mechanism
This was the key realization:
Store-and-forward works well for throughput, but not for interactivity.
For real-time communication, latency mattered more than buffering elegance.
The Pivot: WebSockets and Push-Based Streaming
To eliminate polling delays, I redesigned the system around WebSockets.
Revised Architecture
- Speakers stream audio frames directly over WebSockets
- Durable Objects act as real-time relays
- Audio frames are pushed instantly to all connected listeners
- No intermediate storage required
Speaker → WebSocket → Durable Object → WebSocket → Listeners
(instant relay at the edge)This change reduced perceived latency from seconds to milliseconds, making real-time interaction viable.
Scaling the Idea: Designing a Custom Binary Protocol
With low-latency audio working, I began exploring extensibility:
- Video
- Screen sharing
- Presence updates
Using JSON for everything quickly became limiting. When receiving raw binary data, clients had no reliable way to determine:
- Who sent the data
- What type of stream it represented
Protocol Design
export const STREAM_TYPES = {
AUDIO: 1,
VIDEO: 2,
USER_LIST_UPDATE: 3,
JOIN_REQUEST: 4,
SCREEN_SHARE: 5,
SCREEN_SHARE_STOP: 6,
SCREEN_SHARE_START: 7
};
export const USER_ID_LENGTH = 36;
export const STREAM_TYPE_LENGTH = 1;Message Layout:
┌──────────────────────┬───────────────┬─────────────────┐
│ User ID (36 bytes) │ Type (1 byte) │ Payload (n) │
└──────────────────────┴───────────────┴─────────────────┘This structure allows the client to deterministically parse each message and route it to the correct handler.
What I Learned
- Binary data handling using
Uint8Array - Manual buffer slicing and offset management
- Designing protocols with future extensibility in mind
- Appreciation for the complexity handled by protocols like WebRTC and QUIC
Debugging Screen Share: The Black Screen Problem
Adding screen sharing proved significantly harder than audio. My initial attempt using a hardcoded VP9 codec resulted in a persistent black screen, despite valid data transmission.
The Missing Piece: Initialization Segments
After three days of debugging, I discovered that unlike simple audio streams, video requires an initialization segment (init segment) before any media data. This segment configures the decoder.
The Solution:
- Dynamic Codec Selection: The browser now iterates through a list of codecs to find a supported one, rather than forcing VP9.
- Protocol Update: When a screen share starts (
SCREEN_SHARE_START), the client first sends the specific initialization segment. - Stream Handling: Frame data is only processed after this init segment is digested.
Fighting Lag with Dynamic Playback
Using MediaSource API, I faced issues where network lag caused the buffer to fill up behind the live edge. To fix this, I implemented dynamic playback speed:
- If the buffer lags behind, playback speed increases.
- Once caught up, it normalizes.
Result: Achieved ~3s latency in real-world tests between Russia and Sri Lanka.
Current State
Implemented:
- Real-time audio and video streaming
- Screen sharing with dynamic codec negotiation (New!)
- Presence tracking and user identification
- Custom binary protocol for stream multiplexing
- Edge-based deployment
In progress:
- Synchronization of stream state across reconnects
- Chat functionality
- Waiting room for admitting participants
Key Takeaways
- Theory behaves differently under real-time constraints.
- Serverless architectures have clear boundaries.
- Polling is simple but fundamentally incompatible with low latency.
- Binary protocols are approachable with clear structure.
- Edge computing enables architectures that traditional backends struggle with.
Pulse began as a curiosity-driven experiment and evolved into a hands-on lesson in real-time systems design. It reinforced that meaningful learning often comes from building something imperfect, measuring its failures, and iterating based on evidence.