# Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan **Created:** 2026-05-09 **Updated:** 2026-05-15 **Status:** ✅ Complete **Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) --- ## 1. Overview Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload: 1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`. 2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`. Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. ### System Audio — Use Cases - Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) - Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG - Transcribing any audio playing on the computer without needing to download files ### Listen Mic — Use Cases - Recording a live meeting or lecture through the computer's microphone - Dictating questions or notes verbally and getting RAG answers - Transcribing spoken Cantonese in real time without a video source ### How They Work ``` [System Audio] User clicks "System Audio" → "Start Capture" → Browser shows permission dialog (screen/tab picker) → User selects tab/window/screen (with audio) → getDisplayMedia() returns MediaStream (with audio track) → AudioContext.createMediaStreamSource(stream) → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG [Listen Mic] User clicks "Listen Mic" → "Start Listening" → Browser shows microphone permission prompt → getUserMedia({ audio: true }) returns MediaStream → AudioContext.createMediaStreamSource(stream) → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG ``` ### Audio Routing (vs Existing Sources) | Source | Audio Input | SourceNode Type | Start/Stop Trigger | |--------|-------------|-----------------|-------------------| | Upload | `