diff --git a/.plans/phase4_system_audio_plan.md b/.plans/phase4_system_audio_plan.md index ad31065..cad664c 100644 --- a/.plans/phase4_system_audio_plan.md +++ b/.plans/phase4_system_audio_plan.md @@ -1,7 +1,7 @@ -# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan +# Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan **Created:** 2026-05-09 -**Updated:** 2026-05-09 +**Updated:** 2026-05-14 **Status:** 📋 Draft (Not Started) **Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) @@ -9,24 +9,40 @@ ## 1. Overview -Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline. +Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload: -**Use cases:** +1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`. +2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`. + +Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. + +### System Audio — Use Cases - Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) - Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG - Transcribing any audio playing on the computer without needing to download files -### How It Works +### Listen Mic — Use Cases +- Recording a live meeting or lecture through the computer's microphone +- Dictating questions or notes verbally and getting RAG answers +- Transcribing spoken Cantonese in real time without a video source + +### How They Work ``` -User clicks "System Audio" → clicks "Start Capture" - → Browser shows permission dialog (screen/tab picker) - → User selects tab/window/screen (with audio) - → getDisplayMedia() returns MediaStream (with audio track) - → AudioContext.createMediaStreamSource(stream) - → ScriptProcessorNode (Float32 PCM, mono 16kHz) - → WebSocket → FastAPI → DashScope realtime ASR - → transcript → QueryInput → RAG Pipeline +[System Audio] + User clicks "System Audio" → "Start Capture" + → Browser shows permission dialog (screen/tab picker) + → User selects tab/window/screen (with audio) + → getDisplayMedia() returns MediaStream (with audio track) + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG + +[Listen Mic] + User clicks "Listen Mic" → "Start Listening" + → Browser shows microphone permission prompt + → getUserMedia({ audio: true }) returns MediaStream + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG ``` ### Audio Routing (vs Existing Sources) @@ -34,59 +50,85 @@ User clicks "System Audio" → clicks "Start Capture" | Source | Audio Input | SourceNode Type | Start/Stop Trigger | |--------|-------------|-----------------|-------------------| | Upload | `