diff --git a/.plans/phase4_system_audio_plan.md b/.plans/phase4_system_audio_plan.md new file mode 100644 index 0000000..ad31065 --- /dev/null +++ b/.plans/phase4_system_audio_plan.md @@ -0,0 +1,453 @@ +# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan + +**Created:** 2026-05-09 +**Updated:** 2026-05-09 +**Status:** 📋 Draft (Not Started) +**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) + +--- + +## 1. Overview + +Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline. + +**Use cases:** +- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) +- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG +- Transcribing any audio playing on the computer without needing to download files + +### How It Works + +``` +User clicks "System Audio" → clicks "Start Capture" + → Browser shows permission dialog (screen/tab picker) + → User selects tab/window/screen (with audio) + → getDisplayMedia() returns MediaStream (with audio track) + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode (Float32 PCM, mono 16kHz) + → WebSocket → FastAPI → DashScope realtime ASR + → transcript → QueryInput → RAG Pipeline +``` + +### Audio Routing (vs Existing Sources) + +| Source | Audio Input | SourceNode Type | Start/Stop Trigger | +|--------|-------------|-----------------|-------------------| +| Upload | `