From 7bff4308b74c7c92994425698f3ee6501d097b67 Mon Sep 17 00:00:00 2001 From: Woody Date: Thu, 14 May 2026 22:55:06 +0800 Subject: [PATCH] =?UTF-8?q?feat:=20Phase=204=20=E2=80=94=20System=20Audio?= =?UTF-8?q?=20&=20Listen=20Mic=20capture=20into=20ASR=20=E2=86=92=20RAG?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds two new live audio sources alongside file Upload: - System Audio: getDisplayMedia() captures system/tab audio output, pipes through WebSocket → DashScope realtime ASR → RAG. - Listen Mic: getUserMedia() captures microphone input via the same audio pipeline (shared useMediaStreamASR hook). Backend: feature toggles (system_audio_enabled, mic_enabled) in config.py, source query param gating in ws_asr.py, 10 config tests. Bug fix: getDisplayMedia() rejected video:false per W3C spec — changed to video:true then stop video tracks to allow audio-only capture on Windows/macOS Chrome. --- .plans/phase4_system_audio_plan.md | 436 +++++++++++++----- backend/app/core/config.py | 4 + backend/app/routers/ws_asr.py | 18 +- backend/app/test/test_phase4_config.py | 140 ++++++ frontend/.pnpmrc | 2 + frontend/package-lock.json | 13 - frontend/package.json | 5 + frontend/pnpm-workspace.yaml | 11 + frontend/src/components/MicCapture.tsx | 80 ++++ frontend/src/components/SourceSelector.tsx | 42 ++ .../src/components/SystemAudioCapture.tsx | 86 ++++ frontend/src/hooks/useMediaStreamASR.ts | 188 ++++++++ frontend/src/hooks/useMicASR.ts | 85 ++++ frontend/src/hooks/useSystemAudioASR.ts | 91 ++++ frontend/src/pages/LTTPage.tsx | 116 +++-- frontend/src/types/index.ts | 26 ++ 16 files changed, 1174 insertions(+), 169 deletions(-) create mode 100644 backend/app/test/test_phase4_config.py create mode 100644 frontend/.pnpmrc create mode 100644 frontend/pnpm-workspace.yaml create mode 100644 frontend/src/components/MicCapture.tsx create mode 100644 frontend/src/components/SourceSelector.tsx create mode 100644 frontend/src/components/SystemAudioCapture.tsx create mode 100644 frontend/src/hooks/useMediaStreamASR.ts create mode 100644 frontend/src/hooks/useMicASR.ts create mode 100644 frontend/src/hooks/useSystemAudioASR.ts diff --git a/.plans/phase4_system_audio_plan.md b/.plans/phase4_system_audio_plan.md index ad31065..cad664c 100644 --- a/.plans/phase4_system_audio_plan.md +++ b/.plans/phase4_system_audio_plan.md @@ -1,7 +1,7 @@ -# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan +# Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan **Created:** 2026-05-09 -**Updated:** 2026-05-09 +**Updated:** 2026-05-14 **Status:** 📋 Draft (Not Started) **Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) @@ -9,24 +9,40 @@ ## 1. Overview -Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline. +Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload: -**Use cases:** +1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`. +2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`. + +Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. + +### System Audio — Use Cases - Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) - Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG - Transcribing any audio playing on the computer without needing to download files -### How It Works +### Listen Mic — Use Cases +- Recording a live meeting or lecture through the computer's microphone +- Dictating questions or notes verbally and getting RAG answers +- Transcribing spoken Cantonese in real time without a video source + +### How They Work ``` -User clicks "System Audio" → clicks "Start Capture" - → Browser shows permission dialog (screen/tab picker) - → User selects tab/window/screen (with audio) - → getDisplayMedia() returns MediaStream (with audio track) - → AudioContext.createMediaStreamSource(stream) - → ScriptProcessorNode (Float32 PCM, mono 16kHz) - → WebSocket → FastAPI → DashScope realtime ASR - → transcript → QueryInput → RAG Pipeline +[System Audio] + User clicks "System Audio" → "Start Capture" + → Browser shows permission dialog (screen/tab picker) + → User selects tab/window/screen (with audio) + → getDisplayMedia() returns MediaStream (with audio track) + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG + +[Listen Mic] + User clicks "Listen Mic" → "Start Listening" + → Browser shows microphone permission prompt + → getUserMedia({ audio: true }) returns MediaStream + → AudioContext.createMediaStreamSource(stream) + → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG ``` ### Audio Routing (vs Existing Sources) @@ -34,59 +50,85 @@ User clicks "System Audio" → clicks "Start Capture" | Source | Audio Input | SourceNode Type | Start/Stop Trigger | |--------|-------------|-----------------|-------------------| | Upload | `