20 KiB
Phase 4: System Audio Capture → ASR → RAG — Implementation Plan
Created: 2026-05-09 Updated: 2026-05-09 Status: 📋 Draft (Not Started) Depends on: Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)
1. Overview
Phase 4 adds system audio capture as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline.
Use cases:
- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
- Transcribing any audio playing on the computer without needing to download files
How It Works
User clicks "System Audio" → clicks "Start Capture"
→ Browser shows permission dialog (screen/tab picker)
→ User selects tab/window/screen (with audio)
→ getDisplayMedia() returns MediaStream (with audio track)
→ AudioContext.createMediaStreamSource(stream)
→ ScriptProcessorNode (Float32 PCM, mono 16kHz)
→ WebSocket → FastAPI → DashScope realtime ASR
→ transcript → QueryInput → RAG Pipeline
Audio Routing (vs Existing Sources)
| Source | Audio Input | SourceNode Type | Start/Stop Trigger |
|---|---|---|---|
| Upload | <video> element |
createMediaElementSource |
play/pause events |
| YouTube | <audio> element |
createMediaElementSource |
play/pause events on <video> |
| System Audio | MediaStream from getDisplayMedia() |
createMediaStreamSource |
Manual Start/Stop button + track ended event |
Why New Hook (Not Reuse Existing)
The existing useVideoASR and useYouTubeASR hooks depend on HTML media elements (<video>, <audio>) for both the audio source and play/pause lifecycle. System audio capture uses a MediaStream object (no DOM element), and its lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events. A new hook is architecturally cleaner than overloading the existing ones with branching logic.
2. User Flow
- User selects "System Audio" tab (third option alongside Upload / YouTube)
- UI shows a "Start Capture" button with browser compatibility info
- User clicks "Start Capture"
- Browser opens permission dialog (screen/tab picker)
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
- User checks "Share audio" if available
- On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
- Real-time ASR transcription flows into QueryInput (same as Upload/YouTube)
- User can edit transcript while capturing continues
- User clicks "Stop Capture" to end — transcript stays in QueryInput
- User submits query → RAG pipeline processes it
- "Full Transcript" button hidden (streaming ASR only, same as YouTube)
Permission Denied Flow
- If user clicks "Cancel" in permission dialog → error state: "Permission denied — system audio capture requires your explicit permission"
- If user revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
- If no audio track in the stream → error: "No audio track found in the shared content"
3. Architecture
3.1 Component Tree (LTTPage — System Audio Mode)
LTTPage
├── SourceSelector (tabs: Upload | YouTube | System Audio)
├── [source === 'system-audio']
│ ├── SystemAudioCapture
│ │ ├── Start/Stop button
│ │ ├── Status indicator (idle | requesting | capturing | error)
│ │ ├── Audio level meter (optional, nice-to-have)
│ │ └── Browser compatibility note (non-Chrome users)
│ └── (no video player — audio-only capture)
├── QueryInput (receives transcript from useSystemAudioASR)
├── ExtractedQuestionsDisplay
└── RAG Response Panel
3.2 Data Flow
SystemAudioCapture (UI)
│
├── "Start Capture" click → calls startCapture() from hook
│
▼
useSystemAudioASR hook
│
├── getDisplayMedia({ audio: { systemAudio: 'include' } })
│ └── User picks tab/window → returns MediaStream
│
├── AudioContext.createMediaStreamSource(stream)
│ └── MediaStreamAudioSourceNode
│
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
│ └── onaudioprocess: convert Float32 → Int16 PCM
│
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
│ └── Sends binary PCM frames
│
└── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
│
▼
LTTPage unifies: const asr = source === 'system-audio' ? systemAudioASR : ...
│
▼
QueryInput receives asr.partialTranscript
3.3 Backend Changes
Minimal. The existing WebSocket ASR endpoint (ws_asr.py) already accepts audio from any source. The only addition is handling a UUID-based video_id for system audio sessions (no real video file).
| Change | File | Description |
|---|---|---|
| Allow UUID video_id | backend/app/routers/ws_asr.py |
Accept non-file-based video IDs (already accepts any string) |
| Transcript persistence | backend/app/services/history_service.py |
Store system audio transcripts with UUID session ID (optional — nice-to-have) |
| Config | backend/app/core/config.py |
Add SYSTEM_AUDIO_ENABLED toggle (default: true) |
No changes needed to:
- DashScope ASR client (receives PCM, doesn't care about source)
- WebSocket protocol (same binary PCM format)
- RAG pipeline (consumes transcript text)
3.4 Frontend Files
| File | Status | Description |
|---|---|---|
frontend/src/hooks/useSystemAudioASR.ts |
New | Hook: getDisplayMedia → AudioContext → WebSocket |
frontend/src/components/SystemAudioCapture.tsx |
New | UI: Start/Stop button, status, compatibility note |
frontend/src/pages/LTTPage.tsx |
Modified | Add "System Audio" tab, wire hook, unify ASR |
frontend/src/types/index.ts |
Modified | Add SystemAudioStatus type |
frontend/src/components/SourceSelector.tsx |
Refactor | Extract source tabs into reusable component (optional — can inline in LTTPage) |
4. Sub-Phases
| Sub-Phase | Description | Effort | Depends On | Status |
|---|---|---|---|---|
| 4.1 | Config & Infrastructure | 0.5 day | — | 📋 Draft |
| 4.2 | System Audio Capture Hook (useSystemAudioASR) |
1 day | 4.1 | 📋 Draft |
| 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | 📋 Draft |
| 4.4 | LTTPage Integration | 0.5 day | 4.2, 4.3 | 📋 Draft |
| 4.5 | Backend Adjustments | 0.5 day | 4.1 | 📋 Draft |
| 4.6 | Integration & Acceptance Tests | 1 day | 4.4, 4.5 | 📋 Draft |
| 4.7 | Polish & Documentation | 0.5 day | 4.6 | 📋 Draft |
| Total | 4.5 days |
Phase 4.1 — Config & Infrastructure (0.5 day)
Objective: Add system audio feature toggle, define types, establish UUID generation.
Tasks:
- Add
SYSTEM_AUDIO_ENABLEDtobackend/app/core/config.py(default:True) - Add
SystemAudioStatustype tofrontend/src/types/index.ts:type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error' - Add
SystemAudioASRStateinterface to types - Add
video_idUUID generation helper (frontend-side:crypto.randomUUID()) - Verify WebSocket ASR endpoint accepts arbitrary
video_idstrings (it does — confirm with a quick test)
Test Files: backend/app/test/test_phase4_config.py
Phase 4.2 — System Audio Capture Hook (1 day)
Objective: Create useSystemAudioASR.ts hook that captures system audio and streams it to the ASR WebSocket.
Key Design:
interface UseSystemAudioASRProps {
wsUrl: string // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}
interface UseSystemAudioASRReturn {
status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
transcript: string
partialTranscript: string
error: string | null
startCapture: () => Promise<void>
stopCapture: () => void
}
Implementation Details:
startCapture(): callsnavigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })- On success: creates AudioContext,
createMediaStreamSource(stream), connects ScriptProcessor → WebSocket - On user cancel: sets status to
'idle', sets error "Permission denied" - On no audio track: sets status to
'error', sets error "No audio track found"
- On success: creates AudioContext,
stopCapture(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket- Auto-stop: listens for
track.onended(user clicks "Stop sharing" in Chrome) → calls stopCapture - Audio processing: identical to useVideoASR —
ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket - WebSocket lifecycle: connect on capture start, close on capture stop
- Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks
Pattern to Follow:
- AudioContext setup: follow
useVideoASR.tslines 45-143 (AudioContext, ScriptProcessor, sample rate conversion) - WebSocket handling: follow
useYouTubeASR.tslines 35-100 - State management: combine patterns from both hooks, adapting for MediaStream source
Test Files: frontend/src/test/test_phase4_useSystemAudioASR.test.ts
Phase 4.3 — SystemAudioCapture UI Component (0.5 day)
Objective: Create the SystemAudioCapture.tsx component with Start/Stop button, status display, and browser compatibility info.
Component Props:
interface SystemAudioCaptureProps {
status: SystemAudioStatus
error: string | null
onStart: () => void
onStop: () => void
}
UI States:
- Idle: "Start Capture" button (blue, prominent) + compatibility note
- Requesting: "Waiting for permission..." (loading spinner)
- Capturing: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
- Error: Red banner with error message + "Try Again" button
Browser Compatibility Note:
⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
Firefox and Safari do not support this feature.
On Linux, only tab audio is available (not full system audio).
Test Files: frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
Phase 4.4 — LTTPage Integration (0.5 day)
Objective: Wire the System Audio source into LTTPage, adding it as the third tab alongside Upload and YouTube.
Changes to LTTPage.tsx:
- Extend
SourceTypefrom'upload' | 'youtube'to'upload' | 'youtube' | 'system-audio' - Add third tab button (icon:
AudioLinesfrom lucide-react) in the source selector - Initialize
useSystemAudioASRhook with a UUID-based WebSocket URL - Update
asrvariable:const asr = source === 'youtube' ? youtubeASR : source === 'system-audio' ? systemAudioASR : uploadASR - Conditional rendering:
{source === 'upload' && <VideoUploader />} {source === 'youtube' && <YouTubeMode />} {source === 'system-audio' && <SystemAudioCapture />} - WebSocket URL:
ws://host/ws/asr/{crypto.randomUUID()}?language=yue - Full Transcript button: hidden for system-audio (same as YouTube)
- QueryInput: remains editable during capture (same behavior as other sources)
Test Files: frontend/src/test/test_phase4_LTTPage_integration.test.tsx
Phase 4.5 — Backend Adjustments (0.5 day)
Objective: Ensure backend handles system audio sessions correctly.
Tasks:
- Verify
ws_asr.pyWebSocket endpoint works with arbitraryvideo_id(UUID format) — likely no changes needed - Add
SYSTEM_AUDIO_ENABLEDconfig validation in the router (return 503 if disabled) - Handle system audio sessions in transcript history (optional — store with
source: 'system-audio'metadata) - Verify the ASR client handles system audio PCM identically to video audio
No new endpoints needed. The existing WebSocket and ASR infrastructure is source-agnostic.
Test Files: backend/app/test/test_phase4_config.py
Phase 4.6 — Integration & Acceptance Tests (1 day)
Objective: Comprehensive tests for the system audio capture flow.
Backend Integration Tests (backend/app/test/test_integration_phase4.py):
- WebSocket accepts UUID video_id
- ASR processes audio from system audio session
- Config toggle disables feature
Frontend Tests:
-
Hook tests (
test_phase4_useSystemAudioASR.test.ts): ~10 tests- Mock
getDisplayMedia→ successful capture - Mock
getDisplayMedia→ user cancels (permission denied) - Mock
getDisplayMedia→ no audio track - AudioContext setup and teardown
- WebSocket connection lifecycle
- PCM conversion and sending
track.onendedtriggers auto-stopstopCapturecleanup- Multiple rapid start/stop cycles
- Mock
-
Component tests (
test_phase4_SystemAudioCapture.test.tsx): ~5 tests- All UI states render correctly (idle, requesting, capturing, error)
- Start button calls onStart
- Stop button calls onStop
- Error state shows message and retry button
- Compatibility note visible for non-Chrome (optional)
-
Integration tests (
test_phase4_LTTPage_integration.test.tsx): ~5 tests- System Audio tab renders and switches correctly
- ASR variable selects systemAudioASR when source is system-audio
- Full Transcript button hidden for system audio
- QueryInput receives transcript from system audio
- Source switching preserves transcript
Acceptance Tests (backend/app/test/acceptance/test_acceptance_phase4.py):
- Real
getDisplayMediawith actual browser (manual — requires human interaction) - Real DashScope ASR with system audio stream
- End-to-end: capture → ASR → transcript → RAG answer
Phase 4.7 — Polish & Documentation (0.5 day)
Tasks:
- Update
README.md— add System Audio Capture section with usage instructions, browser compatibility table, and limitations - Update
development_plan.md— add Phase 4 row to timeline, mark status - Add browser detection helper for compatibility warning
- Verify production build (
npm run build) - Run full CI regression (
pytest+vitest) - Final commit
5. Design Decisions
| Decision | Rationale |
|---|---|
New hook (useSystemAudioASR) rather than modifying existing |
MediaStream source requires createMediaStreamSource (not createMediaElementSource), and lifecycle is permission-based (not play/pause events). Separate hook avoids branching complexity. |
UUID-based video_id |
No actual video file for system audio. crypto.randomUUID() generates unique session IDs. Backend WebSocket already accepts arbitrary strings. |
| Manual Start/Stop (not auto) | getDisplayMedia() requires explicit user action (browser policy). Cannot auto-start. |
| No video display in System Audio mode | User watches content in another tab/window. Only capture status and audio controls shown. |
video: false in getDisplayMedia |
Audio-only capture reduces bandwidth and permission scope. User only needs to share audio. |
| Hide Full Transcript button for system audio | Same as YouTube — streaming ASR only. Full transcript would require recording and batch processing (future Phase 5). |
| Browser compatibility note in UI | getDisplayMedia with audio is Chrome/Edge-only. Non-supporting browsers get clear messaging. |
getDisplayMedia Options
const stream = await navigator.mediaDevices.getDisplayMedia({
video: false, // No video needed
audio: {
systemAudio: 'include', // Request system audio (tab + full system where supported)
echoCancellation: false, // Don't filter audio
noiseSuppression: false, // Don't filter audio
autoGainControl: false, // Don't adjust volume
}
})
Note on video: false: Setting video: false tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation, not ours.
6. Browser Compatibility
| Platform / Browser | Tab Audio | System Audio | Works? |
|---|---|---|---|
| Chrome/Edge (Windows) | ✅ | ✅ | Best — full support |
| Chrome/Edge (macOS 14.2+) | ✅ | ✅ | Good |
| Chrome/Edge (Linux) | ✅ | ❌ | Works, tab audio only |
| Firefox | ❌ | ❌ | Audio ignored |
| Safari | ❌ | ❌ | Audio not supported |
| Mobile browsers | ❌ | ❌ | Not supported |
Detection helper:
function isSystemAudioSupported(): boolean {
const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
// Firefox and Safari don't support audio in getDisplayMedia
return isChromium && !navigator.userAgent.includes('Firefox')
}
7. Test Strategy
Test Files
| File | Type | Count | Description |
|---|---|---|---|
test_phase4_config.py |
Backend integration | 3 | Config toggle, WebSocket accepts UUID |
test_phase4_useSystemAudioASR.test.ts |
Frontend unit | ~10 | Hook behavior: capture, permission, audio, WS |
test_phase4_SystemAudioCapture.test.tsx |
Frontend component | ~5 | UI states: idle, requesting, capturing, error |
test_phase4_LTTPage_integration.test.tsx |
Frontend integration | ~5 | Tab switching, ASR unification, Full Transcript |
test_integration_phase4.py |
Backend integration | 4 | Config toggle, WebSocket, ASR client |
test_acceptance_phase4.py |
Acceptance | 3 | Real browser + real DashScope ASR |
Mocking Strategy
getDisplayMedia: Mock withjest.fn()returning a synthetic MediaStream with an AudioTrack- AudioContext: Use
jest-webgl-mockor manual mock for AudioContext, ScriptProcessorNode - WebSocket: Mock via
vitestWebSocket mock (same pattern as Phase 2/3 tests) - DashScope ASR: Mock in CI; real in acceptance tests
8. File Manifest
New Files
frontend/src/hooks/useSystemAudioASR.ts
frontend/src/components/SystemAudioCapture.tsx
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py
backend/app/test/acceptance/test_acceptance_phase4.py
.plans/phase4_system_audio_plan.md ← this file
Modified Files
frontend/src/pages/LTTPage.tsx ← add "System Audio" tab, wire hook
frontend/src/types/index.ts ← add SystemAudioStatus, SystemAudioASRState
backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED
development_plan.md ← add Phase 4 row
README.md ← add System Audio Capture section
9. Acceptance Criteria
- User can select "System Audio" tab in LTTPage
- Clicking "Start Capture" opens browser permission dialog
- On permission grant, audio streams through WebSocket to DashScope ASR
- Real-time transcript flows into QueryInput
- User can edit transcript while capture continues
- "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
- Permission denied shows clear error message
- Browser compatibility note shown for non-Chrome browsers
- All CI tests pass (no regressions)
- Acceptance tests pass with real DashScope ASR
npm run buildproduces clean production build
File Information
- Filename:
phase4_system_audio_plan.md - Created: 2026-05-09
- Status: Draft — awaiting review before Phase 4.1 implementation begins