20 KiB

Raw Blame History

Phase 4: System Audio Capture → ASR → RAG — Implementation Plan

Created: 2026-05-09 Updated: 2026-05-09 Status: 📋 Draft (Not Started) Depends on: Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)

1. Overview

Phase 4 adds system audio capture as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline.

Use cases:

Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
Transcribing any audio playing on the computer without needing to download files

How It Works

User clicks "System Audio" → clicks "Start Capture"
  → Browser shows permission dialog (screen/tab picker)
  → User selects tab/window/screen (with audio)
  → getDisplayMedia() returns MediaStream (with audio track)
  → AudioContext.createMediaStreamSource(stream)
  → ScriptProcessorNode (Float32 PCM, mono 16kHz)
  → WebSocket → FastAPI → DashScope realtime ASR
  → transcript → QueryInput → RAG Pipeline

Audio Routing (vs Existing Sources)

Source	Audio Input	SourceNode Type	Start/Stop Trigger
Upload	`<video>` element	`createMediaElementSource`	play/pause events
YouTube	`<audio>` element	`createMediaElementSource`	play/pause events on `<video>`
System Audio	MediaStream from `getDisplayMedia()`	`createMediaStreamSource`	Manual Start/Stop button + track ended event

Why New Hook (Not Reuse Existing)

The existing useVideoASR and useYouTubeASR hooks depend on HTML media elements (<video>, <audio>) for both the audio source and play/pause lifecycle. System audio capture uses a MediaStream object (no DOM element), and its lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events. A new hook is architecturally cleaner than overloading the existing ones with branching logic.

2. User Flow

User selects "System Audio" tab (third option alongside Upload / YouTube)
UI shows a "Start Capture" button with browser compatibility info
User clicks "Start Capture"
Browser opens permission dialog (screen/tab picker)
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
- User checks "Share audio" if available
On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
Real-time ASR transcription flows into QueryInput (same as Upload/YouTube)
User can edit transcript while capturing continues
User clicks "Stop Capture" to end — transcript stays in QueryInput
User submits query → RAG pipeline processes it
"Full Transcript" button hidden (streaming ASR only, same as YouTube)

Permission Denied Flow

If user clicks "Cancel" in permission dialog → error state: "Permission denied — system audio capture requires your explicit permission"
If user revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
If no audio track in the stream → error: "No audio track found in the shared content"

3. Architecture

3.1 Component Tree (LTTPage — System Audio Mode)

LTTPage
├── SourceSelector (tabs: Upload | YouTube | System Audio)
├── [source === 'system-audio']
│   ├── SystemAudioCapture
│   │   ├── Start/Stop button
│   │   ├── Status indicator (idle | requesting | capturing | error)
│   │   ├── Audio level meter (optional, nice-to-have)
│   │   └── Browser compatibility note (non-Chrome users)
│   └── (no video player — audio-only capture)
├── QueryInput (receives transcript from useSystemAudioASR)
├── ExtractedQuestionsDisplay
└── RAG Response Panel

3.2 Data Flow

SystemAudioCapture (UI)
  │
  ├── "Start Capture" click → calls startCapture() from hook
  │
  ▼
useSystemAudioASR hook
  │
  ├── getDisplayMedia({ audio: { systemAudio: 'include' } })
  │     └── User picks tab/window → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │     └── MediaStreamAudioSourceNode
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │     └── onaudioprocess: convert Float32 → Int16 PCM
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │     └── Sends binary PCM frames
  │
  └── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
        │
        ▼
LTTPage unifies: const asr = source === 'system-audio' ? systemAudioASR : ...
  │
  ▼
QueryInput receives asr.partialTranscript

3.3 Backend Changes

Minimal. The existing WebSocket ASR endpoint (ws_asr.py) already accepts audio from any source. The only addition is handling a UUID-based video_id for system audio sessions (no real video file).

Change	File	Description
Allow UUID video_id	`backend/app/routers/ws_asr.py`	Accept non-file-based video IDs (already accepts any string)
Transcript persistence	`backend/app/services/history_service.py`	Store system audio transcripts with UUID session ID (optional — nice-to-have)
Config	`backend/app/core/config.py`	Add `SYSTEM_AUDIO_ENABLED` toggle (default: true)

No changes needed to:

DashScope ASR client (receives PCM, doesn't care about source)
WebSocket protocol (same binary PCM format)
RAG pipeline (consumes transcript text)

3.4 Frontend Files

File	Status	Description
`frontend/src/hooks/useSystemAudioASR.ts`	New	Hook: getDisplayMedia → AudioContext → WebSocket
`frontend/src/components/SystemAudioCapture.tsx`	New	UI: Start/Stop button, status, compatibility note
`frontend/src/pages/LTTPage.tsx`	Modified	Add "System Audio" tab, wire hook, unify ASR
`frontend/src/types/index.ts`	Modified	Add SystemAudioStatus type
`frontend/src/components/SourceSelector.tsx`	Refactor	Extract source tabs into reusable component (optional — can inline in LTTPage)

4. Sub-Phases

Sub-Phase	Description	Effort	Depends On	Status
4.1	Config & Infrastructure	0.5 day	—	📋 Draft
4.2	System Audio Capture Hook (`useSystemAudioASR`)	1 day	4.1	📋 Draft
4.3	SystemAudioCapture UI Component	0.5 day	4.2	📋 Draft
4.4	LTTPage Integration	0.5 day	4.2, 4.3	📋 Draft
4.5	Backend Adjustments	0.5 day	4.1	📋 Draft
4.6	Integration & Acceptance Tests	1 day	4.4, 4.5	📋 Draft
4.7	Polish & Documentation	0.5 day	4.6	📋 Draft
Total		4.5 days

Phase 4.1 — Config & Infrastructure (0.5 day)

Objective: Add system audio feature toggle, define types, establish UUID generation.

Tasks:

Add SYSTEM_AUDIO_ENABLED to backend/app/core/config.py (default: True)

Add SystemAudioStatus type to frontend/src/types/index.ts:

type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'

Add SystemAudioASRState interface to types
Add video_id UUID generation helper (frontend-side: crypto.randomUUID())
Verify WebSocket ASR endpoint accepts arbitrary video_id strings (it does — confirm with a quick test)

Test Files: backend/app/test/test_phase4_config.py

Phase 4.2 — System Audio Capture Hook (1 day)

Objective: Create useSystemAudioASR.ts hook that captures system audio and streams it to the ASR WebSocket.

Key Design:

interface UseSystemAudioASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseSystemAudioASRReturn {
  status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startCapture: () => Promise<void>
  stopCapture: () => void
}

Implementation Details:

startCapture(): calls navigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })
- On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
- On user cancel: sets status to 'idle', sets error "Permission denied"
- On no audio track: sets status to 'error', sets error "No audio track found"
stopCapture(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
Auto-stop: listens for track.onended (user clicks "Stop sharing" in Chrome) → calls stopCapture
Audio processing: identical to useVideoASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
WebSocket lifecycle: connect on capture start, close on capture stop
Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Pattern to Follow:

AudioContext setup: follow useVideoASR.ts lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
WebSocket handling: follow useYouTubeASR.ts lines 35-100
State management: combine patterns from both hooks, adapting for MediaStream source

Test Files: frontend/src/test/test_phase4_useSystemAudioASR.test.ts

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Objective: Create the SystemAudioCapture.tsx component with Start/Stop button, status display, and browser compatibility info.

Component Props:

interface SystemAudioCaptureProps {
  status: SystemAudioStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

Idle: "Start Capture" button (blue, prominent) + compatibility note
Requesting: "Waiting for permission..." (loading spinner)
Capturing: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
Error: Red banner with error message + "Try Again" button

Browser Compatibility Note:

⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
Firefox and Safari do not support this feature.
On Linux, only tab audio is available (not full system audio).

Test Files: frontend/src/test/test_phase4_SystemAudioCapture.test.tsx

Phase 4.4 — LTTPage Integration (0.5 day)

Objective: Wire the System Audio source into LTTPage, adding it as the third tab alongside Upload and YouTube.

Changes to LTTPage.tsx:

Extend SourceType from 'upload' | 'youtube' to 'upload' | 'youtube' | 'system-audio'
Add third tab button (icon: AudioLines from lucide-react) in the source selector
Initialize useSystemAudioASR hook with a UUID-based WebSocket URL

Update asr variable:

const asr = source === 'youtube' ? youtubeASR 
  : source === 'system-audio' ? systemAudioASR 
  : uploadASR

Conditional rendering:

{source === 'upload' && <VideoUploader />}
{source === 'youtube' && <YouTubeMode />}
{source === 'system-audio' && <SystemAudioCapture />}

WebSocket URL: ws://host/ws/asr/{crypto.randomUUID()}?language=yue
Full Transcript button: hidden for system-audio (same as YouTube)
QueryInput: remains editable during capture (same behavior as other sources)

Test Files: frontend/src/test/test_phase4_LTTPage_integration.test.tsx

Phase 4.5 — Backend Adjustments (0.5 day)

Objective: Ensure backend handles system audio sessions correctly.

Tasks:

Verify ws_asr.py WebSocket endpoint works with arbitrary video_id (UUID format) — likely no changes needed
Add SYSTEM_AUDIO_ENABLED config validation in the router (return 503 if disabled)
Handle system audio sessions in transcript history (optional — store with source: 'system-audio' metadata)
Verify the ASR client handles system audio PCM identically to video audio

No new endpoints needed. The existing WebSocket and ASR infrastructure is source-agnostic.

Test Files: backend/app/test/test_phase4_config.py

Phase 4.6 — Integration & Acceptance Tests (1 day)

Objective: Comprehensive tests for the system audio capture flow.

Backend Integration Tests (backend/app/test/test_integration_phase4.py):

WebSocket accepts UUID video_id
ASR processes audio from system audio session
Config toggle disables feature

Frontend Tests:

Hook tests (test_phase4_useSystemAudioASR.test.ts): ~10 tests
- Mock getDisplayMedia → successful capture
- Mock getDisplayMedia → user cancels (permission denied)
- Mock getDisplayMedia → no audio track
- AudioContext setup and teardown
- WebSocket connection lifecycle
- PCM conversion and sending
- track.onended triggers auto-stop
- stopCapture cleanup
- Multiple rapid start/stop cycles
Component tests (test_phase4_SystemAudioCapture.test.tsx): ~5 tests
- All UI states render correctly (idle, requesting, capturing, error)
- Start button calls onStart
- Stop button calls onStop
- Error state shows message and retry button
- Compatibility note visible for non-Chrome (optional)
Integration tests (test_phase4_LTTPage_integration.test.tsx): ~5 tests
- System Audio tab renders and switches correctly
- ASR variable selects systemAudioASR when source is system-audio
- Full Transcript button hidden for system audio
- QueryInput receives transcript from system audio
- Source switching preserves transcript

Acceptance Tests (backend/app/test/acceptance/test_acceptance_phase4.py):

Real getDisplayMedia with actual browser (manual — requires human interaction)
Real DashScope ASR with system audio stream
End-to-end: capture → ASR → transcript → RAG answer

Phase 4.7 — Polish & Documentation (0.5 day)

Tasks:

Update README.md — add System Audio Capture section with usage instructions, browser compatibility table, and limitations
Update development_plan.md — add Phase 4 row to timeline, mark status
Add browser detection helper for compatibility warning
Verify production build (npm run build)
Run full CI regression (pytest + vitest)
Final commit

5. Design Decisions

Decision	Rationale
New hook (`useSystemAudioASR`) rather than modifying existing	MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hook avoids branching complexity.
UUID-based `video_id`	No actual video file for system audio. `crypto.randomUUID()` generates unique session IDs. Backend WebSocket already accepts arbitrary strings.
Manual Start/Stop (not auto)	`getDisplayMedia()` requires explicit user action (browser policy). Cannot auto-start.
No video display in System Audio mode	User watches content in another tab/window. Only capture status and audio controls shown.
`video: false` in getDisplayMedia	Audio-only capture reduces bandwidth and permission scope. User only needs to share audio.
Hide Full Transcript button for system audio	Same as YouTube — streaming ASR only. Full transcript would require recording and batch processing (future Phase 5).
Browser compatibility note in UI	`getDisplayMedia` with audio is Chrome/Edge-only. Non-supporting browsers get clear messaging.

getDisplayMedia Options

const stream = await navigator.mediaDevices.getDisplayMedia({
  video: false,                        // No video needed
  audio: {
    systemAudio: 'include',            // Request system audio (tab + full system where supported)
    echoCancellation: false,           // Don't filter audio
    noiseSuppression: false,           // Don't filter audio
    autoGainControl: false,            // Don't adjust volume
  }
})

Note on video: false: Setting video: false tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation, not ours.

6. Browser Compatibility

Platform / Browser	Tab Audio	System Audio	Works?
Chrome/Edge (Windows)	✅	✅	Best — full support
Chrome/Edge (macOS 14.2+)	✅	✅	Good
Chrome/Edge (Linux)	✅	❌	Works, tab audio only
Firefox	❌	❌	Audio ignored
Safari	❌	❌	Audio not supported
Mobile browsers	❌	❌	Not supported

Detection helper:

function isSystemAudioSupported(): boolean {
  const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
  // Firefox and Safari don't support audio in getDisplayMedia
  return isChromium && !navigator.userAgent.includes('Firefox')
}

7. Test Strategy

Test Files

File	Type	Count	Description
`test_phase4_config.py`	Backend integration	3	Config toggle, WebSocket accepts UUID
`test_phase4_useSystemAudioASR.test.ts`	Frontend unit	~10	Hook behavior: capture, permission, audio, WS
`test_phase4_SystemAudioCapture.test.tsx`	Frontend component	~5	UI states: idle, requesting, capturing, error
`test_phase4_LTTPage_integration.test.tsx`	Frontend integration	~5	Tab switching, ASR unification, Full Transcript
`test_integration_phase4.py`	Backend integration	4	Config toggle, WebSocket, ASR client
`test_acceptance_phase4.py`	Acceptance	3	Real browser + real DashScope ASR

Mocking Strategy

getDisplayMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
AudioContext: Use jest-webgl-mock or manual mock for AudioContext, ScriptProcessorNode
WebSocket: Mock via vitest WebSocket mock (same pattern as Phase 2/3 tests)
DashScope ASR: Mock in CI; real in acceptance tests

8. File Manifest

New Files

frontend/src/hooks/useSystemAudioASR.ts
frontend/src/components/SystemAudioCapture.tsx
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py
backend/app/test/acceptance/test_acceptance_phase4.py
.plans/phase4_system_audio_plan.md              ← this file

Modified Files

frontend/src/pages/LTTPage.tsx                    ← add "System Audio" tab, wire hook
frontend/src/types/index.ts                       ← add SystemAudioStatus, SystemAudioASRState
backend/app/core/config.py                        ← add SYSTEM_AUDIO_ENABLED
development_plan.md                               ← add Phase 4 row
README.md                                         ← add System Audio Capture section

9. Acceptance Criteria

User can select "System Audio" tab in LTTPage
Clicking "Start Capture" opens browser permission dialog
On permission grant, audio streams through WebSocket to DashScope ASR
Real-time transcript flows into QueryInput
User can edit transcript while capture continues
"Stop Capture" properly closes MediaStream, AudioContext, WebSocket
Permission denied shows clear error message
Browser compatibility note shown for non-Chrome browsers
All CI tests pass (no regressions)
Acceptance tests pass with real DashScope ASR
npm run build produces clean production build

File Information

Filename: phase4_system_audio_plan.md
Created: 2026-05-09
Status: Draft — awaiting review before Phase 4.1 implementation begins

20 KiB Raw Blame History

Phase 4: System Audio Capture → ASR → RAG — Implementation Plan

1. Overview

How It Works

Audio Routing (vs Existing Sources)

Why New Hook (Not Reuse Existing)

2. User Flow

Permission Denied Flow

3. Architecture

3.1 Component Tree (LTTPage — System Audio Mode)

3.2 Data Flow

3.3 Backend Changes

3.4 Frontend Files

4. Sub-Phases

Phase 4.1 — Config & Infrastructure (0.5 day)

Phase 4.2 — System Audio Capture Hook (1 day)

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Phase 4.4 — LTTPage Integration (0.5 day)

Phase 4.5 — Backend Adjustments (0.5 day)

Phase 4.6 — Integration & Acceptance Tests (1 day)

Phase 4.7 — Polish & Documentation (0.5 day)

5. Design Decisions

getDisplayMedia Options

6. Browser Compatibility

7. Test Strategy

Test Files

Mocking Strategy

8. File Manifest

New Files

Modified Files

9. Acceptance Criteria

20 KiB

Raw Blame History