legco_ai_assistant/.plans/phase4_system_audio_plan.md

20 KiB

Phase 4: System Audio Capture → ASR → RAG — Implementation Plan

Created: 2026-05-09 Updated: 2026-05-09 Status: 📋 Draft (Not Started) Depends on: Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)


1. Overview

Phase 4 adds system audio capture as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline.

Use cases:

  • Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
  • Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
  • Transcribing any audio playing on the computer without needing to download files

How It Works

User clicks "System Audio" → clicks "Start Capture"
  → Browser shows permission dialog (screen/tab picker)
  → User selects tab/window/screen (with audio)
  → getDisplayMedia() returns MediaStream (with audio track)
  → AudioContext.createMediaStreamSource(stream)
  → ScriptProcessorNode (Float32 PCM, mono 16kHz)
  → WebSocket → FastAPI → DashScope realtime ASR
  → transcript → QueryInput → RAG Pipeline

Audio Routing (vs Existing Sources)

Source Audio Input SourceNode Type Start/Stop Trigger
Upload <video> element createMediaElementSource play/pause events
YouTube <audio> element createMediaElementSource play/pause events on <video>
System Audio MediaStream from getDisplayMedia() createMediaStreamSource Manual Start/Stop button + track ended event

Why New Hook (Not Reuse Existing)

The existing useVideoASR and useYouTubeASR hooks depend on HTML media elements (<video>, <audio>) for both the audio source and play/pause lifecycle. System audio capture uses a MediaStream object (no DOM element), and its lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events. A new hook is architecturally cleaner than overloading the existing ones with branching logic.


2. User Flow

  1. User selects "System Audio" tab (third option alongside Upload / YouTube)
  2. UI shows a "Start Capture" button with browser compatibility info
  3. User clicks "Start Capture"
  4. Browser opens permission dialog (screen/tab picker)
    • User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
    • User checks "Share audio" if available
  5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
  6. Real-time ASR transcription flows into QueryInput (same as Upload/YouTube)
  7. User can edit transcript while capturing continues
  8. User clicks "Stop Capture" to end — transcript stays in QueryInput
  9. User submits query → RAG pipeline processes it
  10. "Full Transcript" button hidden (streaming ASR only, same as YouTube)

Permission Denied Flow

  1. If user clicks "Cancel" in permission dialog → error state: "Permission denied — system audio capture requires your explicit permission"
  2. If user revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
  3. If no audio track in the stream → error: "No audio track found in the shared content"

3. Architecture

3.1 Component Tree (LTTPage — System Audio Mode)

LTTPage
├── SourceSelector (tabs: Upload | YouTube | System Audio)
├── [source === 'system-audio']
│   ├── SystemAudioCapture
│   │   ├── Start/Stop button
│   │   ├── Status indicator (idle | requesting | capturing | error)
│   │   ├── Audio level meter (optional, nice-to-have)
│   │   └── Browser compatibility note (non-Chrome users)
│   └── (no video player — audio-only capture)
├── QueryInput (receives transcript from useSystemAudioASR)
├── ExtractedQuestionsDisplay
└── RAG Response Panel

3.2 Data Flow

SystemAudioCapture (UI)
  │
  ├── "Start Capture" click → calls startCapture() from hook
  │
  ▼
useSystemAudioASR hook
  │
  ├── getDisplayMedia({ audio: { systemAudio: 'include' } })
  │     └── User picks tab/window → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │     └── MediaStreamAudioSourceNode
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │     └── onaudioprocess: convert Float32 → Int16 PCM
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │     └── Sends binary PCM frames
  │
  └── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
        │
        ▼
LTTPage unifies: const asr = source === 'system-audio' ? systemAudioASR : ...
  │
  ▼
QueryInput receives asr.partialTranscript

3.3 Backend Changes

Minimal. The existing WebSocket ASR endpoint (ws_asr.py) already accepts audio from any source. The only addition is handling a UUID-based video_id for system audio sessions (no real video file).

Change File Description
Allow UUID video_id backend/app/routers/ws_asr.py Accept non-file-based video IDs (already accepts any string)
Transcript persistence backend/app/services/history_service.py Store system audio transcripts with UUID session ID (optional — nice-to-have)
Config backend/app/core/config.py Add SYSTEM_AUDIO_ENABLED toggle (default: true)

No changes needed to:

  • DashScope ASR client (receives PCM, doesn't care about source)
  • WebSocket protocol (same binary PCM format)
  • RAG pipeline (consumes transcript text)

3.4 Frontend Files

File Status Description
frontend/src/hooks/useSystemAudioASR.ts New Hook: getDisplayMedia → AudioContext → WebSocket
frontend/src/components/SystemAudioCapture.tsx New UI: Start/Stop button, status, compatibility note
frontend/src/pages/LTTPage.tsx Modified Add "System Audio" tab, wire hook, unify ASR
frontend/src/types/index.ts Modified Add SystemAudioStatus type
frontend/src/components/SourceSelector.tsx Refactor Extract source tabs into reusable component (optional — can inline in LTTPage)

4. Sub-Phases

Sub-Phase Description Effort Depends On Status
4.1 Config & Infrastructure 0.5 day 📋 Draft
4.2 System Audio Capture Hook (useSystemAudioASR) 1 day 4.1 📋 Draft
4.3 SystemAudioCapture UI Component 0.5 day 4.2 📋 Draft
4.4 LTTPage Integration 0.5 day 4.2, 4.3 📋 Draft
4.5 Backend Adjustments 0.5 day 4.1 📋 Draft
4.6 Integration & Acceptance Tests 1 day 4.4, 4.5 📋 Draft
4.7 Polish & Documentation 0.5 day 4.6 📋 Draft
Total 4.5 days

Phase 4.1 — Config & Infrastructure (0.5 day)

Objective: Add system audio feature toggle, define types, establish UUID generation.

Tasks:

  1. Add SYSTEM_AUDIO_ENABLED to backend/app/core/config.py (default: True)
  2. Add SystemAudioStatus type to frontend/src/types/index.ts:
    type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
    
  3. Add SystemAudioASRState interface to types
  4. Add video_id UUID generation helper (frontend-side: crypto.randomUUID())
  5. Verify WebSocket ASR endpoint accepts arbitrary video_id strings (it does — confirm with a quick test)

Test Files: backend/app/test/test_phase4_config.py

Phase 4.2 — System Audio Capture Hook (1 day)

Objective: Create useSystemAudioASR.ts hook that captures system audio and streams it to the ASR WebSocket.

Key Design:

interface UseSystemAudioASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseSystemAudioASRReturn {
  status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startCapture: () => Promise<void>
  stopCapture: () => void
}

Implementation Details:

  • startCapture(): calls navigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })
    • On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
    • On user cancel: sets status to 'idle', sets error "Permission denied"
    • On no audio track: sets status to 'error', sets error "No audio track found"
  • stopCapture(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
  • Auto-stop: listens for track.onended (user clicks "Stop sharing" in Chrome) → calls stopCapture
  • Audio processing: identical to useVideoASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
  • WebSocket lifecycle: connect on capture start, close on capture stop
  • Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Pattern to Follow:

  • AudioContext setup: follow useVideoASR.ts lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
  • WebSocket handling: follow useYouTubeASR.ts lines 35-100
  • State management: combine patterns from both hooks, adapting for MediaStream source

Test Files: frontend/src/test/test_phase4_useSystemAudioASR.test.ts

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Objective: Create the SystemAudioCapture.tsx component with Start/Stop button, status display, and browser compatibility info.

Component Props:

interface SystemAudioCaptureProps {
  status: SystemAudioStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

  1. Idle: "Start Capture" button (blue, prominent) + compatibility note
  2. Requesting: "Waiting for permission..." (loading spinner)
  3. Capturing: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
  4. Error: Red banner with error message + "Try Again" button

Browser Compatibility Note:

⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
Firefox and Safari do not support this feature.
On Linux, only tab audio is available (not full system audio).

Test Files: frontend/src/test/test_phase4_SystemAudioCapture.test.tsx

Phase 4.4 — LTTPage Integration (0.5 day)

Objective: Wire the System Audio source into LTTPage, adding it as the third tab alongside Upload and YouTube.

Changes to LTTPage.tsx:

  1. Extend SourceType from 'upload' | 'youtube' to 'upload' | 'youtube' | 'system-audio'
  2. Add third tab button (icon: AudioLines from lucide-react) in the source selector
  3. Initialize useSystemAudioASR hook with a UUID-based WebSocket URL
  4. Update asr variable:
    const asr = source === 'youtube' ? youtubeASR 
      : source === 'system-audio' ? systemAudioASR 
      : uploadASR
    
  5. Conditional rendering:
    {source === 'upload' && <VideoUploader />}
    {source === 'youtube' && <YouTubeMode />}
    {source === 'system-audio' && <SystemAudioCapture />}
    
  6. WebSocket URL: ws://host/ws/asr/{crypto.randomUUID()}?language=yue
  7. Full Transcript button: hidden for system-audio (same as YouTube)
  8. QueryInput: remains editable during capture (same behavior as other sources)

Test Files: frontend/src/test/test_phase4_LTTPage_integration.test.tsx

Phase 4.5 — Backend Adjustments (0.5 day)

Objective: Ensure backend handles system audio sessions correctly.

Tasks:

  1. Verify ws_asr.py WebSocket endpoint works with arbitrary video_id (UUID format) — likely no changes needed
  2. Add SYSTEM_AUDIO_ENABLED config validation in the router (return 503 if disabled)
  3. Handle system audio sessions in transcript history (optional — store with source: 'system-audio' metadata)
  4. Verify the ASR client handles system audio PCM identically to video audio

No new endpoints needed. The existing WebSocket and ASR infrastructure is source-agnostic.

Test Files: backend/app/test/test_phase4_config.py

Phase 4.6 — Integration & Acceptance Tests (1 day)

Objective: Comprehensive tests for the system audio capture flow.

Backend Integration Tests (backend/app/test/test_integration_phase4.py):

  1. WebSocket accepts UUID video_id
  2. ASR processes audio from system audio session
  3. Config toggle disables feature

Frontend Tests:

  1. Hook tests (test_phase4_useSystemAudioASR.test.ts): ~10 tests

    • Mock getDisplayMedia → successful capture
    • Mock getDisplayMedia → user cancels (permission denied)
    • Mock getDisplayMedia → no audio track
    • AudioContext setup and teardown
    • WebSocket connection lifecycle
    • PCM conversion and sending
    • track.onended triggers auto-stop
    • stopCapture cleanup
    • Multiple rapid start/stop cycles
  2. Component tests (test_phase4_SystemAudioCapture.test.tsx): ~5 tests

    • All UI states render correctly (idle, requesting, capturing, error)
    • Start button calls onStart
    • Stop button calls onStop
    • Error state shows message and retry button
    • Compatibility note visible for non-Chrome (optional)
  3. Integration tests (test_phase4_LTTPage_integration.test.tsx): ~5 tests

    • System Audio tab renders and switches correctly
    • ASR variable selects systemAudioASR when source is system-audio
    • Full Transcript button hidden for system audio
    • QueryInput receives transcript from system audio
    • Source switching preserves transcript

Acceptance Tests (backend/app/test/acceptance/test_acceptance_phase4.py):

  • Real getDisplayMedia with actual browser (manual — requires human interaction)
  • Real DashScope ASR with system audio stream
  • End-to-end: capture → ASR → transcript → RAG answer

Phase 4.7 — Polish & Documentation (0.5 day)

Tasks:

  1. Update README.md — add System Audio Capture section with usage instructions, browser compatibility table, and limitations
  2. Update development_plan.md — add Phase 4 row to timeline, mark status
  3. Add browser detection helper for compatibility warning
  4. Verify production build (npm run build)
  5. Run full CI regression (pytest + vitest)
  6. Final commit

5. Design Decisions

Decision Rationale
New hook (useSystemAudioASR) rather than modifying existing MediaStream source requires createMediaStreamSource (not createMediaElementSource), and lifecycle is permission-based (not play/pause events). Separate hook avoids branching complexity.
UUID-based video_id No actual video file for system audio. crypto.randomUUID() generates unique session IDs. Backend WebSocket already accepts arbitrary strings.
Manual Start/Stop (not auto) getDisplayMedia() requires explicit user action (browser policy). Cannot auto-start.
No video display in System Audio mode User watches content in another tab/window. Only capture status and audio controls shown.
video: false in getDisplayMedia Audio-only capture reduces bandwidth and permission scope. User only needs to share audio.
Hide Full Transcript button for system audio Same as YouTube — streaming ASR only. Full transcript would require recording and batch processing (future Phase 5).
Browser compatibility note in UI getDisplayMedia with audio is Chrome/Edge-only. Non-supporting browsers get clear messaging.

getDisplayMedia Options

const stream = await navigator.mediaDevices.getDisplayMedia({
  video: false,                        // No video needed
  audio: {
    systemAudio: 'include',            // Request system audio (tab + full system where supported)
    echoCancellation: false,           // Don't filter audio
    noiseSuppression: false,           // Don't filter audio
    autoGainControl: false,            // Don't adjust volume
  }
})

Note on video: false: Setting video: false tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation, not ours.


6. Browser Compatibility

Platform / Browser Tab Audio System Audio Works?
Chrome/Edge (Windows) Best — full support
Chrome/Edge (macOS 14.2+) Good
Chrome/Edge (Linux) Works, tab audio only
Firefox Audio ignored
Safari Audio not supported
Mobile browsers Not supported

Detection helper:

function isSystemAudioSupported(): boolean {
  const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
  // Firefox and Safari don't support audio in getDisplayMedia
  return isChromium && !navigator.userAgent.includes('Firefox')
}

7. Test Strategy

Test Files

File Type Count Description
test_phase4_config.py Backend integration 3 Config toggle, WebSocket accepts UUID
test_phase4_useSystemAudioASR.test.ts Frontend unit ~10 Hook behavior: capture, permission, audio, WS
test_phase4_SystemAudioCapture.test.tsx Frontend component ~5 UI states: idle, requesting, capturing, error
test_phase4_LTTPage_integration.test.tsx Frontend integration ~5 Tab switching, ASR unification, Full Transcript
test_integration_phase4.py Backend integration 4 Config toggle, WebSocket, ASR client
test_acceptance_phase4.py Acceptance 3 Real browser + real DashScope ASR

Mocking Strategy

  • getDisplayMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
  • AudioContext: Use jest-webgl-mock or manual mock for AudioContext, ScriptProcessorNode
  • WebSocket: Mock via vitest WebSocket mock (same pattern as Phase 2/3 tests)
  • DashScope ASR: Mock in CI; real in acceptance tests

8. File Manifest

New Files

frontend/src/hooks/useSystemAudioASR.ts
frontend/src/components/SystemAudioCapture.tsx
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py
backend/app/test/acceptance/test_acceptance_phase4.py
.plans/phase4_system_audio_plan.md              ← this file

Modified Files

frontend/src/pages/LTTPage.tsx                    ← add "System Audio" tab, wire hook
frontend/src/types/index.ts                       ← add SystemAudioStatus, SystemAudioASRState
backend/app/core/config.py                        ← add SYSTEM_AUDIO_ENABLED
development_plan.md                               ← add Phase 4 row
README.md                                         ← add System Audio Capture section

9. Acceptance Criteria

  • User can select "System Audio" tab in LTTPage
  • Clicking "Start Capture" opens browser permission dialog
  • On permission grant, audio streams through WebSocket to DashScope ASR
  • Real-time transcript flows into QueryInput
  • User can edit transcript while capture continues
  • "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
  • Permission denied shows clear error message
  • Browser compatibility note shown for non-Chrome browsers
  • All CI tests pass (no regressions)
  • Acceptance tests pass with real DashScope ASR
  • npm run build produces clean production build

File Information

  • Filename: phase4_system_audio_plan.md
  • Created: 2026-05-09
  • Status: Draft — awaiting review before Phase 4.1 implementation begins