legco_ai_assistant/.plans/phase4_system_audio_plan.md

29 KiB

Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan

Created: 2026-05-09 Updated: 2026-05-14 Status: 📋 Draft (Not Started) Depends on: Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)


1. Overview

Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload:

  1. System Audio Capture — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via getDisplayMedia().
  2. Listen Mic — captures microphone input (user's voice, room audio) via getUserMedia({ audio: true }).

Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline.

System Audio — Use Cases

  • Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
  • Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
  • Transcribing any audio playing on the computer without needing to download files

Listen Mic — Use Cases

  • Recording a live meeting or lecture through the computer's microphone
  • Dictating questions or notes verbally and getting RAG answers
  • Transcribing spoken Cantonese in real time without a video source

How They Work

[System Audio]
  User clicks "System Audio" → "Start Capture"
    → Browser shows permission dialog (screen/tab picker)
    → User selects tab/window/screen (with audio)
    → getDisplayMedia() returns MediaStream (with audio track)
    → AudioContext.createMediaStreamSource(stream)
    → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG

[Listen Mic]
  User clicks "Listen Mic" → "Start Listening"
    → Browser shows microphone permission prompt
    → getUserMedia({ audio: true }) returns MediaStream
    → AudioContext.createMediaStreamSource(stream)
    → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG

Audio Routing (vs Existing Sources)

Source Audio Input SourceNode Type Start/Stop Trigger
Upload <video> element createMediaElementSource play/pause events
System Audio MediaStream from getDisplayMedia() createMediaStreamSource Manual Start/Stop button + track ended event
Listen Mic MediaStream from getUserMedia({ audio: true }) createMediaStreamSource Manual Start/Stop button + track ended event

Why New Hooks (Not Reuse Existing)

The existing useVideoASR hook depends on HTML media elements (<video>) for both the audio source and play/pause lifecycle. Both new sources use MediaStream objects (no DOM element), and their lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events.

System Audio and Listen Mic share the same audio processing pipeline (MediaStream → AudioContext → ScriptProcessorNode → WebSocket) but differ in their capture API. A shared internal audio processing utility (useMediaStreamASR or similar) should be extracted to avoid code duplication between the two hooks.


2. User Flow

2.1 System Audio

  1. User selects "System Audio" tab (second option alongside Upload / Listen Mic)
  2. UI shows a "Start Capture" button with browser compatibility info
  3. User clicks "Start Capture"
  4. Browser opens permission dialog (screen/tab picker)
    • User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
    • User checks "Share audio" if available
  5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
  6. Real-time ASR transcription flows into QueryInput (same as Upload)
  7. User can edit transcript while capturing continues
  8. User clicks "Stop Capture" to end — transcript stays in QueryInput
  9. User submits query → RAG pipeline processes it
  10. "Full Transcript" button hidden (streaming ASR only — no batch transcription for live sources)

Permission Denied Flow

  • User clicks "Cancel" in permission dialog → error: "Permission denied — system audio capture requires your explicit permission"
  • User revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
  • No audio track in the stream → error: "No audio track found in the shared content"

2.2 Listen Mic

  1. User selects "Listen Mic" tab (third option)
  2. UI shows a "Start Listening" button (no browser compatibility warning — widely supported)
  3. User clicks "Start Listening"
  4. Browser shows microphone permission prompt (first time only)
  5. On approval: listening starts — status indicator shows "Listening" with a live audio level meter
  6. Real-time ASR transcription flows into QueryInput
  7. User can edit transcript while listening continues
  8. User clicks "Stop Listening" to end — transcript stays in QueryInput
  9. User submits query → RAG pipeline processes it
  10. "Full Transcript" button hidden (streaming ASR only)

Permission Denied Flow

  • User clicks "Block" in mic permission prompt → error: "Microphone access denied — please allow microphone access in your browser settings"
  • User revokes permission via browser UI → listening stops, status: "Microphone disconnected"
  • No audio track → error: "No microphone input detected"

3. Architecture

3.1 Component Tree (LTTPage — All Sources)

LTTPage
├── SourceSelector (tabs: Upload | System Audio | Listen Mic)
├── [source === 'system-audio']
│   └── SystemAudioCapture
│       ├── Start/Stop button
│       ├── Status indicator (idle | requesting | capturing | error)
│       ├── Audio level meter (optional, nice-to-have)
│       └── Browser compatibility note (non-Chrome users)
├── [source === 'mic']
│   └── MicCapture
│       ├── Start/Stop button
│       ├── Status indicator (idle | requesting | listening | error)
│       └── Audio level meter (optional, nice-to-have)
├── QueryInput (receives transcript from active ASR hook)
├── ExtractedQuestionsDisplay
└── RAG Response Panel

3.2 Data Flow

System Audio

SystemAudioCapture (UI)
  │
  ├── "Start Capture" click → calls startCapture() from hook
  │
  ▼
useSystemAudioASR hook
  │
  ├── getDisplayMedia({ audio: { systemAudio: 'include' } })
  │     └── User picks tab/window → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │
  └── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }

Listen Mic

MicCapture (UI)
  │
  ├── "Start Listening" click → calls startListening() from hook
  │
  ▼
useMicASR hook
  │
  ├── getUserMedia({ audio: true })
  │     └── Browser shows mic permission prompt → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │
  └── Returns: { status, transcript, partialTranscript, startListening, stopListening }

LTTPage Unification

const asr = source === 'system-audio' ? systemAudioASR
  : source === 'mic' ? micASR
  : uploadASR

3.3 Backend Changes

Minimal. The existing WebSocket ASR endpoint (ws_asr.py) already accepts audio from any source. The only additions are UUID-based video_id handling and feature toggles.

Change File Description
Allow UUID video_id backend/app/routers/ws_asr.py Accept non-file-based video IDs (already accepts any string)
Transcript persistence backend/app/services/history_service.py Store system audio & mic transcripts with UUID session ID (optional — nice-to-have)
Config backend/app/core/config.py Add SYSTEM_AUDIO_ENABLED and MIC_ENABLED toggles (default: true)

No changes needed to:

  • DashScope ASR client (receives PCM, doesn't care about source)
  • WebSocket protocol (same binary PCM format)
  • RAG pipeline (consumes transcript text)

3.4 Frontend Files

File Status Description
frontend/src/components/SourceSelector.tsx New Reusable tab bar component (Upload | System Audio | Listen Mic)
frontend/src/hooks/useSystemAudioASR.ts New Hook: getDisplayMedia → AudioContext → WebSocket
frontend/src/hooks/useMicASR.ts New Hook: getUserMedia → AudioContext → WebSocket
frontend/src/components/SystemAudioCapture.tsx New UI: Start/Stop, status, compatibility note
frontend/src/components/MicCapture.tsx New UI: Start/Stop, status
frontend/src/pages/LTTPage.tsx Modified Add source selector, wire hooks, unify ASR, conditional rendering
frontend/src/types/index.ts Modified Add SourceType, SystemAudioStatus, MicStatus types

4. Sub-Phases

Sub-Phase Description Effort Depends On Status
4.1 Config & Infrastructure 0.5 day 📋 Draft
4.2 System Audio Capture Hook (useSystemAudioASR) 1 day 4.1 📋 Draft
4.3 SystemAudioCapture UI Component 0.5 day 4.2 📋 Draft
4.4 Mic Capture Hook (useMicASR) 0.5 day 4.1 📋 Draft
4.5 MicCapture UI Component 0.5 day 4.4 📋 Draft
4.6 LTTPage Integration (all 3 sources) 0.5 day 4.2, 4.3, 4.4, 4.5 📋 Draft
4.7 Backend Adjustments 0.5 day 4.1 📋 Draft
4.8 Integration & Acceptance Tests 1 day 4.6, 4.7 📋 Draft
4.9 Polish & Documentation 0.5 day 4.8 📋 Draft
Total 5.5 days

Phase 4.1 — Config & Infrastructure (0.5 day)

Objective: Add feature toggles, define types, establish UUID generation.

Tasks:

  1. Add SYSTEM_AUDIO_ENABLED and MIC_ENABLED to backend/app/core/config.py (default: True)
  2. Add SystemAudioStatus type to frontend/src/types/index.ts:
    type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
    
  3. Add MicStatus type:
    type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
    
  4. Add SystemAudioASRState and MicASRState interfaces to types
  5. Add video_id UUID generation helper (frontend-side: crypto.randomUUID())
  6. Verify WebSocket ASR endpoint accepts arbitrary video_id strings (it does — confirm with a quick test)

Test Files: backend/app/test/test_phase4_config.py

Phase 4.2 — System Audio Capture Hook (1 day)

Objective: Create useSystemAudioASR.ts hook that captures system audio and streams it to the ASR WebSocket.

Key Design:

interface UseSystemAudioASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseSystemAudioASRReturn {
  status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startCapture: () => Promise<void>
  stopCapture: () => void
}

Implementation Details:

  • startCapture(): calls navigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })
    • On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
    • On user cancel: sets status to 'idle', sets error "Permission denied"
    • On no audio track: sets status to 'error', sets error "No audio track found"
  • stopCapture(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
  • Auto-stop: listens for track.onended (user clicks "Stop sharing" in Chrome) → calls stopCapture
  • Audio processing: identical to useVideoASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
  • WebSocket lifecycle: connect on capture start, close on capture stop
  • Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Pattern to Follow:

  • AudioContext setup: follow useVideoASR.ts lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
  • WebSocket handling: follow useVideoASR.ts lines 35-100
  • State management: combine patterns from both hooks, adapting for MediaStream source

Test Files: frontend/src/test/test_phase4_useSystemAudioASR.test.ts

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Objective: Create the SystemAudioCapture.tsx component with Start/Stop button, status display, and browser compatibility info.

Component Props:

interface SystemAudioCaptureProps {
  status: SystemAudioStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

  1. Idle: "Start Capture" button (blue, prominent) + compatibility note
  2. Requesting: "Waiting for permission..." (loading spinner)
  3. Capturing: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
  4. Error: Red banner with error message + "Try Again" button

Browser Compatibility Note:

⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
Firefox and Safari do not support this feature.
On Linux, only tab audio is available (not full system audio).

Test Files: frontend/src/test/test_phase4_SystemAudioCapture.test.tsx

Phase 4.4 — Mic Capture Hook (0.5 day)

Objective: Create useMicASR.ts hook that captures microphone input and streams it to the ASR WebSocket.

Key Design:

interface UseMicASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseMicASRReturn {
  status: 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startListening: () => Promise<void>
  stopListening: () => void
}

Implementation Details:

  • startListening(): calls navigator.mediaDevices.getUserMedia({ audio: true, video: false })
    • On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
    • On user deny: sets status to 'idle', sets error "Microphone access denied"
    • On no audio track: sets status to 'error', sets error "No microphone input detected"
  • stopListening(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
  • Auto-stop: listens for track.onended (user revokes permission) → calls stopListening
  • Audio processing: identical to useSystemAudioASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
  • WebSocket lifecycle: connect on listening start, close on listening stop
  • Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Code Sharing: Extract shared audio processing logic (MediaStream → AudioContext → ScriptProcessorNode → WebSocket) into a reusable internal utility (useMediaStreamASR or audioPipeline.ts) to avoid duplication between useSystemAudioASR and useMicASR.

Test Files: frontend/src/test/test_phase4_useMicASR.test.ts

Phase 4.5 — MicCapture UI Component (0.5 day)

Objective: Create the MicCapture.tsx component with Start/Stop button and status display.

Component Props:

interface MicCaptureProps {
  status: MicStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

  1. Idle: "Start Listening" button (blue, prominent) — no compatibility warning needed (mic is universally supported)
  2. Requesting: "Waiting for microphone permission..." (loading spinner)
  3. Listening: "Stop Listening" button (red) + pulsing green dot + "Listening..."
  4. Error: Red banner with error message + "Try Again" button

Test Files: frontend/src/test/test_phase4_MicCapture.test.tsx

Phase 4.6 — LTTPage Integration (0.5 day)

Objective: Create the SourceSelector tab bar component and wire both new sources into LTTPage.

New Component — SourceSelector.tsx:

interface SourceSelectorProps {
  activeSource: SourceType
  onSelect: (source: SourceType) => void
}
  • Three tabs: Upload (📁), System Audio (🔊), Listen Mic (🎤)
  • Active tab highlighted with blue background, inactive tabs gray
  • Icons from lucide-react: Upload, MonitorSpeaker, Mic

Changes to LTTPage.tsx:

  1. Add SourceType state: const [source, setSource] = useState<SourceType>('upload')
  2. Render <SourceSelector activeSource={source} onSelect={setSource} /> above the panels
  3. Extend SourceType to 'upload' | 'system-audio' | 'mic'
  4. Initialize useSystemAudioASR and useMicASR hooks with session-scoped UUIDs (generated once when tab selected, reused across Start/Stop cycles)
  5. Update asr variable:
    const asr = source === 'system-audio' ? systemAudioASR
      : source === 'mic' ? micASR
      : uploadASR
    
  6. Conditional rendering:
    {source === 'upload' && <VideoUpload />}
    {source === 'system-audio' && <SystemAudioCapture />}
    {source === 'mic' && <MicCapture />}
    
  7. WebSocket URL: ws://host/ws/asr/{sessionUUID}?language=yue (UUID stable per session, regenerated only on source switch)
  8. Full Transcript button: hidden for system-audio AND mic (streaming ASR only)
  9. QueryInput: remains editable during capture/listening

Test Files: frontend/src/test/test_phase4_LTTPage_integration.test.tsx

Phase 4.7 — Backend Adjustments (0.5 day)

Objective: Ensure backend handles both system audio and mic sessions correctly.

Tasks:

  1. Verify ws_asr.py WebSocket endpoint works with arbitrary video_id (UUID format) — likely no changes needed
  2. Add SYSTEM_AUDIO_ENABLED and MIC_ENABLED config validation in the router (return 503 if disabled)
  3. Handle system audio and mic sessions in transcript history (optional — store with source: 'system-audio' / source: 'mic' metadata)
  4. Verify the ASR client handles audio from both sources identically

No new endpoints needed. The existing WebSocket and ASR infrastructure is source-agnostic.

Test Files: backend/app/test/test_phase4_config.py

Phase 4.8 — Integration & Acceptance Tests (1 day)

Objective: Comprehensive tests for both capture flows.

Backend Integration Tests (backend/app/test/test_integration_phase4.py):

  1. WebSocket accepts UUID video_id
  2. ASR processes audio from system audio session
  3. ASR processes audio from mic session
  4. Config toggles disable features

Frontend Tests:

  1. System Audio Hook tests (test_phase4_useSystemAudioASR.test.ts): ~10 tests

    • Mock getDisplayMedia → successful capture
    • Mock getDisplayMedia → user cancels (permission denied)
    • Mock getDisplayMedia → no audio track
    • AudioContext setup and teardown
    • WebSocket connection lifecycle
    • PCM conversion and sending
    • track.onended triggers auto-stop
    • stopCapture cleanup
    • Multiple rapid start/stop cycles
  2. System Audio Component tests (test_phase4_SystemAudioCapture.test.tsx): ~5 tests

    • All UI states render correctly (idle, requesting, capturing, error)
    • Start button calls onStart
    • Stop button calls onStop
    • Error state shows message and retry button
    • Compatibility note visible for non-Chrome (optional)
  3. Mic Hook tests (test_phase4_useMicASR.test.ts): ~8 tests

    • Mock getUserMedia → successful capture
    • Mock getUserMedia → user denies (permission denied)
    • Mock getUserMedia → no audio track
    • AudioContext setup and teardown
    • WebSocket connection lifecycle
    • track.onended triggers auto-stop
    • stopListening cleanup
    • PCM conversion and sending
  4. Mic Component tests (test_phase4_MicCapture.test.tsx): ~4 tests

    • All UI states render correctly (idle, requesting, listening, error)
    • Start button calls onStart
    • Stop button calls onStop
    • Error state shows message and retry button
  5. LTTPage Integration tests (test_phase4_LTTPage_integration.test.tsx): ~8 tests

    • System Audio tab renders and switches correctly
    • Listen Mic tab renders and switches correctly
    • ASR variable selects systemAudioASR when source is system-audio
    • ASR variable selects micASR when source is mic
    • Full Transcript button hidden for system audio and mic
    • QueryInput receives transcript from system audio
    • QueryInput receives transcript from mic
    • Source switching preserves transcript

Acceptance Tests (backend/app/test/acceptance/test_acceptance_phase4.py):

  • Real getDisplayMedia with actual browser (manual — requires human interaction)
  • Real getUserMedia with actual microphone (manual — requires human interaction)
  • Real DashScope ASR with system audio stream
  • Real DashScope ASR with microphone stream
  • End-to-end: capture → ASR → transcript → RAG answer (both sources)

Phase 4.9 — Polish & Documentation (0.5 day)

Tasks:

  1. Update README.md — add System Audio Capture and Listen Mic sections with usage instructions, browser compatibility table, and limitations
  2. Update development_plan.md — add Phase 4 row to timeline, mark status
  3. Add browser detection helper for system audio compatibility warning
  4. Verify production build (npm run build)
  5. Run full CI regression (pytest + vitest)
  6. Final commit

5. Design Decisions

Decision Rationale
New hooks rather than modifying existing MediaStream source requires createMediaStreamSource (not createMediaElementSource), and lifecycle is permission-based (not play/pause events). Separate hooks avoid branching complexity.
Two separate hooks + shared audio utility System Audio and Mic share identical audio processing (MediaStream → PCM → WebSocket) but differ in capture API (getDisplayMedia vs getUserMedia) and UX. Extract shared pipeline to avoid duplication.
UUID-based video_id (per-session) No actual video file for live audio. UUID generated once when source tab is selected, reused across Start/Stop cycles within the same session. Regenerated only when switching between sources. Backend WebSocket already accepts arbitrary strings.
Manual Start/Stop (not auto) Both getDisplayMedia() and getUserMedia() require explicit user action (browser policy). Cannot auto-start.
No video display in System Audio or Mic mode User watches/listens to content elsewhere. Only capture status and audio controls shown.
video: false in getDisplayMedia Audio-only capture reduces bandwidth and permission scope.
Hide Full Transcript button for both new sources Streaming ASR only — no video file to batch transcribe. Full transcript would require audio recording (future Phase 5).
Browser compatibility note only for System Audio Mic (getUserMedia) is universally supported in all modern browsers. System Audio (getDisplayMedia with audio) is Chrome/Edge-only.
Mic uses getUserMedia({ audio: true, video: false }) Audio-only capture — no camera needed.

getDisplayMedia Options (System Audio)

const stream = await navigator.mediaDevices.getDisplayMedia({
  video: false,
  audio: {
    systemAudio: 'include',
    echoCancellation: false,
    noiseSuppression: false,
    autoGainControl: false,
  }
})

Note on video: false: Setting video: false tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation.

getUserMedia Options (Listen Mic)

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: false,    // Don't filter audio (pass raw mic input)
    noiseSuppression: false,    // Don't filter audio
    autoGainControl: false,     // Don't adjust volume
  },
  video: false,
})

6. Browser Compatibility

System Audio (getDisplayMedia)

Platform / Browser Tab Audio System Audio Works?
Chrome/Edge (Windows) Best — full support
Chrome/Edge (macOS 14.2+) Good
Chrome/Edge (Linux) Works, tab audio only
Firefox Audio ignored
Safari Audio not supported
Mobile browsers Not supported

Detection helper:

function isSystemAudioSupported(): boolean {
  const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
  return isChromium && !navigator.userAgent.includes('Firefox')
}

Listen Mic (getUserMedia)

Platform / Browser Microphone Works?
Chrome/Edge Full support
Firefox Full support
Safari Full support
Mobile browsers Full support

Mic capture is universally supported — no compatibility warning needed.


7. Test Strategy

Test Files

File Type Count Description
test_phase4_config.py Backend integration 4 Config toggles, WebSocket accepts UUID
test_phase4_useSystemAudioASR.test.ts Frontend unit ~10 Hook: capture, permission, audio, WS
test_phase4_SystemAudioCapture.test.tsx Frontend component ~5 UI states: idle, requesting, capturing, error
test_phase4_useMicASR.test.ts Frontend unit ~8 Hook: capture, permission, audio, WS
test_phase4_MicCapture.test.tsx Frontend component ~4 UI states: idle, requesting, listening, error
test_phase4_LTTPage_integration.test.tsx Frontend integration ~8 Tab switching, ASR unification, Full Transcript
test_integration_phase4.py Backend integration 4 Config toggles, WebSocket, ASR client
test_acceptance_phase4.py Acceptance 5 Real browser + real mic + real DashScope ASR

Mocking Strategy

  • getDisplayMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
  • getUserMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
  • AudioContext: Use jest-webgl-mock or manual mock for AudioContext, ScriptProcessorNode
  • WebSocket: Mock via vitest WebSocket mock (same pattern as Phase 2/3 tests)
  • DashScope ASR: Mock in CI; real in acceptance tests

8. File Manifest

New Files

frontend/src/hooks/useSystemAudioASR.ts
frontend/src/hooks/useMicASR.ts
frontend/src/components/SystemAudioCapture.tsx
frontend/src/components/MicCapture.tsx
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_useMicASR.test.ts
frontend/src/test/test_phase4_MicCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py
backend/app/test/acceptance/test_acceptance_phase4.py
.plans/phase4_system_audio_plan.md              ← this file

Modified Files

frontend/src/pages/LTTPage.tsx                    ← add "System Audio" + "Listen Mic" tabs, wire hooks
frontend/src/types/index.ts                       ← add SystemAudioStatus, MicStatus, ASRState types
backend/app/core/config.py                        ← add SYSTEM_AUDIO_ENABLED, MIC_ENABLED
development_plan.md                               ← add Phase 4 row
README.md                                         ← add System Audio + Listen Mic sections

9. Acceptance Criteria

  • User can select "System Audio" tab in LTTPage
  • User can select "Listen Mic" tab in LTTPage
  • Clicking "Start Capture" (System Audio) opens browser permission dialog
  • Clicking "Start Listening" (Listen Mic) opens microphone permission prompt
  • On permission grant, audio streams through WebSocket to DashScope ASR (both sources)
  • Real-time transcript flows into QueryInput (both sources)
  • User can edit transcript while capture/listening continues
  • "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
  • "Stop Listening" properly closes MediaStream, AudioContext, WebSocket
  • Permission denied shows clear error message (both sources)
  • Browser compatibility note shown for System Audio on non-Chrome browsers
  • No compatibility warning for Listen Mic (universally supported)
  • All CI tests pass (no regressions)
  • Acceptance tests pass with real DashScope ASR
  • npm run build produces clean production build

File Information

  • Filename: phase4_system_audio_plan.md
  • Created: 2026-05-09
  • Updated: 2026-05-14 — Added Listen Mic as third source; removed YouTube
  • Status: Draft — awaiting review before Phase 4.1 implementation begins