29 KiB

Raw Blame History

Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan

Created: 2026-05-09 Updated: 2026-05-15 Status: ✅ Complete Depends on: Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)

1. Overview

Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload:

System Audio Capture — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via getDisplayMedia().
Listen Mic — captures microphone input (user's voice, room audio) via getUserMedia({ audio: true }).

Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline.

System Audio — Use Cases

Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
Transcribing any audio playing on the computer without needing to download files

Listen Mic — Use Cases

Recording a live meeting or lecture through the computer's microphone
Dictating questions or notes verbally and getting RAG answers
Transcribing spoken Cantonese in real time without a video source

How They Work

[System Audio]
  User clicks "System Audio" → "Start Capture"
    → Browser shows permission dialog (screen/tab picker)
    → User selects tab/window/screen (with audio)
    → getDisplayMedia() returns MediaStream (with audio track)
    → AudioContext.createMediaStreamSource(stream)
    → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG

[Listen Mic]
  User clicks "Listen Mic" → "Start Listening"
    → Browser shows microphone permission prompt
    → getUserMedia({ audio: true }) returns MediaStream
    → AudioContext.createMediaStreamSource(stream)
    → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG

Audio Routing (vs Existing Sources)

Source	Audio Input	SourceNode Type	Start/Stop Trigger
Upload	`<video>` element	`createMediaElementSource`	play/pause events
System Audio	MediaStream from `getDisplayMedia()`	`createMediaStreamSource`	Manual Start/Stop button + track ended event
Listen Mic	MediaStream from `getUserMedia({ audio: true })`	`createMediaStreamSource`	Manual Start/Stop button + track ended event

Why New Hooks (Not Reuse Existing)

The existing useVideoASR hook depends on HTML media elements (<video>) for both the audio source and play/pause lifecycle. Both new sources use MediaStream objects (no DOM element), and their lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events.

System Audio and Listen Mic share the same audio processing pipeline (MediaStream → AudioContext → ScriptProcessorNode → WebSocket) but differ in their capture API. A shared internal audio processing utility (useMediaStreamASR or similar) should be extracted to avoid code duplication between the two hooks.

2. User Flow

2.1 System Audio

User selects "System Audio" tab (second option alongside Upload / Listen Mic)
UI shows a "Start Capture" button with browser compatibility info
User clicks "Start Capture"
Browser opens permission dialog (screen/tab picker)
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
- User checks "Share audio" if available
On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
Real-time ASR transcription flows into QueryInput (same as Upload)
User can edit transcript while capturing continues
User clicks "Stop Capture" to end — transcript stays in QueryInput
User submits query → RAG pipeline processes it
"Full Transcript" button hidden (streaming ASR only — no batch transcription for live sources)

Permission Denied Flow

User clicks "Cancel" in permission dialog → error: "Permission denied — system audio capture requires your explicit permission"
User revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
No audio track in the stream → error: "No audio track found in the shared content"

2.2 Listen Mic

User selects "Listen Mic" tab (third option)
UI shows a "Start Listening" button (no browser compatibility warning — widely supported)
User clicks "Start Listening"
Browser shows microphone permission prompt (first time only)
On approval: listening starts — status indicator shows "Listening" with a live audio level meter
Real-time ASR transcription flows into QueryInput
User can edit transcript while listening continues
User clicks "Stop Listening" to end — transcript stays in QueryInput
User submits query → RAG pipeline processes it
"Full Transcript" button hidden (streaming ASR only)

Permission Denied Flow

User clicks "Block" in mic permission prompt → error: "Microphone access denied — please allow microphone access in your browser settings"
User revokes permission via browser UI → listening stops, status: "Microphone disconnected"
No audio track → error: "No microphone input detected"

3. Architecture

3.1 Component Tree (LTTPage — All Sources)

LTTPage
├── SourceSelector (tabs: Upload | System Audio | Listen Mic)
├── [source === 'system-audio']
│   └── SystemAudioCapture
│       ├── Start/Stop button
│       ├── Status indicator (idle | requesting | capturing | error)
│       ├── Audio level meter (optional, nice-to-have)
│       └── Browser compatibility note (non-Chrome users)
├── [source === 'mic']
│   └── MicCapture
│       ├── Start/Stop button
│       ├── Status indicator (idle | requesting | listening | error)
│       └── Audio level meter (optional, nice-to-have)
├── QueryInput (receives transcript from active ASR hook)
├── ExtractedQuestionsDisplay
└── RAG Response Panel

3.2 Data Flow

System Audio

SystemAudioCapture (UI)
  │
  ├── "Start Capture" click → calls startCapture() from hook
  │
  ▼
useSystemAudioASR hook
  │
  ├── getDisplayMedia({ audio: { systemAudio: 'include' } })
  │     └── User picks tab/window → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │
  └── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }

Listen Mic

MicCapture (UI)
  │
  ├── "Start Listening" click → calls startListening() from hook
  │
  ▼
useMicASR hook
  │
  ├── getUserMedia({ audio: true })
  │     └── Browser shows mic permission prompt → returns MediaStream
  │
  ├── AudioContext.createMediaStreamSource(stream)
  │
  ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
  │
  ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
  │
  └── Returns: { status, transcript, partialTranscript, startListening, stopListening }

LTTPage Unification

const asr = source === 'system-audio' ? systemAudioASR
  : source === 'mic' ? micASR
  : uploadASR

3.3 Backend Changes

Minimal. The existing WebSocket ASR endpoint (ws_asr.py) already accepts audio from any source. The only additions are UUID-based video_id handling and feature toggles.

Change	File	Description
Allow UUID video_id	`backend/app/routers/ws_asr.py`	Accept non-file-based video IDs (already accepts any string)
Transcript persistence	`backend/app/services/history_service.py`	Store system audio & mic transcripts with UUID session ID (optional — nice-to-have)
Config	`backend/app/core/config.py`	Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` toggles (default: true)

No changes needed to:

DashScope ASR client (receives PCM, doesn't care about source)
WebSocket protocol (same binary PCM format)
RAG pipeline (consumes transcript text)

3.4 Frontend Files

File	Status	Description
`frontend/src/components/SourceSelector.tsx`	New	Reusable tab bar component (Upload \| System Audio \| Listen Mic)
`frontend/src/hooks/useSystemAudioASR.ts`	New	Hook: getDisplayMedia → AudioContext → WebSocket
`frontend/src/hooks/useMicASR.ts`	New	Hook: getUserMedia → AudioContext → WebSocket
`frontend/src/components/SystemAudioCapture.tsx`	New	UI: Start/Stop, status, compatibility note
`frontend/src/components/MicCapture.tsx`	New	UI: Start/Stop, status
`frontend/src/pages/LTTPage.tsx`	Modified	Add source selector, wire hooks, unify ASR, conditional rendering
`frontend/src/types/index.ts`	Modified	Add SourceType, SystemAudioStatus, MicStatus types

4. Sub-Phases

Sub-Phase	Description	Effort	Depends On	Status
4.1	Config & Infrastructure	0.5 day	—	✅ Complete
4.2	System Audio Capture Hook (`useSystemAudioASR`)	1 day	4.1	✅ Complete
4.3	SystemAudioCapture UI Component	0.5 day	4.2	✅ Complete
4.4	Mic Capture Hook (`useMicASR`)	0.5 day	4.1	✅ Complete
4.5	MicCapture UI Component	0.5 day	4.4	✅ Complete
4.6	LTTPage Integration (all 3 sources)	0.5 day	4.2, 4.3, 4.4, 4.5	✅ Complete
4.7	Backend Adjustments	0.5 day	4.1	✅ Complete
4.8	Integration & Acceptance Tests	1 day	4.6, 4.7	✅ Complete
4.9	Polish & Documentation	0.5 day	4.8	✅ Complete
Total		5.5 days

Phase 4.1 — Config & Infrastructure (0.5 day)

Objective: Add feature toggles, define types, establish UUID generation.

Tasks:

Add SYSTEM_AUDIO_ENABLED and MIC_ENABLED to backend/app/core/config.py (default: True)

Add SystemAudioStatus type to frontend/src/types/index.ts:

type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'

Add MicStatus type:

type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'

Add SystemAudioASRState and MicASRState interfaces to types
Add video_id UUID generation helper (frontend-side: crypto.randomUUID())
Verify WebSocket ASR endpoint accepts arbitrary video_id strings (it does — confirm with a quick test)

Test Files: backend/app/test/test_phase4_config.py

Phase 4.2 — System Audio Capture Hook (1 day)

Objective: Create useSystemAudioASR.ts hook that captures system audio and streams it to the ASR WebSocket.

Key Design:

interface UseSystemAudioASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseSystemAudioASRReturn {
  status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startCapture: () => Promise<void>
  stopCapture: () => void
}

Implementation Details:

startCapture(): calls navigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })
- On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
- On user cancel: sets status to 'idle', sets error "Permission denied"
- On no audio track: sets status to 'error', sets error "No audio track found"
stopCapture(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
Auto-stop: listens for track.onended (user clicks "Stop sharing" in Chrome) → calls stopCapture
Audio processing: identical to useVideoASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
WebSocket lifecycle: connect on capture start, close on capture stop
Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Pattern to Follow:

AudioContext setup: follow useVideoASR.ts lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
WebSocket handling: follow useVideoASR.ts lines 35-100
State management: combine patterns from both hooks, adapting for MediaStream source

Test Files: frontend/src/test/test_phase4_useSystemAudioASR.test.ts

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Objective: Create the SystemAudioCapture.tsx component with Start/Stop button, status display, and browser compatibility info.

Component Props:

interface SystemAudioCaptureProps {
  status: SystemAudioStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

Idle: "Start Capture" button (blue, prominent) + compatibility note
Requesting: "Waiting for permission..." (loading spinner)
Capturing: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
Error: Red banner with error message + "Try Again" button

Browser Compatibility Note:

⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
Firefox and Safari do not support this feature.
On Linux, only tab audio is available (not full system audio).

Test Files: frontend/src/test/test_phase4_SystemAudioCapture.test.tsx

Phase 4.4 — Mic Capture Hook (0.5 day)

Objective: Create useMicASR.ts hook that captures microphone input and streams it to the ASR WebSocket.

Key Design:

interface UseMicASRProps {
  wsUrl: string   // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}

interface UseMicASRReturn {
  status: 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
  transcript: string
  partialTranscript: string
  error: string | null
  startListening: () => Promise<void>
  stopListening: () => void
}

Implementation Details:

startListening(): calls navigator.mediaDevices.getUserMedia({ audio: true, video: false })
- On success: creates AudioContext, createMediaStreamSource(stream), connects ScriptProcessor → WebSocket
- On user deny: sets status to 'idle', sets error "Microphone access denied"
- On no audio track: sets status to 'error', sets error "No microphone input detected"
stopListening(): stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
Auto-stop: listens for track.onended (user revokes permission) → calls stopListening
Audio processing: identical to useSystemAudioASR — ScriptProcessorNode(4096), convert Float32 → Int16 PCM, send via WebSocket
WebSocket lifecycle: connect on listening start, close on listening stop
Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks

Code Sharing: Extract shared audio processing logic (MediaStream → AudioContext → ScriptProcessorNode → WebSocket) into a reusable internal utility (useMediaStreamASR or audioPipeline.ts) to avoid duplication between useSystemAudioASR and useMicASR.

Test Files: frontend/src/test/test_phase4_useMicASR.test.ts

Phase 4.5 — MicCapture UI Component (0.5 day)

Objective: Create the MicCapture.tsx component with Start/Stop button and status display.

Component Props:

interface MicCaptureProps {
  status: MicStatus
  error: string | null
  onStart: () => void
  onStop: () => void
}

UI States:

Idle: "Start Listening" button (blue, prominent) — no compatibility warning needed (mic is universally supported)
Requesting: "Waiting for microphone permission..." (loading spinner)
Listening: "Stop Listening" button (red) + pulsing green dot + "Listening..."
Error: Red banner with error message + "Try Again" button

Test Files: frontend/src/test/test_phase4_MicCapture.test.tsx

Phase 4.6 — LTTPage Integration (0.5 day)

Objective: Create the SourceSelector tab bar component and wire both new sources into LTTPage.

New Component — SourceSelector.tsx:

interface SourceSelectorProps {
  activeSource: SourceType
  onSelect: (source: SourceType) => void
}

Three tabs: Upload (📁), System Audio (🔊), Listen Mic (🎤)
Active tab highlighted with blue background, inactive tabs gray
Icons from lucide-react: Upload, MonitorSpeaker, Mic

Changes to LTTPage.tsx:

Add SourceType state: const [source, setSource] = useState<SourceType>('upload')
Render <SourceSelector activeSource={source} onSelect={setSource} /> above the panels
Extend SourceType to 'upload' | 'system-audio' | 'mic'
Initialize useSystemAudioASR and useMicASR hooks with session-scoped UUIDs (generated once when tab selected, reused across Start/Stop cycles)

Update asr variable:

const asr = source === 'system-audio' ? systemAudioASR
  : source === 'mic' ? micASR
  : uploadASR

Conditional rendering:

{source === 'upload' && <VideoUpload />}
{source === 'system-audio' && <SystemAudioCapture />}
{source === 'mic' && <MicCapture />}

WebSocket URL: ws://host/ws/asr/{sessionUUID}?language=yue (UUID stable per session, regenerated only on source switch)
Full Transcript button: hidden for system-audio AND mic (streaming ASR only)
QueryInput: remains editable during capture/listening

Test Files: frontend/src/test/test_phase4_LTTPage_integration.test.tsx

Phase 4.7 — Backend Adjustments (0.5 day)

Objective: Ensure backend handles both system audio and mic sessions correctly.

Tasks:

Verify ws_asr.py WebSocket endpoint works with arbitrary video_id (UUID format) — likely no changes needed
Add SYSTEM_AUDIO_ENABLED and MIC_ENABLED config validation in the router (return 503 if disabled)
Handle system audio and mic sessions in transcript history (optional — store with source: 'system-audio' / source: 'mic' metadata)
Verify the ASR client handles audio from both sources identically

No new endpoints needed. The existing WebSocket and ASR infrastructure is source-agnostic.

Test Files: backend/app/test/test_phase4_config.py

Phase 4.8 — Integration & Acceptance Tests (1 day)

Objective: Comprehensive tests for both capture flows.

Backend Integration Tests (backend/app/test/test_integration_phase4.py):

WebSocket accepts UUID video_id
ASR processes audio from system audio session
ASR processes audio from mic session
Config toggles disable features

Frontend Tests:

System Audio Hook tests (test_phase4_useSystemAudioASR.test.ts): ~10 tests
- Mock getDisplayMedia → successful capture
- Mock getDisplayMedia → user cancels (permission denied)
- Mock getDisplayMedia → no audio track
- AudioContext setup and teardown
- WebSocket connection lifecycle
- PCM conversion and sending
- track.onended triggers auto-stop
- stopCapture cleanup
- Multiple rapid start/stop cycles
System Audio Component tests (test_phase4_SystemAudioCapture.test.tsx): ~5 tests
- All UI states render correctly (idle, requesting, capturing, error)
- Start button calls onStart
- Stop button calls onStop
- Error state shows message and retry button
- Compatibility note visible for non-Chrome (optional)
Mic Hook tests (test_phase4_useMicASR.test.ts): ~8 tests
- Mock getUserMedia → successful capture
- Mock getUserMedia → user denies (permission denied)
- Mock getUserMedia → no audio track
- AudioContext setup and teardown
- WebSocket connection lifecycle
- track.onended triggers auto-stop
- stopListening cleanup
- PCM conversion and sending
Mic Component tests (test_phase4_MicCapture.test.tsx): ~4 tests
- All UI states render correctly (idle, requesting, listening, error)
- Start button calls onStart
- Stop button calls onStop
- Error state shows message and retry button
LTTPage Integration tests (test_phase4_LTTPage_integration.test.tsx): ~8 tests
- System Audio tab renders and switches correctly
- Listen Mic tab renders and switches correctly
- ASR variable selects systemAudioASR when source is system-audio
- ASR variable selects micASR when source is mic
- Full Transcript button hidden for system audio and mic
- QueryInput receives transcript from system audio
- QueryInput receives transcript from mic
- Source switching preserves transcript

Acceptance Tests (backend/app/test/acceptance/test_acceptance_phase4.py):

Real getDisplayMedia with actual browser (manual — requires human interaction)
Real getUserMedia with actual microphone (manual — requires human interaction)
Real DashScope ASR with system audio stream
Real DashScope ASR with microphone stream
End-to-end: capture → ASR → transcript → RAG answer (both sources)

Phase 4.9 — Polish & Documentation (0.5 day)

Tasks:

Update README.md — add System Audio Capture and Listen Mic sections with usage instructions, browser compatibility table, and limitations
Update development_plan.md — add Phase 4 row to timeline, mark status
Add browser detection helper for system audio compatibility warning
Verify production build (npm run build)
Run full CI regression (pytest + vitest)
Final commit

5. Design Decisions

Decision	Rationale
New hooks rather than modifying existing	MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hooks avoid branching complexity.
Two separate hooks + shared audio utility	System Audio and Mic share identical audio processing (MediaStream → PCM → WebSocket) but differ in capture API (`getDisplayMedia` vs `getUserMedia`) and UX. Extract shared pipeline to avoid duplication.
UUID-based `video_id` (per-session)	No actual video file for live audio. UUID generated once when source tab is selected, reused across Start/Stop cycles within the same session. Regenerated only when switching between sources. Backend WebSocket already accepts arbitrary strings.
Manual Start/Stop (not auto)	Both `getDisplayMedia()` and `getUserMedia()` require explicit user action (browser policy). Cannot auto-start.
No video display in System Audio or Mic mode	User watches/listens to content elsewhere. Only capture status and audio controls shown.
`video: true` in getDisplayMedia	Chrome requires `video: true` (throws TypeError with `video: false`). Video tracks are captured but immediately discarded — only audio is used.
Hide Full Transcript button for both new sources	Streaming ASR only — no video file to batch transcribe. Full transcript would require audio recording (future Phase 5).
Browser compatibility note only for System Audio	Mic (`getUserMedia`) is universally supported in all modern browsers. System Audio (`getDisplayMedia` with audio) is Chrome/Edge-only.
Mic uses `getUserMedia({ audio: true, video: false })`	Audio-only capture — no camera needed.

getDisplayMedia Options (System Audio)

const stream = await navigator.mediaDevices.getDisplayMedia({
  video: true,   // REQUIRED: Chrome throws TypeError with video:false.
                 // Video tracks are captured but immediately discarded — only audio is used.
  audio: {
    systemAudio: 'include',
    echoCancellation: false,
    noiseSuppression: false,
    autoGainControl: false,
  },
})

Note on video: true: Chrome requires video: true — setting video: false causes a TypeError. The video tracks are immediately stopped via stream.getVideoTracks().forEach(t => t.stop()) so only audio is processed. The browser permission dialog always shows screen/tab selection (no "audio-only picker" exists). The user must select a tab or screen to share.

getUserMedia Options (Listen Mic)

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: false,    // Don't filter audio (pass raw mic input)
    noiseSuppression: false,    // Don't filter audio
    autoGainControl: false,     // Don't adjust volume
  },
  video: false,
})

6. Browser Compatibility

System Audio (`getDisplayMedia`)

Platform / Browser	Tab Audio	System Audio	Works?
Chrome/Edge (Windows)	✅	✅	Best — full support
Chrome/Edge (macOS 14.2+)	✅	✅	Good
Chrome/Edge (Linux)	✅	❌	Works, tab audio only
Firefox	❌	❌	Audio ignored
Safari	❌	❌	Audio not supported
Mobile browsers	❌	❌	Not supported

Detection helper:

function isSystemAudioSupported(): boolean {
  const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
  return isChromium && !navigator.userAgent.includes('Firefox')
}

Listen Mic (`getUserMedia`)

Platform / Browser	Microphone	Works?
Chrome/Edge	✅	Full support
Firefox	✅	Full support
Safari	✅	Full support
Mobile browsers	✅	Full support

Mic capture is universally supported — no compatibility warning needed.

7. Test Strategy

Test Files

File	Type	Count	Description
`test_phase4_config.py`	Backend integration	4	Config toggles, WebSocket accepts UUID
`test_phase4_useSystemAudioASR.test.ts`	Frontend unit	~10	Hook: capture, permission, audio, WS
`test_phase4_SystemAudioCapture.test.tsx`	Frontend component	~5	UI states: idle, requesting, capturing, error
`test_phase4_useMicASR.test.ts`	Frontend unit	~8	Hook: capture, permission, audio, WS
`test_phase4_MicCapture.test.tsx`	Frontend component	~4	UI states: idle, requesting, listening, error
`test_phase4_LTTPage_integration.test.tsx`	Frontend integration	~8	Tab switching, ASR unification, Full Transcript
`test_integration_phase4.py`	Backend integration	4	Config toggles, WebSocket, ASR client
`test_acceptance_phase4.py`	Acceptance	5	Real browser + real mic + real DashScope ASR

Mocking Strategy

getDisplayMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
getUserMedia: Mock with jest.fn() returning a synthetic MediaStream with an AudioTrack
AudioContext: Use jest-webgl-mock or manual mock for AudioContext, ScriptProcessorNode
WebSocket: Mock via vitest WebSocket mock (same pattern as Phase 2/3 tests)
DashScope ASR: Mock in CI; real in acceptance tests

8. File Manifest

New Files

frontend/src/hooks/useMediaStreamASR.ts          ← shared audio pipeline (MediaStream → AudioContext → WebSocket)
frontend/src/hooks/useSystemAudioASR.ts
frontend/src/hooks/useMicASR.ts
frontend/src/components/SourceSelector.tsx        ← tab bar (Upload | System Audio | Listen Mic)
frontend/src/components/SystemAudioCapture.tsx
frontend/src/components/MicCapture.tsx
frontend/src/lib/browser.ts                       ← isSystemAudioSupported() detection helper
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_useMicASR.test.ts
frontend/src/test/test_phase4_MicCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py
backend/app/test/acceptance/test_acceptance_phase4.py
.plans/phase4_system_audio_plan.md              ← this file

Modified Files

frontend/src/pages/LTTPage.tsx                    ← add "System Audio" + "Listen Mic" tabs, wire hooks
frontend/src/types/index.ts                       ← add SystemAudioStatus, MicStatus, ASRState types
backend/app/core/config.py                        ← add SYSTEM_AUDIO_ENABLED, MIC_ENABLED
development_plan.md                               ← add Phase 4 row
README.md                                         ← add System Audio + Listen Mic sections

9. Acceptance Criteria

User can select "System Audio" tab in LTTPage
User can select "Listen Mic" tab in LTTPage
Clicking "Start Capture" (System Audio) opens browser permission dialog
Clicking "Start Listening" (Listen Mic) opens microphone permission prompt
On permission grant, audio streams through WebSocket to DashScope ASR (both sources) (verified via manual tests)
Real-time transcript flows into QueryInput (both sources) (verified via manual tests)
User can edit transcript while capture/listening continues
"Stop Capture" properly closes MediaStream, AudioContext, WebSocket
"Stop Listening" properly closes MediaStream, AudioContext, WebSocket
Permission denied shows clear error message (both sources)
Browser compatibility note shown for System Audio
No compatibility warning for Listen Mic (universally supported)
All CI tests pass (46 front-end + 14 back-end)
Acceptance tests pass with real DashScope ASR (requires real browser/mic — skipped in CI)
npm run build produces clean production build

File Information

Filename: phase4_system_audio_plan.md
Created: 2026-05-09
Updated: 2026-05-15 — Phase 4 complete (all 9 sub-phases)
Status: Complete — 46 frontend tests + 14 backend tests passing, production build verified

29 KiB Raw Blame History

Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan

1. Overview

System Audio — Use Cases

Listen Mic — Use Cases

How They Work

Audio Routing (vs Existing Sources)

Why New Hooks (Not Reuse Existing)

2. User Flow

2.1 System Audio

Permission Denied Flow

2.2 Listen Mic

Permission Denied Flow

3. Architecture

3.1 Component Tree (LTTPage — All Sources)

3.2 Data Flow

System Audio

Listen Mic

LTTPage Unification

3.3 Backend Changes

3.4 Frontend Files

4. Sub-Phases

Phase 4.1 — Config & Infrastructure (0.5 day)

Phase 4.2 — System Audio Capture Hook (1 day)

Phase 4.3 — SystemAudioCapture UI Component (0.5 day)

Phase 4.4 — Mic Capture Hook (0.5 day)

Phase 4.5 — MicCapture UI Component (0.5 day)

Phase 4.6 — LTTPage Integration (0.5 day)

Phase 4.7 — Backend Adjustments (0.5 day)

Phase 4.8 — Integration & Acceptance Tests (1 day)

Phase 4.9 — Polish & Documentation (0.5 day)

5. Design Decisions

getDisplayMedia Options (System Audio)

getUserMedia Options (Listen Mic)

6. Browser Compatibility

System Audio (getDisplayMedia)

Listen Mic (getUserMedia)

7. Test Strategy

Test Files

Mocking Strategy

8. File Manifest

New Files

Modified Files

9. Acceptance Criteria

29 KiB

Raw Blame History

System Audio (`getDisplayMedia`)

Listen Mic (`getUserMedia`)