658 lines
29 KiB
Markdown
658 lines
29 KiB
Markdown
# Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan
|
|
|
|
**Created:** 2026-05-09
|
|
**Updated:** 2026-05-15
|
|
**Status:** ✅ Complete
|
|
**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload:
|
|
|
|
1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`.
|
|
2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`.
|
|
|
|
Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline.
|
|
|
|
### System Audio — Use Cases
|
|
- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
|
|
- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
|
|
- Transcribing any audio playing on the computer without needing to download files
|
|
|
|
### Listen Mic — Use Cases
|
|
- Recording a live meeting or lecture through the computer's microphone
|
|
- Dictating questions or notes verbally and getting RAG answers
|
|
- Transcribing spoken Cantonese in real time without a video source
|
|
|
|
### How They Work
|
|
|
|
```
|
|
[System Audio]
|
|
User clicks "System Audio" → "Start Capture"
|
|
→ Browser shows permission dialog (screen/tab picker)
|
|
→ User selects tab/window/screen (with audio)
|
|
→ getDisplayMedia() returns MediaStream (with audio track)
|
|
→ AudioContext.createMediaStreamSource(stream)
|
|
→ ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
|
|
|
|
[Listen Mic]
|
|
User clicks "Listen Mic" → "Start Listening"
|
|
→ Browser shows microphone permission prompt
|
|
→ getUserMedia({ audio: true }) returns MediaStream
|
|
→ AudioContext.createMediaStreamSource(stream)
|
|
→ ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
|
|
```
|
|
|
|
### Audio Routing (vs Existing Sources)
|
|
|
|
| Source | Audio Input | SourceNode Type | Start/Stop Trigger |
|
|
|--------|-------------|-----------------|-------------------|
|
|
| Upload | `<video>` element | `createMediaElementSource` | play/pause events |
|
|
| **System Audio** | MediaStream from `getDisplayMedia()` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
|
|
| **Listen Mic** | MediaStream from `getUserMedia({ audio: true })` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
|
|
|
|
### Why New Hooks (Not Reuse Existing)
|
|
|
|
The existing `useVideoASR` hook depends on HTML media elements (`<video>`) for both the audio source and play/pause lifecycle. Both new sources use **MediaStream** objects (no DOM element), and their lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events.
|
|
|
|
**System Audio** and **Listen Mic** share the same audio processing pipeline (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) but differ in their capture API. A shared internal audio processing utility (`useMediaStreamASR` or similar) should be extracted to avoid code duplication between the two hooks.
|
|
|
|
---
|
|
|
|
## 2. User Flow
|
|
|
|
### 2.1 System Audio
|
|
|
|
1. User selects **"System Audio"** tab (second option alongside Upload / Listen Mic)
|
|
2. UI shows a **"Start Capture"** button with browser compatibility info
|
|
3. User clicks **"Start Capture"**
|
|
4. Browser opens **permission dialog** (screen/tab picker)
|
|
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
|
|
- User checks "Share audio" if available
|
|
5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
|
|
6. Real-time ASR transcription flows into **QueryInput** (same as Upload)
|
|
7. User can **edit transcript while capturing** continues
|
|
8. User clicks **"Stop Capture"** to end — transcript stays in QueryInput
|
|
9. User submits query → RAG pipeline processes it
|
|
10. **"Full Transcript" button hidden** (streaming ASR only — no batch transcription for live sources)
|
|
|
|
#### Permission Denied Flow
|
|
- User clicks "Cancel" in permission dialog → error: "Permission denied — system audio capture requires your explicit permission"
|
|
- User revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
|
|
- No audio track in the stream → error: "No audio track found in the shared content"
|
|
|
|
### 2.2 Listen Mic
|
|
|
|
1. User selects **"Listen Mic"** tab (third option)
|
|
2. UI shows a **"Start Listening"** button (no browser compatibility warning — widely supported)
|
|
3. User clicks **"Start Listening"**
|
|
4. Browser shows **microphone permission prompt** (first time only)
|
|
5. On approval: listening starts — status indicator shows "Listening" with a live audio level meter
|
|
6. Real-time ASR transcription flows into **QueryInput**
|
|
7. User can **edit transcript while listening** continues
|
|
8. User clicks **"Stop Listening"** to end — transcript stays in QueryInput
|
|
9. User submits query → RAG pipeline processes it
|
|
10. **"Full Transcript" button hidden** (streaming ASR only)
|
|
|
|
#### Permission Denied Flow
|
|
- User clicks "Block" in mic permission prompt → error: "Microphone access denied — please allow microphone access in your browser settings"
|
|
- User revokes permission via browser UI → listening stops, status: "Microphone disconnected"
|
|
- No audio track → error: "No microphone input detected"
|
|
|
|
---
|
|
|
|
## 3. Architecture
|
|
|
|
### 3.1 Component Tree (LTTPage — All Sources)
|
|
|
|
```
|
|
LTTPage
|
|
├── SourceSelector (tabs: Upload | System Audio | Listen Mic)
|
|
├── [source === 'system-audio']
|
|
│ └── SystemAudioCapture
|
|
│ ├── Start/Stop button
|
|
│ ├── Status indicator (idle | requesting | capturing | error)
|
|
│ ├── Audio level meter (optional, nice-to-have)
|
|
│ └── Browser compatibility note (non-Chrome users)
|
|
├── [source === 'mic']
|
|
│ └── MicCapture
|
|
│ ├── Start/Stop button
|
|
│ ├── Status indicator (idle | requesting | listening | error)
|
|
│ └── Audio level meter (optional, nice-to-have)
|
|
├── QueryInput (receives transcript from active ASR hook)
|
|
├── ExtractedQuestionsDisplay
|
|
└── RAG Response Panel
|
|
```
|
|
|
|
### 3.2 Data Flow
|
|
|
|
#### System Audio
|
|
```
|
|
SystemAudioCapture (UI)
|
|
│
|
|
├── "Start Capture" click → calls startCapture() from hook
|
|
│
|
|
▼
|
|
useSystemAudioASR hook
|
|
│
|
|
├── getDisplayMedia({ audio: { systemAudio: 'include' } })
|
|
│ └── User picks tab/window → returns MediaStream
|
|
│
|
|
├── AudioContext.createMediaStreamSource(stream)
|
|
│
|
|
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
|
|
│
|
|
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
|
|
│
|
|
└── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
|
|
```
|
|
|
|
#### Listen Mic
|
|
```
|
|
MicCapture (UI)
|
|
│
|
|
├── "Start Listening" click → calls startListening() from hook
|
|
│
|
|
▼
|
|
useMicASR hook
|
|
│
|
|
├── getUserMedia({ audio: true })
|
|
│ └── Browser shows mic permission prompt → returns MediaStream
|
|
│
|
|
├── AudioContext.createMediaStreamSource(stream)
|
|
│
|
|
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
|
|
│
|
|
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
|
|
│
|
|
└── Returns: { status, transcript, partialTranscript, startListening, stopListening }
|
|
```
|
|
|
|
#### LTTPage Unification
|
|
```typescript
|
|
const asr = source === 'system-audio' ? systemAudioASR
|
|
: source === 'mic' ? micASR
|
|
: uploadASR
|
|
```
|
|
|
|
### 3.3 Backend Changes
|
|
|
|
**Minimal.** The existing WebSocket ASR endpoint (`ws_asr.py`) already accepts audio from any source. The only additions are UUID-based `video_id` handling and feature toggles.
|
|
|
|
| Change | File | Description |
|
|
|--------|------|-------------|
|
|
| Allow UUID video_id | `backend/app/routers/ws_asr.py` | Accept non-file-based video IDs (already accepts any string) |
|
|
| Transcript persistence | `backend/app/services/history_service.py` | Store system audio & mic transcripts with UUID session ID (optional — nice-to-have) |
|
|
| Config | `backend/app/core/config.py` | Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` toggles (default: true) |
|
|
|
|
**No changes needed to:**
|
|
- DashScope ASR client (receives PCM, doesn't care about source)
|
|
- WebSocket protocol (same binary PCM format)
|
|
- RAG pipeline (consumes transcript text)
|
|
|
|
### 3.4 Frontend Files
|
|
|
|
| File | Status | Description |
|
|
|------|--------|-------------|
|
|
| `frontend/src/components/SourceSelector.tsx` | **New** | Reusable tab bar component (Upload \| System Audio \| Listen Mic) |
|
|
| `frontend/src/hooks/useSystemAudioASR.ts` | **New** | Hook: getDisplayMedia → AudioContext → WebSocket |
|
|
| `frontend/src/hooks/useMicASR.ts` | **New** | Hook: getUserMedia → AudioContext → WebSocket |
|
|
| `frontend/src/components/SystemAudioCapture.tsx` | **New** | UI: Start/Stop, status, compatibility note |
|
|
| `frontend/src/components/MicCapture.tsx` | **New** | UI: Start/Stop, status |
|
|
| `frontend/src/pages/LTTPage.tsx` | **Modified** | Add source selector, wire hooks, unify ASR, conditional rendering |
|
|
| `frontend/src/types/index.ts` | **Modified** | Add SourceType, SystemAudioStatus, MicStatus types |
|
|
|
|
---
|
|
|
|
## 4. Sub-Phases
|
|
|
|
| Sub-Phase | Description | Effort | Depends On | Status |
|
|
|-----------|-------------|--------|------------|--------|
|
|
| 4.1 | Config & Infrastructure | 0.5 day | — | ✅ Complete |
|
|
| 4.2 | System Audio Capture Hook (`useSystemAudioASR`) | 1 day | 4.1 | ✅ Complete |
|
|
| 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | ✅ Complete |
|
|
| 4.4 | Mic Capture Hook (`useMicASR`) | 0.5 day | 4.1 | ✅ Complete |
|
|
| 4.5 | MicCapture UI Component | 0.5 day | 4.4 | ✅ Complete |
|
|
| 4.6 | LTTPage Integration (all 3 sources) | 0.5 day | 4.2, 4.3, 4.4, 4.5 | ✅ Complete |
|
|
| 4.7 | Backend Adjustments | 0.5 day | 4.1 | ✅ Complete |
|
|
| 4.8 | Integration & Acceptance Tests | 1 day | 4.6, 4.7 | ✅ Complete |
|
|
| 4.9 | Polish & Documentation | 0.5 day | 4.8 | ✅ Complete |
|
|
| **Total** | | **5.5 days** | | |
|
|
|
|
### Phase 4.1 — Config & Infrastructure (0.5 day)
|
|
|
|
**Objective:** Add feature toggles, define types, establish UUID generation.
|
|
|
|
**Tasks:**
|
|
1. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` to `backend/app/core/config.py` (default: `True`)
|
|
2. Add `SystemAudioStatus` type to `frontend/src/types/index.ts`:
|
|
```typescript
|
|
type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
|
|
```
|
|
3. Add `MicStatus` type:
|
|
```typescript
|
|
type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
|
|
```
|
|
4. Add `SystemAudioASRState` and `MicASRState` interfaces to types
|
|
5. Add `video_id` UUID generation helper (frontend-side: `crypto.randomUUID()`)
|
|
6. Verify WebSocket ASR endpoint accepts arbitrary `video_id` strings (it does — confirm with a quick test)
|
|
|
|
**Test Files:** `backend/app/test/test_phase4_config.py`
|
|
|
|
### Phase 4.2 — System Audio Capture Hook (1 day)
|
|
|
|
**Objective:** Create `useSystemAudioASR.ts` hook that captures system audio and streams it to the ASR WebSocket.
|
|
|
|
**Key Design:**
|
|
```typescript
|
|
interface UseSystemAudioASRProps {
|
|
wsUrl: string // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
|
|
}
|
|
|
|
interface UseSystemAudioASRReturn {
|
|
status: 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
|
|
transcript: string
|
|
partialTranscript: string
|
|
error: string | null
|
|
startCapture: () => Promise<void>
|
|
stopCapture: () => void
|
|
}
|
|
```
|
|
|
|
**Implementation Details:**
|
|
- `startCapture()`: calls `navigator.mediaDevices.getDisplayMedia({ video: false, audio: { systemAudio: 'include' } })`
|
|
- On success: creates AudioContext, `createMediaStreamSource(stream)`, connects ScriptProcessor → WebSocket
|
|
- On user cancel: sets status to `'idle'`, sets error "Permission denied"
|
|
- On no audio track: sets status to `'error'`, sets error "No audio track found"
|
|
- `stopCapture()`: stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
|
|
- Auto-stop: listens for `track.onended` (user clicks "Stop sharing" in Chrome) → calls stopCapture
|
|
- Audio processing: identical to useVideoASR — `ScriptProcessorNode(4096)`, convert Float32 → Int16 PCM, send via WebSocket
|
|
- WebSocket lifecycle: connect on capture start, close on capture stop
|
|
- Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks
|
|
|
|
**Pattern to Follow:**
|
|
- AudioContext setup: follow `useVideoASR.ts` lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
|
|
- WebSocket handling: follow `useVideoASR.ts` lines 35-100
|
|
- State management: combine patterns from both hooks, adapting for MediaStream source
|
|
|
|
**Test Files:** `frontend/src/test/test_phase4_useSystemAudioASR.test.ts`
|
|
|
|
### Phase 4.3 — SystemAudioCapture UI Component (0.5 day)
|
|
|
|
**Objective:** Create the `SystemAudioCapture.tsx` component with Start/Stop button, status display, and browser compatibility info.
|
|
|
|
**Component Props:**
|
|
```typescript
|
|
interface SystemAudioCaptureProps {
|
|
status: SystemAudioStatus
|
|
error: string | null
|
|
onStart: () => void
|
|
onStop: () => void
|
|
}
|
|
```
|
|
|
|
**UI States:**
|
|
1. **Idle**: "Start Capture" button (blue, prominent) + compatibility note
|
|
2. **Requesting**: "Waiting for permission..." (loading spinner)
|
|
3. **Capturing**: "Stop Capture" button (red) + pulsing green dot + "Capturing system audio..."
|
|
4. **Error**: Red banner with error message + "Try Again" button
|
|
|
|
**Browser Compatibility Note:**
|
|
```
|
|
⚠️ System audio capture works best in Chrome/Edge on Windows/macOS.
|
|
Firefox and Safari do not support this feature.
|
|
On Linux, only tab audio is available (not full system audio).
|
|
```
|
|
|
|
**Test Files:** `frontend/src/test/test_phase4_SystemAudioCapture.test.tsx`
|
|
|
|
### Phase 4.4 — Mic Capture Hook (0.5 day)
|
|
|
|
**Objective:** Create `useMicASR.ts` hook that captures microphone input and streams it to the ASR WebSocket.
|
|
|
|
**Key Design:**
|
|
```typescript
|
|
interface UseMicASRProps {
|
|
wsUrl: string // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
|
|
}
|
|
|
|
interface UseMicASRReturn {
|
|
status: 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
|
|
transcript: string
|
|
partialTranscript: string
|
|
error: string | null
|
|
startListening: () => Promise<void>
|
|
stopListening: () => void
|
|
}
|
|
```
|
|
|
|
**Implementation Details:**
|
|
- `startListening()`: calls `navigator.mediaDevices.getUserMedia({ audio: true, video: false })`
|
|
- On success: creates AudioContext, `createMediaStreamSource(stream)`, connects ScriptProcessor → WebSocket
|
|
- On user deny: sets status to `'idle'`, sets error "Microphone access denied"
|
|
- On no audio track: sets status to `'error'`, sets error "No microphone input detected"
|
|
- `stopListening()`: stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
|
|
- Auto-stop: listens for `track.onended` (user revokes permission) → calls stopListening
|
|
- Audio processing: identical to useSystemAudioASR — `ScriptProcessorNode(4096)`, convert Float32 → Int16 PCM, send via WebSocket
|
|
- WebSocket lifecycle: connect on listening start, close on listening stop
|
|
- Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks
|
|
|
|
**Code Sharing:** Extract shared audio processing logic (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) into a reusable internal utility (`useMediaStreamASR` or `audioPipeline.ts`) to avoid duplication between `useSystemAudioASR` and `useMicASR`.
|
|
|
|
**Test Files:** `frontend/src/test/test_phase4_useMicASR.test.ts`
|
|
|
|
### Phase 4.5 — MicCapture UI Component (0.5 day)
|
|
|
|
**Objective:** Create the `MicCapture.tsx` component with Start/Stop button and status display.
|
|
|
|
**Component Props:**
|
|
```typescript
|
|
interface MicCaptureProps {
|
|
status: MicStatus
|
|
error: string | null
|
|
onStart: () => void
|
|
onStop: () => void
|
|
}
|
|
```
|
|
|
|
**UI States:**
|
|
1. **Idle**: "Start Listening" button (blue, prominent) — no compatibility warning needed (mic is universally supported)
|
|
2. **Requesting**: "Waiting for microphone permission..." (loading spinner)
|
|
3. **Listening**: "Stop Listening" button (red) + pulsing green dot + "Listening..."
|
|
4. **Error**: Red banner with error message + "Try Again" button
|
|
|
|
**Test Files:** `frontend/src/test/test_phase4_MicCapture.test.tsx`
|
|
|
|
### Phase 4.6 — LTTPage Integration (0.5 day)
|
|
|
|
**Objective:** Create the `SourceSelector` tab bar component and wire both new sources into LTTPage.
|
|
|
|
**New Component — `SourceSelector.tsx`:**
|
|
```typescript
|
|
interface SourceSelectorProps {
|
|
activeSource: SourceType
|
|
onSelect: (source: SourceType) => void
|
|
}
|
|
```
|
|
- Three tabs: Upload (📁), System Audio (🔊), Listen Mic (🎤)
|
|
- Active tab highlighted with blue background, inactive tabs gray
|
|
- Icons from lucide-react: `Upload`, `MonitorSpeaker`, `Mic`
|
|
|
|
**Changes to `LTTPage.tsx`:**
|
|
1. Add `SourceType` state: `const [source, setSource] = useState<SourceType>('upload')`
|
|
2. Render `<SourceSelector activeSource={source} onSelect={setSource} />` above the panels
|
|
3. Extend `SourceType` to `'upload' | 'system-audio' | 'mic'`
|
|
4. Initialize `useSystemAudioASR` and `useMicASR` hooks with session-scoped UUIDs (generated once when tab selected, reused across Start/Stop cycles)
|
|
5. Update `asr` variable:
|
|
```typescript
|
|
const asr = source === 'system-audio' ? systemAudioASR
|
|
: source === 'mic' ? micASR
|
|
: uploadASR
|
|
```
|
|
6. Conditional rendering:
|
|
```tsx
|
|
{source === 'upload' && <VideoUpload />}
|
|
{source === 'system-audio' && <SystemAudioCapture />}
|
|
{source === 'mic' && <MicCapture />}
|
|
```
|
|
7. WebSocket URL: `ws://host/ws/asr/{sessionUUID}?language=yue` (UUID stable per session, regenerated only on source switch)
|
|
8. Full Transcript button: hidden for system-audio AND mic (streaming ASR only)
|
|
9. QueryInput: remains editable during capture/listening
|
|
|
|
**Test Files:** `frontend/src/test/test_phase4_LTTPage_integration.test.tsx`
|
|
|
|
### Phase 4.7 — Backend Adjustments (0.5 day)
|
|
|
|
**Objective:** Ensure backend handles both system audio and mic sessions correctly.
|
|
|
|
**Tasks:**
|
|
1. Verify `ws_asr.py` WebSocket endpoint works with arbitrary `video_id` (UUID format) — likely no changes needed
|
|
2. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` config validation in the router (return 503 if disabled)
|
|
3. Handle system audio and mic sessions in transcript history (optional — store with `source: 'system-audio'` / `source: 'mic'` metadata)
|
|
4. Verify the ASR client handles audio from both sources identically
|
|
|
|
**No new endpoints needed.** The existing WebSocket and ASR infrastructure is source-agnostic.
|
|
|
|
**Test Files:** `backend/app/test/test_phase4_config.py`
|
|
|
|
### Phase 4.8 — Integration & Acceptance Tests (1 day)
|
|
|
|
**Objective:** Comprehensive tests for both capture flows.
|
|
|
|
**Backend Integration Tests** (`backend/app/test/test_integration_phase4.py`):
|
|
1. WebSocket accepts UUID video_id
|
|
2. ASR processes audio from system audio session
|
|
3. ASR processes audio from mic session
|
|
4. Config toggles disable features
|
|
|
|
**Frontend Tests:**
|
|
|
|
1. **System Audio Hook tests** (`test_phase4_useSystemAudioASR.test.ts`): ~10 tests
|
|
- Mock `getDisplayMedia` → successful capture
|
|
- Mock `getDisplayMedia` → user cancels (permission denied)
|
|
- Mock `getDisplayMedia` → no audio track
|
|
- AudioContext setup and teardown
|
|
- WebSocket connection lifecycle
|
|
- PCM conversion and sending
|
|
- `track.onended` triggers auto-stop
|
|
- `stopCapture` cleanup
|
|
- Multiple rapid start/stop cycles
|
|
|
|
2. **System Audio Component tests** (`test_phase4_SystemAudioCapture.test.tsx`): ~5 tests
|
|
- All UI states render correctly (idle, requesting, capturing, error)
|
|
- Start button calls onStart
|
|
- Stop button calls onStop
|
|
- Error state shows message and retry button
|
|
- Compatibility note visible for non-Chrome (optional)
|
|
|
|
3. **Mic Hook tests** (`test_phase4_useMicASR.test.ts`): ~8 tests
|
|
- Mock `getUserMedia` → successful capture
|
|
- Mock `getUserMedia` → user denies (permission denied)
|
|
- Mock `getUserMedia` → no audio track
|
|
- AudioContext setup and teardown
|
|
- WebSocket connection lifecycle
|
|
- `track.onended` triggers auto-stop
|
|
- `stopListening` cleanup
|
|
- PCM conversion and sending
|
|
|
|
4. **Mic Component tests** (`test_phase4_MicCapture.test.tsx`): ~4 tests
|
|
- All UI states render correctly (idle, requesting, listening, error)
|
|
- Start button calls onStart
|
|
- Stop button calls onStop
|
|
- Error state shows message and retry button
|
|
|
|
5. **LTTPage Integration tests** (`test_phase4_LTTPage_integration.test.tsx`): ~8 tests
|
|
- System Audio tab renders and switches correctly
|
|
- Listen Mic tab renders and switches correctly
|
|
- ASR variable selects systemAudioASR when source is system-audio
|
|
- ASR variable selects micASR when source is mic
|
|
- Full Transcript button hidden for system audio and mic
|
|
- QueryInput receives transcript from system audio
|
|
- QueryInput receives transcript from mic
|
|
- Source switching preserves transcript
|
|
|
|
**Acceptance Tests** (`backend/app/test/acceptance/test_acceptance_phase4.py`):
|
|
- Real `getDisplayMedia` with actual browser (manual — requires human interaction)
|
|
- Real `getUserMedia` with actual microphone (manual — requires human interaction)
|
|
- Real DashScope ASR with system audio stream
|
|
- Real DashScope ASR with microphone stream
|
|
- End-to-end: capture → ASR → transcript → RAG answer (both sources)
|
|
|
|
### Phase 4.9 — Polish & Documentation (0.5 day)
|
|
|
|
**Tasks:**
|
|
1. Update `README.md` — add System Audio Capture and Listen Mic sections with usage instructions, browser compatibility table, and limitations
|
|
2. Update `development_plan.md` — add Phase 4 row to timeline, mark status
|
|
3. Add browser detection helper for system audio compatibility warning
|
|
4. Verify production build (`npm run build`)
|
|
5. Run full CI regression (`pytest` + `vitest`)
|
|
6. Final commit
|
|
|
|
---
|
|
|
|
## 5. Design Decisions
|
|
|
|
| Decision | Rationale |
|
|
|----------|-----------|
|
|
| New hooks rather than modifying existing | MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hooks avoid branching complexity. |
|
|
| Two separate hooks + shared audio utility | System Audio and Mic share identical audio processing (MediaStream → PCM → WebSocket) but differ in capture API (`getDisplayMedia` vs `getUserMedia`) and UX. Extract shared pipeline to avoid duplication. |
|
|
| UUID-based `video_id` (per-session) | No actual video file for live audio. UUID generated once when source tab is selected, reused across Start/Stop cycles within the same session. Regenerated only when switching between sources. Backend WebSocket already accepts arbitrary strings. |
|
|
| Manual Start/Stop (not auto) | Both `getDisplayMedia()` and `getUserMedia()` require explicit user action (browser policy). Cannot auto-start. |
|
|
| No video display in System Audio or Mic mode | User watches/listens to content elsewhere. Only capture status and audio controls shown. |
|
|
| `video: true` in getDisplayMedia | Chrome requires `video: true` (throws TypeError with `video: false`). Video tracks are captured but immediately discarded — only audio is used. |
|
|
| Hide Full Transcript button for both new sources | Streaming ASR only — no video file to batch transcribe. Full transcript would require audio recording (future Phase 5). |
|
|
| Browser compatibility note only for System Audio | Mic (`getUserMedia`) is universally supported in all modern browsers. System Audio (`getDisplayMedia` with audio) is Chrome/Edge-only. |
|
|
| Mic uses `getUserMedia({ audio: true, video: false })` | Audio-only capture — no camera needed. |
|
|
|
|
### getDisplayMedia Options (System Audio)
|
|
|
|
```javascript
|
|
const stream = await navigator.mediaDevices.getDisplayMedia({
|
|
video: true, // REQUIRED: Chrome throws TypeError with video:false.
|
|
// Video tracks are captured but immediately discarded — only audio is used.
|
|
audio: {
|
|
systemAudio: 'include',
|
|
echoCancellation: false,
|
|
noiseSuppression: false,
|
|
autoGainControl: false,
|
|
},
|
|
})
|
|
```
|
|
|
|
**Note on `video: true`:** Chrome requires `video: true` — setting `video: false` causes a `TypeError`. The video tracks are immediately stopped via `stream.getVideoTracks().forEach(t => t.stop())` so only audio is processed. The browser permission dialog always shows screen/tab selection (no "audio-only picker" exists). The user must select a tab or screen to share.
|
|
|
|
### getUserMedia Options (Listen Mic)
|
|
|
|
```javascript
|
|
const stream = await navigator.mediaDevices.getUserMedia({
|
|
audio: {
|
|
echoCancellation: false, // Don't filter audio (pass raw mic input)
|
|
noiseSuppression: false, // Don't filter audio
|
|
autoGainControl: false, // Don't adjust volume
|
|
},
|
|
video: false,
|
|
})
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Browser Compatibility
|
|
|
|
### System Audio (`getDisplayMedia`)
|
|
|
|
| Platform / Browser | Tab Audio | System Audio | Works? |
|
|
|--------------------|-----------|-------------|--------|
|
|
| Chrome/Edge (Windows) | ✅ | ✅ | **Best — full support** |
|
|
| Chrome/Edge (macOS 14.2+) | ✅ | ✅ | **Good** |
|
|
| Chrome/Edge (Linux) | ✅ | ❌ | Works, tab audio only |
|
|
| Firefox | ❌ | ❌ | Audio ignored |
|
|
| Safari | ❌ | ❌ | Audio not supported |
|
|
| Mobile browsers | ❌ | ❌ | Not supported |
|
|
|
|
**Detection helper:**
|
|
```typescript
|
|
function isSystemAudioSupported(): boolean {
|
|
const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
|
|
return isChromium && !navigator.userAgent.includes('Firefox')
|
|
}
|
|
```
|
|
|
|
### Listen Mic (`getUserMedia`)
|
|
|
|
| Platform / Browser | Microphone | Works? |
|
|
|--------------------|-----------|--------|
|
|
| Chrome/Edge | ✅ | **Full support** |
|
|
| Firefox | ✅ | **Full support** |
|
|
| Safari | ✅ | **Full support** |
|
|
| Mobile browsers | ✅ | **Full support** |
|
|
|
|
Mic capture is universally supported — no compatibility warning needed.
|
|
|
|
---
|
|
|
|
## 7. Test Strategy
|
|
|
|
### Test Files
|
|
|
|
| File | Type | Count | Description |
|
|
|------|------|-------|-------------|
|
|
| `test_phase4_config.py` | Backend integration | 4 | Config toggles, WebSocket accepts UUID |
|
|
| `test_phase4_useSystemAudioASR.test.ts` | Frontend unit | ~10 | Hook: capture, permission, audio, WS |
|
|
| `test_phase4_SystemAudioCapture.test.tsx` | Frontend component | ~5 | UI states: idle, requesting, capturing, error |
|
|
| `test_phase4_useMicASR.test.ts` | Frontend unit | ~8 | Hook: capture, permission, audio, WS |
|
|
| `test_phase4_MicCapture.test.tsx` | Frontend component | ~4 | UI states: idle, requesting, listening, error |
|
|
| `test_phase4_LTTPage_integration.test.tsx` | Frontend integration | ~8 | Tab switching, ASR unification, Full Transcript |
|
|
| `test_integration_phase4.py` | Backend integration | 4 | Config toggles, WebSocket, ASR client |
|
|
| `test_acceptance_phase4.py` | Acceptance | 5 | Real browser + real mic + real DashScope ASR |
|
|
|
|
### Mocking Strategy
|
|
|
|
- **`getDisplayMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
|
|
- **`getUserMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
|
|
- **AudioContext**: Use `jest-webgl-mock` or manual mock for AudioContext, ScriptProcessorNode
|
|
- **WebSocket**: Mock via `vitest` WebSocket mock (same pattern as Phase 2/3 tests)
|
|
- **DashScope ASR**: Mock in CI; real in acceptance tests
|
|
|
|
---
|
|
|
|
## 8. File Manifest
|
|
|
|
### New Files
|
|
```
|
|
frontend/src/hooks/useMediaStreamASR.ts ← shared audio pipeline (MediaStream → AudioContext → WebSocket)
|
|
frontend/src/hooks/useSystemAudioASR.ts
|
|
frontend/src/hooks/useMicASR.ts
|
|
frontend/src/components/SourceSelector.tsx ← tab bar (Upload | System Audio | Listen Mic)
|
|
frontend/src/components/SystemAudioCapture.tsx
|
|
frontend/src/components/MicCapture.tsx
|
|
frontend/src/lib/browser.ts ← isSystemAudioSupported() detection helper
|
|
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
|
|
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
|
|
frontend/src/test/test_phase4_useMicASR.test.ts
|
|
frontend/src/test/test_phase4_MicCapture.test.tsx
|
|
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
|
|
backend/app/test/test_phase4_config.py
|
|
backend/app/test/test_integration_phase4.py
|
|
backend/app/test/acceptance/test_acceptance_phase4.py
|
|
.plans/phase4_system_audio_plan.md ← this file
|
|
```
|
|
|
|
### Modified Files
|
|
```
|
|
frontend/src/pages/LTTPage.tsx ← add "System Audio" + "Listen Mic" tabs, wire hooks
|
|
frontend/src/types/index.ts ← add SystemAudioStatus, MicStatus, ASRState types
|
|
backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED, MIC_ENABLED
|
|
development_plan.md ← add Phase 4 row
|
|
README.md ← add System Audio + Listen Mic sections
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Acceptance Criteria
|
|
|
|
- [x] User can select "System Audio" tab in LTTPage
|
|
- [x] User can select "Listen Mic" tab in LTTPage
|
|
- [x] Clicking "Start Capture" (System Audio) opens browser permission dialog
|
|
- [x] Clicking "Start Listening" (Listen Mic) opens microphone permission prompt
|
|
- [x] On permission grant, audio streams through WebSocket to DashScope ASR (both sources) *(verified via manual tests)*
|
|
- [x] Real-time transcript flows into QueryInput (both sources) *(verified via manual tests)*
|
|
- [x] User can edit transcript while capture/listening continues
|
|
- [x] "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
|
|
- [x] "Stop Listening" properly closes MediaStream, AudioContext, WebSocket
|
|
- [x] Permission denied shows clear error message (both sources)
|
|
- [x] Browser compatibility note shown for System Audio
|
|
- [x] No compatibility warning for Listen Mic (universally supported)
|
|
- [x] All CI tests pass (46 front-end + 14 back-end)
|
|
- [ ] Acceptance tests pass with real DashScope ASR *(requires real browser/mic — skipped in CI)*
|
|
- [x] `npm run build` produces clean production build
|
|
|
|
---
|
|
|
|
**File Information**
|
|
- Filename: `phase4_system_audio_plan.md`
|
|
- Created: 2026-05-09
|
|
- Updated: 2026-05-15 — Phase 4 complete (all 9 sub-phases)
|
|
- Status: Complete — 46 frontend tests + 14 backend tests passing, production build verified
|