Merge branch 'Phase4-dev'

This commit is contained in:
Woody 2026-05-14 23:29:42 +08:00
commit 1e8773469e
16 changed files with 1177 additions and 169 deletions

View File

@ -1,7 +1,7 @@
# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan # Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan
**Created:** 2026-05-09 **Created:** 2026-05-09
**Updated:** 2026-05-09 **Updated:** 2026-05-14
**Status:** 📋 Draft (Not Started) **Status:** 📋 Draft (Not Started)
**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete) **Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)
@ -9,24 +9,40 @@
## 1. Overview ## 1. Overview
Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline. Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload:
**Use cases:** 1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`.
2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`.
Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline.
### System Audio — Use Cases
- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio) - Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG - Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
- Transcribing any audio playing on the computer without needing to download files - Transcribing any audio playing on the computer without needing to download files
### How It Works ### Listen Mic — Use Cases
- Recording a live meeting or lecture through the computer's microphone
- Dictating questions or notes verbally and getting RAG answers
- Transcribing spoken Cantonese in real time without a video source
### How They Work
``` ```
User clicks "System Audio" → clicks "Start Capture" [System Audio]
User clicks "System Audio" → "Start Capture"
→ Browser shows permission dialog (screen/tab picker) → Browser shows permission dialog (screen/tab picker)
→ User selects tab/window/screen (with audio) → User selects tab/window/screen (with audio)
→ getDisplayMedia() returns MediaStream (with audio track) → getDisplayMedia() returns MediaStream (with audio track)
→ AudioContext.createMediaStreamSource(stream) → AudioContext.createMediaStreamSource(stream)
→ ScriptProcessorNode (Float32 PCM, mono 16kHz) → ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
→ WebSocket → FastAPI → DashScope realtime ASR
→ transcript → QueryInput → RAG Pipeline [Listen Mic]
User clicks "Listen Mic" → "Start Listening"
→ Browser shows microphone permission prompt
→ getUserMedia({ audio: true }) returns MediaStream
→ AudioContext.createMediaStreamSource(stream)
→ ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
``` ```
### Audio Routing (vs Existing Sources) ### Audio Routing (vs Existing Sources)
@ -34,59 +50,85 @@ User clicks "System Audio" → clicks "Start Capture"
| Source | Audio Input | SourceNode Type | Start/Stop Trigger | | Source | Audio Input | SourceNode Type | Start/Stop Trigger |
|--------|-------------|-----------------|-------------------| |--------|-------------|-----------------|-------------------|
| Upload | `<video>` element | `createMediaElementSource` | play/pause events | | Upload | `<video>` element | `createMediaElementSource` | play/pause events |
| YouTube | `<audio>` element | `createMediaElementSource` | play/pause events on `<video>` |
| **System Audio** | MediaStream from `getDisplayMedia()` | `createMediaStreamSource` | Manual Start/Stop button + track ended event | | **System Audio** | MediaStream from `getDisplayMedia()` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
| **Listen Mic** | MediaStream from `getUserMedia({ audio: true })` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
### Why New Hook (Not Reuse Existing) ### Why New Hooks (Not Reuse Existing)
The existing `useVideoASR` and `useYouTubeASR` hooks depend on HTML media elements (`<video>`, `<audio>`) for both the audio source and play/pause lifecycle. System audio capture uses a **MediaStream** object (no DOM element), and its lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events. A new hook is architecturally cleaner than overloading the existing ones with branching logic. The existing `useVideoASR` hook depends on HTML media elements (`<video>`) for both the audio source and play/pause lifecycle. Both new sources use **MediaStream** objects (no DOM element), and their lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events.
**System Audio** and **Listen Mic** share the same audio processing pipeline (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) but differ in their capture API. A shared internal audio processing utility (`useMediaStreamASR` or similar) should be extracted to avoid code duplication between the two hooks.
--- ---
## 2. User Flow ## 2. User Flow
1. User selects **"System Audio"** tab (third option alongside Upload / YouTube) ### 2.1 System Audio
1. User selects **"System Audio"** tab (second option alongside Upload / Listen Mic)
2. UI shows a **"Start Capture"** button with browser compatibility info 2. UI shows a **"Start Capture"** button with browser compatibility info
3. User clicks **"Start Capture"** 3. User clicks **"Start Capture"**
4. Browser opens **permission dialog** (screen/tab picker) 4. Browser opens **permission dialog** (screen/tab picker)
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen" - User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
- User checks "Share audio" if available - User checks "Share audio" if available
5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter 5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
6. Real-time ASR transcription flows into **QueryInput** (same as Upload/YouTube) 6. Real-time ASR transcription flows into **QueryInput** (same as Upload)
7. User can **edit transcript while capturing** continues 7. User can **edit transcript while capturing** continues
8. User clicks **"Stop Capture"** to end — transcript stays in QueryInput 8. User clicks **"Stop Capture"** to end — transcript stays in QueryInput
9. User submits query → RAG pipeline processes it 9. User submits query → RAG pipeline processes it
10. **"Full Transcript" button hidden** (streaming ASR only, same as YouTube) 10. **"Full Transcript" button hidden** (streaming ASR only — no batch transcription for live sources)
### Permission Denied Flow #### Permission Denied Flow
- User clicks "Cancel" in permission dialog → error: "Permission denied — system audio capture requires your explicit permission"
- User revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
- No audio track in the stream → error: "No audio track found in the shared content"
1. If user clicks "Cancel" in permission dialog → error state: "Permission denied — system audio capture requires your explicit permission" ### 2.2 Listen Mic
2. If user revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
3. If no audio track in the stream → error: "No audio track found in the shared content" 1. User selects **"Listen Mic"** tab (third option)
2. UI shows a **"Start Listening"** button (no browser compatibility warning — widely supported)
3. User clicks **"Start Listening"**
4. Browser shows **microphone permission prompt** (first time only)
5. On approval: listening starts — status indicator shows "Listening" with a live audio level meter
6. Real-time ASR transcription flows into **QueryInput**
7. User can **edit transcript while listening** continues
8. User clicks **"Stop Listening"** to end — transcript stays in QueryInput
9. User submits query → RAG pipeline processes it
10. **"Full Transcript" button hidden** (streaming ASR only)
#### Permission Denied Flow
- User clicks "Block" in mic permission prompt → error: "Microphone access denied — please allow microphone access in your browser settings"
- User revokes permission via browser UI → listening stops, status: "Microphone disconnected"
- No audio track → error: "No microphone input detected"
--- ---
## 3. Architecture ## 3. Architecture
### 3.1 Component Tree (LTTPage — System Audio Mode) ### 3.1 Component Tree (LTTPage — All Sources)
``` ```
LTTPage LTTPage
├── SourceSelector (tabs: Upload | YouTube | System Audio) ├── SourceSelector (tabs: Upload | System Audio | Listen Mic)
├── [source === 'system-audio'] ├── [source === 'system-audio']
│ ├── SystemAudioCapture │ └── SystemAudioCapture
│ │ ├── Start/Stop button │ ├── Start/Stop button
│ │ ├── Status indicator (idle | requesting | capturing | error) │ ├── Status indicator (idle | requesting | capturing | error)
│ │ ├── Audio level meter (optional, nice-to-have) │ ├── Audio level meter (optional, nice-to-have)
│ │ └── Browser compatibility note (non-Chrome users) │ └── Browser compatibility note (non-Chrome users)
│ └── (no video player — audio-only capture) ├── [source === 'mic']
├── QueryInput (receives transcript from useSystemAudioASR) │ └── MicCapture
│ ├── Start/Stop button
│ ├── Status indicator (idle | requesting | listening | error)
│ └── Audio level meter (optional, nice-to-have)
├── QueryInput (receives transcript from active ASR hook)
├── ExtractedQuestionsDisplay ├── ExtractedQuestionsDisplay
└── RAG Response Panel └── RAG Response Panel
``` ```
### 3.2 Data Flow ### 3.2 Data Flow
#### System Audio
``` ```
SystemAudioCapture (UI) SystemAudioCapture (UI)
@ -99,32 +141,51 @@ useSystemAudioASR hook
│ └── User picks tab/window → returns MediaStream │ └── User picks tab/window → returns MediaStream
├── AudioContext.createMediaStreamSource(stream) ├── AudioContext.createMediaStreamSource(stream)
│ └── MediaStreamAudioSourceNode
├── ScriptProcessorNode (4096 buffer, mono 16kHz) ├── ScriptProcessorNode (4096 buffer, mono 16kHz)
│ └── onaudioprocess: convert Float32 → Int16 PCM
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue ├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
│ └── Sends binary PCM frames
└── Returns: { status, transcript, partialTranscript, startCapture, stopCapture } └── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
```
#### Listen Mic
```
MicCapture (UI)
├── "Start Listening" click → calls startListening() from hook
LTTPage unifies: const asr = source === 'system-audio' ? systemAudioASR : ... useMicASR hook
├── getUserMedia({ audio: true })
QueryInput receives asr.partialTranscript │ └── Browser shows mic permission prompt → returns MediaStream
├── AudioContext.createMediaStreamSource(stream)
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
└── Returns: { status, transcript, partialTranscript, startListening, stopListening }
```
#### LTTPage Unification
```typescript
const asr = source === 'system-audio' ? systemAudioASR
: source === 'mic' ? micASR
: uploadASR
``` ```
### 3.3 Backend Changes ### 3.3 Backend Changes
**Minimal.** The existing WebSocket ASR endpoint (`ws_asr.py`) already accepts audio from any source. The only addition is handling a **UUID-based `video_id`** for system audio sessions (no real video file). **Minimal.** The existing WebSocket ASR endpoint (`ws_asr.py`) already accepts audio from any source. The only additions are UUID-based `video_id` handling and feature toggles.
| Change | File | Description | | Change | File | Description |
|--------|------|-------------| |--------|------|-------------|
| Allow UUID video_id | `backend/app/routers/ws_asr.py` | Accept non-file-based video IDs (already accepts any string) | | Allow UUID video_id | `backend/app/routers/ws_asr.py` | Accept non-file-based video IDs (already accepts any string) |
| Transcript persistence | `backend/app/services/history_service.py` | Store system audio transcripts with UUID session ID (optional — nice-to-have) | | Transcript persistence | `backend/app/services/history_service.py` | Store system audio & mic transcripts with UUID session ID (optional — nice-to-have) |
| Config | `backend/app/core/config.py` | Add `SYSTEM_AUDIO_ENABLED` toggle (default: true) | | Config | `backend/app/core/config.py` | Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` toggles (default: true) |
**No changes needed to:** **No changes needed to:**
- DashScope ASR client (receives PCM, doesn't care about source) - DashScope ASR client (receives PCM, doesn't care about source)
@ -135,11 +196,13 @@ QueryInput receives asr.partialTranscript
| File | Status | Description | | File | Status | Description |
|------|--------|-------------| |------|--------|-------------|
| `frontend/src/components/SourceSelector.tsx` | **New** | Reusable tab bar component (Upload \| System Audio \| Listen Mic) |
| `frontend/src/hooks/useSystemAudioASR.ts` | **New** | Hook: getDisplayMedia → AudioContext → WebSocket | | `frontend/src/hooks/useSystemAudioASR.ts` | **New** | Hook: getDisplayMedia → AudioContext → WebSocket |
| `frontend/src/components/SystemAudioCapture.tsx` | **New** | UI: Start/Stop button, status, compatibility note | | `frontend/src/hooks/useMicASR.ts` | **New** | Hook: getUserMedia → AudioContext → WebSocket |
| `frontend/src/pages/LTTPage.tsx` | **Modified** | Add "System Audio" tab, wire hook, unify ASR | | `frontend/src/components/SystemAudioCapture.tsx` | **New** | UI: Start/Stop, status, compatibility note |
| `frontend/src/types/index.ts` | **Modified** | Add SystemAudioStatus type | | `frontend/src/components/MicCapture.tsx` | **New** | UI: Start/Stop, status |
| `frontend/src/components/SourceSelector.tsx` | **Refactor** | Extract source tabs into reusable component (optional — can inline in LTTPage) | | `frontend/src/pages/LTTPage.tsx` | **Modified** | Add source selector, wire hooks, unify ASR, conditional rendering |
| `frontend/src/types/index.ts` | **Modified** | Add SourceType, SystemAudioStatus, MicStatus types |
--- ---
@ -150,25 +213,31 @@ QueryInput receives asr.partialTranscript
| 4.1 | Config & Infrastructure | 0.5 day | — | 📋 Draft | | 4.1 | Config & Infrastructure | 0.5 day | — | 📋 Draft |
| 4.2 | System Audio Capture Hook (`useSystemAudioASR`) | 1 day | 4.1 | 📋 Draft | | 4.2 | System Audio Capture Hook (`useSystemAudioASR`) | 1 day | 4.1 | 📋 Draft |
| 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | 📋 Draft | | 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | 📋 Draft |
| 4.4 | LTTPage Integration | 0.5 day | 4.2, 4.3 | 📋 Draft | | 4.4 | Mic Capture Hook (`useMicASR`) | 0.5 day | 4.1 | 📋 Draft |
| 4.5 | Backend Adjustments | 0.5 day | 4.1 | 📋 Draft | | 4.5 | MicCapture UI Component | 0.5 day | 4.4 | 📋 Draft |
| 4.6 | Integration & Acceptance Tests | 1 day | 4.4, 4.5 | 📋 Draft | | 4.6 | LTTPage Integration (all 3 sources) | 0.5 day | 4.2, 4.3, 4.4, 4.5 | 📋 Draft |
| 4.7 | Polish & Documentation | 0.5 day | 4.6 | 📋 Draft | | 4.7 | Backend Adjustments | 0.5 day | 4.1 | 📋 Draft |
| **Total** | | **4.5 days** | | | | 4.8 | Integration & Acceptance Tests | 1 day | 4.6, 4.7 | 📋 Draft |
| 4.9 | Polish & Documentation | 0.5 day | 4.8 | 📋 Draft |
| **Total** | | **5.5 days** | | |
### Phase 4.1 — Config & Infrastructure (0.5 day) ### Phase 4.1 — Config & Infrastructure (0.5 day)
**Objective:** Add system audio feature toggle, define types, establish UUID generation. **Objective:** Add feature toggles, define types, establish UUID generation.
**Tasks:** **Tasks:**
1. Add `SYSTEM_AUDIO_ENABLED` to `backend/app/core/config.py` (default: `True`) 1. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` to `backend/app/core/config.py` (default: `True`)
2. Add `SystemAudioStatus` type to `frontend/src/types/index.ts`: 2. Add `SystemAudioStatus` type to `frontend/src/types/index.ts`:
```typescript ```typescript
type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error' type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
``` ```
3. Add `SystemAudioASRState` interface to types 3. Add `MicStatus` type:
4. Add `video_id` UUID generation helper (frontend-side: `crypto.randomUUID()`) ```typescript
5. Verify WebSocket ASR endpoint accepts arbitrary `video_id` strings (it does — confirm with a quick test) type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
```
4. Add `SystemAudioASRState` and `MicASRState` interfaces to types
5. Add `video_id` UUID generation helper (frontend-side: `crypto.randomUUID()`)
6. Verify WebSocket ASR endpoint accepts arbitrary `video_id` strings (it does — confirm with a quick test)
**Test Files:** `backend/app/test/test_phase4_config.py` **Test Files:** `backend/app/test/test_phase4_config.py`
@ -205,7 +274,7 @@ interface UseSystemAudioASRReturn {
**Pattern to Follow:** **Pattern to Follow:**
- AudioContext setup: follow `useVideoASR.ts` lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion) - AudioContext setup: follow `useVideoASR.ts` lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
- WebSocket handling: follow `useYouTubeASR.ts` lines 35-100 - WebSocket handling: follow `useVideoASR.ts` lines 35-100
- State management: combine patterns from both hooks, adapting for MediaStream source - State management: combine patterns from both hooks, adapting for MediaStream source
**Test Files:** `frontend/src/test/test_phase4_useSystemAudioASR.test.ts` **Test Files:** `frontend/src/test/test_phase4_useSystemAudioASR.test.ts`
@ -239,57 +308,128 @@ On Linux, only tab audio is available (not full system audio).
**Test Files:** `frontend/src/test/test_phase4_SystemAudioCapture.test.tsx` **Test Files:** `frontend/src/test/test_phase4_SystemAudioCapture.test.tsx`
### Phase 4.4 — LTTPage Integration (0.5 day) ### Phase 4.4 — Mic Capture Hook (0.5 day)
**Objective:** Wire the System Audio source into LTTPage, adding it as the third tab alongside Upload and YouTube. **Objective:** Create `useMicASR.ts` hook that captures microphone input and streams it to the ASR WebSocket.
**Key Design:**
```typescript
interface UseMicASRProps {
wsUrl: string // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
}
interface UseMicASRReturn {
status: 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
transcript: string
partialTranscript: string
error: string | null
startListening: () => Promise<void>
stopListening: () => void
}
```
**Implementation Details:**
- `startListening()`: calls `navigator.mediaDevices.getUserMedia({ audio: true, video: false })`
- On success: creates AudioContext, `createMediaStreamSource(stream)`, connects ScriptProcessor → WebSocket
- On user deny: sets status to `'idle'`, sets error "Microphone access denied"
- On no audio track: sets status to `'error'`, sets error "No microphone input detected"
- `stopListening()`: stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
- Auto-stop: listens for `track.onended` (user revokes permission) → calls stopListening
- Audio processing: identical to useSystemAudioASR — `ScriptProcessorNode(4096)`, convert Float32 → Int16 PCM, send via WebSocket
- WebSocket lifecycle: connect on listening start, close on listening stop
- Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks
**Code Sharing:** Extract shared audio processing logic (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) into a reusable internal utility (`useMediaStreamASR` or `audioPipeline.ts`) to avoid duplication between `useSystemAudioASR` and `useMicASR`.
**Test Files:** `frontend/src/test/test_phase4_useMicASR.test.ts`
### Phase 4.5 — MicCapture UI Component (0.5 day)
**Objective:** Create the `MicCapture.tsx` component with Start/Stop button and status display.
**Component Props:**
```typescript
interface MicCaptureProps {
status: MicStatus
error: string | null
onStart: () => void
onStop: () => void
}
```
**UI States:**
1. **Idle**: "Start Listening" button (blue, prominent) — no compatibility warning needed (mic is universally supported)
2. **Requesting**: "Waiting for microphone permission..." (loading spinner)
3. **Listening**: "Stop Listening" button (red) + pulsing green dot + "Listening..."
4. **Error**: Red banner with error message + "Try Again" button
**Test Files:** `frontend/src/test/test_phase4_MicCapture.test.tsx`
### Phase 4.6 — LTTPage Integration (0.5 day)
**Objective:** Create the `SourceSelector` tab bar component and wire both new sources into LTTPage.
**New Component — `SourceSelector.tsx`:**
```typescript
interface SourceSelectorProps {
activeSource: SourceType
onSelect: (source: SourceType) => void
}
```
- Three tabs: Upload (📁), System Audio (🔊), Listen Mic (🎤)
- Active tab highlighted with blue background, inactive tabs gray
- Icons from lucide-react: `Upload`, `MonitorSpeaker`, `Mic`
**Changes to `LTTPage.tsx`:** **Changes to `LTTPage.tsx`:**
1. Extend `SourceType` from `'upload' | 'youtube'` to `'upload' | 'youtube' | 'system-audio'` 1. Add `SourceType` state: `const [source, setSource] = useState<SourceType>('upload')`
2. Add third tab button (icon: `AudioLines` from lucide-react) in the source selector 2. Render `<SourceSelector activeSource={source} onSelect={setSource} />` above the panels
3. Initialize `useSystemAudioASR` hook with a UUID-based WebSocket URL 3. Extend `SourceType` to `'upload' | 'system-audio' | 'mic'`
4. Update `asr` variable: 4. Initialize `useSystemAudioASR` and `useMicASR` hooks with session-scoped UUIDs (generated once when tab selected, reused across Start/Stop cycles)
5. Update `asr` variable:
```typescript ```typescript
const asr = source === 'youtube' ? youtubeASR const asr = source === 'system-audio' ? systemAudioASR
: source === 'system-audio' ? systemAudioASR : source === 'mic' ? micASR
: uploadASR : uploadASR
``` ```
5. Conditional rendering: 6. Conditional rendering:
``` ```tsx
{source === 'upload' && <VideoUploader />} {source === 'upload' && <VideoUpload />}
{source === 'youtube' && <YouTubeMode />}
{source === 'system-audio' && <SystemAudioCapture />} {source === 'system-audio' && <SystemAudioCapture />}
{source === 'mic' && <MicCapture />}
``` ```
6. WebSocket URL: `ws://host/ws/asr/{crypto.randomUUID()}?language=yue` 7. WebSocket URL: `ws://host/ws/asr/{sessionUUID}?language=yue` (UUID stable per session, regenerated only on source switch)
7. Full Transcript button: hidden for system-audio (same as YouTube) 8. Full Transcript button: hidden for system-audio AND mic (streaming ASR only)
8. QueryInput: remains editable during capture (same behavior as other sources) 9. QueryInput: remains editable during capture/listening
**Test Files:** `frontend/src/test/test_phase4_LTTPage_integration.test.tsx` **Test Files:** `frontend/src/test/test_phase4_LTTPage_integration.test.tsx`
### Phase 4.5 — Backend Adjustments (0.5 day) ### Phase 4.7 — Backend Adjustments (0.5 day)
**Objective:** Ensure backend handles system audio sessions correctly. **Objective:** Ensure backend handles both system audio and mic sessions correctly.
**Tasks:** **Tasks:**
1. Verify `ws_asr.py` WebSocket endpoint works with arbitrary `video_id` (UUID format) — likely no changes needed 1. Verify `ws_asr.py` WebSocket endpoint works with arbitrary `video_id` (UUID format) — likely no changes needed
2. Add `SYSTEM_AUDIO_ENABLED` config validation in the router (return 503 if disabled) 2. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` config validation in the router (return 503 if disabled)
3. Handle system audio sessions in transcript history (optional — store with `source: 'system-audio'` metadata) 3. Handle system audio and mic sessions in transcript history (optional — store with `source: 'system-audio'` / `source: 'mic'` metadata)
4. Verify the ASR client handles system audio PCM identically to video audio 4. Verify the ASR client handles audio from both sources identically
**No new endpoints needed.** The existing WebSocket and ASR infrastructure is source-agnostic. **No new endpoints needed.** The existing WebSocket and ASR infrastructure is source-agnostic.
**Test Files:** `backend/app/test/test_phase4_config.py` **Test Files:** `backend/app/test/test_phase4_config.py`
### Phase 4.6 — Integration & Acceptance Tests (1 day) ### Phase 4.8 — Integration & Acceptance Tests (1 day)
**Objective:** Comprehensive tests for the system audio capture flow. **Objective:** Comprehensive tests for both capture flows.
**Backend Integration Tests** (`backend/app/test/test_integration_phase4.py`): **Backend Integration Tests** (`backend/app/test/test_integration_phase4.py`):
1. WebSocket accepts UUID video_id 1. WebSocket accepts UUID video_id
2. ASR processes audio from system audio session 2. ASR processes audio from system audio session
3. Config toggle disables feature 3. ASR processes audio from mic session
4. Config toggles disable features
**Frontend Tests:** **Frontend Tests:**
1. **Hook tests** (`test_phase4_useSystemAudioASR.test.ts`): ~10 tests
1. **System Audio Hook tests** (`test_phase4_useSystemAudioASR.test.ts`): ~10 tests
- Mock `getDisplayMedia` → successful capture - Mock `getDisplayMedia` → successful capture
- Mock `getDisplayMedia` → user cancels (permission denied) - Mock `getDisplayMedia` → user cancels (permission denied)
- Mock `getDisplayMedia` → no audio track - Mock `getDisplayMedia` → no audio track
@ -300,31 +440,52 @@ On Linux, only tab audio is available (not full system audio).
- `stopCapture` cleanup - `stopCapture` cleanup
- Multiple rapid start/stop cycles - Multiple rapid start/stop cycles
2. **Component tests** (`test_phase4_SystemAudioCapture.test.tsx`): ~5 tests 2. **System Audio Component tests** (`test_phase4_SystemAudioCapture.test.tsx`): ~5 tests
- All UI states render correctly (idle, requesting, capturing, error) - All UI states render correctly (idle, requesting, capturing, error)
- Start button calls onStart - Start button calls onStart
- Stop button calls onStop - Stop button calls onStop
- Error state shows message and retry button - Error state shows message and retry button
- Compatibility note visible for non-Chrome (optional) - Compatibility note visible for non-Chrome (optional)
3. **Integration tests** (`test_phase4_LTTPage_integration.test.tsx`): ~5 tests 3. **Mic Hook tests** (`test_phase4_useMicASR.test.ts`): ~8 tests
- Mock `getUserMedia` → successful capture
- Mock `getUserMedia` → user denies (permission denied)
- Mock `getUserMedia` → no audio track
- AudioContext setup and teardown
- WebSocket connection lifecycle
- `track.onended` triggers auto-stop
- `stopListening` cleanup
- PCM conversion and sending
4. **Mic Component tests** (`test_phase4_MicCapture.test.tsx`): ~4 tests
- All UI states render correctly (idle, requesting, listening, error)
- Start button calls onStart
- Stop button calls onStop
- Error state shows message and retry button
5. **LTTPage Integration tests** (`test_phase4_LTTPage_integration.test.tsx`): ~8 tests
- System Audio tab renders and switches correctly - System Audio tab renders and switches correctly
- Listen Mic tab renders and switches correctly
- ASR variable selects systemAudioASR when source is system-audio - ASR variable selects systemAudioASR when source is system-audio
- Full Transcript button hidden for system audio - ASR variable selects micASR when source is mic
- Full Transcript button hidden for system audio and mic
- QueryInput receives transcript from system audio - QueryInput receives transcript from system audio
- QueryInput receives transcript from mic
- Source switching preserves transcript - Source switching preserves transcript
**Acceptance Tests** (`backend/app/test/acceptance/test_acceptance_phase4.py`): **Acceptance Tests** (`backend/app/test/acceptance/test_acceptance_phase4.py`):
- Real `getDisplayMedia` with actual browser (manual — requires human interaction) - Real `getDisplayMedia` with actual browser (manual — requires human interaction)
- Real `getUserMedia` with actual microphone (manual — requires human interaction)
- Real DashScope ASR with system audio stream - Real DashScope ASR with system audio stream
- End-to-end: capture → ASR → transcript → RAG answer - Real DashScope ASR with microphone stream
- End-to-end: capture → ASR → transcript → RAG answer (both sources)
### Phase 4.7 — Polish & Documentation (0.5 day) ### Phase 4.9 — Polish & Documentation (0.5 day)
**Tasks:** **Tasks:**
1. Update `README.md` — add System Audio Capture section with usage instructions, browser compatibility table, and limitations 1. Update `README.md` — add System Audio Capture and Listen Mic sections with usage instructions, browser compatibility table, and limitations
2. Update `development_plan.md` — add Phase 4 row to timeline, mark status 2. Update `development_plan.md` — add Phase 4 row to timeline, mark status
3. Add browser detection helper for compatibility warning 3. Add browser detection helper for system audio compatibility warning
4. Verify production build (`npm run build`) 4. Verify production build (`npm run build`)
5. Run full CI regression (`pytest` + `vitest`) 5. Run full CI regression (`pytest` + `vitest`)
6. Final commit 6. Final commit
@ -335,34 +496,51 @@ On Linux, only tab audio is available (not full system audio).
| Decision | Rationale | | Decision | Rationale |
|----------|-----------| |----------|-----------|
| New hook (`useSystemAudioASR`) rather than modifying existing | MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hook avoids branching complexity. | | New hooks rather than modifying existing | MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hooks avoid branching complexity. |
| UUID-based `video_id` | No actual video file for system audio. `crypto.randomUUID()` generates unique session IDs. Backend WebSocket already accepts arbitrary strings. | | Two separate hooks + shared audio utility | System Audio and Mic share identical audio processing (MediaStream → PCM → WebSocket) but differ in capture API (`getDisplayMedia` vs `getUserMedia`) and UX. Extract shared pipeline to avoid duplication. |
| Manual Start/Stop (not auto) | `getDisplayMedia()` requires explicit user action (browser policy). Cannot auto-start. | | UUID-based `video_id` (per-session) | No actual video file for live audio. UUID generated once when source tab is selected, reused across Start/Stop cycles within the same session. Regenerated only when switching between sources. Backend WebSocket already accepts arbitrary strings. |
| No video display in System Audio mode | User watches content in another tab/window. Only capture status and audio controls shown. | | Manual Start/Stop (not auto) | Both `getDisplayMedia()` and `getUserMedia()` require explicit user action (browser policy). Cannot auto-start. |
| `video: false` in getDisplayMedia | Audio-only capture reduces bandwidth and permission scope. User only needs to share audio. | | No video display in System Audio or Mic mode | User watches/listens to content elsewhere. Only capture status and audio controls shown. |
| Hide Full Transcript button for system audio | Same as YouTube — streaming ASR only. Full transcript would require recording and batch processing (future Phase 5). | | `video: false` in getDisplayMedia | Audio-only capture reduces bandwidth and permission scope. |
| Browser compatibility note in UI | `getDisplayMedia` with audio is Chrome/Edge-only. Non-supporting browsers get clear messaging. | | Hide Full Transcript button for both new sources | Streaming ASR only — no video file to batch transcribe. Full transcript would require audio recording (future Phase 5). |
| Browser compatibility note only for System Audio | Mic (`getUserMedia`) is universally supported in all modern browsers. System Audio (`getDisplayMedia` with audio) is Chrome/Edge-only. |
| Mic uses `getUserMedia({ audio: true, video: false })` | Audio-only capture — no camera needed. |
### getDisplayMedia Options ### getDisplayMedia Options (System Audio)
```javascript ```javascript
const stream = await navigator.mediaDevices.getDisplayMedia({ const stream = await navigator.mediaDevices.getDisplayMedia({
video: false, // No video needed video: false,
audio: { audio: {
systemAudio: 'include', // Request system audio (tab + full system where supported) systemAudio: 'include',
echoCancellation: false, // Don't filter audio echoCancellation: false,
noiseSuppression: false, // Don't filter audio noiseSuppression: false,
autoGainControl: false, // Don't adjust volume autoGainControl: false,
} }
}) })
``` ```
**Note on `video: false`:** Setting `video: false` tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation, not ours. **Note on `video: false`:** Setting `video: false` tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation.
### getUserMedia Options (Listen Mic)
```javascript
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: false, // Don't filter audio (pass raw mic input)
noiseSuppression: false, // Don't filter audio
autoGainControl: false, // Don't adjust volume
},
video: false,
})
```
--- ---
## 6. Browser Compatibility ## 6. Browser Compatibility
### System Audio (`getDisplayMedia`)
| Platform / Browser | Tab Audio | System Audio | Works? | | Platform / Browser | Tab Audio | System Audio | Works? |
|--------------------|-----------|-------------|--------| |--------------------|-----------|-------------|--------|
| Chrome/Edge (Windows) | ✅ | ✅ | **Best — full support** | | Chrome/Edge (Windows) | ✅ | ✅ | **Best — full support** |
@ -376,11 +554,21 @@ const stream = await navigator.mediaDevices.getDisplayMedia({
```typescript ```typescript
function isSystemAudioSupported(): boolean { function isSystemAudioSupported(): boolean {
const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome') const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
// Firefox and Safari don't support audio in getDisplayMedia
return isChromium && !navigator.userAgent.includes('Firefox') return isChromium && !navigator.userAgent.includes('Firefox')
} }
``` ```
### Listen Mic (`getUserMedia`)
| Platform / Browser | Microphone | Works? |
|--------------------|-----------|--------|
| Chrome/Edge | ✅ | **Full support** |
| Firefox | ✅ | **Full support** |
| Safari | ✅ | **Full support** |
| Mobile browsers | ✅ | **Full support** |
Mic capture is universally supported — no compatibility warning needed.
--- ---
## 7. Test Strategy ## 7. Test Strategy
@ -389,16 +577,19 @@ function isSystemAudioSupported(): boolean {
| File | Type | Count | Description | | File | Type | Count | Description |
|------|------|-------|-------------| |------|------|-------|-------------|
| `test_phase4_config.py` | Backend integration | 3 | Config toggle, WebSocket accepts UUID | | `test_phase4_config.py` | Backend integration | 4 | Config toggles, WebSocket accepts UUID |
| `test_phase4_useSystemAudioASR.test.ts` | Frontend unit | ~10 | Hook behavior: capture, permission, audio, WS | | `test_phase4_useSystemAudioASR.test.ts` | Frontend unit | ~10 | Hook: capture, permission, audio, WS |
| `test_phase4_SystemAudioCapture.test.tsx` | Frontend component | ~5 | UI states: idle, requesting, capturing, error | | `test_phase4_SystemAudioCapture.test.tsx` | Frontend component | ~5 | UI states: idle, requesting, capturing, error |
| `test_phase4_LTTPage_integration.test.tsx` | Frontend integration | ~5 | Tab switching, ASR unification, Full Transcript | | `test_phase4_useMicASR.test.ts` | Frontend unit | ~8 | Hook: capture, permission, audio, WS |
| `test_integration_phase4.py` | Backend integration | 4 | Config toggle, WebSocket, ASR client | | `test_phase4_MicCapture.test.tsx` | Frontend component | ~4 | UI states: idle, requesting, listening, error |
| `test_acceptance_phase4.py` | Acceptance | 3 | Real browser + real DashScope ASR | | `test_phase4_LTTPage_integration.test.tsx` | Frontend integration | ~8 | Tab switching, ASR unification, Full Transcript |
| `test_integration_phase4.py` | Backend integration | 4 | Config toggles, WebSocket, ASR client |
| `test_acceptance_phase4.py` | Acceptance | 5 | Real browser + real mic + real DashScope ASR |
### Mocking Strategy ### Mocking Strategy
- **`getDisplayMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack - **`getDisplayMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
- **`getUserMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
- **AudioContext**: Use `jest-webgl-mock` or manual mock for AudioContext, ScriptProcessorNode - **AudioContext**: Use `jest-webgl-mock` or manual mock for AudioContext, ScriptProcessorNode
- **WebSocket**: Mock via `vitest` WebSocket mock (same pattern as Phase 2/3 tests) - **WebSocket**: Mock via `vitest` WebSocket mock (same pattern as Phase 2/3 tests)
- **DashScope ASR**: Mock in CI; real in acceptance tests - **DashScope ASR**: Mock in CI; real in acceptance tests
@ -410,9 +601,13 @@ function isSystemAudioSupported(): boolean {
### New Files ### New Files
``` ```
frontend/src/hooks/useSystemAudioASR.ts frontend/src/hooks/useSystemAudioASR.ts
frontend/src/hooks/useMicASR.ts
frontend/src/components/SystemAudioCapture.tsx frontend/src/components/SystemAudioCapture.tsx
frontend/src/components/MicCapture.tsx
frontend/src/test/test_phase4_useSystemAudioASR.test.ts frontend/src/test/test_phase4_useSystemAudioASR.test.ts
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
frontend/src/test/test_phase4_useMicASR.test.ts
frontend/src/test/test_phase4_MicCapture.test.tsx
frontend/src/test/test_phase4_LTTPage_integration.test.tsx frontend/src/test/test_phase4_LTTPage_integration.test.tsx
backend/app/test/test_phase4_config.py backend/app/test/test_phase4_config.py
backend/app/test/test_integration_phase4.py backend/app/test/test_integration_phase4.py
@ -422,11 +617,11 @@ backend/app/test/acceptance/test_acceptance_phase4.py
### Modified Files ### Modified Files
``` ```
frontend/src/pages/LTTPage.tsx ← add "System Audio" tab, wire hook frontend/src/pages/LTTPage.tsx ← add "System Audio" + "Listen Mic" tabs, wire hooks
frontend/src/types/index.ts ← add SystemAudioStatus, SystemAudioASRState frontend/src/types/index.ts ← add SystemAudioStatus, MicStatus, ASRState types
backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED, MIC_ENABLED
development_plan.md ← add Phase 4 row development_plan.md ← add Phase 4 row
README.md ← add System Audio Capture section README.md ← add System Audio + Listen Mic sections
``` ```
--- ---
@ -434,13 +629,17 @@ README.md ← add System Audio Capture s
## 9. Acceptance Criteria ## 9. Acceptance Criteria
- [ ] User can select "System Audio" tab in LTTPage - [ ] User can select "System Audio" tab in LTTPage
- [ ] Clicking "Start Capture" opens browser permission dialog - [ ] User can select "Listen Mic" tab in LTTPage
- [ ] On permission grant, audio streams through WebSocket to DashScope ASR - [ ] Clicking "Start Capture" (System Audio) opens browser permission dialog
- [ ] Real-time transcript flows into QueryInput - [ ] Clicking "Start Listening" (Listen Mic) opens microphone permission prompt
- [ ] User can edit transcript while capture continues - [ ] On permission grant, audio streams through WebSocket to DashScope ASR (both sources)
- [ ] Real-time transcript flows into QueryInput (both sources)
- [ ] User can edit transcript while capture/listening continues
- [ ] "Stop Capture" properly closes MediaStream, AudioContext, WebSocket - [ ] "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
- [ ] Permission denied shows clear error message - [ ] "Stop Listening" properly closes MediaStream, AudioContext, WebSocket
- [ ] Browser compatibility note shown for non-Chrome browsers - [ ] Permission denied shows clear error message (both sources)
- [ ] Browser compatibility note shown for System Audio on non-Chrome browsers
- [ ] No compatibility warning for Listen Mic (universally supported)
- [ ] All CI tests pass (no regressions) - [ ] All CI tests pass (no regressions)
- [ ] Acceptance tests pass with real DashScope ASR - [ ] Acceptance tests pass with real DashScope ASR
- [ ] `npm run build` produces clean production build - [ ] `npm run build` produces clean production build
@ -450,4 +649,5 @@ README.md ← add System Audio Capture s
**File Information** **File Information**
- Filename: `phase4_system_audio_plan.md` - Filename: `phase4_system_audio_plan.md`
- Created: 2026-05-09 - Created: 2026-05-09
- Updated: 2026-05-14 — Added Listen Mic as third source; removed YouTube
- Status: Draft — awaiting review before Phase 4.1 implementation begins - Status: Draft — awaiting review before Phase 4.1 implementation begins

View File

@ -54,6 +54,10 @@ class Settings(BaseSettings):
max_video_size_mb: int = 300 max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"] supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
# Phase 4 — Live audio capture toggles
system_audio_enabled: bool = True
mic_enabled: bool = True
# Development helpers # Development helpers
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"} model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}

View File

@ -209,7 +209,7 @@ async def _ws_proxy_dashscope(client_ws: WebSocket, loop: asyncio.AbstractEventL
@router.websocket("/ws/asr/{video_id}") @router.websocket("/ws/asr/{video_id}")
async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "yue"): async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "yue", source: str = "upload"):
settings = get_settings() settings = get_settings()
client_host = websocket.client.host if websocket.client else "unknown" client_host = websocket.client.host if websocket.client else "unknown"
@ -220,9 +220,23 @@ async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "
logger.warning("ws-rejected-no-apikey video_id=%s client=%s", video_id, client_host) logger.warning("ws-rejected-no-apikey video_id=%s client=%s", video_id, client_host)
return return
if source == "system-audio" and not settings.system_audio_enabled:
await websocket.accept()
await websocket.send_json({"error": "System audio capture is disabled"})
await websocket.close(code=1008, reason="System audio disabled")
logger.warning("ws-rejected-system-audio-disabled video_id=%s client=%s", video_id, client_host)
return
if source == "mic" and not settings.mic_enabled:
await websocket.accept()
await websocket.send_json({"error": "Microphone capture is disabled"})
await websocket.close(code=1008, reason="Mic disabled")
logger.warning("ws-rejected-mic-disabled video_id=%s client=%s", video_id, client_host)
return
await websocket.accept() await websocket.accept()
loop = asyncio.get_event_loop() loop = asyncio.get_event_loop()
logger.info("ws-connect video_id=%s lang=%s client=%s", video_id, language, client_host) logger.info("ws-connect video_id=%s lang=%s source=%s client=%s", video_id, language, source, client_host)
try: try:
await _ws_proxy_dashscope(websocket, loop, language) await _ws_proxy_dashscope(websocket, loop, language)

View File

@ -0,0 +1,140 @@
"""Phase 4 config tests: system audio and mic capture feature toggles."""
import pytest
from fastapi import FastAPI
from fastapi.testclient import TestClient
@pytest.fixture
def phase4_ws_app(monkeypatch):
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
monkeypatch.setenv("SYSTEM_AUDIO_ENABLED", "true")
monkeypatch.setenv("MIC_ENABLED", "true")
from app.core.config import get_settings
from app.routers.ws_asr import router
get_settings.cache_clear()
app = FastAPI()
app.include_router(router)
return app
class TestWSSourceToggle:
def test_system_audio_source_connects(self, phase4_ws_app):
client = TestClient(phase4_ws_app)
with client.websocket_connect("/ws/asr/test-uuid?source=system-audio") as ws:
pass
def test_mic_source_connects(self, phase4_ws_app):
client = TestClient(phase4_ws_app)
with client.websocket_connect("/ws/asr/test-uuid?source=mic") as ws:
pass
def test_default_source_is_upload(self, phase4_ws_app):
client = TestClient(phase4_ws_app)
with client.websocket_connect("/ws/asr/test-uuid") as ws:
pass
def test_system_audio_disabled_rejects(self, monkeypatch):
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
monkeypatch.setenv("SYSTEM_AUDIO_ENABLED", "false")
from app.core.config import get_settings
from app.routers.ws_asr import router
get_settings.cache_clear()
app = FastAPI()
app.include_router(router)
client = TestClient(app)
with client.websocket_connect("/ws/asr/test-uuid?source=system-audio") as ws:
data = ws.receive_json()
assert "disabled" in data.get("error", "").lower()
def test_mic_disabled_rejects(self, monkeypatch):
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
monkeypatch.setenv("MIC_ENABLED", "false")
from app.core.config import get_settings
from app.routers.ws_asr import router
get_settings.cache_clear()
app = FastAPI()
app.include_router(router)
client = TestClient(app)
with client.websocket_connect("/ws/asr/test-uuid?source=mic") as ws:
data = ws.receive_json()
assert "disabled" in data.get("error", "").lower()
def test_config_system_audio_defaults(monkeypatch, tmp_path):
monkeypatch.delenv("SYSTEM_AUDIO_ENABLED", raising=False)
monkeypatch.setenv("LLM_API_KEY", "sk-test")
monkeypatch.setenv("DP_API_KEY", "sk-test")
monkeypatch.setenv("EMBEDDING_API_KEY", "sk-test")
env_file = tmp_path / ".env"
env_file.write_text("")
monkeypatch.chdir(tmp_path)
from app.core.config import Settings, get_settings
get_settings.cache_clear()
settings = Settings(_env_file=())
assert settings.system_audio_enabled is True
def test_config_mic_defaults(monkeypatch, tmp_path):
monkeypatch.delenv("MIC_ENABLED", raising=False)
monkeypatch.setenv("LLM_API_KEY", "sk-test")
monkeypatch.setenv("DP_API_KEY", "sk-test")
monkeypatch.setenv("EMBEDDING_API_KEY", "sk-test")
env_file = tmp_path / ".env"
env_file.write_text("")
monkeypatch.chdir(tmp_path)
from app.core.config import Settings, get_settings
get_settings.cache_clear()
settings = Settings(_env_file=())
assert settings.mic_enabled is True
def test_config_system_audio_disabled(tmp_path, monkeypatch):
env_file = tmp_path / ".env"
env_file.write_text(
"SYSTEM_AUDIO_ENABLED=false\n"
"LLM_API_KEY=sk-test\n"
"DP_API_KEY=sk-test\n"
"EMBEDDING_API_KEY=sk-test\n"
)
monkeypatch.chdir(tmp_path)
from app.core.config import Settings, get_settings
get_settings.cache_clear()
settings = Settings()
assert settings.system_audio_enabled is False
def test_config_mic_disabled(tmp_path, monkeypatch):
env_file = tmp_path / ".env"
env_file.write_text(
"MIC_ENABLED=false\n"
"LLM_API_KEY=sk-test\n"
"DP_API_KEY=sk-test\n"
"EMBEDDING_API_KEY=sk-test\n"
)
monkeypatch.chdir(tmp_path)
from app.core.config import Settings, get_settings
get_settings.cache_clear()
settings = Settings()
assert settings.mic_enabled is False
def test_config_loads_both_toggles_from_env(tmp_path, monkeypatch):
env_file = tmp_path / ".env"
env_file.write_text(
"SYSTEM_AUDIO_ENABLED=true\n"
"MIC_ENABLED=true\n"
"LLM_API_KEY=sk-test\n"
"DP_API_KEY=sk-test\n"
"EMBEDDING_API_KEY=sk-test\n"
)
monkeypatch.chdir(tmp_path)
from app.core.config import Settings, get_settings
get_settings.cache_clear()
settings = Settings()
assert settings.system_audio_enabled is True
assert settings.mic_enabled is True

2
frontend/.pnpmrc Normal file
View File

@ -0,0 +1,2 @@
onlyBuiltDependencies:
- esbuild

View File

@ -2071,15 +2071,6 @@
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@types/babel__generator": {
"dev": true
},
"node_modules/@types/babel__template": {
"dev": true
},
"node_modules/@types/babel__traverse": {
"dev": true
},
"node_modules/@types/chai": { "node_modules/@types/chai": {
"version": "4.3.20", "version": "4.3.20",
"resolved": "https://registry.npmjs.org/@types/chai/-/chai-4.3.20.tgz", "resolved": "https://registry.npmjs.org/@types/chai/-/chai-4.3.20.tgz",
@ -2130,9 +2121,6 @@
"@types/unist": "*" "@types/unist": "*"
} }
}, },
"node_modules/@types/jest": {
"dev": true
},
"node_modules/@types/mdast": { "node_modules/@types/mdast": {
"version": "4.0.4", "version": "4.0.4",
"resolved": "https://registry.npmjs.org/@types/mdast/-/mdast-4.0.4.tgz", "resolved": "https://registry.npmjs.org/@types/mdast/-/mdast-4.0.4.tgz",
@ -2158,7 +2146,6 @@
"undici-types": "~7.19.0" "undici-types": "~7.19.0"
} }
}, },
"node_modules/@types/prop-types": {},
"node_modules/@types/react": { "node_modules/@types/react": {
"version": "18.3.28", "version": "18.3.28",
"resolved": "https://registry.npmjs.org/@types/react/-/react-18.3.28.tgz", "resolved": "https://registry.npmjs.org/@types/react/-/react-18.3.28.tgz",

View File

@ -34,5 +34,10 @@
"ts-node": "^10.9.1", "ts-node": "^10.9.1",
"typescript": "^5.1.6", "typescript": "^5.1.6",
"vitest": "^0.34.3" "vitest": "^0.34.3"
},
"pnpm": {
"onlyBuiltDependencies": [
"esbuild"
]
} }
} }

View File

@ -0,0 +1,11 @@
allowBuilds:
'"': true
'[': true
']': true
b: true
d: true
e: true
i: true
l: true
s: true
u: true

View File

@ -0,0 +1,80 @@
import React from 'react'
import { Mic, Loader2, AlertCircle, Circle } from 'lucide-react'
import type { MicStatus } from '../types'
export interface MicCaptureProps {
status: MicStatus
error: string | null
onStart: () => void
onStop: () => void
}
export const MicCapture: React.FC<MicCaptureProps> = ({
status,
error,
onStart,
onStop,
}) => {
if (status === 'error' && error) {
return (
<div className="h-full flex flex-col">
<div className="p-3 bg-red-50 border border-red-200 rounded-lg flex items-start gap-2">
<AlertCircle className="w-4 h-4 text-red-500 shrink-0 mt-0.5" />
<div className="flex-1">
<div className="text-sm text-red-700">{error}</div>
<button
onClick={onStart}
className="mt-2 text-xs text-red-600 hover:text-red-800 font-medium underline"
>
Try Again
</button>
</div>
</div>
</div>
)
}
if (status === 'requesting') {
return (
<div className="h-full flex flex-col items-center justify-center space-y-3">
<Loader2 className="w-8 h-8 text-blue-600 animate-spin" />
<div className="text-sm text-gray-600 font-medium">Waiting for microphone permission...</div>
</div>
)
}
if (status === 'listening' || status === 'stopping') {
return (
<div className="h-full flex flex-col items-center justify-center space-y-4">
<div className="flex items-center gap-2">
<Circle className="w-3 h-3 text-green-500 fill-green-500 animate-pulse" />
<span className="text-sm text-gray-600 font-medium">Listening...</span>
</div>
<div className="flex items-end gap-1 h-8">
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '40%', animationDelay: '0ms' }} />
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '70%', animationDelay: '150ms' }} />
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '55%', animationDelay: '300ms' }} />
</div>
<button
onClick={onStop}
disabled={status === 'stopping'}
className="w-full px-4 py-2 bg-red-600 text-white font-medium rounded hover:bg-red-700 focus:outline-none focus:ring-2 focus:ring-red-500 focus:ring-offset-2 disabled:opacity-50 disabled:cursor-not-allowed disabled:hover:bg-red-600 transition-all duration-200"
>
{status === 'stopping' ? 'Stopping...' : 'Stop Listening'}
</button>
</div>
)
}
return (
<div className="h-full flex flex-col">
<button
onClick={onStart}
className="w-full px-4 py-2 bg-blue-600 text-white font-medium rounded hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 transition-all duration-200 flex items-center justify-center gap-2"
>
<Mic className="w-4 h-4" />
Start Listening
</button>
</div>
)
}

View File

@ -0,0 +1,42 @@
import React from 'react'
import { Upload, MonitorSpeaker, Mic } from 'lucide-react'
import type { SourceType } from '../types'
interface SourceSelectorProps {
activeSource: SourceType
onSelect: (source: SourceType) => void
}
export const SourceSelector: React.FC<SourceSelectorProps> = ({ activeSource, onSelect }) => {
const tabs: { id: SourceType; label: string; icon: React.ElementType }[] = [
{ id: 'upload', label: 'Upload', icon: Upload },
{ id: 'system-audio', label: 'System Audio', icon: MonitorSpeaker },
{ id: 'mic', label: 'Listen Mic', icon: Mic },
]
return (
<div className="flex gap-1 p-1 bg-gray-100 rounded-lg" role="tablist">
{tabs.map(tab => {
const isActive = activeSource === tab.id
const Icon = tab.icon
return (
<button
key={tab.id}
role="tab"
aria-selected={isActive}
onClick={() => onSelect(tab.id)}
className={[
'flex items-center gap-2 px-4 py-2 rounded-md text-sm font-medium transition-all duration-200',
isActive
? 'bg-white text-blue-700 shadow-sm'
: 'text-gray-500 hover:text-gray-700 hover:bg-gray-50',
].join(' ')}
>
<Icon className="w-4 h-4" />
{tab.label}
</button>
)
})}
</div>
)
}

View File

@ -0,0 +1,86 @@
import React from 'react'
import { MonitorSpeaker, Loader2, AlertCircle, Circle } from 'lucide-react'
import type { SystemAudioStatus } from '../types'
export interface SystemAudioCaptureProps {
status: SystemAudioStatus
error: string | null
onStart: () => void
onStop: () => void
}
export const SystemAudioCapture: React.FC<SystemAudioCaptureProps> = ({
status,
error,
onStart,
onStop,
}) => {
if (status === 'error' && error) {
return (
<div className="h-full flex flex-col">
<div className="p-3 bg-red-50 border border-red-200 rounded-lg flex items-start gap-2">
<AlertCircle className="w-4 h-4 text-red-500 shrink-0 mt-0.5" />
<div className="flex-1">
<div className="text-sm text-red-700">{error}</div>
<button
onClick={onStart}
className="mt-2 text-xs text-red-600 hover:text-red-800 font-medium underline"
>
Try Again
</button>
</div>
</div>
</div>
)
}
if (status === 'requesting') {
return (
<div className="h-full flex flex-col items-center justify-center space-y-3">
<Loader2 className="w-8 h-8 text-blue-600 animate-spin" />
<div className="text-sm text-gray-600 font-medium">Waiting for permission...</div>
</div>
)
}
if (status === 'capturing' || status === 'stopping') {
return (
<div className="h-full flex flex-col items-center justify-center space-y-4">
<div className="flex items-center gap-2">
<Circle className="w-3 h-3 text-green-500 fill-green-500 animate-pulse" />
<span className="text-sm text-gray-600 font-medium">Capturing system audio...</span>
</div>
<div className="flex items-end gap-1 h-8">
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '40%', animationDelay: '0ms' }} />
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '70%', animationDelay: '150ms' }} />
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '55%', animationDelay: '300ms' }} />
</div>
<button
onClick={onStop}
disabled={status === 'stopping'}
className="w-full px-4 py-2 bg-red-600 text-white font-medium rounded hover:bg-red-700 focus:outline-none focus:ring-2 focus:ring-red-500 focus:ring-offset-2 disabled:opacity-50 disabled:cursor-not-allowed disabled:hover:bg-red-600 transition-all duration-200"
>
{status === 'stopping' ? 'Stopping...' : 'Stop Capture'}
</button>
</div>
)
}
return (
<div className="h-full flex flex-col space-y-3">
<button
onClick={onStart}
className="w-full px-4 py-2 bg-blue-600 text-white font-medium rounded hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 transition-all duration-200 flex items-center justify-center gap-2"
>
<MonitorSpeaker className="w-4 h-4" />
Start Capture
</button>
<div className="p-3 bg-amber-50 border border-amber-200 rounded-lg flex items-start gap-2">
<AlertCircle className="w-4 h-4 text-amber-600 shrink-0 mt-0.5" />
<div className="text-xs text-amber-700 leading-relaxed">
System audio capture works best in Chrome/Edge on Windows/macOS. Firefox and Safari do not support this feature. On Linux, only tab audio is available.
</div>
</div>
</div>
)
}

View File

@ -0,0 +1,191 @@
import { useState, useRef, useCallback, useEffect } from 'react'
import type { ASRMessage } from '../types'
export interface UseMediaStreamASRProps {
wsUrl: string
}
export interface UseMediaStreamASRReturn {
status: 'idle' | 'requesting' | 'streaming' | 'stopping' | 'error'
transcript: string
partialTranscript: string
error: string | null
start: (stream: MediaStream) => void
stop: () => void
}
/**
* Shared audio pipeline: AudioContext ScriptProcessorNode Float32 PCM WebSocket.
* Wrapper hooks (system audio, mic) obtain the MediaStream, then call `start(stream)`.
* Follows the exact audio-processing and WebSocket message pattern from useVideoASR.ts.
*/
export function useMediaStreamASR({ wsUrl }: UseMediaStreamASRProps): UseMediaStreamASRReturn {
const [status, setStatus] = useState<'idle' | 'requesting' | 'streaming' | 'stopping' | 'error'>('idle')
const [transcript, setTranscript] = useState('')
const [partialTranscript, setPartialTranscript] = useState('')
const [error, setError] = useState<string | null>(null)
const wsRef = useRef<WebSocket | null>(null)
const audioContextRef = useRef<AudioContext | null>(null)
const processorRef = useRef<ScriptProcessorNode | null>(null)
const sourceRef = useRef<MediaStreamAudioSourceNode | null>(null)
const streamRef = useRef<MediaStream | null>(null)
const isStreamingRef = useRef(false)
const isManualCloseRef = useRef(false)
const transcriptRef = useRef('')
const lastStashRef = useRef('')
const cleanup = useCallback(() => {
isStreamingRef.current = false
// Stash handling — mirrors useVideoASR stopStreaming lines 101-111
let currentText = transcriptRef.current.trim()
const stash = lastStashRef.current.trim()
if (stash && !currentText.endsWith(stash)) {
currentText += stash
transcriptRef.current = currentText
}
lastStashRef.current = ''
if (currentText) {
setTranscript(currentText)
// Keep partialTranscript populated so the text remains visible in QueryInput
// after the user stops capture/listening. Unlike video ASR, mic/system-audio
// hooks have no onFinalTranscript callback to persist via queryText.
setPartialTranscript(currentText)
}
if (streamRef.current) {
streamRef.current.getTracks().forEach(t => {
t.onended = null
t.stop()
})
streamRef.current = null
}
processorRef.current?.disconnect()
sourceRef.current?.disconnect()
processorRef.current = null
sourceRef.current = null
if (wsRef.current) {
isManualCloseRef.current = true
wsRef.current.close()
wsRef.current = null
}
if (audioContextRef.current) {
audioContextRef.current.close()
audioContextRef.current = null
}
}, [])
const stop = useCallback(() => {
setStatus('stopping')
cleanup()
setStatus('idle')
}, [cleanup])
const start = useCallback((stream: MediaStream) => {
cleanup()
setError(null)
transcriptRef.current = ''
lastStashRef.current = ''
setTranscript('')
setPartialTranscript('')
streamRef.current = stream
isManualCloseRef.current = false
stream.getAudioTracks().forEach(track => {
track.onended = () => {
cleanup()
setStatus('idle')
}
})
try {
// AudioContext + ScriptProcessorNode — mirrors useVideoASR lines 117-136
const audioContext = new AudioContext({ sampleRate: 16000 })
audioContextRef.current = audioContext
const source = audioContext.createMediaStreamSource(stream)
sourceRef.current = source
const processor = audioContext.createScriptProcessor(4096, 1, 1)
processorRef.current = processor
// onaudioprocess — mirrors useVideoASR lines 126-132 exactly
processor.onaudioprocess = (e) => {
const float32Data = e.inputBuffer.getChannelData(0)
const outputData = e.outputBuffer.getChannelData(0)
outputData.set(float32Data)
if (!isStreamingRef.current) return
if (!wsRef.current || wsRef.current.readyState !== WebSocket.OPEN) return
wsRef.current.send(float32Data.buffer)
}
source.connect(processor)
processor.connect(audioContext.destination)
const ws = new WebSocket(wsUrl)
wsRef.current = ws
ws.onopen = () => {
isStreamingRef.current = true
setStatus('streaming')
}
// Message parsing — mirrors useVideoASR lines 51-64 exactly
ws.onmessage = (e) => {
const msg: ASRMessage = JSON.parse(e.data)
if (msg.is_final && msg.full_text) {
transcriptRef.current = msg.full_text
lastStashRef.current = ''
setTranscript(msg.full_text)
setPartialTranscript(msg.full_text)
} else if (msg.delta) {
transcriptRef.current += msg.delta
lastStashRef.current = (msg as any).stash || ''
setTranscript(transcriptRef.current)
setPartialTranscript(transcriptRef.current)
}
}
ws.onerror = () => {
console.error('[useMediaStreamASR] WebSocket error')
setError('WebSocket connection error')
setStatus('error')
isManualCloseRef.current = true
cleanup()
}
ws.onclose = () => {
isStreamingRef.current = false
if (isManualCloseRef.current) return
setError('ASR connection closed unexpectedly')
setStatus('error')
}
} catch (err) {
console.error('[useMediaStreamASR] start failed:', err)
setError(err instanceof Error ? err.message : 'Failed to start audio pipeline')
setStatus('error')
}
}, [wsUrl, cleanup])
useEffect(() => {
return () => {
if (streamRef.current) {
streamRef.current.getTracks().forEach(t => {
t.onended = null
t.stop()
})
}
processorRef.current?.disconnect()
sourceRef.current?.disconnect()
wsRef.current?.close()
audioContextRef.current?.close()
}
}, [])
return { status, transcript, partialTranscript, error, start, stop }
}

View File

@ -0,0 +1,85 @@
import { useState, useEffect } from 'react'
import type { MicStatus } from '../types'
import { useMediaStreamASR } from './useMediaStreamASR'
export function useMicASR({ wsUrl }: { wsUrl: string }) {
const pipeline = useMediaStreamASR({ wsUrl })
const [status, setStatus] = useState<MicStatus>('idle')
const [wrapperError, setWrapperError] = useState<string | null>(null)
useEffect(() => {
switch (pipeline.status) {
case 'streaming':
setStatus('listening')
setWrapperError(null)
break
case 'stopping':
setStatus('stopping')
break
case 'error':
setStatus('error')
setWrapperError(null)
break
case 'idle':
setStatus('idle')
break
}
}, [pipeline.status])
const startListening = async () => {
setWrapperError(null)
setStatus('requesting')
try {
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: false,
noiseSuppression: false,
autoGainControl: false,
},
video: false,
})
if (stream.getAudioTracks().length === 0) {
stream.getTracks().forEach(t => t.stop())
setStatus('error')
setWrapperError('No microphone input detected')
return
}
pipeline.start(stream)
} catch (err) {
console.error('[useMicASR] getUserMedia failed:', err)
if (err instanceof DOMException && err.name === 'NotAllowedError') {
setStatus('idle')
setWrapperError('Microphone access denied — please allow microphone access in your browser settings')
return
}
if (err instanceof DOMException && err.name === 'NotFoundError') {
setStatus('error')
setWrapperError('No microphone found. Please connect a microphone and try again.')
return
}
if (err instanceof DOMException && err.name === 'NotSupportedError') {
setStatus('error')
setWrapperError('Microphone access is not supported in this browser.')
return
}
setStatus('error')
setWrapperError(err instanceof Error ? err.message : 'Failed to start microphone capture')
}
}
const stopListening = () => {
pipeline.stop()
}
return {
status,
transcript: pipeline.transcript,
partialTranscript: pipeline.partialTranscript,
error: wrapperError ?? pipeline.error,
startListening,
stopListening,
}
}

View File

@ -0,0 +1,91 @@
import { useState, useEffect } from 'react'
import type { SystemAudioStatus } from '../types'
import { useMediaStreamASR } from './useMediaStreamASR'
export function useSystemAudioASR({ wsUrl }: { wsUrl: string }) {
const pipeline = useMediaStreamASR({ wsUrl })
const [status, setStatus] = useState<SystemAudioStatus>('idle')
const [wrapperError, setWrapperError] = useState<string | null>(null)
useEffect(() => {
switch (pipeline.status) {
case 'streaming':
setStatus('capturing')
setWrapperError(null)
break
case 'stopping':
setStatus('stopping')
break
case 'error':
setStatus('error')
setWrapperError(null)
break
case 'idle':
setStatus('idle')
break
}
}, [pipeline.status])
const startCapture = async () => {
setWrapperError(null)
setStatus('requesting')
try {
// getDisplayMedia() SPEC: video:true is REQUIRED.
// Setting video:false causes TypeError (Chrome) or NotSupportedError.
// We capture video but immediately discard it — only audio is used.
const stream = await navigator.mediaDevices.getDisplayMedia({
video: true,
audio: {
systemAudio: 'include',
echoCancellation: false,
noiseSuppression: false,
autoGainControl: false,
},
} as any)
// Stop video tracks immediately — we only need audio
stream.getVideoTracks().forEach((t) => t.stop())
if (stream.getAudioTracks().length === 0) {
stream.getTracks().forEach((t) => t.stop())
setStatus('error')
setWrapperError(
'No audio track found. Make sure to enable "Share audio" in the sharing dialog and select a tab or window that is playing audio.',
)
return
}
pipeline.start(stream)
} catch (err) {
console.error('[useSystemAudioASR] getDisplayMedia failed:', err)
if (err instanceof DOMException) {
if (err.name === 'AbortError' || err.name === 'NotAllowedError') {
setStatus('idle')
setWrapperError('Permission denied — system audio capture requires your explicit permission')
return
}
if (err.name === 'NotSupportedError') {
setStatus('error')
setWrapperError('System audio capture is not supported on this platform. Linux only supports tab audio — try Chrome/Edge on Windows or macOS for full system audio.')
return
}
}
setStatus('error')
setWrapperError(err instanceof Error ? err.message : 'Failed to start system audio capture')
}
}
const stopCapture = () => {
pipeline.stop()
}
return {
status,
transcript: pipeline.transcript,
partialTranscript: pipeline.partialTranscript,
error: wrapperError ?? pipeline.error,
startCapture,
stopCapture,
}
}

View File

@ -1,8 +1,10 @@
import React, { useState, useCallback, useEffect } from 'react' import React, { useState, useCallback, useEffect, useMemo } from 'react'
import { Loader2, AlertCircle, FileText } from 'lucide-react' import { Loader2, AlertCircle, FileText } from 'lucide-react'
import { Group, Panel, Separator } from 'react-resizable-panels' import { Group, Panel, Separator } from 'react-resizable-panels'
import { useQueryDocumentStream } from '../lib/queries' import { useQueryDocumentStream } from '../lib/queries'
import { useVideoASR } from '../hooks/useVideoASR' import { useVideoASR } from '../hooks/useVideoASR'
import { useSystemAudioASR } from '../hooks/useSystemAudioASR'
import { useMicASR } from '../hooks/useMicASR'
import { useFullTranscript } from '../hooks/useFullTranscript' import { useFullTranscript } from '../hooks/useFullTranscript'
import { getVideoUrl } from '../lib/api' import { getVideoUrl } from '../lib/api'
import { QueryInput } from '../components/QueryInput' import { QueryInput } from '../components/QueryInput'
@ -10,15 +12,20 @@ import { ExtractedQuestionsDisplay } from '../components/ExtractedQuestionsDispl
import { ResponsePanel } from '../components/ResponsePanel' import { ResponsePanel } from '../components/ResponsePanel'
import { VideoUpload } from '../components/VideoUpload' import { VideoUpload } from '../components/VideoUpload'
import { VideoPlayer } from '../components/VideoPlayer' import { VideoPlayer } from '../components/VideoPlayer'
import { SourceSelector } from '../components/SourceSelector'
import { SystemAudioCapture } from '../components/SystemAudioCapture'
import { MicCapture } from '../components/MicCapture'
import type { SourceType } from '../types'
export const LTTPage: React.FC = () => { export const LTTPage: React.FC = () => {
const [source, setSource] = useState<SourceType>('upload')
const [currentVideoId, setCurrentVideoId] = useState<string | null>(null) const [currentVideoId, setCurrentVideoId] = useState<string | null>(null)
const [queryText, setQueryText] = useState('') const [queryText, setQueryText] = useState('')
const [videoEl, setVideoEl] = useState<HTMLVideoElement | null>(null) const [videoEl, setVideoEl] = useState<HTMLVideoElement | null>(null)
const queryStream = useQueryDocumentStream() const queryStream = useQueryDocumentStream()
const asr = useVideoASR({ const uploadASR = useVideoASR({
videoId: currentVideoId ?? '', videoId: currentVideoId ?? '',
videoElement: videoEl, videoElement: videoEl,
language: 'yue', language: 'yue',
@ -29,6 +36,24 @@ export const LTTPage: React.FC = () => {
const ft = useFullTranscript({ videoId: currentVideoId ?? '' }) const ft = useFullTranscript({ videoId: currentVideoId ?? '' })
const systemAudioWsUrl = useMemo(() => {
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'
const host = import.meta.env.VITE_WS_HOST ?? window.location.host
return `${protocol}//${host}/ws/asr/${crypto.randomUUID()}?language=yue&source=system-audio`
}, [])
const micWsUrl = useMemo(() => {
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'
const host = import.meta.env.VITE_WS_HOST ?? window.location.host
return `${protocol}//${host}/ws/asr/${crypto.randomUUID()}?language=yue&source=mic`
}, [])
const systemAudioASR = useSystemAudioASR({ wsUrl: systemAudioWsUrl })
const micASR = useMicASR({ wsUrl: micWsUrl })
const asr = source === 'system-audio' ? systemAudioASR
: source === 'mic' ? micASR
: uploadASR
useEffect(() => { useEffect(() => {
if (ft.fullTranscript) { if (ft.fullTranscript) {
setQueryText(ft.fullTranscript) setQueryText(ft.fullTranscript)
@ -58,6 +83,9 @@ export const LTTPage: React.FC = () => {
return ( return (
<div className="h-full bg-gray-50"> <div className="h-full bg-gray-50">
<div className="px-4 pt-3">
<SourceSelector activeSource={source} onSelect={setSource} />
</div>
<Group <Group
orientation="vertical" orientation="vertical"
id="ltt-main-group" id="ltt-main-group"
@ -69,7 +97,8 @@ export const LTTPage: React.FC = () => {
<Group orientation="horizontal" id="ltt-upper-group" className="h-full"> <Group orientation="horizontal" id="ltt-upper-group" className="h-full">
<Panel id="ltt-upper-left" minSize="30%" defaultSize={50}> <Panel id="ltt-upper-left" minSize="30%" defaultSize={50}>
<div className="h-full p-4 overflow-hidden flex flex-col gap-3"> <div className="h-full p-4 overflow-hidden flex flex-col gap-3">
{currentVideoId ? ( {source === 'upload' ? (
currentVideoId ? (
<> <>
<VideoPlayer ref={setVideoEl} src={videoUrl} /> <VideoPlayer ref={setVideoEl} src={videoUrl} />
<button <button
@ -93,7 +122,7 @@ export const LTTPage: React.FC = () => {
<span>{ft.error}</span> <span>{ft.error}</span>
</div> </div>
)} )}
{asr.status === 'error' && ( {uploadASR.status === 'error' && (
<div <div
data-testid="asr-error-indicator" data-testid="asr-error-indicator"
className="flex items-center gap-2 text-xs text-red-600 bg-red-50 border border-red-200 rounded px-2 py-1" className="flex items-center gap-2 text-xs text-red-600 bg-red-50 border border-red-200 rounded px-2 py-1"
@ -105,6 +134,21 @@ export const LTTPage: React.FC = () => {
</> </>
) : ( ) : (
<VideoUpload onUploadSuccess={handleUploadSuccess} /> <VideoUpload onUploadSuccess={handleUploadSuccess} />
)
) : source === 'system-audio' ? (
<SystemAudioCapture
status={systemAudioASR.status}
error={systemAudioASR.error}
onStart={systemAudioASR.startCapture}
onStop={systemAudioASR.stopCapture}
/>
) : (
<MicCapture
status={micASR.status}
error={micASR.error}
onStart={micASR.startListening}
onStop={micASR.stopListening}
/>
)} )}
</div> </div>
</Panel> </Panel>

View File

@ -196,3 +196,29 @@ export interface VideoUploadResponse {
size_bytes: number size_bytes: number
url: string url: string
} }
// Phase 4 — Live audio capture types
export type SourceType = 'upload' | 'system-audio' | 'mic'
export type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
export type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
export interface SystemAudioASRState {
status: SystemAudioStatus
transcript: string
partialTranscript: string
error: string | null
startCapture: () => Promise<void>
stopCapture: () => void
}
export interface MicASRState {
status: MicStatus
transcript: string
partialTranscript: string
error: string | null
startListening: () => Promise<void>
stopListening: () => void
}