feat: Phase 4 — System Audio & Listen Mic capture into ASR → RAG
Adds two new live audio sources alongside file Upload: - System Audio: getDisplayMedia() captures system/tab audio output, pipes through WebSocket → DashScope realtime ASR → RAG. - Listen Mic: getUserMedia() captures microphone input via the same audio pipeline (shared useMediaStreamASR hook). Backend: feature toggles (system_audio_enabled, mic_enabled) in config.py, source query param gating in ws_asr.py, 10 config tests. Bug fix: getDisplayMedia() rejected video:false per W3C spec — changed to video:true then stop video tracks to allow audio-only capture on Windows/macOS Chrome.
This commit is contained in:
parent
64a7a8a46b
commit
7bff4308b7
|
|
@ -1,7 +1,7 @@
|
||||||
# Phase 4: System Audio Capture → ASR → RAG — Implementation Plan
|
# Phase 4: System Audio & Mic Capture → ASR → RAG — Implementation Plan
|
||||||
|
|
||||||
**Created:** 2026-05-09
|
**Created:** 2026-05-09
|
||||||
**Updated:** 2026-05-09
|
**Updated:** 2026-05-14
|
||||||
**Status:** 📋 Draft (Not Started)
|
**Status:** 📋 Draft (Not Started)
|
||||||
**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)
|
**Depends on:** Phase 1 (Complete), Phase 2 (Complete), Phase 3 (Complete)
|
||||||
|
|
||||||
|
|
@ -9,24 +9,40 @@
|
||||||
|
|
||||||
## 1. Overview
|
## 1. Overview
|
||||||
|
|
||||||
Phase 4 adds **system audio capture** as a third audio source in the LTTPage, alongside file Upload and YouTube. Instead of playing a video in the browser, the user captures audio output from any application on their computer (browser tab, Spotify, Zoom, system sounds) and pipes it through the existing ASR → RAG pipeline.
|
Phase 4 adds two new live audio sources in the LTTPage, alongside file Upload:
|
||||||
|
|
||||||
**Use cases:**
|
1. **System Audio Capture** — captures audio output from any application on the user's computer (browser tab, Spotify, Zoom, system sounds) via `getDisplayMedia()`.
|
||||||
|
2. **Listen Mic** — captures microphone input (user's voice, room audio) via `getUserMedia({ audio: true })`.
|
||||||
|
|
||||||
|
Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline.
|
||||||
|
|
||||||
|
### System Audio — Use Cases
|
||||||
- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
|
- Watching a YouTube video in a regular browser tab (no proxy needed — just share that tab's audio)
|
||||||
- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
|
- Listening to a podcast, lecture, or meeting and getting real-time transcript + RAG
|
||||||
- Transcribing any audio playing on the computer without needing to download files
|
- Transcribing any audio playing on the computer without needing to download files
|
||||||
|
|
||||||
### How It Works
|
### Listen Mic — Use Cases
|
||||||
|
- Recording a live meeting or lecture through the computer's microphone
|
||||||
|
- Dictating questions or notes verbally and getting RAG answers
|
||||||
|
- Transcribing spoken Cantonese in real time without a video source
|
||||||
|
|
||||||
|
### How They Work
|
||||||
|
|
||||||
```
|
```
|
||||||
User clicks "System Audio" → clicks "Start Capture"
|
[System Audio]
|
||||||
|
User clicks "System Audio" → "Start Capture"
|
||||||
→ Browser shows permission dialog (screen/tab picker)
|
→ Browser shows permission dialog (screen/tab picker)
|
||||||
→ User selects tab/window/screen (with audio)
|
→ User selects tab/window/screen (with audio)
|
||||||
→ getDisplayMedia() returns MediaStream (with audio track)
|
→ getDisplayMedia() returns MediaStream (with audio track)
|
||||||
→ AudioContext.createMediaStreamSource(stream)
|
→ AudioContext.createMediaStreamSource(stream)
|
||||||
→ ScriptProcessorNode (Float32 PCM, mono 16kHz)
|
→ ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
|
||||||
→ WebSocket → FastAPI → DashScope realtime ASR
|
|
||||||
→ transcript → QueryInput → RAG Pipeline
|
[Listen Mic]
|
||||||
|
User clicks "Listen Mic" → "Start Listening"
|
||||||
|
→ Browser shows microphone permission prompt
|
||||||
|
→ getUserMedia({ audio: true }) returns MediaStream
|
||||||
|
→ AudioContext.createMediaStreamSource(stream)
|
||||||
|
→ ScriptProcessorNode → WebSocket → DashScope ASR → Transcript → RAG
|
||||||
```
|
```
|
||||||
|
|
||||||
### Audio Routing (vs Existing Sources)
|
### Audio Routing (vs Existing Sources)
|
||||||
|
|
@ -34,59 +50,85 @@ User clicks "System Audio" → clicks "Start Capture"
|
||||||
| Source | Audio Input | SourceNode Type | Start/Stop Trigger |
|
| Source | Audio Input | SourceNode Type | Start/Stop Trigger |
|
||||||
|--------|-------------|-----------------|-------------------|
|
|--------|-------------|-----------------|-------------------|
|
||||||
| Upload | `<video>` element | `createMediaElementSource` | play/pause events |
|
| Upload | `<video>` element | `createMediaElementSource` | play/pause events |
|
||||||
| YouTube | `<audio>` element | `createMediaElementSource` | play/pause events on `<video>` |
|
|
||||||
| **System Audio** | MediaStream from `getDisplayMedia()` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
|
| **System Audio** | MediaStream from `getDisplayMedia()` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
|
||||||
|
| **Listen Mic** | MediaStream from `getUserMedia({ audio: true })` | `createMediaStreamSource` | Manual Start/Stop button + track ended event |
|
||||||
|
|
||||||
### Why New Hook (Not Reuse Existing)
|
### Why New Hooks (Not Reuse Existing)
|
||||||
|
|
||||||
The existing `useVideoASR` and `useYouTubeASR` hooks depend on HTML media elements (`<video>`, `<audio>`) for both the audio source and play/pause lifecycle. System audio capture uses a **MediaStream** object (no DOM element), and its lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events. A new hook is architecturally cleaner than overloading the existing ones with branching logic.
|
The existing `useVideoASR` hook depends on HTML media elements (`<video>`) for both the audio source and play/pause lifecycle. Both new sources use **MediaStream** objects (no DOM element), and their lifecycle is controlled by user permission (grant/revoke) and manual start/stop, not DOM events.
|
||||||
|
|
||||||
|
**System Audio** and **Listen Mic** share the same audio processing pipeline (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) but differ in their capture API. A shared internal audio processing utility (`useMediaStreamASR` or similar) should be extracted to avoid code duplication between the two hooks.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2. User Flow
|
## 2. User Flow
|
||||||
|
|
||||||
1. User selects **"System Audio"** tab (third option alongside Upload / YouTube)
|
### 2.1 System Audio
|
||||||
|
|
||||||
|
1. User selects **"System Audio"** tab (second option alongside Upload / Listen Mic)
|
||||||
2. UI shows a **"Start Capture"** button with browser compatibility info
|
2. UI shows a **"Start Capture"** button with browser compatibility info
|
||||||
3. User clicks **"Start Capture"**
|
3. User clicks **"Start Capture"**
|
||||||
4. Browser opens **permission dialog** (screen/tab picker)
|
4. Browser opens **permission dialog** (screen/tab picker)
|
||||||
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
|
- User selects a browser tab (e.g., "YouTube — Live Stream") or "Entire Screen"
|
||||||
- User checks "Share audio" if available
|
- User checks "Share audio" if available
|
||||||
5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
|
5. On approval: capture starts — status indicator shows "Capturing" with a live audio level meter
|
||||||
6. Real-time ASR transcription flows into **QueryInput** (same as Upload/YouTube)
|
6. Real-time ASR transcription flows into **QueryInput** (same as Upload)
|
||||||
7. User can **edit transcript while capturing** continues
|
7. User can **edit transcript while capturing** continues
|
||||||
8. User clicks **"Stop Capture"** to end — transcript stays in QueryInput
|
8. User clicks **"Stop Capture"** to end — transcript stays in QueryInput
|
||||||
9. User submits query → RAG pipeline processes it
|
9. User submits query → RAG pipeline processes it
|
||||||
10. **"Full Transcript" button hidden** (streaming ASR only, same as YouTube)
|
10. **"Full Transcript" button hidden** (streaming ASR only — no batch transcription for live sources)
|
||||||
|
|
||||||
### Permission Denied Flow
|
#### Permission Denied Flow
|
||||||
|
- User clicks "Cancel" in permission dialog → error: "Permission denied — system audio capture requires your explicit permission"
|
||||||
|
- User revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
|
||||||
|
- No audio track in the stream → error: "No audio track found in the shared content"
|
||||||
|
|
||||||
1. If user clicks "Cancel" in permission dialog → error state: "Permission denied — system audio capture requires your explicit permission"
|
### 2.2 Listen Mic
|
||||||
2. If user revokes permission (Chrome "Stop sharing") → capture stops gracefully, status: "Capture stopped"
|
|
||||||
3. If no audio track in the stream → error: "No audio track found in the shared content"
|
1. User selects **"Listen Mic"** tab (third option)
|
||||||
|
2. UI shows a **"Start Listening"** button (no browser compatibility warning — widely supported)
|
||||||
|
3. User clicks **"Start Listening"**
|
||||||
|
4. Browser shows **microphone permission prompt** (first time only)
|
||||||
|
5. On approval: listening starts — status indicator shows "Listening" with a live audio level meter
|
||||||
|
6. Real-time ASR transcription flows into **QueryInput**
|
||||||
|
7. User can **edit transcript while listening** continues
|
||||||
|
8. User clicks **"Stop Listening"** to end — transcript stays in QueryInput
|
||||||
|
9. User submits query → RAG pipeline processes it
|
||||||
|
10. **"Full Transcript" button hidden** (streaming ASR only)
|
||||||
|
|
||||||
|
#### Permission Denied Flow
|
||||||
|
- User clicks "Block" in mic permission prompt → error: "Microphone access denied — please allow microphone access in your browser settings"
|
||||||
|
- User revokes permission via browser UI → listening stops, status: "Microphone disconnected"
|
||||||
|
- No audio track → error: "No microphone input detected"
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Architecture
|
## 3. Architecture
|
||||||
|
|
||||||
### 3.1 Component Tree (LTTPage — System Audio Mode)
|
### 3.1 Component Tree (LTTPage — All Sources)
|
||||||
|
|
||||||
```
|
```
|
||||||
LTTPage
|
LTTPage
|
||||||
├── SourceSelector (tabs: Upload | YouTube | System Audio)
|
├── SourceSelector (tabs: Upload | System Audio | Listen Mic)
|
||||||
├── [source === 'system-audio']
|
├── [source === 'system-audio']
|
||||||
│ ├── SystemAudioCapture
|
│ └── SystemAudioCapture
|
||||||
│ │ ├── Start/Stop button
|
│ ├── Start/Stop button
|
||||||
│ │ ├── Status indicator (idle | requesting | capturing | error)
|
│ ├── Status indicator (idle | requesting | capturing | error)
|
||||||
│ │ ├── Audio level meter (optional, nice-to-have)
|
│ ├── Audio level meter (optional, nice-to-have)
|
||||||
│ │ └── Browser compatibility note (non-Chrome users)
|
│ └── Browser compatibility note (non-Chrome users)
|
||||||
│ └── (no video player — audio-only capture)
|
├── [source === 'mic']
|
||||||
├── QueryInput (receives transcript from useSystemAudioASR)
|
│ └── MicCapture
|
||||||
|
│ ├── Start/Stop button
|
||||||
|
│ ├── Status indicator (idle | requesting | listening | error)
|
||||||
|
│ └── Audio level meter (optional, nice-to-have)
|
||||||
|
├── QueryInput (receives transcript from active ASR hook)
|
||||||
├── ExtractedQuestionsDisplay
|
├── ExtractedQuestionsDisplay
|
||||||
└── RAG Response Panel
|
└── RAG Response Panel
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.2 Data Flow
|
### 3.2 Data Flow
|
||||||
|
|
||||||
|
#### System Audio
|
||||||
```
|
```
|
||||||
SystemAudioCapture (UI)
|
SystemAudioCapture (UI)
|
||||||
│
|
│
|
||||||
|
|
@ -99,32 +141,51 @@ useSystemAudioASR hook
|
||||||
│ └── User picks tab/window → returns MediaStream
|
│ └── User picks tab/window → returns MediaStream
|
||||||
│
|
│
|
||||||
├── AudioContext.createMediaStreamSource(stream)
|
├── AudioContext.createMediaStreamSource(stream)
|
||||||
│ └── MediaStreamAudioSourceNode
|
|
||||||
│
|
│
|
||||||
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
|
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
|
||||||
│ └── onaudioprocess: convert Float32 → Int16 PCM
|
|
||||||
│
|
│
|
||||||
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
|
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
|
||||||
│ └── Sends binary PCM frames
|
|
||||||
│
|
│
|
||||||
└── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
|
└── Returns: { status, transcript, partialTranscript, startCapture, stopCapture }
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Listen Mic
|
||||||
|
```
|
||||||
|
MicCapture (UI)
|
||||||
|
│
|
||||||
|
├── "Start Listening" click → calls startListening() from hook
|
||||||
│
|
│
|
||||||
▼
|
▼
|
||||||
LTTPage unifies: const asr = source === 'system-audio' ? systemAudioASR : ...
|
useMicASR hook
|
||||||
│
|
│
|
||||||
▼
|
├── getUserMedia({ audio: true })
|
||||||
QueryInput receives asr.partialTranscript
|
│ └── Browser shows mic permission prompt → returns MediaStream
|
||||||
|
│
|
||||||
|
├── AudioContext.createMediaStreamSource(stream)
|
||||||
|
│
|
||||||
|
├── ScriptProcessorNode (4096 buffer, mono 16kHz)
|
||||||
|
│
|
||||||
|
├── WebSocket → ws://host/ws/asr/{uuid}?language=yue
|
||||||
|
│
|
||||||
|
└── Returns: { status, transcript, partialTranscript, startListening, stopListening }
|
||||||
|
```
|
||||||
|
|
||||||
|
#### LTTPage Unification
|
||||||
|
```typescript
|
||||||
|
const asr = source === 'system-audio' ? systemAudioASR
|
||||||
|
: source === 'mic' ? micASR
|
||||||
|
: uploadASR
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3.3 Backend Changes
|
### 3.3 Backend Changes
|
||||||
|
|
||||||
**Minimal.** The existing WebSocket ASR endpoint (`ws_asr.py`) already accepts audio from any source. The only addition is handling a **UUID-based `video_id`** for system audio sessions (no real video file).
|
**Minimal.** The existing WebSocket ASR endpoint (`ws_asr.py`) already accepts audio from any source. The only additions are UUID-based `video_id` handling and feature toggles.
|
||||||
|
|
||||||
| Change | File | Description |
|
| Change | File | Description |
|
||||||
|--------|------|-------------|
|
|--------|------|-------------|
|
||||||
| Allow UUID video_id | `backend/app/routers/ws_asr.py` | Accept non-file-based video IDs (already accepts any string) |
|
| Allow UUID video_id | `backend/app/routers/ws_asr.py` | Accept non-file-based video IDs (already accepts any string) |
|
||||||
| Transcript persistence | `backend/app/services/history_service.py` | Store system audio transcripts with UUID session ID (optional — nice-to-have) |
|
| Transcript persistence | `backend/app/services/history_service.py` | Store system audio & mic transcripts with UUID session ID (optional — nice-to-have) |
|
||||||
| Config | `backend/app/core/config.py` | Add `SYSTEM_AUDIO_ENABLED` toggle (default: true) |
|
| Config | `backend/app/core/config.py` | Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` toggles (default: true) |
|
||||||
|
|
||||||
**No changes needed to:**
|
**No changes needed to:**
|
||||||
- DashScope ASR client (receives PCM, doesn't care about source)
|
- DashScope ASR client (receives PCM, doesn't care about source)
|
||||||
|
|
@ -135,11 +196,13 @@ QueryInput receives asr.partialTranscript
|
||||||
|
|
||||||
| File | Status | Description |
|
| File | Status | Description |
|
||||||
|------|--------|-------------|
|
|------|--------|-------------|
|
||||||
|
| `frontend/src/components/SourceSelector.tsx` | **New** | Reusable tab bar component (Upload \| System Audio \| Listen Mic) |
|
||||||
| `frontend/src/hooks/useSystemAudioASR.ts` | **New** | Hook: getDisplayMedia → AudioContext → WebSocket |
|
| `frontend/src/hooks/useSystemAudioASR.ts` | **New** | Hook: getDisplayMedia → AudioContext → WebSocket |
|
||||||
| `frontend/src/components/SystemAudioCapture.tsx` | **New** | UI: Start/Stop button, status, compatibility note |
|
| `frontend/src/hooks/useMicASR.ts` | **New** | Hook: getUserMedia → AudioContext → WebSocket |
|
||||||
| `frontend/src/pages/LTTPage.tsx` | **Modified** | Add "System Audio" tab, wire hook, unify ASR |
|
| `frontend/src/components/SystemAudioCapture.tsx` | **New** | UI: Start/Stop, status, compatibility note |
|
||||||
| `frontend/src/types/index.ts` | **Modified** | Add SystemAudioStatus type |
|
| `frontend/src/components/MicCapture.tsx` | **New** | UI: Start/Stop, status |
|
||||||
| `frontend/src/components/SourceSelector.tsx` | **Refactor** | Extract source tabs into reusable component (optional — can inline in LTTPage) |
|
| `frontend/src/pages/LTTPage.tsx` | **Modified** | Add source selector, wire hooks, unify ASR, conditional rendering |
|
||||||
|
| `frontend/src/types/index.ts` | **Modified** | Add SourceType, SystemAudioStatus, MicStatus types |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -150,25 +213,31 @@ QueryInput receives asr.partialTranscript
|
||||||
| 4.1 | Config & Infrastructure | 0.5 day | — | 📋 Draft |
|
| 4.1 | Config & Infrastructure | 0.5 day | — | 📋 Draft |
|
||||||
| 4.2 | System Audio Capture Hook (`useSystemAudioASR`) | 1 day | 4.1 | 📋 Draft |
|
| 4.2 | System Audio Capture Hook (`useSystemAudioASR`) | 1 day | 4.1 | 📋 Draft |
|
||||||
| 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | 📋 Draft |
|
| 4.3 | SystemAudioCapture UI Component | 0.5 day | 4.2 | 📋 Draft |
|
||||||
| 4.4 | LTTPage Integration | 0.5 day | 4.2, 4.3 | 📋 Draft |
|
| 4.4 | Mic Capture Hook (`useMicASR`) | 0.5 day | 4.1 | 📋 Draft |
|
||||||
| 4.5 | Backend Adjustments | 0.5 day | 4.1 | 📋 Draft |
|
| 4.5 | MicCapture UI Component | 0.5 day | 4.4 | 📋 Draft |
|
||||||
| 4.6 | Integration & Acceptance Tests | 1 day | 4.4, 4.5 | 📋 Draft |
|
| 4.6 | LTTPage Integration (all 3 sources) | 0.5 day | 4.2, 4.3, 4.4, 4.5 | 📋 Draft |
|
||||||
| 4.7 | Polish & Documentation | 0.5 day | 4.6 | 📋 Draft |
|
| 4.7 | Backend Adjustments | 0.5 day | 4.1 | 📋 Draft |
|
||||||
| **Total** | | **4.5 days** | | |
|
| 4.8 | Integration & Acceptance Tests | 1 day | 4.6, 4.7 | 📋 Draft |
|
||||||
|
| 4.9 | Polish & Documentation | 0.5 day | 4.8 | 📋 Draft |
|
||||||
|
| **Total** | | **5.5 days** | | |
|
||||||
|
|
||||||
### Phase 4.1 — Config & Infrastructure (0.5 day)
|
### Phase 4.1 — Config & Infrastructure (0.5 day)
|
||||||
|
|
||||||
**Objective:** Add system audio feature toggle, define types, establish UUID generation.
|
**Objective:** Add feature toggles, define types, establish UUID generation.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Add `SYSTEM_AUDIO_ENABLED` to `backend/app/core/config.py` (default: `True`)
|
1. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` to `backend/app/core/config.py` (default: `True`)
|
||||||
2. Add `SystemAudioStatus` type to `frontend/src/types/index.ts`:
|
2. Add `SystemAudioStatus` type to `frontend/src/types/index.ts`:
|
||||||
```typescript
|
```typescript
|
||||||
type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
|
type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
|
||||||
```
|
```
|
||||||
3. Add `SystemAudioASRState` interface to types
|
3. Add `MicStatus` type:
|
||||||
4. Add `video_id` UUID generation helper (frontend-side: `crypto.randomUUID()`)
|
```typescript
|
||||||
5. Verify WebSocket ASR endpoint accepts arbitrary `video_id` strings (it does — confirm with a quick test)
|
type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
|
||||||
|
```
|
||||||
|
4. Add `SystemAudioASRState` and `MicASRState` interfaces to types
|
||||||
|
5. Add `video_id` UUID generation helper (frontend-side: `crypto.randomUUID()`)
|
||||||
|
6. Verify WebSocket ASR endpoint accepts arbitrary `video_id` strings (it does — confirm with a quick test)
|
||||||
|
|
||||||
**Test Files:** `backend/app/test/test_phase4_config.py`
|
**Test Files:** `backend/app/test/test_phase4_config.py`
|
||||||
|
|
||||||
|
|
@ -205,7 +274,7 @@ interface UseSystemAudioASRReturn {
|
||||||
|
|
||||||
**Pattern to Follow:**
|
**Pattern to Follow:**
|
||||||
- AudioContext setup: follow `useVideoASR.ts` lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
|
- AudioContext setup: follow `useVideoASR.ts` lines 45-143 (AudioContext, ScriptProcessor, sample rate conversion)
|
||||||
- WebSocket handling: follow `useYouTubeASR.ts` lines 35-100
|
- WebSocket handling: follow `useVideoASR.ts` lines 35-100
|
||||||
- State management: combine patterns from both hooks, adapting for MediaStream source
|
- State management: combine patterns from both hooks, adapting for MediaStream source
|
||||||
|
|
||||||
**Test Files:** `frontend/src/test/test_phase4_useSystemAudioASR.test.ts`
|
**Test Files:** `frontend/src/test/test_phase4_useSystemAudioASR.test.ts`
|
||||||
|
|
@ -239,57 +308,128 @@ On Linux, only tab audio is available (not full system audio).
|
||||||
|
|
||||||
**Test Files:** `frontend/src/test/test_phase4_SystemAudioCapture.test.tsx`
|
**Test Files:** `frontend/src/test/test_phase4_SystemAudioCapture.test.tsx`
|
||||||
|
|
||||||
### Phase 4.4 — LTTPage Integration (0.5 day)
|
### Phase 4.4 — Mic Capture Hook (0.5 day)
|
||||||
|
|
||||||
**Objective:** Wire the System Audio source into LTTPage, adding it as the third tab alongside Upload and YouTube.
|
**Objective:** Create `useMicASR.ts` hook that captures microphone input and streams it to the ASR WebSocket.
|
||||||
|
|
||||||
|
**Key Design:**
|
||||||
|
```typescript
|
||||||
|
interface UseMicASRProps {
|
||||||
|
wsUrl: string // e.g., ws://localhost:8000/ws/asr/{uuid}?language=yue
|
||||||
|
}
|
||||||
|
|
||||||
|
interface UseMicASRReturn {
|
||||||
|
status: 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
|
||||||
|
transcript: string
|
||||||
|
partialTranscript: string
|
||||||
|
error: string | null
|
||||||
|
startListening: () => Promise<void>
|
||||||
|
stopListening: () => void
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementation Details:**
|
||||||
|
- `startListening()`: calls `navigator.mediaDevices.getUserMedia({ audio: true, video: false })`
|
||||||
|
- On success: creates AudioContext, `createMediaStreamSource(stream)`, connects ScriptProcessor → WebSocket
|
||||||
|
- On user deny: sets status to `'idle'`, sets error "Microphone access denied"
|
||||||
|
- On no audio track: sets status to `'error'`, sets error "No microphone input detected"
|
||||||
|
- `stopListening()`: stops all tracks in the MediaStream, closes AudioContext, closes WebSocket
|
||||||
|
- Auto-stop: listens for `track.onended` (user revokes permission) → calls stopListening
|
||||||
|
- Audio processing: identical to useSystemAudioASR — `ScriptProcessorNode(4096)`, convert Float32 → Int16 PCM, send via WebSocket
|
||||||
|
- WebSocket lifecycle: connect on listening start, close on listening stop
|
||||||
|
- Cleanup: useEffect return closes AudioContext, WebSocket, and stops tracks
|
||||||
|
|
||||||
|
**Code Sharing:** Extract shared audio processing logic (`MediaStream → AudioContext → ScriptProcessorNode → WebSocket`) into a reusable internal utility (`useMediaStreamASR` or `audioPipeline.ts`) to avoid duplication between `useSystemAudioASR` and `useMicASR`.
|
||||||
|
|
||||||
|
**Test Files:** `frontend/src/test/test_phase4_useMicASR.test.ts`
|
||||||
|
|
||||||
|
### Phase 4.5 — MicCapture UI Component (0.5 day)
|
||||||
|
|
||||||
|
**Objective:** Create the `MicCapture.tsx` component with Start/Stop button and status display.
|
||||||
|
|
||||||
|
**Component Props:**
|
||||||
|
```typescript
|
||||||
|
interface MicCaptureProps {
|
||||||
|
status: MicStatus
|
||||||
|
error: string | null
|
||||||
|
onStart: () => void
|
||||||
|
onStop: () => void
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**UI States:**
|
||||||
|
1. **Idle**: "Start Listening" button (blue, prominent) — no compatibility warning needed (mic is universally supported)
|
||||||
|
2. **Requesting**: "Waiting for microphone permission..." (loading spinner)
|
||||||
|
3. **Listening**: "Stop Listening" button (red) + pulsing green dot + "Listening..."
|
||||||
|
4. **Error**: Red banner with error message + "Try Again" button
|
||||||
|
|
||||||
|
**Test Files:** `frontend/src/test/test_phase4_MicCapture.test.tsx`
|
||||||
|
|
||||||
|
### Phase 4.6 — LTTPage Integration (0.5 day)
|
||||||
|
|
||||||
|
**Objective:** Create the `SourceSelector` tab bar component and wire both new sources into LTTPage.
|
||||||
|
|
||||||
|
**New Component — `SourceSelector.tsx`:**
|
||||||
|
```typescript
|
||||||
|
interface SourceSelectorProps {
|
||||||
|
activeSource: SourceType
|
||||||
|
onSelect: (source: SourceType) => void
|
||||||
|
}
|
||||||
|
```
|
||||||
|
- Three tabs: Upload (📁), System Audio (🔊), Listen Mic (🎤)
|
||||||
|
- Active tab highlighted with blue background, inactive tabs gray
|
||||||
|
- Icons from lucide-react: `Upload`, `MonitorSpeaker`, `Mic`
|
||||||
|
|
||||||
**Changes to `LTTPage.tsx`:**
|
**Changes to `LTTPage.tsx`:**
|
||||||
1. Extend `SourceType` from `'upload' | 'youtube'` to `'upload' | 'youtube' | 'system-audio'`
|
1. Add `SourceType` state: `const [source, setSource] = useState<SourceType>('upload')`
|
||||||
2. Add third tab button (icon: `AudioLines` from lucide-react) in the source selector
|
2. Render `<SourceSelector activeSource={source} onSelect={setSource} />` above the panels
|
||||||
3. Initialize `useSystemAudioASR` hook with a UUID-based WebSocket URL
|
3. Extend `SourceType` to `'upload' | 'system-audio' | 'mic'`
|
||||||
4. Update `asr` variable:
|
4. Initialize `useSystemAudioASR` and `useMicASR` hooks with session-scoped UUIDs (generated once when tab selected, reused across Start/Stop cycles)
|
||||||
|
5. Update `asr` variable:
|
||||||
```typescript
|
```typescript
|
||||||
const asr = source === 'youtube' ? youtubeASR
|
const asr = source === 'system-audio' ? systemAudioASR
|
||||||
: source === 'system-audio' ? systemAudioASR
|
: source === 'mic' ? micASR
|
||||||
: uploadASR
|
: uploadASR
|
||||||
```
|
```
|
||||||
5. Conditional rendering:
|
6. Conditional rendering:
|
||||||
```
|
```tsx
|
||||||
{source === 'upload' && <VideoUploader />}
|
{source === 'upload' && <VideoUpload />}
|
||||||
{source === 'youtube' && <YouTubeMode />}
|
|
||||||
{source === 'system-audio' && <SystemAudioCapture />}
|
{source === 'system-audio' && <SystemAudioCapture />}
|
||||||
|
{source === 'mic' && <MicCapture />}
|
||||||
```
|
```
|
||||||
6. WebSocket URL: `ws://host/ws/asr/{crypto.randomUUID()}?language=yue`
|
7. WebSocket URL: `ws://host/ws/asr/{sessionUUID}?language=yue` (UUID stable per session, regenerated only on source switch)
|
||||||
7. Full Transcript button: hidden for system-audio (same as YouTube)
|
8. Full Transcript button: hidden for system-audio AND mic (streaming ASR only)
|
||||||
8. QueryInput: remains editable during capture (same behavior as other sources)
|
9. QueryInput: remains editable during capture/listening
|
||||||
|
|
||||||
**Test Files:** `frontend/src/test/test_phase4_LTTPage_integration.test.tsx`
|
**Test Files:** `frontend/src/test/test_phase4_LTTPage_integration.test.tsx`
|
||||||
|
|
||||||
### Phase 4.5 — Backend Adjustments (0.5 day)
|
### Phase 4.7 — Backend Adjustments (0.5 day)
|
||||||
|
|
||||||
**Objective:** Ensure backend handles system audio sessions correctly.
|
**Objective:** Ensure backend handles both system audio and mic sessions correctly.
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Verify `ws_asr.py` WebSocket endpoint works with arbitrary `video_id` (UUID format) — likely no changes needed
|
1. Verify `ws_asr.py` WebSocket endpoint works with arbitrary `video_id` (UUID format) — likely no changes needed
|
||||||
2. Add `SYSTEM_AUDIO_ENABLED` config validation in the router (return 503 if disabled)
|
2. Add `SYSTEM_AUDIO_ENABLED` and `MIC_ENABLED` config validation in the router (return 503 if disabled)
|
||||||
3. Handle system audio sessions in transcript history (optional — store with `source: 'system-audio'` metadata)
|
3. Handle system audio and mic sessions in transcript history (optional — store with `source: 'system-audio'` / `source: 'mic'` metadata)
|
||||||
4. Verify the ASR client handles system audio PCM identically to video audio
|
4. Verify the ASR client handles audio from both sources identically
|
||||||
|
|
||||||
**No new endpoints needed.** The existing WebSocket and ASR infrastructure is source-agnostic.
|
**No new endpoints needed.** The existing WebSocket and ASR infrastructure is source-agnostic.
|
||||||
|
|
||||||
**Test Files:** `backend/app/test/test_phase4_config.py`
|
**Test Files:** `backend/app/test/test_phase4_config.py`
|
||||||
|
|
||||||
### Phase 4.6 — Integration & Acceptance Tests (1 day)
|
### Phase 4.8 — Integration & Acceptance Tests (1 day)
|
||||||
|
|
||||||
**Objective:** Comprehensive tests for the system audio capture flow.
|
**Objective:** Comprehensive tests for both capture flows.
|
||||||
|
|
||||||
**Backend Integration Tests** (`backend/app/test/test_integration_phase4.py`):
|
**Backend Integration Tests** (`backend/app/test/test_integration_phase4.py`):
|
||||||
1. WebSocket accepts UUID video_id
|
1. WebSocket accepts UUID video_id
|
||||||
2. ASR processes audio from system audio session
|
2. ASR processes audio from system audio session
|
||||||
3. Config toggle disables feature
|
3. ASR processes audio from mic session
|
||||||
|
4. Config toggles disable features
|
||||||
|
|
||||||
**Frontend Tests:**
|
**Frontend Tests:**
|
||||||
1. **Hook tests** (`test_phase4_useSystemAudioASR.test.ts`): ~10 tests
|
|
||||||
|
1. **System Audio Hook tests** (`test_phase4_useSystemAudioASR.test.ts`): ~10 tests
|
||||||
- Mock `getDisplayMedia` → successful capture
|
- Mock `getDisplayMedia` → successful capture
|
||||||
- Mock `getDisplayMedia` → user cancels (permission denied)
|
- Mock `getDisplayMedia` → user cancels (permission denied)
|
||||||
- Mock `getDisplayMedia` → no audio track
|
- Mock `getDisplayMedia` → no audio track
|
||||||
|
|
@ -300,31 +440,52 @@ On Linux, only tab audio is available (not full system audio).
|
||||||
- `stopCapture` cleanup
|
- `stopCapture` cleanup
|
||||||
- Multiple rapid start/stop cycles
|
- Multiple rapid start/stop cycles
|
||||||
|
|
||||||
2. **Component tests** (`test_phase4_SystemAudioCapture.test.tsx`): ~5 tests
|
2. **System Audio Component tests** (`test_phase4_SystemAudioCapture.test.tsx`): ~5 tests
|
||||||
- All UI states render correctly (idle, requesting, capturing, error)
|
- All UI states render correctly (idle, requesting, capturing, error)
|
||||||
- Start button calls onStart
|
- Start button calls onStart
|
||||||
- Stop button calls onStop
|
- Stop button calls onStop
|
||||||
- Error state shows message and retry button
|
- Error state shows message and retry button
|
||||||
- Compatibility note visible for non-Chrome (optional)
|
- Compatibility note visible for non-Chrome (optional)
|
||||||
|
|
||||||
3. **Integration tests** (`test_phase4_LTTPage_integration.test.tsx`): ~5 tests
|
3. **Mic Hook tests** (`test_phase4_useMicASR.test.ts`): ~8 tests
|
||||||
|
- Mock `getUserMedia` → successful capture
|
||||||
|
- Mock `getUserMedia` → user denies (permission denied)
|
||||||
|
- Mock `getUserMedia` → no audio track
|
||||||
|
- AudioContext setup and teardown
|
||||||
|
- WebSocket connection lifecycle
|
||||||
|
- `track.onended` triggers auto-stop
|
||||||
|
- `stopListening` cleanup
|
||||||
|
- PCM conversion and sending
|
||||||
|
|
||||||
|
4. **Mic Component tests** (`test_phase4_MicCapture.test.tsx`): ~4 tests
|
||||||
|
- All UI states render correctly (idle, requesting, listening, error)
|
||||||
|
- Start button calls onStart
|
||||||
|
- Stop button calls onStop
|
||||||
|
- Error state shows message and retry button
|
||||||
|
|
||||||
|
5. **LTTPage Integration tests** (`test_phase4_LTTPage_integration.test.tsx`): ~8 tests
|
||||||
- System Audio tab renders and switches correctly
|
- System Audio tab renders and switches correctly
|
||||||
|
- Listen Mic tab renders and switches correctly
|
||||||
- ASR variable selects systemAudioASR when source is system-audio
|
- ASR variable selects systemAudioASR when source is system-audio
|
||||||
- Full Transcript button hidden for system audio
|
- ASR variable selects micASR when source is mic
|
||||||
|
- Full Transcript button hidden for system audio and mic
|
||||||
- QueryInput receives transcript from system audio
|
- QueryInput receives transcript from system audio
|
||||||
|
- QueryInput receives transcript from mic
|
||||||
- Source switching preserves transcript
|
- Source switching preserves transcript
|
||||||
|
|
||||||
**Acceptance Tests** (`backend/app/test/acceptance/test_acceptance_phase4.py`):
|
**Acceptance Tests** (`backend/app/test/acceptance/test_acceptance_phase4.py`):
|
||||||
- Real `getDisplayMedia` with actual browser (manual — requires human interaction)
|
- Real `getDisplayMedia` with actual browser (manual — requires human interaction)
|
||||||
|
- Real `getUserMedia` with actual microphone (manual — requires human interaction)
|
||||||
- Real DashScope ASR with system audio stream
|
- Real DashScope ASR with system audio stream
|
||||||
- End-to-end: capture → ASR → transcript → RAG answer
|
- Real DashScope ASR with microphone stream
|
||||||
|
- End-to-end: capture → ASR → transcript → RAG answer (both sources)
|
||||||
|
|
||||||
### Phase 4.7 — Polish & Documentation (0.5 day)
|
### Phase 4.9 — Polish & Documentation (0.5 day)
|
||||||
|
|
||||||
**Tasks:**
|
**Tasks:**
|
||||||
1. Update `README.md` — add System Audio Capture section with usage instructions, browser compatibility table, and limitations
|
1. Update `README.md` — add System Audio Capture and Listen Mic sections with usage instructions, browser compatibility table, and limitations
|
||||||
2. Update `development_plan.md` — add Phase 4 row to timeline, mark status
|
2. Update `development_plan.md` — add Phase 4 row to timeline, mark status
|
||||||
3. Add browser detection helper for compatibility warning
|
3. Add browser detection helper for system audio compatibility warning
|
||||||
4. Verify production build (`npm run build`)
|
4. Verify production build (`npm run build`)
|
||||||
5. Run full CI regression (`pytest` + `vitest`)
|
5. Run full CI regression (`pytest` + `vitest`)
|
||||||
6. Final commit
|
6. Final commit
|
||||||
|
|
@ -335,34 +496,51 @@ On Linux, only tab audio is available (not full system audio).
|
||||||
|
|
||||||
| Decision | Rationale |
|
| Decision | Rationale |
|
||||||
|----------|-----------|
|
|----------|-----------|
|
||||||
| New hook (`useSystemAudioASR`) rather than modifying existing | MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hook avoids branching complexity. |
|
| New hooks rather than modifying existing | MediaStream source requires `createMediaStreamSource` (not `createMediaElementSource`), and lifecycle is permission-based (not play/pause events). Separate hooks avoid branching complexity. |
|
||||||
| UUID-based `video_id` | No actual video file for system audio. `crypto.randomUUID()` generates unique session IDs. Backend WebSocket already accepts arbitrary strings. |
|
| Two separate hooks + shared audio utility | System Audio and Mic share identical audio processing (MediaStream → PCM → WebSocket) but differ in capture API (`getDisplayMedia` vs `getUserMedia`) and UX. Extract shared pipeline to avoid duplication. |
|
||||||
| Manual Start/Stop (not auto) | `getDisplayMedia()` requires explicit user action (browser policy). Cannot auto-start. |
|
| UUID-based `video_id` (per-session) | No actual video file for live audio. UUID generated once when source tab is selected, reused across Start/Stop cycles within the same session. Regenerated only when switching between sources. Backend WebSocket already accepts arbitrary strings. |
|
||||||
| No video display in System Audio mode | User watches content in another tab/window. Only capture status and audio controls shown. |
|
| Manual Start/Stop (not auto) | Both `getDisplayMedia()` and `getUserMedia()` require explicit user action (browser policy). Cannot auto-start. |
|
||||||
| `video: false` in getDisplayMedia | Audio-only capture reduces bandwidth and permission scope. User only needs to share audio. |
|
| No video display in System Audio or Mic mode | User watches/listens to content elsewhere. Only capture status and audio controls shown. |
|
||||||
| Hide Full Transcript button for system audio | Same as YouTube — streaming ASR only. Full transcript would require recording and batch processing (future Phase 5). |
|
| `video: false` in getDisplayMedia | Audio-only capture reduces bandwidth and permission scope. |
|
||||||
| Browser compatibility note in UI | `getDisplayMedia` with audio is Chrome/Edge-only. Non-supporting browsers get clear messaging. |
|
| Hide Full Transcript button for both new sources | Streaming ASR only — no video file to batch transcribe. Full transcript would require audio recording (future Phase 5). |
|
||||||
|
| Browser compatibility note only for System Audio | Mic (`getUserMedia`) is universally supported in all modern browsers. System Audio (`getDisplayMedia` with audio) is Chrome/Edge-only. |
|
||||||
|
| Mic uses `getUserMedia({ audio: true, video: false })` | Audio-only capture — no camera needed. |
|
||||||
|
|
||||||
### getDisplayMedia Options
|
### getDisplayMedia Options (System Audio)
|
||||||
|
|
||||||
```javascript
|
```javascript
|
||||||
const stream = await navigator.mediaDevices.getDisplayMedia({
|
const stream = await navigator.mediaDevices.getDisplayMedia({
|
||||||
video: false, // No video needed
|
video: false,
|
||||||
audio: {
|
audio: {
|
||||||
systemAudio: 'include', // Request system audio (tab + full system where supported)
|
systemAudio: 'include',
|
||||||
echoCancellation: false, // Don't filter audio
|
echoCancellation: false,
|
||||||
noiseSuppression: false, // Don't filter audio
|
noiseSuppression: false,
|
||||||
autoGainControl: false, // Don't adjust volume
|
autoGainControl: false,
|
||||||
}
|
}
|
||||||
})
|
})
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note on `video: false`:** Setting `video: false` tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation, not ours.
|
**Note on `video: false`:** Setting `video: false` tells the browser we only want audio. However, the browser permission dialog still shows screen/tab selection (there's no "audio-only picker"). The user must select a tab or screen to share — this is a browser limitation.
|
||||||
|
|
||||||
|
### getUserMedia Options (Listen Mic)
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const stream = await navigator.mediaDevices.getUserMedia({
|
||||||
|
audio: {
|
||||||
|
echoCancellation: false, // Don't filter audio (pass raw mic input)
|
||||||
|
noiseSuppression: false, // Don't filter audio
|
||||||
|
autoGainControl: false, // Don't adjust volume
|
||||||
|
},
|
||||||
|
video: false,
|
||||||
|
})
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Browser Compatibility
|
## 6. Browser Compatibility
|
||||||
|
|
||||||
|
### System Audio (`getDisplayMedia`)
|
||||||
|
|
||||||
| Platform / Browser | Tab Audio | System Audio | Works? |
|
| Platform / Browser | Tab Audio | System Audio | Works? |
|
||||||
|--------------------|-----------|-------------|--------|
|
|--------------------|-----------|-------------|--------|
|
||||||
| Chrome/Edge (Windows) | ✅ | ✅ | **Best — full support** |
|
| Chrome/Edge (Windows) | ✅ | ✅ | **Best — full support** |
|
||||||
|
|
@ -376,11 +554,21 @@ const stream = await navigator.mediaDevices.getDisplayMedia({
|
||||||
```typescript
|
```typescript
|
||||||
function isSystemAudioSupported(): boolean {
|
function isSystemAudioSupported(): boolean {
|
||||||
const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
|
const isChromium = 'chrome' in window || navigator.userAgent.includes('Chrome')
|
||||||
// Firefox and Safari don't support audio in getDisplayMedia
|
|
||||||
return isChromium && !navigator.userAgent.includes('Firefox')
|
return isChromium && !navigator.userAgent.includes('Firefox')
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Listen Mic (`getUserMedia`)
|
||||||
|
|
||||||
|
| Platform / Browser | Microphone | Works? |
|
||||||
|
|--------------------|-----------|--------|
|
||||||
|
| Chrome/Edge | ✅ | **Full support** |
|
||||||
|
| Firefox | ✅ | **Full support** |
|
||||||
|
| Safari | ✅ | **Full support** |
|
||||||
|
| Mobile browsers | ✅ | **Full support** |
|
||||||
|
|
||||||
|
Mic capture is universally supported — no compatibility warning needed.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Test Strategy
|
## 7. Test Strategy
|
||||||
|
|
@ -389,16 +577,19 @@ function isSystemAudioSupported(): boolean {
|
||||||
|
|
||||||
| File | Type | Count | Description |
|
| File | Type | Count | Description |
|
||||||
|------|------|-------|-------------|
|
|------|------|-------|-------------|
|
||||||
| `test_phase4_config.py` | Backend integration | 3 | Config toggle, WebSocket accepts UUID |
|
| `test_phase4_config.py` | Backend integration | 4 | Config toggles, WebSocket accepts UUID |
|
||||||
| `test_phase4_useSystemAudioASR.test.ts` | Frontend unit | ~10 | Hook behavior: capture, permission, audio, WS |
|
| `test_phase4_useSystemAudioASR.test.ts` | Frontend unit | ~10 | Hook: capture, permission, audio, WS |
|
||||||
| `test_phase4_SystemAudioCapture.test.tsx` | Frontend component | ~5 | UI states: idle, requesting, capturing, error |
|
| `test_phase4_SystemAudioCapture.test.tsx` | Frontend component | ~5 | UI states: idle, requesting, capturing, error |
|
||||||
| `test_phase4_LTTPage_integration.test.tsx` | Frontend integration | ~5 | Tab switching, ASR unification, Full Transcript |
|
| `test_phase4_useMicASR.test.ts` | Frontend unit | ~8 | Hook: capture, permission, audio, WS |
|
||||||
| `test_integration_phase4.py` | Backend integration | 4 | Config toggle, WebSocket, ASR client |
|
| `test_phase4_MicCapture.test.tsx` | Frontend component | ~4 | UI states: idle, requesting, listening, error |
|
||||||
| `test_acceptance_phase4.py` | Acceptance | 3 | Real browser + real DashScope ASR |
|
| `test_phase4_LTTPage_integration.test.tsx` | Frontend integration | ~8 | Tab switching, ASR unification, Full Transcript |
|
||||||
|
| `test_integration_phase4.py` | Backend integration | 4 | Config toggles, WebSocket, ASR client |
|
||||||
|
| `test_acceptance_phase4.py` | Acceptance | 5 | Real browser + real mic + real DashScope ASR |
|
||||||
|
|
||||||
### Mocking Strategy
|
### Mocking Strategy
|
||||||
|
|
||||||
- **`getDisplayMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
|
- **`getDisplayMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
|
||||||
|
- **`getUserMedia`**: Mock with `jest.fn()` returning a synthetic MediaStream with an AudioTrack
|
||||||
- **AudioContext**: Use `jest-webgl-mock` or manual mock for AudioContext, ScriptProcessorNode
|
- **AudioContext**: Use `jest-webgl-mock` or manual mock for AudioContext, ScriptProcessorNode
|
||||||
- **WebSocket**: Mock via `vitest` WebSocket mock (same pattern as Phase 2/3 tests)
|
- **WebSocket**: Mock via `vitest` WebSocket mock (same pattern as Phase 2/3 tests)
|
||||||
- **DashScope ASR**: Mock in CI; real in acceptance tests
|
- **DashScope ASR**: Mock in CI; real in acceptance tests
|
||||||
|
|
@ -410,9 +601,13 @@ function isSystemAudioSupported(): boolean {
|
||||||
### New Files
|
### New Files
|
||||||
```
|
```
|
||||||
frontend/src/hooks/useSystemAudioASR.ts
|
frontend/src/hooks/useSystemAudioASR.ts
|
||||||
|
frontend/src/hooks/useMicASR.ts
|
||||||
frontend/src/components/SystemAudioCapture.tsx
|
frontend/src/components/SystemAudioCapture.tsx
|
||||||
|
frontend/src/components/MicCapture.tsx
|
||||||
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
|
frontend/src/test/test_phase4_useSystemAudioASR.test.ts
|
||||||
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
|
frontend/src/test/test_phase4_SystemAudioCapture.test.tsx
|
||||||
|
frontend/src/test/test_phase4_useMicASR.test.ts
|
||||||
|
frontend/src/test/test_phase4_MicCapture.test.tsx
|
||||||
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
|
frontend/src/test/test_phase4_LTTPage_integration.test.tsx
|
||||||
backend/app/test/test_phase4_config.py
|
backend/app/test/test_phase4_config.py
|
||||||
backend/app/test/test_integration_phase4.py
|
backend/app/test/test_integration_phase4.py
|
||||||
|
|
@ -422,11 +617,11 @@ backend/app/test/acceptance/test_acceptance_phase4.py
|
||||||
|
|
||||||
### Modified Files
|
### Modified Files
|
||||||
```
|
```
|
||||||
frontend/src/pages/LTTPage.tsx ← add "System Audio" tab, wire hook
|
frontend/src/pages/LTTPage.tsx ← add "System Audio" + "Listen Mic" tabs, wire hooks
|
||||||
frontend/src/types/index.ts ← add SystemAudioStatus, SystemAudioASRState
|
frontend/src/types/index.ts ← add SystemAudioStatus, MicStatus, ASRState types
|
||||||
backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED
|
backend/app/core/config.py ← add SYSTEM_AUDIO_ENABLED, MIC_ENABLED
|
||||||
development_plan.md ← add Phase 4 row
|
development_plan.md ← add Phase 4 row
|
||||||
README.md ← add System Audio Capture section
|
README.md ← add System Audio + Listen Mic sections
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
@ -434,13 +629,17 @@ README.md ← add System Audio Capture s
|
||||||
## 9. Acceptance Criteria
|
## 9. Acceptance Criteria
|
||||||
|
|
||||||
- [ ] User can select "System Audio" tab in LTTPage
|
- [ ] User can select "System Audio" tab in LTTPage
|
||||||
- [ ] Clicking "Start Capture" opens browser permission dialog
|
- [ ] User can select "Listen Mic" tab in LTTPage
|
||||||
- [ ] On permission grant, audio streams through WebSocket to DashScope ASR
|
- [ ] Clicking "Start Capture" (System Audio) opens browser permission dialog
|
||||||
- [ ] Real-time transcript flows into QueryInput
|
- [ ] Clicking "Start Listening" (Listen Mic) opens microphone permission prompt
|
||||||
- [ ] User can edit transcript while capture continues
|
- [ ] On permission grant, audio streams through WebSocket to DashScope ASR (both sources)
|
||||||
|
- [ ] Real-time transcript flows into QueryInput (both sources)
|
||||||
|
- [ ] User can edit transcript while capture/listening continues
|
||||||
- [ ] "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
|
- [ ] "Stop Capture" properly closes MediaStream, AudioContext, WebSocket
|
||||||
- [ ] Permission denied shows clear error message
|
- [ ] "Stop Listening" properly closes MediaStream, AudioContext, WebSocket
|
||||||
- [ ] Browser compatibility note shown for non-Chrome browsers
|
- [ ] Permission denied shows clear error message (both sources)
|
||||||
|
- [ ] Browser compatibility note shown for System Audio on non-Chrome browsers
|
||||||
|
- [ ] No compatibility warning for Listen Mic (universally supported)
|
||||||
- [ ] All CI tests pass (no regressions)
|
- [ ] All CI tests pass (no regressions)
|
||||||
- [ ] Acceptance tests pass with real DashScope ASR
|
- [ ] Acceptance tests pass with real DashScope ASR
|
||||||
- [ ] `npm run build` produces clean production build
|
- [ ] `npm run build` produces clean production build
|
||||||
|
|
@ -450,4 +649,5 @@ README.md ← add System Audio Capture s
|
||||||
**File Information**
|
**File Information**
|
||||||
- Filename: `phase4_system_audio_plan.md`
|
- Filename: `phase4_system_audio_plan.md`
|
||||||
- Created: 2026-05-09
|
- Created: 2026-05-09
|
||||||
|
- Updated: 2026-05-14 — Added Listen Mic as third source; removed YouTube
|
||||||
- Status: Draft — awaiting review before Phase 4.1 implementation begins
|
- Status: Draft — awaiting review before Phase 4.1 implementation begins
|
||||||
|
|
|
||||||
|
|
@ -54,6 +54,10 @@ class Settings(BaseSettings):
|
||||||
max_video_size_mb: int = 300
|
max_video_size_mb: int = 300
|
||||||
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
|
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
|
||||||
|
|
||||||
|
# Phase 4 — Live audio capture toggles
|
||||||
|
system_audio_enabled: bool = True
|
||||||
|
mic_enabled: bool = True
|
||||||
|
|
||||||
# Development helpers
|
# Development helpers
|
||||||
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}
|
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -209,7 +209,7 @@ async def _ws_proxy_dashscope(client_ws: WebSocket, loop: asyncio.AbstractEventL
|
||||||
|
|
||||||
|
|
||||||
@router.websocket("/ws/asr/{video_id}")
|
@router.websocket("/ws/asr/{video_id}")
|
||||||
async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "yue"):
|
async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "yue", source: str = "upload"):
|
||||||
settings = get_settings()
|
settings = get_settings()
|
||||||
client_host = websocket.client.host if websocket.client else "unknown"
|
client_host = websocket.client.host if websocket.client else "unknown"
|
||||||
|
|
||||||
|
|
@ -220,9 +220,23 @@ async def ws_asr_endpoint(websocket: WebSocket, video_id: str, language: str = "
|
||||||
logger.warning("ws-rejected-no-apikey video_id=%s client=%s", video_id, client_host)
|
logger.warning("ws-rejected-no-apikey video_id=%s client=%s", video_id, client_host)
|
||||||
return
|
return
|
||||||
|
|
||||||
|
if source == "system-audio" and not settings.system_audio_enabled:
|
||||||
|
await websocket.accept()
|
||||||
|
await websocket.send_json({"error": "System audio capture is disabled"})
|
||||||
|
await websocket.close(code=1008, reason="System audio disabled")
|
||||||
|
logger.warning("ws-rejected-system-audio-disabled video_id=%s client=%s", video_id, client_host)
|
||||||
|
return
|
||||||
|
|
||||||
|
if source == "mic" and not settings.mic_enabled:
|
||||||
|
await websocket.accept()
|
||||||
|
await websocket.send_json({"error": "Microphone capture is disabled"})
|
||||||
|
await websocket.close(code=1008, reason="Mic disabled")
|
||||||
|
logger.warning("ws-rejected-mic-disabled video_id=%s client=%s", video_id, client_host)
|
||||||
|
return
|
||||||
|
|
||||||
await websocket.accept()
|
await websocket.accept()
|
||||||
loop = asyncio.get_event_loop()
|
loop = asyncio.get_event_loop()
|
||||||
logger.info("ws-connect video_id=%s lang=%s client=%s", video_id, language, client_host)
|
logger.info("ws-connect video_id=%s lang=%s source=%s client=%s", video_id, language, source, client_host)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
await _ws_proxy_dashscope(websocket, loop, language)
|
await _ws_proxy_dashscope(websocket, loop, language)
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,140 @@
|
||||||
|
"""Phase 4 config tests: system audio and mic capture feature toggles."""
|
||||||
|
import pytest
|
||||||
|
from fastapi import FastAPI
|
||||||
|
from fastapi.testclient import TestClient
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def phase4_ws_app(monkeypatch):
|
||||||
|
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
|
||||||
|
monkeypatch.setenv("SYSTEM_AUDIO_ENABLED", "true")
|
||||||
|
monkeypatch.setenv("MIC_ENABLED", "true")
|
||||||
|
from app.core.config import get_settings
|
||||||
|
from app.routers.ws_asr import router
|
||||||
|
get_settings.cache_clear()
|
||||||
|
app = FastAPI()
|
||||||
|
app.include_router(router)
|
||||||
|
return app
|
||||||
|
|
||||||
|
|
||||||
|
class TestWSSourceToggle:
|
||||||
|
def test_system_audio_source_connects(self, phase4_ws_app):
|
||||||
|
client = TestClient(phase4_ws_app)
|
||||||
|
with client.websocket_connect("/ws/asr/test-uuid?source=system-audio") as ws:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_mic_source_connects(self, phase4_ws_app):
|
||||||
|
client = TestClient(phase4_ws_app)
|
||||||
|
with client.websocket_connect("/ws/asr/test-uuid?source=mic") as ws:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_default_source_is_upload(self, phase4_ws_app):
|
||||||
|
client = TestClient(phase4_ws_app)
|
||||||
|
with client.websocket_connect("/ws/asr/test-uuid") as ws:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_system_audio_disabled_rejects(self, monkeypatch):
|
||||||
|
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
|
||||||
|
monkeypatch.setenv("SYSTEM_AUDIO_ENABLED", "false")
|
||||||
|
from app.core.config import get_settings
|
||||||
|
from app.routers.ws_asr import router
|
||||||
|
get_settings.cache_clear()
|
||||||
|
app = FastAPI()
|
||||||
|
app.include_router(router)
|
||||||
|
client = TestClient(app)
|
||||||
|
with client.websocket_connect("/ws/asr/test-uuid?source=system-audio") as ws:
|
||||||
|
data = ws.receive_json()
|
||||||
|
assert "disabled" in data.get("error", "").lower()
|
||||||
|
|
||||||
|
def test_mic_disabled_rejects(self, monkeypatch):
|
||||||
|
monkeypatch.setenv("DASHSCOPE_API_KEY", "sk-test-key")
|
||||||
|
monkeypatch.setenv("MIC_ENABLED", "false")
|
||||||
|
from app.core.config import get_settings
|
||||||
|
from app.routers.ws_asr import router
|
||||||
|
get_settings.cache_clear()
|
||||||
|
app = FastAPI()
|
||||||
|
app.include_router(router)
|
||||||
|
client = TestClient(app)
|
||||||
|
with client.websocket_connect("/ws/asr/test-uuid?source=mic") as ws:
|
||||||
|
data = ws.receive_json()
|
||||||
|
assert "disabled" in data.get("error", "").lower()
|
||||||
|
|
||||||
|
|
||||||
|
def test_config_system_audio_defaults(monkeypatch, tmp_path):
|
||||||
|
monkeypatch.delenv("SYSTEM_AUDIO_ENABLED", raising=False)
|
||||||
|
monkeypatch.setenv("LLM_API_KEY", "sk-test")
|
||||||
|
monkeypatch.setenv("DP_API_KEY", "sk-test")
|
||||||
|
monkeypatch.setenv("EMBEDDING_API_KEY", "sk-test")
|
||||||
|
env_file = tmp_path / ".env"
|
||||||
|
env_file.write_text("")
|
||||||
|
monkeypatch.chdir(tmp_path)
|
||||||
|
|
||||||
|
from app.core.config import Settings, get_settings
|
||||||
|
get_settings.cache_clear()
|
||||||
|
settings = Settings(_env_file=())
|
||||||
|
assert settings.system_audio_enabled is True
|
||||||
|
|
||||||
|
|
||||||
|
def test_config_mic_defaults(monkeypatch, tmp_path):
|
||||||
|
monkeypatch.delenv("MIC_ENABLED", raising=False)
|
||||||
|
monkeypatch.setenv("LLM_API_KEY", "sk-test")
|
||||||
|
monkeypatch.setenv("DP_API_KEY", "sk-test")
|
||||||
|
monkeypatch.setenv("EMBEDDING_API_KEY", "sk-test")
|
||||||
|
env_file = tmp_path / ".env"
|
||||||
|
env_file.write_text("")
|
||||||
|
monkeypatch.chdir(tmp_path)
|
||||||
|
|
||||||
|
from app.core.config import Settings, get_settings
|
||||||
|
get_settings.cache_clear()
|
||||||
|
settings = Settings(_env_file=())
|
||||||
|
assert settings.mic_enabled is True
|
||||||
|
|
||||||
|
|
||||||
|
def test_config_system_audio_disabled(tmp_path, monkeypatch):
|
||||||
|
env_file = tmp_path / ".env"
|
||||||
|
env_file.write_text(
|
||||||
|
"SYSTEM_AUDIO_ENABLED=false\n"
|
||||||
|
"LLM_API_KEY=sk-test\n"
|
||||||
|
"DP_API_KEY=sk-test\n"
|
||||||
|
"EMBEDDING_API_KEY=sk-test\n"
|
||||||
|
)
|
||||||
|
monkeypatch.chdir(tmp_path)
|
||||||
|
from app.core.config import Settings, get_settings
|
||||||
|
get_settings.cache_clear()
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
|
assert settings.system_audio_enabled is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_config_mic_disabled(tmp_path, monkeypatch):
|
||||||
|
env_file = tmp_path / ".env"
|
||||||
|
env_file.write_text(
|
||||||
|
"MIC_ENABLED=false\n"
|
||||||
|
"LLM_API_KEY=sk-test\n"
|
||||||
|
"DP_API_KEY=sk-test\n"
|
||||||
|
"EMBEDDING_API_KEY=sk-test\n"
|
||||||
|
)
|
||||||
|
monkeypatch.chdir(tmp_path)
|
||||||
|
from app.core.config import Settings, get_settings
|
||||||
|
get_settings.cache_clear()
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
|
assert settings.mic_enabled is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_config_loads_both_toggles_from_env(tmp_path, monkeypatch):
|
||||||
|
env_file = tmp_path / ".env"
|
||||||
|
env_file.write_text(
|
||||||
|
"SYSTEM_AUDIO_ENABLED=true\n"
|
||||||
|
"MIC_ENABLED=true\n"
|
||||||
|
"LLM_API_KEY=sk-test\n"
|
||||||
|
"DP_API_KEY=sk-test\n"
|
||||||
|
"EMBEDDING_API_KEY=sk-test\n"
|
||||||
|
)
|
||||||
|
monkeypatch.chdir(tmp_path)
|
||||||
|
from app.core.config import Settings, get_settings
|
||||||
|
get_settings.cache_clear()
|
||||||
|
|
||||||
|
settings = Settings()
|
||||||
|
assert settings.system_audio_enabled is True
|
||||||
|
assert settings.mic_enabled is True
|
||||||
|
|
@ -0,0 +1,2 @@
|
||||||
|
onlyBuiltDependencies:
|
||||||
|
- esbuild
|
||||||
|
|
@ -2071,15 +2071,6 @@
|
||||||
"node": ">=6.9.0"
|
"node": ">=6.9.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"node_modules/@types/babel__generator": {
|
|
||||||
"dev": true
|
|
||||||
},
|
|
||||||
"node_modules/@types/babel__template": {
|
|
||||||
"dev": true
|
|
||||||
},
|
|
||||||
"node_modules/@types/babel__traverse": {
|
|
||||||
"dev": true
|
|
||||||
},
|
|
||||||
"node_modules/@types/chai": {
|
"node_modules/@types/chai": {
|
||||||
"version": "4.3.20",
|
"version": "4.3.20",
|
||||||
"resolved": "https://registry.npmjs.org/@types/chai/-/chai-4.3.20.tgz",
|
"resolved": "https://registry.npmjs.org/@types/chai/-/chai-4.3.20.tgz",
|
||||||
|
|
@ -2130,9 +2121,6 @@
|
||||||
"@types/unist": "*"
|
"@types/unist": "*"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"node_modules/@types/jest": {
|
|
||||||
"dev": true
|
|
||||||
},
|
|
||||||
"node_modules/@types/mdast": {
|
"node_modules/@types/mdast": {
|
||||||
"version": "4.0.4",
|
"version": "4.0.4",
|
||||||
"resolved": "https://registry.npmjs.org/@types/mdast/-/mdast-4.0.4.tgz",
|
"resolved": "https://registry.npmjs.org/@types/mdast/-/mdast-4.0.4.tgz",
|
||||||
|
|
@ -2158,7 +2146,6 @@
|
||||||
"undici-types": "~7.19.0"
|
"undici-types": "~7.19.0"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"node_modules/@types/prop-types": {},
|
|
||||||
"node_modules/@types/react": {
|
"node_modules/@types/react": {
|
||||||
"version": "18.3.28",
|
"version": "18.3.28",
|
||||||
"resolved": "https://registry.npmjs.org/@types/react/-/react-18.3.28.tgz",
|
"resolved": "https://registry.npmjs.org/@types/react/-/react-18.3.28.tgz",
|
||||||
|
|
|
||||||
|
|
@ -34,5 +34,10 @@
|
||||||
"ts-node": "^10.9.1",
|
"ts-node": "^10.9.1",
|
||||||
"typescript": "^5.1.6",
|
"typescript": "^5.1.6",
|
||||||
"vitest": "^0.34.3"
|
"vitest": "^0.34.3"
|
||||||
|
},
|
||||||
|
"pnpm": {
|
||||||
|
"onlyBuiltDependencies": [
|
||||||
|
"esbuild"
|
||||||
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,11 @@
|
||||||
|
allowBuilds:
|
||||||
|
'"': true
|
||||||
|
'[': true
|
||||||
|
']': true
|
||||||
|
b: true
|
||||||
|
d: true
|
||||||
|
e: true
|
||||||
|
i: true
|
||||||
|
l: true
|
||||||
|
s: true
|
||||||
|
u: true
|
||||||
|
|
@ -0,0 +1,80 @@
|
||||||
|
import React from 'react'
|
||||||
|
import { Mic, Loader2, AlertCircle, Circle } from 'lucide-react'
|
||||||
|
import type { MicStatus } from '../types'
|
||||||
|
|
||||||
|
export interface MicCaptureProps {
|
||||||
|
status: MicStatus
|
||||||
|
error: string | null
|
||||||
|
onStart: () => void
|
||||||
|
onStop: () => void
|
||||||
|
}
|
||||||
|
|
||||||
|
export const MicCapture: React.FC<MicCaptureProps> = ({
|
||||||
|
status,
|
||||||
|
error,
|
||||||
|
onStart,
|
||||||
|
onStop,
|
||||||
|
}) => {
|
||||||
|
if (status === 'error' && error) {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col">
|
||||||
|
<div className="p-3 bg-red-50 border border-red-200 rounded-lg flex items-start gap-2">
|
||||||
|
<AlertCircle className="w-4 h-4 text-red-500 shrink-0 mt-0.5" />
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="text-sm text-red-700">{error}</div>
|
||||||
|
<button
|
||||||
|
onClick={onStart}
|
||||||
|
className="mt-2 text-xs text-red-600 hover:text-red-800 font-medium underline"
|
||||||
|
>
|
||||||
|
Try Again
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (status === 'requesting') {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col items-center justify-center space-y-3">
|
||||||
|
<Loader2 className="w-8 h-8 text-blue-600 animate-spin" />
|
||||||
|
<div className="text-sm text-gray-600 font-medium">Waiting for microphone permission...</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (status === 'listening' || status === 'stopping') {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col items-center justify-center space-y-4">
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<Circle className="w-3 h-3 text-green-500 fill-green-500 animate-pulse" />
|
||||||
|
<span className="text-sm text-gray-600 font-medium">Listening...</span>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-end gap-1 h-8">
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '40%', animationDelay: '0ms' }} />
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '70%', animationDelay: '150ms' }} />
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '55%', animationDelay: '300ms' }} />
|
||||||
|
</div>
|
||||||
|
<button
|
||||||
|
onClick={onStop}
|
||||||
|
disabled={status === 'stopping'}
|
||||||
|
className="w-full px-4 py-2 bg-red-600 text-white font-medium rounded hover:bg-red-700 focus:outline-none focus:ring-2 focus:ring-red-500 focus:ring-offset-2 disabled:opacity-50 disabled:cursor-not-allowed disabled:hover:bg-red-600 transition-all duration-200"
|
||||||
|
>
|
||||||
|
{status === 'stopping' ? 'Stopping...' : 'Stop Listening'}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col">
|
||||||
|
<button
|
||||||
|
onClick={onStart}
|
||||||
|
className="w-full px-4 py-2 bg-blue-600 text-white font-medium rounded hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 transition-all duration-200 flex items-center justify-center gap-2"
|
||||||
|
>
|
||||||
|
<Mic className="w-4 h-4" />
|
||||||
|
Start Listening
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,42 @@
|
||||||
|
import React from 'react'
|
||||||
|
import { Upload, MonitorSpeaker, Mic } from 'lucide-react'
|
||||||
|
import type { SourceType } from '../types'
|
||||||
|
|
||||||
|
interface SourceSelectorProps {
|
||||||
|
activeSource: SourceType
|
||||||
|
onSelect: (source: SourceType) => void
|
||||||
|
}
|
||||||
|
|
||||||
|
export const SourceSelector: React.FC<SourceSelectorProps> = ({ activeSource, onSelect }) => {
|
||||||
|
const tabs: { id: SourceType; label: string; icon: React.ElementType }[] = [
|
||||||
|
{ id: 'upload', label: 'Upload', icon: Upload },
|
||||||
|
{ id: 'system-audio', label: 'System Audio', icon: MonitorSpeaker },
|
||||||
|
{ id: 'mic', label: 'Listen Mic', icon: Mic },
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="flex gap-1 p-1 bg-gray-100 rounded-lg" role="tablist">
|
||||||
|
{tabs.map(tab => {
|
||||||
|
const isActive = activeSource === tab.id
|
||||||
|
const Icon = tab.icon
|
||||||
|
return (
|
||||||
|
<button
|
||||||
|
key={tab.id}
|
||||||
|
role="tab"
|
||||||
|
aria-selected={isActive}
|
||||||
|
onClick={() => onSelect(tab.id)}
|
||||||
|
className={[
|
||||||
|
'flex items-center gap-2 px-4 py-2 rounded-md text-sm font-medium transition-all duration-200',
|
||||||
|
isActive
|
||||||
|
? 'bg-white text-blue-700 shadow-sm'
|
||||||
|
: 'text-gray-500 hover:text-gray-700 hover:bg-gray-50',
|
||||||
|
].join(' ')}
|
||||||
|
>
|
||||||
|
<Icon className="w-4 h-4" />
|
||||||
|
{tab.label}
|
||||||
|
</button>
|
||||||
|
)
|
||||||
|
})}
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,86 @@
|
||||||
|
import React from 'react'
|
||||||
|
import { MonitorSpeaker, Loader2, AlertCircle, Circle } from 'lucide-react'
|
||||||
|
import type { SystemAudioStatus } from '../types'
|
||||||
|
|
||||||
|
export interface SystemAudioCaptureProps {
|
||||||
|
status: SystemAudioStatus
|
||||||
|
error: string | null
|
||||||
|
onStart: () => void
|
||||||
|
onStop: () => void
|
||||||
|
}
|
||||||
|
|
||||||
|
export const SystemAudioCapture: React.FC<SystemAudioCaptureProps> = ({
|
||||||
|
status,
|
||||||
|
error,
|
||||||
|
onStart,
|
||||||
|
onStop,
|
||||||
|
}) => {
|
||||||
|
if (status === 'error' && error) {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col">
|
||||||
|
<div className="p-3 bg-red-50 border border-red-200 rounded-lg flex items-start gap-2">
|
||||||
|
<AlertCircle className="w-4 h-4 text-red-500 shrink-0 mt-0.5" />
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="text-sm text-red-700">{error}</div>
|
||||||
|
<button
|
||||||
|
onClick={onStart}
|
||||||
|
className="mt-2 text-xs text-red-600 hover:text-red-800 font-medium underline"
|
||||||
|
>
|
||||||
|
Try Again
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (status === 'requesting') {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col items-center justify-center space-y-3">
|
||||||
|
<Loader2 className="w-8 h-8 text-blue-600 animate-spin" />
|
||||||
|
<div className="text-sm text-gray-600 font-medium">Waiting for permission...</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (status === 'capturing' || status === 'stopping') {
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col items-center justify-center space-y-4">
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<Circle className="w-3 h-3 text-green-500 fill-green-500 animate-pulse" />
|
||||||
|
<span className="text-sm text-gray-600 font-medium">Capturing system audio...</span>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-end gap-1 h-8">
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '40%', animationDelay: '0ms' }} />
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '70%', animationDelay: '150ms' }} />
|
||||||
|
<div className="w-2 bg-green-500 rounded-full animate-[bounce_1s_infinite]" style={{ height: '55%', animationDelay: '300ms' }} />
|
||||||
|
</div>
|
||||||
|
<button
|
||||||
|
onClick={onStop}
|
||||||
|
disabled={status === 'stopping'}
|
||||||
|
className="w-full px-4 py-2 bg-red-600 text-white font-medium rounded hover:bg-red-700 focus:outline-none focus:ring-2 focus:ring-red-500 focus:ring-offset-2 disabled:opacity-50 disabled:cursor-not-allowed disabled:hover:bg-red-600 transition-all duration-200"
|
||||||
|
>
|
||||||
|
{status === 'stopping' ? 'Stopping...' : 'Stop Capture'}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="h-full flex flex-col space-y-3">
|
||||||
|
<button
|
||||||
|
onClick={onStart}
|
||||||
|
className="w-full px-4 py-2 bg-blue-600 text-white font-medium rounded hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:ring-offset-2 transition-all duration-200 flex items-center justify-center gap-2"
|
||||||
|
>
|
||||||
|
<MonitorSpeaker className="w-4 h-4" />
|
||||||
|
Start Capture
|
||||||
|
</button>
|
||||||
|
<div className="p-3 bg-amber-50 border border-amber-200 rounded-lg flex items-start gap-2">
|
||||||
|
<AlertCircle className="w-4 h-4 text-amber-600 shrink-0 mt-0.5" />
|
||||||
|
<div className="text-xs text-amber-700 leading-relaxed">
|
||||||
|
System audio capture works best in Chrome/Edge on Windows/macOS. Firefox and Safari do not support this feature. On Linux, only tab audio is available.
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,188 @@
|
||||||
|
import { useState, useRef, useCallback, useEffect } from 'react'
|
||||||
|
import type { ASRMessage } from '../types'
|
||||||
|
|
||||||
|
export interface UseMediaStreamASRProps {
|
||||||
|
wsUrl: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface UseMediaStreamASRReturn {
|
||||||
|
status: 'idle' | 'requesting' | 'streaming' | 'stopping' | 'error'
|
||||||
|
transcript: string
|
||||||
|
partialTranscript: string
|
||||||
|
error: string | null
|
||||||
|
start: (stream: MediaStream) => void
|
||||||
|
stop: () => void
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Shared audio pipeline: AudioContext → ScriptProcessorNode → Float32 PCM → WebSocket.
|
||||||
|
* Wrapper hooks (system audio, mic) obtain the MediaStream, then call `start(stream)`.
|
||||||
|
* Follows the exact audio-processing and WebSocket message pattern from useVideoASR.ts.
|
||||||
|
*/
|
||||||
|
export function useMediaStreamASR({ wsUrl }: UseMediaStreamASRProps): UseMediaStreamASRReturn {
|
||||||
|
const [status, setStatus] = useState<'idle' | 'requesting' | 'streaming' | 'stopping' | 'error'>('idle')
|
||||||
|
const [transcript, setTranscript] = useState('')
|
||||||
|
const [partialTranscript, setPartialTranscript] = useState('')
|
||||||
|
const [error, setError] = useState<string | null>(null)
|
||||||
|
|
||||||
|
const wsRef = useRef<WebSocket | null>(null)
|
||||||
|
const audioContextRef = useRef<AudioContext | null>(null)
|
||||||
|
const processorRef = useRef<ScriptProcessorNode | null>(null)
|
||||||
|
const sourceRef = useRef<MediaStreamAudioSourceNode | null>(null)
|
||||||
|
const streamRef = useRef<MediaStream | null>(null)
|
||||||
|
const isStreamingRef = useRef(false)
|
||||||
|
const isManualCloseRef = useRef(false)
|
||||||
|
const transcriptRef = useRef('')
|
||||||
|
const lastStashRef = useRef('')
|
||||||
|
|
||||||
|
const cleanup = useCallback(() => {
|
||||||
|
isStreamingRef.current = false
|
||||||
|
|
||||||
|
// Stash handling — mirrors useVideoASR stopStreaming lines 101-111
|
||||||
|
let currentText = transcriptRef.current.trim()
|
||||||
|
const stash = lastStashRef.current.trim()
|
||||||
|
if (stash && !currentText.endsWith(stash)) {
|
||||||
|
currentText += stash
|
||||||
|
transcriptRef.current = currentText
|
||||||
|
}
|
||||||
|
lastStashRef.current = ''
|
||||||
|
if (currentText) {
|
||||||
|
setTranscript(currentText)
|
||||||
|
setPartialTranscript('')
|
||||||
|
}
|
||||||
|
|
||||||
|
if (streamRef.current) {
|
||||||
|
streamRef.current.getTracks().forEach(t => {
|
||||||
|
t.onended = null
|
||||||
|
t.stop()
|
||||||
|
})
|
||||||
|
streamRef.current = null
|
||||||
|
}
|
||||||
|
|
||||||
|
processorRef.current?.disconnect()
|
||||||
|
sourceRef.current?.disconnect()
|
||||||
|
processorRef.current = null
|
||||||
|
sourceRef.current = null
|
||||||
|
|
||||||
|
if (wsRef.current) {
|
||||||
|
isManualCloseRef.current = true
|
||||||
|
wsRef.current.close()
|
||||||
|
wsRef.current = null
|
||||||
|
}
|
||||||
|
|
||||||
|
if (audioContextRef.current) {
|
||||||
|
audioContextRef.current.close()
|
||||||
|
audioContextRef.current = null
|
||||||
|
}
|
||||||
|
}, [])
|
||||||
|
|
||||||
|
const stop = useCallback(() => {
|
||||||
|
setStatus('stopping')
|
||||||
|
cleanup()
|
||||||
|
setStatus('idle')
|
||||||
|
}, [cleanup])
|
||||||
|
|
||||||
|
const start = useCallback((stream: MediaStream) => {
|
||||||
|
cleanup()
|
||||||
|
|
||||||
|
setError(null)
|
||||||
|
transcriptRef.current = ''
|
||||||
|
lastStashRef.current = ''
|
||||||
|
setTranscript('')
|
||||||
|
setPartialTranscript('')
|
||||||
|
streamRef.current = stream
|
||||||
|
isManualCloseRef.current = false
|
||||||
|
|
||||||
|
stream.getAudioTracks().forEach(track => {
|
||||||
|
track.onended = () => {
|
||||||
|
cleanup()
|
||||||
|
setStatus('idle')
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
try {
|
||||||
|
// AudioContext + ScriptProcessorNode — mirrors useVideoASR lines 117-136
|
||||||
|
const audioContext = new AudioContext({ sampleRate: 16000 })
|
||||||
|
audioContextRef.current = audioContext
|
||||||
|
|
||||||
|
const source = audioContext.createMediaStreamSource(stream)
|
||||||
|
sourceRef.current = source
|
||||||
|
|
||||||
|
const processor = audioContext.createScriptProcessor(4096, 1, 1)
|
||||||
|
processorRef.current = processor
|
||||||
|
|
||||||
|
// onaudioprocess — mirrors useVideoASR lines 126-132 exactly
|
||||||
|
processor.onaudioprocess = (e) => {
|
||||||
|
const float32Data = e.inputBuffer.getChannelData(0)
|
||||||
|
const outputData = e.outputBuffer.getChannelData(0)
|
||||||
|
outputData.set(float32Data)
|
||||||
|
if (!isStreamingRef.current) return
|
||||||
|
if (!wsRef.current || wsRef.current.readyState !== WebSocket.OPEN) return
|
||||||
|
wsRef.current.send(float32Data.buffer)
|
||||||
|
}
|
||||||
|
|
||||||
|
source.connect(processor)
|
||||||
|
processor.connect(audioContext.destination)
|
||||||
|
|
||||||
|
const ws = new WebSocket(wsUrl)
|
||||||
|
wsRef.current = ws
|
||||||
|
|
||||||
|
ws.onopen = () => {
|
||||||
|
isStreamingRef.current = true
|
||||||
|
setStatus('streaming')
|
||||||
|
}
|
||||||
|
|
||||||
|
// Message parsing — mirrors useVideoASR lines 51-64 exactly
|
||||||
|
ws.onmessage = (e) => {
|
||||||
|
const msg: ASRMessage = JSON.parse(e.data)
|
||||||
|
if (msg.is_final && msg.full_text) {
|
||||||
|
transcriptRef.current = msg.full_text
|
||||||
|
lastStashRef.current = ''
|
||||||
|
setTranscript(msg.full_text)
|
||||||
|
setPartialTranscript('')
|
||||||
|
} else if (msg.delta) {
|
||||||
|
transcriptRef.current += msg.delta
|
||||||
|
lastStashRef.current = (msg as any).stash || ''
|
||||||
|
setTranscript(transcriptRef.current)
|
||||||
|
setPartialTranscript(transcriptRef.current)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ws.onerror = () => {
|
||||||
|
console.error('[useMediaStreamASR] WebSocket error')
|
||||||
|
setError('WebSocket connection error')
|
||||||
|
setStatus('error')
|
||||||
|
isManualCloseRef.current = true
|
||||||
|
cleanup()
|
||||||
|
}
|
||||||
|
|
||||||
|
ws.onclose = () => {
|
||||||
|
isStreamingRef.current = false
|
||||||
|
if (isManualCloseRef.current) return
|
||||||
|
setError('ASR connection closed unexpectedly')
|
||||||
|
setStatus('error')
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
console.error('[useMediaStreamASR] start failed:', err)
|
||||||
|
setError(err instanceof Error ? err.message : 'Failed to start audio pipeline')
|
||||||
|
setStatus('error')
|
||||||
|
}
|
||||||
|
}, [wsUrl, cleanup])
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
return () => {
|
||||||
|
if (streamRef.current) {
|
||||||
|
streamRef.current.getTracks().forEach(t => {
|
||||||
|
t.onended = null
|
||||||
|
t.stop()
|
||||||
|
})
|
||||||
|
}
|
||||||
|
processorRef.current?.disconnect()
|
||||||
|
sourceRef.current?.disconnect()
|
||||||
|
wsRef.current?.close()
|
||||||
|
audioContextRef.current?.close()
|
||||||
|
}
|
||||||
|
}, [])
|
||||||
|
|
||||||
|
return { status, transcript, partialTranscript, error, start, stop }
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,85 @@
|
||||||
|
import { useState, useEffect } from 'react'
|
||||||
|
import type { MicStatus } from '../types'
|
||||||
|
import { useMediaStreamASR } from './useMediaStreamASR'
|
||||||
|
|
||||||
|
export function useMicASR({ wsUrl }: { wsUrl: string }) {
|
||||||
|
const pipeline = useMediaStreamASR({ wsUrl })
|
||||||
|
const [status, setStatus] = useState<MicStatus>('idle')
|
||||||
|
const [wrapperError, setWrapperError] = useState<string | null>(null)
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
switch (pipeline.status) {
|
||||||
|
case 'streaming':
|
||||||
|
setStatus('listening')
|
||||||
|
setWrapperError(null)
|
||||||
|
break
|
||||||
|
case 'stopping':
|
||||||
|
setStatus('stopping')
|
||||||
|
break
|
||||||
|
case 'error':
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError(null)
|
||||||
|
break
|
||||||
|
case 'idle':
|
||||||
|
setStatus('idle')
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}, [pipeline.status])
|
||||||
|
|
||||||
|
const startListening = async () => {
|
||||||
|
setWrapperError(null)
|
||||||
|
setStatus('requesting')
|
||||||
|
|
||||||
|
try {
|
||||||
|
const stream = await navigator.mediaDevices.getUserMedia({
|
||||||
|
audio: {
|
||||||
|
echoCancellation: false,
|
||||||
|
noiseSuppression: false,
|
||||||
|
autoGainControl: false,
|
||||||
|
},
|
||||||
|
video: false,
|
||||||
|
})
|
||||||
|
|
||||||
|
if (stream.getAudioTracks().length === 0) {
|
||||||
|
stream.getTracks().forEach(t => t.stop())
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError('No microphone input detected')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
pipeline.start(stream)
|
||||||
|
} catch (err) {
|
||||||
|
console.error('[useMicASR] getUserMedia failed:', err)
|
||||||
|
if (err instanceof DOMException && err.name === 'NotAllowedError') {
|
||||||
|
setStatus('idle')
|
||||||
|
setWrapperError('Microphone access denied — please allow microphone access in your browser settings')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (err instanceof DOMException && err.name === 'NotFoundError') {
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError('No microphone found. Please connect a microphone and try again.')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (err instanceof DOMException && err.name === 'NotSupportedError') {
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError('Microphone access is not supported in this browser.')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError(err instanceof Error ? err.message : 'Failed to start microphone capture')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const stopListening = () => {
|
||||||
|
pipeline.stop()
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
status,
|
||||||
|
transcript: pipeline.transcript,
|
||||||
|
partialTranscript: pipeline.partialTranscript,
|
||||||
|
error: wrapperError ?? pipeline.error,
|
||||||
|
startListening,
|
||||||
|
stopListening,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,91 @@
|
||||||
|
import { useState, useEffect } from 'react'
|
||||||
|
import type { SystemAudioStatus } from '../types'
|
||||||
|
import { useMediaStreamASR } from './useMediaStreamASR'
|
||||||
|
|
||||||
|
export function useSystemAudioASR({ wsUrl }: { wsUrl: string }) {
|
||||||
|
const pipeline = useMediaStreamASR({ wsUrl })
|
||||||
|
const [status, setStatus] = useState<SystemAudioStatus>('idle')
|
||||||
|
const [wrapperError, setWrapperError] = useState<string | null>(null)
|
||||||
|
|
||||||
|
useEffect(() => {
|
||||||
|
switch (pipeline.status) {
|
||||||
|
case 'streaming':
|
||||||
|
setStatus('capturing')
|
||||||
|
setWrapperError(null)
|
||||||
|
break
|
||||||
|
case 'stopping':
|
||||||
|
setStatus('stopping')
|
||||||
|
break
|
||||||
|
case 'error':
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError(null)
|
||||||
|
break
|
||||||
|
case 'idle':
|
||||||
|
setStatus('idle')
|
||||||
|
break
|
||||||
|
}
|
||||||
|
}, [pipeline.status])
|
||||||
|
|
||||||
|
const startCapture = async () => {
|
||||||
|
setWrapperError(null)
|
||||||
|
setStatus('requesting')
|
||||||
|
|
||||||
|
try {
|
||||||
|
// getDisplayMedia() SPEC: video:true is REQUIRED.
|
||||||
|
// Setting video:false causes TypeError (Chrome) or NotSupportedError.
|
||||||
|
// We capture video but immediately discard it — only audio is used.
|
||||||
|
const stream = await navigator.mediaDevices.getDisplayMedia({
|
||||||
|
video: true,
|
||||||
|
audio: {
|
||||||
|
systemAudio: 'include',
|
||||||
|
echoCancellation: false,
|
||||||
|
noiseSuppression: false,
|
||||||
|
autoGainControl: false,
|
||||||
|
},
|
||||||
|
} as any)
|
||||||
|
|
||||||
|
// Stop video tracks immediately — we only need audio
|
||||||
|
stream.getVideoTracks().forEach((t) => t.stop())
|
||||||
|
|
||||||
|
if (stream.getAudioTracks().length === 0) {
|
||||||
|
stream.getTracks().forEach((t) => t.stop())
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError(
|
||||||
|
'No audio track found. Make sure to enable "Share audio" in the sharing dialog and select a tab or window that is playing audio.',
|
||||||
|
)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
pipeline.start(stream)
|
||||||
|
} catch (err) {
|
||||||
|
console.error('[useSystemAudioASR] getDisplayMedia failed:', err)
|
||||||
|
if (err instanceof DOMException) {
|
||||||
|
if (err.name === 'AbortError' || err.name === 'NotAllowedError') {
|
||||||
|
setStatus('idle')
|
||||||
|
setWrapperError('Permission denied — system audio capture requires your explicit permission')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if (err.name === 'NotSupportedError') {
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError('System audio capture is not supported on this platform. Linux only supports tab audio — try Chrome/Edge on Windows or macOS for full system audio.')
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
setStatus('error')
|
||||||
|
setWrapperError(err instanceof Error ? err.message : 'Failed to start system audio capture')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const stopCapture = () => {
|
||||||
|
pipeline.stop()
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
status,
|
||||||
|
transcript: pipeline.transcript,
|
||||||
|
partialTranscript: pipeline.partialTranscript,
|
||||||
|
error: wrapperError ?? pipeline.error,
|
||||||
|
startCapture,
|
||||||
|
stopCapture,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
@ -1,8 +1,10 @@
|
||||||
import React, { useState, useCallback, useEffect } from 'react'
|
import React, { useState, useCallback, useEffect, useMemo } from 'react'
|
||||||
import { Loader2, AlertCircle, FileText } from 'lucide-react'
|
import { Loader2, AlertCircle, FileText } from 'lucide-react'
|
||||||
import { Group, Panel, Separator } from 'react-resizable-panels'
|
import { Group, Panel, Separator } from 'react-resizable-panels'
|
||||||
import { useQueryDocumentStream } from '../lib/queries'
|
import { useQueryDocumentStream } from '../lib/queries'
|
||||||
import { useVideoASR } from '../hooks/useVideoASR'
|
import { useVideoASR } from '../hooks/useVideoASR'
|
||||||
|
import { useSystemAudioASR } from '../hooks/useSystemAudioASR'
|
||||||
|
import { useMicASR } from '../hooks/useMicASR'
|
||||||
import { useFullTranscript } from '../hooks/useFullTranscript'
|
import { useFullTranscript } from '../hooks/useFullTranscript'
|
||||||
import { getVideoUrl } from '../lib/api'
|
import { getVideoUrl } from '../lib/api'
|
||||||
import { QueryInput } from '../components/QueryInput'
|
import { QueryInput } from '../components/QueryInput'
|
||||||
|
|
@ -10,15 +12,20 @@ import { ExtractedQuestionsDisplay } from '../components/ExtractedQuestionsDispl
|
||||||
import { ResponsePanel } from '../components/ResponsePanel'
|
import { ResponsePanel } from '../components/ResponsePanel'
|
||||||
import { VideoUpload } from '../components/VideoUpload'
|
import { VideoUpload } from '../components/VideoUpload'
|
||||||
import { VideoPlayer } from '../components/VideoPlayer'
|
import { VideoPlayer } from '../components/VideoPlayer'
|
||||||
|
import { SourceSelector } from '../components/SourceSelector'
|
||||||
|
import { SystemAudioCapture } from '../components/SystemAudioCapture'
|
||||||
|
import { MicCapture } from '../components/MicCapture'
|
||||||
|
import type { SourceType } from '../types'
|
||||||
|
|
||||||
export const LTTPage: React.FC = () => {
|
export const LTTPage: React.FC = () => {
|
||||||
|
const [source, setSource] = useState<SourceType>('upload')
|
||||||
const [currentVideoId, setCurrentVideoId] = useState<string | null>(null)
|
const [currentVideoId, setCurrentVideoId] = useState<string | null>(null)
|
||||||
const [queryText, setQueryText] = useState('')
|
const [queryText, setQueryText] = useState('')
|
||||||
const [videoEl, setVideoEl] = useState<HTMLVideoElement | null>(null)
|
const [videoEl, setVideoEl] = useState<HTMLVideoElement | null>(null)
|
||||||
|
|
||||||
const queryStream = useQueryDocumentStream()
|
const queryStream = useQueryDocumentStream()
|
||||||
|
|
||||||
const asr = useVideoASR({
|
const uploadASR = useVideoASR({
|
||||||
videoId: currentVideoId ?? '',
|
videoId: currentVideoId ?? '',
|
||||||
videoElement: videoEl,
|
videoElement: videoEl,
|
||||||
language: 'yue',
|
language: 'yue',
|
||||||
|
|
@ -29,6 +36,24 @@ export const LTTPage: React.FC = () => {
|
||||||
|
|
||||||
const ft = useFullTranscript({ videoId: currentVideoId ?? '' })
|
const ft = useFullTranscript({ videoId: currentVideoId ?? '' })
|
||||||
|
|
||||||
|
const systemAudioWsUrl = useMemo(() => {
|
||||||
|
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'
|
||||||
|
const host = import.meta.env.VITE_WS_HOST ?? window.location.host
|
||||||
|
return `${protocol}//${host}/ws/asr/${crypto.randomUUID()}?language=yue&source=system-audio`
|
||||||
|
}, [])
|
||||||
|
const micWsUrl = useMemo(() => {
|
||||||
|
const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'
|
||||||
|
const host = import.meta.env.VITE_WS_HOST ?? window.location.host
|
||||||
|
return `${protocol}//${host}/ws/asr/${crypto.randomUUID()}?language=yue&source=mic`
|
||||||
|
}, [])
|
||||||
|
|
||||||
|
const systemAudioASR = useSystemAudioASR({ wsUrl: systemAudioWsUrl })
|
||||||
|
const micASR = useMicASR({ wsUrl: micWsUrl })
|
||||||
|
|
||||||
|
const asr = source === 'system-audio' ? systemAudioASR
|
||||||
|
: source === 'mic' ? micASR
|
||||||
|
: uploadASR
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
if (ft.fullTranscript) {
|
if (ft.fullTranscript) {
|
||||||
setQueryText(ft.fullTranscript)
|
setQueryText(ft.fullTranscript)
|
||||||
|
|
@ -54,6 +79,9 @@ export const LTTPage: React.FC = () => {
|
||||||
|
|
||||||
return (
|
return (
|
||||||
<div className="h-full bg-gray-50">
|
<div className="h-full bg-gray-50">
|
||||||
|
<div className="px-4 pt-3">
|
||||||
|
<SourceSelector activeSource={source} onSelect={setSource} />
|
||||||
|
</div>
|
||||||
<Group
|
<Group
|
||||||
orientation="vertical"
|
orientation="vertical"
|
||||||
id="ltt-main-group"
|
id="ltt-main-group"
|
||||||
|
|
@ -65,7 +93,8 @@ export const LTTPage: React.FC = () => {
|
||||||
<Group orientation="horizontal" id="ltt-upper-group" className="h-full">
|
<Group orientation="horizontal" id="ltt-upper-group" className="h-full">
|
||||||
<Panel id="ltt-upper-left" minSize="30%" defaultSize={50}>
|
<Panel id="ltt-upper-left" minSize="30%" defaultSize={50}>
|
||||||
<div className="h-full p-4 overflow-hidden flex flex-col gap-3">
|
<div className="h-full p-4 overflow-hidden flex flex-col gap-3">
|
||||||
{currentVideoId ? (
|
{source === 'upload' ? (
|
||||||
|
currentVideoId ? (
|
||||||
<>
|
<>
|
||||||
<VideoPlayer ref={setVideoEl} src={videoUrl} />
|
<VideoPlayer ref={setVideoEl} src={videoUrl} />
|
||||||
<button
|
<button
|
||||||
|
|
@ -89,7 +118,7 @@ export const LTTPage: React.FC = () => {
|
||||||
<span>{ft.error}</span>
|
<span>{ft.error}</span>
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
{asr.status === 'error' && (
|
{uploadASR.status === 'error' && (
|
||||||
<div
|
<div
|
||||||
data-testid="asr-error-indicator"
|
data-testid="asr-error-indicator"
|
||||||
className="flex items-center gap-2 text-xs text-red-600 bg-red-50 border border-red-200 rounded px-2 py-1"
|
className="flex items-center gap-2 text-xs text-red-600 bg-red-50 border border-red-200 rounded px-2 py-1"
|
||||||
|
|
@ -101,6 +130,21 @@ export const LTTPage: React.FC = () => {
|
||||||
</>
|
</>
|
||||||
) : (
|
) : (
|
||||||
<VideoUpload onUploadSuccess={handleUploadSuccess} />
|
<VideoUpload onUploadSuccess={handleUploadSuccess} />
|
||||||
|
)
|
||||||
|
) : source === 'system-audio' ? (
|
||||||
|
<SystemAudioCapture
|
||||||
|
status={systemAudioASR.status}
|
||||||
|
error={systemAudioASR.error}
|
||||||
|
onStart={systemAudioASR.startCapture}
|
||||||
|
onStop={systemAudioASR.stopCapture}
|
||||||
|
/>
|
||||||
|
) : (
|
||||||
|
<MicCapture
|
||||||
|
status={micASR.status}
|
||||||
|
error={micASR.error}
|
||||||
|
onStart={micASR.startListening}
|
||||||
|
onStop={micASR.stopListening}
|
||||||
|
/>
|
||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
</Panel>
|
</Panel>
|
||||||
|
|
|
||||||
|
|
@ -195,3 +195,29 @@ export interface VideoUploadResponse {
|
||||||
size_bytes: number
|
size_bytes: number
|
||||||
url: string
|
url: string
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Phase 4 — Live audio capture types
|
||||||
|
|
||||||
|
export type SourceType = 'upload' | 'system-audio' | 'mic'
|
||||||
|
|
||||||
|
export type SystemAudioStatus = 'idle' | 'requesting' | 'capturing' | 'stopping' | 'error'
|
||||||
|
|
||||||
|
export type MicStatus = 'idle' | 'requesting' | 'listening' | 'stopping' | 'error'
|
||||||
|
|
||||||
|
export interface SystemAudioASRState {
|
||||||
|
status: SystemAudioStatus
|
||||||
|
transcript: string
|
||||||
|
partialTranscript: string
|
||||||
|
error: string | null
|
||||||
|
startCapture: () => Promise<void>
|
||||||
|
stopCapture: () => void
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface MicASRState {
|
||||||
|
status: MicStatus
|
||||||
|
transcript: string
|
||||||
|
partialTranscript: string
|
||||||
|
error: string | null
|
||||||
|
startListening: () => Promise<void>
|
||||||
|
stopListening: () => void
|
||||||
|
}
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue