322 lines
14 KiB
Markdown
322 lines
14 KiB
Markdown
# Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan
|
|
|
|
**Created:** 2026-05-06
|
|
**Updated:** 2026-05-06 (video audio capture via createMediaElementSource; Full Transcript batch mode)
|
|
**Status:** Planning — Not Started
|
|
**Depends on:** Phase 1 (Complete)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
Phase 2 adds video upload/playback and ASR transcription of the **video's audio track** (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.
|
|
|
|
### Two ASR Modes
|
|
|
|
**Mode A — Streaming (real-time, auto on play):**
|
|
```
|
|
<video> → AudioContext.createMediaElementSource(video)
|
|
→ ScriptProcessor(4096, 1, 1) → Float32 PCM
|
|
→ WebSocket → FastAPI → DashScope realtime API
|
|
→ transcript JSON → QueryInput (in real time)
|
|
```
|
|
Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.
|
|
|
|
**Mode B — Full Transcript (batch, manual button):**
|
|
```
|
|
User clicks "Full Transcript" under video player
|
|
→ POST /api/v1/video/{id}/transcribe
|
|
→ Backend: ffmpeg extracts audio from uploaded video
|
|
→ DashScope OpenAI-compatible API (non-streaming)
|
|
→ Complete transcript of entire video → QueryInput
|
|
```
|
|
Server-side audio extraction via ffmpeg. No browser involvement.
|
|
|
|
### Changes From Previous Versions
|
|
|
|
| Aspect | V2 (mic capture) | V3 (video audio capture) |
|
|
|---|---|---|
|
|
| Audio source | getUserMedia() microphone | createMediaElementSource(video) |
|
|
| Trigger | Manual record button | Auto on video play, stop on pause/end |
|
|
| Permissions | Microphone permission required | None |
|
|
| Batch mode | No | Yes — "Full Transcript" button |
|
|
| Backend ffmpeg | Not needed | For Full Transcript mode |
|
|
|
|
---
|
|
|
|
## 2. User Flow
|
|
|
|
1. User uploads video → appears in left panel player
|
|
2. User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
|
|
3. User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
|
|
4. User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
|
|
5. **Full Transcript**: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput
|
|
|
|
---
|
|
|
|
## 3. Sub-Phases
|
|
|
|
### Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)
|
|
|
|
Config fields, directory structure, service/router/model skeletons, register routers.
|
|
|
|
**Test:** `test_phase2_config.py`
|
|
|
|
**Tasks:**
|
|
| # | Task | File |
|
|
|---|------|------|
|
|
| 2.1.1 | Add 6 config fields: `dashscope_api_key`, `asr_model_name`, `asr_realtime_model_name`, `video_upload_dir`, `max_video_size_mb`, `supported_video_formats` | `core/config.py` |
|
|
| 2.1.2 | Update `.env.example` | `.env.example` |
|
|
| 2.1.3 | Add deps: `dashscope>=0.4.0`, `openai>=1.52.0`, `zhconv>=1.4.0`, `python-multipart`, `aiofiles` | `requirements.txt` |
|
|
| 2.1.4 | Create `models/video.py` — `VideoUploadResponse`, `FullTranscriptResponse` | `models/video.py` |
|
|
| 2.1.5 | Create `models/asr.py` — `ASRTranscriptEvent` | `models/asr.py` |
|
|
| 2.1.6 | Create `services/video_service.py`, `services/asr_client.py` stubs | `services/` |
|
|
| 2.1.7 | Create `routers/video.py` stub: `POST /upload`, `GET /{id}`, `POST /{id}/transcribe` | `routers/video.py` |
|
|
| 2.1.8 | Create `routers/ws_asr.py` stub: `WS /ws/asr/{video_id}?language=yue` | `routers/ws_asr.py` |
|
|
| 2.1.9 | Register routers in `main.py` | `main.py` |
|
|
| 2.1.10 | Write and pass `test_phase2_config.py` | `app/test/` |
|
|
|
|
---
|
|
|
|
### Phase 2.2 — Video Upload Backend (0.5 day)
|
|
|
|
Streaming upload with size/format validation. Reuses `routers/ingest.py` pattern.
|
|
|
|
**Test:** `test_phase2_video_upload.py` (implement 4 existing stubs)
|
|
|
|
**Tasks:**
|
|
| # | Task | File |
|
|
|---|------|------|
|
|
| 2.2.1 | Write tests — implement all 4 stubs | `test_phase2_video_upload.py` |
|
|
| 2.2.2 | Implement `VideoService.validate_video()`, `save_video()` (streaming, aiofiles) | `services/video_service.py` |
|
|
| 2.2.3 | Implement `VideoService.get_video_path()`, `delete_video()` | `services/video_service.py` |
|
|
| 2.2.4 | Implement `POST /api/v1/video/upload` route | `routers/video.py` |
|
|
| 2.2.5 | Implement `GET /api/v1/video/{video_id}` route (FileResponse) | `routers/video.py` |
|
|
| 2.2.6 | Run tests → pass → commit | — |
|
|
|
|
---
|
|
|
|
### Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)
|
|
|
|
Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).
|
|
|
|
**Reference:** `.examples/alibaba_asr_backend.py`
|
|
|
|
**Tests:** `test_phase2_asr_client.py` (3 stubs), `test_phase2_ws_asr.py` (3 stubs), `test_phase2_ws_protocol.py` (new), `test_phase2_full_transcript.py` (new)
|
|
|
|
**Acceptance Criteria:**
|
|
- WebSocket `/ws/asr/{video_id}?language=yue` → Float32 PCM → S16_LE base64 → DashScope realtime
|
|
- `transcription.text` events → `{"full_text": "...", "is_final": false}` to browser
|
|
- `transcription.completed` events → `{"full_text": "...", "is_final": true}` to browser
|
|
- Language: `yue` (Cantonese), `zh`, `en`, `auto`
|
|
- Traditional Chinese via `zhconv`
|
|
- `POST /api/v1/video/{video_id}/transcribe` → ffmpeg extract audio → DashScope batch → `{"text": "..."}`
|
|
- `DASHSCOPE_API_KEY` not set → clear error
|
|
- Client disconnect → DashScope session closed cleanly
|
|
|
|
**Tasks:**
|
|
| # | Task | File |
|
|
|---|------|------|
|
|
| 2.3.1 | Write tests first | `app/test/` |
|
|
| 2.3.2 | `float32_to_s16le()`, `build_display_text()`, `_to_traditional()` | `services/asr_client.py` |
|
|
| 2.3.3 | `DashScopeCallback` (sync SDK → asyncio.Queue bridge) + `_ws_proxy_dashscope()` | `routers/ws_asr.py` |
|
|
| 2.3.4 | WebSocket endpoint | `routers/ws_asr.py` |
|
|
| 2.3.5 | `VideoService.extract_audio()` — ffmpeg async subprocess: PCM16LE 16kHz mono | `services/video_service.py` |
|
|
| 2.3.6 | `ASRClient.transcribe_full()` — batch: WAV → DashScope OpenAI-compatible API | `services/asr_client.py` |
|
|
| 2.3.7 | `POST /api/v1/video/{video_id}/transcribe` route | `routers/video.py` |
|
|
| 2.3.8 | Enhance `conftest.py` mock_asr_client | `conftest.py` |
|
|
| 2.3.9 | Run tests → pass → commit | — |
|
|
|
|
---
|
|
|
|
### Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)
|
|
|
|
Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.
|
|
|
|
**Tests:** `test_phase2_useVideoASR.test.ts`, `test_phase2_useFullTranscript.test.ts`, `test_phase2_QueryInput_integration.test.tsx`
|
|
|
|
**Acceptance Criteria:**
|
|
- `useVideoASR` hook: auto-starts on video `play`, stops on `pause`/`ended`
|
|
- `useVideoASR` exposes `transcript`, `partialTranscript`, `isStreaming`, `status`
|
|
- `useFullTranscript` hook: `requestFullTranscript()` → loading → transcript → error
|
|
- QueryInput shows transcript (grey italic = partial, black = final)
|
|
- QueryInput accepts `partialText` prop
|
|
|
|
**Tasks:**
|
|
| # | Task | File |
|
|
|---|------|------|
|
|
| 2.4.1 | Write tests first | `src/test/` |
|
|
| 2.4.2 | Create `hooks/useVideoASR.ts` (see `.examples/alibaba_asr_frontend_react.tsx`) | `hooks/useVideoASR.ts` |
|
|
| 2.4.3 | Create `hooks/useFullTranscript.ts` | `hooks/useFullTranscript.ts` |
|
|
| 2.4.4 | Update `types/index.ts` — `ASRMessage`, `ASRStatus`, `FullTranscriptResponse` | `types/index.ts` |
|
|
| 2.4.5 | Update `QueryInput.tsx` — add `partialText` prop | `components/QueryInput.tsx` |
|
|
| 2.4.6 | Run tests → pass → commit | — |
|
|
|
|
---
|
|
|
|
### Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)
|
|
|
|
Replace `VideoPlaceholder` with video upload + player. ASR auto on play. Full Transcript button.
|
|
|
|
```
|
|
┌─────────────────────┬──────────────────────────┐
|
|
│ VideoUpload / │ QueryInput │ ← Upper Panel (30%)
|
|
│ VideoPlayer │ (transcript flows here │
|
|
│ │ from video audio ASR) │
|
|
│ [Full Transcript] │ [Submit] │
|
|
├─────────────────────┴──────────────────────────┤
|
|
│ ResponsePanel │ ← Lower Panel (70%)
|
|
└────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Tests:** `test_phase2_VideoUpload.test.tsx`, `test_phase2_VideoPlayer.test.tsx`, `test_phase2_LTTPage_integration.test.tsx`
|
|
|
|
**Acceptance Criteria:**
|
|
- Drag-and-drop video upload with progress bar (native HTML5)
|
|
- Video player with native `<video controls>` exposing ref
|
|
- ASR auto on play → transcript in QueryInput; stops on pause/end
|
|
- "Full Transcript" button → loading spinner → fills QueryInput with full transcript
|
|
- Error states: upload fails, ASR fails, Full Transcript fails → clear messages
|
|
|
|
**Tasks:**
|
|
| # | Task | File |
|
|
|---|------|------|
|
|
| 2.5.1 | Write all 3 tests first | `src/test/` |
|
|
| 2.5.2 | Create `VideoUpload.tsx` — native drag-drop, axios progress | `components/VideoUpload.tsx` |
|
|
| 2.5.3 | Create `VideoPlayer.tsx` — native `<video controls>`, forwardRef | `components/VideoPlayer.tsx` |
|
|
| 2.5.4 | Update `types/index.ts` | `types/index.ts` |
|
|
| 2.5.5 | Update `lib/api.ts` — `uploadVideo()`, `getVideoUrl()`, `requestFullTranscript()` | `lib/api.ts` |
|
|
| 2.5.6 | Update `lib/queries.tsx` — `useVideoUpload()` | `lib/queries.tsx` |
|
|
| 2.5.7 | Refactor `LTTPage.tsx` — replace VideoPlaceholder, wire hooks + QueryInput | `pages/LTTPage.tsx` |
|
|
| 2.5.8 | Update `QueryInput.tsx` — transcript value + partial text styling | `components/QueryInput.tsx` |
|
|
| 2.5.9 | Run tests → pass → commit | — |
|
|
|
|
---
|
|
|
|
### Phase 2.6 — Integration & Acceptance Testing (1 day)
|
|
|
|
**Tests:** `test_integration_phase2.py`, `test_acceptance_phase2_video.py`, `test_acceptance_phase2_asr.py`, `test_acceptance_integration_phase2.py`
|
|
|
|
**Tasks:**
|
|
| # | Task |
|
|
|---|------|
|
|
| 2.6.1 | Implement integration test (mocked DashScope, real ChromaDB + file I/O) |
|
|
| 2.6.2 | Implement acceptance: real video upload + Full Transcript |
|
|
| 2.6.3 | Implement acceptance: real DashScope streaming + batch |
|
|
| 2.6.4 | Implement E2E acceptance |
|
|
| 2.6.5 | Full regression run |
|
|
| 2.6.6 | Fix failures, final commit |
|
|
|
|
---
|
|
|
|
### Phase 2.7 — Polish & Deployment (0.5 day)
|
|
|
|
| # | Task |
|
|
|---|------|
|
|
| 2.7.1 | Structured logging for DashScope proxy + full transcript events |
|
|
| 2.7.2 | Update `nginx.conf` — `client_max_body_size` 350M |
|
|
| 2.7.3 | Verify production build |
|
|
| 2.7.4 | Update `README.md` |
|
|
| 2.7.5 | Final commit |
|
|
|
|
---
|
|
|
|
## 4. Timeline
|
|
|
|
| Sub-Phase | Description | Effort | Depends On |
|
|
|---|---|---|---|
|
|
| 2.1 | Config & Infrastructure | 0.5 day | — |
|
|
| 2.2 | Video Upload Backend | 0.5 day | 2.1 |
|
|
| 2.3 | ASR Proxy + Full Transcript | 1 day | 2.1 |
|
|
| 2.4 | Transcript → QueryInput | 0.5 day | 2.3 |
|
|
| 2.5 | Frontend: Layout + Buttons | 1.5 days | 2.2, 2.3 |
|
|
| 2.6 | Integration & Acceptance | 1 day | 2.4, 2.5 |
|
|
| 2.7 | Polish & Deployment | 0.5 day | 2.6 |
|
|
| **Total** | | **5.5 days** | |
|
|
|
|
2.2 (upload) and 2.3 (ASR) run concurrently.
|
|
|
|
---
|
|
|
|
## 5. Dependencies
|
|
|
|
**Backend:** `dashscope>=0.4.0`, `openai>=1.52.0`, `zhconv>=1.4.0`, `python-multipart`, `aiofiles`
|
|
**Frontend:** None (native Web APIs: `AudioContext.createMediaElementSource`, `ScriptProcessorNode`, `<video>`, HTML5 drag-and-drop)
|
|
**System:** ffmpeg on server (for Full Transcript audio extraction)
|
|
|
|
---
|
|
|
|
## 6. Config Fields
|
|
|
|
```python
|
|
dashscope_api_key: str = ""
|
|
asr_model_name: str = "qwen3-asr-flash" # Batch API
|
|
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
|
|
video_upload_dir: str = "./uploads"
|
|
max_video_size_mb: int = 300
|
|
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Key Design Decisions
|
|
|
|
| Decision | Choice | Why |
|
|
|---|---|---|
|
|
| Audio source | `createMediaElementSource(video)` | Captures video audio during playback. No mic permission. |
|
|
| ASR auto trigger | Video `play` event | Transcript appears as user watches. Natural UX. |
|
|
| ASR stop trigger | Video `pause`/`ended` events | Clean lifecycle. New session on next play/seek. |
|
|
| Full Transcript | Manual button + server ffmpeg | User explicitly requests. Server has the file. |
|
|
| Full Transcript ASR | DashScope OpenAI-compatible API | Standard `/v1/chat/completions` with `input_audio`. WAV format. |
|
|
| ASR streaming | DashScope realtime SDK | `OmniRealtimeConversation` + callback → asyncio.Queue bridge |
|
|
| Transcript display | QueryInput textarea | Editable. Same box for typing or ASR output. |
|
|
| SSL | Required | Chrome may block `createMediaElementSource` without secure context. |
|
|
|
|
---
|
|
|
|
## 8. File Manifest
|
|
|
|
### New Files
|
|
```
|
|
backend/
|
|
app/routers/video.py
|
|
app/routers/ws_asr.py
|
|
app/services/video_service.py
|
|
app/services/asr_client.py
|
|
app/models/video.py
|
|
app/models/asr.py
|
|
app/test/test_phase2_config.py
|
|
app/test/test_phase2_ws_protocol.py
|
|
app/test/test_phase2_full_transcript.py
|
|
app/test/test_phase2_transcript_to_rag.py
|
|
|
|
frontend/src/
|
|
components/VideoUpload.tsx
|
|
components/VideoPlayer.tsx
|
|
hooks/useVideoASR.ts
|
|
hooks/useFullTranscript.ts
|
|
test/test_phase2_VideoUpload.test.tsx
|
|
test/test_phase2_VideoPlayer.test.tsx
|
|
test/test_phase2_useVideoASR.test.ts
|
|
test/test_phase2_useFullTranscript.test.ts
|
|
test/test_phase2_QueryInput_integration.test.tsx
|
|
test/test_phase2_LTTPage_integration.test.tsx
|
|
```
|
|
|
|
### Modified Files
|
|
```
|
|
backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
|
|
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Reference Code (`.examples/`)
|
|
|
|
| File | Content |
|
|
|---|---|
|
|
| `alibaba_asr_backend.py` | DashScope WebSocket proxy + non-streaming fallback (FastAPI) |
|
|
| `alibaba_asr_frontend_vanilla.html` | Browser audio capture (vanilla JS, original) |
|
|
| `alibaba_asr_frontend_react.tsx` | React/TS: `useVideoASR` (streaming) + `useFullTranscript` (batch) hooks |
|
|
| `README.md` | Architecture overview + dependency notes |
|