17 KiB
Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan
Created: 2026-05-06 Updated: 2026-05-06 (all sub-phases complete; 11 bugs resolved) Status: Complete Depends on: Phase 1 (Complete)
1. Overview
Phase 2 adds video upload/playback and ASR transcription of the video's audio track (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.
Two ASR Modes
Mode A — Streaming (real-time, auto on play):
<video> → AudioContext.createMediaElementSource(video)
→ ScriptProcessor(4096, 1, 1) → Float32 PCM
→ WebSocket → FastAPI → DashScope realtime API
→ transcript JSON → QueryInput (in real time)
Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.
Mode B — Full Transcript (batch, manual button):
User clicks "Full Transcript" under video player
→ POST /api/v1/video/{id}/transcribe
→ Backend: ffmpeg extracts audio from uploaded video
→ DashScope OpenAI-compatible API (non-streaming)
→ Complete transcript of entire video → QueryInput
Server-side audio extraction via ffmpeg. No browser involvement.
Changes From Previous Versions
| Aspect | V2 (mic capture) | V3 (video audio capture) |
|---|---|---|
| Audio source | getUserMedia() microphone | createMediaElementSource(video) |
| Trigger | Manual record button | Auto on video play, stop on pause/end |
| Permissions | Microphone permission required | None |
| Batch mode | No | Yes — "Full Transcript" button |
| Backend ffmpeg | Not needed | For Full Transcript mode |
2. User Flow
- User uploads video → appears in left panel player
- User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
- User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
- User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
- Full Transcript: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput
3. Sub-Phases
Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)
Config fields, directory structure, service/router/model skeletons, register routers.
Test: test_phase2_config.py
Tasks:
| # | Task | File |
|---|---|---|
| 2.1.1 | Add 6 config fields: dashscope_api_key, asr_model_name, asr_realtime_model_name, video_upload_dir, max_video_size_mb, supported_video_formats |
core/config.py |
| 2.1.2 | Update .env.example |
.env.example |
| 2.1.3 | Add deps: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles |
requirements.txt |
| 2.1.4 | Create models/video.py — VideoUploadResponse, FullTranscriptResponse |
models/video.py |
| 2.1.5 | Create models/asr.py — ASRTranscriptEvent |
models/asr.py |
| 2.1.6 | Create services/video_service.py, services/asr_client.py stubs |
services/ |
| 2.1.7 | Create routers/video.py stub: POST /upload, GET /{id}, POST /{id}/transcribe |
routers/video.py |
| 2.1.8 | Create routers/ws_asr.py stub: WS /ws/asr/{video_id}?language=yue |
routers/ws_asr.py |
| 2.1.9 | Register routers in main.py |
main.py |
| 2.1.10 | Write and pass test_phase2_config.py |
app/test/ |
Phase 2.2 — Video Upload Backend (0.5 day)
Streaming upload with size/format validation. Reuses routers/ingest.py pattern.
Test: test_phase2_video_upload.py (implement 4 existing stubs)
Tasks:
| # | Task | File |
|---|---|---|
| 2.2.1 | Write tests — implement all 4 stubs | test_phase2_video_upload.py |
| 2.2.2 | Implement VideoService.validate_video(), save_video() (streaming, aiofiles) |
services/video_service.py |
| 2.2.3 | Implement VideoService.get_video_path(), delete_video() |
services/video_service.py |
| 2.2.4 | Implement POST /api/v1/video/upload route |
routers/video.py |
| 2.2.5 | Implement GET /api/v1/video/{video_id} route (FileResponse) |
routers/video.py |
| 2.2.6 | Run tests → pass → commit | — |
Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)
Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).
Reference: .examples/alibaba_asr_backend.py
Tests: test_phase2_asr_client.py (3 stubs), test_phase2_ws_asr.py (3 stubs), test_phase2_ws_protocol.py (new), test_phase2_full_transcript.py (new)
Acceptance Criteria:
- WebSocket
/ws/asr/{video_id}?language=yue→ Float32 PCM → S16_LE base64 → DashScope realtime transcription.textevents →{"full_text": "...", "is_final": false}to browsertranscription.completedevents →{"full_text": "...", "is_final": true}to browser- Language:
yue(Cantonese),zh,en,auto - Traditional Chinese via
zhconv POST /api/v1/video/{video_id}/transcribe→ ffmpeg extract audio → DashScope batch →{"text": "..."}DASHSCOPE_API_KEYnot set → clear error- Client disconnect → DashScope session closed cleanly
Tasks:
| # | Task | File |
|---|---|---|
| 2.3.1 | Write tests first | app/test/ |
| 2.3.2 | float32_to_s16le(), build_display_text(), _to_traditional() |
services/asr_client.py |
| 2.3.3 | DashScopeCallback (sync SDK → asyncio.Queue bridge) + _ws_proxy_dashscope() |
routers/ws_asr.py |
| 2.3.4 | WebSocket endpoint | routers/ws_asr.py |
| 2.3.5 | VideoService.extract_audio() — ffmpeg async subprocess: PCM16LE 16kHz mono |
services/video_service.py |
| 2.3.6 | ASRClient.transcribe_full() — batch: WAV → DashScope OpenAI-compatible API |
services/asr_client.py |
| 2.3.7 | POST /api/v1/video/{video_id}/transcribe route |
routers/video.py |
| 2.3.8 | Enhance conftest.py mock_asr_client |
conftest.py |
| 2.3.9 | Run tests → pass → commit | — |
Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)
Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.
Tests: test_phase2_useVideoASR.test.ts, test_phase2_useFullTranscript.test.ts, test_phase2_QueryInput_integration.test.tsx
Acceptance Criteria:
useVideoASRhook: auto-starts on videoplay, stops onpause/endeduseVideoASRexposestranscript,partialTranscript,isStreaming,statususeFullTranscripthook:requestFullTranscript()→ loading → transcript → error- QueryInput shows transcript (grey italic = partial, black = final)
- QueryInput accepts
partialTextprop
Tasks:
| # | Task | File |
|---|---|---|
| 2.4.1 | Write tests first | src/test/ |
| 2.4.2 | Create hooks/useVideoASR.ts (see .examples/alibaba_asr_frontend_react.tsx) |
hooks/useVideoASR.ts |
| 2.4.3 | Create hooks/useFullTranscript.ts |
hooks/useFullTranscript.ts |
| 2.4.4 | Update types/index.ts — ASRMessage, ASRStatus, FullTranscriptResponse |
types/index.ts |
| 2.4.5 | Update QueryInput.tsx — add partialText prop |
components/QueryInput.tsx |
| 2.4.6 | Run tests → pass → commit | — |
Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)
Replace VideoPlaceholder with video upload + player. ASR auto on play. Full Transcript button.
┌─────────────────────┬──────────────────────────┐
│ VideoUpload / │ QueryInput │ ← Upper Panel (30%)
│ VideoPlayer │ (transcript flows here │
│ │ from video audio ASR) │
│ [Full Transcript] │ [Submit] │
├─────────────────────┴──────────────────────────┤
│ ResponsePanel │ ← Lower Panel (70%)
└────────────────────────────────────────────────┘
Tests: test_phase2_VideoUpload.test.tsx, test_phase2_VideoPlayer.test.tsx, test_phase2_LTTPage_integration.test.tsx
Acceptance Criteria:
- Drag-and-drop video upload with progress bar (native HTML5)
- Video player with native
<video controls>exposing ref - ASR auto on play → transcript in QueryInput; stops on pause/end
- "Full Transcript" button → loading spinner → fills QueryInput with full transcript
- Error states: upload fails, ASR fails, Full Transcript fails → clear messages
Tasks:
| # | Task | File |
|---|---|---|
| 2.5.1 | Write all 3 tests first | src/test/ |
| 2.5.2 | Create VideoUpload.tsx — native drag-drop, axios progress |
components/VideoUpload.tsx |
| 2.5.3 | Create VideoPlayer.tsx — native <video controls>, forwardRef |
components/VideoPlayer.tsx |
| 2.5.4 | Update types/index.ts |
types/index.ts |
| 2.5.5 | Update lib/api.ts — uploadVideo(), getVideoUrl(), requestFullTranscript() |
lib/api.ts |
| 2.5.6 | Update lib/queries.tsx — useVideoUpload() |
lib/queries.tsx |
| 2.5.7 | Refactor LTTPage.tsx — replace VideoPlaceholder, wire hooks + QueryInput |
pages/LTTPage.tsx |
| 2.5.8 | Update QueryInput.tsx — transcript value + partial text styling |
components/QueryInput.tsx |
| 2.5.9 | Run tests → pass → commit | — |
Phase 2.6 — Integration & Acceptance Testing (1 day)
Tests: test_integration_phase2.py, test_acceptance_phase2_video.py, test_acceptance_phase2_asr.py, test_acceptance_integration_phase2.py
Tasks:
| # | Task |
|---|---|
| 2.6.1 | Implement integration test (mocked DashScope, real ChromaDB + file I/O) |
| 2.6.2 | Implement acceptance: real video upload + Full Transcript |
| 2.6.3 | Implement acceptance: real DashScope streaming + batch |
| 2.6.4 | Implement E2E acceptance |
| 2.6.5 | Full regression run |
| 2.6.6 | Fix failures, final commit |
Phase 2.7 — Polish & Deployment (0.5 day)
| # | Task |
|---|---|
| 2.7.1 | Structured logging for DashScope proxy + full transcript events |
| 2.7.2 | Update nginx.conf — client_max_body_size 350M |
| 2.7.3 | Verify production build |
| 2.7.4 | Update README.md |
| 2.7.5 | Final commit |
4. Timeline
| Sub-Phase | Description | Effort | Depends On |
|---|---|---|---|
| 2.1 | Config & Infrastructure | 0.5 day | — |
| 2.2 | Video Upload Backend | 0.5 day | 2.1 |
| 2.3 | ASR Proxy + Full Transcript | 1 day | 2.1 |
| 2.4 | Transcript → QueryInput | 0.5 day | 2.3 |
| 2.5 | Frontend: Layout + Buttons | 1.5 days | 2.2, 2.3 |
| 2.6 | Integration & Acceptance | 1 day | 2.4, 2.5 |
| 2.7 | Polish & Deployment | 0.5 day | 2.6 |
| Total | 5.5 days |
2.2 (upload) and 2.3 (ASR) run concurrently.
5. Dependencies
Backend: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles
Frontend: None (native Web APIs: AudioContext.createMediaElementSource, ScriptProcessorNode, <video>, HTML5 drag-and-drop)
System: ffmpeg on server (for Full Transcript audio extraction)
6. Config Fields
dashscope_api_key: str = ""
asr_model_name: str = "qwen3-asr-flash" # Batch API
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
video_upload_dir: str = "./uploads"
max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
7. Key Design Decisions
| Decision | Choice | Why |
|---|---|---|
| Audio source | createMediaElementSource(video) |
Captures video audio during playback. No mic permission. |
| ASR auto trigger | Video play event |
Transcript appears as user watches. Natural UX. |
| ASR stop trigger | Video pause/ended events |
Clean lifecycle. New session on next play/seek. |
| Full Transcript | Manual button + server ffmpeg | User explicitly requests. Server has the file. |
| Full Transcript ASR | DashScope OpenAI-compatible API | Standard /v1/chat/completions with input_audio. WAV format. |
| ASR streaming | DashScope realtime SDK | OmniRealtimeConversation + callback → asyncio.Queue bridge |
| Transcript display | QueryInput textarea | Editable. Same box for typing or ASR output. |
| SSL | Required | Chrome may block createMediaElementSource without secure context. |
8. File Manifest
New Files
backend/
app/routers/video.py
app/routers/ws_asr.py
app/services/video_service.py
app/services/asr_client.py
app/models/video.py
app/models/asr.py
app/test/test_phase2_config.py
app/test/test_phase2_ws_protocol.py
app/test/test_phase2_full_transcript.py
app/test/test_phase2_transcript_to_rag.py
frontend/src/
components/VideoUpload.tsx
components/VideoPlayer.tsx
hooks/useVideoASR.ts
hooks/useFullTranscript.ts
test/test_phase2_VideoUpload.test.tsx
test/test_phase2_VideoPlayer.test.tsx
test/test_phase2_useVideoASR.test.ts
test/test_phase2_useFullTranscript.test.ts
test/test_phase2_QueryInput_integration.test.tsx
test/test_phase2_LTTPage_integration.test.tsx
Modified Files
backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts
10. Bugs Resolved (11 Total)
| # | Bug | Root Cause | Fix |
|---|---|---|---|
| 1 | Vite proxy missing | /api and /ws routes not proxied to backend port 8000 |
Added proxy config in vite.config.ts |
| 2 | crossOrigin missing on <video> |
Cross-origin video blocked createMediaElementSource |
Added crossOrigin="anonymous" to VideoPlayer.tsx |
| 3 | Audio silent (consumed but not passed through) | onaudioprocess didn't copy input to output |
Added outputData.set(float32Data) |
| 4 | Audio graph recreated every play/pause | createMediaElementSource can only be called once per element |
Setup audio graph once per video lifetime (graphSetupRef) |
| 5 | AudioContext suspended on play |
Autoplay policy requires user gesture | Added audioContext.resume() on play |
| 6 | dashscope package not installed |
Missing from requirements.txt |
Installed dashscope>=0.4.0 |
| 7 | api_key not passed to DashScope |
OmniRealtimeConversation needs explicit api_key |
Added api_key=settings.dashscope_api_key |
| 8 | data_url MIME type wrong |
data:audio/wav;base64 instead of data:audio/wav;base64, |
Fixed data URL format in transcribe_full() |
| 9 | extra_body sent when language="auto" |
DashScope rejects extra_body with auto-detect |
Omitted extra_body entirely when language="auto" |
| 10 | Text accumulation broken — stashes are sliding windows | DashScope stash is a rolling ~7-char window, not cumulative. Each partial event replaces the previous. Completed events rarely sent. Old code showed partials as-is (text jumped). Simplification silently dropped partials (nothing appeared). |
Added _merge_stash() — finds overlapping suffix between successive stashes and appends only the new chars. Partial events now contribute to a growing partial_buffer. full_text = accumulated + partial_buffer. |
| 11 | Text disappeared on pause | stopStreaming set partialTranscript = '' and onFinalTranscript never fired (no completed events). |
Removed blind clear; added transcriptRef to avoid stale closures; stopStreaming now fires onFinalTranscript(currentText) then clears partial. |
Text Accumulation Architecture (Bug #10 Detail)
Raw DashScope stashes (sliding window, ~7 chars each):
"系多謝主席" → "主席咁咧呢個" → "呢個古洞北" → "三百二十五億"
_merge_stash() reconstruction:
"" + "系多謝主席" → "系多謝主席"
"系多謝主席" + overlap("主席")="咁咧呢個" → "系多謝主席咁咧呢個"
"系多謝主席咁咧呢個" + overlap("呢個")="古洞北" → "系多謝主席咁咧呢個古洞北"
"系多謝主席咁咧呢個古洞北" + no overlap → append with space → "系多謝主席咁咧呢個古洞北 三百二十五億"
Sent to frontend: {"full_text": accumulated + partial_buffer, "is_final": false}
On pause: onFinalTranscript fires with current text, partialTranscript cleared
11. Reference Code (.examples/)
| File | Content |
|---|---|
alibaba_asr_backend.py |
DashScope WebSocket proxy + non-streaming fallback (FastAPI) |
alibaba_asr_frontend_vanilla.html |
Browser audio capture (vanilla JS, original) |
alibaba_asr_frontend_react.tsx |
React/TS: useVideoASR (streaming) + useFullTranscript (batch) hooks |
README.md |
Architecture overview + dependency notes |