17 KiB

Raw Blame History

Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan

Created: 2026-05-06 Updated: 2026-05-06 (all sub-phases complete; 11 bugs resolved) Status: Complete Depends on: Phase 1 (Complete)

1. Overview

Phase 2 adds video upload/playback and ASR transcription of the video's audio track (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.

Two ASR Modes

Mode A — Streaming (real-time, auto on play):

<video> → AudioContext.createMediaElementSource(video)
  → ScriptProcessor(4096, 1, 1) → Float32 PCM
  → WebSocket → FastAPI → DashScope realtime API
  → transcript JSON → QueryInput (in real time)

Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.

Mode B — Full Transcript (batch, manual button):

User clicks "Full Transcript" under video player
  → POST /api/v1/video/{id}/transcribe
  → Backend: ffmpeg extracts audio from uploaded video
  → DashScope OpenAI-compatible API (non-streaming)
  → Complete transcript of entire video → QueryInput

Server-side audio extraction via ffmpeg. No browser involvement.

Changes From Previous Versions

Aspect	V2 (mic capture)	V3 (video audio capture)
Audio source	getUserMedia() microphone	createMediaElementSource(video)
Trigger	Manual record button	Auto on video play, stop on pause/end
Permissions	Microphone permission required	None
Batch mode	No	Yes — "Full Transcript" button
Backend ffmpeg	Not needed	For Full Transcript mode

2. User Flow

User uploads video → appears in left panel player
User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
Full Transcript: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput

3. Sub-Phases

Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)

Config fields, directory structure, service/router/model skeletons, register routers.

Test: test_phase2_config.py

Tasks:

#	Task	File
2.1.1	Add 6 config fields: `dashscope_api_key`, `asr_model_name`, `asr_realtime_model_name`, `video_upload_dir`, `max_video_size_mb`, `supported_video_formats`	`core/config.py`
2.1.2	Update `.env.example`	`.env.example`
2.1.3	Add deps: `dashscope>=0.4.0`, `openai>=1.52.0`, `zhconv>=1.4.0`, `python-multipart`, `aiofiles`	`requirements.txt`
2.1.4	Create `models/video.py` — `VideoUploadResponse`, `FullTranscriptResponse`	`models/video.py`
2.1.5	Create `models/asr.py` — `ASRTranscriptEvent`	`models/asr.py`
2.1.6	Create `services/video_service.py`, `services/asr_client.py` stubs	`services/`
2.1.7	Create `routers/video.py` stub: `POST /upload`, `GET /{id}`, `POST /{id}/transcribe`	`routers/video.py`
2.1.8	Create `routers/ws_asr.py` stub: `WS /ws/asr/{video_id}?language=yue`	`routers/ws_asr.py`
2.1.9	Register routers in `main.py`	`main.py`
2.1.10	Write and pass `test_phase2_config.py`	`app/test/`

Phase 2.2 — Video Upload Backend (0.5 day)

Streaming upload with size/format validation. Reuses routers/ingest.py pattern.

Test: test_phase2_video_upload.py (implement 4 existing stubs)

Tasks:

#	Task	File
2.2.1	Write tests — implement all 4 stubs	`test_phase2_video_upload.py`
2.2.2	Implement `VideoService.validate_video()`, `save_video()` (streaming, aiofiles)	`services/video_service.py`
2.2.3	Implement `VideoService.get_video_path()`, `delete_video()`	`services/video_service.py`
2.2.4	Implement `POST /api/v1/video/upload` route	`routers/video.py`
2.2.5	Implement `GET /api/v1/video/{video_id}` route (FileResponse)	`routers/video.py`
2.2.6	Run tests → pass → commit	—

Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)

Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).

Reference: .examples/alibaba_asr_backend.py

Tests: test_phase2_asr_client.py (3 stubs), test_phase2_ws_asr.py (3 stubs), test_phase2_ws_protocol.py (new), test_phase2_full_transcript.py (new)

Acceptance Criteria:

WebSocket /ws/asr/{video_id}?language=yue → Float32 PCM → S16_LE base64 → DashScope realtime
transcription.text events → {"full_text": "...", "is_final": false} to browser
transcription.completed events → {"full_text": "...", "is_final": true} to browser
Language: yue (Cantonese), zh, en, auto
Traditional Chinese via zhconv
POST /api/v1/video/{video_id}/transcribe → ffmpeg extract audio → DashScope batch → {"text": "..."}
DASHSCOPE_API_KEY not set → clear error
Client disconnect → DashScope session closed cleanly

Tasks:

#	Task	File
2.3.1	Write tests first	`app/test/`
2.3.2	`float32_to_s16le()`, `build_display_text()`, `_to_traditional()`	`services/asr_client.py`
2.3.3	`DashScopeCallback` (sync SDK → asyncio.Queue bridge) + `_ws_proxy_dashscope()`	`routers/ws_asr.py`
2.3.4	WebSocket endpoint	`routers/ws_asr.py`
2.3.5	`VideoService.extract_audio()` — ffmpeg async subprocess: PCM16LE 16kHz mono	`services/video_service.py`
2.3.6	`ASRClient.transcribe_full()` — batch: WAV → DashScope OpenAI-compatible API	`services/asr_client.py`
2.3.7	`POST /api/v1/video/{video_id}/transcribe` route	`routers/video.py`
2.3.8	Enhance `conftest.py` mock_asr_client	`conftest.py`
2.3.9	Run tests → pass → commit	—

Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)

Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.

Tests: test_phase2_useVideoASR.test.ts, test_phase2_useFullTranscript.test.ts, test_phase2_QueryInput_integration.test.tsx

Acceptance Criteria:

useVideoASR hook: auto-starts on video play, stops on pause/ended
useVideoASR exposes transcript, partialTranscript, isStreaming, status
useFullTranscript hook: requestFullTranscript() → loading → transcript → error
QueryInput shows transcript (grey italic = partial, black = final)
QueryInput accepts partialText prop

Tasks:

#	Task	File
2.4.1	Write tests first	`src/test/`
2.4.2	Create `hooks/useVideoASR.ts` (see `.examples/alibaba_asr_frontend_react.tsx`)	`hooks/useVideoASR.ts`
2.4.3	Create `hooks/useFullTranscript.ts`	`hooks/useFullTranscript.ts`
2.4.4	Update `types/index.ts` — `ASRMessage`, `ASRStatus`, `FullTranscriptResponse`	`types/index.ts`
2.4.5	Update `QueryInput.tsx` — add `partialText` prop	`components/QueryInput.tsx`
2.4.6	Run tests → pass → commit	—

Replace VideoPlaceholder with video upload + player. ASR auto on play. Full Transcript button.

┌─────────────────────┬──────────────────────────┐
│ VideoUpload /       │ QueryInput               │  ← Upper Panel (30%)
│ VideoPlayer         │   (transcript flows here  │
│                     │    from video audio ASR)  │
│ [Full Transcript]   │   [Submit]                │
├─────────────────────┴──────────────────────────┤
│ ResponsePanel                                   │  ← Lower Panel (70%)
└────────────────────────────────────────────────┘

Tests: test_phase2_VideoUpload.test.tsx, test_phase2_VideoPlayer.test.tsx, test_phase2_LTTPage_integration.test.tsx

Acceptance Criteria:

Drag-and-drop video upload with progress bar (native HTML5)
Video player with native <video controls> exposing ref
ASR auto on play → transcript in QueryInput; stops on pause/end
"Full Transcript" button → loading spinner → fills QueryInput with full transcript
Error states: upload fails, ASR fails, Full Transcript fails → clear messages

Tasks:

#	Task	File
2.5.1	Write all 3 tests first	`src/test/`
2.5.2	Create `VideoUpload.tsx` — native drag-drop, axios progress	`components/VideoUpload.tsx`
2.5.3	Create `VideoPlayer.tsx` — native `<video controls>`, forwardRef	`components/VideoPlayer.tsx`
2.5.4	Update `types/index.ts`	`types/index.ts`
2.5.5	Update `lib/api.ts` — `uploadVideo()`, `getVideoUrl()`, `requestFullTranscript()`	`lib/api.ts`
2.5.6	Update `lib/queries.tsx` — `useVideoUpload()`	`lib/queries.tsx`
2.5.7	Refactor `LTTPage.tsx` — replace VideoPlaceholder, wire hooks + QueryInput	`pages/LTTPage.tsx`
2.5.8	Update `QueryInput.tsx` — transcript value + partial text styling	`components/QueryInput.tsx`
2.5.9	Run tests → pass → commit	—

Phase 2.6 — Integration & Acceptance Testing (1 day)

Tests: test_integration_phase2.py, test_acceptance_phase2_video.py, test_acceptance_phase2_asr.py, test_acceptance_integration_phase2.py

Tasks:

#	Task
2.6.1	Implement integration test (mocked DashScope, real ChromaDB + file I/O)
2.6.2	Implement acceptance: real video upload + Full Transcript
2.6.3	Implement acceptance: real DashScope streaming + batch
2.6.4	Implement E2E acceptance
2.6.5	Full regression run
2.6.6	Fix failures, final commit

Phase 2.7 — Polish & Deployment (0.5 day)

#	Task
2.7.1	Structured logging for DashScope proxy + full transcript events
2.7.2	Update `nginx.conf` — `client_max_body_size` 350M
2.7.3	Verify production build
2.7.4	Update `README.md`
2.7.5	Final commit

4. Timeline

Sub-Phase	Description	Effort	Depends On
2.1	Config & Infrastructure	0.5 day	—
2.2	Video Upload Backend	0.5 day	2.1
2.3	ASR Proxy + Full Transcript	1 day	2.1
2.4	Transcript → QueryInput	0.5 day	2.3
2.5	Frontend: Layout + Buttons	1.5 days	2.2, 2.3
2.6	Integration & Acceptance	1 day	2.4, 2.5
2.7	Polish & Deployment	0.5 day	2.6
Total		5.5 days

2.2 (upload) and 2.3 (ASR) run concurrently.

5. Dependencies

Backend: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles Frontend: None (native Web APIs: AudioContext.createMediaElementSource, ScriptProcessorNode, <video>, HTML5 drag-and-drop) System: ffmpeg on server (for Full Transcript audio extraction)

6. Config Fields

dashscope_api_key: str = ""
asr_model_name: str = "qwen3-asr-flash"                  # Batch API
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
video_upload_dir: str = "./uploads"
max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]

7. Key Design Decisions

Decision	Choice	Why
Audio source	`createMediaElementSource(video)`	Captures video audio during playback. No mic permission.
ASR auto trigger	Video `play` event	Transcript appears as user watches. Natural UX.
ASR stop trigger	Video `pause`/`ended` events	Clean lifecycle. New session on next play/seek.
Full Transcript	Manual button + server ffmpeg	User explicitly requests. Server has the file.
Full Transcript ASR	DashScope OpenAI-compatible API	Standard `/v1/chat/completions` with `input_audio`. WAV format.
ASR streaming	DashScope realtime SDK	`OmniRealtimeConversation` + callback → asyncio.Queue bridge
Transcript display	QueryInput textarea	Editable. Same box for typing or ASR output.
SSL	Required	Chrome may block `createMediaElementSource` without secure context.

8. File Manifest

New Files

backend/
  app/routers/video.py
  app/routers/ws_asr.py
  app/services/video_service.py
  app/services/asr_client.py
  app/models/video.py
  app/models/asr.py
  app/test/test_phase2_config.py
  app/test/test_phase2_ws_protocol.py
  app/test/test_phase2_full_transcript.py
  app/test/test_phase2_transcript_to_rag.py

frontend/src/
  components/VideoUpload.tsx
  components/VideoPlayer.tsx
  hooks/useVideoASR.ts
  hooks/useFullTranscript.ts
  test/test_phase2_VideoUpload.test.tsx
  test/test_phase2_VideoPlayer.test.tsx
  test/test_phase2_useVideoASR.test.ts
  test/test_phase2_useFullTranscript.test.ts
  test/test_phase2_QueryInput_integration.test.tsx
  test/test_phase2_LTTPage_integration.test.tsx

Modified Files

backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts

10. Bugs Resolved (11 Total)

#	Bug	Root Cause	Fix
1	Vite proxy missing	`/api` and `/ws` routes not proxied to backend port 8000	Added proxy config in `vite.config.ts`
2	`crossOrigin` missing on `<video>`	Cross-origin video blocked `createMediaElementSource`	Added `crossOrigin="anonymous"` to `VideoPlayer.tsx`
3	Audio silent (consumed but not passed through)	`onaudioprocess` didn't copy input to output	Added `outputData.set(float32Data)`
4	Audio graph recreated every play/pause	`createMediaElementSource` can only be called once per element	Setup audio graph once per video lifetime (`graphSetupRef`)
5	`AudioContext` suspended on play	Autoplay policy requires user gesture	Added `audioContext.resume()` on play
6	`dashscope` package not installed	Missing from `requirements.txt`	Installed `dashscope>=0.4.0`
7	`api_key` not passed to DashScope	`OmniRealtimeConversation` needs explicit `api_key`	Added `api_key=settings.dashscope_api_key`
8	`data_url` MIME type wrong	`data:audio/wav;base64` instead of `data:audio/wav;base64,`	Fixed data URL format in `transcribe_full()`
9	`extra_body` sent when `language="auto"`	DashScope rejects `extra_body` with auto-detect	Omitted `extra_body` entirely when `language="auto"`
10	Text accumulation broken — stashes are sliding windows	DashScope `stash` is a rolling ~7-char window, not cumulative. Each partial event replaces the previous. Completed events rarely sent. Old code showed partials as-is (text jumped). Simplification silently dropped partials (nothing appeared).	Added `_merge_stash()` — finds overlapping suffix between successive stashes and appends only the new chars. Partial events now contribute to a growing `partial_buffer`. `full_text = accumulated + partial_buffer`.
11	Text disappeared on pause	`stopStreaming` set `partialTranscript = ''` and `onFinalTranscript` never fired (no completed events).	Removed blind clear; added `transcriptRef` to avoid stale closures; `stopStreaming` now fires `onFinalTranscript(currentText)` then clears partial.

Text Accumulation Architecture (Bug #10 Detail)

Raw DashScope stashes (sliding window, ~7 chars each):
  "系多謝主席" → "主席咁咧呢個" → "呢個古洞北" → "三百二十五億"

_merge_stash() reconstruction:
  "" + "系多謝主席"                                           → "系多謝主席"
  "系多謝主席" + overlap("主席")="咁咧呢個"                    → "系多謝主席咁咧呢個"
  "系多謝主席咁咧呢個" + overlap("呢個")="古洞北"              → "系多謝主席咁咧呢個古洞北"
  "系多謝主席咁咧呢個古洞北" + no overlap → append with space  → "系多謝主席咁咧呢個古洞北 三百二十五億"

Sent to frontend: {"full_text": accumulated + partial_buffer, "is_final": false}
On pause: onFinalTranscript fires with current text, partialTranscript cleared

11. Reference Code (`.examples/`)

File	Content
`alibaba_asr_backend.py`	DashScope WebSocket proxy + non-streaming fallback (FastAPI)
`alibaba_asr_frontend_vanilla.html`	Browser audio capture (vanilla JS, original)
`alibaba_asr_frontend_react.tsx`	React/TS: `useVideoASR` (streaming) + `useFullTranscript` (batch) hooks
`README.md`	Architecture overview + dependency notes

17 KiB Raw Blame History

Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan

1. Overview

Two ASR Modes

Changes From Previous Versions

2. User Flow

3. Sub-Phases

Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)

Phase 2.2 — Video Upload Backend (0.5 day)

Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)

Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)

Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)

Phase 2.6 — Integration & Acceptance Testing (1 day)

Phase 2.7 — Polish & Deployment (0.5 day)

4. Timeline

5. Dependencies

6. Config Fields

7. Key Design Decisions

8. File Manifest

New Files

Modified Files

10. Bugs Resolved (11 Total)

Text Accumulation Architecture (Bug #10 Detail)

11. Reference Code (.examples/)

17 KiB

Raw Blame History

11. Reference Code (`.examples/`)