legco_ai_assistant/.plans/phase2_implementation_plan.md

17 KiB

Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan

Created: 2026-05-06 Updated: 2026-05-06 (all sub-phases complete; 11 bugs resolved) Status: Complete Depends on: Phase 1 (Complete)


1. Overview

Phase 2 adds video upload/playback and ASR transcription of the video's audio track (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.

Two ASR Modes

Mode A — Streaming (real-time, auto on play):

<video> → AudioContext.createMediaElementSource(video)
  → ScriptProcessor(4096, 1, 1) → Float32 PCM
  → WebSocket → FastAPI → DashScope realtime API
  → transcript JSON → QueryInput (in real time)

Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.

Mode B — Full Transcript (batch, manual button):

User clicks "Full Transcript" under video player
  → POST /api/v1/video/{id}/transcribe
  → Backend: ffmpeg extracts audio from uploaded video
  → DashScope OpenAI-compatible API (non-streaming)
  → Complete transcript of entire video → QueryInput

Server-side audio extraction via ffmpeg. No browser involvement.

Changes From Previous Versions

Aspect V2 (mic capture) V3 (video audio capture)
Audio source getUserMedia() microphone createMediaElementSource(video)
Trigger Manual record button Auto on video play, stop on pause/end
Permissions Microphone permission required None
Batch mode No Yes — "Full Transcript" button
Backend ffmpeg Not needed For Full Transcript mode

2. User Flow

  1. User uploads video → appears in left panel player
  2. User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
  3. User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
  4. User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
  5. Full Transcript: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput

3. Sub-Phases

Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)

Config fields, directory structure, service/router/model skeletons, register routers.

Test: test_phase2_config.py

Tasks:

# Task File
2.1.1 Add 6 config fields: dashscope_api_key, asr_model_name, asr_realtime_model_name, video_upload_dir, max_video_size_mb, supported_video_formats core/config.py
2.1.2 Update .env.example .env.example
2.1.3 Add deps: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles requirements.txt
2.1.4 Create models/video.pyVideoUploadResponse, FullTranscriptResponse models/video.py
2.1.5 Create models/asr.pyASRTranscriptEvent models/asr.py
2.1.6 Create services/video_service.py, services/asr_client.py stubs services/
2.1.7 Create routers/video.py stub: POST /upload, GET /{id}, POST /{id}/transcribe routers/video.py
2.1.8 Create routers/ws_asr.py stub: WS /ws/asr/{video_id}?language=yue routers/ws_asr.py
2.1.9 Register routers in main.py main.py
2.1.10 Write and pass test_phase2_config.py app/test/

Phase 2.2 — Video Upload Backend (0.5 day)

Streaming upload with size/format validation. Reuses routers/ingest.py pattern.

Test: test_phase2_video_upload.py (implement 4 existing stubs)

Tasks:

# Task File
2.2.1 Write tests — implement all 4 stubs test_phase2_video_upload.py
2.2.2 Implement VideoService.validate_video(), save_video() (streaming, aiofiles) services/video_service.py
2.2.3 Implement VideoService.get_video_path(), delete_video() services/video_service.py
2.2.4 Implement POST /api/v1/video/upload route routers/video.py
2.2.5 Implement GET /api/v1/video/{video_id} route (FileResponse) routers/video.py
2.2.6 Run tests → pass → commit

Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)

Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).

Reference: .examples/alibaba_asr_backend.py

Tests: test_phase2_asr_client.py (3 stubs), test_phase2_ws_asr.py (3 stubs), test_phase2_ws_protocol.py (new), test_phase2_full_transcript.py (new)

Acceptance Criteria:

  • WebSocket /ws/asr/{video_id}?language=yue → Float32 PCM → S16_LE base64 → DashScope realtime
  • transcription.text events → {"full_text": "...", "is_final": false} to browser
  • transcription.completed events → {"full_text": "...", "is_final": true} to browser
  • Language: yue (Cantonese), zh, en, auto
  • Traditional Chinese via zhconv
  • POST /api/v1/video/{video_id}/transcribe → ffmpeg extract audio → DashScope batch → {"text": "..."}
  • DASHSCOPE_API_KEY not set → clear error
  • Client disconnect → DashScope session closed cleanly

Tasks:

# Task File
2.3.1 Write tests first app/test/
2.3.2 float32_to_s16le(), build_display_text(), _to_traditional() services/asr_client.py
2.3.3 DashScopeCallback (sync SDK → asyncio.Queue bridge) + _ws_proxy_dashscope() routers/ws_asr.py
2.3.4 WebSocket endpoint routers/ws_asr.py
2.3.5 VideoService.extract_audio() — ffmpeg async subprocess: PCM16LE 16kHz mono services/video_service.py
2.3.6 ASRClient.transcribe_full() — batch: WAV → DashScope OpenAI-compatible API services/asr_client.py
2.3.7 POST /api/v1/video/{video_id}/transcribe route routers/video.py
2.3.8 Enhance conftest.py mock_asr_client conftest.py
2.3.9 Run tests → pass → commit

Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)

Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.

Tests: test_phase2_useVideoASR.test.ts, test_phase2_useFullTranscript.test.ts, test_phase2_QueryInput_integration.test.tsx

Acceptance Criteria:

  • useVideoASR hook: auto-starts on video play, stops on pause/ended
  • useVideoASR exposes transcript, partialTranscript, isStreaming, status
  • useFullTranscript hook: requestFullTranscript() → loading → transcript → error
  • QueryInput shows transcript (grey italic = partial, black = final)
  • QueryInput accepts partialText prop

Tasks:

# Task File
2.4.1 Write tests first src/test/
2.4.2 Create hooks/useVideoASR.ts (see .examples/alibaba_asr_frontend_react.tsx) hooks/useVideoASR.ts
2.4.3 Create hooks/useFullTranscript.ts hooks/useFullTranscript.ts
2.4.4 Update types/index.tsASRMessage, ASRStatus, FullTranscriptResponse types/index.ts
2.4.5 Update QueryInput.tsx — add partialText prop components/QueryInput.tsx
2.4.6 Run tests → pass → commit

Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)

Replace VideoPlaceholder with video upload + player. ASR auto on play. Full Transcript button.

┌─────────────────────┬──────────────────────────┐
│ VideoUpload /       │ QueryInput               │  ← Upper Panel (30%)
│ VideoPlayer         │   (transcript flows here  │
│                     │    from video audio ASR)  │
│ [Full Transcript]   │   [Submit]                │
├─────────────────────┴──────────────────────────┤
│ ResponsePanel                                   │  ← Lower Panel (70%)
└────────────────────────────────────────────────┘

Tests: test_phase2_VideoUpload.test.tsx, test_phase2_VideoPlayer.test.tsx, test_phase2_LTTPage_integration.test.tsx

Acceptance Criteria:

  • Drag-and-drop video upload with progress bar (native HTML5)
  • Video player with native <video controls> exposing ref
  • ASR auto on play → transcript in QueryInput; stops on pause/end
  • "Full Transcript" button → loading spinner → fills QueryInput with full transcript
  • Error states: upload fails, ASR fails, Full Transcript fails → clear messages

Tasks:

# Task File
2.5.1 Write all 3 tests first src/test/
2.5.2 Create VideoUpload.tsx — native drag-drop, axios progress components/VideoUpload.tsx
2.5.3 Create VideoPlayer.tsx — native <video controls>, forwardRef components/VideoPlayer.tsx
2.5.4 Update types/index.ts types/index.ts
2.5.5 Update lib/api.tsuploadVideo(), getVideoUrl(), requestFullTranscript() lib/api.ts
2.5.6 Update lib/queries.tsxuseVideoUpload() lib/queries.tsx
2.5.7 Refactor LTTPage.tsx — replace VideoPlaceholder, wire hooks + QueryInput pages/LTTPage.tsx
2.5.8 Update QueryInput.tsx — transcript value + partial text styling components/QueryInput.tsx
2.5.9 Run tests → pass → commit

Phase 2.6 — Integration & Acceptance Testing (1 day)

Tests: test_integration_phase2.py, test_acceptance_phase2_video.py, test_acceptance_phase2_asr.py, test_acceptance_integration_phase2.py

Tasks:

# Task
2.6.1 Implement integration test (mocked DashScope, real ChromaDB + file I/O)
2.6.2 Implement acceptance: real video upload + Full Transcript
2.6.3 Implement acceptance: real DashScope streaming + batch
2.6.4 Implement E2E acceptance
2.6.5 Full regression run
2.6.6 Fix failures, final commit

Phase 2.7 — Polish & Deployment (0.5 day)

# Task
2.7.1 Structured logging for DashScope proxy + full transcript events
2.7.2 Update nginx.confclient_max_body_size 350M
2.7.3 Verify production build
2.7.4 Update README.md
2.7.5 Final commit

4. Timeline

Sub-Phase Description Effort Depends On
2.1 Config & Infrastructure 0.5 day
2.2 Video Upload Backend 0.5 day 2.1
2.3 ASR Proxy + Full Transcript 1 day 2.1
2.4 Transcript → QueryInput 0.5 day 2.3
2.5 Frontend: Layout + Buttons 1.5 days 2.2, 2.3
2.6 Integration & Acceptance 1 day 2.4, 2.5
2.7 Polish & Deployment 0.5 day 2.6
Total 5.5 days

2.2 (upload) and 2.3 (ASR) run concurrently.


5. Dependencies

Backend: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles Frontend: None (native Web APIs: AudioContext.createMediaElementSource, ScriptProcessorNode, <video>, HTML5 drag-and-drop) System: ffmpeg on server (for Full Transcript audio extraction)


6. Config Fields

dashscope_api_key: str = ""
asr_model_name: str = "qwen3-asr-flash"                  # Batch API
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
video_upload_dir: str = "./uploads"
max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]

7. Key Design Decisions

Decision Choice Why
Audio source createMediaElementSource(video) Captures video audio during playback. No mic permission.
ASR auto trigger Video play event Transcript appears as user watches. Natural UX.
ASR stop trigger Video pause/ended events Clean lifecycle. New session on next play/seek.
Full Transcript Manual button + server ffmpeg User explicitly requests. Server has the file.
Full Transcript ASR DashScope OpenAI-compatible API Standard /v1/chat/completions with input_audio. WAV format.
ASR streaming DashScope realtime SDK OmniRealtimeConversation + callback → asyncio.Queue bridge
Transcript display QueryInput textarea Editable. Same box for typing or ASR output.
SSL Required Chrome may block createMediaElementSource without secure context.

8. File Manifest

New Files

backend/
  app/routers/video.py
  app/routers/ws_asr.py
  app/services/video_service.py
  app/services/asr_client.py
  app/models/video.py
  app/models/asr.py
  app/test/test_phase2_config.py
  app/test/test_phase2_ws_protocol.py
  app/test/test_phase2_full_transcript.py
  app/test/test_phase2_transcript_to_rag.py

frontend/src/
  components/VideoUpload.tsx
  components/VideoPlayer.tsx
  hooks/useVideoASR.ts
  hooks/useFullTranscript.ts
  test/test_phase2_VideoUpload.test.tsx
  test/test_phase2_VideoPlayer.test.tsx
  test/test_phase2_useVideoASR.test.ts
  test/test_phase2_useFullTranscript.test.ts
  test/test_phase2_QueryInput_integration.test.tsx
  test/test_phase2_LTTPage_integration.test.tsx

Modified Files

backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts

10. Bugs Resolved (11 Total)

# Bug Root Cause Fix
1 Vite proxy missing /api and /ws routes not proxied to backend port 8000 Added proxy config in vite.config.ts
2 crossOrigin missing on <video> Cross-origin video blocked createMediaElementSource Added crossOrigin="anonymous" to VideoPlayer.tsx
3 Audio silent (consumed but not passed through) onaudioprocess didn't copy input to output Added outputData.set(float32Data)
4 Audio graph recreated every play/pause createMediaElementSource can only be called once per element Setup audio graph once per video lifetime (graphSetupRef)
5 AudioContext suspended on play Autoplay policy requires user gesture Added audioContext.resume() on play
6 dashscope package not installed Missing from requirements.txt Installed dashscope>=0.4.0
7 api_key not passed to DashScope OmniRealtimeConversation needs explicit api_key Added api_key=settings.dashscope_api_key
8 data_url MIME type wrong data:audio/wav;base64 instead of data:audio/wav;base64, Fixed data URL format in transcribe_full()
9 extra_body sent when language="auto" DashScope rejects extra_body with auto-detect Omitted extra_body entirely when language="auto"
10 Text accumulation broken — stashes are sliding windows DashScope stash is a rolling ~7-char window, not cumulative. Each partial event replaces the previous. Completed events rarely sent. Old code showed partials as-is (text jumped). Simplification silently dropped partials (nothing appeared). Added _merge_stash() — finds overlapping suffix between successive stashes and appends only the new chars. Partial events now contribute to a growing partial_buffer. full_text = accumulated + partial_buffer.
11 Text disappeared on pause stopStreaming set partialTranscript = '' and onFinalTranscript never fired (no completed events). Removed blind clear; added transcriptRef to avoid stale closures; stopStreaming now fires onFinalTranscript(currentText) then clears partial.

Text Accumulation Architecture (Bug #10 Detail)

Raw DashScope stashes (sliding window, ~7 chars each):
  "系多謝主席" → "主席咁咧呢個" → "呢個古洞北" → "三百二十五億"

_merge_stash() reconstruction:
  "" + "系多謝主席"                                           → "系多謝主席"
  "系多謝主席" + overlap("主席")="咁咧呢個"                    → "系多謝主席咁咧呢個"
  "系多謝主席咁咧呢個" + overlap("呢個")="古洞北"              → "系多謝主席咁咧呢個古洞北"
  "系多謝主席咁咧呢個古洞北" + no overlap → append with space  → "系多謝主席咁咧呢個古洞北 三百二十五億"

Sent to frontend: {"full_text": accumulated + partial_buffer, "is_final": false}
On pause: onFinalTranscript fires with current text, partialTranscript cleared

11. Reference Code (.examples/)

File Content
alibaba_asr_backend.py DashScope WebSocket proxy + non-streaming fallback (FastAPI)
alibaba_asr_frontend_vanilla.html Browser audio capture (vanilla JS, original)
alibaba_asr_frontend_react.tsx React/TS: useVideoASR (streaming) + useFullTranscript (batch) hooks
README.md Architecture overview + dependency notes