legco_ai_assistant/.plans/phase2_implementation_plan.md

14 KiB

Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan

Created: 2026-05-06 Updated: 2026-05-06 (video audio capture via createMediaElementSource; Full Transcript batch mode) Status: Planning — Not Started Depends on: Phase 1 (Complete)


1. Overview

Phase 2 adds video upload/playback and ASR transcription of the video's audio track (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.

Two ASR Modes

Mode A — Streaming (real-time, auto on play):

<video> → AudioContext.createMediaElementSource(video)
  → ScriptProcessor(4096, 1, 1) → Float32 PCM
  → WebSocket → FastAPI → DashScope realtime API
  → transcript JSON → QueryInput (in real time)

Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.

Mode B — Full Transcript (batch, manual button):

User clicks "Full Transcript" under video player
  → POST /api/v1/video/{id}/transcribe
  → Backend: ffmpeg extracts audio from uploaded video
  → DashScope OpenAI-compatible API (non-streaming)
  → Complete transcript of entire video → QueryInput

Server-side audio extraction via ffmpeg. No browser involvement.

Changes From Previous Versions

Aspect V2 (mic capture) V3 (video audio capture)
Audio source getUserMedia() microphone createMediaElementSource(video)
Trigger Manual record button Auto on video play, stop on pause/end
Permissions Microphone permission required None
Batch mode No Yes — "Full Transcript" button
Backend ffmpeg Not needed For Full Transcript mode

2. User Flow

  1. User uploads video → appears in left panel player
  2. User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
  3. User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
  4. User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
  5. Full Transcript: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput

3. Sub-Phases

Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)

Config fields, directory structure, service/router/model skeletons, register routers.

Test: test_phase2_config.py

Tasks:

# Task File
2.1.1 Add 6 config fields: dashscope_api_key, asr_model_name, asr_realtime_model_name, video_upload_dir, max_video_size_mb, supported_video_formats core/config.py
2.1.2 Update .env.example .env.example
2.1.3 Add deps: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles requirements.txt
2.1.4 Create models/video.pyVideoUploadResponse, FullTranscriptResponse models/video.py
2.1.5 Create models/asr.pyASRTranscriptEvent models/asr.py
2.1.6 Create services/video_service.py, services/asr_client.py stubs services/
2.1.7 Create routers/video.py stub: POST /upload, GET /{id}, POST /{id}/transcribe routers/video.py
2.1.8 Create routers/ws_asr.py stub: WS /ws/asr/{video_id}?language=yue routers/ws_asr.py
2.1.9 Register routers in main.py main.py
2.1.10 Write and pass test_phase2_config.py app/test/

Phase 2.2 — Video Upload Backend (0.5 day)

Streaming upload with size/format validation. Reuses routers/ingest.py pattern.

Test: test_phase2_video_upload.py (implement 4 existing stubs)

Tasks:

# Task File
2.2.1 Write tests — implement all 4 stubs test_phase2_video_upload.py
2.2.2 Implement VideoService.validate_video(), save_video() (streaming, aiofiles) services/video_service.py
2.2.3 Implement VideoService.get_video_path(), delete_video() services/video_service.py
2.2.4 Implement POST /api/v1/video/upload route routers/video.py
2.2.5 Implement GET /api/v1/video/{video_id} route (FileResponse) routers/video.py
2.2.6 Run tests → pass → commit

Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)

Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).

Reference: .examples/alibaba_asr_backend.py

Tests: test_phase2_asr_client.py (3 stubs), test_phase2_ws_asr.py (3 stubs), test_phase2_ws_protocol.py (new), test_phase2_full_transcript.py (new)

Acceptance Criteria:

  • WebSocket /ws/asr/{video_id}?language=yue → Float32 PCM → S16_LE base64 → DashScope realtime
  • transcription.text events → {"full_text": "...", "is_final": false} to browser
  • transcription.completed events → {"full_text": "...", "is_final": true} to browser
  • Language: yue (Cantonese), zh, en, auto
  • Traditional Chinese via zhconv
  • POST /api/v1/video/{video_id}/transcribe → ffmpeg extract audio → DashScope batch → {"text": "..."}
  • DASHSCOPE_API_KEY not set → clear error
  • Client disconnect → DashScope session closed cleanly

Tasks:

# Task File
2.3.1 Write tests first app/test/
2.3.2 float32_to_s16le(), build_display_text(), _to_traditional() services/asr_client.py
2.3.3 DashScopeCallback (sync SDK → asyncio.Queue bridge) + _ws_proxy_dashscope() routers/ws_asr.py
2.3.4 WebSocket endpoint routers/ws_asr.py
2.3.5 VideoService.extract_audio() — ffmpeg async subprocess: PCM16LE 16kHz mono services/video_service.py
2.3.6 ASRClient.transcribe_full() — batch: WAV → DashScope OpenAI-compatible API services/asr_client.py
2.3.7 POST /api/v1/video/{video_id}/transcribe route routers/video.py
2.3.8 Enhance conftest.py mock_asr_client conftest.py
2.3.9 Run tests → pass → commit

Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)

Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.

Tests: test_phase2_useVideoASR.test.ts, test_phase2_useFullTranscript.test.ts, test_phase2_QueryInput_integration.test.tsx

Acceptance Criteria:

  • useVideoASR hook: auto-starts on video play, stops on pause/ended
  • useVideoASR exposes transcript, partialTranscript, isStreaming, status
  • useFullTranscript hook: requestFullTranscript() → loading → transcript → error
  • QueryInput shows transcript (grey italic = partial, black = final)
  • QueryInput accepts partialText prop

Tasks:

# Task File
2.4.1 Write tests first src/test/
2.4.2 Create hooks/useVideoASR.ts (see .examples/alibaba_asr_frontend_react.tsx) hooks/useVideoASR.ts
2.4.3 Create hooks/useFullTranscript.ts hooks/useFullTranscript.ts
2.4.4 Update types/index.tsASRMessage, ASRStatus, FullTranscriptResponse types/index.ts
2.4.5 Update QueryInput.tsx — add partialText prop components/QueryInput.tsx
2.4.6 Run tests → pass → commit

Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)

Replace VideoPlaceholder with video upload + player. ASR auto on play. Full Transcript button.

┌─────────────────────┬──────────────────────────┐
│ VideoUpload /       │ QueryInput               │  ← Upper Panel (30%)
│ VideoPlayer         │   (transcript flows here  │
│                     │    from video audio ASR)  │
│ [Full Transcript]   │   [Submit]                │
├─────────────────────┴──────────────────────────┤
│ ResponsePanel                                   │  ← Lower Panel (70%)
└────────────────────────────────────────────────┘

Tests: test_phase2_VideoUpload.test.tsx, test_phase2_VideoPlayer.test.tsx, test_phase2_LTTPage_integration.test.tsx

Acceptance Criteria:

  • Drag-and-drop video upload with progress bar (native HTML5)
  • Video player with native <video controls> exposing ref
  • ASR auto on play → transcript in QueryInput; stops on pause/end
  • "Full Transcript" button → loading spinner → fills QueryInput with full transcript
  • Error states: upload fails, ASR fails, Full Transcript fails → clear messages

Tasks:

# Task File
2.5.1 Write all 3 tests first src/test/
2.5.2 Create VideoUpload.tsx — native drag-drop, axios progress components/VideoUpload.tsx
2.5.3 Create VideoPlayer.tsx — native <video controls>, forwardRef components/VideoPlayer.tsx
2.5.4 Update types/index.ts types/index.ts
2.5.5 Update lib/api.tsuploadVideo(), getVideoUrl(), requestFullTranscript() lib/api.ts
2.5.6 Update lib/queries.tsxuseVideoUpload() lib/queries.tsx
2.5.7 Refactor LTTPage.tsx — replace VideoPlaceholder, wire hooks + QueryInput pages/LTTPage.tsx
2.5.8 Update QueryInput.tsx — transcript value + partial text styling components/QueryInput.tsx
2.5.9 Run tests → pass → commit

Phase 2.6 — Integration & Acceptance Testing (1 day)

Tests: test_integration_phase2.py, test_acceptance_phase2_video.py, test_acceptance_phase2_asr.py, test_acceptance_integration_phase2.py

Tasks:

# Task
2.6.1 Implement integration test (mocked DashScope, real ChromaDB + file I/O)
2.6.2 Implement acceptance: real video upload + Full Transcript
2.6.3 Implement acceptance: real DashScope streaming + batch
2.6.4 Implement E2E acceptance
2.6.5 Full regression run
2.6.6 Fix failures, final commit

Phase 2.7 — Polish & Deployment (0.5 day)

# Task
2.7.1 Structured logging for DashScope proxy + full transcript events
2.7.2 Update nginx.confclient_max_body_size 350M
2.7.3 Verify production build
2.7.4 Update README.md
2.7.5 Final commit

4. Timeline

Sub-Phase Description Effort Depends On
2.1 Config & Infrastructure 0.5 day
2.2 Video Upload Backend 0.5 day 2.1
2.3 ASR Proxy + Full Transcript 1 day 2.1
2.4 Transcript → QueryInput 0.5 day 2.3
2.5 Frontend: Layout + Buttons 1.5 days 2.2, 2.3
2.6 Integration & Acceptance 1 day 2.4, 2.5
2.7 Polish & Deployment 0.5 day 2.6
Total 5.5 days

2.2 (upload) and 2.3 (ASR) run concurrently.


5. Dependencies

Backend: dashscope>=0.4.0, openai>=1.52.0, zhconv>=1.4.0, python-multipart, aiofiles Frontend: None (native Web APIs: AudioContext.createMediaElementSource, ScriptProcessorNode, <video>, HTML5 drag-and-drop) System: ffmpeg on server (for Full Transcript audio extraction)


6. Config Fields

dashscope_api_key: str = ""
asr_model_name: str = "qwen3-asr-flash"                  # Batch API
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
video_upload_dir: str = "./uploads"
max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]

7. Key Design Decisions

Decision Choice Why
Audio source createMediaElementSource(video) Captures video audio during playback. No mic permission.
ASR auto trigger Video play event Transcript appears as user watches. Natural UX.
ASR stop trigger Video pause/ended events Clean lifecycle. New session on next play/seek.
Full Transcript Manual button + server ffmpeg User explicitly requests. Server has the file.
Full Transcript ASR DashScope OpenAI-compatible API Standard /v1/chat/completions with input_audio. WAV format.
ASR streaming DashScope realtime SDK OmniRealtimeConversation + callback → asyncio.Queue bridge
Transcript display QueryInput textarea Editable. Same box for typing or ASR output.
SSL Required Chrome may block createMediaElementSource without secure context.

8. File Manifest

New Files

backend/
  app/routers/video.py
  app/routers/ws_asr.py
  app/services/video_service.py
  app/services/asr_client.py
  app/models/video.py
  app/models/asr.py
  app/test/test_phase2_config.py
  app/test/test_phase2_ws_protocol.py
  app/test/test_phase2_full_transcript.py
  app/test/test_phase2_transcript_to_rag.py

frontend/src/
  components/VideoUpload.tsx
  components/VideoPlayer.tsx
  hooks/useVideoASR.ts
  hooks/useFullTranscript.ts
  test/test_phase2_VideoUpload.test.tsx
  test/test_phase2_VideoPlayer.test.tsx
  test/test_phase2_useVideoASR.test.ts
  test/test_phase2_useFullTranscript.test.ts
  test/test_phase2_QueryInput_integration.test.tsx
  test/test_phase2_LTTPage_integration.test.tsx

Modified Files

backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts

9. Reference Code (.examples/)

File Content
alibaba_asr_backend.py DashScope WebSocket proxy + non-streaming fallback (FastAPI)
alibaba_asr_frontend_vanilla.html Browser audio capture (vanilla JS, original)
alibaba_asr_frontend_react.tsx React/TS: useVideoASR (streaming) + useFullTranscript (batch) hooks
README.md Architecture overview + dependency notes