legco_ai_assistant/.plans/phase2_implementation_plan.md

# Phase 2: Video Upload + Video Audio ASR → RAG — Implementation Plan

**Created:** 2026-05-06
**Updated:** 2026-05-06 (video audio capture via createMediaElementSource; Full Transcript batch mode)
**Status:** Planning — Not Started
**Depends on:** Phase 1 (Complete)

---

## 1. Overview

Phase 2 adds video upload/playback and ASR transcription of the **video's audio track** (not microphone). When the video plays, browser captures the video audio output and streams it to Alibaba Cloud DashScope for real-time transcription. A "Full Transcript" button sends the complete video audio for batch (non-streaming) transcription via backend ffmpeg extraction.

### Two ASR Modes

**Mode A — Streaming (real-time, auto on play):**
```
<video> → AudioContext.createMediaElementSource(video)
  → ScriptProcessor(4096, 1, 1) → Float32 PCM
  → WebSocket → FastAPI → DashScope realtime API
  → transcript JSON → QueryInput (in real time)
```
Auto-starts when video plays, stops on pause/seek/end. Partial transcript flows into QueryInput.

**Mode B — Full Transcript (batch, manual button):**
```
User clicks "Full Transcript" under video player
  → POST /api/v1/video/{id}/transcribe
  → Backend: ffmpeg extracts audio from uploaded video
  → DashScope OpenAI-compatible API (non-streaming)
  → Complete transcript of entire video → QueryInput
```
Server-side audio extraction via ffmpeg. No browser involvement.

### Changes From Previous Versions

| Aspect | V2 (mic capture) | V3 (video audio capture) |
|---|---|---|
| Audio source | getUserMedia() microphone | createMediaElementSource(video) |
| Trigger | Manual record button | Auto on video play, stop on pause/end |
| Permissions | Microphone permission required | None |
| Batch mode | No | Yes — "Full Transcript" button |
| Backend ffmpeg | Not needed | For Full Transcript mode |

---

## 2. User Flow

1. User uploads video → appears in left panel player
2. User presses play → browser captures video audio → streams to DashScope → transcript in QueryInput
3. User pauses/seeks/ends → streaming stops, accumulated transcript stays in QueryInput
4. User edits transcript in QueryInput and clicks Submit → Phase 1 RAG pipeline
5. **Full Transcript**: clicks "Full Transcript" button → server extracts audio → batch ASR → complete transcript fills QueryInput

---

## 3. Sub-Phases

### Phase 2.1 — Configuration & Infrastructure Setup (0.5 day)

Config fields, directory structure, service/router/model skeletons, register routers.

**Test:** `test_phase2_config.py`

**Tasks:**
| # | Task | File |
|---|------|------|
| 2.1.1 | Add 6 config fields: `dashscope_api_key`, `asr_model_name`, `asr_realtime_model_name`, `video_upload_dir`, `max_video_size_mb`, `supported_video_formats` | `core/config.py` |
| 2.1.2 | Update `.env.example` | `.env.example` |
| 2.1.3 | Add deps: `dashscope>=0.4.0`, `openai>=1.52.0`, `zhconv>=1.4.0`, `python-multipart`, `aiofiles` | `requirements.txt` |
| 2.1.4 | Create `models/video.py` — `VideoUploadResponse`, `FullTranscriptResponse` | `models/video.py` |
| 2.1.5 | Create `models/asr.py` — `ASRTranscriptEvent` | `models/asr.py` |
| 2.1.6 | Create `services/video_service.py`, `services/asr_client.py` stubs | `services/` |
| 2.1.7 | Create `routers/video.py` stub: `POST /upload`, `GET /{id}`, `POST /{id}/transcribe` | `routers/video.py` |
| 2.1.8 | Create `routers/ws_asr.py` stub: `WS /ws/asr/{video_id}?language=yue` | `routers/ws_asr.py` |
| 2.1.9 | Register routers in `main.py` | `main.py` |
| 2.1.10 | Write and pass `test_phase2_config.py` | `app/test/` |

---

### Phase 2.2 — Video Upload Backend (0.5 day)

Streaming upload with size/format validation. Reuses `routers/ingest.py` pattern.

**Test:** `test_phase2_video_upload.py` (implement 4 existing stubs)

**Tasks:**
| # | Task | File |
|---|------|------|
| 2.2.1 | Write tests — implement all 4 stubs | `test_phase2_video_upload.py` |
| 2.2.2 | Implement `VideoService.validate_video()`, `save_video()` (streaming, aiofiles) | `services/video_service.py` |
| 2.2.3 | Implement `VideoService.get_video_path()`, `delete_video()` | `services/video_service.py` |
| 2.2.4 | Implement `POST /api/v1/video/upload` route | `routers/video.py` |
| 2.2.5 | Implement `GET /api/v1/video/{video_id}` route (FileResponse) | `routers/video.py` |
| 2.2.6 | Run tests → pass → commit | — |

---

### Phase 2.3 — ASR WebSocket Proxy + Full Transcript Backend (1 day)

Two backend ASR paths: real-time streaming (WebSocket proxy to DashScope) and batch (ffmpeg extract → DashScope non-streaming API).

**Reference:** `.examples/alibaba_asr_backend.py`

**Tests:** `test_phase2_asr_client.py` (3 stubs), `test_phase2_ws_asr.py` (3 stubs), `test_phase2_ws_protocol.py` (new), `test_phase2_full_transcript.py` (new)

**Acceptance Criteria:**
- WebSocket `/ws/asr/{video_id}?language=yue` → Float32 PCM → S16_LE base64 → DashScope realtime
- `transcription.text` events → `{"full_text": "...", "is_final": false}` to browser
- `transcription.completed` events → `{"full_text": "...", "is_final": true}` to browser
- Language: `yue` (Cantonese), `zh`, `en`, `auto`
- Traditional Chinese via `zhconv`
- `POST /api/v1/video/{video_id}/transcribe` → ffmpeg extract audio → DashScope batch → `{"text": "..."}`
- `DASHSCOPE_API_KEY` not set → clear error
- Client disconnect → DashScope session closed cleanly

**Tasks:**
| # | Task | File |
|---|------|------|
| 2.3.1 | Write tests first | `app/test/` |
| 2.3.2 | `float32_to_s16le()`, `build_display_text()`, `_to_traditional()` | `services/asr_client.py` |
| 2.3.3 | `DashScopeCallback` (sync SDK → asyncio.Queue bridge) + `_ws_proxy_dashscope()` | `routers/ws_asr.py` |
| 2.3.4 | WebSocket endpoint | `routers/ws_asr.py` |
| 2.3.5 | `VideoService.extract_audio()` — ffmpeg async subprocess: PCM16LE 16kHz mono | `services/video_service.py` |
| 2.3.6 | `ASRClient.transcribe_full()` — batch: WAV → DashScope OpenAI-compatible API | `services/asr_client.py` |
| 2.3.7 | `POST /api/v1/video/{video_id}/transcribe` route | `routers/video.py` |
| 2.3.8 | Enhance `conftest.py` mock_asr_client | `conftest.py` |
| 2.3.9 | Run tests → pass → commit | — |

---

### Phase 2.4 — Transcript → QueryInput + Full Transcript Button (0.5 day)

Wire up real-time transcript from streaming ASR into QueryInput. Full Transcript button wiring.

**Tests:** `test_phase2_useVideoASR.test.ts`, `test_phase2_useFullTranscript.test.ts`, `test_phase2_QueryInput_integration.test.tsx`

**Acceptance Criteria:**
- `useVideoASR` hook: auto-starts on video `play`, stops on `pause`/`ended`
- `useVideoASR` exposes `transcript`, `partialTranscript`, `isStreaming`, `status`
- `useFullTranscript` hook: `requestFullTranscript()` → loading → transcript → error
- QueryInput shows transcript (grey italic = partial, black = final)
- QueryInput accepts `partialText` prop

**Tasks:**
| # | Task | File |
|---|------|------|
| 2.4.1 | Write tests first | `src/test/` |
| 2.4.2 | Create `hooks/useVideoASR.ts` (see `.examples/alibaba_asr_frontend_react.tsx`) | `hooks/useVideoASR.ts` |
| 2.4.3 | Create `hooks/useFullTranscript.ts` | `hooks/useFullTranscript.ts` |
| 2.4.4 | Update `types/index.ts` — `ASRMessage`, `ASRStatus`, `FullTranscriptResponse` | `types/index.ts` |
| 2.4.5 | Update `QueryInput.tsx` — add `partialText` prop | `components/QueryInput.tsx` |
| 2.4.6 | Run tests → pass → commit | — |

---

### Phase 2.5 — Frontend: Video Player + Buttons + Layout (1.5 days)

Replace `VideoPlaceholder` with video upload + player. ASR auto on play. Full Transcript button.

```
┌─────────────────────┬──────────────────────────┐
│ VideoUpload /       │ QueryInput               │  ← Upper Panel (30%)
│ VideoPlayer         │   (transcript flows here  │
│                     │    from video audio ASR)  │
│ [Full Transcript]   │   [Submit]                │
├─────────────────────┴──────────────────────────┤
│ ResponsePanel                                   │  ← Lower Panel (70%)
└────────────────────────────────────────────────┘
```

**Tests:** `test_phase2_VideoUpload.test.tsx`, `test_phase2_VideoPlayer.test.tsx`, `test_phase2_LTTPage_integration.test.tsx`

**Acceptance Criteria:**
- Drag-and-drop video upload with progress bar (native HTML5)
- Video player with native `<video controls>` exposing ref
- ASR auto on play → transcript in QueryInput; stops on pause/end
- "Full Transcript" button → loading spinner → fills QueryInput with full transcript
- Error states: upload fails, ASR fails, Full Transcript fails → clear messages

**Tasks:**
| # | Task | File |
|---|------|------|
| 2.5.1 | Write all 3 tests first | `src/test/` |
| 2.5.2 | Create `VideoUpload.tsx` — native drag-drop, axios progress | `components/VideoUpload.tsx` |
| 2.5.3 | Create `VideoPlayer.tsx` — native `<video controls>`, forwardRef | `components/VideoPlayer.tsx` |
| 2.5.4 | Update `types/index.ts` | `types/index.ts` |
| 2.5.5 | Update `lib/api.ts` — `uploadVideo()`, `getVideoUrl()`, `requestFullTranscript()` | `lib/api.ts` |
| 2.5.6 | Update `lib/queries.tsx` — `useVideoUpload()` | `lib/queries.tsx` |
| 2.5.7 | Refactor `LTTPage.tsx` — replace VideoPlaceholder, wire hooks + QueryInput | `pages/LTTPage.tsx` |
| 2.5.8 | Update `QueryInput.tsx` — transcript value + partial text styling | `components/QueryInput.tsx` |
| 2.5.9 | Run tests → pass → commit | — |

---

### Phase 2.6 — Integration & Acceptance Testing (1 day)

**Tests:** `test_integration_phase2.py`, `test_acceptance_phase2_video.py`, `test_acceptance_phase2_asr.py`, `test_acceptance_integration_phase2.py`

**Tasks:**
| # | Task |
|---|------|
| 2.6.1 | Implement integration test (mocked DashScope, real ChromaDB + file I/O) |
| 2.6.2 | Implement acceptance: real video upload + Full Transcript |
| 2.6.3 | Implement acceptance: real DashScope streaming + batch |
| 2.6.4 | Implement E2E acceptance |
| 2.6.5 | Full regression run |
| 2.6.6 | Fix failures, final commit |

---

### Phase 2.7 — Polish & Deployment (0.5 day)

| # | Task |
|---|------|
| 2.7.1 | Structured logging for DashScope proxy + full transcript events |
| 2.7.2 | Update `nginx.conf` — `client_max_body_size` 350M |
| 2.7.3 | Verify production build |
| 2.7.4 | Update `README.md` |
| 2.7.5 | Final commit |

---

## 4. Timeline

| Sub-Phase | Description | Effort | Depends On |
|---|---|---|---|
| 2.1 | Config & Infrastructure | 0.5 day | — |
| 2.2 | Video Upload Backend | 0.5 day | 2.1 |
| 2.3 | ASR Proxy + Full Transcript | 1 day | 2.1 |
| 2.4 | Transcript → QueryInput | 0.5 day | 2.3 |
| 2.5 | Frontend: Layout + Buttons | 1.5 days | 2.2, 2.3 |
| 2.6 | Integration & Acceptance | 1 day | 2.4, 2.5 |
| 2.7 | Polish & Deployment | 0.5 day | 2.6 |
| **Total** | | **5.5 days** | |

2.2 (upload) and 2.3 (ASR) run concurrently.

---

## 5. Dependencies

**Backend:** `dashscope>=0.4.0`, `openai>=1.52.0`, `zhconv>=1.4.0`, `python-multipart`, `aiofiles`
**Frontend:** None (native Web APIs: `AudioContext.createMediaElementSource`, `ScriptProcessorNode`, `<video>`, HTML5 drag-and-drop)
**System:** ffmpeg on server (for Full Transcript audio extraction)

---

## 6. Config Fields

```python
dashscope_api_key: str = ""
asr_model_name: str = "qwen3-asr-flash"                  # Batch API
asr_realtime_model_name: str = "qwen3-asr-flash-realtime" # Streaming
video_upload_dir: str = "./uploads"
max_video_size_mb: int = 300
supported_video_formats: list[str] = [".mp4", ".webm", ".mov", ".avi", ".mkv"]
```

---

## 7. Key Design Decisions

| Decision | Choice | Why |
|---|---|---|
| Audio source | `createMediaElementSource(video)` | Captures video audio during playback. No mic permission. |
| ASR auto trigger | Video `play` event | Transcript appears as user watches. Natural UX. |
| ASR stop trigger | Video `pause`/`ended` events | Clean lifecycle. New session on next play/seek. |
| Full Transcript | Manual button + server ffmpeg | User explicitly requests. Server has the file. |
| Full Transcript ASR | DashScope OpenAI-compatible API | Standard `/v1/chat/completions` with `input_audio`. WAV format. |
| ASR streaming | DashScope realtime SDK | `OmniRealtimeConversation` + callback → asyncio.Queue bridge |
| Transcript display | QueryInput textarea | Editable. Same box for typing or ASR output. |
| SSL | Required | Chrome may block `createMediaElementSource` without secure context. |

---

## 8. File Manifest

### New Files
```
backend/
  app/routers/video.py
  app/routers/ws_asr.py
  app/services/video_service.py
  app/services/asr_client.py
  app/models/video.py
  app/models/asr.py
  app/test/test_phase2_config.py
  app/test/test_phase2_ws_protocol.py
  app/test/test_phase2_full_transcript.py
  app/test/test_phase2_transcript_to_rag.py

frontend/src/
  components/VideoUpload.tsx
  components/VideoPlayer.tsx
  hooks/useVideoASR.ts
  hooks/useFullTranscript.ts
  test/test_phase2_VideoUpload.test.tsx
  test/test_phase2_VideoPlayer.test.tsx
  test/test_phase2_useVideoASR.test.ts
  test/test_phase2_useFullTranscript.test.ts
  test/test_phase2_QueryInput_integration.test.tsx
  test/test_phase2_LTTPage_integration.test.tsx
```

### Modified Files
```
backend/app/core/config.py, main.py, test/conftest.py, .env.example, requirements.txt
frontend/src/pages/LTTPage.tsx, components/QueryInput.tsx, lib/api.ts, lib/queries.tsx, types/index.ts
```

---

## 9. Reference Code (`.examples/`)

| File | Content |
|---|---|
| `alibaba_asr_backend.py` | DashScope WebSocket proxy + non-streaming fallback (FastAPI) |
| `alibaba_asr_frontend_vanilla.html` | Browser audio capture (vanilla JS, original) |
| `alibaba_asr_frontend_react.tsx` | React/TS: `useVideoASR` (streaming) + `useFullTranscript` (batch) hooks |
| `README.md` | Architecture overview + dependency notes |