legco_ai_assistant/development_plan.md

# RAG Video Q&A Web Application - Development Plan

**Project Overview**
Web-based application built in two phases.
- **Phase 1**: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow

**Tech Stack**
- **Backend**: Python + FastAPI (REST + WebSocket)
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
- **Server**: Linux Ubuntu 22.04
- **RAG Database**: ChromaDB (persistent)
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)
    - Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727

- **Models**:
  - Embedding: `qwen/qwen3-embedding-4b` (via sentence-transformers, provider-switchable via `.env`)
  - LLM: `qwen/qwen3.5-35b-a3b` (OpenRouter for dev, local vLLM for prod)
  - ASR: `Qwen/Qwen3-ASR-1.7B`

**Deployment**
- Development: Simple commands (`uvicorn` + `npm run dev`)
- Production: Docker + Nginx

---

## Project Structure (Monorepo)
app/
├── backend/                  # FastAPI
│   ├── app/
│   │   ├── main.py
│   │   ├── routers/          # query.py, ingest.py, video.py, ws_asr.py
│   │   ├── services/         # rag.py, llm_client.py, asr_client.py, video_service.py
│   │   ├── models/           # Pydantic schemas
│   │   ├── core/             # config.py, database.py
│   │   └── utils/            # chunking, metadata extraction
│   ├── uploads/              # video storage (max 300MB)
│   ├── requirements.txt
│   └── .env.example
├── frontend/                 # React + TypeScript (Vite)
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   ├── lib/              # api.ts
│   │   └── App.tsx
│   ├── package.json
│   └── vite.config.ts
├── chroma_db/                # Persistent vector store
├── Dockerfile
├── docker-compose.yml
├── nginx.conf
└── deploy.sh


---

## Key Requirements Incorporated

- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
- **Video**: MP4 and common formats, maximum 300MB.
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.
- **UI Layout** (Phase 2 grid, pre-allocated in Phase 1):
  - Top-Left: Video player (empty in Phase 1)
  - Top-Right: Text input box + extracted keywords display
  - Bottom Half: RAG response (bullet points with source metadata)
- **Authentication**: Public demo (no login required).
- **Mobile**: Not required at this stage.
- **CORS**: Standard FastAPI CORS middleware for frontend-backend communication.

---

## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

### RAG Pipeline (3-Step LLM Workflow)

```
User Question
    ↓
[LLM Call 1] Extract key questions + keywords from user input
    ↓                ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
    ↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
    ↓
[LLM Call 3] Generate bullet-point response from filtered chunks only
```

- **Query Decomposition** (`services/query_decomposer.py`): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
- **Relevance Filtering** (`services/relevance_filter.py`): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
- **Strict RAG Prompt**: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.

### Backend (FastAPI)
- Dynamic configuration via `.env` (LLM base URL, API key, model names, embedding provider).
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM.
- `services/query_decomposer.py`: LLM-based keyword/question extraction.
- `services/relevance_filter.py`: LLM-based batch relevance scoring.
- `utils/chunking.py`: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
- Endpoints:
  - `POST /api/v1/ingest` – DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
  - `POST /api/v1/query` – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.

### Frontend (React + TS) ✅ Complete
- Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
- Type-safe API calls using TanStack Query.
- Display extracted keywords to user (shown before final answer arrives).
- Display answer as clean bullet list with source metadata.
- Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
- PipelineProgress component (4-stage stepper, ready for streaming API).
- Integration tests: full query flow, error handling, ingest flow.
- **62 tests, TypeScript clean, production build verified.**

---

## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

### Backend Additions
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).
- Static file serving for videos.
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

### Frontend Additions
- Drag & drop video upload + progress.
- Video player (`<video controls>`).
- Live transcript display (scrolling box).
- Top-Left: Video player | Top-Right: Live transcript + manual input.
- Bottom: RAG response panel.
- Support both automatic “Ask” on transcript updates and manual button.

---

## Development Timeline

| Phase                        | Duration     | Key Deliverables | Status |
|-----------------------------|--------------|------------------|--------|
| Setup + Phase 1 Backend     | 3-4 days     | FastAPI + Chroma + Metadata + LLM client | ✅ Complete |
| Phase 1 Frontend            | 2-3 days     | UI layout + text query flow | ✅ Complete |
| Phase 2 Backend             | 4-5 days     | Video upload + WebSocket ASR + question extraction | ✅ Complete |
| Phase 2 Frontend            | 3-4 days     | Video player + live transcript + auto/manual flow | ✅ Complete |
| Phase 4 System Audio & Mic  | 5.5 days     | System Audio capture + Listen Mic + real-time ASR → RAG | ✅ Complete |
| Testing & Polish            | 1-2 days     | End-to-end testing + deployment scripts | ⬜ Pending |

**Total Estimated Effort**: 19-23 developer days (3-4 weeks)

> **Note:** Phase 3 (YouTube Live Stream Proxy → ASR) was implemented (5.5 days, 7 sub-phases) and later reverted in favor of Phase 4's more versatile System Audio Capture approach using `getDisplayMedia()`.
>
> **Phase 4** adds System Audio Capture (`getDisplayMedia`) and Listen Mic (`getUserMedia`) as live audio sources alongside video Upload. Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. Implementation complete with 46 frontend + 14 backend tests. See `.plans/phase4_system_audio_plan.md` for details.

---

## Deployment Strategy

**Development**:
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
- Frontend: `cd frontend && npm run dev`

**Production**:
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
- Simple `deploy.sh` script for building and restarting.


---

**File Information**
- Filename: `development_plan.md`
- Last Updated: May 2026
- Status: Phase 1 ✅, Phase 2 ✅, Phase 4 ✅ — System Audio Capture & Listen Mic complete