# RAG Video Q&A Web Application - Development Plan

**Project Overview**  
Web-based application built in two phases.  
- **Phase 1**: Text question → RAG retrieval → Point-form answer (strictly from database)  
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow  

**Tech Stack**  
- **Backend**: Python + FastAPI (REST + WebSocket)  
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS  
- **Server**: Linux Ubuntu 22.04  
- **RAG Database**: ChromaDB (persistent)  
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)  
    - Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727

- **Models**:  
  - Embedding: `qwen/qwen3-embedding-4b`  
  - LLM: `qwen/qwen3.5-35b-a3b`  
  - ASR: `Qwen/Qwen3-ASR-1.7B`  

**Deployment**  
- Development: Simple commands (`uvicorn` + `npm run dev`)  
- Production: Docker + Nginx  

---

## Project Structure (Monorepo)
app/
├── backend/                  # FastAPI
│   ├── app/
│   │   ├── main.py
│   │   ├── routers/          # query.py, ingest.py, video.py, ws_asr.py
│   │   ├── services/         # rag.py, llm_client.py, asr_client.py, video_service.py
│   │   ├── models/           # Pydantic schemas
│   │   ├── core/             # config.py, database.py
│   │   └── utils/            # chunking, metadata extraction
│   ├── uploads/              # video storage (max 300MB)
│   ├── requirements.txt
│   └── .env.example
├── frontend/                 # React + TypeScript (Vite)
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   ├── lib/              # api.ts
│   │   └── App.tsx
│   ├── package.json
│   └── vite.config.ts
├── chroma_db/                # Persistent vector store
├── Dockerfile
├── docker-compose.yml
├── nginx.conf
└── deploy.sh


---

## Key Requirements Incorporated

- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).  
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).  
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers.  
- **Document Ingestion**: Via UI (project-based demo, no user authentication).  
- **Video**: MP4 and common formats, maximum 300MB.  
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** “Ask from Video” button.  
- **UI Layout**:  
  - Top-Left: Video player  
  - Top-Right: Real-time transcript + text input box  
  - Bottom Half: RAG response (bullet points with source metadata)  
- **Authentication**: Public demo (no login required).  
- **Mobile**: Not required at this stage.  

---

## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

### Backend (FastAPI)
- Dynamic configuration via `.env` (LLM base URL, API key, model names).  
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).  
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM with **strict RAG prompt** (only use retrieved context).  
- Endpoints:  
  - `POST /api/v1/ingest` – Document upload and ingestion with metadata.  
  - `POST /api/v1/query` – Question → retrieve → LLM → bullet-point response.

### Frontend (React + TS)
- Clean layout: Top-right input box, bottom response area.  
- Type-safe API calls using TanStack Query.  
- Display answer as clean bullet list with source metadata.

---

## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

### Backend Additions
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).  
- Static file serving for videos.  
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.  
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).  
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

### Frontend Additions
- Drag & drop video upload + progress.  
- Video player (`<video controls>`).  
- Live transcript display (scrolling box).  
- Top-Left: Video player | Top-Right: Live transcript + manual input.  
- Bottom: RAG response panel.  
- Support both automatic “Ask” on transcript updates and manual button.

---

## Development Timeline

| Phase                        | Duration     | Key Deliverables |
|-----------------------------|--------------|------------------|
| Setup + Phase 1 Backend     | 3-4 days     | FastAPI + Chroma + Metadata + LLM client |
| Phase 1 Frontend            | 2-3 days     | UI layout + text query flow |
| Phase 2 Backend             | 4-5 days     | Video upload + WebSocket ASR + question extraction |
| Phase 2 Frontend            | 3-4 days     | Video player + live transcript + auto/manual flow |
| Testing & Polish            | 1-2 days     | End-to-end testing + deployment scripts |

**Total Estimated Effort**: 13-17 developer days (2-3 weeks)

---

## Deployment Strategy

**Development**:
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
- Frontend: `cd frontend && npm run dev`

**Production**:
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
- Simple `deploy.sh` script for building and restarting.


---

**File Information**  
- Filename: `development_plan.md`  
- Last Updated: April 2026  
- Status: Ready for implementation