# RAG Video Q&A Web Application - Development Plan

**Project Overview**  
Web-based application built in two phases.  
- **Phase 1**: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)  
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow  

**Tech Stack**  
- **Backend**: Python + FastAPI (REST + WebSocket)  
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS  
- **Server**: Linux Ubuntu 22.04  
- **RAG Database**: ChromaDB (persistent)  
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)  
    - Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727

- **Models**:  
  - Embedding: `qwen/qwen3-embedding-4b` (via sentence-transformers, provider-switchable via `.env`)  
  - LLM: `qwen/qwen3.5-35b-a3b` (OpenRouter for dev, local vLLM for prod)  
  - ASR: `Qwen/Qwen3-ASR-1.7B`  

**Deployment**  
- Development: Simple commands (`uvicorn` + `npm run dev`)  
- Production: Docker + Nginx  

---

## Project Structure (Monorepo)
app/
├── backend/                  # FastAPI
│   ├── app/
│   │   ├── main.py
│   │   ├── routers/          # query.py, ingest.py, video.py, ws_asr.py
│   │   ├── services/         # rag.py, llm_client.py, asr_client.py, video_service.py
│   │   ├── models/           # Pydantic schemas
│   │   ├── core/             # config.py, database.py
│   │   └── utils/            # chunking, metadata extraction
│   ├── uploads/              # video storage (max 300MB)
│   ├── requirements.txt
│   └── .env.example
├── frontend/                 # React + TypeScript (Vite)
│   ├── src/
│   │   ├── components/
│   │   ├── pages/
│   │   ├── lib/              # api.ts
│   │   └── App.tsx
│   ├── package.json
│   └── vite.config.ts
├── chroma_db/                # Persistent vector store
├── Dockerfile
├── docker-compose.yml
├── nginx.conf
└── deploy.sh


---

## Key Requirements Incorporated

- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).  
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).  
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).  
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX.  
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.  
- **Video**: MP4 and common formats, maximum 300MB.  
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.  
- **UI Layout** (Phase 2 grid, pre-allocated in Phase 1):  
  - Top-Left: Video player (empty in Phase 1)  
  - Top-Right: Text input box + extracted keywords display  
  - Bottom Half: RAG response (bullet points with source metadata)  
- **Authentication**: Public demo (no login required).  
- **Mobile**: Not required at this stage.  
- **CORS**: Standard FastAPI CORS middleware for frontend-backend communication.

---

## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

### RAG Pipeline (3-Step LLM Workflow)

```
User Question
    ↓
[LLM Call 1] Extract key questions + keywords from user input
    ↓                ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
    ↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
    ↓
[LLM Call 3] Generate bullet-point response from filtered chunks only
```

- **Query Decomposition** (`services/query_decomposer.py`): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.  
- **Relevance Filtering** (`services/relevance_filter.py`): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.  
- **Strict RAG Prompt**: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.  

### Backend (FastAPI)
- Dynamic configuration via `.env` (LLM base URL, API key, model names, embedding provider).  
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).  
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM.  
- `services/query_decomposer.py`: LLM-based keyword/question extraction.  
- `services/relevance_filter.py`: LLM-based batch relevance scoring.  
- `utils/chunking.py`: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.  
- Endpoints:  
  - `POST /api/v1/ingest` – DOCX upload, parsing, chunking, embedding, and ingestion with metadata.  
  - `POST /api/v1/query` – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.  

### Frontend (React + TS)
- Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.  
- Type-safe API calls using TanStack Query.  
- Display extracted keywords to user (shown before final answer arrives).  
- Display answer as clean bullet list with source metadata.

---

## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

### Backend Additions
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).  
- Static file serving for videos.  
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.  
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).  
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

### Frontend Additions
- Drag & drop video upload + progress.  
- Video player (`<video controls>`).  
- Live transcript display (scrolling box).  
- Top-Left: Video player | Top-Right: Live transcript + manual input.  
- Bottom: RAG response panel.  
- Support both automatic “Ask” on transcript updates and manual button.

---

## Development Timeline

| Phase                        | Duration     | Key Deliverables |
|-----------------------------|--------------|------------------|
| Setup + Phase 1 Backend     | 3-4 days     | FastAPI + Chroma + Metadata + LLM client |
| Phase 1 Frontend            | 2-3 days     | UI layout + text query flow |
| Phase 2 Backend             | 4-5 days     | Video upload + WebSocket ASR + question extraction |
| Phase 2 Frontend            | 3-4 days     | Video player + live transcript + auto/manual flow |
| Testing & Polish            | 1-2 days     | End-to-end testing + deployment scripts |

**Total Estimated Effort**: 13-17 developer days (2-3 weeks)

---

## Deployment Strategy

**Development**:
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
- Frontend: `cd frontend && npm run dev`

**Production**:
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
- Simple `deploy.sh` script for building and restarting.


---

**File Information**  
- Filename: `development_plan.md`  
- Last Updated: April 2026  
- Status: Phase 1 clarified, ready for sub-phase planning