# RAG Video Q&A Web Application - Development Plan **Project Overview** Web-based application built in two phases. - **Phase 1**: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database) - **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow **Tech Stack** - **Backend**: Python + FastAPI (REST + WebSocket) - **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS - **Server**: Linux Ubuntu 22.04 - **RAG Database**: ChromaDB (persistent) - **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud) - Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727 - **Models**: - Embedding: `qwen/qwen3-embedding-4b` (via sentence-transformers, provider-switchable via `.env`) - LLM: `qwen/qwen3.5-35b-a3b` (OpenRouter for dev, local vLLM for prod) - ASR: `Qwen/Qwen3-ASR-1.7B` **Deployment** - Development: Simple commands (`uvicorn` + `npm run dev`) - Production: Docker + Nginx --- ## Project Structure (Monorepo) app/ ├── backend/ # FastAPI │ ├── app/ │ │ ├── main.py │ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py │ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py │ │ ├── models/ # Pydantic schemas │ │ ├── core/ # config.py, database.py │ │ └── utils/ # chunking, metadata extraction │ ├── uploads/ # video storage (max 300MB) │ ├── requirements.txt │ └── .env.example ├── frontend/ # React + TypeScript (Vite) │ ├── src/ │ │ ├── components/ │ │ ├── pages/ │ │ ├── lib/ # api.ts │ │ └── App.tsx │ ├── package.json │ └── vite.config.ts ├── chroma_db/ # Persistent vector store ├── Dockerfile ├── docker-compose.yml ├── nginx.conf └── deploy.sh --- ## Key Requirements Incorporated - **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM). - **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata). - **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod). - **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX. - **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement. - **Video**: MP4 and common formats, maximum 300MB. - **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button. - **UI Layout** (Phase 2 grid, pre-allocated in Phase 1): - Top-Left: Video player (empty in Phase 1) - Top-Right: Text input box + extracted keywords display - Bottom Half: RAG response (bullet points with source metadata) - **Authentication**: Public demo (no login required). - **Mobile**: Not required at this stage. - **CORS**: Standard FastAPI CORS middleware for frontend-backend communication. --- ## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days) ### RAG Pipeline (3-Step LLM Workflow) ``` User Question ↓ [LLM Call 1] Extract key questions + keywords from user input ↓ ← keywords shown to user in UI [ChromaDB] Retrieve chunks using extracted keywords ↓ [LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones ↓ [LLM Call 3] Generate bullet-point response from filtered chunks only ``` - **Query Decomposition** (`services/query_decomposer.py`): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency. - **Relevance Filtering** (`services/relevance_filter.py`): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation. - **Strict RAG Prompt**: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering. ### Backend (FastAPI) - Dynamic configuration via `.env` (LLM base URL, API key, model names, embedding provider). - `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary). - `services/llm_client.py`: OpenAI-compatible client for Qwen LLM. - `services/query_decomposer.py`: LLM-based keyword/question extraction. - `services/relevance_filter.py`: LLM-based batch relevance scoring. - `utils/chunking.py`: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement. - Endpoints: - `POST /api/v1/ingest` – DOCX upload, parsing, chunking, embedding, and ingestion with metadata. - `POST /api/v1/query` – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata. ### Frontend (React + TS) - Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area. - Type-safe API calls using TanStack Query. - Display extracted keywords to user (shown before final answer arrives). - Display answer as clean bullet list with source metadata. --- ## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days) ### Backend Additions - Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB). - Static file serving for videos. - WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming. - ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content). - Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support). ### Frontend Additions - Drag & drop video upload + progress. - Video player (`