8.9 KiB

Raw Permalink Blame History

RAG Video Q&A Web Application - Development Plan

Project Overview
Web-based application built in two phases.

Phase 1: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
Phase 2: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow

Tech Stack

Backend: Python + FastAPI (REST + WebSocket)
Frontend: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
Server: Linux Ubuntu 22.04
RAG Database: ChromaDB (persistent)
LLM/ASR Integration: Dynamic via .env (supports local vLLM, OpenRouter, Alibaba Cloud)
- Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727
Models:
- Embedding: qwen/qwen3-embedding-4b (via sentence-transformers, provider-switchable via .env)
- LLM: qwen/qwen3.5-35b-a3b (OpenRouter for dev, local vLLM for prod)
- ASR: Alibaba Cloud DashScope (qwen3-asr-flash for batch, qwen3-asr-flash-realtime for streaming)

Deployment

Development: Simple commands (uvicorn + npm run dev)
Production: Docker + Nginx

Project Structure (Monorepo)

app/ ├── backend/ # FastAPI │ ├── app/ │ │ ├── main.py │ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py │ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py │ │ ├── models/ # Pydantic schemas │ │ ├── core/ # config.py, database.py │ │ └── utils/ # chunking, metadata extraction │ ├── uploads/ # video storage (max 300MB) │ ├── requirements.txt │ └── .env.example ├── frontend/ # React + TypeScript (Vite) │ ├── src/ │ │ ├── components/ │ │ ├── pages/ │ │ ├── lib/ # api.ts │ │ └── App.tsx │ ├── package.json │ └── vite.config.ts ├── chroma_db/ # Persistent vector store ├── Dockerfile ├── docker-compose.yml ├── nginx.conf └── deploy.sh

Key Requirements Incorporated

LLM/ASR Configuration: Backend reads from .env for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
RAG Database: ChromaDB with metadata support (filename + extracted content metadata).
Embedding Model: qwen/qwen3-embedding-4b via sentence-transformers, provider-switchable via .env (OpenRouter for dev, local vLLM for prod).
Document Ingestion: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
Chunking Strategy: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
Video: MP4 and common formats, maximum 300MB.
ASR Flow: Both automatic (on transcript updates) and manual "Ask from Video" button.
UI Layout (Phase 2 grid, pre-allocated in Phase 1):
- Top-Left: Video player (empty in Phase 1)
- Top-Right: Text input box + extracted keywords display
- Bottom Half: RAG response (bullet points with source metadata)
Authentication: Public demo (no login required).
Mobile: Not required at this stage.
CORS: Standard FastAPI CORS middleware for frontend-backend communication.

Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

RAG Pipeline (3-Step LLM Workflow)

User Question
    ↓
[LLM Call 1] Extract key questions + keywords from user input
    ↓                ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
    ↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
    ↓
[LLM Call 3] Generate bullet-point response from filtered chunks only

Query Decomposition (services/query_decomposer.py): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
Relevance Filtering (services/relevance_filter.py): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
Strict RAG Prompt: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.

Backend (FastAPI)

Dynamic configuration via .env (LLM base URL, API key, model names, embedding provider).
services/rag.py: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
services/llm_client.py: OpenAI-compatible client for Qwen LLM.
services/query_decomposer.py: LLM-based keyword/question extraction.
services/relevance_filter.py: LLM-based batch relevance scoring.
utils/chunking.py: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
Endpoints:
- POST /api/v1/ingest – DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
- POST /api/v1/query – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.

Frontend (React + TS) ✅ Complete

Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
Type-safe API calls using TanStack Query.
Display extracted keywords to user (shown before final answer arrives).
Display answer as clean bullet list with source metadata.
Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
PipelineProgress component (4-stage stepper, ready for streaming API).
Integration tests: full query flow, error handling, ingest flow.
62 tests, TypeScript clean, production build verified.

Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

Backend Additions

Video upload (POST /api/v1/upload-video) with size/format validation (<300MB).
Static file serving for videos.
WebSocket /ws/asr/{video_id} for real-time audio chunk streaming.
ASR integration with Qwen/Qwen3-ASR-1.7B (file upload or audio content).
Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

Frontend Additions

Drag & drop video upload + progress.
Video player (<video controls>).
Live transcript display (scrolling box).
Top-Left: Video player | Top-Right: Live transcript + manual input.
Bottom: RAG response panel.
Support both automatic “Ask” on transcript updates and manual button.

Development Timeline

Phase	Duration	Key Deliverables	Status
Setup + Phase 1 Backend	3-4 days	FastAPI + Chroma + Metadata + LLM client	✅ Complete
Phase 1 Frontend	2-3 days	UI layout + text query flow	✅ Complete
Phase 2 Backend	4-5 days	Video upload + WebSocket ASR + question extraction	✅ Complete
Phase 2 Frontend	3-4 days	Video player + live transcript + auto/manual flow	✅ Complete
Phase 4 System Audio & Mic	5.5 days	System Audio capture + Listen Mic + real-time ASR → RAG	✅ Complete
Testing & Polish	1-2 days	End-to-end testing + deployment scripts	✅ Complete

Total Estimated Effort: 19-23 developer days (3-4 weeks)

Note: Phase 3 (YouTube Live Stream Proxy → ASR) was implemented (5.5 days, 7 sub-phases) and later reverted in favor of Phase 4's more versatile System Audio Capture approach using getDisplayMedia().

Phase 4 adds System Audio Capture (getDisplayMedia) and Listen Mic (getUserMedia) as live audio sources alongside video Upload. Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. Implementation complete with 46 frontend + 14 backend tests. See .plans/phase4_system_audio_plan.md for details.

Deployment Strategy

Development:

Backend: cd backend && uvicorn app.main:app --reload --port 8000
Frontend: cd frontend && npm run dev

Production:

Use docker-compose up -d (includes backend, built frontend, Nginx reverse proxy).
Simple deploy.sh script for building and restarting.

File Information

Filename: development_plan.md
Last Updated: May 2026
Status: Phase 1 ✅, Phase 2 ✅, Phase 4 ✅ — All phases complete

8.9 KiB Raw Permalink Blame History Unescape Escape