10 KiB
RAG Video Q&A Web Application - Development Plan
Project Overview
Web-based application built in two phases.
- Phase 1: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
- Phase 2: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow
Tech Stack
-
Backend: Python + FastAPI (REST + WebSocket)
-
Frontend: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
-
Server: Linux Ubuntu 22.04
-
RAG Database: ChromaDB (persistent)
-
LLM/ASR Integration: Dynamic via
.env(supports local vLLM, OpenRouter, Alibaba Cloud) -
Models:
- Embedding:
qwen/qwen3-embedding-4b(via sentence-transformers, provider-switchable via.env) - LLM:
qwen/qwen3.5-35b-a3b(OpenRouter for dev, local vLLM for prod) - ASR:
Qwen/Qwen3-ASR-1.7B
- Embedding:
Deployment
- Development: Simple commands (
uvicorn+npm run dev) - Production: Docker + Nginx
Project Structure (Monorepo)
app/ ├── backend/ # FastAPI │ ├── app/ │ │ ├── main.py │ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py │ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py │ │ ├── models/ # Pydantic schemas │ │ ├── core/ # config.py, database.py │ │ └── utils/ # chunking, metadata extraction │ ├── uploads/ # video storage (max 300MB) │ ├── requirements.txt │ └── .env.example ├── frontend/ # React + TypeScript (Vite) │ ├── src/ │ │ ├── components/ │ │ ├── pages/ │ │ ├── lib/ # api.ts │ │ └── App.tsx │ ├── package.json │ └── vite.config.ts ├── chroma_db/ # Persistent vector store ├── Dockerfile ├── docker-compose.yml ├── nginx.conf └── deploy.sh
Key Requirements Incorporated
- LLM/ASR Configuration: Backend reads from
.envfor easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM). - RAG Database: ChromaDB with metadata support (filename + extracted content metadata).
- Embedding Model:
qwen/qwen3-embedding-4bvia sentence-transformers, provider-switchable via.env(OpenRouter for dev, local vLLM for prod). - Document Ingestion: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
- Chunking Strategy: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
- Video: MP4 and common formats, maximum 300MB.
- ASR Flow: Both automatic (on transcript updates) and manual "Ask from Video" button.
- UI Layout (Phase 2 grid, pre-allocated in Phase 1):
- Top-Left: Video player (empty in Phase 1)
- Top-Right: Text input box + extracted keywords display
- Bottom Half: RAG response (bullet points with source metadata)
- Authentication: Public demo (no login required).
- Mobile: Not required at this stage.
- CORS: Standard FastAPI CORS middleware for frontend-backend communication.
Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)
RAG Pipeline (3-Step LLM Workflow)
User Question
↓
[LLM Call 1] Extract key questions + keywords from user input
↓ ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
↓
[LLM Call 3] Generate bullet-point response from filtered chunks only
- Query Decomposition (
services/query_decomposer.py): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency. - Relevance Filtering (
services/relevance_filter.py): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation. - Strict RAG Prompt: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.
Backend (FastAPI)
- Dynamic configuration via
.env(LLM base URL, API key, model names, embedding provider). services/rag.py: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).services/llm_client.py: OpenAI-compatible client for Qwen LLM.services/query_decomposer.py: LLM-based keyword/question extraction.services/relevance_filter.py: LLM-based batch relevance scoring.utils/chunking.py: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.- Endpoints:
POST /api/v1/ingest– DOCX upload, parsing, chunking, embedding, and ingestion with metadata.POST /api/v1/query– Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.
Frontend (React + TS) ✅ Complete
- Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
- Type-safe API calls using TanStack Query.
- Display extracted keywords to user (shown before final answer arrives).
- Display answer as clean bullet list with source metadata.
- Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
- PipelineProgress component (4-stage stepper, ready for streaming API).
- Integration tests: full query flow, error handling, ingest flow.
- 62 tests, TypeScript clean, production build verified.
Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)
Backend Additions
- Video upload (
POST /api/v1/upload-video) with size/format validation (<300MB). - Static file serving for videos.
- WebSocket
/ws/asr/{video_id}for real-time audio chunk streaming. - ASR integration with
Qwen/Qwen3-ASR-1.7B(file upload or audio content). - Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).
Frontend Additions
- Drag & drop video upload + progress.
- Video player (
<video controls>). - Live transcript display (scrolling box).
- Top-Left: Video player | Top-Right: Live transcript + manual input.
- Bottom: RAG response panel.
- Support both automatic “Ask” on transcript updates and manual button.
Phase 3: YouTube Live Stream Proxy → ASR (5-6 days) ✅ Complete
Overview
Proxy YouTube live streams and VODs through the backend, route audio into the existing ASR pipeline.
Backend Additions
- YouTube URL extraction via yt-dlp (
POST /api/v1/youtube/extract) - Format selection: video-only ≤480p + best audio (VOD), combined HLS (live)
- HLS manifest proxy with line-by-line rewriting (
GET /api/v1/youtube/proxy/manifest.m3u8) - TS segment proxying with CORS headers (
GET /api/v1/youtube/proxy/segment.ts) - In-memory caching: 5 min TTL (live), 30 min TTL (VOD)
- PO token expiration detection with cache invalidation
Frontend Additions
- YouTubeInput component: URL validation, extraction, loading/error states
- YouTubeVideoPlayer component: dual hls.js (video + hidden audio), thumbnail placeholder, LIVE badge
- useYouTubeASR hook: AudioContext from audio element → WebSocket → DashScope ASR
- LTTPage source toggle: Upload / YouTube tabs
- hls.js integration with dynamic import and quality capping (≤480p)
Key Design Decisions
- No iOS client needed (default yt-dlp extractor handles both VOD and live)
- Dual-element architecture:
<video muted>for display,<audio hidden>for AudioContext capture - HLS proxy rewrites all URLs (segments, sub-manifests, EXT-X-KEY URIs)
- Upstream status checked BEFORE streaming (avoids "response already started" errors)
- Both useVideoASR and useYouTubeASR return identical shapes for transparent integration
Architecture
YouTube URL → yt-dlp extract → HLS proxy → hls.js (video + audio)
↓
AudioContext → WebSocket → DashScope ASR → transcript
Development Timeline
| Phase | Duration | Key Deliverables | Status |
|---|---|---|---|
| Setup + Phase 1 Backend | 3-4 days | FastAPI + Chroma + Metadata + LLM client | ✅ Complete |
| Phase 1 Frontend | 2-3 days | UI layout + text query flow | ✅ Complete |
| Phase 2 Backend | 4-5 days | Video upload + WebSocket ASR + question extraction | ✅ Complete |
| Phase 2 Frontend | 3-4 days | Video player + live transcript + auto/manual flow | ✅ Complete |
| Phase 3 YouTube Proxy | 5-6 days | yt-dlp extraction + HLS proxy + YouTube ASR | ✅ Complete |
| Testing & Polish | 1-2 days | End-to-end testing + deployment scripts | ⬜ Pending |
Total Estimated Effort: 13-17 developer days (2-3 weeks)
Deployment Strategy
Development:
- Backend:
cd backend && uvicorn app.main:app --reload --port 8000 - Frontend:
cd frontend && npm run dev
Production:
- Use
docker-compose up -d(includes backend, built frontend, Nginx reverse proxy). - Simple
deploy.shscript for building and restarting.
File Information
- Filename:
development_plan.md - Last Updated: May 2026
- Status: Phase 1-3 Complete — YouTube proxy feature live