10 KiB

Raw Blame History

RAG Video Q&A Web Application - Development Plan

Project Overview
Web-based application built in two phases.

Phase 1: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
Phase 2: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow

Tech Stack

Backend: Python + FastAPI (REST + WebSocket)
Frontend: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
Server: Linux Ubuntu 22.04
RAG Database: ChromaDB (persistent)
LLM/ASR Integration: Dynamic via .env (supports local vLLM, OpenRouter, Alibaba Cloud)
- Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727
Models:
- Embedding: qwen/qwen3-embedding-4b (via sentence-transformers, provider-switchable via .env)
- LLM: qwen/qwen3.5-35b-a3b (OpenRouter for dev, local vLLM for prod)
- ASR: Qwen/Qwen3-ASR-1.7B

Deployment

Development: Simple commands (uvicorn + npm run dev)
Production: Docker + Nginx

Project Structure (Monorepo)

app/ ├── backend/ # FastAPI │ ├── app/ │ │ ├── main.py │ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py │ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py │ │ ├── models/ # Pydantic schemas │ │ ├── core/ # config.py, database.py │ │ └── utils/ # chunking, metadata extraction │ ├── uploads/ # video storage (max 300MB) │ ├── requirements.txt │ └── .env.example ├── frontend/ # React + TypeScript (Vite) │ ├── src/ │ │ ├── components/ │ │ ├── pages/ │ │ ├── lib/ # api.ts │ │ └── App.tsx │ ├── package.json │ └── vite.config.ts ├── chroma_db/ # Persistent vector store ├── Dockerfile ├── docker-compose.yml ├── nginx.conf └── deploy.sh

Key Requirements Incorporated

LLM/ASR Configuration: Backend reads from .env for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
RAG Database: ChromaDB with metadata support (filename + extracted content metadata).
Embedding Model: qwen/qwen3-embedding-4b via sentence-transformers, provider-switchable via .env (OpenRouter for dev, local vLLM for prod).
Document Ingestion: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
Chunking Strategy: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
Video: MP4 and common formats, maximum 300MB.
ASR Flow: Both automatic (on transcript updates) and manual "Ask from Video" button.
UI Layout (Phase 2 grid, pre-allocated in Phase 1):
- Top-Left: Video player (empty in Phase 1)
- Top-Right: Text input box + extracted keywords display
- Bottom Half: RAG response (bullet points with source metadata)
Authentication: Public demo (no login required).
Mobile: Not required at this stage.
CORS: Standard FastAPI CORS middleware for frontend-backend communication.

Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

RAG Pipeline (3-Step LLM Workflow)

User Question
    ↓
[LLM Call 1] Extract key questions + keywords from user input
    ↓                ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
    ↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
    ↓
[LLM Call 3] Generate bullet-point response from filtered chunks only

Query Decomposition (services/query_decomposer.py): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
Relevance Filtering (services/relevance_filter.py): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
Strict RAG Prompt: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.

Backend (FastAPI)

Dynamic configuration via .env (LLM base URL, API key, model names, embedding provider).
services/rag.py: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
services/llm_client.py: OpenAI-compatible client for Qwen LLM.
services/query_decomposer.py: LLM-based keyword/question extraction.
services/relevance_filter.py: LLM-based batch relevance scoring.
utils/chunking.py: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
Endpoints:
- POST /api/v1/ingest – DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
- POST /api/v1/query – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.

Frontend (React + TS) ✅ Complete

Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
Type-safe API calls using TanStack Query.
Display extracted keywords to user (shown before final answer arrives).
Display answer as clean bullet list with source metadata.
Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
PipelineProgress component (4-stage stepper, ready for streaming API).
Integration tests: full query flow, error handling, ingest flow.
62 tests, TypeScript clean, production build verified.

Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

Backend Additions

Video upload (POST /api/v1/upload-video) with size/format validation (<300MB).
Static file serving for videos.
WebSocket /ws/asr/{video_id} for real-time audio chunk streaming.
ASR integration with Qwen/Qwen3-ASR-1.7B (file upload or audio content).
Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

Frontend Additions

Drag & drop video upload + progress.
Video player (<video controls>).
Live transcript display (scrolling box).
Top-Left: Video player | Top-Right: Live transcript + manual input.
Bottom: RAG response panel.
Support both automatic “Ask” on transcript updates and manual button.

Phase 3: YouTube Live Stream Proxy → ASR (5-6 days) ✅ Complete

Overview

Proxy YouTube live streams and VODs through the backend, route audio into the existing ASR pipeline.

Backend Additions

YouTube URL extraction via yt-dlp (POST /api/v1/youtube/extract)
Format selection: video-only ≤480p + best audio (VOD), combined HLS (live)
HLS manifest proxy with line-by-line rewriting (GET /api/v1/youtube/proxy/manifest.m3u8)
TS segment proxying with CORS headers (GET /api/v1/youtube/proxy/segment.ts)
In-memory caching: 5 min TTL (live), 30 min TTL (VOD)
PO token expiration detection with cache invalidation

Frontend Additions

YouTubeInput component: URL validation, extraction, loading/error states
YouTubeVideoPlayer component: dual hls.js (video + hidden audio), thumbnail placeholder, LIVE badge
useYouTubeASR hook: AudioContext from audio element → WebSocket → DashScope ASR
LTTPage source toggle: Upload / YouTube tabs
hls.js integration with dynamic import and quality capping (≤480p)

Key Design Decisions

No iOS client needed (default yt-dlp extractor handles both VOD and live)
Dual-element architecture: <video muted> for display, <audio hidden> for AudioContext capture
HLS proxy rewrites all URLs (segments, sub-manifests, EXT-X-KEY URIs)
Upstream status checked BEFORE streaming (avoids "response already started" errors)
Both useVideoASR and useYouTubeASR return identical shapes for transparent integration

Architecture

YouTube URL → yt-dlp extract → HLS proxy → hls.js (video + audio)
                                                 ↓
                                          AudioContext → WebSocket → DashScope ASR → transcript

Development Timeline

Phase	Duration	Key Deliverables	Status
Setup + Phase 1 Backend	3-4 days	FastAPI + Chroma + Metadata + LLM client	✅ Complete
Phase 1 Frontend	2-3 days	UI layout + text query flow	✅ Complete
Phase 2 Backend	4-5 days	Video upload + WebSocket ASR + question extraction	✅ Complete
Phase 2 Frontend	3-4 days	Video player + live transcript + auto/manual flow	✅ Complete
Phase 3 YouTube Proxy	5-6 days	yt-dlp extraction + HLS proxy + YouTube ASR	✅ Complete
Testing & Polish	1-2 days	End-to-end testing + deployment scripts	⬜ Pending

Total Estimated Effort: 13-17 developer days (2-3 weeks)

Deployment Strategy

Development:

Backend: cd backend && uvicorn app.main:app --reload --port 8000
Frontend: cd frontend && npm run dev

Production:

Use docker-compose up -d (includes backend, built frontend, Nginx reverse proxy).
Simple deploy.sh script for building and restarting.

File Information

Filename: development_plan.md
Last Updated: May 2026
Status: Phase 1-3 Complete — YouTube proxy feature live

10 KiB Raw Blame History Unescape Escape

RAG Video Q&A Web Application - Development Plan

Project Structure (Monorepo)

Key Requirements Incorporated

Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

RAG Pipeline (3-Step LLM Workflow)

Backend (FastAPI)

Frontend (React + TS) ✅ Complete

Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

Backend Additions

Frontend Additions

Phase 3: YouTube Live Stream Proxy → ASR (5-6 days) ✅ Complete

Overview

Backend Additions

Frontend Additions

Key Design Decisions

Architecture

Development Timeline

Deployment Strategy

10 KiB

Raw Blame History