174 lines
8.9 KiB
Markdown
174 lines
8.9 KiB
Markdown
# RAG Video Q&A Web Application - Development Plan
|
||
|
||
**Project Overview**
|
||
Web-based application built in two phases.
|
||
- **Phase 1**: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
|
||
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow
|
||
|
||
**Tech Stack**
|
||
- **Backend**: Python + FastAPI (REST + WebSocket)
|
||
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
|
||
- **Server**: Linux Ubuntu 22.04
|
||
- **RAG Database**: ChromaDB (persistent)
|
||
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)
|
||
- Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727
|
||
|
||
- **Models**:
|
||
- Embedding: `qwen/qwen3-embedding-4b` (via sentence-transformers, provider-switchable via `.env`)
|
||
- LLM: `qwen/qwen3.5-35b-a3b` (OpenRouter for dev, local vLLM for prod)
|
||
- ASR: `Qwen/Qwen3-ASR-1.7B`
|
||
|
||
**Deployment**
|
||
- Development: Simple commands (`uvicorn` + `npm run dev`)
|
||
- Production: Docker + Nginx
|
||
|
||
---
|
||
|
||
## Project Structure (Monorepo)
|
||
app/
|
||
├── backend/ # FastAPI
|
||
│ ├── app/
|
||
│ │ ├── main.py
|
||
│ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py
|
||
│ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py
|
||
│ │ ├── models/ # Pydantic schemas
|
||
│ │ ├── core/ # config.py, database.py
|
||
│ │ └── utils/ # chunking, metadata extraction
|
||
│ ├── uploads/ # video storage (max 300MB)
|
||
│ ├── requirements.txt
|
||
│ └── .env.example
|
||
├── frontend/ # React + TypeScript (Vite)
|
||
│ ├── src/
|
||
│ │ ├── components/
|
||
│ │ ├── pages/
|
||
│ │ ├── lib/ # api.ts
|
||
│ │ └── App.tsx
|
||
│ ├── package.json
|
||
│ └── vite.config.ts
|
||
├── chroma_db/ # Persistent vector store
|
||
├── Dockerfile
|
||
├── docker-compose.yml
|
||
├── nginx.conf
|
||
└── deploy.sh
|
||
|
||
|
||
---
|
||
|
||
## Key Requirements Incorporated
|
||
|
||
- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
|
||
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
|
||
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).
|
||
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
|
||
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
|
||
- **Video**: MP4 and common formats, maximum 300MB.
|
||
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.
|
||
- **UI Layout** (Phase 2 grid, pre-allocated in Phase 1):
|
||
- Top-Left: Video player (empty in Phase 1)
|
||
- Top-Right: Text input box + extracted keywords display
|
||
- Bottom Half: RAG response (bullet points with source metadata)
|
||
- **Authentication**: Public demo (no login required).
|
||
- **Mobile**: Not required at this stage.
|
||
- **CORS**: Standard FastAPI CORS middleware for frontend-backend communication.
|
||
|
||
---
|
||
|
||
## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)
|
||
|
||
### RAG Pipeline (3-Step LLM Workflow)
|
||
|
||
```
|
||
User Question
|
||
↓
|
||
[LLM Call 1] Extract key questions + keywords from user input
|
||
↓ ← keywords shown to user in UI
|
||
[ChromaDB] Retrieve chunks using extracted keywords
|
||
↓
|
||
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
|
||
↓
|
||
[LLM Call 3] Generate bullet-point response from filtered chunks only
|
||
```
|
||
|
||
- **Query Decomposition** (`services/query_decomposer.py`): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
|
||
- **Relevance Filtering** (`services/relevance_filter.py`): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
|
||
- **Strict RAG Prompt**: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.
|
||
|
||
### Backend (FastAPI)
|
||
- Dynamic configuration via `.env` (LLM base URL, API key, model names, embedding provider).
|
||
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
|
||
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM.
|
||
- `services/query_decomposer.py`: LLM-based keyword/question extraction.
|
||
- `services/relevance_filter.py`: LLM-based batch relevance scoring.
|
||
- `utils/chunking.py`: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
|
||
- Endpoints:
|
||
- `POST /api/v1/ingest` – DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
|
||
- `POST /api/v1/query` – Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.
|
||
|
||
### Frontend (React + TS) ✅ Complete
|
||
- Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
|
||
- Type-safe API calls using TanStack Query.
|
||
- Display extracted keywords to user (shown before final answer arrives).
|
||
- Display answer as clean bullet list with source metadata.
|
||
- Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
|
||
- PipelineProgress component (4-stage stepper, ready for streaming API).
|
||
- Integration tests: full query flow, error handling, ingest flow.
|
||
- **62 tests, TypeScript clean, production build verified.**
|
||
|
||
---
|
||
|
||
## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)
|
||
|
||
### Backend Additions
|
||
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).
|
||
- Static file serving for videos.
|
||
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.
|
||
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).
|
||
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).
|
||
|
||
### Frontend Additions
|
||
- Drag & drop video upload + progress.
|
||
- Video player (`<video controls>`).
|
||
- Live transcript display (scrolling box).
|
||
- Top-Left: Video player | Top-Right: Live transcript + manual input.
|
||
- Bottom: RAG response panel.
|
||
- Support both automatic “Ask” on transcript updates and manual button.
|
||
|
||
---
|
||
|
||
## Development Timeline
|
||
|
||
| Phase | Duration | Key Deliverables | Status |
|
||
|-----------------------------|--------------|------------------|--------|
|
||
| Setup + Phase 1 Backend | 3-4 days | FastAPI + Chroma + Metadata + LLM client | ✅ Complete |
|
||
| Phase 1 Frontend | 2-3 days | UI layout + text query flow | ✅ Complete |
|
||
| Phase 2 Backend | 4-5 days | Video upload + WebSocket ASR + question extraction | ✅ Complete |
|
||
| Phase 2 Frontend | 3-4 days | Video player + live transcript + auto/manual flow | ✅ Complete |
|
||
| Phase 4 System Audio & Mic | 5.5 days | System Audio capture + Listen Mic + real-time ASR → RAG | ✅ Complete |
|
||
| Testing & Polish | 1-2 days | End-to-end testing + deployment scripts | ⬜ Pending |
|
||
|
||
**Total Estimated Effort**: 19-23 developer days (3-4 weeks)
|
||
|
||
> **Note:** Phase 3 (YouTube Live Stream Proxy → ASR) was implemented (5.5 days, 7 sub-phases) and later reverted in favor of Phase 4's more versatile System Audio Capture approach using `getDisplayMedia()`.
|
||
>
|
||
> **Phase 4** adds System Audio Capture (`getDisplayMedia`) and Listen Mic (`getUserMedia`) as live audio sources alongside video Upload. Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. Implementation complete with 46 frontend + 14 backend tests. See `.plans/phase4_system_audio_plan.md` for details.
|
||
|
||
---
|
||
|
||
## Deployment Strategy
|
||
|
||
**Development**:
|
||
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
|
||
- Frontend: `cd frontend && npm run dev`
|
||
|
||
**Production**:
|
||
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
|
||
- Simple `deploy.sh` script for building and restarting.
|
||
|
||
|
||
---
|
||
|
||
**File Information**
|
||
- Filename: `development_plan.md`
|
||
- Last Updated: May 2026
|
||
- Status: Phase 1 ✅, Phase 2 ✅, Phase 4 ✅ — System Audio Capture & Listen Mic complete
|