legco_ai_assistant/development_plan.md

165 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RAG Video Q&A Web Application - Development Plan
**Project Overview**
Web-based application built in two phases.
- **Phase 1**: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow
**Tech Stack**
- **Backend**: Python + FastAPI (REST + WebSocket)
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
- **Server**: Linux Ubuntu 22.04
- **RAG Database**: ChromaDB (persistent)
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)
- Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727
- **Models**:
- Embedding: `qwen/qwen3-embedding-4b` (via sentence-transformers, provider-switchable via `.env`)
- LLM: `qwen/qwen3.5-35b-a3b` (OpenRouter for dev, local vLLM for prod)
- ASR: `Qwen/Qwen3-ASR-1.7B`
**Deployment**
- Development: Simple commands (`uvicorn` + `npm run dev`)
- Production: Docker + Nginx
---
## Project Structure (Monorepo)
app/
├── backend/ # FastAPI
│ ├── app/
│ │ ├── main.py
│ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py
│ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py
│ │ ├── models/ # Pydantic schemas
│ │ ├── core/ # config.py, database.py
│ │ └── utils/ # chunking, metadata extraction
│ ├── uploads/ # video storage (max 300MB)
│ ├── requirements.txt
│ └── .env.example
├── frontend/ # React + TypeScript (Vite)
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ ├── lib/ # api.ts
│ │ └── App.tsx
│ ├── package.json
│ └── vite.config.ts
├── chroma_db/ # Persistent vector store
├── Dockerfile
├── docker-compose.yml
├── nginx.conf
└── deploy.sh
---
## Key Requirements Incorporated
- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
- **Video**: MP4 and common formats, maximum 300MB.
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.
- **UI Layout** (Phase 2 grid, pre-allocated in Phase 1):
- Top-Left: Video player (empty in Phase 1)
- Top-Right: Text input box + extracted keywords display
- Bottom Half: RAG response (bullet points with source metadata)
- **Authentication**: Public demo (no login required).
- **Mobile**: Not required at this stage.
- **CORS**: Standard FastAPI CORS middleware for frontend-backend communication.
---
## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)
### RAG Pipeline (3-Step LLM Workflow)
```
User Question
[LLM Call 1] Extract key questions + keywords from user input
↓ ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
[LLM Call 3] Generate bullet-point response from filtered chunks only
```
- **Query Decomposition** (`services/query_decomposer.py`): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
- **Relevance Filtering** (`services/relevance_filter.py`): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
- **Strict RAG Prompt**: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.
### Backend (FastAPI)
- Dynamic configuration via `.env` (LLM base URL, API key, model names, embedding provider).
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM.
- `services/query_decomposer.py`: LLM-based keyword/question extraction.
- `services/relevance_filter.py`: LLM-based batch relevance scoring.
- `utils/chunking.py`: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
- Endpoints:
- `POST /api/v1/ingest` DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
- `POST /api/v1/query` Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.
### Frontend (React + TS)
- Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
- Type-safe API calls using TanStack Query.
- Display extracted keywords to user (shown before final answer arrives).
- Display answer as clean bullet list with source metadata.
---
## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)
### Backend Additions
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).
- Static file serving for videos.
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).
### Frontend Additions
- Drag & drop video upload + progress.
- Video player (`<video controls>`).
- Live transcript display (scrolling box).
- Top-Left: Video player | Top-Right: Live transcript + manual input.
- Bottom: RAG response panel.
- Support both automatic “Ask” on transcript updates and manual button.
---
## Development Timeline
| Phase | Duration | Key Deliverables |
|-----------------------------|--------------|------------------|
| Setup + Phase 1 Backend | 3-4 days | FastAPI + Chroma + Metadata + LLM client |
| Phase 1 Frontend | 2-3 days | UI layout + text query flow |
| Phase 2 Backend | 4-5 days | Video upload + WebSocket ASR + question extraction |
| Phase 2 Frontend | 3-4 days | Video player + live transcript + auto/manual flow |
| Testing & Polish | 1-2 days | End-to-end testing + deployment scripts |
**Total Estimated Effort**: 13-17 developer days (2-3 weeks)
---
## Deployment Strategy
**Development**:
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
- Frontend: `cd frontend && npm run dev`
**Production**:
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
- Simple `deploy.sh` script for building and restarting.
---
**File Information**
- Filename: `development_plan.md`
- Last Updated: April 2026
- Status: Phase 1 clarified, ready for sub-phase planning