legco_ai_assistant/development_plan.md

141 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RAG Video Q&A Web Application - Development Plan
**Project Overview**
Web-based application built in two phases.
- **Phase 1**: Text question → RAG retrieval → Point-form answer (strictly from database)
- **Phase 2**: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow
**Tech Stack**
- **Backend**: Python + FastAPI (REST + WebSocket)
- **Frontend**: TypeScript + React 18 (Vite) + shadcn/ui + Tailwind CSS
- **Server**: Linux Ubuntu 22.04
- **RAG Database**: ChromaDB (persistent)
- **LLM/ASR Integration**: Dynamic via `.env` (supports local vLLM, OpenRouter, Alibaba Cloud)
- Alibaba Cloud reference: https://modelstudio.console.alibabacloud.com/ap-southeast-1?switchAgent=101503&tab=doc&productCode=p_efm&switchUserType=3#/doc/?type=model&url=2989727
- **Models**:
- Embedding: `qwen/qwen3-embedding-4b`
- LLM: `qwen/qwen3.5-35b-a3b`
- ASR: `Qwen/Qwen3-ASR-1.7B`
**Deployment**
- Development: Simple commands (`uvicorn` + `npm run dev`)
- Production: Docker + Nginx
---
## Project Structure (Monorepo)
app/
├── backend/ # FastAPI
│ ├── app/
│ │ ├── main.py
│ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py
│ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py
│ │ ├── models/ # Pydantic schemas
│ │ ├── core/ # config.py, database.py
│ │ └── utils/ # chunking, metadata extraction
│ ├── uploads/ # video storage (max 300MB)
│ ├── requirements.txt
│ └── .env.example
├── frontend/ # React + TypeScript (Vite)
│ ├── src/
│ │ ├── components/
│ │ ├── pages/
│ │ ├── lib/ # api.ts
│ │ └── App.tsx
│ ├── package.json
│ └── vite.config.ts
├── chroma_db/ # Persistent vector store
├── Dockerfile
├── docker-compose.yml
├── nginx.conf
└── deploy.sh
---
## Key Requirements Incorporated
- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers.
- **Document Ingestion**: Via UI (project-based demo, no user authentication).
- **Video**: MP4 and common formats, maximum 300MB.
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** “Ask from Video” button.
- **UI Layout**:
- Top-Left: Video player
- Top-Right: Real-time transcript + text input box
- Bottom Half: RAG response (bullet points with source metadata)
- **Authentication**: Public demo (no login required).
- **Mobile**: Not required at this stage.
---
## Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)
### Backend (FastAPI)
- Dynamic configuration via `.env` (LLM base URL, API key, model names).
- `services/rag.py`: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
- `services/llm_client.py`: OpenAI-compatible client for Qwen LLM with **strict RAG prompt** (only use retrieved context).
- Endpoints:
- `POST /api/v1/ingest` Document upload and ingestion with metadata.
- `POST /api/v1/query` Question → retrieve → LLM → bullet-point response.
### Frontend (React + TS)
- Clean layout: Top-right input box, bottom response area.
- Type-safe API calls using TanStack Query.
- Display answer as clean bullet list with source metadata.
---
## Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)
### Backend Additions
- Video upload (`POST /api/v1/upload-video`) with size/format validation (<300MB).
- Static file serving for videos.
- WebSocket `/ws/asr/{video_id}` for real-time audio chunk streaming.
- ASR integration with `Qwen/Qwen3-ASR-1.7B` (file upload or audio content).
- Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).
### Frontend Additions
- Drag & drop video upload + progress.
- Video player (`<video controls>`).
- Live transcript display (scrolling box).
- Top-Left: Video player | Top-Right: Live transcript + manual input.
- Bottom: RAG response panel.
- Support both automatic “Ask” on transcript updates and manual button.
---
## Development Timeline
| Phase | Duration | Key Deliverables |
|-----------------------------|--------------|------------------|
| Setup + Phase 1 Backend | 3-4 days | FastAPI + Chroma + Metadata + LLM client |
| Phase 1 Frontend | 2-3 days | UI layout + text query flow |
| Phase 2 Backend | 4-5 days | Video upload + WebSocket ASR + question extraction |
| Phase 2 Frontend | 3-4 days | Video player + live transcript + auto/manual flow |
| Testing & Polish | 1-2 days | End-to-end testing + deployment scripts |
**Total Estimated Effort**: 13-17 developer days (2-3 weeks)
---
## Deployment Strategy
**Development**:
- Backend: `cd backend && uvicorn app.main:app --reload --port 8000`
- Frontend: `cd frontend && npm run dev`
**Production**:
- Use `docker-compose up -d` (includes backend, built frontend, Nginx reverse proxy).
- Simple `deploy.sh` script for building and restarting.
---
**File Information**
- Filename: `development_plan.md`
- Last Updated: April 2026
- Status: Ready for implementation