legco_ai_assistant/development_plan.md

8.9 KiB
Raw Permalink Blame History

RAG Video Q&A Web Application - Development Plan

Project Overview
Web-based application built in two phases.

  • Phase 1: Text question → query decomposition → RAG retrieval → relevance filtering → point-form answer (strictly from database)
  • Phase 2: Video upload + player → real-time audio streaming → ASR transcription → question extraction → Phase 1 RAG flow

Tech Stack

Deployment

  • Development: Simple commands (uvicorn + npm run dev)
  • Production: Docker + Nginx

Project Structure (Monorepo)

app/ ├── backend/ # FastAPI │ ├── app/ │ │ ├── main.py │ │ ├── routers/ # query.py, ingest.py, video.py, ws_asr.py │ │ ├── services/ # rag.py, llm_client.py, asr_client.py, video_service.py │ │ ├── models/ # Pydantic schemas │ │ ├── core/ # config.py, database.py │ │ └── utils/ # chunking, metadata extraction │ ├── uploads/ # video storage (max 300MB) │ ├── requirements.txt │ └── .env.example ├── frontend/ # React + TypeScript (Vite) │ ├── src/ │ │ ├── components/ │ │ ├── pages/ │ │ ├── lib/ # api.ts │ │ └── App.tsx │ ├── package.json │ └── vite.config.ts ├── chroma_db/ # Persistent vector store ├── Dockerfile ├── docker-compose.yml ├── nginx.conf └── deploy.sh


Key Requirements Incorporated

  • LLM/ASR Configuration: Backend reads from .env for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
  • RAG Database: ChromaDB with metadata support (filename + extracted content metadata).
  • Embedding Model: qwen/qwen3-embedding-4b via sentence-transformers, provider-switchable via .env (OpenRouter for dev, local vLLM for prod).
  • Document Ingestion: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
  • Chunking Strategy: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
  • Video: MP4 and common formats, maximum 300MB.
  • ASR Flow: Both automatic (on transcript updates) and manual "Ask from Video" button.
  • UI Layout (Phase 2 grid, pre-allocated in Phase 1):
    • Top-Left: Video player (empty in Phase 1)
    • Top-Right: Text input box + extracted keywords display
    • Bottom Half: RAG response (bullet points with source metadata)
  • Authentication: Public demo (no login required).
  • Mobile: Not required at this stage.
  • CORS: Standard FastAPI CORS middleware for frontend-backend communication.

Phase 1: Text Question → RAG → Point-Form Answer (5-7 days)

RAG Pipeline (3-Step LLM Workflow)

User Question
    ↓
[LLM Call 1] Extract key questions + keywords from user input
    ↓                ← keywords shown to user in UI
[ChromaDB] Retrieve chunks using extracted keywords
    ↓
[LLM Call 2] Single batch relevance filter — evaluate all chunks, drop irrelevant ones
    ↓
[LLM Call 3] Generate bullet-point response from filtered chunks only
  • Query Decomposition (services/query_decomposer.py): LLM extracts key questions and search keywords from user's natural language question. Keywords are displayed to the user for transparency.
  • Relevance Filtering (services/relevance_filter.py): Single batch LLM call receives all retrieved chunks + original question. Returns relevance verdict for each chunk. Irrelevant chunks are discarded before response generation.
  • Strict RAG Prompt: Final LLM call generates bullet-point answer using ONLY filtered relevant chunks. No external knowledge allowed. Response format enforced via prompt engineering.

Backend (FastAPI)

  • Dynamic configuration via .env (LLM base URL, API key, model names, embedding provider).
  • services/rag.py: Persistent ChromaDB + Qwen embedding + metadata extraction (filename, upload date, content summary).
  • services/llm_client.py: OpenAI-compatible client for Qwen LLM.
  • services/query_decomposer.py: LLM-based keyword/question extraction.
  • services/relevance_filter.py: LLM-based batch relevance scoring.
  • utils/chunking.py: DOCX parsing + text chunking (1000 tokens, 200 overlap). Strategy abstracted for future replacement.
  • Endpoints:
    • POST /api/v1/ingest DOCX upload, parsing, chunking, embedding, and ingestion with metadata.
    • POST /api/v1/query Full 3-step pipeline: decompose → retrieve → filter → respond. Returns bullet-point answer + extracted keywords + source metadata.

Frontend (React + TS) Complete

  • Phase 2 grid layout pre-allocated: Top-Left video area (empty/hidden), Top-Right input area, Bottom response area.
  • Type-safe API calls using TanStack Query.
  • Display extracted keywords to user (shown before final answer arrives).
  • Display answer as clean bullet list with source metadata.
  • Collapsible source cards, copy-to-clipboard button, enhanced skeleton loaders.
  • PipelineProgress component (4-stage stepper, ready for streaming API).
  • Integration tests: full query flow, error handling, ingest flow.
  • 62 tests, TypeScript clean, production build verified.

Phase 2: Video Upload + Real-Time ASR → RAG (8-10 days)

Backend Additions

  • Video upload (POST /api/v1/upload-video) with size/format validation (<300MB).
  • Static file serving for videos.
  • WebSocket /ws/asr/{video_id} for real-time audio chunk streaming.
  • ASR integration with Qwen/Qwen3-ASR-1.7B (file upload or audio content).
  • Question extraction via LLM, then trigger Phase 1 RAG (auto + manual support).

Frontend Additions

  • Drag & drop video upload + progress.
  • Video player (<video controls>).
  • Live transcript display (scrolling box).
  • Top-Left: Video player | Top-Right: Live transcript + manual input.
  • Bottom: RAG response panel.
  • Support both automatic “Ask” on transcript updates and manual button.

Development Timeline

Phase Duration Key Deliverables Status
Setup + Phase 1 Backend 3-4 days FastAPI + Chroma + Metadata + LLM client Complete
Phase 1 Frontend 2-3 days UI layout + text query flow Complete
Phase 2 Backend 4-5 days Video upload + WebSocket ASR + question extraction Complete
Phase 2 Frontend 3-4 days Video player + live transcript + auto/manual flow Complete
Phase 4 System Audio & Mic 5.5 days System Audio capture + Listen Mic + real-time ASR → RAG Complete
Testing & Polish 1-2 days End-to-end testing + deployment scripts Complete

Total Estimated Effort: 19-23 developer days (3-4 weeks)

Note: Phase 3 (YouTube Live Stream Proxy → ASR) was implemented (5.5 days, 7 sub-phases) and later reverted in favor of Phase 4's more versatile System Audio Capture approach using getDisplayMedia().

Phase 4 adds System Audio Capture (getDisplayMedia) and Listen Mic (getUserMedia) as live audio sources alongside video Upload. Both pipe audio through the existing WebSocket → DashScope realtime ASR → RAG pipeline. Implementation complete with 46 frontend + 14 backend tests. See .plans/phase4_system_audio_plan.md for details.


Deployment Strategy

Development:

  • Backend: cd backend && uvicorn app.main:app --reload --port 8000
  • Frontend: cd frontend && npm run dev

Production:

  • Use docker-compose up -d (includes backend, built frontend, Nginx reverse proxy).
  • Simple deploy.sh script for building and restarting.

File Information

  • Filename: development_plan.md
  • Last Updated: May 2026
  • Status: Phase 1 , Phase 2 , Phase 4 — All phases complete