docs: update AGENTS.md with per-sub-question pipeline architecture

Replace flat 3-step LLM workflow with per-sub-question architecture diagram. Document per-sub-question retrieval, filtering (single LLM call with sub-q grouping), and response generation with ## headers. Update CODE MAP to reflect completed implementation status. Add SSE event sequence with generating_subquestion events, history XML format with <sub_q> wrappers, and sources as list-of-lists JSON. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-26 23:30:19 +08:00 · 2026-04-26 23:30:19 +08:00 · a72d5b0773
parent 3f292abe1b
commit a72d5b0773
1 changed files with 25 additions and 1 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -51,7 +51,9 @@ app/
 | UI components | `frontend/src/components/` | shadcn/ui + Tailwind |
 ## CODE MAP
-*Greenfield — no code yet. See development_plan.md for full specification.*
+- **Backend**: FastAPI app with routers (query, ingest, video, ws_asr, prompts, history), services (rag, llm_client, asr_client, video_service, query_decomposer, relevance_filter, prompt_service, history_service), Pydantic models
 - **Frontend**: React 18 + TypeScript + Vite with react-resizable-panels layout, TanStack Query, SSE streaming via `queryDocumentStream()`, shadcn/ui + Tailwind components
 - **Pipeline**: 3-step LLM workflow (decompose → retrieve → filter → generate) with per-sub-question organization
 ## CONVENTIONS
 - **Backend**: `snake_case` files; routers thin, services thick; `.env` for all LLM/ASR config
@ -60,6 +62,28 @@ app/
 - **RAG**: Strict prompt — answer ONLY from retrieved context; bullet-point format
 - **Metadata**: Every doc chunk must have `filename`, `upload_date`, `content_summary`
 ### RAG Pipeline (3-Step LLM Workflow — Per-Sub-Question)
 ```
 User Question
    ↓
 [LLM Call 1] QueryDecomposer — extract 2-5 sub-questions
    ↓
 [ChromaDB] Retrieve per sub-question — each sub-q independently queries ChromaDB
    ↓
 [LLM Call 2] RelevanceFilter (single call) — chunks grouped by sub-q, each scored against its own sub-q
    ↓
 [LLM Call 3] ResponseGeneration — markdown sections per sub-question with ## headers
 ```
 **Per-Sub-Question Organization**:
 - Retrieval: `RAGService.retrieve_per_subquestion()` queries ChromaDB once per sub-question
 - Filtering: `RelevanceFilter.filter_per_subquestion()` single LLM call with sub-q grouping
 - Response: `RAGService.generate_response_per_subquestion()` produces markdown sections with grouped sources
 - SSE Events: `decomposed → retrieving → filtering → generating → generating_subquestion (per sub-q) → completed`
 - History: XML chunks wrapped in `<sub_q>` elements; sources stored as list-of-lists JSON
 - Empty decomposition fallback (Decision #13): if decomposer returns `[]`, uses `[original_question]`
 ## ANTI-PATTERNS (THIS PROJECT)
 - Hardcode LLM URLs/keys — always `.env`
 - Business logic in routers — belongs in `services/`