From 091fa8444302696f4433cfd1145f8cd4792fff63 Mon Sep 17 00:00:00 2001 From: Woody Date: Tue, 28 Apr 2026 16:43:38 +0800 Subject: [PATCH] docs: update Phase 5 plan with deferred/planned sub-phases Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus --- .plans/package5_enhancement_plan.md | 156 +++++++++++++++++++++++++++- 1 file changed, 153 insertions(+), 3 deletions(-) diff --git a/.plans/package5_enhancement_plan.md b/.plans/package5_enhancement_plan.md index bb96267..b82133e 100644 --- a/.plans/package5_enhancement_plan.md +++ b/.plans/package5_enhancement_plan.md @@ -4,7 +4,7 @@ **Scope**: - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()` - Phase 5.2: Fix missing PDF links in citations and improve citation robustness -**Status**: ✅ Complete — Both phases implemented (2026-04-28) +**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28) **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy). @@ -245,6 +245,156 @@ If `with_structured_output()` causes issues in production: --- +--- + +## Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED) + +Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents. + +**Status**: Deferred. Phase 5.2 fallback links (`/rag-database?document=xxx`) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views. + +--- + +## Phase 5.4 — Sentence-Level Highlighting (PLANNED) + +### Problem + +When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks. + +### Solution + +**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted. + +### Why HTML, not PDF? + +| Approach | Complexity | Works for all doc types? | Preserves original formatting? | +|---|---|---|---| +| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only | +| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only | +| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes | + +**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text". + +### How It Works (No LLM Needed) + +``` +User clicks citation [NEC4 ACC, chunk 3] + │ + ▼ +Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=... + │ + ▼ +Backend: + 1. Fetch chunk text from ChromaDB [chromadb get()] + 2. Split into sentences [nltk.sent_tokenize or regex] + 3. Embed sub-question [existing embedding model] + 4. Embed each sentence (batch, parallel) [same model] + 5. Compute cosine similarity per sentence vs sub-question [numpy] + 6. Return HTML with yellow background on sentences > threshold + │ + ▼ +Frontend renders HTML in an iframe or new tab +``` + +### What Gets Highlighted + +``` +┌──────────────────────────────────────────────────────────┐ +│ Chunk: NEC4 ACC, page 12 [View Original PDF →] │ +├──────────────────────────────────────────────────────────┤ +│ │ +│ The programme shall be prepared in a form acceptable to │ +│ the Project Manager. It shall include: │ +│ │ +│ ████████████████████████████████████████████████████████ │ +│ █ The starting date, access dates, and Key Dates. █ │ ← High similarity +│ ████████████████████████████████████████████████████████ │ +│ │ +│ The Contractor shall submit a first programme within │ +│ ████████████████████████████████████████████████████████ │ +│ █ two weeks of the starting date. █ │ ← High similarity +│ ████████████████████████████████████████████████████████ │ +│ │ +│ The Project Manager may instruct the Contractor to │ +│ submit a revised programme showing the effects of a │ +│ compensation event. This does not affect the Contractor's │ +│ right to be paid for preparing the programme. │ ← Low similarity (no highlight) +│ │ +└──────────────────────────────────────────────────────────┘ +``` + +### Key Design Decisions + +| # | Decision | Rationale | +|---|---|---| +| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. | +| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. | +| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. | +| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). | +| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. | +| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. | +| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. | +| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. | + +### Implementation Tasks + +#### Task 5.4.1: Backend — Sentence splitting utility + +- [ ] Create `backend/app/utils/sentence_splitter.py` +- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]` +- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`) +- [ ] NLTK punkt data auto-downloaded on first use (or bundled) +- [ ] Handle edge cases: empty text, single sentence, lists/bullets +- **Test file**: `test_phase5_sentence_splitter.py` + +#### Task 5.4.2: Backend — Highlighted chunk endpoint + +- [ ] New endpoint: `GET /api/v1/chunks/highlight` +- [ ] Query params: `document_id`, `chunk_index`, `sub_question` +- [ ] Returns `text/html` (not JSON) +- [ ] Logic in `backend/app/services/chunk_highlight_service.py`: + 1. Fetch chunk from ChromaDB by `document_id` + `chunk_index` + 2. Split into sentences via `split_sentences()` + 3. Get embedding for `sub_question` via existing embedding model + 4. Get embeddings for all sentences in one batch call + 5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))` + 6. Mark sentences with similarity > threshold as highlighted + 7. Render HTML template with inline CSS (yellow background, subtle border) +- **Test file**: `test_phase5_chunk_highlight.py` + +#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources + +- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source +- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...` +- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text) +- [ ] Open in new tab or modal +- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx` + +#### Task 5.4.4: Integration testing + +- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT) +- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually) +- [ ] Verify caching works (second request for same chunk is faster) +- [ ] Verify graceful degradation (embedding API down → plain text still served) +- [ ] Run full test suite + +### Test Files + +| # | Test File | Coverage | +|---|-----------|----------| +| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists | +| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text | +| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available | +| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" | + +### Acceptance Tests + +| # | Test File | Coverage | +|---|-----------|----------| +| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences | + +--- + ## Commit Plan | Commit | Message | Scope | @@ -254,5 +404,5 @@ If `with_structured_output()` causes issues in production: | 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests | | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 | | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests | -| 6 | `feat: generate PDFs for DOCX/TXT documents on ingest` | Task 5.2.2 + tests | -| 7 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests | +| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests | +| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |