From 091fa8444302696f4433cfd1145f8cd4792fff63 Mon Sep 17 00:00:00 2001
From: Woody <woody.ck.tse@gmail.com>
Date: Tue, 28 Apr 2026 16:43:38 +0800
Subject: [PATCH] docs: update Phase 5 plan with deferred/planned sub-phases

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
---
 .plans/package5_enhancement_plan.md | 156 +++++++++++++++++++++++++++-
 1 file changed, 153 insertions(+), 3 deletions(-)

diff --git a/.plans/package5_enhancement_plan.md b/.plans/package5_enhancement_plan.md
index bb96267..b82133e 100644
--- a/.plans/package5_enhancement_plan.md
+++ b/.plans/package5_enhancement_plan.md
@@ -4,7 +4,7 @@
 **Scope**: 
 - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
 - Phase 5.2: Fix missing PDF links in citations and improve citation robustness
-**Status**: ✅ Complete — Both phases implemented (2026-04-28)
+**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
 
 **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
 
@@ -245,6 +245,156 @@ If `with_structured_output()` causes issues in production:
 
 ---
 
+---
+
+## Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED)
+
+Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents.
+
+**Status**: Deferred. Phase 5.2 fallback links (`/rag-database?document=xxx`) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views.
+
+---
+
+## Phase 5.4 — Sentence-Level Highlighting (PLANNED)
+
+### Problem
+
+When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
+
+### Solution
+
+**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
+
+### Why HTML, not PDF?
+
+| Approach | Complexity | Works for all doc types? | Preserves original formatting? |
+|---|---|---|---|
+| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only |
+| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
+| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
+
+**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
+
+### How It Works (No LLM Needed)
+
+```
+User clicks citation [NEC4 ACC, chunk 3]
+       │
+       ▼
+Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
+       │
+       ▼
+Backend:
+  1. Fetch chunk text from ChromaDB                          [chromadb get()]
+  2. Split into sentences                                    [nltk.sent_tokenize or regex]
+  3. Embed sub-question                                      [existing embedding model]
+  4. Embed each sentence (batch, parallel)                   [same model]
+  5. Compute cosine similarity per sentence vs sub-question  [numpy]
+  6. Return HTML with yellow background on sentences > threshold
+       │
+       ▼
+Frontend renders HTML in an iframe or new tab
+```
+
+### What Gets Highlighted
+
+```
+┌──────────────────────────────────────────────────────────┐
+│ Chunk: NEC4 ACC, page 12          [View Original PDF →]  │
+├──────────────────────────────────────────────────────────┤
+│                                                            │
+│ The programme shall be prepared in a form acceptable to   │
+│ the Project Manager. It shall include:                    │
+│                                                            │
+│ ████████████████████████████████████████████████████████ │
+│ █ The starting date, access dates, and Key Dates.       █ │  ← High similarity
+│ ████████████████████████████████████████████████████████ │
+│                                                            │
+│ The Contractor shall submit a first programme within      │
+│ ████████████████████████████████████████████████████████ │
+│ █ two weeks of the starting date.                       █ │  ← High similarity
+│ ████████████████████████████████████████████████████████ │
+│                                                            │
+│ The Project Manager may instruct the Contractor to        │
+│ submit a revised programme showing the effects of a       │
+│ compensation event. This does not affect the Contractor's │
+│ right to be paid for preparing the programme.             │  ← Low similarity (no highlight)
+│                                                            │
+└──────────────────────────────────────────────────────────┘
+```
+
+### Key Design Decisions
+
+| # | Decision | Rationale |
+|---|---|---|
+| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
+| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
+| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
+| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
+| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. |
+| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
+| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
+| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
+
+### Implementation Tasks
+
+#### Task 5.4.1: Backend — Sentence splitting utility
+
+- [ ] Create `backend/app/utils/sentence_splitter.py`
+- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]`
+- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`)
+- [ ] NLTK punkt data auto-downloaded on first use (or bundled)
+- [ ] Handle edge cases: empty text, single sentence, lists/bullets
+- **Test file**: `test_phase5_sentence_splitter.py`
+
+#### Task 5.4.2: Backend — Highlighted chunk endpoint
+
+- [ ] New endpoint: `GET /api/v1/chunks/highlight`
+- [ ] Query params: `document_id`, `chunk_index`, `sub_question`
+- [ ] Returns `text/html` (not JSON)
+- [ ] Logic in `backend/app/services/chunk_highlight_service.py`:
+  1. Fetch chunk from ChromaDB by `document_id` + `chunk_index`
+  2. Split into sentences via `split_sentences()`
+  3. Get embedding for `sub_question` via existing embedding model
+  4. Get embeddings for all sentences in one batch call
+  5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
+  6. Mark sentences with similarity > threshold as highlighted
+  7. Render HTML template with inline CSS (yellow background, subtle border)
+- **Test file**: `test_phase5_chunk_highlight.py`
+
+#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
+
+- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source
+- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...`
+- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text)
+- [ ] Open in new tab or modal
+- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx`
+
+#### Task 5.4.4: Integration testing
+
+- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
+- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually)
+- [ ] Verify caching works (second request for same chunk is faster)
+- [ ] Verify graceful degradation (embedding API down → plain text still served)
+- [ ] Run full test suite
+
+### Test Files
+
+| # | Test File | Coverage |
+|---|-----------|----------|
+| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
+| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
+| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available |
+| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" |
+
+### Acceptance Tests
+
+| # | Test File | Coverage |
+|---|-----------|----------|
+| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
+
+---
+
 ## Commit Plan
 
 | Commit | Message | Scope |
@@ -254,5 +404,5 @@ If `with_structured_output()` causes issues in production:
 | 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
 | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
 | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
-| 6 | `feat: generate PDFs for DOCX/TXT documents on ingest` | Task 5.2.2 + tests |
-| 7 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
+| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
+| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |