docs: update Phase 5 plan with deferred/planned sub-phases

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:43:38 +08:00 · 2026-04-28 16:43:38 +08:00 · 091fa84443
parent c43cb372e9
commit 091fa84443
1 changed files with 153 additions and 3 deletions
--- a/.plans/package5_enhancement_plan.md
+++ b/.plans/package5_enhancement_plan.md
@ -4,7 +4,7 @@
 **Scope**: 
 - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
 - Phase 5.2: Fix missing PDF links in citations and improve citation robustness
-**Status**: ✅ Complete — Both phases implemented (2026-04-28)
+**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
 **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
@ -245,6 +245,156 @@ If `with_structured_output()` causes issues in production:
 ---
 ---
 ## Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED)
 Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents.
 **Status**: Deferred. Phase 5.2 fallback links (`/rag-database?document=xxx`) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views.
 ---
 ## Phase 5.4 — Sentence-Level Highlighting (PLANNED)
 ### Problem
 When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
 ### Solution
 **On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
 ### Why HTML, not PDF?
 | Approach | Complexity | Works for all doc types? | Preserves original formatting? |
 |---|---|---|---|
 | Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only |
 | Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
 | Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
 **Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
 ### How It Works (No LLM Needed)
 ```
 User clicks citation [NEC4 ACC, chunk 3]
       │
       ▼
 Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
       │
       ▼
 Backend:
  1. Fetch chunk text from ChromaDB                          [chromadb get()]
  2. Split into sentences                                    [nltk.sent_tokenize or regex]
  3. Embed sub-question                                      [existing embedding model]
  4. Embed each sentence (batch, parallel)                   [same model]
  5. Compute cosine similarity per sentence vs sub-question  [numpy]
  6. Return HTML with yellow background on sentences > threshold
       │
       ▼
 Frontend renders HTML in an iframe or new tab
 ```
 ### What Gets Highlighted
 ```
 ┌──────────────────────────────────────────────────────────┐
 │ Chunk: NEC4 ACC, page 12          [View Original PDF →]  │
 ├──────────────────────────────────────────────────────────┤
 │                                                            │
 │ The programme shall be prepared in a form acceptable to   │
 │ the Project Manager. It shall include:                    │
 │                                                            │
 │ ████████████████████████████████████████████████████████ │
 │ █ The starting date, access dates, and Key Dates.       █ │  ← High similarity
 │ ████████████████████████████████████████████████████████ │
 │                                                            │
 │ The Contractor shall submit a first programme within      │
 │ ████████████████████████████████████████████████████████ │
 │ █ two weeks of the starting date.                       █ │  ← High similarity
 │ ████████████████████████████████████████████████████████ │
 │                                                            │
 │ The Project Manager may instruct the Contractor to        │
 │ submit a revised programme showing the effects of a       │
 │ compensation event. This does not affect the Contractor's │
 │ right to be paid for preparing the programme.             │  ← Low similarity (no highlight)
 │                                                            │
 └──────────────────────────────────────────────────────────┘
 ```
 ### Key Design Decisions
 | # | Decision | Rationale |
 |---|---|---|
 | 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
 | 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
 | 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
 | 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
 | 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. |
 | 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
 | 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
 | 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
 ### Implementation Tasks
 #### Task 5.4.1: Backend — Sentence splitting utility
 - [ ] Create `backend/app/utils/sentence_splitter.py`
 - [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]`
 - [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`)
 - [ ] NLTK punkt data auto-downloaded on first use (or bundled)
 - [ ] Handle edge cases: empty text, single sentence, lists/bullets
 - **Test file**: `test_phase5_sentence_splitter.py`
 #### Task 5.4.2: Backend — Highlighted chunk endpoint
 - [ ] New endpoint: `GET /api/v1/chunks/highlight`
 - [ ] Query params: `document_id`, `chunk_index`, `sub_question`
 - [ ] Returns `text/html` (not JSON)
 - [ ] Logic in `backend/app/services/chunk_highlight_service.py`:
  1. Fetch chunk from ChromaDB by `document_id` + `chunk_index`
  2. Split into sentences via `split_sentences()`
  3. Get embedding for `sub_question` via existing embedding model
  4. Get embeddings for all sentences in one batch call
  5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
  6. Mark sentences with similarity > threshold as highlighted
  7. Render HTML template with inline CSS (yellow background, subtle border)
 - **Test file**: `test_phase5_chunk_highlight.py`
 #### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
 - [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source
 - [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...`
 - [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text)
 - [ ] Open in new tab or modal
 - **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx`
 #### Task 5.4.4: Integration testing
 - [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
 - [ ] Verify sentence highlighting is proportional to relevance (spot-check manually)
 - [ ] Verify caching works (second request for same chunk is faster)
 - [ ] Verify graceful degradation (embedding API down → plain text still served)
 - [ ] Run full test suite
 ### Test Files
 | # | Test File | Coverage |
 |---|-----------|----------|
 | T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
 | T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
 | T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available |
 | T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" |
 ### Acceptance Tests
 | # | Test File | Coverage |
 |---|-----------|----------|
 | AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
 ---
 ## Commit Plan
 | Commit | Message | Scope |
@ -254,5 +404,5 @@ If `with_structured_output()` causes issues in production:
 | 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
 | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
 | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
-| 6 | `feat: generate PDFs for DOCX/TXT documents on ingest` | Task 5.2.2 + tests |
+| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
-| 7 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
+| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |