docs: update Phase 5 plan with deferred/planned sub-phases

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
This commit is contained in:
Woody 2026-04-28 16:43:38 +08:00
parent c43cb372e9
commit 091fa84443
1 changed files with 153 additions and 3 deletions

View File

@ -4,7 +4,7 @@
**Scope**: **Scope**:
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()` - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
- Phase 5.2: Fix missing PDF links in citations and improve citation robustness - Phase 5.2: Fix missing PDF links in citations and improve citation robustness
**Status**: ✅ Complete — Both phases implemented (2026-04-28) **Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
**LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy). **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
@ -245,6 +245,156 @@ If `with_structured_output()` causes issues in production:
--- ---
---
## Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED)
Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents.
**Status**: Deferred. Phase 5.2 fallback links (`/rag-database?document=xxx`) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views.
---
## Phase 5.4 — Sentence-Level Highlighting (PLANNED)
### Problem
When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
### Solution
**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
### Why HTML, not PDF?
| Approach | Complexity | Works for all doc types? | Preserves original formatting? |
|---|---|---|---|
| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only |
| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
### How It Works (No LLM Needed)
```
User clicks citation [NEC4 ACC, chunk 3]
Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
Backend:
1. Fetch chunk text from ChromaDB [chromadb get()]
2. Split into sentences [nltk.sent_tokenize or regex]
3. Embed sub-question [existing embedding model]
4. Embed each sentence (batch, parallel) [same model]
5. Compute cosine similarity per sentence vs sub-question [numpy]
6. Return HTML with yellow background on sentences > threshold
Frontend renders HTML in an iframe or new tab
```
### What Gets Highlighted
```
┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12 [View Original PDF →] │
├──────────────────────────────────────────────────────────┤
│ │
│ The programme shall be prepared in a form acceptable to │
│ the Project Manager. It shall include: │
│ │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates. █ │ ← High similarity
│ ████████████████████████████████████████████████████████ │
│ │
│ The Contractor shall submit a first programme within │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date. █ │ ← High similarity
│ ████████████████████████████████████████████████████████ │
│ │
│ The Project Manager may instruct the Contractor to │
│ submit a revised programme showing the effects of a │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme. │ ← Low similarity (no highlight)
│ │
└──────────────────────────────────────────────────────────┘
```
### Key Design Decisions
| # | Decision | Rationale |
|---|---|---|
| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. |
| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
### Implementation Tasks
#### Task 5.4.1: Backend — Sentence splitting utility
- [ ] Create `backend/app/utils/sentence_splitter.py`
- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]`
- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`)
- [ ] NLTK punkt data auto-downloaded on first use (or bundled)
- [ ] Handle edge cases: empty text, single sentence, lists/bullets
- **Test file**: `test_phase5_sentence_splitter.py`
#### Task 5.4.2: Backend — Highlighted chunk endpoint
- [ ] New endpoint: `GET /api/v1/chunks/highlight`
- [ ] Query params: `document_id`, `chunk_index`, `sub_question`
- [ ] Returns `text/html` (not JSON)
- [ ] Logic in `backend/app/services/chunk_highlight_service.py`:
1. Fetch chunk from ChromaDB by `document_id` + `chunk_index`
2. Split into sentences via `split_sentences()`
3. Get embedding for `sub_question` via existing embedding model
4. Get embeddings for all sentences in one batch call
5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
6. Mark sentences with similarity > threshold as highlighted
7. Render HTML template with inline CSS (yellow background, subtle border)
- **Test file**: `test_phase5_chunk_highlight.py`
#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source
- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...`
- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text)
- [ ] Open in new tab or modal
- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx`
#### Task 5.4.4: Integration testing
- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually)
- [ ] Verify caching works (second request for same chunk is faster)
- [ ] Verify graceful degradation (embedding API down → plain text still served)
- [ ] Run full test suite
### Test Files
| # | Test File | Coverage |
|---|-----------|----------|
| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available |
| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" |
### Acceptance Tests
| # | Test File | Coverage |
|---|-----------|----------|
| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
---
## Commit Plan ## Commit Plan
| Commit | Message | Scope | | Commit | Message | Scope |
@ -254,5 +404,5 @@ If `with_structured_output()` causes issues in production:
| 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests | | 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
| 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 | | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
| 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests | | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
| 6 | `feat: generate PDFs for DOCX/TXT documents on ingest` | Task 5.2.2 + tests | | 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
| 7 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests | | 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |