docs: update Phase 5 plan with deferred/planned sub-phases
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
This commit is contained in:
parent
c43cb372e9
commit
091fa84443
|
|
@ -4,7 +4,7 @@
|
||||||
**Scope**:
|
**Scope**:
|
||||||
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
|
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
|
||||||
- Phase 5.2: Fix missing PDF links in citations and improve citation robustness
|
- Phase 5.2: Fix missing PDF links in citations and improve citation robustness
|
||||||
**Status**: ✅ Complete — Both phases implemented (2026-04-28)
|
**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
|
||||||
|
|
||||||
**LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
|
**LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
|
||||||
|
|
||||||
|
|
@ -245,6 +245,156 @@ If `with_structured_output()` causes issues in production:
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED)
|
||||||
|
|
||||||
|
Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents.
|
||||||
|
|
||||||
|
**Status**: Deferred. Phase 5.2 fallback links (`/rag-database?document=xxx`) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5.4 — Sentence-Level Highlighting (PLANNED)
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
|
||||||
|
When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
|
||||||
|
|
||||||
|
### Solution
|
||||||
|
|
||||||
|
**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
|
||||||
|
|
||||||
|
### Why HTML, not PDF?
|
||||||
|
|
||||||
|
| Approach | Complexity | Works for all doc types? | Preserves original formatting? |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only |
|
||||||
|
| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
|
||||||
|
| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
|
||||||
|
|
||||||
|
**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
|
||||||
|
|
||||||
|
### How It Works (No LLM Needed)
|
||||||
|
|
||||||
|
```
|
||||||
|
User clicks citation [NEC4 ACC, chunk 3]
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Backend:
|
||||||
|
1. Fetch chunk text from ChromaDB [chromadb get()]
|
||||||
|
2. Split into sentences [nltk.sent_tokenize or regex]
|
||||||
|
3. Embed sub-question [existing embedding model]
|
||||||
|
4. Embed each sentence (batch, parallel) [same model]
|
||||||
|
5. Compute cosine similarity per sentence vs sub-question [numpy]
|
||||||
|
6. Return HTML with yellow background on sentences > threshold
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Frontend renders HTML in an iframe or new tab
|
||||||
|
```
|
||||||
|
|
||||||
|
### What Gets Highlighted
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────┐
|
||||||
|
│ Chunk: NEC4 ACC, page 12 [View Original PDF →] │
|
||||||
|
├──────────────────────────────────────────────────────────┤
|
||||||
|
│ │
|
||||||
|
│ The programme shall be prepared in a form acceptable to │
|
||||||
|
│ the Project Manager. It shall include: │
|
||||||
|
│ │
|
||||||
|
│ ████████████████████████████████████████████████████████ │
|
||||||
|
│ █ The starting date, access dates, and Key Dates. █ │ ← High similarity
|
||||||
|
│ ████████████████████████████████████████████████████████ │
|
||||||
|
│ │
|
||||||
|
│ The Contractor shall submit a first programme within │
|
||||||
|
│ ████████████████████████████████████████████████████████ │
|
||||||
|
│ █ two weeks of the starting date. █ │ ← High similarity
|
||||||
|
│ ████████████████████████████████████████████████████████ │
|
||||||
|
│ │
|
||||||
|
│ The Project Manager may instruct the Contractor to │
|
||||||
|
│ submit a revised programme showing the effects of a │
|
||||||
|
│ compensation event. This does not affect the Contractor's │
|
||||||
|
│ right to be paid for preparing the programme. │ ← Low similarity (no highlight)
|
||||||
|
│ │
|
||||||
|
└──────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Design Decisions
|
||||||
|
|
||||||
|
| # | Decision | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
|
||||||
|
| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
|
||||||
|
| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
|
||||||
|
| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
|
||||||
|
| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. |
|
||||||
|
| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
|
||||||
|
| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
|
||||||
|
| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
|
||||||
|
|
||||||
|
### Implementation Tasks
|
||||||
|
|
||||||
|
#### Task 5.4.1: Backend — Sentence splitting utility
|
||||||
|
|
||||||
|
- [ ] Create `backend/app/utils/sentence_splitter.py`
|
||||||
|
- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]`
|
||||||
|
- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`)
|
||||||
|
- [ ] NLTK punkt data auto-downloaded on first use (or bundled)
|
||||||
|
- [ ] Handle edge cases: empty text, single sentence, lists/bullets
|
||||||
|
- **Test file**: `test_phase5_sentence_splitter.py`
|
||||||
|
|
||||||
|
#### Task 5.4.2: Backend — Highlighted chunk endpoint
|
||||||
|
|
||||||
|
- [ ] New endpoint: `GET /api/v1/chunks/highlight`
|
||||||
|
- [ ] Query params: `document_id`, `chunk_index`, `sub_question`
|
||||||
|
- [ ] Returns `text/html` (not JSON)
|
||||||
|
- [ ] Logic in `backend/app/services/chunk_highlight_service.py`:
|
||||||
|
1. Fetch chunk from ChromaDB by `document_id` + `chunk_index`
|
||||||
|
2. Split into sentences via `split_sentences()`
|
||||||
|
3. Get embedding for `sub_question` via existing embedding model
|
||||||
|
4. Get embeddings for all sentences in one batch call
|
||||||
|
5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
|
||||||
|
6. Mark sentences with similarity > threshold as highlighted
|
||||||
|
7. Render HTML template with inline CSS (yellow background, subtle border)
|
||||||
|
- **Test file**: `test_phase5_chunk_highlight.py`
|
||||||
|
|
||||||
|
#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
|
||||||
|
|
||||||
|
- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source
|
||||||
|
- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...`
|
||||||
|
- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text)
|
||||||
|
- [ ] Open in new tab or modal
|
||||||
|
- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx`
|
||||||
|
|
||||||
|
#### Task 5.4.4: Integration testing
|
||||||
|
|
||||||
|
- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
|
||||||
|
- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually)
|
||||||
|
- [ ] Verify caching works (second request for same chunk is faster)
|
||||||
|
- [ ] Verify graceful degradation (embedding API down → plain text still served)
|
||||||
|
- [ ] Run full test suite
|
||||||
|
|
||||||
|
### Test Files
|
||||||
|
|
||||||
|
| # | Test File | Coverage |
|
||||||
|
|---|-----------|----------|
|
||||||
|
| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
|
||||||
|
| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
|
||||||
|
| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available |
|
||||||
|
| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" |
|
||||||
|
|
||||||
|
### Acceptance Tests
|
||||||
|
|
||||||
|
| # | Test File | Coverage |
|
||||||
|
|---|-----------|----------|
|
||||||
|
| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Commit Plan
|
## Commit Plan
|
||||||
|
|
||||||
| Commit | Message | Scope |
|
| Commit | Message | Scope |
|
||||||
|
|
@ -254,5 +404,5 @@ If `with_structured_output()` causes issues in production:
|
||||||
| 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
|
| 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
|
||||||
| 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
|
| 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
|
||||||
| 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
|
| 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
|
||||||
| 6 | `feat: generate PDFs for DOCX/TXT documents on ingest` | Task 5.2.2 + tests |
|
| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
|
||||||
| 7 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
|
| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue