feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8)

- citationParser.ts: extractCitedSources() parses answer text for [citations],
  resolves against SourceMetadata, returns deduplicated cited sources
- ResponsePanel.tsx: useEffect fires POST /api/v1/v2/highlights/batch after
  answer renders; View PDF link upgrades in-place to highlighted HTML when
  batch completes; stays as raw PDF on failure
- Updated plan: LLM-based relevance detection, eager background computation,
  single batched LLM call, sqlite cache, regex sentence splitter
- 45 frontend tests: 28 citationParser + 17 ResponsePanel (including 4 new
  sub-question highlight tests)
This commit is contained in:
Woody 2026-04-29 09:27:04 +08:00
parent a56f8f69e2
commit c632b9ea3b
5 changed files with 620 additions and 76 deletions

View File

@ -4,7 +4,7 @@
**Scope**: **Scope**:
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()` - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
- Phase 5.2: Fix missing PDF links in citations and improve citation robustness - Phase 5.2: Fix missing PDF links in citations and improve citation robustness
**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28) **Status**: Phases 5.1 ✅, 5.2 ✅, 5.3 ✅ — 5.4 Planned (LLM-based, awaiting implementation)
**LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy). **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
@ -266,39 +266,66 @@ Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they ha
When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks. When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
### Solution ### Solution: LLM-Based Relevance Detection + Eager Background Computation
**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted. **Embedding similarity was rejected** — cosine similarity measures semantic proximity, not answer relevance. A sentence like *"The Contractor shall submit a quotation within three weeks"* has low embedding similarity to *"What are the compensation event timelines?"* but is directly relevant to answering it. Only an LLM can distinguish topic-matching from answer-relevance.
**Strategy**: After the final response renders on the frontend, a **single background LLM call** identifies relevant sentences across all cited chunks (grouped by sub-question). Results are cached in SQLite. The "View Highlighted" button activates once caching completes. Clicks serve instantly from cache.
**Why one LLM call, not one per chunk**: A sub-question typically cites only 2-3 chunks (even though up to 10 are retrieved). A single batched prompt with all sub-questions and their cited chunks costs ~1 LLM call per query — not N × M per chunk. Reuses the `complete_structured()` infrastructure from Phase 5.1.
### Why HTML, not PDF? ### Why HTML, not PDF?
| Approach | Complexity | Works for all doc types? | Preserves original formatting? | | Approach | Complexity | Works for all doc types? | Preserves original formatting? |
|---|---|---|---| |---|---|---|---|
| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only | | Highlighted HTML page | **Low** | ✅ Yes (uses chunk text from ChromaDB) | ❌ Plain text only |
| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only | | Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes | | Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text". **Recommendation: HTML page.** Works uniformly across PDF/DOCX/TXT. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a supplementary view reached via a separate button. The two views coexist: "View Original PDF" vs "View Highlighted Text".
### How It Works (No LLM Needed) ### Architecture
``` ```
User clicks citation [NEC4 ACC, chunk 3] EAGER PHASE (background, fires immediately after SSE completed):
─────────────────────────────────────────────────
Frontend receives SSE "completed" event
Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
├── Renders answer with citations instantly (no waiting)
Backend: └── useEffect fires POST /api/v1/v2/highlights/batch
1. Fetch chunk text from ChromaDB [chromadb get()]
2. Split into sentences [nltk.sent_tokenize or regex]
3. Embed sub-question [existing embedding model] Backend highlight batch service:
4. Embed each sentence (batch, parallel) [same model] 1. For each sub-question, extract cited chunks
5. Compute cosine similarity per sentence vs sub-question [numpy] (only 2-3 actually cited per sub-q, not all 10)
6. Return HTML with yellow background on sentences > threshold 2. Fetch chunk texts from ChromaDB
3. Split each chunk into sentences
4. Single LLM call: identify which sentences are
Frontend renders HTML in an iframe or new tab relevant to their sub-question (structured output)
5. Render each chunk as highlighted HTML
6. Cache all HTML + metadata in SQLite
7. Return {status, cached_count} to frontend
Frontend receives response → upgrades "View PDF"
links on source cards to point to cached highlight pages
(same button, same position, upgraded destination)
LAZY PHASE (on user click, instant from cache):
──────────────────────────────────────────────
User clicks "View PDF" on a source card
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
Backend reads cached HTML from SQLite → serves instantly
Highlighted HTML page shows:
┌─ Highlighted sentences (yellow background + LLM reasons)
└─ Footer: [View Original PDF →] link to raw PDF chunk
``` ```
### What Gets Highlighted ### What Gets Highlighted
@ -312,19 +339,23 @@ Frontend renders HTML in an iframe or new tab
│ the Project Manager. It shall include: │ │ the Project Manager. It shall include: │
│ │ │ │
│ ████████████████████████████████████████████████████████ │ │ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates. █ │ ← High similarity │ █ The starting date, access dates, and Key Dates. █ │ ← LLM: relevant to timeline question
│ █ "Defines when work begins and key milestones." █ │ ← Brief reason from LLM
│ ████████████████████████████████████████████████████████ │ │ ████████████████████████████████████████████████████████ │
│ │ │ │
│ The Contractor shall submit a first programme within │ │ The Contractor shall submit a first programme within │
│ ████████████████████████████████████████████████████████ │ │ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date. █ │ ← High similarity │ █ two weeks of the starting date. █ │ ← LLM: relevant to timeline question
│ █ "Specifies the deadline for initial programme." █ │
│ ████████████████████████████████████████████████████████ │ │ ████████████████████████████████████████████████████████ │
│ │ │ │
│ The Project Manager may instruct the Contractor to │ │ The Project Manager may instruct the Contractor to │
│ submit a revised programme showing the effects of a │ │ submit a revised programme showing the effects of a │
│ compensation event. This does not affect the Contractor's │ │ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme. │ ← Low similarity (no highlight) │ right to be paid for preparing the programme. │ ← Not highlighted (LLM: not relevant to timeline)
│ │ │ │
├──────────────────────────────────────────────────────────┤
│ [View Original PDF →] │
└──────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────┘
``` ```
@ -332,75 +363,275 @@ Frontend renders HTML in an iframe or new tab
| # | Decision | Rationale | | # | Decision | Rationale |
|---|---|---| |---|---|---|
| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. | | 1 | HTML page, not PDF | Zero extra dependencies. CSS highlighting is more flexible. Original PDF remains available via separate link. |
| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. | | 2 | **LLM relevance detection**, not embedding similarity | Embedding similarity measures topic overlap, not answer relevance. LLM distinguishes "about timelines" from "answers the timeline question." |
| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. | | 3 | **Single batched LLM call** for all sub-questions + cited chunks | One API call per query, not one per chunk. Each sub-question typically cites 2-3 chunks. Total ~6-15 chunks per query in a single prompt. |
| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). | | 4 | **Eager computation** (background after response renders) | User sees answer instantly. Highlight computation happens while they read. "View Highlighted" activates when ready. Zero perceived latency on click. |
| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. | | 5 | **Only cited chunks** (not all retrieved chunks) | Up to 10 chunks retrieved per sub-question, but only 2-3 are actually cited in the answer. Computing highlights for uncited chunks wastes tokens. |
| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. | | 6 | **SQLite cache** for highlight results | Durable across restarts. Same chunk+sub-question pair only computed once. Subsequent clicks (or same query re-run) serve instantly. Cache key = hash(document_id + chunk_index + sub_question). |
| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. | | 7 | Reuse `complete_structured()` from Phase 5.1 | `LLMClient.complete_structured(prompt, HighlightBatchResult)` returns validated Pydantic model. No new LLM calling infrastructure. |
| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. | | 8 | Sentence-level granularity | Paragraphs are too coarse, words are too noisy. Sentences are the natural unit of meaning. LLM returns sentence indices. |
| 9 | Regex sentence splitter (no NLTK) | NLTK adds 35MB deploy weight for edge-case handling the project doesn't need. `re.split(r'(?<=[.!?。!?\n])\s+')` handles English + Cantonese legal text. |
| 10 | Sentence indices, not text spans | Deterministic: the splitter always produces the same sentences from the same chunk text. LLM returns indices to avoid substring matching fragility. |
| 11 | Graceful degradation | LLM call fails → cache miss → GET returns 404. Frontend keeps "View Highlighted" disabled/loading until batch succeeds. If batch fails, button stays hidden. |
| 12 | "View Highlighted" on source cards only | Source cards have sub-question context via `SubQuestionSection`. Inline citations in answer text lack this context (flattened `allSources`). Adding `sub_question_index` to `SourceMetadata` deferred. |
| 13 | **Upgrade "View PDF" link in place (no second button)** | Before the LLM completes, the "View PDF" link opens the raw PDF chunk page (existing behavior). Once the batch LLM call finishes caching, the same link upgrades to open the highlighted HTML page instead. The highlighted page includes a "View Original PDF →" footer link. If the LLM call fails, the link stays as raw PDF. One link, seamless upgrade. |
### Data Model
#### Batch Request (POST)
```python
class ChunkHighlightTarget(BaseModel):
"""A single chunk to highlight, tied to its sub-question."""
document_id: str
chunk_index: int
sub_question_text: str
sub_question_index: int
class HighlightBatchRequest(BaseModel):
"""Batch request: all cited chunks grouped by sub-question."""
targets: list[ChunkHighlightTarget]
```
#### Structured LLM Output
```python
class RelevantSentence(BaseModel):
sentence_index: int = Field(description="0-based index of the relevant sentence")
reason: str = Field(description="Brief explanation of why this sentence is relevant (max 80 chars)")
class ChunkHighlights(BaseModel):
document_id: str
chunk_index: int
relevant_sentences: list[RelevantSentence] = Field(default_factory=list)
class HighlightBatchResult(BaseModel):
"""Single LLM call returns highlights for all sub-questions and chunks."""
results: list[ChunkHighlights]
```
#### Batch Response (POST response, JSON)
```python
class HighlightBatchResponse(BaseModel):
status: Literal["completed", "partial", "failed"]
cached_count: int # Number of highlights successfully computed and cached
errors: list[str] = [] # Per-chunk errors (LLM failed on some but not all)
```
#### SQLite Cache Table
```sql
CREATE TABLE IF NOT EXISTS chunk_highlights (
id INTEGER PRIMARY KEY AUTOINCREMENT,
cache_key TEXT UNIQUE NOT NULL, -- sha256(document_id + chunk_index + sub_question)
document_id TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
sub_question TEXT NOT NULL,
relevant_sentences_json TEXT NOT NULL, -- JSON of RelevantSentence[]
html_content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_highlights_cache_key ON chunk_highlights(cache_key);
```
#### GET Endpoint Response
```
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
→ 200 text/html (cache hit: highlighted HTML page)
→ 404 (cache miss: batch not yet computed or failed)
```
### LLM Prompt Structure (Single Call, All Sub-Questions)
```
For each sub-question below, identify which sentences in each cited chunk are directly relevant to answering that sub-question. Return only the index and a brief reason for each relevant sentence. Skip sentences that are about the same topic but don't contribute to the answer.
Sub-question 1: "What are the compensation event timelines in NEC4?"
[Chunk: NEC4 ACC, chunk_index=2]
[0] The Contractor shall submit a quotation within three weeks of the instruction.
[1] The Project Manager may instruct the Contractor to submit a revised programme.
[2] This does not affect the Contractor's right to be paid for preparing the programme.
[Chunk: NEC4 ACC, chunk_index=5]
[0] Compensation events are events which change the total of the Prices.
[1] The Project Manager shall notify the Contractor within one week of becoming aware.
Sub-question 2: "What are the key dates and milestones?"
[Chunk: NEC4 ACC, chunk_index=1]
[0] The starting date shall be defined as the first working day.
[1] Key Dates are dates by which the Contractor must complete work.
```
### Implementation Tasks ### Implementation Tasks
#### Task 5.4.1: Backend — Sentence splitting utility #### Task 5.4.1: Backend — Sentence splitting utility
- [ ] Create `backend/app/utils/sentence_splitter.py` - [ ] Create `backend/app/utils/sentence_splitter.py`
- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]` - [ ] Function `split_sentences(text: str) -> list[str]` returns ordered sentence strings
- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`) - [ ] Use regex: `re.split(r'(?<=[.!?。!?\n])\s+', text)` (handles English + Chinese punctuation)
- [ ] NLTK punkt data auto-downloaded on first use (or bundled) - [ ] Strip empty strings, preserve original sentence text whitespace
- [ ] Handle edge cases: empty text, single sentence, lists/bullets - [ ] Handle edge cases: empty text, single sentence, lists/bullets
- **Test file**: `test_phase5_sentence_splitter.py` - **Test file**: `test_phase5_sentence_splitter.py`
#### Task 5.4.2: Backend — Highlighted chunk endpoint #### Task 5.4.2: Backend — Highlight models (`HighlightBatchResult` and related Pydantic schemas)
- [ ] New endpoint: `GET /api/v1/chunks/highlight` - [ ] Create `backend/app/models/highlight.py` with:
- [ ] Query params: `document_id`, `chunk_index`, `sub_question` - `ChunkHighlightTarget` (request input)
- [ ] Returns `text/html` (not JSON) - `HighlightBatchRequest` (request input)
- [ ] Logic in `backend/app/services/chunk_highlight_service.py`: - `RelevantSentence` (LLM output unit)
1. Fetch chunk from ChromaDB by `document_id` + `chunk_index` - `ChunkHighlights` (LLM output per chunk)
2. Split into sentences via `split_sentences()` - `HighlightBatchResult` (LLM output wrapper)
3. Get embedding for `sub_question` via existing embedding model - `HighlightBatchResponse` (POST response)
4. Get embeddings for all sentences in one batch call - **Test file**: `test_phase5_highlight_models.py`
5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
6. Mark sentences with similarity > threshold as highlighted
7. Render HTML template with inline CSS (yellow background, subtle border)
- **Test file**: `test_phase5_chunk_highlight.py`
#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources #### Task 5.4.3: Backend — SQLite cache for highlight results
- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source - [ ] Add `chunk_highlights` table to `backend/app/services/highlight_cache.py`
- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...` - [ ] Functions: `get_highlight(cache_key) -> str | None`, `set_highlight(cache_key, html_content, relevant_sentences_json)`, `compute_cache_key(document_id, chunk_index, sub_question) -> str`
- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text) - [ ] Database path: `backend/app/data/highlights.db` (separate from prompts/history to avoid schema conflicts)
- [ ] Open in new tab or modal - [ ] Initialize table on first use (same pattern as `prompt_service.py` / `history_service.py`)
- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx` - **Test file**: `test_phase5_highlight_cache.py`
#### Task 5.4.4: Integration testing #### Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)
- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT) - [ ] Create `backend/app/services/chunk_highlight_service.py`
- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually) - [ ] Method `compute_highlights_batch(targets: list[ChunkHighlightTarget]) -> HighlightBatchResponse`:
- [ ] Verify caching works (second request for same chunk is faster) 1. Group targets by sub-question for prompt construction
- [ ] Verify graceful degradation (embedding API down → plain text still served) 2. For each target, fetch chunk text + metadata from ChromaDB (`collection.get(ids=[f"{document_id}_{chunk_index}"])`)
- [ ] Run full test suite 3. Split chunk text into sentences via `split_sentences()`
4. Build a single batched prompt (see Prompt Structure above)
5. Call `llm_client.complete_structured(prompt, HighlightBatchResult)` → validated Pydantic result
6. For each `ChunkHighlights` in result: render HTML, compute cache key, store in SQLite
7. Return `HighlightBatchResponse(status="completed", cached_count=N)`
- [ ] Method `render_highlight_html(chunk_text, sentences, relevant_sentences, metadata) -> str`:
- Inline CSS: yellow background (`#fef08a`) on relevant sentences, grey reason text below each
- Header: filename, chunk index, page number, sub-question text
- Footer: "View Original PDF →" link (if `chunk_file_path` exists)
- Returns complete self-contained HTML page
- [ ] Graceful degradation per target (not per batch): if one chunk fetch fails, skip it and continue others. If LLM call fails entirely, return `status="failed"`.
- **Test file**: `test_phase5_chunk_highlight_service.py`
#### Task 5.4.5: Backend — REST endpoints
- [ ] `POST /api/v1/v2/highlights/batch` in `backend/app/routers/chunks.py` (new file):
- Accepts `HighlightBatchRequest`
- Calls `chunk_highlight_service.compute_highlights_batch()`
- Returns `HighlightBatchResponse` (JSON)
- [ ] `GET /api/v1/v2/highlights` in same router:
- Query params: `document_id`, `chunk_index`, `sub_question`
- Computes cache key, reads from SQLite
- Cache hit → returns `text/html`
- Cache miss → returns 404
- [ ] Register router in `main.py`
- **Test file**: `test_phase5_highlight_endpoints.py`
#### Task 5.4.6: Frontend — Extract cited chunks from answer text
- [ ] Add `extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[]` to `citationParser.ts`
- Uses the same `citationPattern` regex as `replaceCitationPatterns()`
- For each `[citation]` match, resolves against `buildCitationLookup(sources)`
- Returns deduplicated list of `SourceMetadata` that were actually cited in the answer
- [ ] Exported alongside existing citation functions
- **Test file**: Update `citationParser.test.ts` with cited source extraction tests
#### Task 5.4.7: Frontend — Background batch trigger on response completion
- [ ] In `ResponsePanel.tsx` (inside `SubQuestionSections` component):
- After answer is rendered (non-null, not loading), `useEffect` fires:
1. Parse answer sections → extract cited sources per sub-question
2. Build `HighlightBatchRequest` payload
3. POST `/api/v1/v2/highlights/batch`
4. On success → store which (document_id, chunk_index, sub_question) pairs are cached in a `useState(Set<string>)`
5. On failure → log error (no UI change — "View Highlighted" stays hidden)
- Source cards check `isHighlightReady(document_id, chunk_index, sub_q_text)` → shows/hides "View Highlighted" button
- **Test file**: Update `ResponsePanel.test.tsx`
#### Task 5.4.8: Frontend — Upgrade "View PDF" link on source cards
- [ ] In `SubQuestionSourceCard` (ResponsePanel.tsx):
- **Before batch completes**: "View PDF" link opens raw PDF chunk page (existing behavior, unchanged)
- **After batch completes**: Same link upgrades to open `/api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...` (highlighted HTML page)
- The link stays in the same position — no repositioning, no second element
- While batch is loading: no visual change (the PDF link remains functional; the highlight upgrade is invisible to the user until ready)
- If batch fails: link stays as raw PDF permanently (graceful degradation)
- [ ] The highlighted HTML page includes a `[View Original PDF →]` footer link (rendered by backend in Task 5.4.4) so users can always reach the raw PDF
- [ ] No label change needed — "View PDF" remains appropriate since the user gets a page-level view of the chunk regardless
- **Test file**: Update `ResponsePanel.test.tsx` — verify link href upgrades after mock batch response, verify link stays as raw PDF when batch fails, verify loading state shows no visual disruption
#### Task 5.4.9: Integration testing
- [ ] Verify batch endpoint: POST with 3 targets → all cached → GET returns highlighted HTML for each
- [ ] Verify cache: same GET twice → both hit SQLite (no second LLM call)
- [ ] Verify graceful degradation: LLM call fails → batch returns `status="failed"` → GET returns 404 → frontend hides buttons
- [ ] Verify cross-document-type: PDF, DOCX, TXT chunks all work (chunk text comes from ChromaDB regardless of source format)
- [ ] Verify sentence splitting handles mixed English/Chinese legal text (e.g., "The date is 2024年1月1日。The Contractor shall...")
- [ ] Run full test suite: `cd backend && pytest -v` and `cd frontend && npm test`
### Test Files ### Test Files
| # | Test File | Coverage | | # | Test File | Coverage |
|---|-----------|----------| |---|-----------|----------|
| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists | | T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, Chinese, mixed, empty, single sentence, bullet lists |
| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text | | T5.4.2 | `backend/app/test/test_phase5_highlight_models.py` | Pydantic validation: valid/invalid targets, empty batch, max batch size, LLM result parsing |
| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available | | T5.4.3 | `backend/app/test/test_phase5_highlight_cache.py` | SQLite cache: insert, retrieve, cache key collision, missing key, JSON serialization |
| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" | | T5.4.4 | `backend/app/test/test_phase5_chunk_highlight_service.py` | Service: valid batch → cached HTML, mocked LLM returns structured result, per-target error isolation, empty batch, ChromaDB miss → skip target |
| T5.4.5 | `backend/app/test/test_phase5_highlight_endpoints.py` | Endpoints: POST returns cache count, GET returns HTML, GET cache miss → 404, invalid params → 422 |
| T5.4.6 | `frontend/src/test/utils/citationParser.test.ts` (update) | `extractCitedSources()`: extracts cited filenames, deduplicates, handles no citations, handles partial matches |
| T5.4.7 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | "View PDF" link upgrades to highlight page after batch completes, link stays as raw PDF when batch fails, no visual disruption during loading |
### Acceptance Tests ### Acceptance Tests
| # | Test File | Coverage | | # | Test File | Coverage |
|---|-----------|----------| |---|-----------|----------|
| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences | | AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real chunk text → batch POST with real LLM → verify cached HTML contains yellow-highlighted spans on genuinely relevant sentences → spot-check 3 chunks manually |
---
## Updated Dependency Graph
```
Phase 5.1 (Structured Output) ✅
Phase 5.2 (Citation Linking) ✅
Phase 5.3 (DOCX/TXT PDF Gen) ✅
Phase 5.4 (LLM Highlighting)
Task 5.4.1 (sentence splitter)
├── Task 5.4.2 (highlight models)
│ │
│ ├── Task 5.4.3 (SQLite cache)
│ │ │
│ │ ├── Task 5.4.4 (batch service + HTML rendering)
│ │ │ │
│ │ │ └── Task 5.4.5 (REST endpoints)
│ │ │
│ │ └── Task 5.4.9 (integration tests)
│ │
│ └── Task 5.4.6 (frontend cited source extraction)
│ │
│ ├── Task 5.4.7 (frontend batch trigger)
│ │ │
│ │ └── Task 5.4.8 (frontend link upgrade)
│ │
│ └── Task 5.4.9 (integration tests)
└── All tasks → Task 5.4.9 (full suite)
```
### Updated Commit Plan
| Commit | Message | Scope |
|--------|---------|-------|
| 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |
--- ---
## Commit Plan ## Complete Commit Plan (All Phases)
| Commit | Message | Scope | | Commit | Message | Scope |
|--------|---------|-------| |--------|---------|-------|
@ -410,4 +641,9 @@ Frontend renders HTML in an iframe or new tab
| 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 | | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
| 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests | | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests | | 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) | | 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |

View File

@ -1,11 +1,15 @@
import React, { useState } from 'react' import React, { useState, useEffect } from 'react'
import { MessageSquare, AlertCircle, Copy, ChevronDown, ChevronRight } from 'lucide-react' import { MessageSquare, AlertCircle, Copy, ChevronDown, ChevronRight } from 'lucide-react'
import ReactMarkdown from 'react-markdown' import ReactMarkdown from 'react-markdown'
import type { SourceMetadata, SubQuestionSources } from '../types' import type { SourceMetadata, SubQuestionSources } from '../types'
import { getPdfViewerUrl } from '../lib/api' import { getPdfViewerUrl } from '../lib/api'
import { processCitations, processCitationsForSubq } from '../utils/citationParser' import { processCitations, processCitationsForSubq, extractCitedSources } from '../utils/citationParser'
import { bulletizeMarkdown } from '../utils/citationParser' import { bulletizeMarkdown } from '../utils/citationParser'
function getHighlightUrl(document_id: string, chunk_index: number, sub_question: string): string {
return `/api/v1/v2/highlights?document_id=${encodeURIComponent(document_id)}&chunk_index=${chunk_index}&sub_question=${encodeURIComponent(sub_question)}`
}
interface ResponsePanelProps { interface ResponsePanelProps {
answer: string | null answer: string | null
sources?: SourceMetadata[] sources?: SourceMetadata[]
@ -32,7 +36,17 @@ function parseAnswerSections(answer: string): string[] {
return sections.filter((s) => s.trim().length > 0).map((s) => bulletizeMarkdown(s.trim())) return sections.filter((s) => s.trim().length > 0).map((s) => bulletizeMarkdown(s.trim()))
} }
function SubQuestionSourceCard({ source, index }: { source: SourceMetadata; index: number }) { function SubQuestionSourceCard({
source,
index,
highlightReady = false,
subQuestionText = '',
}: {
source: SourceMetadata
index: number
highlightReady?: boolean
subQuestionText?: string
}) {
return ( return (
<div <div
key={index} key={index}
@ -48,11 +62,15 @@ function SubQuestionSourceCard({ source, index }: { source: SourceMetadata; inde
</div> </div>
<div className="text-sm text-gray-500">{source.upload_date}</div> <div className="text-sm text-gray-500">{source.upload_date}</div>
<div className="text-sm text-gray-600 mt-1">{source.content_summary}</div> <div className="text-sm text-gray-600 mt-1">{source.content_summary}</div>
<div className="flex items-center justify-between mt-1"> <div className="flex items-center gap-2 mt-1">
<div className="text-xs text-gray-400">Chunk {source.chunk_index}</div> <div className="text-xs text-gray-400">Chunk {source.chunk_index}</div>
{source.chunk_file_path && ( {source.chunk_file_path && (
<a <a
href={getPdfViewerUrl(source.chunk_file_path, source.page_number ?? undefined, source.filename)} href={
highlightReady && source.document_id
? getHighlightUrl(source.document_id, source.chunk_index, subQuestionText)
: getPdfViewerUrl(source.chunk_file_path, source.page_number ?? undefined, source.filename)
}
target="_blank" target="_blank"
rel="noopener noreferrer" rel="noopener noreferrer"
className="text-xs text-blue-600 hover:text-blue-800 hover:underline" className="text-xs text-blue-600 hover:text-blue-800 hover:underline"
@ -71,11 +89,13 @@ function SubQuestionSection({
subQuestion, subQuestion,
answerSection, answerSection,
allSubQuestionSources, allSubQuestionSources,
highlightReadyKeys,
}: { }: {
index: number index: number
subQuestion: SubQuestionSources subQuestion: SubQuestionSources
answerSection: string answerSection: string
allSubQuestionSources: SubQuestionSources[] allSubQuestionSources: SubQuestionSources[]
highlightReadyKeys: Set<string>
}) { }) {
const [expanded, setExpanded] = useState(false) const [expanded, setExpanded] = useState(false)
// Look up citations across ALL sub-questions' sources because the LLM // Look up citations across ALL sub-questions' sources because the LLM
@ -117,7 +137,13 @@ function SubQuestionSection({
{expanded && ( {expanded && (
<div data-testid="sources-container" className="grid grid-cols-2 gap-2"> <div data-testid="sources-container" className="grid grid-cols-2 gap-2">
{subQuestion.sources.map((source, idx) => ( {subQuestion.sources.map((source, idx) => (
<SubQuestionSourceCard key={idx} source={source} index={idx} /> <SubQuestionSourceCard
key={idx}
source={source}
index={idx}
highlightReady={source.document_id ? highlightReadyKeys.has(`${source.document_id}_${source.chunk_index}_${encodeURIComponent(subQuestion.sub_question_text)}`) : false}
subQuestionText={subQuestion.sub_question_text}
/>
))} ))}
</div> </div>
)} )}
@ -137,6 +163,57 @@ function SubQuestionSections({
isLoading?: boolean isLoading?: boolean
}) { }) {
const [copied, setCopied] = useState(false) const [copied, setCopied] = useState(false)
const [highlightReadyKeys, setHighlightReadyKeys] = useState<Set<string>>(new Set())
useEffect(() => {
if (!answer || isLoading || !subQuestionSources.length) return
const targets: Array<{
document_id: string
chunk_index: number
sub_question_text: string
sub_question_index: number
}> = []
const sections = parseAnswerSections(answer)
subQuestionSources.forEach((sq) => {
const answerSection = sections[sq.sub_question_index] ?? ''
const citedSources = extractCitedSources(answerSection, sq.sources)
citedSources.forEach((source) => {
if (source.document_id) {
targets.push({
document_id: source.document_id,
chunk_index: source.chunk_index,
sub_question_text: sq.sub_question_text,
sub_question_index: sq.sub_question_index,
})
}
})
})
if (targets.length === 0) return
fetch('/api/v1/v2/highlights/batch', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ targets }),
})
.then((res) => res.json())
.then((data) => {
if (data.status === 'completed' || data.status === 'partial') {
const keys = new Set<string>()
targets.forEach((t) => {
keys.add(`${t.document_id}_${t.chunk_index}_${encodeURIComponent(t.sub_question_text)}`)
})
setHighlightReadyKeys(keys)
}
})
.catch((err) => {
console.error('Highlight batch computation failed:', err)
})
}, [answer, isLoading, subQuestionSources])
const sections = answer ? parseAnswerSections(answer) : [] const sections = answer ? parseAnswerSections(answer) : []
const handleCopyAnswer = async (): Promise<void> => { const handleCopyAnswer = async (): Promise<void> => {
@ -217,6 +294,7 @@ function SubQuestionSections({
subQuestion={subQuestion} subQuestion={subQuestion}
answerSection={sections[index] ?? ''} answerSection={sections[index] ?? ''}
allSubQuestionSources={subQuestionSources} allSubQuestionSources={subQuestionSources}
highlightReadyKeys={highlightReadyKeys}
/> />
))} ))}
</div> </div>

View File

@ -1,7 +1,7 @@
import React from 'react' import React from 'react'
import { render, screen, fireEvent, waitFor } from '@testing-library/react' import { render, screen, fireEvent, waitFor } from '@testing-library/react'
import { ResponsePanel } from '../../components/ResponsePanel' import { ResponsePanel } from '../../components/ResponsePanel'
import type { SourceMetadata } from '../../types' import type { SourceMetadata, SubQuestionSources } from '../../types'
vi.mock('../../lib/api', () => ({ vi.mock('../../lib/api', () => ({
getChunkPdfUrl: (filePath: string) => `http://localhost:8000/api/v1/chunks/${filePath}/pdf`, getChunkPdfUrl: (filePath: string) => `http://localhost:8000/api/v1/chunks/${filePath}/pdf`,
@ -20,6 +20,7 @@ describe('ResponsePanel', () => {
chunk_index: 0, chunk_index: 0,
page_number: 1, page_number: 1,
chunk_file_path: 'test_chunk_1.pdf', chunk_file_path: 'test_chunk_1.pdf',
document_id: null,
}, },
{ {
filename: 'document2.txt', filename: 'document2.txt',
@ -28,6 +29,7 @@ describe('ResponsePanel', () => {
chunk_index: 1, chunk_index: 1,
page_number: null, page_number: null,
chunk_file_path: null, chunk_file_path: null,
document_id: null,
}, },
] ]
@ -181,6 +183,7 @@ describe('ResponsePanel', () => {
chunk_index: 0, chunk_index: 0,
page_number: 3, page_number: 3,
chunk_file_path: 'chunk_0.pdf', chunk_file_path: 'chunk_0.pdf',
document_id: null,
}, },
] ]
const answer = 'The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3].' const answer = 'The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3].'
@ -215,4 +218,149 @@ describe('ResponsePanel', () => {
expect(screen.getByText(/unknown_file\.pdf, page 10/)).toBeInTheDocument() expect(screen.getByText(/unknown_file\.pdf, page 10/)).toBeInTheDocument()
expect(screen.queryByRole('link', { name: /unknown_file/ })).not.toBeInTheDocument() expect(screen.queryByRole('link', { name: /unknown_file/ })).not.toBeInTheDocument()
}) })
describe('SubQuestionSections highlight batch', () => {
const mockFetch = vi.fn()
const mockSubQuestionSources: SubQuestionSources[] = [
{
sub_question_index: 0,
sub_question_text: 'What is the threshold?',
sources: [
{
filename: 'NEC4 ACC.pdf',
upload_date: '2024-01-15',
content_summary: 'Contract terms',
chunk_index: 0,
page_number: 3,
chunk_file_path: 'chunk_0.pdf',
document_id: 'doc-123',
},
],
},
]
beforeEach(() => {
mockFetch.mockReset()
global.fetch = mockFetch
})
afterEach(() => {
vi.restoreAllMocks()
})
it('calls batch highlight endpoint with cited sources', async () => {
mockFetch.mockResolvedValue({
json: async () => ({ status: 'completed' }),
} as Response)
render(
<ResponsePanel
answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
subQuestionSources={mockSubQuestionSources}
isLoading={false}
error={null}
/>
)
await waitFor(() => {
expect(mockFetch).toHaveBeenCalledTimes(1)
expect(mockFetch).toHaveBeenCalledWith(
'/api/v1/v2/highlights/batch',
expect.objectContaining({
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: expect.stringContaining('doc-123'),
})
)
})
})
it('shows View PDF link with normal URL when highlights are not ready', async () => {
mockFetch.mockResolvedValue({
json: async () => ({ status: 'pending' }),
} as Response)
render(
<ResponsePanel
answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
subQuestionSources={mockSubQuestionSources}
isLoading={false}
error={null}
/>
)
await waitFor(() => {
expect(mockFetch).toHaveBeenCalled()
})
const toggleButton = screen.getByTestId('sources-toggle')
fireEvent.click(toggleButton)
const link = screen.getByTestId('view-chunk-pdf-link')
expect(link).toHaveAttribute('href', expect.stringContaining('/pdf-viewer'))
})
it('upgrades View PDF link to highlight URL when batch completes', async () => {
mockFetch.mockResolvedValue({
json: async () => ({ status: 'completed' }),
} as Response)
render(
<ResponsePanel
answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
subQuestionSources={mockSubQuestionSources}
isLoading={false}
error={null}
/>
)
await waitFor(() => {
expect(mockFetch).toHaveBeenCalled()
})
const toggleButton = screen.getByTestId('sources-toggle')
fireEvent.click(toggleButton)
await waitFor(() => {
const link = screen.getByTestId('view-chunk-pdf-link')
expect(link).toHaveAttribute('href', expect.stringContaining('/api/v1/v2/highlights'))
expect(link).toHaveAttribute('href', expect.stringContaining('doc-123'))
})
})
it('does not upgrade link when source has no document_id', async () => {
const sourcesWithoutDocId: SubQuestionSources[] = [
{
sub_question_index: 0,
sub_question_text: 'What is the threshold?',
sources: [
{
filename: 'NEC4 ACC.pdf',
upload_date: '2024-01-15',
content_summary: 'Contract terms',
chunk_index: 0,
page_number: 3,
chunk_file_path: 'chunk_0.pdf',
document_id: null,
},
],
},
]
render(
<ResponsePanel
answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
subQuestionSources={sourcesWithoutDocId}
isLoading={false}
error={null}
/>
)
const toggleButton = screen.getByTestId('sources-toggle')
fireEvent.click(toggleButton)
const link = screen.getByTestId('view-chunk-pdf-link')
expect(link).toHaveAttribute('href', expect.stringContaining('/pdf-viewer'))
})
})
}) })

View File

@ -1,5 +1,5 @@
import { describe, it, expect } from 'vitest' import { describe, it, expect } from 'vitest'
import { processCitations, bulletizeMarkdown } from '../../utils/citationParser' import { processCitations, bulletizeMarkdown, extractCitedSources } from '../../utils/citationParser'
import type { SourceMetadata } from '../../types' import type { SourceMetadata } from '../../types'
const mockSources: SourceMetadata[] = [ const mockSources: SourceMetadata[] = [
@ -182,3 +182,61 @@ describe('bulletizeMarkdown', () => {
expect(bulletizeMarkdown(' \n ')).toBe('') expect(bulletizeMarkdown(' \n ')).toBe('')
}) })
}) })
describe('extractCitedSources', () => {
it('extracts cited source when citation [NEC4 ACC] matches source filename', () => {
const text = 'Clause info [NEC4 ACC] is important.'
const result = extractCitedSources(text, mockSources)
expect(result).toHaveLength(1)
expect(result[0].filename).toBe('NEC4 ACC.pdf')
expect(result[0].document_id).toBe('doc-001')
})
it('returns empty array when no citations in text', () => {
const text = 'This text has no citations at all.'
const result = extractCitedSources(text, mockSources)
expect(result).toEqual([])
})
it('deduplicates same source cited multiple times', () => {
const text = 'First [NEC4 ACC] and again [NEC4 ACC, page 3] same source.'
const result = extractCitedSources(text, mockSources)
expect(result).toHaveLength(1)
expect(result[0].filename).toBe('NEC4 ACC.pdf')
})
it('extracts multiple different cited sources', () => {
const text = 'See [NEC4 ACC] and [report.pdf, page 5] for details.'
const result = extractCitedSources(text, mockSources)
expect(result).toHaveLength(2)
const filenames = result.map((s) => s.filename)
expect(filenames).toContain('NEC4 ACC.pdf')
expect(filenames).toContain('report.pdf')
})
it('citation with page number [NEC4 ACC, page 3] still resolves', () => {
const text = 'Clause info [NEC4 ACC, page 3] is important.'
const result = extractCitedSources(text, mockSources)
expect(result).toHaveLength(1)
expect(result[0].filename).toBe('NEC4 ACC.pdf')
expect(result[0].page_number).toBe(3)
})
it('citation with extension [NEC4 ACC.pdf] still resolves', () => {
const text = 'Clause info [NEC4 ACC.pdf] is important.'
const result = extractCitedSources(text, mockSources)
expect(result).toHaveLength(1)
expect(result[0].filename).toBe('NEC4 ACC.pdf')
})
it('handles empty sources list gracefully', () => {
const text = 'Clause info [NEC4 ACC] is important.'
const result = extractCitedSources(text, [])
expect(result).toEqual([])
})
it('handles empty answerText gracefully', () => {
const result = extractCitedSources('', mockSources)
expect(result).toEqual([])
})
})

View File

@ -124,3 +124,27 @@ export function processCitations(text: string, sources: SourceMetadata[]): strin
const lookup = buildCitationLookup(sources) const lookup = buildCitationLookup(sources)
return replaceCitationPatterns(text, lookup) return replaceCitationPatterns(text, lookup)
} }
export function extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[] {
if (!answerText.trim() || !sources.length) return []
const lookup = buildCitationLookup(sources)
const citationPattern = /(?<!!)\[([^\]]+)\](?!\()/g
const seen = new Set<string>()
const result: SourceMetadata[] = []
let match: RegExpExecArray | null
while ((match = citationPattern.exec(answerText)) !== null) {
const content = match[1].trim()
const source = findSource(content, lookup)
if (source) {
const key = `${source.document_id}_${source.chunk_index}`
if (!seen.has(key)) {
seen.add(key)
result.push(source)
}
}
}
return result
}