feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8)

- citationParser.ts: extractCitedSources() parses answer text for [citations], resolves against SourceMetadata, returns deduplicated cited sources - ResponsePanel.tsx: useEffect fires POST /api/v1/v2/highlights/batch after answer renders; View PDF link upgrades in-place to highlighted HTML when batch completes; stays as raw PDF on failure - Updated plan: LLM-based relevance detection, eager background computation, single batched LLM call, sqlite cache, regex sentence splitter - 45 frontend tests: 28 citationParser + 17 ResponsePanel (including 4 new sub-question highlight tests)
2026-04-29 09:27:04 +08:00 · 2026-04-29 09:27:04 +08:00 · c632b9ea3b
parent a56f8f69e2
commit c632b9ea3b
5 changed files with 620 additions and 76 deletions
--- a/.plans/package5_enhancement_plan.md
+++ b/.plans/package5_enhancement_plan.md
@ -4,7 +4,7 @@
 **Scope**: 
 - Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
 - Phase 5.2: Fix missing PDF links in citations and improve citation robustness
-**Status**: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
+**Status**: Phases 5.1 ✅, 5.2 ✅, 5.3 ✅ — 5.4 Planned (LLM-based, awaiting implementation)

 **LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).

@ -266,39 +266,66 @@ Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they ha

 When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.

-### Solution
+### Solution: LLM-Based Relevance Detection + Eager Background Computation

-**On-the-fly highlighted HTML chunk views** served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
+**Embedding similarity was rejected** — cosine similarity measures semantic proximity, not answer relevance. A sentence like *"The Contractor shall submit a quotation within three weeks"* has low embedding similarity to *"What are the compensation event timelines?"* but is directly relevant to answering it. Only an LLM can distinguish topic-matching from answer-relevance.
+
+**Strategy**: After the final response renders on the frontend, a **single background LLM call** identifies relevant sentences across all cited chunks (grouped by sub-question). Results are cached in SQLite. The "View Highlighted" button activates once caching completes. Clicks serve instantly from cache.
+
+**Why one LLM call, not one per chunk**: A sub-question typically cites only 2-3 chunks (even though up to 10 are retrieved). A single batched prompt with all sub-questions and their cited chunks costs ~1 LLM call per query — not N × M per chunk. Reuses the `complete_structured()` infrastructure from Phase 5.1.

 ### Why HTML, not PDF?

 | Approach | Complexity | Works for all doc types? | Preserves original formatting? |
 |---|---|---|---|
-| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text) | ❌ Plain text only |
+| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text from ChromaDB) | ❌ Plain text only |
 | Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
 | Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |

-**Recommendation: HTML page.** Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a **supplementary** view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
+**Recommendation: HTML page.** Works uniformly across PDF/DOCX/TXT. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a supplementary view reached via a separate button. The two views coexist: "View Original PDF" vs "View Highlighted Text".

-### How It Works (No LLM Needed)
+### Architecture

 ```
-User clicks citation [NEC4 ACC, chunk 3]
+EAGER PHASE (background, fires immediately after SSE completed):
+─────────────────────────────────────────────────
+Frontend receives SSE "completed" event
+    │
+    ├── Renders answer with citations instantly (no waiting)
+    │
+    └── useEffect fires POST /api/v1/v2/highlights/batch
              │
              ▼
-Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
+        Backend highlight batch service:
+          1. For each sub-question, extract cited chunks
+             (only 2-3 actually cited per sub-q, not all 10)
+          2. Fetch chunk texts from ChromaDB
+          3. Split each chunk into sentences
+          4. Single LLM call: identify which sentences are
+             relevant to their sub-question (structured output)
+          5. Render each chunk as highlighted HTML
+          6. Cache all HTML + metadata in SQLite
+          7. Return {status, cached_count} to frontend
               │
               ▼
-Backend:
-  1. Fetch chunk text from ChromaDB                          [chromadb get()]
-  2. Split into sentences                                    [nltk.sent_tokenize or regex]
-  3. Embed sub-question                                      [existing embedding model]
-  4. Embed each sentence (batch, parallel)                   [same model]
-  5. Compute cosine similarity per sentence vs sub-question  [numpy]
-  6. Return HTML with yellow background on sentences > threshold
+         Frontend receives response → upgrades "View PDF"
+         links on source cards to point to cached highlight pages
+         (same button, same position, upgraded destination)
+
+LAZY PHASE (on user click, instant from cache):
+──────────────────────────────────────────────
+User clicks "View PDF" on a source card
    │
    ▼
-Frontend renders HTML in an iframe or new tab
+GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
+    │
+    ▼
+Backend reads cached HTML from SQLite → serves instantly
+    │
+    ▼
+Highlighted HTML page shows:
+  ┌─ Highlighted sentences (yellow background + LLM reasons)
+  └─ Footer: [View Original PDF →] link to raw PDF chunk
 ```

 ### What Gets Highlighted
@ -312,19 +339,23 @@ Frontend renders HTML in an iframe or new tab
 │ the Project Manager. It shall include:                    │
 │                                                            │
 │ ████████████████████████████████████████████████████████ │
-│ █ The starting date, access dates, and Key Dates.       █ │  ← High similarity
+│ █ The starting date, access dates, and Key Dates.       █ │  ← LLM: relevant to timeline question
+│ █ "Defines when work begins and key milestones."        █ │  ← Brief reason from LLM
 │ ████████████████████████████████████████████████████████ │
 │                                                            │
 │ The Contractor shall submit a first programme within      │
 │ ████████████████████████████████████████████████████████ │
-│ █ two weeks of the starting date.                       █ │  ← High similarity
+│ █ two weeks of the starting date.                       █ │  ← LLM: relevant to timeline question
+│ █ "Specifies the deadline for initial programme."       █ │
 │ ████████████████████████████████████████████████████████ │
 │                                                            │
 │ The Project Manager may instruct the Contractor to        │
 │ submit a revised programme showing the effects of a       │
 │ compensation event. This does not affect the Contractor's │
-│ right to be paid for preparing the programme.             │  ← Low similarity (no highlight)
+│ right to be paid for preparing the programme.             │  ← Not highlighted (LLM: not relevant to timeline)
 │                                                            │
+├──────────────────────────────────────────────────────────┤
+│                                   [View Original PDF →]   │
 └──────────────────────────────────────────────────────────┘
 ```

@ -332,75 +363,275 @@ Frontend renders HTML in an iframe or new tab

 | # | Decision | Rationale |
 |---|---|---|
-| 1 | HTML page, not PDF | Zero dependency (`reportlab` not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
-| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
-| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
-| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
-| 5 | Configurable threshold (env var) | `HIGHLIGHT_SIMILARITY_THRESHOLD` (default 0.5). Tune per embedding model. |
-| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
-| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
-| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (`chunk_file_path`) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
+| 1 | HTML page, not PDF | Zero extra dependencies. CSS highlighting is more flexible. Original PDF remains available via separate link. |
+| 2 | **LLM relevance detection**, not embedding similarity | Embedding similarity measures topic overlap, not answer relevance. LLM distinguishes "about timelines" from "answers the timeline question." |
+| 3 | **Single batched LLM call** for all sub-questions + cited chunks | One API call per query, not one per chunk. Each sub-question typically cites 2-3 chunks. Total ~6-15 chunks per query in a single prompt. |
+| 4 | **Eager computation** (background after response renders) | User sees answer instantly. Highlight computation happens while they read. "View Highlighted" activates when ready. Zero perceived latency on click. |
+| 5 | **Only cited chunks** (not all retrieved chunks) | Up to 10 chunks retrieved per sub-question, but only 2-3 are actually cited in the answer. Computing highlights for uncited chunks wastes tokens. |
+| 6 | **SQLite cache** for highlight results | Durable across restarts. Same chunk+sub-question pair only computed once. Subsequent clicks (or same query re-run) serve instantly. Cache key = hash(document_id + chunk_index + sub_question). |
+| 7 | Reuse `complete_structured()` from Phase 5.1 | `LLMClient.complete_structured(prompt, HighlightBatchResult)` returns validated Pydantic model. No new LLM calling infrastructure. |
+| 8 | Sentence-level granularity | Paragraphs are too coarse, words are too noisy. Sentences are the natural unit of meaning. LLM returns sentence indices. |
+| 9 | Regex sentence splitter (no NLTK) | NLTK adds 35MB deploy weight for edge-case handling the project doesn't need. `re.split(r'(?<=[.!?。！？\n])\s+')` handles English + Cantonese legal text. |
+| 10 | Sentence indices, not text spans | Deterministic: the splitter always produces the same sentences from the same chunk text. LLM returns indices to avoid substring matching fragility. |
+| 11 | Graceful degradation | LLM call fails → cache miss → GET returns 404. Frontend keeps "View Highlighted" disabled/loading until batch succeeds. If batch fails, button stays hidden. |
+| 12 | "View Highlighted" on source cards only | Source cards have sub-question context via `SubQuestionSection`. Inline citations in answer text lack this context (flattened `allSources`). Adding `sub_question_index` to `SourceMetadata` deferred. |
+| 13 | **Upgrade "View PDF" link in place (no second button)** | Before the LLM completes, the "View PDF" link opens the raw PDF chunk page (existing behavior). Once the batch LLM call finishes caching, the same link upgrades to open the highlighted HTML page instead. The highlighted page includes a "View Original PDF →" footer link. If the LLM call fails, the link stays as raw PDF. One link, seamless upgrade. |
+
+### Data Model
+
+#### Batch Request (POST)
+
+```python
+class ChunkHighlightTarget(BaseModel):
+    """A single chunk to highlight, tied to its sub-question."""
+    document_id: str
+    chunk_index: int
+    sub_question_text: str
+    sub_question_index: int
+
+class HighlightBatchRequest(BaseModel):
+    """Batch request: all cited chunks grouped by sub-question."""
+    targets: list[ChunkHighlightTarget]
+```
+
+#### Structured LLM Output
+
+```python
+class RelevantSentence(BaseModel):
+    sentence_index: int = Field(description="0-based index of the relevant sentence")
+    reason: str = Field(description="Brief explanation of why this sentence is relevant (max 80 chars)")
+
+class ChunkHighlights(BaseModel):
+    document_id: str
+    chunk_index: int
+    relevant_sentences: list[RelevantSentence] = Field(default_factory=list)
+
+class HighlightBatchResult(BaseModel):
+    """Single LLM call returns highlights for all sub-questions and chunks."""
+    results: list[ChunkHighlights]
+```
+
+#### Batch Response (POST response, JSON)
+
+```python
+class HighlightBatchResponse(BaseModel):
+    status: Literal["completed", "partial", "failed"]
+    cached_count: int  # Number of highlights successfully computed and cached
+    errors: list[str] = []  # Per-chunk errors (LLM failed on some but not all)
+```
+
+#### SQLite Cache Table
+
+```sql
+CREATE TABLE IF NOT EXISTS chunk_highlights (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    cache_key TEXT UNIQUE NOT NULL,  -- sha256(document_id + chunk_index + sub_question)
+    document_id TEXT NOT NULL,
+    chunk_index INTEGER NOT NULL,
+    sub_question TEXT NOT NULL,
+    relevant_sentences_json TEXT NOT NULL,  -- JSON of RelevantSentence[]
+    html_content TEXT NOT NULL,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+CREATE INDEX IF NOT EXISTS idx_highlights_cache_key ON chunk_highlights(cache_key);
+```
+
+#### GET Endpoint Response
+
+```
+GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
+→ 200 text/html  (cache hit: highlighted HTML page)
+→ 404            (cache miss: batch not yet computed or failed)
+```
+
+### LLM Prompt Structure (Single Call, All Sub-Questions)
+
+```
+For each sub-question below, identify which sentences in each cited chunk are directly relevant to answering that sub-question. Return only the index and a brief reason for each relevant sentence. Skip sentences that are about the same topic but don't contribute to the answer.
+
+Sub-question 1: "What are the compensation event timelines in NEC4?"
+  [Chunk: NEC4 ACC, chunk_index=2]
+    [0] The Contractor shall submit a quotation within three weeks of the instruction.
+    [1] The Project Manager may instruct the Contractor to submit a revised programme.
+    [2] This does not affect the Contractor's right to be paid for preparing the programme.
+  [Chunk: NEC4 ACC, chunk_index=5]
+    [0] Compensation events are events which change the total of the Prices.
+    [1] The Project Manager shall notify the Contractor within one week of becoming aware.
+
+Sub-question 2: "What are the key dates and milestones?"
+  [Chunk: NEC4 ACC, chunk_index=1]
+    [0] The starting date shall be defined as the first working day.
+    [1] Key Dates are dates by which the Contractor must complete work.
+```

 ### Implementation Tasks

 #### Task 5.4.1: Backend — Sentence splitting utility

 - [ ] Create `backend/app/utils/sentence_splitter.py`
- [ ] Function `split_sentences(text: str) -> list[dict]` returns `[{text, start_char, end_char}, ...]`
- [ ] Use `nltk.sent_tokenize` with fallback to regex (`re.split(r'(?<=[.!?])\s+')`)
- [ ] NLTK punkt data auto-downloaded on first use (or bundled)
+- [ ] Function `split_sentences(text: str) -> list[str]` returns ordered sentence strings
+- [ ] Use regex: `re.split(r'(?<=[.!?。！？\n])\s+', text)` (handles English + Chinese punctuation)
+- [ ] Strip empty strings, preserve original sentence text whitespace
 - [ ] Handle edge cases: empty text, single sentence, lists/bullets
 - **Test file**: `test_phase5_sentence_splitter.py`

-#### Task 5.4.2: Backend — Highlighted chunk endpoint
+#### Task 5.4.2: Backend — Highlight models (`HighlightBatchResult` and related Pydantic schemas)

- [ ] New endpoint: `GET /api/v1/chunks/highlight`
- [ ] Query params: `document_id`, `chunk_index`, `sub_question`
- [ ] Returns `text/html` (not JSON)
- [ ] Logic in `backend/app/services/chunk_highlight_service.py`:
-  1. Fetch chunk from ChromaDB by `document_id` + `chunk_index`
-  2. Split into sentences via `split_sentences()`
-  3. Get embedding for `sub_question` via existing embedding model
-  4. Get embeddings for all sentences in one batch call
-  5. Compute cosine similarity: `np.dot(q_emb, s_emb) / (norm(q) * norm(s))`
-  6. Mark sentences with similarity > threshold as highlighted
-  7. Render HTML template with inline CSS (yellow background, subtle border)
- **Test file**: `test_phase5_chunk_highlight.py`
+- [ ] Create `backend/app/models/highlight.py` with:
+  - `ChunkHighlightTarget` (request input)
+  - `HighlightBatchRequest` (request input)
+  - `RelevantSentence` (LLM output unit)
+  - `ChunkHighlights` (LLM output per chunk)
+  - `HighlightBatchResult` (LLM output wrapper)
+  - `HighlightBatchResponse` (POST response)
+- **Test file**: `test_phase5_highlight_models.py`

-#### Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
+#### Task 5.4.3: Backend — SQLite cache for highlight results

- [ ] In `citationParser.ts` and `ResponsePanel.tsx`, add a "🔍" or "View Highlighted" link next to each source
- [ ] Link target: `/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...`
- [ ] The sub-question is the one that retrieved this chunk (already available in the sources structure: `source.sub_question_index` → look up sub-question text)
- [ ] Open in new tab or modal
- **Test file**: Update `citationParser.test.ts` and `ResponsePanel.test.tsx`
+- [ ] Add `chunk_highlights` table to `backend/app/services/highlight_cache.py`
+- [ ] Functions: `get_highlight(cache_key) -> str | None`, `set_highlight(cache_key, html_content, relevant_sentences_json)`, `compute_cache_key(document_id, chunk_index, sub_question) -> str`
+- [ ] Database path: `backend/app/data/highlights.db` (separate from prompts/history to avoid schema conflicts)
+- [ ] Initialize table on first use (same pattern as `prompt_service.py` / `history_service.py`)
+- **Test file**: `test_phase5_highlight_cache.py`

-#### Task 5.4.4: Integration testing
+#### Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)

- [ ] Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
- [ ] Verify sentence highlighting is proportional to relevance (spot-check manually)
- [ ] Verify caching works (second request for same chunk is faster)
- [ ] Verify graceful degradation (embedding API down → plain text still served)
- [ ] Run full test suite
+- [ ] Create `backend/app/services/chunk_highlight_service.py`
+- [ ] Method `compute_highlights_batch(targets: list[ChunkHighlightTarget]) -> HighlightBatchResponse`:
+  1. Group targets by sub-question for prompt construction
+  2. For each target, fetch chunk text + metadata from ChromaDB (`collection.get(ids=[f"{document_id}_{chunk_index}"])`)
+  3. Split chunk text into sentences via `split_sentences()`
+  4. Build a single batched prompt (see Prompt Structure above)
+  5. Call `llm_client.complete_structured(prompt, HighlightBatchResult)` → validated Pydantic result
+  6. For each `ChunkHighlights` in result: render HTML, compute cache key, store in SQLite
+  7. Return `HighlightBatchResponse(status="completed", cached_count=N)`
+- [ ] Method `render_highlight_html(chunk_text, sentences, relevant_sentences, metadata) -> str`:
+  - Inline CSS: yellow background (`#fef08a`) on relevant sentences, grey reason text below each
+  - Header: filename, chunk index, page number, sub-question text
+  - Footer: "View Original PDF →" link (if `chunk_file_path` exists)
+  - Returns complete self-contained HTML page
+- [ ] Graceful degradation per target (not per batch): if one chunk fetch fails, skip it and continue others. If LLM call fails entirely, return `status="failed"`.
+- **Test file**: `test_phase5_chunk_highlight_service.py`
+
+#### Task 5.4.5: Backend — REST endpoints
+
+- [ ] `POST /api/v1/v2/highlights/batch` in `backend/app/routers/chunks.py` (new file):
+  - Accepts `HighlightBatchRequest`
+  - Calls `chunk_highlight_service.compute_highlights_batch()`
+  - Returns `HighlightBatchResponse` (JSON)
+- [ ] `GET /api/v1/v2/highlights` in same router:
+  - Query params: `document_id`, `chunk_index`, `sub_question`
+  - Computes cache key, reads from SQLite
+  - Cache hit → returns `text/html`
+  - Cache miss → returns 404
+- [ ] Register router in `main.py`
+- **Test file**: `test_phase5_highlight_endpoints.py`
+
+#### Task 5.4.6: Frontend — Extract cited chunks from answer text
+
+- [ ] Add `extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[]` to `citationParser.ts`
+  - Uses the same `citationPattern` regex as `replaceCitationPatterns()`
+  - For each `[citation]` match, resolves against `buildCitationLookup(sources)`
+  - Returns deduplicated list of `SourceMetadata` that were actually cited in the answer
+- [ ] Exported alongside existing citation functions
+- **Test file**: Update `citationParser.test.ts` with cited source extraction tests
+
+#### Task 5.4.7: Frontend — Background batch trigger on response completion
+
+- [ ] In `ResponsePanel.tsx` (inside `SubQuestionSections` component):
+  - After answer is rendered (non-null, not loading), `useEffect` fires:
+    1. Parse answer sections → extract cited sources per sub-question
+    2. Build `HighlightBatchRequest` payload
+    3. POST `/api/v1/v2/highlights/batch`
+    4. On success → store which (document_id, chunk_index, sub_question) pairs are cached in a `useState(Set<string>)`
+    5. On failure → log error (no UI change — "View Highlighted" stays hidden)
+  - Source cards check `isHighlightReady(document_id, chunk_index, sub_q_text)` → shows/hides "View Highlighted" button
+- **Test file**: Update `ResponsePanel.test.tsx`
+
+#### Task 5.4.8: Frontend — Upgrade "View PDF" link on source cards
+
+- [ ] In `SubQuestionSourceCard` (ResponsePanel.tsx):
+  - **Before batch completes**: "View PDF" link opens raw PDF chunk page (existing behavior, unchanged)
+  - **After batch completes**: Same link upgrades to open `/api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...` (highlighted HTML page)
+  - The link stays in the same position — no repositioning, no second element
+  - While batch is loading: no visual change (the PDF link remains functional; the highlight upgrade is invisible to the user until ready)
+  - If batch fails: link stays as raw PDF permanently (graceful degradation)
+- [ ] The highlighted HTML page includes a `[View Original PDF →]` footer link (rendered by backend in Task 5.4.4) so users can always reach the raw PDF
+- [ ] No label change needed — "View PDF" remains appropriate since the user gets a page-level view of the chunk regardless
+- **Test file**: Update `ResponsePanel.test.tsx` — verify link href upgrades after mock batch response, verify link stays as raw PDF when batch fails, verify loading state shows no visual disruption
+
+#### Task 5.4.9: Integration testing
+
+- [ ] Verify batch endpoint: POST with 3 targets → all cached → GET returns highlighted HTML for each
+- [ ] Verify cache: same GET twice → both hit SQLite (no second LLM call)
+- [ ] Verify graceful degradation: LLM call fails → batch returns `status="failed"` → GET returns 404 → frontend hides buttons
+- [ ] Verify cross-document-type: PDF, DOCX, TXT chunks all work (chunk text comes from ChromaDB regardless of source format)
+- [ ] Verify sentence splitting handles mixed English/Chinese legal text (e.g., "The date is 2024年1月1日。The Contractor shall...")
+- [ ] Run full test suite: `cd backend && pytest -v` and `cd frontend && npm test`

 ### Test Files

 | # | Test File | Coverage |
 |---|-----------|----------|
-| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
-| T5.4.2 | `backend/app/test/test_phase5_chunk_highlight.py` | Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
-| T5.4.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Citation links include highlight URL when sub-question context available |
-| T5.4.4 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | Sources section renders "View Highlighted" link alongside "View PDF" |
+| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, Chinese, mixed, empty, single sentence, bullet lists |
+| T5.4.2 | `backend/app/test/test_phase5_highlight_models.py` | Pydantic validation: valid/invalid targets, empty batch, max batch size, LLM result parsing |
+| T5.4.3 | `backend/app/test/test_phase5_highlight_cache.py` | SQLite cache: insert, retrieve, cache key collision, missing key, JSON serialization |
+| T5.4.4 | `backend/app/test/test_phase5_chunk_highlight_service.py` | Service: valid batch → cached HTML, mocked LLM returns structured result, per-target error isolation, empty batch, ChromaDB miss → skip target |
+| T5.4.5 | `backend/app/test/test_phase5_highlight_endpoints.py` | Endpoints: POST returns cache count, GET returns HTML, GET cache miss → 404, invalid params → 422 |
+| T5.4.6 | `frontend/src/test/utils/citationParser.test.ts` (update) | `extractCitedSources()`: extracts cited filenames, deduplicates, handles no citations, handles partial matches |
+| T5.4.7 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | "View PDF" link upgrades to highlight page after batch completes, link stays as raw PDF when batch fails, no visual disruption during loading |

 ### Acceptance Tests

 | # | Test File | Coverage |
 |---|-----------|----------|
-| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
+| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real chunk text → batch POST with real LLM → verify cached HTML contains yellow-highlighted spans on genuinely relevant sentences → spot-check 3 chunks manually |
+
+---
+## Updated Dependency Graph
+
+```
+Phase 5.1 (Structured Output) ✅
+Phase 5.2 (Citation Linking)   ✅
+Phase 5.3 (DOCX/TXT PDF Gen)   ✅
+Phase 5.4 (LLM Highlighting)
+   Task 5.4.1 (sentence splitter)
+        │
+        ├── Task 5.4.2 (highlight models)
+        │        │
+        │        ├── Task 5.4.3 (SQLite cache)
+        │        │        │
+        │        │        ├── Task 5.4.4 (batch service + HTML rendering)
+        │        │        │        │
+        │        │        │        └── Task 5.4.5 (REST endpoints)
+        │        │        │
+        │        │        └── Task 5.4.9 (integration tests)
+        │        │
+        │        └── Task 5.4.6 (frontend cited source extraction)
+        │                 │
+        │                 ├── Task 5.4.7 (frontend batch trigger)
+        │                 │        │
+        │                 │        └── Task 5.4.8 (frontend link upgrade)
+        │                 │
+        │                 └── Task 5.4.9 (integration tests)
+        │
+        └── All tasks → Task 5.4.9 (full suite)
+```
+
+### Updated Commit Plan
+
+| Commit | Message | Scope |
+|--------|---------|-------|
+| 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
+| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
+| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
+| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
+| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
+| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |

 ---

-## Commit Plan
+## Complete Commit Plan (All Phases)

 | Commit | Message | Scope |
 |--------|---------|-------|
@ -410,4 +641,9 @@ Frontend renders HTML in an iframe or new tab
 | 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
 | 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
 | 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
-| 7 | `feat: sentence-level chunk highlighting via embedding similarity` | Phase 5.4 (all tasks) |
+| 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
+| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
+| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
+| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
+| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
+| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |
--- a/frontend/src/components/ResponsePanel.tsx
+++ b/frontend/src/components/ResponsePanel.tsx
@ -1,11 +1,15 @@
-import React, { useState } from 'react'
+import React, { useState, useEffect } from 'react'
 import { MessageSquare, AlertCircle, Copy, ChevronDown, ChevronRight } from 'lucide-react'
 import ReactMarkdown from 'react-markdown'
 import type { SourceMetadata, SubQuestionSources } from '../types'
 import { getPdfViewerUrl } from '../lib/api'
-import { processCitations, processCitationsForSubq } from '../utils/citationParser'
+import { processCitations, processCitationsForSubq, extractCitedSources } from '../utils/citationParser'
 import { bulletizeMarkdown } from '../utils/citationParser'

+function getHighlightUrl(document_id: string, chunk_index: number, sub_question: string): string {
+  return `/api/v1/v2/highlights?document_id=${encodeURIComponent(document_id)}&chunk_index=${chunk_index}&sub_question=${encodeURIComponent(sub_question)}`
+}
+
 interface ResponsePanelProps {
  answer: string | null
  sources?: SourceMetadata[]
@ -32,7 +36,17 @@ function parseAnswerSections(answer: string): string[] {
  return sections.filter((s) => s.trim().length > 0).map((s) => bulletizeMarkdown(s.trim()))
 }

-function SubQuestionSourceCard({ source, index }: { source: SourceMetadata; index: number }) {
+function SubQuestionSourceCard({ 
+  source, 
+  index, 
+  highlightReady = false,
+  subQuestionText = '',
+}: { 
+  source: SourceMetadata
+  index: number
+  highlightReady?: boolean
+  subQuestionText?: string 
+}) {
  return (
    <div
      key={index}
@ -48,11 +62,15 @@ function SubQuestionSourceCard({ source, index }: { source: SourceMetadata; inde
      </div>
      <div className="text-sm text-gray-500">{source.upload_date}</div>
      <div className="text-sm text-gray-600 mt-1">{source.content_summary}</div>
-      <div className="flex items-center justify-between mt-1">
+      <div className="flex items-center gap-2 mt-1">
        <div className="text-xs text-gray-400">Chunk {source.chunk_index}</div>
        {source.chunk_file_path && (
          <a
-            href={getPdfViewerUrl(source.chunk_file_path, source.page_number ?? undefined, source.filename)}
+            href={
+              highlightReady && source.document_id
+                ? getHighlightUrl(source.document_id, source.chunk_index, subQuestionText)
+                : getPdfViewerUrl(source.chunk_file_path, source.page_number ?? undefined, source.filename)
+            }
            target="_blank"
            rel="noopener noreferrer"
            className="text-xs text-blue-600 hover:text-blue-800 hover:underline"
@ -71,11 +89,13 @@ function SubQuestionSection({
  subQuestion,
  answerSection,
  allSubQuestionSources,
+  highlightReadyKeys,
 }: {
  index: number
  subQuestion: SubQuestionSources
  answerSection: string
  allSubQuestionSources: SubQuestionSources[]
+  highlightReadyKeys: Set<string>
 }) {
  const [expanded, setExpanded] = useState(false)
  // Look up citations across ALL sub-questions' sources because the LLM
@ -117,7 +137,13 @@ function SubQuestionSection({
          {expanded && (
            <div data-testid="sources-container" className="grid grid-cols-2 gap-2">
              {subQuestion.sources.map((source, idx) => (
-                <SubQuestionSourceCard key={idx} source={source} index={idx} />
+                <SubQuestionSourceCard 
+                  key={idx} 
+                  source={source} 
+                  index={idx}
+                  highlightReady={source.document_id ? highlightReadyKeys.has(`${source.document_id}_${source.chunk_index}_${encodeURIComponent(subQuestion.sub_question_text)}`) : false}
+                  subQuestionText={subQuestion.sub_question_text}
+                />
              ))}
            </div>
          )}
@ -137,6 +163,57 @@ function SubQuestionSections({
  isLoading?: boolean
 }) {
  const [copied, setCopied] = useState(false)
+  const [highlightReadyKeys, setHighlightReadyKeys] = useState<Set<string>>(new Set())
+
+  useEffect(() => {
+    if (!answer || isLoading || !subQuestionSources.length) return
+    
+    const targets: Array<{
+      document_id: string
+      chunk_index: number
+      sub_question_text: string
+      sub_question_index: number
+    }> = []
+    
+    const sections = parseAnswerSections(answer)
+    
+    subQuestionSources.forEach((sq) => {
+      const answerSection = sections[sq.sub_question_index] ?? ''
+      const citedSources = extractCitedSources(answerSection, sq.sources)
+      citedSources.forEach((source) => {
+        if (source.document_id) {
+          targets.push({
+            document_id: source.document_id,
+            chunk_index: source.chunk_index,
+            sub_question_text: sq.sub_question_text,
+            sub_question_index: sq.sub_question_index,
+          })
+        }
+      })
+    })
+    
+    if (targets.length === 0) return
+    
+    fetch('/api/v1/v2/highlights/batch', {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify({ targets }),
+    })
+      .then((res) => res.json())
+      .then((data) => {
+        if (data.status === 'completed' || data.status === 'partial') {
+          const keys = new Set<string>()
+          targets.forEach((t) => {
+            keys.add(`${t.document_id}_${t.chunk_index}_${encodeURIComponent(t.sub_question_text)}`)
+          })
+          setHighlightReadyKeys(keys)
+        }
+      })
+      .catch((err) => {
+        console.error('Highlight batch computation failed:', err)
+      })
+  }, [answer, isLoading, subQuestionSources])
+
  const sections = answer ? parseAnswerSections(answer) : []

  const handleCopyAnswer = async (): Promise<void> => {
@ -217,6 +294,7 @@ function SubQuestionSections({
            subQuestion={subQuestion}
            answerSection={sections[index] ?? ''}
            allSubQuestionSources={subQuestionSources}
+            highlightReadyKeys={highlightReadyKeys}
          />
        ))}
      </div>
--- a/frontend/src/test/components/ResponsePanel.test.tsx
+++ b/frontend/src/test/components/ResponsePanel.test.tsx
@ -1,7 +1,7 @@
 import React from 'react'
 import { render, screen, fireEvent, waitFor } from '@testing-library/react'
 import { ResponsePanel } from '../../components/ResponsePanel'
-import type { SourceMetadata } from '../../types'
+import type { SourceMetadata, SubQuestionSources } from '../../types'

 vi.mock('../../lib/api', () => ({
  getChunkPdfUrl: (filePath: string) => `http://localhost:8000/api/v1/chunks/${filePath}/pdf`,
@ -20,6 +20,7 @@ describe('ResponsePanel', () => {
      chunk_index: 0,
      page_number: 1,
      chunk_file_path: 'test_chunk_1.pdf',
+      document_id: null,
    },
    {
      filename: 'document2.txt',
@ -28,6 +29,7 @@ describe('ResponsePanel', () => {
      chunk_index: 1,
      page_number: null,
      chunk_file_path: null,
+      document_id: null,
    },
  ]

@ -181,6 +183,7 @@ describe('ResponsePanel', () => {
        chunk_index: 0,
        page_number: 3,
        chunk_file_path: 'chunk_0.pdf',
+        document_id: null,
      },
    ]
    const answer = 'The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3].'
@ -215,4 +218,149 @@ describe('ResponsePanel', () => {
    expect(screen.getByText(/unknown_file\.pdf, page 10/)).toBeInTheDocument()
    expect(screen.queryByRole('link', { name: /unknown_file/ })).not.toBeInTheDocument()
  })
+
+  describe('SubQuestionSections highlight batch', () => {
+    const mockFetch = vi.fn()
+    const mockSubQuestionSources: SubQuestionSources[] = [
+      {
+        sub_question_index: 0,
+        sub_question_text: 'What is the threshold?',
+        sources: [
+          {
+            filename: 'NEC4 ACC.pdf',
+            upload_date: '2024-01-15',
+            content_summary: 'Contract terms',
+            chunk_index: 0,
+            page_number: 3,
+            chunk_file_path: 'chunk_0.pdf',
+            document_id: 'doc-123',
+          },
+        ],
+      },
+    ]
+
+    beforeEach(() => {
+      mockFetch.mockReset()
+      global.fetch = mockFetch
+    })
+
+    afterEach(() => {
+      vi.restoreAllMocks()
+    })
+
+    it('calls batch highlight endpoint with cited sources', async () => {
+      mockFetch.mockResolvedValue({
+        json: async () => ({ status: 'completed' }),
+      } as Response)
+
+      render(
+        <ResponsePanel
+          answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
+          subQuestionSources={mockSubQuestionSources}
+          isLoading={false}
+          error={null}
+        />
+      )
+
+      await waitFor(() => {
+        expect(mockFetch).toHaveBeenCalledTimes(1)
+        expect(mockFetch).toHaveBeenCalledWith(
+          '/api/v1/v2/highlights/batch',
+          expect.objectContaining({
+            method: 'POST',
+            headers: { 'Content-Type': 'application/json' },
+            body: expect.stringContaining('doc-123'),
+          })
+        )
+      })
+    })
+
+    it('shows View PDF link with normal URL when highlights are not ready', async () => {
+      mockFetch.mockResolvedValue({
+        json: async () => ({ status: 'pending' }),
+      } as Response)
+
+      render(
+        <ResponsePanel
+          answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
+          subQuestionSources={mockSubQuestionSources}
+          isLoading={false}
+          error={null}
+        />
+      )
+
+      await waitFor(() => {
+        expect(mockFetch).toHaveBeenCalled()
+      })
+
+      const toggleButton = screen.getByTestId('sources-toggle')
+      fireEvent.click(toggleButton)
+
+      const link = screen.getByTestId('view-chunk-pdf-link')
+      expect(link).toHaveAttribute('href', expect.stringContaining('/pdf-viewer'))
+    })
+
+    it('upgrades View PDF link to highlight URL when batch completes', async () => {
+      mockFetch.mockResolvedValue({
+        json: async () => ({ status: 'completed' }),
+      } as Response)
+
+      render(
+        <ResponsePanel
+          answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
+          subQuestionSources={mockSubQuestionSources}
+          isLoading={false}
+          error={null}
+        />
+      )
+
+      await waitFor(() => {
+        expect(mockFetch).toHaveBeenCalled()
+      })
+
+      const toggleButton = screen.getByTestId('sources-toggle')
+      fireEvent.click(toggleButton)
+
+      await waitFor(() => {
+        const link = screen.getByTestId('view-chunk-pdf-link')
+        expect(link).toHaveAttribute('href', expect.stringContaining('/api/v1/v2/highlights'))
+        expect(link).toHaveAttribute('href', expect.stringContaining('doc-123'))
+      })
+    })
+
+    it('does not upgrade link when source has no document_id', async () => {
+      const sourcesWithoutDocId: SubQuestionSources[] = [
+        {
+          sub_question_index: 0,
+          sub_question_text: 'What is the threshold?',
+          sources: [
+            {
+              filename: 'NEC4 ACC.pdf',
+              upload_date: '2024-01-15',
+              content_summary: 'Contract terms',
+              chunk_index: 0,
+              page_number: 3,
+              chunk_file_path: 'chunk_0.pdf',
+              document_id: null,
+            },
+          ],
+        },
+      ]
+
+      render(
+        <ResponsePanel
+          answer="The threshold is HK$1,000,000 [NEC4 ACC.pdf, page 3]."
+          subQuestionSources={sourcesWithoutDocId}
+          isLoading={false}
+          error={null}
+        />
+      )
+
+      const toggleButton = screen.getByTestId('sources-toggle')
+      fireEvent.click(toggleButton)
+
+      const link = screen.getByTestId('view-chunk-pdf-link')
+      expect(link).toHaveAttribute('href', expect.stringContaining('/pdf-viewer'))
+    })
+  })
 })
--- a/frontend/src/test/utils/citationParser.test.ts
+++ b/frontend/src/test/utils/citationParser.test.ts
@ -1,5 +1,5 @@
 import { describe, it, expect } from 'vitest'
-import { processCitations, bulletizeMarkdown } from '../../utils/citationParser'
+import { processCitations, bulletizeMarkdown, extractCitedSources } from '../../utils/citationParser'
 import type { SourceMetadata } from '../../types'

 const mockSources: SourceMetadata[] = [
@ -182,3 +182,61 @@ describe('bulletizeMarkdown', () => {
    expect(bulletizeMarkdown('  \n  ')).toBe('')
  })
 })
+
+describe('extractCitedSources', () => {
+  it('extracts cited source when citation [NEC4 ACC] matches source filename', () => {
+    const text = 'Clause info [NEC4 ACC] is important.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toHaveLength(1)
+    expect(result[0].filename).toBe('NEC4 ACC.pdf')
+    expect(result[0].document_id).toBe('doc-001')
+  })
+
+  it('returns empty array when no citations in text', () => {
+    const text = 'This text has no citations at all.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toEqual([])
+  })
+
+  it('deduplicates same source cited multiple times', () => {
+    const text = 'First [NEC4 ACC] and again [NEC4 ACC, page 3] same source.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toHaveLength(1)
+    expect(result[0].filename).toBe('NEC4 ACC.pdf')
+  })
+
+  it('extracts multiple different cited sources', () => {
+    const text = 'See [NEC4 ACC] and [report.pdf, page 5] for details.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toHaveLength(2)
+    const filenames = result.map((s) => s.filename)
+    expect(filenames).toContain('NEC4 ACC.pdf')
+    expect(filenames).toContain('report.pdf')
+  })
+
+  it('citation with page number [NEC4 ACC, page 3] still resolves', () => {
+    const text = 'Clause info [NEC4 ACC, page 3] is important.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toHaveLength(1)
+    expect(result[0].filename).toBe('NEC4 ACC.pdf')
+    expect(result[0].page_number).toBe(3)
+  })
+
+  it('citation with extension [NEC4 ACC.pdf] still resolves', () => {
+    const text = 'Clause info [NEC4 ACC.pdf] is important.'
+    const result = extractCitedSources(text, mockSources)
+    expect(result).toHaveLength(1)
+    expect(result[0].filename).toBe('NEC4 ACC.pdf')
+  })
+
+  it('handles empty sources list gracefully', () => {
+    const text = 'Clause info [NEC4 ACC] is important.'
+    const result = extractCitedSources(text, [])
+    expect(result).toEqual([])
+  })
+
+  it('handles empty answerText gracefully', () => {
+    const result = extractCitedSources('', mockSources)
+    expect(result).toEqual([])
+  })
+})
--- a/frontend/src/utils/citationParser.ts
+++ b/frontend/src/utils/citationParser.ts
@ -124,3 +124,27 @@ export function processCitations(text: string, sources: SourceMetadata[]): strin
  const lookup = buildCitationLookup(sources)
  return replaceCitationPatterns(text, lookup)
 }
+
+export function extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[] {
+  if (!answerText.trim() || !sources.length) return []
+
+  const lookup = buildCitationLookup(sources)
+  const citationPattern = /(?<!!)\[([^\]]+)\](?!\()/g
+  const seen = new Set<string>()
+  const result: SourceMetadata[] = []
+
+  let match: RegExpExecArray | null
+  while ((match = citationPattern.exec(answerText)) !== null) {
+    const content = match[1].trim()
+    const source = findSource(content, lookup)
+    if (source) {
+      const key = `${source.document_id}_${source.chunk_index}`
+      if (!seen.has(key)) {
+        seen.add(key)
+        result.push(source)
+      }
+    }
+  }
+
+  return result
+}