legco_ai_assistant/.plans/package5_enhancement_plan.md

650 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Package 5 Enhancement Plan — Structured Output + Robust Citation Linking
**Source**: User request (2026-04-28)
**Scope**:
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain `with_structured_output()`
- Phase 5.2: Fix missing PDF links in citations and improve citation robustness
**Status**: Phases 5.1 ✅, 5.2 ✅, 5.3 ✅ — 5.4 Planned (LLM-based, awaiting implementation)
**LangChain version**: 1.2.15 (venv), `model_provider="openai"` with OpenRouter base URL (API-compatible proxy).
**Test results**:
- Backend: 115 passed, 0 failed (Phase 5.1 + Phase 5.2 + all integration/regression tests)
- Frontend: 187 passed, 1 failed (pre-existing e2e test failure unrelated to these changes)
---
## Objective
1. **Decompose structured output**: Eliminate `json.JSONDecodeError` failures in `QueryDecomposer.decompose()` by integrating LangChain's `with_structured_output()` to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid `SubQuestions` object — no manual `json.loads()`, no regex markdown stripping, no silent failures.
2. **Robust citation linking**: Fix the citation→PDF link pipeline so that:
- `document_id` flows through to the frontend for fallback document-level links
- `chunk_file_path` is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
- Citation matching in `citationParser.ts` handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
- Frontend provides fallback "View Document" links when chunk-level PDF is unavailable
---
## Decision Register
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Use LangChain `with_structured_output()` (not OpenAI `response_format` directly) | User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate). |
| 2 | Add `langchain` + `langchain-openai` to `requirements.txt` | Required dependencies for `init_chat_model()` and `with_structured_output()`. `langchain` ~0.3.x for stable API. |
| 3 | Define `SubQuestions` Pydantic model with `questions: list[str]` | LangChain's `with_structured_output()` requires a wrapper Pydantic model — bare `list[str]` is unsupported by provider-native schema enforcement. |
| 4 | Keep `LLMClient` as the central LLM access layer, add LangChain-based `complete_structured()` method | Minimizes refactoring. `QueryDecomposer` calls `llm_client.complete_structured(prompt, SubQuestions)` instead of `llm_client.complete(prompt)`. Other callers (filter, generate) remain unchanged. |
| 5 | Run decomposition at `temperature=0.0` (was `0.7`) | Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance. |
| 6 | Add `document_id` to `SourceMetadata` Pydantic model and frontend type | `document_id` is already stored in ChromaDB metadata (`metadata.py:70`) but is discarded during serialization. Adding it enables document-level fallback links. |
| 7 | ~~Generate **monolithic** PDFs for DOCX/TXT documents~~**DEFERRED** | More complex than needed. Instead, use fallback document-level links via `document_id` when `chunk_file_path` is null. DOCX/TXT PDF generation deferred to Phase 5.3. |
| 8 | Fuzzy citation matching: strip extensions, trim whitespace | `citationParser.ts` currently requires exact filename match. LLM may shorten `NEC4 ACC.pdf` to `NEC4 ACC` in citations. |
| 9 | Fallback "View Document" link when `chunk_file_path` is null | Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback. |
| 10 | Keep `_extract_json_from_markdown()` as a fallback for backward compatibility | During a transition period (or if `with_structured_output()` fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used. |
| 11 | Add `logger.warning` for JSON parse failures before returning empty | The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug. |
| 12 | Keep `QueryDecomposer.decompose()` return type as `Tuple[List[str], str]` | Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions | None]` would break tests unnecessarily. The Pydantic model is internal to `complete_structured()`. |
| 13 | Spike-test LangChain structured output with OpenRouter BEFORE implementation | 2-minute test calling `init_chat_model().with_structured_output().ainvoke()` through OpenRouter to confirm `response_format={"type": "json_schema"}` is proxied correctly. If not, fall back to `method="function_calling"`. |
| 14 | Tighten `generate_per_subq` prompt alongside frontend fuzzy matching | Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3. |
---
## Phase 5.1 — Structured Output for Decompose
### Test Files (write BEFORE implementation)
| # | Test File | Coverage |
|---|-----------|----------|
| T5.1.1 | `backend/app/test/test_phase5_llm_client_structured.py` | `LLMClient.complete_structured()` with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback. |
| T5.1.2 | `backend/app/test/test_phase5_query_decomposer_structured.py` | `QueryDecomposer.decompose()` using `MockLLMClient.complete_structured()`. Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration. |
| T5.1.3 | `backend/app/test/test_phase5_subquestions_model.py` | `SubQuestions` Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected. |
| T5.1.4 | `backend/app/test/test_phase5_decompose_logging.py` | Verify `logger.warning` is emitted when JSON parse fallback is triggered (backward-compat path). |
### Acceptance Tests
| # | Test File | Coverage |
|---|-----------|----------|
| AT5.1.1 | `backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py` | Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions. |
### Implementation Tasks
#### Task 5.1.1: Add LangChain dependencies
- [ ] Add `langchain>=0.3.0,<0.4.0` and `langchain-openai>=0.3.0,<0.4.0` to `backend/requirements.txt`
- [ ] Run `pip install -r backend/requirements.txt` in dev venv
- **Test file**: `test_phase5_subquestions_model.py` (can run immediately after install)
#### Task 5.1.2: Define `SubQuestions` Pydantic model
- [ ] Create `backend/app/models/decompose.py` with:
```python
class SubQuestions(BaseModel):
questions: list[str] = Field(
description="2-5 simplified sub-questions, each focused on one aspect",
min_length=1,
max_length=5,
)
```
- [ ] Add `min_length=1` and `max_length=5` Pydantic constraints (aligns with decompose prompt's "2-5")
- **Test file**: `test_phase5_subquestions_model.py`
#### Task 5.1.3: Add `complete_structured()` method to `LLMClient`
- [ ] In `llm_client.py`, import `init_chat_model` from `langchain.chat_models`
- [ ] Add `self._langchain_model` attribute (lazy-init from settings)
- [ ] Add `async complete_structured(prompt, pydantic_model, step_name) -> BaseModel` method:
1. Calls `self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)`
2. Returns the validated Pydantic model instance
3. Logs timing (same pattern as existing `complete()`)
4. Wraps errors in `LLMClientError`
- [ ] Use `temperature=0.0` via model config for structured calls
- **Test file**: `test_phase5_llm_client_structured.py`
#### Task 5.1.4: Refactor `QueryDecomposer.decompose()` to use structured output
- [ ] Change `decompose()` to call `self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")`
- [ ] Add fallback path: if `complete_structured()` raises → log warning → attempt legacy `complete()` + `json.loads()` → if that works, log info "structured output failed, fallback succeeded"
- [ ] Add `logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500])` when both paths fail
- [ ] Keep return type `Tuple[List[str], str]` unchanged
- [ ] Keep `_extract_json_from_markdown()` for backward-compat fallback path
- **Test file**: `test_phase5_query_decomposer_structured.py` and `test_phase5_decompose_logging.py`
#### Task 5.1.5: Update prompt template for structured output
- [ ] Update `_SEED_DECOMPOSE` in `sqlite_db.py` to instruct the LLM about the expected structure
- [ ] New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
- [ ] Run `seed_default_profiles()` to backfill existing profiles
- **Test file**: Existing `test_phase3_prompt_service.py` should continue to pass
#### Task 5.1.6: Integration test — end-to-end query pipeline
- [ ] Verify existing integration tests still pass (`test_integration_phase1.py`, `test_phase4_integration_query_pipeline.py`)
- [ ] Verify acceptance test passes with real LLM (`test_acceptance_phase1_rag_query.py`)
- [ ] Run full test suite: `cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v`
---
## Phase 5.2 — Robust Citation Linking
### Test Files (write BEFORE implementation)
| # | Test File | Coverage |
|---|-----------|----------|
| T5.2.1 | `backend/app/test/test_phase5_source_metadata.py` | `SourceMetadata` model with `document_id`. Tests: serialization includes document_id, backward compat (old data without document_id). |
| T5.2.2 | `backend/app/test/test_phase5_docx_pdf_generation.py` | DOCX/TXT ingestion now sets `chunk_file_path`. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully. |
| T5.2.3 | `frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts` | Fuzzy citation matching. Tests: citation `[NEC4 ACC]` matches source `NEC4 ACC.pdf`, citation `[nec4 acc.pdf, page 3]` matches after whitespace trim, citation `[NEC4 ACC.PDF]` matches case-insensitively, fallback "View Document" link shown when `chunk_file_path` is null. |
| T5.2.4 | `frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts` | Fallback document link rendering. Tests: chunk with `chunk_file_path: null` but `document_id` present → renders "View Document" link, chunk with both null → remains plain text, chunk with `chunk_file_path` → renders page-level PDF link. |
### Acceptance Tests
| # | Test File | Coverage |
|---|-----------|----------|
| AT5.2.1 | `backend/app/test/acceptance/test_acceptance_phase5_citation_links.py` | Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path). |
### Implementation Tasks
#### Task 5.2.1: Add `document_id` to `SourceMetadata` model
- [ ] In `backend/app/models/common.py`, add `document_id: Optional[str] = None` to `SourceMetadata`
- [ ] In `backend/app/routers/query.py` lines 310-319, include `document_id=meta.get("document_id")` when building `SourceMetadata` objects
- [ ] In `frontend/src/types/index.ts`, add `document_id: string | null` to `SourceMetadata` interface
- **Test file**: `test_phase5_source_metadata.py`
#### Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion
- [ ] Add `reportlab` to `backend/requirements.txt` (lightweight, pure Python PDF generation, no external binaries)
- [ ] In `backend/app/routers/ingest.py` DOCX and TXT branches, add PDF generation logic:
1. After chunking, generate a single PDF from the full text (one page per chunk)
2. Store `chunk_filename = f"{stem}_chunk_{idx}.pdf"` for each chunk
3. Set `chunk_file_paths` list and pass to `extract_metadata()`
- [ ] Add error handling: if PDF generation fails, `chunk_file_path` stays `None` (graceful degradation)
- [ ] Use `logger.warning` on generation failure
- **Test file**: `test_phase5_docx_pdf_generation.py`
#### Task 5.2.3: Improve `citationParser.ts` with fuzzy matching
- [ ] Add extension-stripping helper: `stripExtension(filename: string): string` — removes `.pdf`, `.docx`, `.txt`
- [ ] Modify `buildCitationLookup()` to register both `filename` and `stripExtension(filename)` as lookup keys
- [ ] Add trim-whitespace normalization on citation text before lookup
- [ ] Add test for LLM-common variations: `NEC4 ACC.pdf` vs `NEC4 ACC` vs `NEC4_acc.pdf`
- **Test file**: `test_phase5_citation_parser_fuzzy.test.ts`
#### Task 5.2.4: Add fallback "View Document" link in frontend
- [ ] In `citationParser.ts` `replaceCitationPatterns()`, when `source?.chunk_file_path` is null but `source?.document_id` exists:
1. Build a URL to the document chunk list page: `/rag-database?document_id=${source.document_id}`
2. Return `[${trimmed}](${url})` with a different CSS class (e.g., `text-green-600` for document-level vs `text-blue-600` for page-level)
- [ ] In `ResponsePanel.tsx`, update `CitationLink` component to accept a `variant` prop for visual differentiation
- **Test file**: `test_phase5_citation_fallback_link.test.ts`
#### Task 5.2.5: Integration and regression testing
- [ ] Verify all existing citation parser tests still pass: `cd frontend && npx vitest run src/test/utils/citationParser.test.ts`
- [ ] Verify ResponsePanel tests still pass: `npx vitest run src/test/components/ResponsePanel.test.tsx`
- [ ] Run full frontend test suite: `npm test`
- [ ] Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable
---
## Dependency Graph
```
Phase 5.1 (Structured Output)
Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
│ │
│ ▼
│ Task 5.1.4 (refactor decompose)
│ │
│ Task 5.1.5 (update prompt template)
│ │
│ ▼
│ Task 5.1.6 (integration tests)
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
Task 5.2.2 (DOCX/TXT PDF gen) ──┤
├── Task 5.2.4 (fallback link)
Task 5.2.5 (integration tests)
```
---
## Acceptance Criteria
### Phase 5.1 Completion Checklist
- [x] `LLMClient.complete_structured()` returns validated `SubQuestions` Pydantic model — no `json.JSONDecodeError` possible
- [x] `QueryDecomposer.decompose()` never returns `[]` due to JSON parse failure
- [x] Fallback path (legacy `json.loads()`) logs a warning when triggered
- [x] Existing decompose tests pass (`test_phase1_query_decomposer.py`)
- [x] New structured output tests pass (`test_phase5_*.py`) — 33 tests
- [x] Spike test passed: Cantonese + English → valid sub-questions
- [x] `SQLite` seed templates updated and backfilled to all profiles
- [x] `langchain` and `langchain-openai` installed in venv (1.2.x)
### Phase 5.2 Completion Checklist
- [x] `SourceMetadata` includes `document_id` in both backend and frontend types
- [ ] ~~DOCX/TXT ingestion generates per-chunk PDF files~~**DEFERRED** to Phase 5.3
- [x] `citationParser.ts` matches `[NEC4 ACC]` to source `NEC4 ACC.pdf` (fuzzy matching)
- [x] `citationParser.ts` renders fallback link to `/rag-database?document=xxx` when `chunk_file_path` is null but `document_id` exists
- [x] `RAGDatabasePage` auto-expands document from `?document=` URL param
- [x] All existing citation parser tests pass (14 tests)
- [x] All existing ResponsePanel tests pass
- [x] `generate_per_subq` seed prompt tightened: "Copy the exact bracket labels shown"
---
## Rollback Plan
If `with_structured_output()` causes issues in production:
1. The `complete_structured()` method wraps errors in `LLMClientError` — same exception type as existing `complete()`
2. `QueryDecomposer.decompose()` has a fallback to legacy `complete()` + `json.loads()` path
3. The `_extract_json_from_markdown()` function is preserved for backward compatibility
4. If LangChain is a complete failure, revert `requirements.txt` and `llm_client.py` changes (3 files), keeping the Pydantic model and improved logging
---
---
## Phase 5.3 — DOCX/TXT PDF Generation ✅
Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same `chunk_file_path` → PDF viewer flow as PDF documents.
**Status**: Complete (2026-04-28). Implemented in commit `25b26c9`.
- `reportlab==4.2.5` added to `requirements.txt`
- New `backend/app/utils/text_to_pdf.py`: renders chunk text as simple PDFs with word wrapping
- `ingest.py` DOCX/TXT branches: generates `{stem}_chunk_{idx}.pdf` per chunk, passes `chunk_file_paths` to `extract_metadata()`
- Graceful degradation: `chunk_file_path` stays `None` on generation failure (logged as warning)
- Tests: `test_phase5_docx_pdf_generation.py` (5 tests), updated `test_phase1_ingest_page_aware.py` (2 assertions)
---
## Phase 5.4 — Sentence-Level Highlighting (PLANNED)
### Problem
When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
### Solution: LLM-Based Relevance Detection + Eager Background Computation
**Embedding similarity was rejected** — cosine similarity measures semantic proximity, not answer relevance. A sentence like *"The Contractor shall submit a quotation within three weeks"* has low embedding similarity to *"What are the compensation event timelines?"* but is directly relevant to answering it. Only an LLM can distinguish topic-matching from answer-relevance.
**Strategy**: After the final response renders on the frontend, a **single background LLM call** identifies relevant sentences across all cited chunks (grouped by sub-question). Results are cached in SQLite. The "View Highlighted" button activates once caching completes. Clicks serve instantly from cache.
**Why one LLM call, not one per chunk**: A sub-question typically cites only 2-3 chunks (even though up to 10 are retrieved). A single batched prompt with all sub-questions and their cited chunks costs ~1 LLM call per query — not N × M per chunk. Reuses the `complete_structured()` infrastructure from Phase 5.1.
### Why HTML, not PDF?
| Approach | Complexity | Works for all doc types? | Preserves original formatting? |
|---|---|---|---|
| Highlighted HTML page | **Low** | ✅ Yes (uses chunk text from ChromaDB) | ❌ Plain text only |
| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
**Recommendation: HTML page.** Works uniformly across PDF/DOCX/TXT. Original formatting is preserved in the existing PDF viewer (`chunk_file_path` link) — the highlighted HTML view is a supplementary view reached via a separate button. The two views coexist: "View Original PDF" vs "View Highlighted Text".
### Architecture
```
EAGER PHASE (background, fires immediately after SSE completed):
─────────────────────────────────────────────────
Frontend receives SSE "completed" event
├── Renders answer with citations instantly (no waiting)
└── useEffect fires POST /api/v1/v2/highlights/batch
Backend highlight batch service:
1. For each sub-question, extract cited chunks
(only 2-3 actually cited per sub-q, not all 10)
2. Fetch chunk texts from ChromaDB
3. Split each chunk into sentences
4. Single LLM call: identify which sentences are
relevant to their sub-question (structured output)
5. Render each chunk as highlighted HTML
6. Cache all HTML + metadata in SQLite
7. Return {status, cached_count} to frontend
Frontend receives response → upgrades "View PDF"
links on source cards to point to cached highlight pages
(same button, same position, upgraded destination)
LAZY PHASE (on user click, instant from cache):
──────────────────────────────────────────────
User clicks "View PDF" on a source card
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
Backend reads cached HTML from SQLite → serves instantly
Highlighted HTML page shows:
┌─ Highlighted sentences (yellow background + LLM reasons)
└─ Footer: [View Original PDF →] link to raw PDF chunk
```
### What Gets Highlighted
```
┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12 [View Original PDF →] │
├──────────────────────────────────────────────────────────┤
│ │
│ The programme shall be prepared in a form acceptable to │
│ the Project Manager. It shall include: │
│ │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates. █ │ ← LLM: relevant to timeline question
│ █ "Defines when work begins and key milestones." █ │ ← Brief reason from LLM
│ ████████████████████████████████████████████████████████ │
│ │
│ The Contractor shall submit a first programme within │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date. █ │ ← LLM: relevant to timeline question
│ █ "Specifies the deadline for initial programme." █ │
│ ████████████████████████████████████████████████████████ │
│ │
│ The Project Manager may instruct the Contractor to │
│ submit a revised programme showing the effects of a │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme. │ ← Not highlighted (LLM: not relevant to timeline)
│ │
├──────────────────────────────────────────────────────────┤
│ [View Original PDF →] │
└──────────────────────────────────────────────────────────┘
```
### Key Design Decisions
| # | Decision | Rationale |
|---|---|---|
| 1 | HTML page, not PDF | Zero extra dependencies. CSS highlighting is more flexible. Original PDF remains available via separate link. |
| 2 | **LLM relevance detection**, not embedding similarity | Embedding similarity measures topic overlap, not answer relevance. LLM distinguishes "about timelines" from "answers the timeline question." |
| 3 | **Single batched LLM call** for all sub-questions + cited chunks | One API call per query, not one per chunk. Each sub-question typically cites 2-3 chunks. Total ~6-15 chunks per query in a single prompt. |
| 4 | **Eager computation** (background after response renders) | User sees answer instantly. Highlight computation happens while they read. "View Highlighted" activates when ready. Zero perceived latency on click. |
| 5 | **Only cited chunks** (not all retrieved chunks) | Up to 10 chunks retrieved per sub-question, but only 2-3 are actually cited in the answer. Computing highlights for uncited chunks wastes tokens. |
| 6 | **SQLite cache** for highlight results | Durable across restarts. Same chunk+sub-question pair only computed once. Subsequent clicks (or same query re-run) serve instantly. Cache key = hash(document_id + chunk_index + sub_question). |
| 7 | Reuse `complete_structured()` from Phase 5.1 | `LLMClient.complete_structured(prompt, HighlightBatchResult)` returns validated Pydantic model. No new LLM calling infrastructure. |
| 8 | Sentence-level granularity | Paragraphs are too coarse, words are too noisy. Sentences are the natural unit of meaning. LLM returns sentence indices. |
| 9 | Regex sentence splitter (no NLTK) | NLTK adds 35MB deploy weight for edge-case handling the project doesn't need. `re.split(r'(?<=[.!?。!?\n])\s+')` handles English + Cantonese legal text. |
| 10 | Sentence indices, not text spans | Deterministic: the splitter always produces the same sentences from the same chunk text. LLM returns indices to avoid substring matching fragility. |
| 11 | Graceful degradation | LLM call fails → cache miss → GET returns 404. Frontend keeps "View Highlighted" disabled/loading until batch succeeds. If batch fails, button stays hidden. |
| 12 | "View Highlighted" on source cards only | Source cards have sub-question context via `SubQuestionSection`. Inline citations in answer text lack this context (flattened `allSources`). Adding `sub_question_index` to `SourceMetadata` deferred. |
| 13 | **Upgrade "View PDF" link in place (no second button)** | Before the LLM completes, the "View PDF" link opens the raw PDF chunk page (existing behavior). Once the batch LLM call finishes caching, the same link upgrades to open the highlighted HTML page instead. The highlighted page includes a "View Original PDF →" footer link. If the LLM call fails, the link stays as raw PDF. One link, seamless upgrade. |
### Data Model
#### Batch Request (POST)
```python
class ChunkHighlightTarget(BaseModel):
"""A single chunk to highlight, tied to its sub-question."""
document_id: str
chunk_index: int
sub_question_text: str
sub_question_index: int
class HighlightBatchRequest(BaseModel):
"""Batch request: all cited chunks grouped by sub-question."""
targets: list[ChunkHighlightTarget]
```
#### Structured LLM Output
```python
class RelevantSentence(BaseModel):
sentence_index: int = Field(description="0-based index of the relevant sentence")
reason: str = Field(description="Brief explanation of why this sentence is relevant (max 80 chars)")
class ChunkHighlights(BaseModel):
document_id: str
chunk_index: int
relevant_sentences: list[RelevantSentence] = Field(default_factory=list)
class HighlightBatchResult(BaseModel):
"""Single LLM call returns highlights for all sub-questions and chunks."""
results: list[ChunkHighlights]
```
#### Batch Response (POST response, JSON)
```python
class HighlightBatchResponse(BaseModel):
status: Literal["completed", "partial", "failed"]
cached_count: int # Number of highlights successfully computed and cached
errors: list[str] = [] # Per-chunk errors (LLM failed on some but not all)
```
#### SQLite Cache Table
```sql
CREATE TABLE IF NOT EXISTS chunk_highlights (
id INTEGER PRIMARY KEY AUTOINCREMENT,
cache_key TEXT UNIQUE NOT NULL, -- sha256(document_id + chunk_index + sub_question)
document_id TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
sub_question TEXT NOT NULL,
relevant_sentences_json TEXT NOT NULL, -- JSON of RelevantSentence[]
html_content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_highlights_cache_key ON chunk_highlights(cache_key);
```
#### GET Endpoint Response
```
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
→ 200 text/html (cache hit: highlighted HTML page)
→ 404 (cache miss: batch not yet computed or failed)
```
### LLM Prompt Structure (Single Call, All Sub-Questions)
```
For each sub-question below, identify which sentences in each cited chunk are directly relevant to answering that sub-question. Return only the index and a brief reason for each relevant sentence. Skip sentences that are about the same topic but don't contribute to the answer.
Sub-question 1: "What are the compensation event timelines in NEC4?"
[Chunk: NEC4 ACC, chunk_index=2]
[0] The Contractor shall submit a quotation within three weeks of the instruction.
[1] The Project Manager may instruct the Contractor to submit a revised programme.
[2] This does not affect the Contractor's right to be paid for preparing the programme.
[Chunk: NEC4 ACC, chunk_index=5]
[0] Compensation events are events which change the total of the Prices.
[1] The Project Manager shall notify the Contractor within one week of becoming aware.
Sub-question 2: "What are the key dates and milestones?"
[Chunk: NEC4 ACC, chunk_index=1]
[0] The starting date shall be defined as the first working day.
[1] Key Dates are dates by which the Contractor must complete work.
```
### Implementation Tasks
#### Task 5.4.1: Backend — Sentence splitting utility
- [ ] Create `backend/app/utils/sentence_splitter.py`
- [ ] Function `split_sentences(text: str) -> list[str]` returns ordered sentence strings
- [ ] Use regex: `re.split(r'(?<=[.!?。!?\n])\s+', text)` (handles English + Chinese punctuation)
- [ ] Strip empty strings, preserve original sentence text whitespace
- [ ] Handle edge cases: empty text, single sentence, lists/bullets
- **Test file**: `test_phase5_sentence_splitter.py`
#### Task 5.4.2: Backend — Highlight models (`HighlightBatchResult` and related Pydantic schemas)
- [ ] Create `backend/app/models/highlight.py` with:
- `ChunkHighlightTarget` (request input)
- `HighlightBatchRequest` (request input)
- `RelevantSentence` (LLM output unit)
- `ChunkHighlights` (LLM output per chunk)
- `HighlightBatchResult` (LLM output wrapper)
- `HighlightBatchResponse` (POST response)
- **Test file**: `test_phase5_highlight_models.py`
#### Task 5.4.3: Backend — SQLite cache for highlight results
- [ ] Add `chunk_highlights` table to `backend/app/services/highlight_cache.py`
- [ ] Functions: `get_highlight(cache_key) -> str | None`, `set_highlight(cache_key, html_content, relevant_sentences_json)`, `compute_cache_key(document_id, chunk_index, sub_question) -> str`
- [ ] Database path: `backend/app/data/highlights.db` (separate from prompts/history to avoid schema conflicts)
- [ ] Initialize table on first use (same pattern as `prompt_service.py` / `history_service.py`)
- **Test file**: `test_phase5_highlight_cache.py`
#### Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)
- [ ] Create `backend/app/services/chunk_highlight_service.py`
- [ ] Method `compute_highlights_batch(targets: list[ChunkHighlightTarget]) -> HighlightBatchResponse`:
1. Group targets by sub-question for prompt construction
2. For each target, fetch chunk text + metadata from ChromaDB (`collection.get(ids=[f"{document_id}_{chunk_index}"])`)
3. Split chunk text into sentences via `split_sentences()`
4. Build a single batched prompt (see Prompt Structure above)
5. Call `llm_client.complete_structured(prompt, HighlightBatchResult)` → validated Pydantic result
6. For each `ChunkHighlights` in result: render HTML, compute cache key, store in SQLite
7. Return `HighlightBatchResponse(status="completed", cached_count=N)`
- [ ] Method `render_highlight_html(chunk_text, sentences, relevant_sentences, metadata) -> str`:
- Inline CSS: yellow background (`#fef08a`) on relevant sentences, grey reason text below each
- Header: filename, chunk index, page number, sub-question text
- Footer: "View Original PDF →" link (if `chunk_file_path` exists)
- Returns complete self-contained HTML page
- [ ] Graceful degradation per target (not per batch): if one chunk fetch fails, skip it and continue others. If LLM call fails entirely, return `status="failed"`.
- **Test file**: `test_phase5_chunk_highlight_service.py`
#### Task 5.4.5: Backend — REST endpoints
- [ ] `POST /api/v1/v2/highlights/batch` in `backend/app/routers/chunks.py` (new file):
- Accepts `HighlightBatchRequest`
- Calls `chunk_highlight_service.compute_highlights_batch()`
- Returns `HighlightBatchResponse` (JSON)
- [ ] `GET /api/v1/v2/highlights` in same router:
- Query params: `document_id`, `chunk_index`, `sub_question`
- Computes cache key, reads from SQLite
- Cache hit → returns `text/html`
- Cache miss → returns 404
- [ ] Register router in `main.py`
- **Test file**: `test_phase5_highlight_endpoints.py`
#### Task 5.4.6: Frontend — Extract cited chunks from answer text
- [ ] Add `extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[]` to `citationParser.ts`
- Uses the same `citationPattern` regex as `replaceCitationPatterns()`
- For each `[citation]` match, resolves against `buildCitationLookup(sources)`
- Returns deduplicated list of `SourceMetadata` that were actually cited in the answer
- [ ] Exported alongside existing citation functions
- **Test file**: Update `citationParser.test.ts` with cited source extraction tests
#### Task 5.4.7: Frontend — Background batch trigger on response completion
- [ ] In `ResponsePanel.tsx` (inside `SubQuestionSections` component):
- After answer is rendered (non-null, not loading), `useEffect` fires:
1. Parse answer sections → extract cited sources per sub-question
2. Build `HighlightBatchRequest` payload
3. POST `/api/v1/v2/highlights/batch`
4. On success → store which (document_id, chunk_index, sub_question) pairs are cached in a `useState(Set<string>)`
5. On failure → log error (no UI change — "View Highlighted" stays hidden)
- Source cards check `isHighlightReady(document_id, chunk_index, sub_q_text)` → shows/hides "View Highlighted" button
- **Test file**: Update `ResponsePanel.test.tsx`
#### Task 5.4.8: Frontend — Upgrade "View PDF" link on source cards
- [ ] In `SubQuestionSourceCard` (ResponsePanel.tsx):
- **Before batch completes**: "View PDF" link opens raw PDF chunk page (existing behavior, unchanged)
- **After batch completes**: Same link upgrades to open `/api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...` (highlighted HTML page)
- The link stays in the same position — no repositioning, no second element
- While batch is loading: no visual change (the PDF link remains functional; the highlight upgrade is invisible to the user until ready)
- If batch fails: link stays as raw PDF permanently (graceful degradation)
- [ ] The highlighted HTML page includes a `[View Original PDF →]` footer link (rendered by backend in Task 5.4.4) so users can always reach the raw PDF
- [ ] No label change needed — "View PDF" remains appropriate since the user gets a page-level view of the chunk regardless
- **Test file**: Update `ResponsePanel.test.tsx` — verify link href upgrades after mock batch response, verify link stays as raw PDF when batch fails, verify loading state shows no visual disruption
#### Task 5.4.9: Integration testing
- [ ] Verify batch endpoint: POST with 3 targets → all cached → GET returns highlighted HTML for each
- [ ] Verify cache: same GET twice → both hit SQLite (no second LLM call)
- [ ] Verify graceful degradation: LLM call fails → batch returns `status="failed"` → GET returns 404 → frontend hides buttons
- [ ] Verify cross-document-type: PDF, DOCX, TXT chunks all work (chunk text comes from ChromaDB regardless of source format)
- [ ] Verify sentence splitting handles mixed English/Chinese legal text (e.g., "The date is 2024年1月1日。The Contractor shall...")
- [ ] Run full test suite: `cd backend && pytest -v` and `cd frontend && npm test`
### Test Files
| # | Test File | Coverage |
|---|-----------|----------|
| T5.4.1 | `backend/app/test/test_phase5_sentence_splitter.py` | Sentence splitting: English, Chinese, mixed, empty, single sentence, bullet lists |
| T5.4.2 | `backend/app/test/test_phase5_highlight_models.py` | Pydantic validation: valid/invalid targets, empty batch, max batch size, LLM result parsing |
| T5.4.3 | `backend/app/test/test_phase5_highlight_cache.py` | SQLite cache: insert, retrieve, cache key collision, missing key, JSON serialization |
| T5.4.4 | `backend/app/test/test_phase5_chunk_highlight_service.py` | Service: valid batch → cached HTML, mocked LLM returns structured result, per-target error isolation, empty batch, ChromaDB miss → skip target |
| T5.4.5 | `backend/app/test/test_phase5_highlight_endpoints.py` | Endpoints: POST returns cache count, GET returns HTML, GET cache miss → 404, invalid params → 422 |
| T5.4.6 | `frontend/src/test/utils/citationParser.test.ts` (update) | `extractCitedSources()`: extracts cited filenames, deduplicates, handles no citations, handles partial matches |
| T5.4.7 | `frontend/src/test/components/ResponsePanel.test.tsx` (update) | "View PDF" link upgrades to highlight page after batch completes, link stays as raw PDF when batch fails, no visual disruption during loading |
### Acceptance Tests
| # | Test File | Coverage |
|---|-----------|----------|
| AT5.4.1 | `backend/app/test/acceptance/test_acceptance_phase5_highlight.py` | Real LLM query → real chunk text → batch POST with real LLM → verify cached HTML contains yellow-highlighted spans on genuinely relevant sentences → spot-check 3 chunks manually |
---
## Updated Dependency Graph
```
Phase 5.1 (Structured Output) ✅
Phase 5.2 (Citation Linking) ✅
Phase 5.3 (DOCX/TXT PDF Gen) ✅
Phase 5.4 (LLM Highlighting)
Task 5.4.1 (sentence splitter)
├── Task 5.4.2 (highlight models)
│ │
│ ├── Task 5.4.3 (SQLite cache)
│ │ │
│ │ ├── Task 5.4.4 (batch service + HTML rendering)
│ │ │ │
│ │ │ └── Task 5.4.5 (REST endpoints)
│ │ │
│ │ └── Task 5.4.9 (integration tests)
│ │
│ └── Task 5.4.6 (frontend cited source extraction)
│ │
│ ├── Task 5.4.7 (frontend batch trigger)
│ │ │
│ │ └── Task 5.4.8 (frontend link upgrade)
│ │
│ └── Task 5.4.9 (integration tests)
└── All tasks → Task 5.4.9 (full suite)
```
### Updated Commit Plan
| Commit | Message | Scope |
|--------|---------|-------|
| 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |
---
## Complete Commit Plan (All Phases)
| Commit | Message | Scope |
|--------|---------|-------|
| 1 | `feat: add LangChain deps and SubQuestions Pydantic model` | Tasks 5.1.1 + 5.1.2 + tests |
| 2 | `feat: add LLMClient.complete_structured() with LangChain` | Task 5.1.3 + tests |
| 3 | `feat: refactor QueryDecomposer to use structured output with fallback` | Task 5.1.4 + tests |
| 4 | `chore: update decompose seed prompt for structured output` | Task 5.1.5 |
| 5 | `feat: add document_id to SourceMetadata model` | Task 5.2.1 + tests |
| 6 | `feat: fuzzy citation matching and document fallback links` | Tasks 5.2.3 + 5.2.4 + tests |
| 7 | `feat: add sentence splitter and highlight data models` | Tasks 5.4.1 + 5.4.2 + tests |
| 8 | `feat: add SQLite highlight cache service` | Task 5.4.3 + tests |
| 9 | `feat: add LLM-based batch highlight service and HTML rendering` | Task 5.4.4 + tests |
| 10 | `feat: add highlight batch and GET endpoints` | Task 5.4.5 + tests |
| 11 | `feat: add cited source extraction and background batch trigger` | Tasks 5.4.6 + 5.4.7 + tests |
| 12 | `feat: upgrade View PDF link to highlighted view on source cards` | Task 5.4.8 + tests |