49 KiB

Raw Permalink Blame History

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Source: User request (2026-04-28) Scope:

Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain with_structured_output()
Phase 5.2: Fix missing PDF links in citations and improve citation robustness
Phase 5.3: Generate per-chunk PDFs for DOCX/TXT documents at ingestion time
Phase 5.4: LLM-based sentence-level highlighting with eager background batch computation Status: Phases 5.1 ✅, 5.2 ✅, 5.3 ✅, 5.4 ✅, 5.5 ✅ — All complete (2026-04-29) Phase 5.4 implemented: LLM-based sentence-level highlighting with eager background batch computation, SQLite cache, inline citation link upgrades, and toast notification.

Test results (Phase 5 final):

Backend: 108 passed, 0 failed (all integration + all Phase 5 tests)
Frontend: 45 passed, 0 failed (citation parser + ResponsePanel + highlight batch)
Total: 153 tests, 0 failures

Objective

Decompose structured output: Eliminate json.JSONDecodeError failures in QueryDecomposer.decompose() by integrating LangChain's with_structured_output() to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid SubQuestions object — no manual json.loads(), no regex markdown stripping, no silent failures.
Robust citation linking: Fix the citation→PDF link pipeline so that:
- document_id flows through to the frontend for fallback document-level links
- chunk_file_path is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
- Citation matching in citationParser.ts handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
- Frontend provides fallback "View Document" links when chunk-level PDF is unavailable

Decision Register

#	Decision	Rationale
1	Use LangChain `with_structured_output()` (not OpenAI `response_format` directly)	User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate).
2	Add `langchain` + `langchain-openai` to `requirements.txt`	Required dependencies for `init_chat_model()` and `with_structured_output()`. `langchain` ~0.3.x for stable API.
3	Define `SubQuestions` Pydantic model with `questions: list[str]`	LangChain's `with_structured_output()` requires a wrapper Pydantic model — bare `list[str]` is unsupported by provider-native schema enforcement.
4	Keep `LLMClient` as the central LLM access layer, add LangChain-based `complete_structured()` method	Minimizes refactoring. `QueryDecomposer` calls `llm_client.complete_structured(prompt, SubQuestions)` instead of `llm_client.complete(prompt)`. Other callers (filter, generate) remain unchanged.
5	Run decomposition at `temperature=0.0` (was `0.7`)	Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance.
6	Add `document_id` to `SourceMetadata` Pydantic model and frontend type	`document_id` is already stored in ChromaDB metadata (`metadata.py:70`) but is discarded during serialization. Adding it enables document-level fallback links.
7	~~Generate monolithic PDFs for DOCX/TXT documents~~ → DEFERRED	More complex than needed. Instead, use fallback document-level links via `document_id` when `chunk_file_path` is null. DOCX/TXT PDF generation deferred to Phase 5.3.
8	Fuzzy citation matching: strip extensions, trim whitespace	`citationParser.ts` currently requires exact filename match. LLM may shorten `NEC4 ACC.pdf` to `NEC4 ACC` in citations.
9	Fallback "View Document" link when `chunk_file_path` is null	Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback.
10	Keep `_extract_json_from_markdown()` as a fallback for backward compatibility	During a transition period (or if `with_structured_output()` fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used.
11	Add `logger.warning` for JSON parse failures before returning empty	The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug.
12	Keep `QueryDecomposer.decompose()` return type as `Tuple[List[str], str]`	Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions
13	Spike-test LangChain structured output with OpenRouter BEFORE implementation	2-minute test calling `init_chat_model().with_structured_output().ainvoke()` through OpenRouter to confirm `response_format={"type": "json_schema"}` is proxied correctly. If not, fall back to `method="function_calling"`.
14	Tighten `generate_per_subq` prompt alongside frontend fuzzy matching	Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3.

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

#	Test File	Coverage
T5.1.1	`backend/app/test/test_phase5_llm_client_structured.py`	`LLMClient.complete_structured()` with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback.
T5.1.2	`backend/app/test/test_phase5_query_decomposer_structured.py`	`QueryDecomposer.decompose()` using `MockLLMClient.complete_structured()`. Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration.
T5.1.3	`backend/app/test/test_phase5_subquestions_model.py`	`SubQuestions` Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected.
T5.1.4	`backend/app/test/test_phase5_decompose_logging.py`	Verify `logger.warning` is emitted when JSON parse fallback is triggered (backward-compat path).

Acceptance Tests

#	Test File	Coverage
AT5.1.1	`backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py`	Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions.

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

Add langchain>=0.3.0,<0.4.0 and langchain-openai>=0.3.0,<0.4.0 to backend/requirements.txt
Run pip install -r backend/requirements.txt in dev venv
Test file: test_phase5_subquestions_model.py (can run immediately after install)

Task 5.1.2: Define `SubQuestions` Pydantic model

Create backend/app/models/decompose.py with:

class SubQuestions(BaseModel):
    questions: list[str] = Field(
        description="2-5 simplified sub-questions, each focused on one aspect",
        min_length=1,
        max_length=5,
    )

Add min_length=1 and max_length=5 Pydantic constraints (aligns with decompose prompt's "2-5")
Test file: test_phase5_subquestions_model.py

Task 5.1.3: Add `complete_structured()` method to `LLMClient`

In llm_client.py, import init_chat_model from langchain.chat_models
Add self._langchain_model attribute (lazy-init from settings)
Add async complete_structured(prompt, pydantic_model, step_name) -> BaseModel method:
1. Calls self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)
2. Returns the validated Pydantic model instance
3. Logs timing (same pattern as existing complete())
4. Wraps errors in LLMClientError
Use temperature=0.0 via model config for structured calls
Test file: test_phase5_llm_client_structured.py

Task 5.1.4: Refactor `QueryDecomposer.decompose()` to use structured output

Change decompose() to call self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")
Add fallback path: if complete_structured() raises → log warning → attempt legacy complete() + json.loads() → if that works, log info "structured output failed, fallback succeeded"
Add logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500]) when both paths fail
Keep return type Tuple[List[str], str] unchanged
Keep _extract_json_from_markdown() for backward-compat fallback path
Test file: test_phase5_query_decomposer_structured.py and test_phase5_decompose_logging.py

Task 5.1.5: Update prompt template for structured output

Update _SEED_DECOMPOSE in sqlite_db.py to instruct the LLM about the expected structure
New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
Run seed_default_profiles() to backfill existing profiles
Test file: Existing test_phase3_prompt_service.py should continue to pass

Task 5.1.6: Integration test — end-to-end query pipeline

Verify existing integration tests still pass (test_integration_phase1.py, test_phase4_integration_query_pipeline.py)
Verify acceptance test passes with real LLM (test_acceptance_phase1_rag_query.py)
Run full test suite: cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

#	Test File	Coverage
T5.2.1	`backend/app/test/test_phase5_source_metadata.py`	`SourceMetadata` model with `document_id`. Tests: serialization includes document_id, backward compat (old data without document_id).
T5.2.2	`backend/app/test/test_phase5_docx_pdf_generation.py`	DOCX/TXT ingestion now sets `chunk_file_path`. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully.
T5.2.3	`frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts`	Fuzzy citation matching. Tests: citation `[NEC4 ACC]` matches source `NEC4 ACC.pdf`, citation `[nec4 acc.pdf, page 3]` matches after whitespace trim, citation `[NEC4 ACC.PDF]` matches case-insensitively, fallback "View Document" link shown when `chunk_file_path` is null.
T5.2.4	`frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts`	Fallback document link rendering. Tests: chunk with `chunk_file_path: null` but `document_id` present → renders "View Document" link, chunk with both null → remains plain text, chunk with `chunk_file_path` → renders page-level PDF link.

Acceptance Tests

#	Test File	Coverage
AT5.2.1	`backend/app/test/acceptance/test_acceptance_phase5_citation_links.py`	Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path).

Implementation Tasks

Task 5.2.1: Add `document_id` to `SourceMetadata` model

In backend/app/models/common.py, add document_id: Optional[str] = None to SourceMetadata
In backend/app/routers/query.py lines 310-319, include document_id=meta.get("document_id") when building SourceMetadata objects
In frontend/src/types/index.ts, add document_id: string | null to SourceMetadata interface
Test file: test_phase5_source_metadata.py

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

Add reportlab to backend/requirements.txt (lightweight, pure Python PDF generation, no external binaries)
In backend/app/routers/ingest.py DOCX and TXT branches, add PDF generation logic:
1. After chunking, generate a single PDF from the full text (one page per chunk)
2. Store chunk_filename = f"{stem}_chunk_{idx}.pdf" for each chunk
3. Set chunk_file_paths list and pass to extract_metadata()
Add error handling: if PDF generation fails, chunk_file_path stays None (graceful degradation)
Use logger.warning on generation failure
Test file: test_phase5_docx_pdf_generation.py

Task 5.2.3: Improve `citationParser.ts` with fuzzy matching

Add extension-stripping helper: stripExtension(filename: string): string — removes .pdf, .docx, .txt
Modify buildCitationLookup() to register both filename and stripExtension(filename) as lookup keys
Add trim-whitespace normalization on citation text before lookup
Add test for LLM-common variations: NEC4 ACC.pdf vs NEC4 ACC vs NEC4_acc.pdf
Test file: test_phase5_citation_parser_fuzzy.test.ts

Task 5.2.4: Add fallback "View Document" link in frontend

In citationParser.ts replaceCitationPatterns(), when source?.chunk_file_path is null but source?.document_id exists:
1. Build a URL to the document chunk list page: /rag-database?document_id=${source.document_id}
2. Return [${trimmed}](${url}) with a different CSS class (e.g., text-green-600 for document-level vs text-blue-600 for page-level)
In ResponsePanel.tsx, update CitationLink component to accept a variant prop for visual differentiation
Test file: test_phase5_citation_fallback_link.test.ts

Task 5.2.5: Integration and regression testing

Verify all existing citation parser tests still pass: cd frontend && npx vitest run src/test/utils/citationParser.test.ts
Verify ResponsePanel tests still pass: npx vitest run src/test/components/ResponsePanel.test.tsx
Run full frontend test suite: npm test
Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable

Dependency Graph

Phase 5.1 (Structured Output)
  Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.4 (refactor decompose)
                           │                                           │
                           │                              Task 5.1.5 (update prompt template)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.6 (integration tests)
                           │
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
  Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
  Task 5.2.2 (DOCX/TXT PDF gen)    ──┤
                                      ├── Task 5.2.4 (fallback link)
                                      │
                                      ▼
                              Task 5.2.5 (integration tests)

Acceptance Criteria

Phase 5.1 Completion Checklist

LLMClient.complete_structured() returns validated SubQuestions Pydantic model — no json.JSONDecodeError possible
QueryDecomposer.decompose() never returns [] due to JSON parse failure
Fallback path (legacy json.loads()) logs a warning when triggered
Existing decompose tests pass (test_phase1_query_decomposer.py)
New structured output tests pass (test_phase5_*.py) — 33 tests
Spike test passed: Cantonese + English → valid sub-questions
SQLite seed templates updated and backfilled to all profiles
langchain and langchain-openai installed in venv (1.2.x)

Phase 5.2 Completion Checklist

SourceMetadata includes document_id in both backend and frontend types
~~DOCX/TXT ingestion generates per-chunk PDF files~~ → DEFERRED to Phase 5.3
citationParser.ts matches [NEC4 ACC] to source NEC4 ACC.pdf (fuzzy matching)
citationParser.ts renders fallback link to /rag-database?document=xxx when chunk_file_path is null but document_id exists
RAGDatabasePage auto-expands document from ?document= URL param
All existing citation parser tests pass (14 tests)
All existing ResponsePanel tests pass
generate_per_subq seed prompt tightened: "Copy the exact bracket labels shown"

Rollback Plan

If with_structured_output() causes issues in production:

The complete_structured() method wraps errors in LLMClientError — same exception type as existing complete()
QueryDecomposer.decompose() has a fallback to legacy complete() + json.loads() path
The _extract_json_from_markdown() function is preserved for backward compatibility
If LangChain is a complete failure, revert requirements.txt and llm_client.py changes (3 files), keeping the Pydantic model and improved logging

Phase 5.3 — DOCX/TXT PDF Generation ✅

Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same chunk_file_path → PDF viewer flow as PDF documents.

Status: Complete (2026-04-28). Implemented in commit 25b26c9.

reportlab==4.2.5 added to requirements.txt
New backend/app/utils/text_to_pdf.py: renders chunk text as simple PDFs with word wrapping
ingest.py DOCX/TXT branches: generates {stem}_chunk_{idx}.pdf per chunk, passes chunk_file_paths to extract_metadata()
Graceful degradation: chunk_file_path stays None on generation failure (logged as warning)
Tests: test_phase5_docx_pdf_generation.py (5 tests), updated test_phase1_ingest_page_aware.py (2 assertions)

Phase 5.4 — Sentence-Level Highlighting ✅

Problem

When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.

Solution: LLM-Based Relevance Detection + Eager Background Computation

Embedding similarity was rejected — cosine similarity measures semantic proximity, not answer relevance. A sentence like "The Contractor shall submit a quotation within three weeks" has low embedding similarity to "What are the compensation event timelines?" but is directly relevant to answering it. Only an LLM can distinguish topic-matching from answer-relevance.

Strategy: After the final response renders on the frontend, a single background LLM call identifies relevant sentences across all cited chunks (grouped by sub-question). Results are cached in SQLite. The "View Highlighted" button activates once caching completes. Clicks serve instantly from cache.

Why one LLM call, not one per chunk: A sub-question typically cites only 2-3 chunks (even though up to 10 are retrieved). A single batched prompt with all sub-questions and their cited chunks costs ~1 LLM call per query — not N × M per chunk. Reuses the complete_structured() infrastructure from Phase 5.1.

Why HTML, not PDF?

Approach	Complexity	Works for all doc types?	Preserves original formatting?
Highlighted HTML page	Low	✅ Yes (uses chunk text from ChromaDB)	❌ Plain text only
Highlighted PDF via reportlab	Medium	✅ Yes (new PDF)	❌ Plain text only
Overlay highlights on existing PDF	High	⚠️ PDF only	✅ Yes

Recommendation: HTML page. Works uniformly across PDF/DOCX/TXT. Original formatting is preserved in the existing PDF viewer (chunk_file_path link) — the highlighted HTML view is a supplementary view reached via a separate button. The two views coexist: "View Original PDF" vs "View Highlighted Text".

Architecture

EAGER PHASE (background, fires immediately after SSE completed):
─────────────────────────────────────────────────
Frontend receives SSE "completed" event
    │
    ├── Renders answer with citations instantly (no waiting)
    │
    └── useEffect fires POST /api/v1/v2/highlights/batch
              │
              ▼
        Backend highlight batch service:
          1. For each sub-question, extract cited chunks
             (only 2-3 actually cited per sub-q, not all 10)
          2. Fetch chunk texts from ChromaDB
          3. Split each chunk into sentences
          4. Single LLM call: identify which sentences are
             relevant to their sub-question (structured output)
          5. Render each chunk as highlighted HTML
          6. Cache all HTML + metadata in SQLite
          7. Return {status, cached_count} to frontend
               │
               ▼
         Frontend receives response → upgrades "View PDF"
         links on source cards to point to cached highlight pages
         (same button, same position, upgraded destination)

LAZY PHASE (on user click, instant from cache):
──────────────────────────────────────────────
User clicks "View PDF" on a source card
    │
    ▼
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
    │
    ▼
Backend reads cached HTML from SQLite → serves instantly
    │
    ▼
Highlighted HTML page shows:
  ┌─ Highlighted sentences (yellow background + LLM reasons)
  └─ Footer: [View Original PDF →] link to raw PDF chunk

What Gets Highlighted

┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12          [View Original PDF →]  │
├──────────────────────────────────────────────────────────┤
│                                                            │
│ The programme shall be prepared in a form acceptable to   │
│ the Project Manager. It shall include:                    │
│                                                            │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates.       █ │  ← LLM: relevant to timeline question
│ █ "Defines when work begins and key milestones."        █ │  ← Brief reason from LLM
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Contractor shall submit a first programme within      │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date.                       █ │  ← LLM: relevant to timeline question
│ █ "Specifies the deadline for initial programme."       █ │
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Project Manager may instruct the Contractor to        │
│ submit a revised programme showing the effects of a       │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme.             │  ← Not highlighted (LLM: not relevant to timeline)
│                                                            │
├──────────────────────────────────────────────────────────┤
│                                   [View Original PDF →]   │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

#	Decision	Rationale
1	HTML page, not PDF	Zero extra dependencies. CSS highlighting is more flexible. Original PDF remains available via separate link.
2	LLM relevance detection, not embedding similarity	Embedding similarity measures topic overlap, not answer relevance. LLM distinguishes "about timelines" from "answers the timeline question."
3	Single batched LLM call for all sub-questions + cited chunks	One API call per query, not one per chunk. Each sub-question typically cites 2-3 chunks. Total ~6-15 chunks per query in a single prompt.
4	Eager computation (background after response renders)	User sees answer instantly. Highlight computation happens while they read. "View Highlighted" activates when ready. Zero perceived latency on click.
5	Only cited chunks (not all retrieved chunks)	Up to 10 chunks retrieved per sub-question, but only 2-3 are actually cited in the answer. Computing highlights for uncited chunks wastes tokens.
6	SQLite cache for highlight results	Durable across restarts. Same chunk+sub-question pair only computed once. Subsequent clicks (or same query re-run) serve instantly. Cache key = hash(document_id + chunk_index + sub_question).
7	Reuse `complete_structured()` from Phase 5.1	`LLMClient.complete_structured(prompt, HighlightBatchResult)` returns validated Pydantic model. No new LLM calling infrastructure.
8	Sentence-level granularity	Paragraphs are too coarse, words are too noisy. Sentences are the natural unit of meaning. LLM returns sentence indices.
9	Regex sentence splitter (no NLTK)	NLTK adds 35MB deploy weight for edge-case handling the project doesn't need. `re.split(r'(?<=[.!?。！？\n])\s+')` handles English + Cantonese legal text.
10	Sentence indices, not text spans	Deterministic: the splitter always produces the same sentences from the same chunk text. LLM returns indices to avoid substring matching fragility.
11	Graceful degradation	LLM call fails → cache miss → GET returns 404. Frontend keeps "View Highlighted" disabled/loading until batch succeeds. If batch fails, button stays hidden.
12	"View Highlighted" on source cards and inline citations	Decision expanded during implementation: inline citation links in the answer text also upgrade to highlighted view via `buildCitationUrl()` checking `highlightKeys`. Source cards have `SubQuestionSource` context; inline citations route via `sub_question_text` added to `SourceMetadata`.
13	Upgrade "View PDF" link in place (no second button)	Before the LLM completes, the "View PDF" link opens the raw PDF chunk page (existing behavior). Once the batch LLM call finishes caching, the same link upgrades to open the highlighted HTML page instead. The highlighted page includes a "View Original PDF →" footer link. If the LLM call fails, the link stays as raw PDF. One link, seamless upgrade.

Data Model

Batch Request (POST)

class ChunkHighlightTarget(BaseModel):
    """A single chunk to highlight, tied to its sub-question."""
    document_id: str
    chunk_index: int
    sub_question_text: str
    sub_question_index: int

class HighlightBatchRequest(BaseModel):
    """Batch request: all cited chunks grouped by sub-question."""
    targets: list[ChunkHighlightTarget]

Structured LLM Output

class RelevantSentence(BaseModel):
    sentence_index: int = Field(description="0-based index of the relevant sentence")
    reason: str = Field(description="Brief explanation of why this sentence is relevant (max 80 chars)")

class ChunkHighlights(BaseModel):
    document_id: str
    chunk_index: int
    relevant_sentences: list[RelevantSentence] = Field(default_factory=list)

class HighlightBatchResult(BaseModel):
    """Single LLM call returns highlights for all sub-questions and chunks."""
    results: list[ChunkHighlights]

Batch Response (POST response, JSON)

class HighlightBatchResponse(BaseModel):
    status: Literal["completed", "partial", "failed"]
    cached_count: int  # Number of highlights successfully computed and cached
    errors: list[str] = []  # Per-chunk errors (LLM failed on some but not all)

SQLite Cache Table

CREATE TABLE IF NOT EXISTS chunk_highlights (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    cache_key TEXT UNIQUE NOT NULL,  -- sha256(document_id + chunk_index + sub_question)
    document_id TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    sub_question TEXT NOT NULL,
    relevant_sentences_json TEXT NOT NULL,  -- JSON of RelevantSentence[]
    html_content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_highlights_cache_key ON chunk_highlights(cache_key);

GET Endpoint Response

GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
→ 200 text/html  (cache hit: highlighted HTML page)
→ 404            (cache miss: batch not yet computed or failed)

LLM Prompt Structure (Single Call, All Sub-Questions)

For each sub-question below, identify which sentences in each cited chunk are directly relevant to answering that sub-question. Return only the index and a brief reason for each relevant sentence. Skip sentences that are about the same topic but don't contribute to the answer.

Sub-question 1: "What are the compensation event timelines in NEC4?"
  [Chunk: NEC4 ACC, chunk_index=2]
    [0] The Contractor shall submit a quotation within three weeks of the instruction.
    [1] The Project Manager may instruct the Contractor to submit a revised programme.
    [2] This does not affect the Contractor's right to be paid for preparing the programme.
  [Chunk: NEC4 ACC, chunk_index=5]
    [0] Compensation events are events which change the total of the Prices.
    [1] The Project Manager shall notify the Contractor within one week of becoming aware.

Sub-question 2: "What are the key dates and milestones?"
  [Chunk: NEC4 ACC, chunk_index=1]
    [0] The starting date shall be defined as the first working day.
    [1] Key Dates are dates by which the Contractor must complete work.

Implementation Tasks

Task 5.4.1: Backend — Sentence splitting utility

Create backend/app/utils/sentence_splitter.py
Function split_sentences(text: str) -> list[str] returns ordered sentence strings
Use regex: re.split(r'(?<=[.!?。！？\n])\s+', text) (handles English + Chinese punctuation)
Strip empty strings, preserve original sentence text whitespace
Handle edge cases: empty text, single sentence, lists/bullets
Test file: test_phase5_sentence_splitter.py

Task 5.4.2: Backend — Highlight models (`HighlightBatchResult` and related Pydantic schemas)

Create backend/app/models/highlight.py with:
- ChunkHighlightTarget (request input)
- HighlightBatchRequest (request input)
- RelevantSentence (LLM output unit)
- ChunkHighlights (LLM output per chunk)
- HighlightBatchResult (LLM output wrapper)
- HighlightBatchResponse (POST response)
Test file: test_phase5_highlight_models.py

Task 5.4.3: Backend — SQLite cache for highlight results

Add chunk_highlights table to backend/app/services/highlight_cache.py
Functions: get_highlight(cache_key) -> str | None, set_highlight(cache_key, html_content, relevant_sentences_json), compute_cache_key(document_id, chunk_index, sub_question) -> str
Database path: backend/app/data/highlights.db (separate from prompts/history to avoid schema conflicts)
Initialize table on first use (same pattern as prompt_service.py / history_service.py)
Test file: test_phase5_highlight_cache.py

Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)

Create backend/app/services/chunk_highlight_service.py
Method compute_highlights_batch(targets: list[ChunkHighlightTarget]) -> HighlightBatchResponse:
1. Group targets by sub-question for prompt construction
2. For each target, fetch chunk text + metadata from ChromaDB (collection.get(ids=[f"{document_id}_{chunk_index}"]))
3. Split chunk text into sentences via split_sentences()
4. Build a single batched prompt (see Prompt Structure above)
5. Call llm_client.complete_structured(prompt, HighlightBatchResult) → validated Pydantic result
6. For each ChunkHighlights in result: render HTML, compute cache key, store in SQLite
7. Return HighlightBatchResponse(status="completed", cached_count=N)
Method render_highlight_html(chunk_text, sentences, relevant_sentences, metadata) -> str:
- Inline CSS: yellow background (#fef08a) on relevant sentences, grey reason text below each
- Header: filename, chunk index, page number, sub-question text
- Footer: "View Original PDF →" link (if chunk_file_path exists)
- Returns complete self-contained HTML page
Graceful degradation per target (not per batch): if one chunk fetch fails, skip it and continue others. If LLM call fails entirely, return status="failed".
Test file: test_phase5_chunk_highlight_service.py

Task 5.4.5: Backend — REST endpoints

POST /api/v1/v2/highlights/batch in backend/app/routers/chunks.py (new file):
- Accepts HighlightBatchRequest
- Calls chunk_highlight_service.compute_highlights_batch()
- Returns HighlightBatchResponse (JSON)
GET /api/v1/v2/highlights in same router:
- Query params: document_id, chunk_index, sub_question
- Computes cache key, reads from SQLite
- Cache hit → returns text/html
- Cache miss → returns 404
Register router in main.py
Test file: test_phase5_highlight_endpoints.py

Task 5.4.6: Frontend — Extract cited chunks from answer text

Add extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[] to citationParser.ts
- Uses the same citationPattern regex as replaceCitationPatterns()
- For each [citation] match, resolves against buildCitationLookup(sources)
- Returns deduplicated list of SourceMetadata that were actually cited in the answer
Exported alongside existing citation functions
Test file: Update citationParser.test.ts with cited source extraction tests

Task 5.4.7: Frontend — Background batch trigger on response completion

In ResponsePanel.tsx (inside SubQuestionSections component):
- After answer is rendered (non-null, not loading), useEffect fires:
  1. Parse answer sections → extract cited sources per sub-question
  2. Build HighlightBatchRequest payload
  3. POST /api/v1/v2/highlights/batch
  4. On success → store which (document_id, chunk_index, sub_question) pairs are cached in a useState(Set<string>)
  5. On failure → log error (no UI change — "View Highlighted" stays hidden)
- Source cards check isHighlightReady(document_id, chunk_index, sub_q_text) → shows/hides "View Highlighted" button
Test file: Update ResponsePanel.test.tsx

Task 5.4.8: Frontend — Upgrade "View PDF" link on source cards

In SubQuestionSourceCard (ResponsePanel.tsx):
- Before batch completes: "View PDF" link opens raw PDF chunk page (existing behavior, unchanged)
- After batch completes: Same link upgrades to open /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=... (highlighted HTML page)
- The link stays in the same position — no repositioning, no second element
- While batch is loading: no visual change (the PDF link remains functional; the highlight upgrade is invisible to the user until ready)
- If batch fails: link stays as raw PDF permanently (graceful degradation)
The highlighted HTML page includes a [View Original PDF →] footer link (rendered by backend in Task 5.4.4) so users can always reach the raw PDF
No label change needed — "View PDF" remains appropriate since the user gets a page-level view of the chunk regardless
Test file: Update ResponsePanel.test.tsx — verify link href upgrades after mock batch response, verify link stays as raw PDF when batch fails, verify loading state shows no visual disruption

Task 5.4.9: Integration testing

Verify batch endpoint: POST with 3 targets → all cached → GET returns highlighted HTML for each
Verify cache: same GET twice → both hit SQLite (no second LLM call)
Verify graceful degradation: LLM call fails → batch returns status="failed" → GET returns 404 → frontend hides buttons
Verify cross-document-type: PDF, DOCX, TXT chunks all work (chunk text comes from ChromaDB regardless of source format)
Verify sentence splitting handles mixed English/Chinese legal text (e.g., "The date is 2024年1月1日。The Contractor shall...")
Run full test suite: cd backend && pytest -v and cd frontend && npm test

Test Files

#	Test File	Coverage
T5.4.1	`backend/app/test/test_phase5_sentence_splitter.py`	Sentence splitting: English, Chinese, mixed, empty, single sentence, bullet lists
T5.4.2	`backend/app/test/test_phase5_highlight_models.py`	Pydantic validation: valid/invalid targets, empty batch, max batch size, LLM result parsing
T5.4.3	`backend/app/test/test_phase5_highlight_cache.py`	SQLite cache: insert, retrieve, cache key collision, missing key, JSON serialization
T5.4.4	`backend/app/test/test_phase5_chunk_highlight_service.py`	Service: valid batch → cached HTML, mocked LLM returns structured result, per-target error isolation, empty batch, ChromaDB miss → skip target
T5.4.5	`backend/app/test/test_phase5_highlight_endpoints.py`	Endpoints: POST returns cache count, GET returns HTML, GET cache miss → 404, invalid params → 422
T5.4.6	`frontend/src/test/utils/citationParser.test.ts` (update)	`extractCitedSources()`: extracts cited filenames, deduplicates, handles no citations, handles partial matches
T5.4.7	`frontend/src/test/components/ResponsePanel.test.tsx` (update)	"View PDF" link upgrades to highlight page after batch completes, link stays as raw PDF when batch fails, no visual disruption during loading

Acceptance Tests

#	Test File	Coverage
AT5.4.1	`backend/app/test/acceptance/test_acceptance_phase5_highlight.py`	Real LLM query → real chunk text → batch POST with real LLM → verify cached HTML contains yellow-highlighted spans on genuinely relevant sentences → spot-check 3 chunks manually

Updated Dependency Graph

Phase 5.1 (Structured Output) ✅
Phase 5.2 (Citation Linking)   ✅
Phase 5.3 (DOCX/TXT PDF Gen)   ✅
Phase 5.4 (LLM Highlighting)
   Task 5.4.1 (sentence splitter)
        │
        ├── Task 5.4.2 (highlight models)
        │        │
        │        ├── Task 5.4.3 (SQLite cache)
        │        │        │
        │        │        ├── Task 5.4.4 (batch service + HTML rendering)
        │        │        │        │
        │        │        │        └── Task 5.4.5 (REST endpoints)
        │        │        │
        │        │        └── Task 5.4.9 (integration tests)
        │        │
        │        └── Task 5.4.6 (frontend cited source extraction)
        │                 │
        │                 ├── Task 5.4.7 (frontend batch trigger)
        │                 │        │
        │                 │        └── Task 5.4.8 (frontend link upgrade)
        │                 │
        │                 └── Task 5.4.9 (integration tests)
        │
        └── All tasks → Task 5.4.9 (full suite)

Updated Commit Plan (ACTUAL)

Commit	Message	Scope
`b11d31e`	`feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2)`	Tasks 5.4.1 + 5.4.2 + tests (13 tests)
`bdbc8ea`	`feat: add SQLite highlight cache service (Phase 5.4.3)`	Task 5.4.3 + tests (13 tests)
`c6d4a38`	`feat: add LLM-based batch highlight service and HTML rendering (Phase 5.4.4)`	Task 5.4.4 + tests (20 tests)
`a56f8f6`	`feat: add highlight batch and GET endpoints (Phase 5.4.5)`	Task 5.4.5 + tests (16 tests)
`c632b9e`	`feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8)`	Tasks 5.4.6 + 5.4.7 + 5.4.8 + tests (25 frontend tests)
`1c490ce`	`fix: inline citations now upgrade to highlighted view (Phase 5.4)`	Inline citation links in answer text also upgrade to highlighted view via `buildCitationUrl()` + `highlightKeys`
`bcf4a85`	`feat: add highlight status toast notification (Phase 5.4)`	Toast in top-left: amber spinner "Preparing highlights..." → green "Highlights ready — citations upgraded"
`b47e37f`	`fix: use absolute backend URL for highlight API calls`	Vite dev server doesn't proxy `/api/v1/v2/` paths. Changed `fetch` URL and `getHighlightUrl` to use `http://localhost:8000`.
`523b27b`	`test: update batch URL assertion to match absolute backend URL`	Fixed test that expected relative URL path.

Complete Commit Plan (All Phases — ACTUAL)

Commit	Message	Scope
`48e15f8`	`feat(llm): log structured LLM response and extra_body`	LLMClient logging improvements
`4058c7d`	`fix(citations): use all sub-question sources for citation lookup`	Citation matching fix
`091fa84`	`docs: update Phase 5 plan with deferred/planned sub-phases`	Plan documentation
`25b26c9`	`feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3)`	Phase 5.3 DOCX/TXT PDF generation
`ec3b5a4`	`docs: mark Phase 5.3 complete in enhancement plan`	Plan update
`b11d31e`	`feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2)`	Sentence splitter + 6 Pydantic models (13 tests)
`bdbc8ea`	`feat: add SQLite highlight cache service (Phase 5.4.3)`	HighlightCache + cache_key hash (13 tests)
`c6d4a38`	`feat: add LLM-based batch highlight service and HTML rendering (Phase 5.4.4)`	ChunkHighlightService + render_highlight_html (20 tests)
`a56f8f6`	`feat: add highlight batch and GET endpoints (Phase 5.4.5)`	POST /batch + GET /highlights endpoints (16 tests)
`c632b9e`	`feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8)`	extractCitedSources, useEffect batch trigger, link upgrade (25 frontend tests)
`1c490ce`	`fix: inline citations now upgrade to highlighted view (Phase 5.4)`	`buildCitationUrl()` upgraded for inline citation links
`bcf4a85`	`feat: add highlight status toast notification (Phase 5.4)`	Toast: loading → done states
`b47e37f`	`fix: use absolute backend URL for highlight API calls`	Hardcoded `http://localhost:8000` for Vite proxy bypass
`523b27b`	`test: update batch URL assertion to match absolute backend URL`	Test fix
—	(Phase 5.1-5.2 commits omitted for brevity — see git log)

Phase 5.4 Completion Checklist

Sentence splitter utility (sentence_splitter.py) handles English + Chinese + mixed text
Highlight Pydantic models (6 schemas: target, request, response, LLM output)
SQLite highlight cache (highlight_cache.py) with sha256 cache keys
LLM-based batch highlight service (chunk_highlight_service.py) — single batched call for all cited chunks
HTML page rendering (render_highlight_html()) with .highlighted CSS spans and "View Original PDF →" footer
POST /api/v1/v2/highlights/batch endpoint — accepts targets, returns cached_count
GET /api/v1/v2/highlights endpoint — serves cached highlighted HTML (200) or 404 on cache miss
extractCitedSources() in citationParser.ts — extracts cited sources from answer text per sub-question
useEffect batch trigger in SubQuestionSections — fires after answer renders, POSTs batch to backend
Source card "View PDF" links upgrade to highlighted HTML after batch completes
Inline citation links in answer text also upgrade to highlighted view (buildCitationUrl + highlightKeys)
Toast notification: "Preparing highlights..." (amber spinner) → "Highlights ready — citations upgraded" (green)
Backend test suite: 108 passed, 0 failed
Frontend test suite: 45 passed, 0 failed
Manual verification: batch endpoint returns cached_count: 1 with real document, GET returns valid HTML with .highlighted spans

Phase 5.5 — Highlight History Tracking ✅

Source: User request (2026-04-29) Status: Complete

Track highlight generation data in the query history database:

highlight_time_ms — duration of the highlight batch LLM call (excluded from total_time_ms)
highlight_prompt — the full prompt sent to the LLM for highlight batch
highlight_response — the LLM's structured response (HighlightBatchResult as JSON)

Architecture

SSE Stream: ... → phase: "completed"  (includes history_id)
    │
    ├── Frontend captures historyId from SSE event
    │
    └── Frontend useEffect: POST /api/v1/v2/highlights/batch?history_id={id}
              │
              ▼
        Backend chunk_highlight_service:
          1. Measures highlight_time_ms via time.perf_counter()
          2. Captures highlight_prompt (full LLM prompt)
          3. Captures highlight_response_json (llm_result.model_dump_json())
          4. Returns all 3 in HighlightBatchResponse
              │
              ▼
        Backend chunks.py endpoint:
          If history_id provided → HistoryService.update_highlights()
          UPDATE query_history SET highlight_prompt=?, highlight_response=?, highlight_time_ms=?

Key Decisions

#	Decision	Rationale
1	Synchronous history record at completion (was fire-and-forget)	Need `history_id` for SSE completed event payload. SQLite insert is microseconds — no latency impact.
2	`update_highlights()` separate from `record()`	Highlights computed AFTER the main query pipeline. Update existing row rather than inserting a new one.
3	Highlight time excluded from `total_time_ms`	`total_time_ms` is measured before highlights are computed. The highlight time is tracked separately.
4	Pass `history_id` via SSE completed event → frontend → batch POST query param	Avoids restructuring the SSE pipeline. Frontend already triggers batch POST — just adds a query param.
5	`_record_history` returns `-1` on failure	Legacy call sites (no-results, all-filtered) are unaffected. The completed site checks `if history_id > 0`.

Files Changed

File	Change
`backend/app/core/sqlite_db.py`	+3 ALTER TABLE columns: `highlight_prompt`, `highlight_response`, `highlight_time_ms`
`backend/app/services/history_service.py`	+3 columns in `_INSERT_COLUMNS`, +`update_highlights()` method
`backend/app/models/history.py`	+3 fields in `QueryHistoryRecord` and `QueryHistoryDetail`
`backend/app/models/highlight.py`	+3 optional fields in `HighlightBatchResponse`
`backend/app/services/chunk_highlight_service.py`	+`import time`, timing around LLM call, capture `highlight_response_json`
`backend/app/routers/query.py`	`_record_history` returns int, sync call at completed, `history_id` in SSE event
`backend/app/routers/chunks.py`	+`history_id` query param, calls `update_highlights()` on success
`frontend/src/lib/queries.tsx`	+`historyId` in `QueryStreamState`, capture from completed event
`frontend/src/components/ResponsePanel.tsx`	+`historyId` prop, pass in batch POST URL
`frontend/src/pages/LTTPage.tsx`	pass `historyId` from query state to `ResponsePanel`
`backend/app/test/test_phase5_highlight_models.py`	updated expected dict for new default fields

Test Results

Backend: 108 passed, 0 failed
Frontend: 45 passed, 0 failed
Total: 153 tests, 0 failures

Production Notes

Vite Proxy Limitation

Vite dev server only proxies paths under /api/v1/ (defined in vite.config.ts). The highlight endpoints live at /api/v1/v2/. To bypass this, the frontend uses the full backend URL http://localhost:8000/api/v1/v2/highlights/... directly rather than relying on the proxy.

To fix properly in production: use import.meta.env.VITE_API_BASE_URL (set to /api/v1 in prod, http://localhost:8000/api/v1 in dev) and construct URLs dynamically instead of hardcoding http://localhost:8000.

Port Conflict on Restart

If you restart the Vite dev server and the browser doesn't see changes, check for stale processes:

fuser -k 5173/tcp  # kill old Vite on port 5173
fuser -k 5174/tcp  # kill old Vite on port 5174 (auto-switched when 5173 busy)

Cache Location

Highlight SQLite cache lives at backend/app/data/highlights.db (auto-computed from PROMPTS_DB_PATH prefix: data/prompts.db → data/highlights.db). No new .env setting needed.

LLM Provider Compatibility

The batch highlight service reuses LLMClient.complete_structured() from Phase 5.1, which supports both OpenRouter (OpenAI-compatible) and vLLM (via VLLM_ENGINE=true).

49 KiB Raw Permalink Blame History Unescape Escape

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Objective

Decision Register

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

Acceptance Tests

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

Task 5.1.2: Define SubQuestions Pydantic model

Task 5.1.3: Add complete_structured() method to LLMClient

Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output

Task 5.1.5: Update prompt template for structured output

Task 5.1.6: Integration test — end-to-end query pipeline

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

Acceptance Tests

Implementation Tasks

Task 5.2.1: Add document_id to SourceMetadata model

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

Task 5.2.3: Improve citationParser.ts with fuzzy matching

Task 5.2.4: Add fallback "View Document" link in frontend

Task 5.2.5: Integration and regression testing

Dependency Graph

Acceptance Criteria

Phase 5.1 Completion Checklist

Phase 5.2 Completion Checklist

Rollback Plan

Phase 5.3 — DOCX/TXT PDF Generation ✅

Phase 5.4 — Sentence-Level Highlighting ✅

Problem

Solution: LLM-Based Relevance Detection + Eager Background Computation

Why HTML, not PDF?

Architecture

What Gets Highlighted

Key Design Decisions

Data Model

Batch Request (POST)

Structured LLM Output

Batch Response (POST response, JSON)

SQLite Cache Table

GET Endpoint Response

LLM Prompt Structure (Single Call, All Sub-Questions)

Implementation Tasks

Task 5.4.1: Backend — Sentence splitting utility

Task 5.4.2: Backend — Highlight models (HighlightBatchResult and related Pydantic schemas)

Task 5.4.3: Backend — SQLite cache for highlight results

Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)

Task 5.4.5: Backend — REST endpoints

Task 5.4.6: Frontend — Extract cited chunks from answer text

Task 5.4.7: Frontend — Background batch trigger on response completion

Task 5.4.8: Frontend — Upgrade "View PDF" link on source cards

Task 5.4.9: Integration testing

Test Files

Acceptance Tests

Updated Dependency Graph

Updated Commit Plan (ACTUAL)

Complete Commit Plan (All Phases — ACTUAL)

Phase 5.4 Completion Checklist

Phase 5.5 — Highlight History Tracking ✅

Architecture

Key Decisions

Files Changed

Test Results

Production Notes

Vite Proxy Limitation

Port Conflict on Restart

Cache Location

LLM Provider Compatibility

49 KiB

Raw Permalink Blame History

Task 5.1.2: Define `SubQuestions` Pydantic model

Task 5.1.3: Add `complete_structured()` method to `LLMClient`

Task 5.1.4: Refactor `QueryDecomposer.decompose()` to use structured output

Task 5.2.1: Add `document_id` to `SourceMetadata` model

Task 5.2.3: Improve `citationParser.ts` with fuzzy matching

Task 5.4.2: Backend — Highlight models (`HighlightBatchResult` and related Pydantic schemas)