legco_ai_assistant/.plans/package5_enhancement_plan.md

49 KiB
Raw Permalink Blame History

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Source: User request (2026-04-28) Scope:

  • Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain with_structured_output()
  • Phase 5.2: Fix missing PDF links in citations and improve citation robustness
  • Phase 5.3: Generate per-chunk PDFs for DOCX/TXT documents at ingestion time
  • Phase 5.4: LLM-based sentence-level highlighting with eager background batch computation Status: Phases 5.1 , 5.2 , 5.3 , 5.4 , 5.5 — All complete (2026-04-29) Phase 5.4 implemented: LLM-based sentence-level highlighting with eager background batch computation, SQLite cache, inline citation link upgrades, and toast notification.

Test results (Phase 5 final):

  • Backend: 108 passed, 0 failed (all integration + all Phase 5 tests)
  • Frontend: 45 passed, 0 failed (citation parser + ResponsePanel + highlight batch)
  • Total: 153 tests, 0 failures

Objective

  1. Decompose structured output: Eliminate json.JSONDecodeError failures in QueryDecomposer.decompose() by integrating LangChain's with_structured_output() to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid SubQuestions object — no manual json.loads(), no regex markdown stripping, no silent failures.

  2. Robust citation linking: Fix the citation→PDF link pipeline so that:

    • document_id flows through to the frontend for fallback document-level links
    • chunk_file_path is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
    • Citation matching in citationParser.ts handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
    • Frontend provides fallback "View Document" links when chunk-level PDF is unavailable

Decision Register

# Decision Rationale
1 Use LangChain with_structured_output() (not OpenAI response_format directly) User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate).
2 Add langchain + langchain-openai to requirements.txt Required dependencies for init_chat_model() and with_structured_output(). langchain ~0.3.x for stable API.
3 Define SubQuestions Pydantic model with questions: list[str] LangChain's with_structured_output() requires a wrapper Pydantic model — bare list[str] is unsupported by provider-native schema enforcement.
4 Keep LLMClient as the central LLM access layer, add LangChain-based complete_structured() method Minimizes refactoring. QueryDecomposer calls llm_client.complete_structured(prompt, SubQuestions) instead of llm_client.complete(prompt). Other callers (filter, generate) remain unchanged.
5 Run decomposition at temperature=0.0 (was 0.7) Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance.
6 Add document_id to SourceMetadata Pydantic model and frontend type document_id is already stored in ChromaDB metadata (metadata.py:70) but is discarded during serialization. Adding it enables document-level fallback links.
7 Generate monolithic PDFs for DOCX/TXT documentsDEFERRED More complex than needed. Instead, use fallback document-level links via document_id when chunk_file_path is null. DOCX/TXT PDF generation deferred to Phase 5.3.
8 Fuzzy citation matching: strip extensions, trim whitespace citationParser.ts currently requires exact filename match. LLM may shorten NEC4 ACC.pdf to NEC4 ACC in citations.
9 Fallback "View Document" link when chunk_file_path is null Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback.
10 Keep _extract_json_from_markdown() as a fallback for backward compatibility During a transition period (or if with_structured_output() fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used.
11 Add logger.warning for JSON parse failures before returning empty The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug.
12 Keep QueryDecomposer.decompose() return type as Tuple[List[str], str] Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions
13 Spike-test LangChain structured output with OpenRouter BEFORE implementation 2-minute test calling init_chat_model().with_structured_output().ainvoke() through OpenRouter to confirm response_format={"type": "json_schema"} is proxied correctly. If not, fall back to method="function_calling".
14 Tighten generate_per_subq prompt alongside frontend fuzzy matching Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3.

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

# Test File Coverage
T5.1.1 backend/app/test/test_phase5_llm_client_structured.py LLMClient.complete_structured() with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback.
T5.1.2 backend/app/test/test_phase5_query_decomposer_structured.py QueryDecomposer.decompose() using MockLLMClient.complete_structured(). Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration.
T5.1.3 backend/app/test/test_phase5_subquestions_model.py SubQuestions Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected.
T5.1.4 backend/app/test/test_phase5_decompose_logging.py Verify logger.warning is emitted when JSON parse fallback is triggered (backward-compat path).

Acceptance Tests

# Test File Coverage
AT5.1.1 backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions.

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

  • Add langchain>=0.3.0,<0.4.0 and langchain-openai>=0.3.0,<0.4.0 to backend/requirements.txt
  • Run pip install -r backend/requirements.txt in dev venv
  • Test file: test_phase5_subquestions_model.py (can run immediately after install)

Task 5.1.2: Define SubQuestions Pydantic model

  • Create backend/app/models/decompose.py with:
    class SubQuestions(BaseModel):
        questions: list[str] = Field(
            description="2-5 simplified sub-questions, each focused on one aspect",
            min_length=1,
            max_length=5,
        )
    
  • Add min_length=1 and max_length=5 Pydantic constraints (aligns with decompose prompt's "2-5")
  • Test file: test_phase5_subquestions_model.py

Task 5.1.3: Add complete_structured() method to LLMClient

  • In llm_client.py, import init_chat_model from langchain.chat_models
  • Add self._langchain_model attribute (lazy-init from settings)
  • Add async complete_structured(prompt, pydantic_model, step_name) -> BaseModel method:
    1. Calls self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)
    2. Returns the validated Pydantic model instance
    3. Logs timing (same pattern as existing complete())
    4. Wraps errors in LLMClientError
  • Use temperature=0.0 via model config for structured calls
  • Test file: test_phase5_llm_client_structured.py

Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output

  • Change decompose() to call self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")
  • Add fallback path: if complete_structured() raises → log warning → attempt legacy complete() + json.loads() → if that works, log info "structured output failed, fallback succeeded"
  • Add logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500]) when both paths fail
  • Keep return type Tuple[List[str], str] unchanged
  • Keep _extract_json_from_markdown() for backward-compat fallback path
  • Test file: test_phase5_query_decomposer_structured.py and test_phase5_decompose_logging.py

Task 5.1.5: Update prompt template for structured output

  • Update _SEED_DECOMPOSE in sqlite_db.py to instruct the LLM about the expected structure
  • New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
  • Run seed_default_profiles() to backfill existing profiles
  • Test file: Existing test_phase3_prompt_service.py should continue to pass

Task 5.1.6: Integration test — end-to-end query pipeline

  • Verify existing integration tests still pass (test_integration_phase1.py, test_phase4_integration_query_pipeline.py)
  • Verify acceptance test passes with real LLM (test_acceptance_phase1_rag_query.py)
  • Run full test suite: cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

# Test File Coverage
T5.2.1 backend/app/test/test_phase5_source_metadata.py SourceMetadata model with document_id. Tests: serialization includes document_id, backward compat (old data without document_id).
T5.2.2 backend/app/test/test_phase5_docx_pdf_generation.py DOCX/TXT ingestion now sets chunk_file_path. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully.
T5.2.3 frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts Fuzzy citation matching. Tests: citation [NEC4 ACC] matches source NEC4 ACC.pdf, citation [nec4 acc.pdf, page 3] matches after whitespace trim, citation [NEC4 ACC.PDF] matches case-insensitively, fallback "View Document" link shown when chunk_file_path is null.
T5.2.4 frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts Fallback document link rendering. Tests: chunk with chunk_file_path: null but document_id present → renders "View Document" link, chunk with both null → remains plain text, chunk with chunk_file_path → renders page-level PDF link.

Acceptance Tests

# Test File Coverage
AT5.2.1 backend/app/test/acceptance/test_acceptance_phase5_citation_links.py Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path).

Implementation Tasks

Task 5.2.1: Add document_id to SourceMetadata model

  • In backend/app/models/common.py, add document_id: Optional[str] = None to SourceMetadata
  • In backend/app/routers/query.py lines 310-319, include document_id=meta.get("document_id") when building SourceMetadata objects
  • In frontend/src/types/index.ts, add document_id: string | null to SourceMetadata interface
  • Test file: test_phase5_source_metadata.py

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

  • Add reportlab to backend/requirements.txt (lightweight, pure Python PDF generation, no external binaries)
  • In backend/app/routers/ingest.py DOCX and TXT branches, add PDF generation logic:
    1. After chunking, generate a single PDF from the full text (one page per chunk)
    2. Store chunk_filename = f"{stem}_chunk_{idx}.pdf" for each chunk
    3. Set chunk_file_paths list and pass to extract_metadata()
  • Add error handling: if PDF generation fails, chunk_file_path stays None (graceful degradation)
  • Use logger.warning on generation failure
  • Test file: test_phase5_docx_pdf_generation.py

Task 5.2.3: Improve citationParser.ts with fuzzy matching

  • Add extension-stripping helper: stripExtension(filename: string): string — removes .pdf, .docx, .txt
  • Modify buildCitationLookup() to register both filename and stripExtension(filename) as lookup keys
  • Add trim-whitespace normalization on citation text before lookup
  • Add test for LLM-common variations: NEC4 ACC.pdf vs NEC4 ACC vs NEC4_acc.pdf
  • Test file: test_phase5_citation_parser_fuzzy.test.ts
  • In citationParser.ts replaceCitationPatterns(), when source?.chunk_file_path is null but source?.document_id exists:
    1. Build a URL to the document chunk list page: /rag-database?document_id=${source.document_id}
    2. Return [${trimmed}](${url}) with a different CSS class (e.g., text-green-600 for document-level vs text-blue-600 for page-level)
  • In ResponsePanel.tsx, update CitationLink component to accept a variant prop for visual differentiation
  • Test file: test_phase5_citation_fallback_link.test.ts

Task 5.2.5: Integration and regression testing

  • Verify all existing citation parser tests still pass: cd frontend && npx vitest run src/test/utils/citationParser.test.ts
  • Verify ResponsePanel tests still pass: npx vitest run src/test/components/ResponsePanel.test.tsx
  • Run full frontend test suite: npm test
  • Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable

Dependency Graph

Phase 5.1 (Structured Output)
  Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.4 (refactor decompose)
                           │                                           │
                           │                              Task 5.1.5 (update prompt template)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.6 (integration tests)
                           │
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
  Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
  Task 5.2.2 (DOCX/TXT PDF gen)    ──┤
                                      ├── Task 5.2.4 (fallback link)
                                      │
                                      ▼
                              Task 5.2.5 (integration tests)

Acceptance Criteria

Phase 5.1 Completion Checklist

  • LLMClient.complete_structured() returns validated SubQuestions Pydantic model — no json.JSONDecodeError possible
  • QueryDecomposer.decompose() never returns [] due to JSON parse failure
  • Fallback path (legacy json.loads()) logs a warning when triggered
  • Existing decompose tests pass (test_phase1_query_decomposer.py)
  • New structured output tests pass (test_phase5_*.py) — 33 tests
  • Spike test passed: Cantonese + English → valid sub-questions
  • SQLite seed templates updated and backfilled to all profiles
  • langchain and langchain-openai installed in venv (1.2.x)

Phase 5.2 Completion Checklist

  • SourceMetadata includes document_id in both backend and frontend types
  • DOCX/TXT ingestion generates per-chunk PDF filesDEFERRED to Phase 5.3
  • citationParser.ts matches [NEC4 ACC] to source NEC4 ACC.pdf (fuzzy matching)
  • citationParser.ts renders fallback link to /rag-database?document=xxx when chunk_file_path is null but document_id exists
  • RAGDatabasePage auto-expands document from ?document= URL param
  • All existing citation parser tests pass (14 tests)
  • All existing ResponsePanel tests pass
  • generate_per_subq seed prompt tightened: "Copy the exact bracket labels shown"

Rollback Plan

If with_structured_output() causes issues in production:

  1. The complete_structured() method wraps errors in LLMClientError — same exception type as existing complete()
  2. QueryDecomposer.decompose() has a fallback to legacy complete() + json.loads() path
  3. The _extract_json_from_markdown() function is preserved for backward compatibility
  4. If LangChain is a complete failure, revert requirements.txt and llm_client.py changes (3 files), keeping the Pydantic model and improved logging


Phase 5.3 — DOCX/TXT PDF Generation

Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same chunk_file_path → PDF viewer flow as PDF documents.

Status: Complete (2026-04-28). Implemented in commit 25b26c9.

  • reportlab==4.2.5 added to requirements.txt
  • New backend/app/utils/text_to_pdf.py: renders chunk text as simple PDFs with word wrapping
  • ingest.py DOCX/TXT branches: generates {stem}_chunk_{idx}.pdf per chunk, passes chunk_file_paths to extract_metadata()
  • Graceful degradation: chunk_file_path stays None on generation failure (logged as warning)
  • Tests: test_phase5_docx_pdf_generation.py (5 tests), updated test_phase1_ingest_page_aware.py (2 assertions)

Phase 5.4 — Sentence-Level Highlighting

Problem

When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.

Solution: LLM-Based Relevance Detection + Eager Background Computation

Embedding similarity was rejected — cosine similarity measures semantic proximity, not answer relevance. A sentence like "The Contractor shall submit a quotation within three weeks" has low embedding similarity to "What are the compensation event timelines?" but is directly relevant to answering it. Only an LLM can distinguish topic-matching from answer-relevance.

Strategy: After the final response renders on the frontend, a single background LLM call identifies relevant sentences across all cited chunks (grouped by sub-question). Results are cached in SQLite. The "View Highlighted" button activates once caching completes. Clicks serve instantly from cache.

Why one LLM call, not one per chunk: A sub-question typically cites only 2-3 chunks (even though up to 10 are retrieved). A single batched prompt with all sub-questions and their cited chunks costs ~1 LLM call per query — not N × M per chunk. Reuses the complete_structured() infrastructure from Phase 5.1.

Why HTML, not PDF?

Approach Complexity Works for all doc types? Preserves original formatting?
Highlighted HTML page Low Yes (uses chunk text from ChromaDB) Plain text only
Highlighted PDF via reportlab Medium Yes (new PDF) Plain text only
Overlay highlights on existing PDF High ⚠️ PDF only Yes

Recommendation: HTML page. Works uniformly across PDF/DOCX/TXT. Original formatting is preserved in the existing PDF viewer (chunk_file_path link) — the highlighted HTML view is a supplementary view reached via a separate button. The two views coexist: "View Original PDF" vs "View Highlighted Text".

Architecture

EAGER PHASE (background, fires immediately after SSE completed):
─────────────────────────────────────────────────
Frontend receives SSE "completed" event
    │
    ├── Renders answer with citations instantly (no waiting)
    │
    └── useEffect fires POST /api/v1/v2/highlights/batch
              │
              ▼
        Backend highlight batch service:
          1. For each sub-question, extract cited chunks
             (only 2-3 actually cited per sub-q, not all 10)
          2. Fetch chunk texts from ChromaDB
          3. Split each chunk into sentences
          4. Single LLM call: identify which sentences are
             relevant to their sub-question (structured output)
          5. Render each chunk as highlighted HTML
          6. Cache all HTML + metadata in SQLite
          7. Return {status, cached_count} to frontend
               │
               ▼
         Frontend receives response → upgrades "View PDF"
         links on source cards to point to cached highlight pages
         (same button, same position, upgraded destination)

LAZY PHASE (on user click, instant from cache):
──────────────────────────────────────────────
User clicks "View PDF" on a source card
    │
    ▼
GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
    │
    ▼
Backend reads cached HTML from SQLite → serves instantly
    │
    ▼
Highlighted HTML page shows:
  ┌─ Highlighted sentences (yellow background + LLM reasons)
  └─ Footer: [View Original PDF →] link to raw PDF chunk

What Gets Highlighted

┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12          [View Original PDF →]  │
├──────────────────────────────────────────────────────────┤
│                                                            │
│ The programme shall be prepared in a form acceptable to   │
│ the Project Manager. It shall include:                    │
│                                                            │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates.       █ │  ← LLM: relevant to timeline question
│ █ "Defines when work begins and key milestones."        █ │  ← Brief reason from LLM
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Contractor shall submit a first programme within      │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date.                       █ │  ← LLM: relevant to timeline question
│ █ "Specifies the deadline for initial programme."       █ │
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Project Manager may instruct the Contractor to        │
│ submit a revised programme showing the effects of a       │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme.             │  ← Not highlighted (LLM: not relevant to timeline)
│                                                            │
├──────────────────────────────────────────────────────────┤
│                                   [View Original PDF →]   │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

# Decision Rationale
1 HTML page, not PDF Zero extra dependencies. CSS highlighting is more flexible. Original PDF remains available via separate link.
2 LLM relevance detection, not embedding similarity Embedding similarity measures topic overlap, not answer relevance. LLM distinguishes "about timelines" from "answers the timeline question."
3 Single batched LLM call for all sub-questions + cited chunks One API call per query, not one per chunk. Each sub-question typically cites 2-3 chunks. Total ~6-15 chunks per query in a single prompt.
4 Eager computation (background after response renders) User sees answer instantly. Highlight computation happens while they read. "View Highlighted" activates when ready. Zero perceived latency on click.
5 Only cited chunks (not all retrieved chunks) Up to 10 chunks retrieved per sub-question, but only 2-3 are actually cited in the answer. Computing highlights for uncited chunks wastes tokens.
6 SQLite cache for highlight results Durable across restarts. Same chunk+sub-question pair only computed once. Subsequent clicks (or same query re-run) serve instantly. Cache key = hash(document_id + chunk_index + sub_question).
7 Reuse complete_structured() from Phase 5.1 LLMClient.complete_structured(prompt, HighlightBatchResult) returns validated Pydantic model. No new LLM calling infrastructure.
8 Sentence-level granularity Paragraphs are too coarse, words are too noisy. Sentences are the natural unit of meaning. LLM returns sentence indices.
9 Regex sentence splitter (no NLTK) NLTK adds 35MB deploy weight for edge-case handling the project doesn't need. re.split(r'(?<=[.!?。!?\n])\s+') handles English + Cantonese legal text.
10 Sentence indices, not text spans Deterministic: the splitter always produces the same sentences from the same chunk text. LLM returns indices to avoid substring matching fragility.
11 Graceful degradation LLM call fails → cache miss → GET returns 404. Frontend keeps "View Highlighted" disabled/loading until batch succeeds. If batch fails, button stays hidden.
12 "View Highlighted" on source cards and inline citations Decision expanded during implementation: inline citation links in the answer text also upgrade to highlighted view via buildCitationUrl() checking highlightKeys. Source cards have SubQuestionSource context; inline citations route via sub_question_text added to SourceMetadata.
13 Upgrade "View PDF" link in place (no second button) Before the LLM completes, the "View PDF" link opens the raw PDF chunk page (existing behavior). Once the batch LLM call finishes caching, the same link upgrades to open the highlighted HTML page instead. The highlighted page includes a "View Original PDF →" footer link. If the LLM call fails, the link stays as raw PDF. One link, seamless upgrade.

Data Model

Batch Request (POST)

class ChunkHighlightTarget(BaseModel):
    """A single chunk to highlight, tied to its sub-question."""
    document_id: str
    chunk_index: int
    sub_question_text: str
    sub_question_index: int

class HighlightBatchRequest(BaseModel):
    """Batch request: all cited chunks grouped by sub-question."""
    targets: list[ChunkHighlightTarget]

Structured LLM Output

class RelevantSentence(BaseModel):
    sentence_index: int = Field(description="0-based index of the relevant sentence")
    reason: str = Field(description="Brief explanation of why this sentence is relevant (max 80 chars)")

class ChunkHighlights(BaseModel):
    document_id: str
    chunk_index: int
    relevant_sentences: list[RelevantSentence] = Field(default_factory=list)

class HighlightBatchResult(BaseModel):
    """Single LLM call returns highlights for all sub-questions and chunks."""
    results: list[ChunkHighlights]

Batch Response (POST response, JSON)

class HighlightBatchResponse(BaseModel):
    status: Literal["completed", "partial", "failed"]
    cached_count: int  # Number of highlights successfully computed and cached
    errors: list[str] = []  # Per-chunk errors (LLM failed on some but not all)

SQLite Cache Table

CREATE TABLE IF NOT EXISTS chunk_highlights (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    cache_key TEXT UNIQUE NOT NULL,  -- sha256(document_id + chunk_index + sub_question)
    document_id TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    sub_question TEXT NOT NULL,
    relevant_sentences_json TEXT NOT NULL,  -- JSON of RelevantSentence[]
    html_content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_highlights_cache_key ON chunk_highlights(cache_key);

GET Endpoint Response

GET /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=...
→ 200 text/html  (cache hit: highlighted HTML page)
→ 404            (cache miss: batch not yet computed or failed)

LLM Prompt Structure (Single Call, All Sub-Questions)

For each sub-question below, identify which sentences in each cited chunk are directly relevant to answering that sub-question. Return only the index and a brief reason for each relevant sentence. Skip sentences that are about the same topic but don't contribute to the answer.

Sub-question 1: "What are the compensation event timelines in NEC4?"
  [Chunk: NEC4 ACC, chunk_index=2]
    [0] The Contractor shall submit a quotation within three weeks of the instruction.
    [1] The Project Manager may instruct the Contractor to submit a revised programme.
    [2] This does not affect the Contractor's right to be paid for preparing the programme.
  [Chunk: NEC4 ACC, chunk_index=5]
    [0] Compensation events are events which change the total of the Prices.
    [1] The Project Manager shall notify the Contractor within one week of becoming aware.

Sub-question 2: "What are the key dates and milestones?"
  [Chunk: NEC4 ACC, chunk_index=1]
    [0] The starting date shall be defined as the first working day.
    [1] Key Dates are dates by which the Contractor must complete work.

Implementation Tasks

Task 5.4.1: Backend — Sentence splitting utility

  • Create backend/app/utils/sentence_splitter.py
  • Function split_sentences(text: str) -> list[str] returns ordered sentence strings
  • Use regex: re.split(r'(?<=[.!?。!?\n])\s+', text) (handles English + Chinese punctuation)
  • Strip empty strings, preserve original sentence text whitespace
  • Handle edge cases: empty text, single sentence, lists/bullets
  • Test file: test_phase5_sentence_splitter.py
  • Create backend/app/models/highlight.py with:
    • ChunkHighlightTarget (request input)
    • HighlightBatchRequest (request input)
    • RelevantSentence (LLM output unit)
    • ChunkHighlights (LLM output per chunk)
    • HighlightBatchResult (LLM output wrapper)
    • HighlightBatchResponse (POST response)
  • Test file: test_phase5_highlight_models.py

Task 5.4.3: Backend — SQLite cache for highlight results

  • Add chunk_highlights table to backend/app/services/highlight_cache.py
  • Functions: get_highlight(cache_key) -> str | None, set_highlight(cache_key, html_content, relevant_sentences_json), compute_cache_key(document_id, chunk_index, sub_question) -> str
  • Database path: backend/app/data/highlights.db (separate from prompts/history to avoid schema conflicts)
  • Initialize table on first use (same pattern as prompt_service.py / history_service.py)
  • Test file: test_phase5_highlight_cache.py

Task 5.4.4: Backend — Highlight batch service (LLM call + HTML rendering)

  • Create backend/app/services/chunk_highlight_service.py
  • Method compute_highlights_batch(targets: list[ChunkHighlightTarget]) -> HighlightBatchResponse:
    1. Group targets by sub-question for prompt construction
    2. For each target, fetch chunk text + metadata from ChromaDB (collection.get(ids=[f"{document_id}_{chunk_index}"]))
    3. Split chunk text into sentences via split_sentences()
    4. Build a single batched prompt (see Prompt Structure above)
    5. Call llm_client.complete_structured(prompt, HighlightBatchResult) → validated Pydantic result
    6. For each ChunkHighlights in result: render HTML, compute cache key, store in SQLite
    7. Return HighlightBatchResponse(status="completed", cached_count=N)
  • Method render_highlight_html(chunk_text, sentences, relevant_sentences, metadata) -> str:
    • Inline CSS: yellow background (#fef08a) on relevant sentences, grey reason text below each
    • Header: filename, chunk index, page number, sub-question text
    • Footer: "View Original PDF →" link (if chunk_file_path exists)
    • Returns complete self-contained HTML page
  • Graceful degradation per target (not per batch): if one chunk fetch fails, skip it and continue others. If LLM call fails entirely, return status="failed".
  • Test file: test_phase5_chunk_highlight_service.py

Task 5.4.5: Backend — REST endpoints

  • POST /api/v1/v2/highlights/batch in backend/app/routers/chunks.py (new file):
    • Accepts HighlightBatchRequest
    • Calls chunk_highlight_service.compute_highlights_batch()
    • Returns HighlightBatchResponse (JSON)
  • GET /api/v1/v2/highlights in same router:
    • Query params: document_id, chunk_index, sub_question
    • Computes cache key, reads from SQLite
    • Cache hit → returns text/html
    • Cache miss → returns 404
  • Register router in main.py
  • Test file: test_phase5_highlight_endpoints.py

Task 5.4.6: Frontend — Extract cited chunks from answer text

  • Add extractCitedSources(answerText: string, sources: SourceMetadata[]): SourceMetadata[] to citationParser.ts
    • Uses the same citationPattern regex as replaceCitationPatterns()
    • For each [citation] match, resolves against buildCitationLookup(sources)
    • Returns deduplicated list of SourceMetadata that were actually cited in the answer
  • Exported alongside existing citation functions
  • Test file: Update citationParser.test.ts with cited source extraction tests

Task 5.4.7: Frontend — Background batch trigger on response completion

  • In ResponsePanel.tsx (inside SubQuestionSections component):
    • After answer is rendered (non-null, not loading), useEffect fires:
      1. Parse answer sections → extract cited sources per sub-question
      2. Build HighlightBatchRequest payload
      3. POST /api/v1/v2/highlights/batch
      4. On success → store which (document_id, chunk_index, sub_question) pairs are cached in a useState(Set<string>)
      5. On failure → log error (no UI change — "View Highlighted" stays hidden)
    • Source cards check isHighlightReady(document_id, chunk_index, sub_q_text) → shows/hides "View Highlighted" button
  • Test file: Update ResponsePanel.test.tsx
  • In SubQuestionSourceCard (ResponsePanel.tsx):
    • Before batch completes: "View PDF" link opens raw PDF chunk page (existing behavior, unchanged)
    • After batch completes: Same link upgrades to open /api/v1/v2/highlights?document_id=...&chunk_index=...&sub_question=... (highlighted HTML page)
    • The link stays in the same position — no repositioning, no second element
    • While batch is loading: no visual change (the PDF link remains functional; the highlight upgrade is invisible to the user until ready)
    • If batch fails: link stays as raw PDF permanently (graceful degradation)
  • The highlighted HTML page includes a [View Original PDF →] footer link (rendered by backend in Task 5.4.4) so users can always reach the raw PDF
  • No label change needed — "View PDF" remains appropriate since the user gets a page-level view of the chunk regardless
  • Test file: Update ResponsePanel.test.tsx — verify link href upgrades after mock batch response, verify link stays as raw PDF when batch fails, verify loading state shows no visual disruption

Task 5.4.9: Integration testing

  • Verify batch endpoint: POST with 3 targets → all cached → GET returns highlighted HTML for each
  • Verify cache: same GET twice → both hit SQLite (no second LLM call)
  • Verify graceful degradation: LLM call fails → batch returns status="failed" → GET returns 404 → frontend hides buttons
  • Verify cross-document-type: PDF, DOCX, TXT chunks all work (chunk text comes from ChromaDB regardless of source format)
  • Verify sentence splitting handles mixed English/Chinese legal text (e.g., "The date is 2024年1月1日。The Contractor shall...")
  • Run full test suite: cd backend && pytest -v and cd frontend && npm test

Test Files

# Test File Coverage
T5.4.1 backend/app/test/test_phase5_sentence_splitter.py Sentence splitting: English, Chinese, mixed, empty, single sentence, bullet lists
T5.4.2 backend/app/test/test_phase5_highlight_models.py Pydantic validation: valid/invalid targets, empty batch, max batch size, LLM result parsing
T5.4.3 backend/app/test/test_phase5_highlight_cache.py SQLite cache: insert, retrieve, cache key collision, missing key, JSON serialization
T5.4.4 backend/app/test/test_phase5_chunk_highlight_service.py Service: valid batch → cached HTML, mocked LLM returns structured result, per-target error isolation, empty batch, ChromaDB miss → skip target
T5.4.5 backend/app/test/test_phase5_highlight_endpoints.py Endpoints: POST returns cache count, GET returns HTML, GET cache miss → 404, invalid params → 422
T5.4.6 frontend/src/test/utils/citationParser.test.ts (update) extractCitedSources(): extracts cited filenames, deduplicates, handles no citations, handles partial matches
T5.4.7 frontend/src/test/components/ResponsePanel.test.tsx (update) "View PDF" link upgrades to highlight page after batch completes, link stays as raw PDF when batch fails, no visual disruption during loading

Acceptance Tests

# Test File Coverage
AT5.4.1 backend/app/test/acceptance/test_acceptance_phase5_highlight.py Real LLM query → real chunk text → batch POST with real LLM → verify cached HTML contains yellow-highlighted spans on genuinely relevant sentences → spot-check 3 chunks manually

Updated Dependency Graph

Phase 5.1 (Structured Output) ✅
Phase 5.2 (Citation Linking)   ✅
Phase 5.3 (DOCX/TXT PDF Gen)   ✅
Phase 5.4 (LLM Highlighting)
   Task 5.4.1 (sentence splitter)
        │
        ├── Task 5.4.2 (highlight models)
        │        │
        │        ├── Task 5.4.3 (SQLite cache)
        │        │        │
        │        │        ├── Task 5.4.4 (batch service + HTML rendering)
        │        │        │        │
        │        │        │        └── Task 5.4.5 (REST endpoints)
        │        │        │
        │        │        └── Task 5.4.9 (integration tests)
        │        │
        │        └── Task 5.4.6 (frontend cited source extraction)
        │                 │
        │                 ├── Task 5.4.7 (frontend batch trigger)
        │                 │        │
        │                 │        └── Task 5.4.8 (frontend link upgrade)
        │                 │
        │                 └── Task 5.4.9 (integration tests)
        │
        └── All tasks → Task 5.4.9 (full suite)

Updated Commit Plan (ACTUAL)

Commit Message Scope
b11d31e feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2) Tasks 5.4.1 + 5.4.2 + tests (13 tests)
bdbc8ea feat: add SQLite highlight cache service (Phase 5.4.3) Task 5.4.3 + tests (13 tests)
c6d4a38 feat: add LLM-based batch highlight service and HTML rendering (Phase 5.4.4) Task 5.4.4 + tests (20 tests)
a56f8f6 feat: add highlight batch and GET endpoints (Phase 5.4.5) Task 5.4.5 + tests (16 tests)
c632b9e feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8) Tasks 5.4.6 + 5.4.7 + 5.4.8 + tests (25 frontend tests)
1c490ce fix: inline citations now upgrade to highlighted view (Phase 5.4) Inline citation links in answer text also upgrade to highlighted view via buildCitationUrl() + highlightKeys
bcf4a85 feat: add highlight status toast notification (Phase 5.4) Toast in top-left: amber spinner "Preparing highlights..." → green "Highlights ready — citations upgraded"
b47e37f fix: use absolute backend URL for highlight API calls Vite dev server doesn't proxy /api/v1/v2/ paths. Changed fetch URL and getHighlightUrl to use http://localhost:8000.
523b27b test: update batch URL assertion to match absolute backend URL Fixed test that expected relative URL path.

Complete Commit Plan (All Phases — ACTUAL)

Commit Message Scope
48e15f8 feat(llm): log structured LLM response and extra_body LLMClient logging improvements
4058c7d fix(citations): use all sub-question sources for citation lookup Citation matching fix
091fa84 docs: update Phase 5 plan with deferred/planned sub-phases Plan documentation
25b26c9 feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3) Phase 5.3 DOCX/TXT PDF generation
ec3b5a4 docs: mark Phase 5.3 complete in enhancement plan Plan update
b11d31e feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2) Sentence splitter + 6 Pydantic models (13 tests)
bdbc8ea feat: add SQLite highlight cache service (Phase 5.4.3) HighlightCache + cache_key hash (13 tests)
c6d4a38 feat: add LLM-based batch highlight service and HTML rendering (Phase 5.4.4) ChunkHighlightService + render_highlight_html (20 tests)
a56f8f6 feat: add highlight batch and GET endpoints (Phase 5.4.5) POST /batch + GET /highlights endpoints (16 tests)
c632b9e feat: cited source extraction, background batch trigger, and View PDF link upgrade (Phase 5.4.6-5.4.8) extractCitedSources, useEffect batch trigger, link upgrade (25 frontend tests)
1c490ce fix: inline citations now upgrade to highlighted view (Phase 5.4) buildCitationUrl() upgraded for inline citation links
bcf4a85 feat: add highlight status toast notification (Phase 5.4) Toast: loading → done states
b47e37f fix: use absolute backend URL for highlight API calls Hardcoded http://localhost:8000 for Vite proxy bypass
523b27b test: update batch URL assertion to match absolute backend URL Test fix
(Phase 5.1-5.2 commits omitted for brevity — see git log)

Phase 5.4 Completion Checklist

  • Sentence splitter utility (sentence_splitter.py) handles English + Chinese + mixed text
  • Highlight Pydantic models (6 schemas: target, request, response, LLM output)
  • SQLite highlight cache (highlight_cache.py) with sha256 cache keys
  • LLM-based batch highlight service (chunk_highlight_service.py) — single batched call for all cited chunks
  • HTML page rendering (render_highlight_html()) with .highlighted CSS spans and "View Original PDF →" footer
  • POST /api/v1/v2/highlights/batch endpoint — accepts targets, returns cached_count
  • GET /api/v1/v2/highlights endpoint — serves cached highlighted HTML (200) or 404 on cache miss
  • extractCitedSources() in citationParser.ts — extracts cited sources from answer text per sub-question
  • useEffect batch trigger in SubQuestionSections — fires after answer renders, POSTs batch to backend
  • Source card "View PDF" links upgrade to highlighted HTML after batch completes
  • Inline citation links in answer text also upgrade to highlighted view (buildCitationUrl + highlightKeys)
  • Toast notification: "Preparing highlights..." (amber spinner) → "Highlights ready — citations upgraded" (green)
  • Backend test suite: 108 passed, 0 failed
  • Frontend test suite: 45 passed, 0 failed
  • Manual verification: batch endpoint returns cached_count: 1 with real document, GET returns valid HTML with .highlighted spans

Phase 5.5 — Highlight History Tracking

Source: User request (2026-04-29) Status: Complete

Track highlight generation data in the query history database:

  1. highlight_time_ms — duration of the highlight batch LLM call (excluded from total_time_ms)
  2. highlight_prompt — the full prompt sent to the LLM for highlight batch
  3. highlight_response — the LLM's structured response (HighlightBatchResult as JSON)

Architecture

SSE Stream: ... → phase: "completed"  (includes history_id)
    │
    ├── Frontend captures historyId from SSE event
    │
    └── Frontend useEffect: POST /api/v1/v2/highlights/batch?history_id={id}
              │
              ▼
        Backend chunk_highlight_service:
          1. Measures highlight_time_ms via time.perf_counter()
          2. Captures highlight_prompt (full LLM prompt)
          3. Captures highlight_response_json (llm_result.model_dump_json())
          4. Returns all 3 in HighlightBatchResponse
              │
              ▼
        Backend chunks.py endpoint:
          If history_id provided → HistoryService.update_highlights()
          UPDATE query_history SET highlight_prompt=?, highlight_response=?, highlight_time_ms=?

Key Decisions

# Decision Rationale
1 Synchronous history record at completion (was fire-and-forget) Need history_id for SSE completed event payload. SQLite insert is microseconds — no latency impact.
2 update_highlights() separate from record() Highlights computed AFTER the main query pipeline. Update existing row rather than inserting a new one.
3 Highlight time excluded from total_time_ms total_time_ms is measured before highlights are computed. The highlight time is tracked separately.
4 Pass history_id via SSE completed event → frontend → batch POST query param Avoids restructuring the SSE pipeline. Frontend already triggers batch POST — just adds a query param.
5 _record_history returns -1 on failure Legacy call sites (no-results, all-filtered) are unaffected. The completed site checks if history_id > 0.

Files Changed

File Change
backend/app/core/sqlite_db.py +3 ALTER TABLE columns: highlight_prompt, highlight_response, highlight_time_ms
backend/app/services/history_service.py +3 columns in _INSERT_COLUMNS, +update_highlights() method
backend/app/models/history.py +3 fields in QueryHistoryRecord and QueryHistoryDetail
backend/app/models/highlight.py +3 optional fields in HighlightBatchResponse
backend/app/services/chunk_highlight_service.py +import time, timing around LLM call, capture highlight_response_json
backend/app/routers/query.py _record_history returns int, sync call at completed, history_id in SSE event
backend/app/routers/chunks.py +history_id query param, calls update_highlights() on success
frontend/src/lib/queries.tsx +historyId in QueryStreamState, capture from completed event
frontend/src/components/ResponsePanel.tsx +historyId prop, pass in batch POST URL
frontend/src/pages/LTTPage.tsx pass historyId from query state to ResponsePanel
backend/app/test/test_phase5_highlight_models.py updated expected dict for new default fields

Test Results

  • Backend: 108 passed, 0 failed
  • Frontend: 45 passed, 0 failed
  • Total: 153 tests, 0 failures

Production Notes

Vite Proxy Limitation

Vite dev server only proxies paths under /api/v1/ (defined in vite.config.ts). The highlight endpoints live at /api/v1/v2/. To bypass this, the frontend uses the full backend URL http://localhost:8000/api/v1/v2/highlights/... directly rather than relying on the proxy.

To fix properly in production: use import.meta.env.VITE_API_BASE_URL (set to /api/v1 in prod, http://localhost:8000/api/v1 in dev) and construct URLs dynamically instead of hardcoding http://localhost:8000.

Port Conflict on Restart

If you restart the Vite dev server and the browser doesn't see changes, check for stale processes:

fuser -k 5173/tcp  # kill old Vite on port 5173
fuser -k 5174/tcp  # kill old Vite on port 5174 (auto-switched when 5173 busy)

Cache Location

Highlight SQLite cache lives at backend/app/data/highlights.db (auto-computed from PROMPTS_DB_PATH prefix: data/prompts.dbdata/highlights.db). No new .env setting needed.

LLM Provider Compatibility

The batch highlight service reuses LLMClient.complete_structured() from Phase 5.1, which supports both OpenRouter (OpenAI-compatible) and vLLM (via VLLM_ENGINE=true).