27 KiB
Package 5 Enhancement Plan — Structured Output + Robust Citation Linking
Source: User request (2026-04-28) Scope:
- Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain
with_structured_output() - Phase 5.2: Fix missing PDF links in citations and improve citation robustness Status: Phases 5.1 ✅, 5.2 ✅ — 5.3 Deferred, 5.4 Planned (2026-04-28)
LangChain version: 1.2.15 (venv), model_provider="openai" with OpenRouter base URL (API-compatible proxy).
Test results:
- Backend: 115 passed, 0 failed (Phase 5.1 + Phase 5.2 + all integration/regression tests)
- Frontend: 187 passed, 1 failed (pre-existing e2e test failure unrelated to these changes)
Objective
-
Decompose structured output: Eliminate
json.JSONDecodeErrorfailures inQueryDecomposer.decompose()by integrating LangChain'swith_structured_output()to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a validSubQuestionsobject — no manualjson.loads(), no regex markdown stripping, no silent failures. -
Robust citation linking: Fix the citation→PDF link pipeline so that:
document_idflows through to the frontend for fallback document-level linkschunk_file_pathis always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)- Citation matching in
citationParser.tshandles fuzzy filename matching (strips extensions, tolerates whitespace variations) - Frontend provides fallback "View Document" links when chunk-level PDF is unavailable
Decision Register
| # | Decision | Rationale |
|---|---|---|
| 1 | Use LangChain with_structured_output() (not OpenAI response_format directly) |
User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate). |
| 2 | Add langchain + langchain-openai to requirements.txt |
Required dependencies for init_chat_model() and with_structured_output(). langchain ~0.3.x for stable API. |
| 3 | Define SubQuestions Pydantic model with questions: list[str] |
LangChain's with_structured_output() requires a wrapper Pydantic model — bare list[str] is unsupported by provider-native schema enforcement. |
| 4 | Keep LLMClient as the central LLM access layer, add LangChain-based complete_structured() method |
Minimizes refactoring. QueryDecomposer calls llm_client.complete_structured(prompt, SubQuestions) instead of llm_client.complete(prompt). Other callers (filter, generate) remain unchanged. |
| 5 | Run decomposition at temperature=0.0 (was 0.7) |
Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance. |
| 6 | Add document_id to SourceMetadata Pydantic model and frontend type |
document_id is already stored in ChromaDB metadata (metadata.py:70) but is discarded during serialization. Adding it enables document-level fallback links. |
| 7 | More complex than needed. Instead, use fallback document-level links via document_id when chunk_file_path is null. DOCX/TXT PDF generation deferred to Phase 5.3. |
|
| 8 | Fuzzy citation matching: strip extensions, trim whitespace | citationParser.ts currently requires exact filename match. LLM may shorten NEC4 ACC.pdf to NEC4 ACC in citations. |
| 9 | Fallback "View Document" link when chunk_file_path is null |
Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback. |
| 10 | Keep _extract_json_from_markdown() as a fallback for backward compatibility |
During a transition period (or if with_structured_output() fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used. |
| 11 | Add logger.warning for JSON parse failures before returning empty |
The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug. |
| 12 | Keep QueryDecomposer.decompose() return type as Tuple[List[str], str] |
Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions |
| 13 | Spike-test LangChain structured output with OpenRouter BEFORE implementation | 2-minute test calling init_chat_model().with_structured_output().ainvoke() through OpenRouter to confirm response_format={"type": "json_schema"} is proxied correctly. If not, fall back to method="function_calling". |
| 14 | Tighten generate_per_subq prompt alongside frontend fuzzy matching |
Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3. |
Phase 5.1 — Structured Output for Decompose
Test Files (write BEFORE implementation)
| # | Test File | Coverage |
|---|---|---|
| T5.1.1 | backend/app/test/test_phase5_llm_client_structured.py |
LLMClient.complete_structured() with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback. |
| T5.1.2 | backend/app/test/test_phase5_query_decomposer_structured.py |
QueryDecomposer.decompose() using MockLLMClient.complete_structured(). Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration. |
| T5.1.3 | backend/app/test/test_phase5_subquestions_model.py |
SubQuestions Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected. |
| T5.1.4 | backend/app/test/test_phase5_decompose_logging.py |
Verify logger.warning is emitted when JSON parse fallback is triggered (backward-compat path). |
Acceptance Tests
| # | Test File | Coverage |
|---|---|---|
| AT5.1.1 | backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py |
Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions. |
Implementation Tasks
Task 5.1.1: Add LangChain dependencies
- Add
langchain>=0.3.0,<0.4.0andlangchain-openai>=0.3.0,<0.4.0tobackend/requirements.txt - Run
pip install -r backend/requirements.txtin dev venv - Test file:
test_phase5_subquestions_model.py(can run immediately after install)
Task 5.1.2: Define SubQuestions Pydantic model
- Create
backend/app/models/decompose.pywith:class SubQuestions(BaseModel): questions: list[str] = Field( description="2-5 simplified sub-questions, each focused on one aspect", min_length=1, max_length=5, ) - Add
min_length=1andmax_length=5Pydantic constraints (aligns with decompose prompt's "2-5") - Test file:
test_phase5_subquestions_model.py
Task 5.1.3: Add complete_structured() method to LLMClient
- In
llm_client.py, importinit_chat_modelfromlangchain.chat_models - Add
self._langchain_modelattribute (lazy-init from settings) - Add
async complete_structured(prompt, pydantic_model, step_name) -> BaseModelmethod:- Calls
self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt) - Returns the validated Pydantic model instance
- Logs timing (same pattern as existing
complete()) - Wraps errors in
LLMClientError
- Calls
- Use
temperature=0.0via model config for structured calls - Test file:
test_phase5_llm_client_structured.py
Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output
- Change
decompose()to callself.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer") - Add fallback path: if
complete_structured()raises → log warning → attempt legacycomplete()+json.loads()→ if that works, log info "structured output failed, fallback succeeded" - Add
logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500])when both paths fail - Keep return type
Tuple[List[str], str]unchanged - Keep
_extract_json_from_markdown()for backward-compat fallback path - Test file:
test_phase5_query_decomposer_structured.pyandtest_phase5_decompose_logging.py
Task 5.1.5: Update prompt template for structured output
- Update
_SEED_DECOMPOSEinsqlite_db.pyto instruct the LLM about the expected structure - New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
- Run
seed_default_profiles()to backfill existing profiles - Test file: Existing
test_phase3_prompt_service.pyshould continue to pass
Task 5.1.6: Integration test — end-to-end query pipeline
- Verify existing integration tests still pass (
test_integration_phase1.py,test_phase4_integration_query_pipeline.py) - Verify acceptance test passes with real LLM (
test_acceptance_phase1_rag_query.py) - Run full test suite:
cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v
Phase 5.2 — Robust Citation Linking
Test Files (write BEFORE implementation)
| # | Test File | Coverage |
|---|---|---|
| T5.2.1 | backend/app/test/test_phase5_source_metadata.py |
SourceMetadata model with document_id. Tests: serialization includes document_id, backward compat (old data without document_id). |
| T5.2.2 | backend/app/test/test_phase5_docx_pdf_generation.py |
DOCX/TXT ingestion now sets chunk_file_path. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully. |
| T5.2.3 | frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts |
Fuzzy citation matching. Tests: citation [NEC4 ACC] matches source NEC4 ACC.pdf, citation [nec4 acc.pdf, page 3] matches after whitespace trim, citation [NEC4 ACC.PDF] matches case-insensitively, fallback "View Document" link shown when chunk_file_path is null. |
| T5.2.4 | frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts |
Fallback document link rendering. Tests: chunk with chunk_file_path: null but document_id present → renders "View Document" link, chunk with both null → remains plain text, chunk with chunk_file_path → renders page-level PDF link. |
Acceptance Tests
| # | Test File | Coverage |
|---|---|---|
| AT5.2.1 | backend/app/test/acceptance/test_acceptance_phase5_citation_links.py |
Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path). |
Implementation Tasks
Task 5.2.1: Add document_id to SourceMetadata model
- In
backend/app/models/common.py, adddocument_id: Optional[str] = NonetoSourceMetadata - In
backend/app/routers/query.pylines 310-319, includedocument_id=meta.get("document_id")when buildingSourceMetadataobjects - In
frontend/src/types/index.ts, adddocument_id: string | nulltoSourceMetadatainterface - Test file:
test_phase5_source_metadata.py
Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion
- Add
reportlabtobackend/requirements.txt(lightweight, pure Python PDF generation, no external binaries) - In
backend/app/routers/ingest.pyDOCX and TXT branches, add PDF generation logic:- After chunking, generate a single PDF from the full text (one page per chunk)
- Store
chunk_filename = f"{stem}_chunk_{idx}.pdf"for each chunk - Set
chunk_file_pathslist and pass toextract_metadata()
- Add error handling: if PDF generation fails,
chunk_file_pathstaysNone(graceful degradation) - Use
logger.warningon generation failure - Test file:
test_phase5_docx_pdf_generation.py
Task 5.2.3: Improve citationParser.ts with fuzzy matching
- Add extension-stripping helper:
stripExtension(filename: string): string— removes.pdf,.docx,.txt - Modify
buildCitationLookup()to register bothfilenameandstripExtension(filename)as lookup keys - Add trim-whitespace normalization on citation text before lookup
- Add test for LLM-common variations:
NEC4 ACC.pdfvsNEC4 ACCvsNEC4_acc.pdf - Test file:
test_phase5_citation_parser_fuzzy.test.ts
Task 5.2.4: Add fallback "View Document" link in frontend
- In
citationParser.tsreplaceCitationPatterns(), whensource?.chunk_file_pathis null butsource?.document_idexists:- Build a URL to the document chunk list page:
/rag-database?document_id=${source.document_id} - Return
[${trimmed}](${url})with a different CSS class (e.g.,text-green-600for document-level vstext-blue-600for page-level)
- Build a URL to the document chunk list page:
- In
ResponsePanel.tsx, updateCitationLinkcomponent to accept avariantprop for visual differentiation - Test file:
test_phase5_citation_fallback_link.test.ts
Task 5.2.5: Integration and regression testing
- Verify all existing citation parser tests still pass:
cd frontend && npx vitest run src/test/utils/citationParser.test.ts - Verify ResponsePanel tests still pass:
npx vitest run src/test/components/ResponsePanel.test.tsx - Run full frontend test suite:
npm test - Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable
Dependency Graph
Phase 5.1 (Structured Output)
Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
│ │
│ ▼
│ Task 5.1.4 (refactor decompose)
│ │
│ Task 5.1.5 (update prompt template)
│ │
│ ▼
│ Task 5.1.6 (integration tests)
│
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
Task 5.2.2 (DOCX/TXT PDF gen) ──┤
├── Task 5.2.4 (fallback link)
│
▼
Task 5.2.5 (integration tests)
Acceptance Criteria
Phase 5.1 Completion Checklist
LLMClient.complete_structured()returns validatedSubQuestionsPydantic model — nojson.JSONDecodeErrorpossibleQueryDecomposer.decompose()never returns[]due to JSON parse failure- Fallback path (legacy
json.loads()) logs a warning when triggered - Existing decompose tests pass (
test_phase1_query_decomposer.py) - New structured output tests pass (
test_phase5_*.py) — 33 tests - Spike test passed: Cantonese + English → valid sub-questions
SQLiteseed templates updated and backfilled to all profileslangchainandlangchain-openaiinstalled in venv (1.2.x)
Phase 5.2 Completion Checklist
SourceMetadataincludesdocument_idin both backend and frontend typesDOCX/TXT ingestion generates per-chunk PDF files→ DEFERRED to Phase 5.3citationParser.tsmatches[NEC4 ACC]to sourceNEC4 ACC.pdf(fuzzy matching)citationParser.tsrenders fallback link to/rag-database?document=xxxwhenchunk_file_pathis null butdocument_idexistsRAGDatabasePageauto-expands document from?document=URL param- All existing citation parser tests pass (14 tests)
- All existing ResponsePanel tests pass
generate_per_subqseed prompt tightened: "Copy the exact bracket labels shown"
Rollback Plan
If with_structured_output() causes issues in production:
- The
complete_structured()method wraps errors inLLMClientError— same exception type as existingcomplete() QueryDecomposer.decompose()has a fallback to legacycomplete()+json.loads()path- The
_extract_json_from_markdown()function is preserved for backward compatibility - If LangChain is a complete failure, revert
requirements.txtandllm_client.pychanges (3 files), keeping the Pydantic model and improved logging
Phase 5.3 — DOCX/TXT PDF Generation (DEFERRED)
Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same chunk_file_path → PDF viewer flow as PDF documents.
Status: Deferred. Phase 5.2 fallback links (/rag-database?document=xxx) are sufficient. Revisit after Phase 5.4 if plain-text chunk views are still needed alongside highlighted views.
Phase 5.4 — Sentence-Level Highlighting (PLANNED)
Problem
When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.
Solution
On-the-fly highlighted HTML chunk views served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.
Why HTML, not PDF?
| Approach | Complexity | Works for all doc types? | Preserves original formatting? |
|---|---|---|---|
| Highlighted HTML page | Low | ✅ Yes (uses chunk text) | ❌ Plain text only |
| Highlighted PDF via reportlab | Medium | ✅ Yes (new PDF) | ❌ Plain text only |
| Overlay highlights on existing PDF | High | ⚠️ PDF only | ✅ Yes |
Recommendation: HTML page. Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (chunk_file_path link) — the highlighted HTML view is a supplementary view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".
How It Works (No LLM Needed)
User clicks citation [NEC4 ACC, chunk 3]
│
▼
Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
│
▼
Backend:
1. Fetch chunk text from ChromaDB [chromadb get()]
2. Split into sentences [nltk.sent_tokenize or regex]
3. Embed sub-question [existing embedding model]
4. Embed each sentence (batch, parallel) [same model]
5. Compute cosine similarity per sentence vs sub-question [numpy]
6. Return HTML with yellow background on sentences > threshold
│
▼
Frontend renders HTML in an iframe or new tab
What Gets Highlighted
┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12 [View Original PDF →] │
├──────────────────────────────────────────────────────────┤
│ │
│ The programme shall be prepared in a form acceptable to │
│ the Project Manager. It shall include: │
│ │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates. █ │ ← High similarity
│ ████████████████████████████████████████████████████████ │
│ │
│ The Contractor shall submit a first programme within │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date. █ │ ← High similarity
│ ████████████████████████████████████████████████████████ │
│ │
│ The Project Manager may instruct the Contractor to │
│ submit a revised programme showing the effects of a │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme. │ ← Low similarity (no highlight)
│ │
└──────────────────────────────────────────────────────────┘
Key Design Decisions
| # | Decision | Rationale |
|---|---|---|
| 1 | HTML page, not PDF | Zero dependency (reportlab not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately. |
| 2 | Embedding similarity, not LLM | No API cost, no latency. The embedding model is already running. Cosine similarity is cheap. |
| 3 | Sentence-level granularity | Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning. |
| 4 | Embed sentences in batch | A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip). |
| 5 | Configurable threshold (env var) | HIGHLIGHT_SIMILARITY_THRESHOLD (default 0.5). Tune per embedding model. |
| 6 | Cache sentence embeddings per chunk | A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation. |
| 7 | Graceful degradation | If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk. |
| 8 | Frontend: "View Highlighted" link alongside "View PDF" | The existing PDF viewer link (chunk_file_path) stays. A second link opens the highlighted HTML view. Both visible, user chooses. |
Implementation Tasks
Task 5.4.1: Backend — Sentence splitting utility
- Create
backend/app/utils/sentence_splitter.py - Function
split_sentences(text: str) -> list[dict]returns[{text, start_char, end_char}, ...] - Use
nltk.sent_tokenizewith fallback to regex (re.split(r'(?<=[.!?])\s+')) - NLTK punkt data auto-downloaded on first use (or bundled)
- Handle edge cases: empty text, single sentence, lists/bullets
- Test file:
test_phase5_sentence_splitter.py
Task 5.4.2: Backend — Highlighted chunk endpoint
- New endpoint:
GET /api/v1/chunks/highlight - Query params:
document_id,chunk_index,sub_question - Returns
text/html(not JSON) - Logic in
backend/app/services/chunk_highlight_service.py:- Fetch chunk from ChromaDB by
document_id+chunk_index - Split into sentences via
split_sentences() - Get embedding for
sub_questionvia existing embedding model - Get embeddings for all sentences in one batch call
- Compute cosine similarity:
np.dot(q_emb, s_emb) / (norm(q) * norm(s)) - Mark sentences with similarity > threshold as highlighted
- Render HTML template with inline CSS (yellow background, subtle border)
- Fetch chunk from ChromaDB by
- Test file:
test_phase5_chunk_highlight.py
Task 5.4.3: Frontend — "View Highlighted" link in citations and sources
- In
citationParser.tsandResponsePanel.tsx, add a "🔍" or "View Highlighted" link next to each source - Link target:
/api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=... - The sub-question is the one that retrieved this chunk (already available in the sources structure:
source.sub_question_index→ look up sub-question text) - Open in new tab or modal
- Test file: Update
citationParser.test.tsandResponsePanel.test.tsx
Task 5.4.4: Integration testing
- Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
- Verify sentence highlighting is proportional to relevance (spot-check manually)
- Verify caching works (second request for same chunk is faster)
- Verify graceful degradation (embedding API down → plain text still served)
- Run full test suite
Test Files
| # | Test File | Coverage |
|---|---|---|
| T5.4.1 | backend/app/test/test_phase5_sentence_splitter.py |
Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists |
| T5.4.2 | backend/app/test/test_phase5_chunk_highlight.py |
Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text |
| T5.4.3 | frontend/src/test/utils/citationParser.test.ts (update) |
Citation links include highlight URL when sub-question context available |
| T5.4.4 | frontend/src/test/components/ResponsePanel.test.tsx (update) |
Sources section renders "View Highlighted" link alongside "View PDF" |
Acceptance Tests
| # | Test File | Coverage |
|---|---|---|
| AT5.4.1 | backend/app/test/acceptance/test_acceptance_phase5_highlight.py |
Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences |
Commit Plan
| Commit | Message | Scope |
|---|---|---|
| 1 | feat: add LangChain deps and SubQuestions Pydantic model |
Tasks 5.1.1 + 5.1.2 + tests |
| 2 | feat: add LLMClient.complete_structured() with LangChain |
Task 5.1.3 + tests |
| 3 | feat: refactor QueryDecomposer to use structured output with fallback |
Task 5.1.4 + tests |
| 4 | chore: update decompose seed prompt for structured output |
Task 5.1.5 |
| 5 | feat: add document_id to SourceMetadata model |
Task 5.2.1 + tests |
| 6 | feat: fuzzy citation matching and document fallback links |
Tasks 5.2.3 + 5.2.4 + tests |
| 7 | feat: sentence-level chunk highlighting via embedding similarity |
Phase 5.4 (all tasks) |