18 KiB

Raw Blame History

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Source: User request (2026-04-28) Scope:

Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain with_structured_output()
Phase 5.2: Fix missing PDF links in citations and improve citation robustness Status: ✅ Complete — Both phases implemented (2026-04-28)

LangChain version: 1.2.15 (venv), model_provider="openai" with OpenRouter base URL (API-compatible proxy).

Test results:

Backend: 115 passed, 0 failed (Phase 5.1 + Phase 5.2 + all integration/regression tests)
Frontend: 187 passed, 1 failed (pre-existing e2e test failure unrelated to these changes)

Objective

Decompose structured output: Eliminate json.JSONDecodeError failures in QueryDecomposer.decompose() by integrating LangChain's with_structured_output() to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid SubQuestions object — no manual json.loads(), no regex markdown stripping, no silent failures.
Robust citation linking: Fix the citation→PDF link pipeline so that:
- document_id flows through to the frontend for fallback document-level links
- chunk_file_path is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
- Citation matching in citationParser.ts handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
- Frontend provides fallback "View Document" links when chunk-level PDF is unavailable

Decision Register

#	Decision	Rationale
1	Use LangChain `with_structured_output()` (not OpenAI `response_format` directly)	User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate).
2	Add `langchain` + `langchain-openai` to `requirements.txt`	Required dependencies for `init_chat_model()` and `with_structured_output()`. `langchain` ~0.3.x for stable API.
3	Define `SubQuestions` Pydantic model with `questions: list[str]`	LangChain's `with_structured_output()` requires a wrapper Pydantic model — bare `list[str]` is unsupported by provider-native schema enforcement.
4	Keep `LLMClient` as the central LLM access layer, add LangChain-based `complete_structured()` method	Minimizes refactoring. `QueryDecomposer` calls `llm_client.complete_structured(prompt, SubQuestions)` instead of `llm_client.complete(prompt)`. Other callers (filter, generate) remain unchanged.
5	Run decomposition at `temperature=0.0` (was `0.7`)	Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance.
6	Add `document_id` to `SourceMetadata` Pydantic model and frontend type	`document_id` is already stored in ChromaDB metadata (`metadata.py:70`) but is discarded during serialization. Adding it enables document-level fallback links.
7	~~Generate monolithic PDFs for DOCX/TXT documents~~ → DEFERRED	More complex than needed. Instead, use fallback document-level links via `document_id` when `chunk_file_path` is null. DOCX/TXT PDF generation deferred to Phase 5.3.
8	Fuzzy citation matching: strip extensions, trim whitespace	`citationParser.ts` currently requires exact filename match. LLM may shorten `NEC4 ACC.pdf` to `NEC4 ACC` in citations.
9	Fallback "View Document" link when `chunk_file_path` is null	Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback.
10	Keep `_extract_json_from_markdown()` as a fallback for backward compatibility	During a transition period (or if `with_structured_output()` fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used.
11	Add `logger.warning` for JSON parse failures before returning empty	The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug.
12	Keep `QueryDecomposer.decompose()` return type as `Tuple[List[str], str]`	Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions
13	Spike-test LangChain structured output with OpenRouter BEFORE implementation	2-minute test calling `init_chat_model().with_structured_output().ainvoke()` through OpenRouter to confirm `response_format={"type": "json_schema"}` is proxied correctly. If not, fall back to `method="function_calling"`.
14	Tighten `generate_per_subq` prompt alongside frontend fuzzy matching	Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3.

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

#	Test File	Coverage
T5.1.1	`backend/app/test/test_phase5_llm_client_structured.py`	`LLMClient.complete_structured()` with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback.
T5.1.2	`backend/app/test/test_phase5_query_decomposer_structured.py`	`QueryDecomposer.decompose()` using `MockLLMClient.complete_structured()`. Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration.
T5.1.3	`backend/app/test/test_phase5_subquestions_model.py`	`SubQuestions` Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected.
T5.1.4	`backend/app/test/test_phase5_decompose_logging.py`	Verify `logger.warning` is emitted when JSON parse fallback is triggered (backward-compat path).

Acceptance Tests

#	Test File	Coverage
AT5.1.1	`backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py`	Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions.

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

Add langchain>=0.3.0,<0.4.0 and langchain-openai>=0.3.0,<0.4.0 to backend/requirements.txt
Run pip install -r backend/requirements.txt in dev venv
Test file: test_phase5_subquestions_model.py (can run immediately after install)

Task 5.1.2: Define `SubQuestions` Pydantic model

Create backend/app/models/decompose.py with:

class SubQuestions(BaseModel):
    questions: list[str] = Field(
        description="2-5 simplified sub-questions, each focused on one aspect",
        min_length=1,
        max_length=5,
    )

Add min_length=1 and max_length=5 Pydantic constraints (aligns with decompose prompt's "2-5")
Test file: test_phase5_subquestions_model.py

Task 5.1.3: Add `complete_structured()` method to `LLMClient`

In llm_client.py, import init_chat_model from langchain.chat_models
Add self._langchain_model attribute (lazy-init from settings)
Add async complete_structured(prompt, pydantic_model, step_name) -> BaseModel method:
1. Calls self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)
2. Returns the validated Pydantic model instance
3. Logs timing (same pattern as existing complete())
4. Wraps errors in LLMClientError
Use temperature=0.0 via model config for structured calls
Test file: test_phase5_llm_client_structured.py

Task 5.1.4: Refactor `QueryDecomposer.decompose()` to use structured output

Change decompose() to call self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")
Add fallback path: if complete_structured() raises → log warning → attempt legacy complete() + json.loads() → if that works, log info "structured output failed, fallback succeeded"
Add logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500]) when both paths fail
Keep return type Tuple[List[str], str] unchanged
Keep _extract_json_from_markdown() for backward-compat fallback path
Test file: test_phase5_query_decomposer_structured.py and test_phase5_decompose_logging.py

Task 5.1.5: Update prompt template for structured output

Update _SEED_DECOMPOSE in sqlite_db.py to instruct the LLM about the expected structure
New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
Run seed_default_profiles() to backfill existing profiles
Test file: Existing test_phase3_prompt_service.py should continue to pass

Task 5.1.6: Integration test — end-to-end query pipeline

Verify existing integration tests still pass (test_integration_phase1.py, test_phase4_integration_query_pipeline.py)
Verify acceptance test passes with real LLM (test_acceptance_phase1_rag_query.py)
Run full test suite: cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

#	Test File	Coverage
T5.2.1	`backend/app/test/test_phase5_source_metadata.py`	`SourceMetadata` model with `document_id`. Tests: serialization includes document_id, backward compat (old data without document_id).
T5.2.2	`backend/app/test/test_phase5_docx_pdf_generation.py`	DOCX/TXT ingestion now sets `chunk_file_path`. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully.
T5.2.3	`frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts`	Fuzzy citation matching. Tests: citation `[NEC4 ACC]` matches source `NEC4 ACC.pdf`, citation `[nec4 acc.pdf, page 3]` matches after whitespace trim, citation `[NEC4 ACC.PDF]` matches case-insensitively, fallback "View Document" link shown when `chunk_file_path` is null.
T5.2.4	`frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts`	Fallback document link rendering. Tests: chunk with `chunk_file_path: null` but `document_id` present → renders "View Document" link, chunk with both null → remains plain text, chunk with `chunk_file_path` → renders page-level PDF link.

Acceptance Tests

#	Test File	Coverage
AT5.2.1	`backend/app/test/acceptance/test_acceptance_phase5_citation_links.py`	Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path).

Implementation Tasks

Task 5.2.1: Add `document_id` to `SourceMetadata` model

In backend/app/models/common.py, add document_id: Optional[str] = None to SourceMetadata
In backend/app/routers/query.py lines 310-319, include document_id=meta.get("document_id") when building SourceMetadata objects
In frontend/src/types/index.ts, add document_id: string | null to SourceMetadata interface
Test file: test_phase5_source_metadata.py

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

Add reportlab to backend/requirements.txt (lightweight, pure Python PDF generation, no external binaries)
In backend/app/routers/ingest.py DOCX and TXT branches, add PDF generation logic:
1. After chunking, generate a single PDF from the full text (one page per chunk)
2. Store chunk_filename = f"{stem}_chunk_{idx}.pdf" for each chunk
3. Set chunk_file_paths list and pass to extract_metadata()
Add error handling: if PDF generation fails, chunk_file_path stays None (graceful degradation)
Use logger.warning on generation failure
Test file: test_phase5_docx_pdf_generation.py

Task 5.2.3: Improve `citationParser.ts` with fuzzy matching

Add extension-stripping helper: stripExtension(filename: string): string — removes .pdf, .docx, .txt
Modify buildCitationLookup() to register both filename and stripExtension(filename) as lookup keys
Add trim-whitespace normalization on citation text before lookup
Add test for LLM-common variations: NEC4 ACC.pdf vs NEC4 ACC vs NEC4_acc.pdf
Test file: test_phase5_citation_parser_fuzzy.test.ts

Task 5.2.4: Add fallback "View Document" link in frontend

In citationParser.ts replaceCitationPatterns(), when source?.chunk_file_path is null but source?.document_id exists:
1. Build a URL to the document chunk list page: /rag-database?document_id=${source.document_id}
2. Return [${trimmed}](${url}) with a different CSS class (e.g., text-green-600 for document-level vs text-blue-600 for page-level)
In ResponsePanel.tsx, update CitationLink component to accept a variant prop for visual differentiation
Test file: test_phase5_citation_fallback_link.test.ts

Task 5.2.5: Integration and regression testing

Verify all existing citation parser tests still pass: cd frontend && npx vitest run src/test/utils/citationParser.test.ts
Verify ResponsePanel tests still pass: npx vitest run src/test/components/ResponsePanel.test.tsx
Run full frontend test suite: npm test
Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable

Dependency Graph

Phase 5.1 (Structured Output)
  Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.4 (refactor decompose)
                           │                                           │
                           │                              Task 5.1.5 (update prompt template)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.6 (integration tests)
                           │
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
  Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
  Task 5.2.2 (DOCX/TXT PDF gen)    ──┤
                                      ├── Task 5.2.4 (fallback link)
                                      │
                                      ▼
                              Task 5.2.5 (integration tests)

Acceptance Criteria

Phase 5.1 Completion Checklist

LLMClient.complete_structured() returns validated SubQuestions Pydantic model — no json.JSONDecodeError possible
QueryDecomposer.decompose() never returns [] due to JSON parse failure
Fallback path (legacy json.loads()) logs a warning when triggered
Existing decompose tests pass (test_phase1_query_decomposer.py)
New structured output tests pass (test_phase5_*.py) — 33 tests
Spike test passed: Cantonese + English → valid sub-questions
SQLite seed templates updated and backfilled to all profiles
langchain and langchain-openai installed in venv (1.2.x)

Phase 5.2 Completion Checklist

SourceMetadata includes document_id in both backend and frontend types
~~DOCX/TXT ingestion generates per-chunk PDF files~~ → DEFERRED to Phase 5.3
citationParser.ts matches [NEC4 ACC] to source NEC4 ACC.pdf (fuzzy matching)
citationParser.ts renders fallback link to /rag-database?document=xxx when chunk_file_path is null but document_id exists
RAGDatabasePage auto-expands document from ?document= URL param
All existing citation parser tests pass (14 tests)
All existing ResponsePanel tests pass
generate_per_subq seed prompt tightened: "Copy the exact bracket labels shown"

Rollback Plan

If with_structured_output() causes issues in production:

The complete_structured() method wraps errors in LLMClientError — same exception type as existing complete()
QueryDecomposer.decompose() has a fallback to legacy complete() + json.loads() path
The _extract_json_from_markdown() function is preserved for backward compatibility
If LangChain is a complete failure, revert requirements.txt and llm_client.py changes (3 files), keeping the Pydantic model and improved logging

Commit Plan

Commit	Message	Scope
1	`feat: add LangChain deps and SubQuestions Pydantic model`	Tasks 5.1.1 + 5.1.2 + tests
2	`feat: add LLMClient.complete_structured() with LangChain`	Task 5.1.3 + tests
3	`feat: refactor QueryDecomposer to use structured output with fallback`	Task 5.1.4 + tests
4	`chore: update decompose seed prompt for structured output`	Task 5.1.5
5	`feat: add document_id to SourceMetadata model`	Task 5.2.1 + tests
6	`feat: generate PDFs for DOCX/TXT documents on ingest`	Task 5.2.2 + tests
7	`feat: fuzzy citation matching and document fallback links`	Tasks 5.2.3 + 5.2.4 + tests

18 KiB Raw Blame History

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Objective

Decision Register

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

Acceptance Tests

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

Task 5.1.2: Define SubQuestions Pydantic model

Task 5.1.3: Add complete_structured() method to LLMClient

Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output

Task 5.1.5: Update prompt template for structured output

Task 5.1.6: Integration test — end-to-end query pipeline

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

Acceptance Tests

Implementation Tasks

Task 5.2.1: Add document_id to SourceMetadata model

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

Task 5.2.3: Improve citationParser.ts with fuzzy matching

Task 5.2.4: Add fallback "View Document" link in frontend

Task 5.2.5: Integration and regression testing

Dependency Graph

Acceptance Criteria

Phase 5.1 Completion Checklist

Phase 5.2 Completion Checklist

Rollback Plan

Commit Plan

18 KiB

Raw Blame History

Task 5.1.2: Define `SubQuestions` Pydantic model

Task 5.1.3: Add `complete_structured()` method to `LLMClient`

Task 5.1.4: Refactor `QueryDecomposer.decompose()` to use structured output

Task 5.2.1: Add `document_id` to `SourceMetadata` model

Task 5.2.3: Improve `citationParser.ts` with fuzzy matching