legco_ai_assistant/.plans/package5_enhancement_plan.md

18 KiB

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Source: User request (2026-04-28) Scope:

  • Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain with_structured_output()
  • Phase 5.2: Fix missing PDF links in citations and improve citation robustness Status: Complete — Both phases implemented (2026-04-28)

LangChain version: 1.2.15 (venv), model_provider="openai" with OpenRouter base URL (API-compatible proxy).

Test results:

  • Backend: 115 passed, 0 failed (Phase 5.1 + Phase 5.2 + all integration/regression tests)
  • Frontend: 187 passed, 1 failed (pre-existing e2e test failure unrelated to these changes)

Objective

  1. Decompose structured output: Eliminate json.JSONDecodeError failures in QueryDecomposer.decompose() by integrating LangChain's with_structured_output() to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid SubQuestions object — no manual json.loads(), no regex markdown stripping, no silent failures.

  2. Robust citation linking: Fix the citation→PDF link pipeline so that:

    • document_id flows through to the frontend for fallback document-level links
    • chunk_file_path is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
    • Citation matching in citationParser.ts handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
    • Frontend provides fallback "View Document" links when chunk-level PDF is unavailable

Decision Register

# Decision Rationale
1 Use LangChain with_structured_output() (not OpenAI response_format directly) User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate).
2 Add langchain + langchain-openai to requirements.txt Required dependencies for init_chat_model() and with_structured_output(). langchain ~0.3.x for stable API.
3 Define SubQuestions Pydantic model with questions: list[str] LangChain's with_structured_output() requires a wrapper Pydantic model — bare list[str] is unsupported by provider-native schema enforcement.
4 Keep LLMClient as the central LLM access layer, add LangChain-based complete_structured() method Minimizes refactoring. QueryDecomposer calls llm_client.complete_structured(prompt, SubQuestions) instead of llm_client.complete(prompt). Other callers (filter, generate) remain unchanged.
5 Run decomposition at temperature=0.0 (was 0.7) Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance.
6 Add document_id to SourceMetadata Pydantic model and frontend type document_id is already stored in ChromaDB metadata (metadata.py:70) but is discarded during serialization. Adding it enables document-level fallback links.
7 Generate monolithic PDFs for DOCX/TXT documentsDEFERRED More complex than needed. Instead, use fallback document-level links via document_id when chunk_file_path is null. DOCX/TXT PDF generation deferred to Phase 5.3.
8 Fuzzy citation matching: strip extensions, trim whitespace citationParser.ts currently requires exact filename match. LLM may shorten NEC4 ACC.pdf to NEC4 ACC in citations.
9 Fallback "View Document" link when chunk_file_path is null Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback.
10 Keep _extract_json_from_markdown() as a fallback for backward compatibility During a transition period (or if with_structured_output() fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used.
11 Add logger.warning for JSON parse failures before returning empty The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug.
12 Keep QueryDecomposer.decompose() return type as Tuple[List[str], str] Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions
13 Spike-test LangChain structured output with OpenRouter BEFORE implementation 2-minute test calling init_chat_model().with_structured_output().ainvoke() through OpenRouter to confirm response_format={"type": "json_schema"} is proxied correctly. If not, fall back to method="function_calling".
14 Tighten generate_per_subq prompt alongside frontend fuzzy matching Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3.

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

# Test File Coverage
T5.1.1 backend/app/test/test_phase5_llm_client_structured.py LLMClient.complete_structured() with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback.
T5.1.2 backend/app/test/test_phase5_query_decomposer_structured.py QueryDecomposer.decompose() using MockLLMClient.complete_structured(). Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration.
T5.1.3 backend/app/test/test_phase5_subquestions_model.py SubQuestions Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected.
T5.1.4 backend/app/test/test_phase5_decompose_logging.py Verify logger.warning is emitted when JSON parse fallback is triggered (backward-compat path).

Acceptance Tests

# Test File Coverage
AT5.1.1 backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions.

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

  • Add langchain>=0.3.0,<0.4.0 and langchain-openai>=0.3.0,<0.4.0 to backend/requirements.txt
  • Run pip install -r backend/requirements.txt in dev venv
  • Test file: test_phase5_subquestions_model.py (can run immediately after install)

Task 5.1.2: Define SubQuestions Pydantic model

  • Create backend/app/models/decompose.py with:
    class SubQuestions(BaseModel):
        questions: list[str] = Field(
            description="2-5 simplified sub-questions, each focused on one aspect",
            min_length=1,
            max_length=5,
        )
    
  • Add min_length=1 and max_length=5 Pydantic constraints (aligns with decompose prompt's "2-5")
  • Test file: test_phase5_subquestions_model.py

Task 5.1.3: Add complete_structured() method to LLMClient

  • In llm_client.py, import init_chat_model from langchain.chat_models
  • Add self._langchain_model attribute (lazy-init from settings)
  • Add async complete_structured(prompt, pydantic_model, step_name) -> BaseModel method:
    1. Calls self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)
    2. Returns the validated Pydantic model instance
    3. Logs timing (same pattern as existing complete())
    4. Wraps errors in LLMClientError
  • Use temperature=0.0 via model config for structured calls
  • Test file: test_phase5_llm_client_structured.py

Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output

  • Change decompose() to call self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")
  • Add fallback path: if complete_structured() raises → log warning → attempt legacy complete() + json.loads() → if that works, log info "structured output failed, fallback succeeded"
  • Add logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500]) when both paths fail
  • Keep return type Tuple[List[str], str] unchanged
  • Keep _extract_json_from_markdown() for backward-compat fallback path
  • Test file: test_phase5_query_decomposer_structured.py and test_phase5_decompose_logging.py

Task 5.1.5: Update prompt template for structured output

  • Update _SEED_DECOMPOSE in sqlite_db.py to instruct the LLM about the expected structure
  • New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
  • Run seed_default_profiles() to backfill existing profiles
  • Test file: Existing test_phase3_prompt_service.py should continue to pass

Task 5.1.6: Integration test — end-to-end query pipeline

  • Verify existing integration tests still pass (test_integration_phase1.py, test_phase4_integration_query_pipeline.py)
  • Verify acceptance test passes with real LLM (test_acceptance_phase1_rag_query.py)
  • Run full test suite: cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

# Test File Coverage
T5.2.1 backend/app/test/test_phase5_source_metadata.py SourceMetadata model with document_id. Tests: serialization includes document_id, backward compat (old data without document_id).
T5.2.2 backend/app/test/test_phase5_docx_pdf_generation.py DOCX/TXT ingestion now sets chunk_file_path. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully.
T5.2.3 frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts Fuzzy citation matching. Tests: citation [NEC4 ACC] matches source NEC4 ACC.pdf, citation [nec4 acc.pdf, page 3] matches after whitespace trim, citation [NEC4 ACC.PDF] matches case-insensitively, fallback "View Document" link shown when chunk_file_path is null.
T5.2.4 frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts Fallback document link rendering. Tests: chunk with chunk_file_path: null but document_id present → renders "View Document" link, chunk with both null → remains plain text, chunk with chunk_file_path → renders page-level PDF link.

Acceptance Tests

# Test File Coverage
AT5.2.1 backend/app/test/acceptance/test_acceptance_phase5_citation_links.py Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path).

Implementation Tasks

Task 5.2.1: Add document_id to SourceMetadata model

  • In backend/app/models/common.py, add document_id: Optional[str] = None to SourceMetadata
  • In backend/app/routers/query.py lines 310-319, include document_id=meta.get("document_id") when building SourceMetadata objects
  • In frontend/src/types/index.ts, add document_id: string | null to SourceMetadata interface
  • Test file: test_phase5_source_metadata.py

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

  • Add reportlab to backend/requirements.txt (lightweight, pure Python PDF generation, no external binaries)
  • In backend/app/routers/ingest.py DOCX and TXT branches, add PDF generation logic:
    1. After chunking, generate a single PDF from the full text (one page per chunk)
    2. Store chunk_filename = f"{stem}_chunk_{idx}.pdf" for each chunk
    3. Set chunk_file_paths list and pass to extract_metadata()
  • Add error handling: if PDF generation fails, chunk_file_path stays None (graceful degradation)
  • Use logger.warning on generation failure
  • Test file: test_phase5_docx_pdf_generation.py

Task 5.2.3: Improve citationParser.ts with fuzzy matching

  • Add extension-stripping helper: stripExtension(filename: string): string — removes .pdf, .docx, .txt
  • Modify buildCitationLookup() to register both filename and stripExtension(filename) as lookup keys
  • Add trim-whitespace normalization on citation text before lookup
  • Add test for LLM-common variations: NEC4 ACC.pdf vs NEC4 ACC vs NEC4_acc.pdf
  • Test file: test_phase5_citation_parser_fuzzy.test.ts
  • In citationParser.ts replaceCitationPatterns(), when source?.chunk_file_path is null but source?.document_id exists:
    1. Build a URL to the document chunk list page: /rag-database?document_id=${source.document_id}
    2. Return [${trimmed}](${url}) with a different CSS class (e.g., text-green-600 for document-level vs text-blue-600 for page-level)
  • In ResponsePanel.tsx, update CitationLink component to accept a variant prop for visual differentiation
  • Test file: test_phase5_citation_fallback_link.test.ts

Task 5.2.5: Integration and regression testing

  • Verify all existing citation parser tests still pass: cd frontend && npx vitest run src/test/utils/citationParser.test.ts
  • Verify ResponsePanel tests still pass: npx vitest run src/test/components/ResponsePanel.test.tsx
  • Run full frontend test suite: npm test
  • Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable

Dependency Graph

Phase 5.1 (Structured Output)
  Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.4 (refactor decompose)
                           │                                           │
                           │                              Task 5.1.5 (update prompt template)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.6 (integration tests)
                           │
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
  Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
  Task 5.2.2 (DOCX/TXT PDF gen)    ──┤
                                      ├── Task 5.2.4 (fallback link)
                                      │
                                      ▼
                              Task 5.2.5 (integration tests)

Acceptance Criteria

Phase 5.1 Completion Checklist

  • LLMClient.complete_structured() returns validated SubQuestions Pydantic model — no json.JSONDecodeError possible
  • QueryDecomposer.decompose() never returns [] due to JSON parse failure
  • Fallback path (legacy json.loads()) logs a warning when triggered
  • Existing decompose tests pass (test_phase1_query_decomposer.py)
  • New structured output tests pass (test_phase5_*.py) — 33 tests
  • Spike test passed: Cantonese + English → valid sub-questions
  • SQLite seed templates updated and backfilled to all profiles
  • langchain and langchain-openai installed in venv (1.2.x)

Phase 5.2 Completion Checklist

  • SourceMetadata includes document_id in both backend and frontend types
  • DOCX/TXT ingestion generates per-chunk PDF filesDEFERRED to Phase 5.3
  • citationParser.ts matches [NEC4 ACC] to source NEC4 ACC.pdf (fuzzy matching)
  • citationParser.ts renders fallback link to /rag-database?document=xxx when chunk_file_path is null but document_id exists
  • RAGDatabasePage auto-expands document from ?document= URL param
  • All existing citation parser tests pass (14 tests)
  • All existing ResponsePanel tests pass
  • generate_per_subq seed prompt tightened: "Copy the exact bracket labels shown"

Rollback Plan

If with_structured_output() causes issues in production:

  1. The complete_structured() method wraps errors in LLMClientError — same exception type as existing complete()
  2. QueryDecomposer.decompose() has a fallback to legacy complete() + json.loads() path
  3. The _extract_json_from_markdown() function is preserved for backward compatibility
  4. If LangChain is a complete failure, revert requirements.txt and llm_client.py changes (3 files), keeping the Pydantic model and improved logging

Commit Plan

Commit Message Scope
1 feat: add LangChain deps and SubQuestions Pydantic model Tasks 5.1.1 + 5.1.2 + tests
2 feat: add LLMClient.complete_structured() with LangChain Task 5.1.3 + tests
3 feat: refactor QueryDecomposer to use structured output with fallback Task 5.1.4 + tests
4 chore: update decompose seed prompt for structured output Task 5.1.5
5 feat: add document_id to SourceMetadata model Task 5.2.1 + tests
6 feat: generate PDFs for DOCX/TXT documents on ingest Task 5.2.2 + tests
7 feat: fuzzy citation matching and document fallback links Tasks 5.2.3 + 5.2.4 + tests