legco_ai_assistant/.plans/package5_enhancement_plan.md

27 KiB

Package 5 Enhancement Plan — Structured Output + Robust Citation Linking

Source: User request (2026-04-28) Scope:

  • Phase 5.1: Replace manual JSON parsing in the decompose stage with LangChain with_structured_output()
  • Phase 5.2: Fix missing PDF links in citations and improve citation robustness Status: Phases 5.1 , 5.2 — 5.3 Deferred, 5.4 Planned (2026-04-28)

LangChain version: 1.2.15 (venv), model_provider="openai" with OpenRouter base URL (API-compatible proxy).

Test results:

  • Backend: 115 passed, 0 failed (Phase 5.1 + Phase 5.2 + all integration/regression tests)
  • Frontend: 187 passed, 1 failed (pre-existing e2e test failure unrelated to these changes)

Objective

  1. Decompose structured output: Eliminate json.JSONDecodeError failures in QueryDecomposer.decompose() by integrating LangChain's with_structured_output() to enforce a Pydantic schema at the API level. The LLM response is guaranteed to be a valid SubQuestions object — no manual json.loads(), no regex markdown stripping, no silent failures.

  2. Robust citation linking: Fix the citation→PDF link pipeline so that:

    • document_id flows through to the frontend for fallback document-level links
    • chunk_file_path is always available (generate per-chunk PDFs for DOCX/TXT too, or provide a document-level PDF fallback)
    • Citation matching in citationParser.ts handles fuzzy filename matching (strips extensions, tolerates whitespace variations)
    • Frontend provides fallback "View Document" links when chunk-level PDF is unavailable

Decision Register

# Decision Rationale
1 Use LangChain with_structured_output() (not OpenAI response_format directly) User explicitly chose Option B. Provides cleaner API, auto-retry on validation failure, and future flexibility for other pipeline stages (filter, generate).
2 Add langchain + langchain-openai to requirements.txt Required dependencies for init_chat_model() and with_structured_output(). langchain ~0.3.x for stable API.
3 Define SubQuestions Pydantic model with questions: list[str] LangChain's with_structured_output() requires a wrapper Pydantic model — bare list[str] is unsupported by provider-native schema enforcement.
4 Keep LLMClient as the central LLM access layer, add LangChain-based complete_structured() method Minimizes refactoring. QueryDecomposer calls llm_client.complete_structured(prompt, SubQuestions) instead of llm_client.complete(prompt). Other callers (filter, generate) remain unchanged.
5 Run decomposition at temperature=0.0 (was 0.7) Structured output benefits from deterministic behavior. Lower temperature = more reliable schema compliance.
6 Add document_id to SourceMetadata Pydantic model and frontend type document_id is already stored in ChromaDB metadata (metadata.py:70) but is discarded during serialization. Adding it enables document-level fallback links.
7 Generate monolithic PDFs for DOCX/TXT documentsDEFERRED More complex than needed. Instead, use fallback document-level links via document_id when chunk_file_path is null. DOCX/TXT PDF generation deferred to Phase 5.3.
8 Fuzzy citation matching: strip extensions, trim whitespace citationParser.ts currently requires exact filename match. LLM may shorten NEC4 ACC.pdf to NEC4 ACC in citations.
9 Fallback "View Document" link when chunk_file_path is null Even after Decision #7, network failures or edge cases may leave null paths. The frontend should show a document-level PDF link as fallback.
10 Keep _extract_json_from_markdown() as a fallback for backward compatibility During a transition period (or if with_structured_output() fails), the existing regex-based extraction serves as a safety net. Log a warning when fallback is used.
11 Add logger.warning for JSON parse failures before returning empty The biggest blind spot today: JSON parse failures are silent. Log the raw LLM response (truncated) so operators can debug.
12 Keep QueryDecomposer.decompose() return type as Tuple[List[str], str] Existing callers unpack the tuple. Adding `Tuple[List[str], str, SubQuestions
13 Spike-test LangChain structured output with OpenRouter BEFORE implementation 2-minute test calling init_chat_model().with_structured_output().ainvoke() through OpenRouter to confirm response_format={"type": "json_schema"} is proxied correctly. If not, fall back to method="function_calling".
14 Tighten generate_per_subq prompt alongside frontend fuzzy matching Add "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions." to seed template. Two-layer defense: prompt reduces hallucinations + fuzzy matching catches remaining cases. No separate task — folded into Task 5.2.3.

Phase 5.1 — Structured Output for Decompose

Test Files (write BEFORE implementation)

# Test File Coverage
T5.1.1 backend/app/test/test_phase5_llm_client_structured.py LLMClient.complete_structured() with mock LangChain model. Tests: valid Pydantic return, validation error → retry, empty questions list, non-JSON fallback.
T5.1.2 backend/app/test/test_phase5_query_decomposer_structured.py QueryDecomposer.decompose() using MockLLMClient.complete_structured(). Tests: valid SubQuestions, empty questions, LLM error fallback, prompt service integration.
T5.1.3 backend/app/test/test_phase5_subquestions_model.py SubQuestions Pydantic model validation. Tests: valid input, empty list, too many questions, non-string items rejected.
T5.1.4 backend/app/test/test_phase5_decompose_logging.py Verify logger.warning is emitted when JSON parse fallback is triggered (backward-compat path).

Acceptance Tests

# Test File Coverage
AT5.1.1 backend/app/test/acceptance/test_acceptance_phase5_structured_decompose.py Real LLM call with structured output. Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, very short question → 1 sub-question, very long question → ≤5 sub-questions.

Implementation Tasks

Task 5.1.1: Add LangChain dependencies

  • Add langchain>=0.3.0,<0.4.0 and langchain-openai>=0.3.0,<0.4.0 to backend/requirements.txt
  • Run pip install -r backend/requirements.txt in dev venv
  • Test file: test_phase5_subquestions_model.py (can run immediately after install)

Task 5.1.2: Define SubQuestions Pydantic model

  • Create backend/app/models/decompose.py with:
    class SubQuestions(BaseModel):
        questions: list[str] = Field(
            description="2-5 simplified sub-questions, each focused on one aspect",
            min_length=1,
            max_length=5,
        )
    
  • Add min_length=1 and max_length=5 Pydantic constraints (aligns with decompose prompt's "2-5")
  • Test file: test_phase5_subquestions_model.py

Task 5.1.3: Add complete_structured() method to LLMClient

  • In llm_client.py, import init_chat_model from langchain.chat_models
  • Add self._langchain_model attribute (lazy-init from settings)
  • Add async complete_structured(prompt, pydantic_model, step_name) -> BaseModel method:
    1. Calls self._langchain_model.with_structured_output(pydantic_model, method="json_schema").ainvoke(prompt)
    2. Returns the validated Pydantic model instance
    3. Logs timing (same pattern as existing complete())
    4. Wraps errors in LLMClientError
  • Use temperature=0.0 via model config for structured calls
  • Test file: test_phase5_llm_client_structured.py

Task 5.1.4: Refactor QueryDecomposer.decompose() to use structured output

  • Change decompose() to call self.llm_client.complete_structured(prompt, SubQuestions, step_name="QueryDecomposer")
  • Add fallback path: if complete_structured() raises → log warning → attempt legacy complete() + json.loads() → if that works, log info "structured output failed, fallback succeeded"
  • Add logger.warning("Decompose JSON parse failed, raw response (first 500 chars): %s", response[:500]) when both paths fail
  • Keep return type Tuple[List[str], str] unchanged
  • Keep _extract_json_from_markdown() for backward-compat fallback path
  • Test file: test_phase5_query_decomposer_structured.py and test_phase5_decompose_logging.py

Task 5.1.5: Update prompt template for structured output

  • Update _SEED_DECOMPOSE in sqlite_db.py to instruct the LLM about the expected structure
  • New seed prompt: mention that output will be validated against a schema — more explicit about JSON array of strings requirement
  • Run seed_default_profiles() to backfill existing profiles
  • Test file: Existing test_phase3_prompt_service.py should continue to pass

Task 5.1.6: Integration test — end-to-end query pipeline

  • Verify existing integration tests still pass (test_integration_phase1.py, test_phase4_integration_query_pipeline.py)
  • Verify acceptance test passes with real LLM (test_acceptance_phase1_rag_query.py)
  • Run full test suite: cd backend && pytest app/test/test_phase5*.py app/test/test_phase4*.py app/test/test_phase3*.py -v

Phase 5.2 — Robust Citation Linking

Test Files (write BEFORE implementation)

# Test File Coverage
T5.2.1 backend/app/test/test_phase5_source_metadata.py SourceMetadata model with document_id. Tests: serialization includes document_id, backward compat (old data without document_id).
T5.2.2 backend/app/test/test_phase5_docx_pdf_generation.py DOCX/TXT ingestion now sets chunk_file_path. Tests: DOCX ingestion produces chunk PDFs, TXT ingestion produces chunk PDFs, PDF generation errors are handled gracefully.
T5.2.3 frontend/src/test/utils/test_phase5_citation_parser_fuzzy.test.ts Fuzzy citation matching. Tests: citation [NEC4 ACC] matches source NEC4 ACC.pdf, citation [nec4 acc.pdf, page 3] matches after whitespace trim, citation [NEC4 ACC.PDF] matches case-insensitively, fallback "View Document" link shown when chunk_file_path is null.
T5.2.4 frontend/src/test/utils/test_phase5_citation_fallback_link.test.ts Fallback document link rendering. Tests: chunk with chunk_file_path: null but document_id present → renders "View Document" link, chunk with both null → remains plain text, chunk with chunk_file_path → renders page-level PDF link.

Acceptance Tests

# Test File Coverage
AT5.2.1 backend/app/test/acceptance/test_acceptance_phase5_citation_links.py Real LLM query with DOCX and PDF documents. Verify citations in the answer are clickable in the SSE response (sources include document_id and chunk_file_path).

Implementation Tasks

Task 5.2.1: Add document_id to SourceMetadata model

  • In backend/app/models/common.py, add document_id: Optional[str] = None to SourceMetadata
  • In backend/app/routers/query.py lines 310-319, include document_id=meta.get("document_id") when building SourceMetadata objects
  • In frontend/src/types/index.ts, add document_id: string | null to SourceMetadata interface
  • Test file: test_phase5_source_metadata.py

Task 5.2.2: Generate PDFs for DOCX/TXT documents during ingestion

  • Add reportlab to backend/requirements.txt (lightweight, pure Python PDF generation, no external binaries)
  • In backend/app/routers/ingest.py DOCX and TXT branches, add PDF generation logic:
    1. After chunking, generate a single PDF from the full text (one page per chunk)
    2. Store chunk_filename = f"{stem}_chunk_{idx}.pdf" for each chunk
    3. Set chunk_file_paths list and pass to extract_metadata()
  • Add error handling: if PDF generation fails, chunk_file_path stays None (graceful degradation)
  • Use logger.warning on generation failure
  • Test file: test_phase5_docx_pdf_generation.py

Task 5.2.3: Improve citationParser.ts with fuzzy matching

  • Add extension-stripping helper: stripExtension(filename: string): string — removes .pdf, .docx, .txt
  • Modify buildCitationLookup() to register both filename and stripExtension(filename) as lookup keys
  • Add trim-whitespace normalization on citation text before lookup
  • Add test for LLM-common variations: NEC4 ACC.pdf vs NEC4 ACC vs NEC4_acc.pdf
  • Test file: test_phase5_citation_parser_fuzzy.test.ts
  • In citationParser.ts replaceCitationPatterns(), when source?.chunk_file_path is null but source?.document_id exists:
    1. Build a URL to the document chunk list page: /rag-database?document_id=${source.document_id}
    2. Return [${trimmed}](${url}) with a different CSS class (e.g., text-green-600 for document-level vs text-blue-600 for page-level)
  • In ResponsePanel.tsx, update CitationLink component to accept a variant prop for visual differentiation
  • Test file: test_phase5_citation_fallback_link.test.ts

Task 5.2.5: Integration and regression testing

  • Verify all existing citation parser tests still pass: cd frontend && npx vitest run src/test/utils/citationParser.test.ts
  • Verify ResponsePanel tests still pass: npx vitest run src/test/components/ResponsePanel.test.tsx
  • Run full frontend test suite: npm test
  • Verify SSE streaming integration: query with a mix of PDF and DOCX documents, confirm citations are clickable

Dependency Graph

Phase 5.1 (Structured Output)
  Task 5.1.1 (add deps) ──┬── Task 5.1.2 (SubQuestions model) ── Task 5.1.3 (complete_structured)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.4 (refactor decompose)
                           │                                           │
                           │                              Task 5.1.5 (update prompt template)
                           │                                           │
                           │                                           ▼
                           │                              Task 5.1.6 (integration tests)
                           │
Phase 5.2 (Citation Linking) — independent, can run in parallel with 5.1
  Task 5.2.1 (document_id in model) ──┬── Task 5.2.3 (fuzzy matching)
  Task 5.2.2 (DOCX/TXT PDF gen)    ──┤
                                      ├── Task 5.2.4 (fallback link)
                                      │
                                      ▼
                              Task 5.2.5 (integration tests)

Acceptance Criteria

Phase 5.1 Completion Checklist

  • LLMClient.complete_structured() returns validated SubQuestions Pydantic model — no json.JSONDecodeError possible
  • QueryDecomposer.decompose() never returns [] due to JSON parse failure
  • Fallback path (legacy json.loads()) logs a warning when triggered
  • Existing decompose tests pass (test_phase1_query_decomposer.py)
  • New structured output tests pass (test_phase5_*.py) — 33 tests
  • Spike test passed: Cantonese + English → valid sub-questions
  • SQLite seed templates updated and backfilled to all profiles
  • langchain and langchain-openai installed in venv (1.2.x)

Phase 5.2 Completion Checklist

  • SourceMetadata includes document_id in both backend and frontend types
  • DOCX/TXT ingestion generates per-chunk PDF filesDEFERRED to Phase 5.3
  • citationParser.ts matches [NEC4 ACC] to source NEC4 ACC.pdf (fuzzy matching)
  • citationParser.ts renders fallback link to /rag-database?document=xxx when chunk_file_path is null but document_id exists
  • RAGDatabasePage auto-expands document from ?document= URL param
  • All existing citation parser tests pass (14 tests)
  • All existing ResponsePanel tests pass
  • generate_per_subq seed prompt tightened: "Copy the exact bracket labels shown"

Rollback Plan

If with_structured_output() causes issues in production:

  1. The complete_structured() method wraps errors in LLMClientError — same exception type as existing complete()
  2. QueryDecomposer.decompose() has a fallback to legacy complete() + json.loads() path
  3. The _extract_json_from_markdown() function is preserved for backward compatibility
  4. If LangChain is a complete failure, revert requirements.txt and llm_client.py changes (3 files), keeping the Pydantic model and improved logging


Phase 5.3 — DOCX/TXT PDF Generation

Generate per-chunk PDF files for DOCX/TXT documents at ingestion time so they have the same chunk_file_path → PDF viewer flow as PDF documents.

Status: Complete (2026-04-28). Implemented in commit 25b26c9.

  • reportlab==4.2.5 added to requirements.txt
  • New backend/app/utils/text_to_pdf.py: renders chunk text as simple PDFs with word wrapping
  • ingest.py DOCX/TXT branches: generates {stem}_chunk_{idx}.pdf per chunk, passes chunk_file_paths to extract_metadata()
  • Graceful degradation: chunk_file_path stays None on generation failure (logged as warning)
  • Tests: test_phase5_docx_pdf_generation.py (5 tests), updated test_phase1_ingest_page_aware.py (2 assertions)

Phase 5.4 — Sentence-Level Highlighting (PLANNED)

Problem

When a user clicks a citation link to view a cited chunk, they see the full chunk text (up to ~1000 tokens). They have to manually scan to find which sentences actually drove the relevance. This is especially painful for long, dense chunks.

Solution

On-the-fly highlighted HTML chunk views served by the backend. When a citation link is clicked, the frontend passes the sub-question that retrieved that chunk. The backend splits the chunk into sentences, computes embedding similarity of each sentence to the sub-question, and returns a styled HTML page with relevant sentences highlighted.

Why HTML, not PDF?

Approach Complexity Works for all doc types? Preserves original formatting?
Highlighted HTML page Low Yes (uses chunk text) Plain text only
Highlighted PDF via reportlab Medium Yes (new PDF) Plain text only
Overlay highlights on existing PDF High ⚠️ PDF only Yes

Recommendation: HTML page. Simple, fast, works uniformly for PDF/DOCX/TXT chunks. Original formatting is preserved in the existing PDF viewer (chunk_file_path link) — the highlighted HTML view is a supplementary view reached via a separate button/link. The two views coexist: "View Original PDF" vs "View Highlighted Text".

How It Works (No LLM Needed)

User clicks citation [NEC4 ACC, chunk 3]
       │
       ▼
Frontend sends: GET /api/v1/chunks/highlight?document_id=abc&chunk_index=2&sub_question=...
       │
       ▼
Backend:
  1. Fetch chunk text from ChromaDB                          [chromadb get()]
  2. Split into sentences                                    [nltk.sent_tokenize or regex]
  3. Embed sub-question                                      [existing embedding model]
  4. Embed each sentence (batch, parallel)                   [same model]
  5. Compute cosine similarity per sentence vs sub-question  [numpy]
  6. Return HTML with yellow background on sentences > threshold
       │
       ▼
Frontend renders HTML in an iframe or new tab

What Gets Highlighted

┌──────────────────────────────────────────────────────────┐
│ Chunk: NEC4 ACC, page 12          [View Original PDF →]  │
├──────────────────────────────────────────────────────────┤
│                                                            │
│ The programme shall be prepared in a form acceptable to   │
│ the Project Manager. It shall include:                    │
│                                                            │
│ ████████████████████████████████████████████████████████ │
│ █ The starting date, access dates, and Key Dates.       █ │  ← High similarity
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Contractor shall submit a first programme within      │
│ ████████████████████████████████████████████████████████ │
│ █ two weeks of the starting date.                       █ │  ← High similarity
│ ████████████████████████████████████████████████████████ │
│                                                            │
│ The Project Manager may instruct the Contractor to        │
│ submit a revised programme showing the effects of a       │
│ compensation event. This does not affect the Contractor's │
│ right to be paid for preparing the programme.             │  ← Low similarity (no highlight)
│                                                            │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

# Decision Rationale
1 HTML page, not PDF Zero dependency (reportlab not needed). Faster to generate. CSS-based highlighting is more flexible. Original PDF view remains available separately.
2 Embedding similarity, not LLM No API cost, no latency. The embedding model is already running. Cosine similarity is cheap.
3 Sentence-level granularity Paragraph-level is too coarse (whole paragraph might be dimly relevant). Word/phrase-level is too noisy. Sentences are the natural unit of meaning.
4 Embed sentences in batch A 1000-token chunk has ~8-12 sentences. One batch embedding call is fast (single API round-trip).
5 Configurable threshold (env var) HIGHLIGHT_SIMILARITY_THRESHOLD (default 0.5). Tune per embedding model.
6 Cache sentence embeddings per chunk A chunk may be cited in multiple queries. Cache sentence embeddings in ChromaDB metadata or SQLite to avoid recomputation.
7 Graceful degradation If embedding fails → return plain text chunk view. If sentence splitting fails → highlight entire chunk.
8 Frontend: "View Highlighted" link alongside "View PDF" The existing PDF viewer link (chunk_file_path) stays. A second link opens the highlighted HTML view. Both visible, user chooses.

Implementation Tasks

Task 5.4.1: Backend — Sentence splitting utility

  • Create backend/app/utils/sentence_splitter.py
  • Function split_sentences(text: str) -> list[dict] returns [{text, start_char, end_char}, ...]
  • Use nltk.sent_tokenize with fallback to regex (re.split(r'(?<=[.!?])\s+'))
  • NLTK punkt data auto-downloaded on first use (or bundled)
  • Handle edge cases: empty text, single sentence, lists/bullets
  • Test file: test_phase5_sentence_splitter.py

Task 5.4.2: Backend — Highlighted chunk endpoint

  • New endpoint: GET /api/v1/chunks/highlight
  • Query params: document_id, chunk_index, sub_question
  • Returns text/html (not JSON)
  • Logic in backend/app/services/chunk_highlight_service.py:
    1. Fetch chunk from ChromaDB by document_id + chunk_index
    2. Split into sentences via split_sentences()
    3. Get embedding for sub_question via existing embedding model
    4. Get embeddings for all sentences in one batch call
    5. Compute cosine similarity: np.dot(q_emb, s_emb) / (norm(q) * norm(s))
    6. Mark sentences with similarity > threshold as highlighted
    7. Render HTML template with inline CSS (yellow background, subtle border)
  • Test file: test_phase5_chunk_highlight.py
  • In citationParser.ts and ResponsePanel.tsx, add a "🔍" or "View Highlighted" link next to each source
  • Link target: /api/v1/chunks/highlight?document_id=...&chunk_index=...&sub_question=...
  • The sub-question is the one that retrieved this chunk (already available in the sources structure: source.sub_question_index → look up sub-question text)
  • Open in new tab or modal
  • Test file: Update citationParser.test.ts and ResponsePanel.test.tsx

Task 5.4.4: Integration testing

  • Verify highlight endpoint returns 200 with valid HTML for all doc types (PDF, DOCX, TXT)
  • Verify sentence highlighting is proportional to relevance (spot-check manually)
  • Verify caching works (second request for same chunk is faster)
  • Verify graceful degradation (embedding API down → plain text still served)
  • Run full test suite

Test Files

# Test File Coverage
T5.4.1 backend/app/test/test_phase5_sentence_splitter.py Sentence splitting: English, mixed punctuation, empty, single sentence, bullet lists
T5.4.2 backend/app/test/test_phase5_chunk_highlight.py Highlight endpoint: valid request → HTML with highlights, threshold filtering, no sentences above threshold → all plain, missing document/chunk → 404, embedding failure → fallback plain text
T5.4.3 frontend/src/test/utils/citationParser.test.ts (update) Citation links include highlight URL when sub-question context available
T5.4.4 frontend/src/test/components/ResponsePanel.test.tsx (update) Sources section renders "View Highlighted" link alongside "View PDF"

Acceptance Tests

# Test File Coverage
AT5.4.1 backend/app/test/acceptance/test_acceptance_phase5_highlight.py Real LLM query → real embeddings → open highlighted view → verify yellow spans exist on relevant sentences

Commit Plan

Commit Message Scope
1 feat: add LangChain deps and SubQuestions Pydantic model Tasks 5.1.1 + 5.1.2 + tests
2 feat: add LLMClient.complete_structured() with LangChain Task 5.1.3 + tests
3 feat: refactor QueryDecomposer to use structured output with fallback Task 5.1.4 + tests
4 chore: update decompose seed prompt for structured output Task 5.1.5
5 feat: add document_id to SourceMetadata model Task 5.2.1 + tests
6 feat: fuzzy citation matching and document fallback links Tasks 5.2.3 + 5.2.4 + tests
7 feat: sentence-level chunk highlighting via embedding similarity Phase 5.4 (all tasks)