legco_ai_assistant/.plans/package4_enhancement_plan.md

50 KiB
Raw Blame History

Package 4 Enhancement Plan — Per-Sub-Question RAG Pipeline

Source: User request (2026-04-26)
Scope: Refactor the 3-step RAG query pipeline so retrieval, filtering, and response generation are organized per sub-question instead of batch-flattened.
Status: Complete — All 7 sub-phases implemented (2026-04-26). Phase 4a Prompt Integration added (2026-04-27).


Objective

Restructure the POST /api/v1/query pipeline so that:

  1. Retrieval per sub-question: Each sub-question independently retrieves n_results chunks from ChromaDB (instead of joining all sub-questions into one query string).
  2. Filtering per sub-question: Each chunk is evaluated for relevance against its own originating sub-question (not the original user question). One LLM call handles all filtering — the prompt is redesigned to group chunks by sub-question.
  3. Final answer organized by sub-question: Each sub-question gets its own bullet-point answer with its own sources. The frontend renders answer sections per sub-question rather than one monolithic bullet list.

Decision Register

# Decision Rationale
1 Keep QueryDecomposer unchanged Input/output contract is identical — decomposition still produces a flat list of sub-questions
2 Single LLM call for filtering User explicitly requested one call. Prompt redesigned to carry sub-question context for each chunk group
3 Keep RAGService.retrieve() signature Call it N times (once per sub-question) externally in the orchestrator rather than changing its internal contract
4 Add retrieve_per_subquestion() to RAGService New method that iterates over sub-questions, calls retrieve() per question, returns grouped results
5 Redesign generate_response() signature Accepts structured sub_questions: List[SubQuestionContext] instead of flat chunk lists
6 SSE events: add generating_subquestion phase Progressive streaming — frontend sees which sub-question is being answered
7 History: change XML/JSON formats in-place Add <sub_q> wrappers to chunks_retrieved/chunks_filtered XML. Add sub-question grouping to sources JSON. No new DB columns.
8 Final answer format: markdown sections ## Sub-question 1 headers with inline citations. Backward-compatible with existing ReactMarkdown rendering
9 Deduplicate chunks within a sub-question only Same chunk may be retrieved by multiple sub-questions. Keep duplicates (different sub-questions need independent evaluation). ChromaDB query() naturally may return the same doc for different queries — this is acceptable.
10 Prompt template: add generate placeholders New placeholder {context_sections} replaces single {context}. Filter template unchanged (sub-question injected at call site). Decompose template unchanged.
11 Progressive SSE events Emit generating_subquestion event as each sub-question's answer section is generated. Frontend renders sections one by one.
12 retrieval_n_results Global — same value for all sub-questions. Use existing settings.retrieval_n_results config.
13 Empty decomposition fallback Treat original user question as single sub-question. Pipeline runs as 1-sub-q case — single retrieval, no filtering needed (one sub-q = no ambiguity), flat answer with ## header.

Pipeline: Before vs After

Before (Current — Flat Batch)

User Question: "What are NEC4 time extension clauses?"
         │
    ┌────▼─────┐
    │ Decompose│  LLM Call 1
    │ → ["What are time extensions?", 
    │    "What notice is required?"]
    └────┬─────┘
         │ joined: "What are time extensions? What notice is required?"
    ┌────▼─────┐
    │ Retrieve │  1 ChromaDB query → 10 chunks (flat, no sub-q association)
    └────┬─────┘
         │ 10 chunks
    ┌────▼─────┐
    │  Filter  │  LLM Call 2 — all chunks scored against ORIGINAL question
    │          │  Score > 7 → keep (flat, no sub-q association)
    └────┬─────┘
         │ N filtered chunks
    ┌────▼─────┐
    │ Generate │  LLM Call 3 — flat answer from ALL filtered chunks
    │          │  "• Time extensions require notice [NEC4 ACC.pdf, p3]
    │          │   • The project manager must acknowledge [NEC4, p7]
    │          │   • Notice is defined as..."  (sources from all sub-qs mixed)
    └────┬─────┘
         │ single SSE completed event
    ┌────▼─────┐
    │ Frontend │  1 ReactMarkdown block, 1 flat sources list
    └──────────┘

After (Per-Sub-Question)

User Question: "What are NEC4 time extension clauses?"
         │
    ┌────▼─────┐
    │ Decompose│  LLM Call 1 (UNCHANGED)
    │ → ["What are time extensions?",
    │    "What notice is required?"]
    └────┬─────┘
         │ sub_q1                    sub_q2
    ┌────▼─────┐              ┌────▼─────┐
    │ Retrieve │              │ Retrieve │   2 ChromaDB queries → 10 chunks each
    │ q1 → 10  │              │ q2 → 10  │   chunks tagged with sub-q index
    └────┬─────┘              └────┬─────┘
         │                         │
         └─────────┬───────────────┘
                   │ grouped: {sub_q0: [chunks 0-9], sub_q1: [chunks 10-19]}
              ┌────▼─────┐
              │  Filter  │  LLM Call 2 (SINGLE CALL — redesigned prompt)
              │          │  Each chunk scored against its OWN sub-question
              │          │  Returns grouped scores → filtered per sub-q
              └────┬─────┘
                   │ filtered_by_subq: {0: [chunk_a, chunk_b], 1: [chunk_c]}
              ┌────▼─────┐
              │ Generate │  LLM Call 3 (redesigned prompt with per-sub-q context)
              │          │  ┌─────────────────────────────────────┐
              │          │  │ ## What are time extensions?         │
              │          │  │ - Time extensions must be notified   │
              │          │  │   [NEC4 ACC.pdf, page 3]             │
              │          │  │ - The project manager has 2 weeks    │
              │          │  │   [NEC4 Contract.pdf, page 12]       │
              │          │  │                                      │
              │          │  │ ## What notice is required?          │
              │          │  │ - Written notice must be given       │
              │          │  │   [NEC4 ACC.pdf, page 7]             │
              │          │  └─────────────────────────────────────┘
              └────┬─────┘
                   │ SSE events: generating_subquestion (per sub-q) → completed
              ┌────▼─────┐
              │ Frontend │  Sections per sub-question, sources grouped per section
              └──────────┘

Current State (Pre-Enhancement)

Backend

Component File Current Behavior
Decomposer services/query_decomposer.py decompose(question) -> (List[str], prompt) — returns 2-5 sub-questions
Retrieval services/rag.py:retrieve() query_text = " ".join(query_keywords) — joins all sub-qs into ONE string, single ChromaDB query → flat chunk list
Filter services/relevance_filter.py filter(question, chunks) — ALL chunks scored against ORIGINAL question, single LLM call, flat output
Generate services/rag.py:generate_response() generate_response(question, chunks, metadata) — flat chunks → flat bullet answer
Orchestrator routers/query.py:_query_stream() Linear 4-stage pipeline: decompose → retrieve → filter → generate
SSE Events routers/query.py decomposed → retrieving → filtering → generating → completed — flat answer + sources in completed
History services/history_service.py Flat XML for chunks_retrieved/chunks_filtered. Flat JSON for sources. Single timing per stage.
Prompt templates prompt_service.py + sqlite_db.py 3 steps (decompose, filter, generate). Placeholders: {question}, {chunks}, {context}
Config core/config.py retrieval_n_results=10, relevance_threshold=7.0

Frontend

Component File Current Behavior
Types types/index.ts QueryStreamEvent.phase, flat extracted_questions: string[], flat answer: string, flat sources: SourceMetadata[]
SSE Client lib/api.ts queryDocumentStream() — generic JSON.parse per data: line, no sub-question awareness
State lib/queries.tsx QueryStreamState with flat answer/sources/extractedQuestions
Response components/ResponsePanel.tsx Single ReactMarkdown block for answer. Flat 2-column grid for sources. No sub-question grouping.
Questions components/ExtractedQuestionsDisplay.tsx <ol> list of question strings. No sources attached.
Citations utils/citationParser.ts Flat sources lookup — buildCitationLookup(sources) returns global map
Progress components/PipelineProgress.tsx 4-step stepper (NOT currently wired in LTTPage)

Key Test Files

File Lines Status
test_phase1_query_decomposer.py 76 Unchanged — decomposer contract stays
test_phase1_rag_service.py 139 🔴 Needs update — retrieve(), generate_response() signatures change
test_phase1_relevance_filter.py 93 🟡 Needs update — one-call pattern changes to per-sub-q grouping
test_phase1_query.py 97 🟢 Already skipped (SSE migration) — may un-skip later
test_phase3_query_history_integration.py 608 🔴 Major rewrite — pipeline simulation mirrors _query_stream 1:1
test_phase3_prompt_injection.py 238 🟡 Moderate — new generate template placeholder
test_acceptance_phase1_rag_query.py 101 🔴 Full rewrite — already broken (SSE vs JSON), new response shape
conftest.py 94 🟡 Low — may add per-sub-q mock helpers

Implementation Tasks

Sub-Phase 4.1: Backend — Per-Sub-Question Retrieval

Test files to write first:

  • test_phase4_retrieve_per_subquestion.py — Tests RAGService.retrieve_per_subquestion()
  • test_phase4_query_router_retrieval.py — Tests _query_stream retrieval stage produces per-sub-q chunks

Task 4.1.1: Add retrieve_per_subquestion() to RAGService

File: backend/app/services/rag.py

New method signature:

def retrieve_per_subquestion(
    self,
    sub_questions: List[str],
    n_results: int = 10,
) -> List[Tuple[str, List[Tuple[str, Dict[str, Any], float]]]]:
    """Retrieve chunks for each sub-question independently.

    Args:
        sub_questions: List of decomposed sub-questions.
        n_results: Number of chunks per sub-question.

    Returns:
        List of (sub_question, chunks) tuples.
        chunks is the standard retrieve() output: [(text, metadata, distance), ...].
    """

Implementation:

  • Call self.retrieve([sub_q], n_results) for each sub-question
  • Return list of (sub_question, chunks) — chunks remain deduplicated at ChromaDB level (ChromaDB automatically deduplicates by ID)
  • Existing retrieve() method is NOT modified — it continues to work as before

Task 4.1.2: Update _query_stream() retrieval stage

File: backend/app/routers/query.py

Changes:

  • Replace rag.retrieve(extracted_questions, n_results) with rag.retrieve_per_subquestion(extracted_questions, n_results)
  • Track per-sub-question retrieval timing (new field or combined timing)
  • Format chunks_retrieved XML with sub-question wrappers

New chunks_retrieved XML format:

<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Content: Clause 61.3 states that...
</chunk_1>
<chunk_2>
...
</chunk_2>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Content: Notice must be given...
</chunk_1>
...
</sub_q>

Task 4.1.3: Format helpers

File: backend/app/routers/query.py

New functions:

def format_chunks_retrieved_per_subq(results: List[Tuple[str, List]]) -> str:
    """Format per-sub-question retrieved chunks as XML."""
    
def format_chunks_filtered_per_subq(results: List[Tuple[str, List]]) -> str:
    """Format per-sub-question filtered chunks as XML with relevance scores."""

Commit: "feat: Phase 4.1 per-sub-question retrieval with grouped chunk XML"

Sub-Phase 4.2: Backend — Per-Sub-Question Filtering (Single LLM Call)

Test files to write first:

  • test_phase4_relevance_filter_per_subq.py — Tests RelevanceFilter.filter_per_subquestion() with grouped chunks
  • test_phase4_query_router_filter.py — Tests filter stage with per-sub-q chunk groups

Task 4.2.1: Add filter_per_subquestion() to RelevanceFilter

File: backend/app/services/relevance_filter.py

New method signature:

async def filter_per_subquestion(
    self,
    sub_questions: List[str],
    sub_chunks: List[List[Tuple[str, Dict]]],
    threshold: float = 7.0,
) -> Tuple[List[Tuple[str, List[Tuple[str, Dict]]]], str]:
    """Filter chunks per sub-question in a single LLM call.

    Args:
        sub_questions: List of decomposed sub-questions.
        sub_chunks: List of chunk lists (one per sub-question).
        threshold: Minimum relevance score.

    Returns:
        Tuple of (filtered_results, prompt).
        filtered_results: List of (sub_question, filtered_chunks_for_that_q).
    """

Prompt design (single LLM call):

Evaluate each chunk for relevance to its associated sub-question.

Sub-question 0: "{sub_q_0}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...

Sub-question 1: "{sub_q_1}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...

For each chunk, rate relevance 0-10 considering ONLY its associated sub-question.
Return a JSON object mapping sub-question indices to arrays of scores:
{"0": [8.5, 3.2, 9.0], "1": [7.0, 6.5, 9.1]}

Key rules:

  • Each chunk is evaluated against its own sub-question (not the original user question)
  • JSON keys are stringified sub-question indices ("0", "1", ...)
  • Score arrays MUST match chunk count for each sub-question
  • Same JSON extraction/markdown stripping logic as existing filter()

Existing filter() method is preserved — not modified, not deprecated. The new method is additive.

Task 4.2.2: Update _query_stream() filter stage

File: backend/app/routers/query.py

Changes:

  • Call relevance_filter.filter_per_subquestion(extracted_questions, chunks_for_filter, threshold) instead of relevance_filter.filter(question, chunks, threshold)
  • Build chunks_for_filter from per-sub-question retrieval results
  • Track filter_prompt (the redesigned prompt)
  • Format chunks_filtered XML with sub-question wrappers and Relevance: scores

New chunks_filtered XML format:

<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Relevance: 8.5
Content: Clause 61.3 states that...
</chunk_1>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Relevance: 9.0
Content: Notice must be given...
</chunk_1>
</sub_q>

Commit: "feat: Phase 4.2 per-sub-question filtering with single LLM call"

Sub-Phase 4.3: Backend — Sub-Question-Organized Response Generation

Test files to write first:

  • test_phase4_generate_per_subq.py — Tests RAGService.generate_response_per_subquestion()
  • test_phase4_response_format.py — Tests the final answer matches expected format

Task 4.3.1: Redesign generate_response()generate_response_per_subquestion()

File: backend/app/services/rag.py

New method signature:

async def generate_response_per_subquestion(
    self,
    sub_questions: List[str],
    sub_chunks: List[List[str]],
    sub_metadata: List[List[Dict[str, Any]]],
) -> Tuple[str, str, List[List[SourceMetadata]]]:
    """Generate sub-question-organized RAG response.

    Args:
        sub_questions: List of decomposed sub-questions.
        sub_chunks: List of chunk text lists (one per sub-question).
        sub_metadata: List of metadata dict lists (one per sub-question).

    Returns:
        Tuple of (answer, prompt, grouped_sources).
        answer: Markdown string with sections per sub-question.
        prompt: The rendered LLM prompt.
        grouped_sources: List of SourceMetadata lists (one per sub-question).
    """

New prompt template (replaces generate):

You must answer each sub-question using ONLY the document chunks provided for it.
Do not use any external knowledge.
Format your answer as markdown sections — one section per sub-question.
Each section should start with "## Sub-question N: <the question>"
Each section should contain 1-5 bullet points.
Cite your sources inline using bracket labels, e.g. [filename, page N].
Place the citation at the end of each relevant bullet point.

{context_sections}

Answer:

Context format (replaces {context}):

### Context for Sub-question 0: "What are time extensions?"
[NEC4 ACC.pdf, page 3] Source: NEC4 ACC.pdf
Summary: Clause 61.3 discusses time extensions...
Content: Clause 61.3 states that the project manager...

[NEC4 Contract.pdf, page 12] Source: NEC4 Contract.pdf
Summary: Notice requirements for time extensions...
Content: Written notice must be given within...

### Context for Sub-question 1: "What notice is required?"
[NEC4 ACC.pdf, page 7] Source: NEC4 ACC.pdf
Summary: Notice requirements...
Content: The contractor shall notify the project manager in writing...

Expected answer format:

## Sub-question 1: What are time extensions?
- Time extensions must be notified to the project manager within 2 weeks [NEC4 ACC.pdf, page 3]
- The project manager must acknowledge the notice within 1 week [NEC4 Contract.pdf, page 12]

## Sub-question 2: What notice is required?
- Written notice must be given [NEC4 ACC.pdf, page 7]

Existing generate_response() is preserved — not modified, not deprecated.

Task 4.3.2: Update _query_stream() generate stage

File: backend/app/routers/query.py

Changes:

  • Call rag.generate_response_per_subquestion(extracted_questions, chunk_texts_by_subq, metadata_by_subq)
  • New SSE event: generating_subquestion — emitted before each sub-question's section (lets frontend show progressive build)
  • completed SSE event includes both answer (markdown string) and sub_question_sources (grouped sources)

New SSE event sequence:

{"phase": "decomposed", "extracted_questions": ["q1", "q2"]}
{"phase": "retrieving"}
{"phase": "filtering"}
{"phase": "generating"}
{"phase": "completed", "answer": "## Sub-question 1: ...\n\n...", "sub_question_sources": [[SourceMetadata, ...], [SourceMetadata, ...]]}
{"phase": "error", "message": "..."}

New QueryResponse model:

File: backend/app/models/query.py

class SubQuestionSources(BaseModel):
    sub_question_index: int
    sub_question_text: str
    sources: List[SourceMetadata]

class QueryResponse(BaseModel):
    extracted_questions: List[str]
    answer: str                          # Markdown with ## sections
    sub_question_sources: List[SubQuestionSources]  # Grouped sources
    # Backward compat:
    sources: List[SourceMetadata]        # Flattened version (all sources)

Commit: "feat: Phase 4.3 sub-question-organized response generation"

Sub-Phase 4.4: Backend — History & Prompt Template Updates

Test files to write first:

  • test_phase4_history_format.py — Tests new XML/JSON history formats
  • test_phase4_prompt_templates.py — Tests new generate template with {context_sections}

Task 4.4.1: Update history recording

File: backend/app/routers/query.py (the _schedule_history / _record_history helpers)

Changes:

  • chunks_retrieved: Store new grouped XML format (with <sub_q> wrappers)
  • chunks_filtered: Store new grouped XML format (with <sub_q> wrappers and Relevance: scores)
  • sources: Store grouped JSON: json.dumps([[SourceMetadata_dict, ...], [...]]) (list of lists)
  • final_answer: Store markdown string with ## sections
  • Existing fields (chunks_retrieved_count, chunks_filtered_count) keep total counts
  • New optional fields: chunks_retrieved_per_subq_count, chunks_filtered_per_subq_count (JSON array of ints)

Task 4.4.2: Update history DB schema (minimal)

File: backend/app/core/sqlite_db.py

Add two new columns (optional, NULL-able):

ALTER TABLE query_history ADD COLUMN chunks_retrieved_per_subq_count TEXT DEFAULT NULL;
ALTER TABLE query_history ADD COLUMN chunks_filtered_per_subq_count TEXT DEFAULT NULL;

These store JSON arrays like [10, 8] — one count per sub-question. NULL for pre-Package-4 records.

Task 4.4.3: Update history Pydantic models

File: backend/app/models/history.py

Add optional fields to QueryHistoryRecord and QueryHistoryDetail:

chunks_retrieved_per_subq_count: Optional[str] = None  # JSON array string
chunks_filtered_per_subq_count: Optional[str] = None    # JSON array string

Task 4.4.4: Update prompt templates

File: backend/app/core/sqlite_db.py (seed data)

New generate template:

"generate": (
    "You must answer each sub-question using ONLY the document chunks provided for it.\n"
    "Do not use any external knowledge.\n"
    "Format your answer as markdown sections — one section per sub-question.\n"
    "Each section should start with \"## Sub-question N: <the question>\"\n"
    "Each section should contain 1-5 bullet points.\n"
    "Cite your sources inline using bracket labels, e.g. [filename, page N].\n"
    "Place the citation at the end of each relevant bullet point.\n\n"
    "{context_sections}\n\n"
    "Answer:"
)

decompose and filter templates remain unchanged (they still use {question} placeholder — the orchestrator injects the right value at call time).

Task 4.4.5: Update PromptService to handle new template placeholder

File: backend/app/services/prompt_service.py

  • Add context_sections as a known placeholder for the generate step (optional — str.replace already safe with unknown keys)
  • The reset_to_defaults() method must include the new generate template

Task 4.4.6: Update history detail API response

File: backend/app/routers/history.py

GET /api/v1/history/{id} response now includes chunks_retrieved_per_subq_count and chunks_filtered_per_subq_count when they are not NULL. Backward-compatible (older records return null for these fields).

Commit: "feat: Phase 4.4 history schema, prompt templates, and Pydantic model updates"

Sub-Phase 4.5: Frontend — Types & State Management

Test files to write first:

  • test_phase4_stream_state.test.tsx — Tests QueryStreamState handles new response shape
  • test_phase4_types.test.ts — Tests type compatibility

Task 4.5.1: Update TypeScript types

File: frontend/src/types/index.ts

New types:

interface SubQuestionSources {
  sub_question_index: number;
  sub_question_text: string;
  sources: SourceMetadata[];
}

interface QueryStreamCompletedEvent {
  phase: 'completed';
  answer: string;                              // Markdown with ## sections
  sub_question_sources: SubQuestionSources[];  // Grouped sources
}

interface QueryStreamDecomposedEvent {
  phase: 'decomposed';
  extracted_questions: string[];
}

type QueryStreamEvent = 
  | QueryStreamDecomposedEvent
  | { phase: 'retrieving' | 'filtering' | 'generating' }
  | QueryStreamCompletedEvent
  | { phase: 'error'; message: string };

Task 4.5.2: Update QueryStreamState and mutation handler

File: frontend/src/lib/queries.tsx

Changes:

interface QueryStreamState {
  extractedQuestions: string[] | null;
  answer: string | null;                        // Full markdown
  subQuestionSources: SubQuestionSources[] | null;  // NEW — grouped sources
  phase: 'idle' | 'decomposing' | 'retrieving' | 'filtering' | 'generating' | 'completed' | 'error';
  error: Error | null;
}

In the completed case:

case 'completed':
  setState(prev => ({
    ...prev,
    answer: event.answer,
    subQuestionSources: event.sub_question_sources,
    phase: 'completed',
  }));
  break;

Commit: "feat: Phase 4.5 frontend types and state management for per-sub-q responses"

Sub-Phase 4.6: Frontend — ResponsePanel & ExtractedQuestionsDisplay

Test files to write first:

  • test_phase4_response_panel.test.tsx — Tests per-sub-question section rendering
  • test_phase4_citation_parser.test.ts — Tests per-sub-question citation lookup

Task 4.6.1: Redesign ResponsePanel for sub-question sections

File: frontend/src/components/ResponsePanel.tsx

Current: single ReactMarkdown block + flat sources grid.

New layout:

┌─────────────────────────────────────────────────────┐
│  📋 Response                           [Copy All]   │
├─────────────────────────────────────────────────────┤
│                                                      │
│  ┌─ Sub-question 1: What are time extensions? ─────┐│
│  │                                                    │
│  │  • Time extensions must be notified...             │
│  │    [NEC4 ACC.pdf, page 3]                          │
│  │  • The project manager must acknowledge...         │
│  │    [NEC4 Contract.pdf, page 12]                    │
│  │                                                    │
│  │  Sources (2)                          [Expand ▼]  │
│  │  ┌──────────────────────────────────────────────┐ │
│  │  │ NEC4 ACC.pdf, Page 3  │ NEC4 Contract, p12 │ │
│  │  │ "Clause 61.3 states.." │ "Notice must be..." │ │
│  │  └──────────────────────────────────────────────┘ │
│  └────────────────────────────────────────────────────┘│
│                                                      │
│  ┌─ Sub-question 2: What notice is required? ───────┐│
│  │                                                    │
│  │  • Written notice must be given...                  │
│  │    [NEC4 ACC.pdf, page 7]                           │
│  │                                                    │
│  │  Sources (1)                          [Expand ▼]  │
│  └────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Implementation approach:

  1. Parse the answer markdown into sections using ## Sub-question N: headers
  2. Map each section to its SubQuestionSources by matching index
  3. Render each section as an accordion/card with:
    • Header: sub-question text (from SubQuestionSources)
    • Body: ReactMarkdown for bullet points (with inline citation links)
    • Footer: collapsible sources grid (only sources belonging to this sub-question)
  4. Keep the existing citation link behavior (clickable [filename, page N] → PDF viewer)

Task 4.6.2: Update citationParser.ts for per-sub-question lookup

File: frontend/src/utils/citationParser.ts

Current: buildCitationLookup(sources: SourceMetadata[]) — returns a single global map.

New: buildCitationLookup(subQuestionSources: SubQuestionSources[]) — returns a map scoped to the correct sources for each section. The citation [filename, page N] match is looked up in the relevant sub-question's source list.

Task 4.6.3: Update ExtractedQuestionsDisplay for anchors

File: frontend/src/components/ExtractedQuestionsDisplay.tsx

Minor enhancement:

  • Make each extracted question a clickable anchor that scrolls to its corresponding section in the answer
  • Add id="subq-{index}" to each section header in ResponsePanel
  • Keep existing skeleton loading behavior

Commit: "feat: Phase 4.6 frontend per-sub-question response rendering"

Sub-Phase 4.7: Testing & Polish

Test files to write:

  • test_phase4_integration_query_pipeline.py — Full integration test simulating per-sub-q pipeline
  • test_phase4_acceptance_query.py — Acceptance test with real LLM (manual run)
  • test_phase4_e2e_query_flow.test.tsx — Frontend e2e test with mocked SSE stream

Task 4.7.1: Backend unit tests

  • Run pytest backend/app/test/test_phase4_*.py -v — all must pass
  • Verify no regressions in existing Phase 1 and Phase 3 tests
  • Update test_phase1_rag_service.py for new method signatures
  • Update test_phase1_relevance_filter.py for per-sub-q behavior
  • Rewrite test_phase3_query_history_integration.py for new pipeline flow
  • Update test_phase3_prompt_injection.py for new generate template

Task 4.7.2: Backend acceptance tests

  • test_phase4_acceptance_query.py — real LLM, real ChromaDB
  • Verify: answer contains ## Sub-question headers, sources grouped by sub-question index
  • Verify: each sub-question section has 1-5 bullet points
  • Verify: inline citations match the correct sub-question's source list

Task 4.7.3: Frontend tests

  • test_phase4_response_panel.test.tsx — renders per-sub-question sections, expandable sources
  • test_phase4_citation_parser.test.ts — per-sub-question lookup returns correct source
  • test_phase4_e2e_query_flow.test.tsx — mocks SSE with new event format, verifies section rendering
  • Update existing ResponsePanel.test.tsx and citationParser.test.ts for new API

Task 4.7.4: Frontend build verification

  • npm run build — no TypeScript errors
  • npm test — all 62 existing tests pass + new Phase 4 tests
  • Verify manual flow: ask question → see extracted questions → see per-sub-question answer sections → expand sources per section

Task 4.7.5: Error handling

  • Empty decomposition: if decompose() returns [], fall back to using original question as single sub-question
  • Empty retrieval for some sub-questions: that sub-question gets no chunks → section shows "No relevant information found"
  • Filter failure (all chunks below threshold): that sub-question gets no answer → graceful empty section
  • JSON parse failure in filter: fall back to including all chunks (no filtering) for that sub-question

Task 4.7.6: Documentation

  • Update AGENTS.md with new pipeline architecture section
  • Add docstrings to all new methods (retrieve_per_subquestion, filter_per_subquestion, generate_response_per_subquestion)
  • Update prompt template documentation in system prompts page

Commit: "feat: Phase 4.7 testing, error handling, and polish for per-sub-q pipeline"


Phase 4a: Prompt Service Integration for Per-Sub-Q Filter (2026-04-27)

Root issue: filter_per_subquestion() in relevance_filter.py had a hardcoded prompt (_build_per_subq_prompt()) — completely bypassing PromptService. Users could not edit the per-sub-q filter prompt on the System Prompts page, unlike the flat filter step which was already prompt-service-driven.

Solution: Broke the per-sub-q filter prompt into 3 composable pieces, each a separately editable step on the System Prompts page:

Step Name Label Placeholders Default
filter_intro Step 2.1: Filter Intro (Preamble) (none) "Evaluate each chunk for relevance to its associated sub-question only."
filter_section Step 2.2: Filter Section (Per Sub-Q) {subq_idx}, {subq_question}, {chunks} 'Sub-question {subq_idx}: "{subq_question}"\n{chunks}'
filter_outro Step 2.3: Filter Outro (Format) (none) JSON format instructions + example

The RelevanceFilter._build_per_subq_prompt() now composes them at runtime:

filter_intro + [filter_section.replace(...) for each sub-q] + filter_outro

Falls back to built-in defaults when PromptService is unavailable.

Bugs Fixed

  1. generate_per_subq not seeded: rag.py called get_prompt_template("generate_per_subq") but this step name was never added to _VALID_STEPS, _SEED_STEPS, or _SEED_TEMPLATES — would crash at runtime with ValueError. Now properly seeded with {context_sections} placeholder.

  2. _SEED_GENERATE placeholder mismatch from Package 4: The flat generate_response() expects {question}/{context} placeholders, but Package 4 changed the seed template to use {context_sections} (intended for per-sub-q generate). Restored flat template; generate_per_subq now holds {context_sections}.

Database Backfill Migration

The existing seed_default_profiles() only inserted steps for NEWLY created profiles. Added a backfill loop that iterates ALL existing profiles and INSERT OR IGNOREs any missing step names. This ensures existing A/B/C profiles pick up filter_intro, filter_section, filter_outro, and generate_per_subq on restart.

System Prompts UI Restructured

The flat filter and generate steps were removed from the UI (they're unused by the current pipeline). The page now shows 5 steps:

UI Order Label Step Key
1 Step 1: Query Decomposition decompose
2 Step 2.1: Filter Intro (Preamble) filter_intro
3 Step 2.2: Filter Section (Per Sub-Q) filter_section
4 Step 2.3: Filter Outro (Format) filter_outro
5 Step 3: Generate (Per-Sub-Question) generate_per_subq

The old filter and generate templates remain in the DB (for API backward compatibility) but are hidden from the UI.

Files Changed

File Change
backend/app/core/sqlite_db.py 3 new seed templates + generate_per_subq seed; backfill migration; restored _SEED_GENERATE to {question}/{context}
backend/app/services/prompt_service.py Added 4 step names to _VALID_STEPS
backend/app/routers/prompts.py Added 4 step names to _VALID_STEPS
backend/app/services/relevance_filter.py Refactored _build_per_subq_prompt() to use PromptService + built-in fallback constants
frontend/src/components/PromptEditor.tsx Replaced unused flat steps with 5-step per-sub-q layout (Step 2.1-2.3 + Step 3)
frontend/src/components/PlaceholderDocs.tsx Added {context_sections}, {subq_idx}, {subq_question} docs
backend/app/test/conftest.py Added 4 new templates to mock
backend/app/test/test_phase3_sqlite_db.py Updated counts (9→21 prompts) and placeholder assertions
backend/app/test/test_phase3_prompt_service.py Updated step set + placeholder assertions
backend/app/test/test_phase3_prompts_router.py Updated step set assertion
backend/app/test/test_phase4_prompt_templates.py Updated for split generate/generate_per_subq
frontend/src/test/components/PromptEditor.test.tsx Updated to 5 textareas, new labels, new placeholder layout
frontend/src/test/components/PlaceholderDocs.test.tsx Updated to 6 placeholders

Test Results (Post-Phase 4a)

  • Backend: 295 passed, 5 skipped (pre-existing)
  • Frontend: 182 passed, 1 pre-existing failure (unrelated file-input e2e)

Sub-Phase Summary

Sub-Phase Scope Backend Frontend Tests Status
4.1 Per-sub-q retrieval rag.py, query.py, format helpers None test_phase4_retrieve_per_subquestion.py, test_phase4_query_router_retrieval.py Complete
4.2 Per-sub-q filtering (1 LLM call) relevance_filter.py, query.py None test_phase4_relevance_filter_per_subq.py, test_phase4_query_router_filter.py Complete
4.3 Sub-q-organized response generation rag.py, query.py, models/query.py None test_phase4_generate_per_subq.py, test_phase4_response_format.py Complete
4.4 History schema, prompts, models sqlite_db.py, history.py (router + models), prompt_service.py None test_phase4_history_format.py, test_phase4_prompt_templates.py Complete
4.5 Frontend types + state None types/index.ts, lib/queries.tsx test_phase4_stream_state.test.tsx, test_phase4_types.test.ts Complete
4.6 Frontend rendering None ResponsePanel.tsx, citationParser.ts, ExtractedQuestionsDisplay.tsx test_phase4_response_panel.test.tsx, test_phase4_citation_parser.test.ts Complete
4.7 Testing & polish All affected files All affected files Integration + acceptance + e2e tests Complete
4a Prompt service integration for filter_per_subq sqlite_db.py, prompt_service.py, prompts.py, relevance_filter.py PromptEditor.tsx, PlaceholderDocs.tsx Updated 7 test files, 13 total files changed Complete

Implementation Sequence & Dependencies

4.1 (Retrieval) ──┐
                  ├──► 4.2 (Filtering) ──► 4.3 (Generate) ──► 4.4 (History/Prompts)
                  │                                                    │
                  │                                                    ▼
                  │                                         4.5 (Frontend Types/State)
                  │                                                    │
                  │                                                    ▼
                  │                                         4.6 (Frontend Rendering)
                  │                                                    │
                  └─────────────────────────────────────────────────────▼
                                                              4.7 (Testing & Polish)
  • 4.1 → 4.2 sequential: Filtering needs per-sub-q chunk structure from retrieval
  • 4.2 → 4.3 sequential: Generation needs filtered chunks from filtering stage
  • 4.3 → 4.4 sequential: History recording and prompt templates need final data shapes
  • 4.4 → 4.5 parallel: Backend prompt/history changes don't block frontend type definitions
  • 4.5 → 4.6 sequential: Rendering needs types and state management
  • 4.7 blocked by all: Integration tests need everything wired together

Parallelization opportunity: 4.5 (frontend types) could start as soon as 4.3 defines the SSE contract, but it's safer to start after 4.4 confirms the final data shapes.


Affected Files — Complete Inventory

Backend — New Files

File Sub-Phase Purpose
backend/app/test/test_phase4_retrieve_per_subquestion.py 4.1 Unit test: retrieve_per_subquestion()
backend/app/test/test_phase4_query_router_retrieval.py 4.1 Unit test: retrieval stage in _query_stream
backend/app/test/test_phase4_relevance_filter_per_subq.py 4.2 Unit test: filter_per_subquestion()
backend/app/test/test_phase4_query_router_filter.py 4.2 Unit test: filter stage in _query_stream
backend/app/test/test_phase4_generate_per_subq.py 4.3 Unit test: generate_response_per_subquestion()
backend/app/test/test_phase4_response_format.py 4.3 Unit test: answer format validation
backend/app/test/test_phase4_history_format.py 4.4 Unit test: new XML/JSON history formats
backend/app/test/test_phase4_prompt_templates.py 4.4 Unit test: new generate template
backend/app/test/test_phase4_integration_query_pipeline.py 4.7 Integration test: full per-sub-q pipeline
backend/app/test/acceptance/test_phase4_acceptance_query.py 4.7 Acceptance test: real LLM

Backend — Modified Files

File Sub-Phase Changes
backend/app/services/rag.py 4.1, 4.3 Add retrieve_per_subquestion(), generate_response_per_subquestion()
backend/app/services/relevance_filter.py 4.2 Add filter_per_subquestion()
backend/app/routers/query.py 4.14.4 Refactor _query_stream(), add per-sub-q format helpers, update history recording
backend/app/models/query.py 4.3 Add SubQuestionSources model, update QueryResponse
backend/app/models/history.py 4.4 Add optional per-sub-q count fields
backend/app/core/sqlite_db.py 4.4 Add new columns, update seed generate template
backend/app/services/prompt_service.py 4.4 Update reset_to_defaults() generate template
backend/app/routers/history.py 4.4 Include new fields in detail response
backend/app/core/config.py 4.1 (Maybe) Add retrieval_n_results_per_subq setting

Backend — Tests Needing Update

File Sub-Phase Changes
backend/app/test/test_phase1_rag_service.py 4.7 Add tests for new methods; existing tests unaffected
backend/app/test/test_phase1_relevance_filter.py 4.7 Add tests for filter_per_subquestion()
backend/app/test/test_phase3_query_history_integration.py 4.7 Rewrite pipeline simulation for per-sub-q flow
backend/app/test/test_phase3_prompt_injection.py 4.7 Add tests for new generate template
backend/app/test/acceptance/test_acceptance_phase1_rag_query.py 4.7 Rewrite — SSE parsing + new response shape
backend/app/test/conftest.py 4.7 Add per-sub-q mock helpers

Frontend — New Files

File Sub-Phase Purpose
frontend/src/test/components/test_phase4_response_panel.test.tsx 4.7 Component test: per-sub-q sections
frontend/src/test/utils/test_phase4_citation_parser.test.ts 4.7 Unit test: per-sub-q citation lookup
frontend/src/test/e2e/test_phase4_query_flow.test.tsx 4.7 E2E test: mocked SSE with new format
frontend/src/test/lib/test_phase4_stream_state.test.tsx 4.5 State test: new event shapes
frontend/src/test/lib/test_phase4_types.test.ts 4.5 Type test: type compatibility

Frontend — Modified Files

File Sub-Phase Changes
frontend/src/types/index.ts 4.5 Add SubQuestionSources, update QueryStreamEvent
frontend/src/lib/queries.tsx 4.5 Update QueryStreamState, completed event handler
frontend/src/components/ResponsePanel.tsx 4.6 Redesign — per-sub-question sections with grouped sources
frontend/src/utils/citationParser.ts 4.6 Update buildCitationLookup() for per-sub-q
frontend/src/components/ExtractedQuestionsDisplay.tsx 4.6 Add anchor links to answer sections
frontend/src/pages/LTTPage.tsx 4.6 Pass new props to children

Risk Register

Risk Likelihood Impact Mitigation
LLM struggles with per-sub-q filtering prompt format Medium High — all chunks dropped Use strong prompt constraints, validate JSON, fall back to including all chunks on parse failure
LLM generates answer not matching ## Sub-question N: format Medium Medium — frontend can't parse sections Fall back to rendering as single block if parsing fails. Prompt engineering tuned for format compliance
Same chunk retrieved by multiple sub-questions → duplicated in context High Low — slightly larger prompt but acceptable Accept duplicates. ChromaDB naturally returns same doc if relevant to multiple queries. Each sub-q's evaluation is independent
Per-sub-q retrieval = more ChromaDB queries = slower Medium Medium — N × retrieval latency ChromaDB retrieval is fast (~10-50ms). 5 sub-questions × 10ms = 50ms overhead. Acceptable trade-off for better relevance.
History DB migration fails for existing records Low Low — new columns are NULL-able ALTER TABLE ADD COLUMN ... DEFAULT NULL is safe. Existing records work as before — chunks_retrieved/chunks_filtered still have flat XML.
Frontend rendering breaks on older history records Low Low — answer format differs ResponsePanel renders per-sub-q sections only when subQuestionSources is non-null. Older history records show flat answer as before.
Prompt template migration breaks user-customized prompts Medium Medium — users lose their generate template Warn in docs. The generate template changes fundamentally (single {context}{context_sections}). Users must re-customize.

Acceptance Criteria

Backend

  • POST /api/v1/query retrieves chunks per sub-question (verified by history XML showing <sub_q> wrappers)
  • Filtering uses single LLM call evaluating chunks against their originating sub-question (verified by filter prompt)
  • Response answer is organized by sub-question with ## Sub-question N: headers
  • sub_question_sources in SSE completed event is grouped by sub-question index
  • History records include new grouped XML formats for chunks_retrieved and chunks_filtered
  • History records include grouped sources JSON (list of lists)
  • History records include per-sub-q chunk counts
  • New generate prompt template uses {context_sections} placeholder
  • Prompt service reset_to_defaults() includes new generate template
  • Existing decompose, filter (old), generate_response (old) methods are unchanged
  • All Phase 1, Phase 3, and new Phase 4 unit tests pass (312 passed, 4 skipped)
  • All acceptance tests pass with real LLM (manual run)

Frontend

  • QueryStreamState includes subQuestionSources field
  • ResponsePanel renders per-sub-question sections with expandable source grids
  • Each section's sources are scoped to that sub-question (no cross-contamination)
  • Inline citations [filename, page N] link to the correct PDF viewer page
  • ExtractedQuestionsDisplay shows clickable anchors to answer sections
  • Copy button copies all answer text including section headers
  • Loading states: skeleton per section during generation
  • Empty state: "No relevant information found" per sub-question (not entire response)
  • All 62+ existing frontend tests still pass (183 passed)
  • All new Phase 4 frontend tests pass
  • npm run build succeeds with zero TypeScript errors
  • Manual verification: full query flow works end-to-end

New Dependencies

None. All changes use existing libraries (FastAPI, ChromaDB, OpenAI SDK, React, ReactMarkdown, TanStack Query).


Decisions (All Confirmed)

# Topic Decision
1 Single vs multiple filter LLM calls Single call — user explicitly requested this
2 Filter prompt design Group chunks by sub-question in one prompt. JSON response maps sub-q indices to score arrays
3 Answer format Markdown with ## Sub-question N: <question> headers
4 Sources grouping sub_question_sources: [{index, text, sources}, ...] in SSE + frontend
5 History XML format Add <sub_q idx="N" question="..."> wrappers around chunk groups
6 History DB migration Add 2 new NULL-able columns. No data migration needed.
7 Backward compatibility Preserve old retrieve(), filter(), generate_response() methods. New methods are additive.
8 Deduplication None. Same chunk may appear in multiple sub-questions. Each sub-q evaluates independently.
9 Error handling Per-sub-question graceful degradation. Filter failure → include all chunks for that sub-q. Generate failure → "Unable to generate answer for this sub-question."
10 Frontend rendering engine Keep ReactMarkdown. Parse sections client-side by splitting on ## Sub-question N: headers.

Open Questions

None — all resolved.

# Question Resolution
1 Progressive SSE events? Yes — emit generating_subquestion as each sub-question's answer is generated. Frontend renders sections progressively.
2 retrieval_n_results per sub-question or global? Global — same value for all sub-questions. Simpler config, one setting.
3 Fallback when decomposition returns 0 sub-questions? Fall back to original question — treat as single sub-question. Pipeline runs as 1-sub-q case (retrieval via original question, no filtering needed for single sub-q, flat answer).

Test Plan Summary

Backend (New Tests)

File Tests Coverage
test_phase4_retrieve_per_subquestion.py ~6 Per-sub-q retrieval, empty input, single sub-q, dedup behavior
test_phase4_query_router_retrieval.py ~4 SSE events during retrieval, chunk XML format
test_phase4_relevance_filter_per_subq.py ~6 Per-sub-q filtering, JSON response parsing, threshold behavior
test_phase4_query_router_filter.py ~4 SSE events during filtering, filtered XML format
test_phase4_generate_per_subq.py ~5 Per-sub-q generate, prompt construction, answer format
test_phase4_response_format.py ~4 Answer has ## headers, citations in correct sections
test_phase4_history_format.py ~5 New XML/JSON formats, per-sub-q counts
test_phase4_prompt_templates.py ~3 New generate template, {context_sections} placeholder
test_phase4_integration_query_pipeline.py ~5 Full pipeline simulation
test_phase4_acceptance_query.py ~3 Real LLM end-to-end (manual)

Frontend (New Tests)

File Tests Coverage
test_phase4_stream_state.test.tsx ~4 State updates for new event shapes
test_phase4_types.test.ts ~2 Type compatibility checks
test_phase4_response_panel.test.tsx ~6 Section rendering, source grouping, copy, loading
test_phase4_citation_parser.test.ts ~4 Per-sub-q lookup, cross-section isolation
test_phase4_e2e_query_flow.test.tsx ~3 Full SSE flow with mocked stream