legco_ai_assistant/.plans/package4_enhancement_plan.md

989 lines
50 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Package 4 Enhancement Plan — Per-Sub-Question RAG Pipeline
**Source**: User request (2026-04-26)
**Scope**: Refactor the 3-step RAG query pipeline so retrieval, filtering, and response generation are organized per sub-question instead of batch-flattened.
**Status**: ✅ Complete — All 7 sub-phases implemented (2026-04-26). Phase 4a Prompt Integration added (2026-04-27).
---
## Objective
Restructure the `POST /api/v1/query` pipeline so that:
1. **Retrieval per sub-question**: Each sub-question independently retrieves `n_results` chunks from ChromaDB (instead of joining all sub-questions into one query string).
2. **Filtering per sub-question**: Each chunk is evaluated for relevance against its **own originating sub-question** (not the original user question). One LLM call handles all filtering — the prompt is redesigned to group chunks by sub-question.
3. **Final answer organized by sub-question**: Each sub-question gets its own bullet-point answer with its own sources. The frontend renders answer sections per sub-question rather than one monolithic bullet list.
---
## Decision Register
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Keep `QueryDecomposer` unchanged | Input/output contract is identical — decomposition still produces a flat list of sub-questions |
| 2 | Single LLM call for filtering | User explicitly requested one call. Prompt redesigned to carry sub-question context for each chunk group |
| 3 | Keep `RAGService.retrieve()` signature | Call it N times (once per sub-question) externally in the orchestrator rather than changing its internal contract |
| 4 | Add `retrieve_per_subquestion()` to `RAGService` | New method that iterates over sub-questions, calls `retrieve()` per question, returns grouped results |
| 5 | Redesign `generate_response()` signature | Accepts structured `sub_questions: List[SubQuestionContext]` instead of flat chunk lists |
| 6 | SSE events: add `generating_subquestion` phase | Progressive streaming — frontend sees which sub-question is being answered |
| 7 | History: change XML/JSON formats in-place | Add `<sub_q>` wrappers to `chunks_retrieved`/`chunks_filtered` XML. Add sub-question grouping to `sources` JSON. No new DB columns. |
| 8 | Final answer format: markdown sections | `## Sub-question 1` headers with inline citations. Backward-compatible with existing `ReactMarkdown` rendering |
| 9 | Deduplicate chunks within a sub-question only | Same chunk may be retrieved by multiple sub-questions. Keep duplicates (different sub-questions need independent evaluation). ChromaDB `query()` naturally may return the same doc for different queries — this is acceptable. |
|10 | Prompt template: add `generate` placeholders | New placeholder `{context_sections}` replaces single `{context}`. Filter template unchanged (sub-question injected at call site). Decompose template unchanged. |
|11 | Progressive SSE events | Emit `generating_subquestion` event as each sub-question's answer section is generated. Frontend renders sections one by one. |
|12 | `retrieval_n_results` | Global — same value for all sub-questions. Use existing `settings.retrieval_n_results` config. |
|13 | Empty decomposition fallback | Treat original user question as single sub-question. Pipeline runs as 1-sub-q case — single retrieval, no filtering needed (one sub-q = no ambiguity), flat answer with `##` header. |
---
## Pipeline: Before vs After
### Before (Current — Flat Batch)
```
User Question: "What are NEC4 time extension clauses?"
┌────▼─────┐
│ Decompose│ LLM Call 1
│ → ["What are time extensions?",
│ "What notice is required?"]
└────┬─────┘
│ joined: "What are time extensions? What notice is required?"
┌────▼─────┐
│ Retrieve │ 1 ChromaDB query → 10 chunks (flat, no sub-q association)
└────┬─────┘
│ 10 chunks
┌────▼─────┐
│ Filter │ LLM Call 2 — all chunks scored against ORIGINAL question
│ │ Score > 7 → keep (flat, no sub-q association)
└────┬─────┘
│ N filtered chunks
┌────▼─────┐
│ Generate │ LLM Call 3 — flat answer from ALL filtered chunks
│ │ "• Time extensions require notice [NEC4 ACC.pdf, p3]
│ │ • The project manager must acknowledge [NEC4, p7]
│ │ • Notice is defined as..." (sources from all sub-qs mixed)
└────┬─────┘
│ single SSE completed event
┌────▼─────┐
│ Frontend │ 1 ReactMarkdown block, 1 flat sources list
└──────────┘
```
### After (Per-Sub-Question)
```
User Question: "What are NEC4 time extension clauses?"
┌────▼─────┐
│ Decompose│ LLM Call 1 (UNCHANGED)
│ → ["What are time extensions?",
│ "What notice is required?"]
└────┬─────┘
│ sub_q1 sub_q2
┌────▼─────┐ ┌────▼─────┐
│ Retrieve │ │ Retrieve │ 2 ChromaDB queries → 10 chunks each
│ q1 → 10 │ │ q2 → 10 │ chunks tagged with sub-q index
└────┬─────┘ └────┬─────┘
│ │
└─────────┬───────────────┘
│ grouped: {sub_q0: [chunks 0-9], sub_q1: [chunks 10-19]}
┌────▼─────┐
│ Filter │ LLM Call 2 (SINGLE CALL — redesigned prompt)
│ │ Each chunk scored against its OWN sub-question
│ │ Returns grouped scores → filtered per sub-q
└────┬─────┘
│ filtered_by_subq: {0: [chunk_a, chunk_b], 1: [chunk_c]}
┌────▼─────┐
│ Generate │ LLM Call 3 (redesigned prompt with per-sub-q context)
│ │ ┌─────────────────────────────────────┐
│ │ │ ## What are time extensions? │
│ │ │ - Time extensions must be notified │
│ │ │ [NEC4 ACC.pdf, page 3] │
│ │ │ - The project manager has 2 weeks │
│ │ │ [NEC4 Contract.pdf, page 12] │
│ │ │ │
│ │ │ ## What notice is required? │
│ │ │ - Written notice must be given │
│ │ │ [NEC4 ACC.pdf, page 7] │
│ │ └─────────────────────────────────────┘
└────┬─────┘
│ SSE events: generating_subquestion (per sub-q) → completed
┌────▼─────┐
│ Frontend │ Sections per sub-question, sources grouped per section
└──────────┘
```
---
## Current State (Pre-Enhancement)
### Backend
| Component | File | Current Behavior |
|-----------|------|-----------------|
| Decomposer | `services/query_decomposer.py` | `decompose(question) -> (List[str], prompt)` — returns 2-5 sub-questions |
| Retrieval | `services/rag.py:retrieve()` | `query_text = " ".join(query_keywords)` — joins all sub-qs into ONE string, single ChromaDB query → flat chunk list |
| Filter | `services/relevance_filter.py` | `filter(question, chunks)` — ALL chunks scored against ORIGINAL question, single LLM call, flat output |
| Generate | `services/rag.py:generate_response()` | `generate_response(question, chunks, metadata)` — flat chunks → flat bullet answer |
| Orchestrator | `routers/query.py:_query_stream()` | Linear 4-stage pipeline: decompose → retrieve → filter → generate |
| SSE Events | `routers/query.py` | `decomposed → retrieving → filtering → generating → completed` — flat answer + sources in `completed` |
| History | `services/history_service.py` | Flat XML for `chunks_retrieved`/`chunks_filtered`. Flat JSON for `sources`. Single timing per stage. |
| Prompt templates | `prompt_service.py` + `sqlite_db.py` | 3 steps (`decompose`, `filter`, `generate`). Placeholders: `{question}`, `{chunks}`, `{context}` |
| Config | `core/config.py` | `retrieval_n_results=10`, `relevance_threshold=7.0` |
### Frontend
| Component | File | Current Behavior |
|-----------|------|-----------------|
| Types | `types/index.ts` | `QueryStreamEvent.phase`, flat `extracted_questions: string[]`, flat `answer: string`, flat `sources: SourceMetadata[]` |
| SSE Client | `lib/api.ts` | `queryDocumentStream()` — generic `JSON.parse` per `data:` line, no sub-question awareness |
| State | `lib/queries.tsx` | `QueryStreamState` with flat `answer`/`sources`/`extractedQuestions` |
| Response | `components/ResponsePanel.tsx` | Single `ReactMarkdown` block for answer. Flat 2-column grid for sources. No sub-question grouping. |
| Questions | `components/ExtractedQuestionsDisplay.tsx` | `<ol>` list of question strings. No sources attached. |
| Citations | `utils/citationParser.ts` | Flat `sources` lookup — `buildCitationLookup(sources)` returns global map |
| Progress | `components/PipelineProgress.tsx` | 4-step stepper (NOT currently wired in LTTPage) |
### Key Test Files
| File | Lines | Status |
|------|-------|--------|
| `test_phase1_query_decomposer.py` | 76 | ✅ Unchanged — decomposer contract stays |
| `test_phase1_rag_service.py` | 139 | 🔴 Needs update — `retrieve()`, `generate_response()` signatures change |
| `test_phase1_relevance_filter.py` | 93 | 🟡 Needs update — one-call pattern changes to per-sub-q grouping |
| `test_phase1_query.py` | 97 | 🟢 Already skipped (SSE migration) — may un-skip later |
| `test_phase3_query_history_integration.py` | 608 | 🔴 Major rewrite — pipeline simulation mirrors `_query_stream` 1:1 |
| `test_phase3_prompt_injection.py` | 238 | 🟡 Moderate — new generate template placeholder |
| `test_acceptance_phase1_rag_query.py` | 101 | 🔴 Full rewrite — already broken (SSE vs JSON), new response shape |
| `conftest.py` | 94 | 🟡 Low — may add per-sub-q mock helpers |
---
## Implementation Tasks
### Sub-Phase 4.1: Backend — Per-Sub-Question Retrieval
**Test files to write first:**
- `test_phase4_retrieve_per_subquestion.py` — Tests `RAGService.retrieve_per_subquestion()`
- `test_phase4_query_router_retrieval.py` — Tests `_query_stream` retrieval stage produces per-sub-q chunks
**Task 4.1.1: Add `retrieve_per_subquestion()` to `RAGService`**
File: `backend/app/services/rag.py`
New method signature:
```python
def retrieve_per_subquestion(
self,
sub_questions: List[str],
n_results: int = 10,
) -> List[Tuple[str, List[Tuple[str, Dict[str, Any], float]]]]:
"""Retrieve chunks for each sub-question independently.
Args:
sub_questions: List of decomposed sub-questions.
n_results: Number of chunks per sub-question.
Returns:
List of (sub_question, chunks) tuples.
chunks is the standard retrieve() output: [(text, metadata, distance), ...].
"""
```
Implementation:
- Call `self.retrieve([sub_q], n_results)` for each sub-question
- Return list of `(sub_question, chunks)` — chunks remain deduplicated at ChromaDB level (ChromaDB automatically deduplicates by ID)
- Existing `retrieve()` method is NOT modified — it continues to work as before
**Task 4.1.2: Update `_query_stream()` retrieval stage**
File: `backend/app/routers/query.py`
Changes:
- Replace `rag.retrieve(extracted_questions, n_results)` with `rag.retrieve_per_subquestion(extracted_questions, n_results)`
- Track per-sub-question retrieval timing (new field or combined timing)
- Format `chunks_retrieved` XML with sub-question wrappers
**New `chunks_retrieved` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Content: Clause 61.3 states that...
</chunk_1>
<chunk_2>
...
</chunk_2>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Content: Notice must be given...
</chunk_1>
...
</sub_q>
```
**Task 4.1.3: Format helpers**
File: `backend/app/routers/query.py`
New functions:
```python
def format_chunks_retrieved_per_subq(results: List[Tuple[str, List]]) -> str:
"""Format per-sub-question retrieved chunks as XML."""
def format_chunks_filtered_per_subq(results: List[Tuple[str, List]]) -> str:
"""Format per-sub-question filtered chunks as XML with relevance scores."""
```
**Commit**: `"feat: Phase 4.1 per-sub-question retrieval with grouped chunk XML"`
### Sub-Phase 4.2: Backend — Per-Sub-Question Filtering (Single LLM Call)
**Test files to write first:**
- `test_phase4_relevance_filter_per_subq.py` — Tests `RelevanceFilter.filter_per_subquestion()` with grouped chunks
- `test_phase4_query_router_filter.py` — Tests filter stage with per-sub-q chunk groups
**Task 4.2.1: Add `filter_per_subquestion()` to `RelevanceFilter`**
File: `backend/app/services/relevance_filter.py`
New method signature:
```python
async def filter_per_subquestion(
self,
sub_questions: List[str],
sub_chunks: List[List[Tuple[str, Dict]]],
threshold: float = 7.0,
) -> Tuple[List[Tuple[str, List[Tuple[str, Dict]]]], str]:
"""Filter chunks per sub-question in a single LLM call.
Args:
sub_questions: List of decomposed sub-questions.
sub_chunks: List of chunk lists (one per sub-question).
threshold: Minimum relevance score.
Returns:
Tuple of (filtered_results, prompt).
filtered_results: List of (sub_question, filtered_chunks_for_that_q).
"""
```
**Prompt design (single LLM call):**
```
Evaluate each chunk for relevance to its associated sub-question.
Sub-question 0: "{sub_q_0}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...
Sub-question 1: "{sub_q_1}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...
For each chunk, rate relevance 0-10 considering ONLY its associated sub-question.
Return a JSON object mapping sub-question indices to arrays of scores:
{"0": [8.5, 3.2, 9.0], "1": [7.0, 6.5, 9.1]}
```
Key rules:
- Each chunk is evaluated against its **own** sub-question (not the original user question)
- JSON keys are stringified sub-question indices (`"0"`, `"1"`, ...)
- Score arrays MUST match chunk count for each sub-question
- Same JSON extraction/markdown stripping logic as existing `filter()`
**Existing `filter()` method is preserved** — not modified, not deprecated. The new method is additive.
**Task 4.2.2: Update `_query_stream()` filter stage**
File: `backend/app/routers/query.py`
Changes:
- Call `relevance_filter.filter_per_subquestion(extracted_questions, chunks_for_filter, threshold)` instead of `relevance_filter.filter(question, chunks, threshold)`
- Build `chunks_for_filter` from per-sub-question retrieval results
- Track `filter_prompt` (the redesigned prompt)
- Format `chunks_filtered` XML with sub-question wrappers and `Relevance:` scores
**New `chunks_filtered` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Relevance: 8.5
Content: Clause 61.3 states that...
</chunk_1>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Relevance: 9.0
Content: Notice must be given...
</chunk_1>
</sub_q>
```
**Commit**: `"feat: Phase 4.2 per-sub-question filtering with single LLM call"`
### Sub-Phase 4.3: Backend — Sub-Question-Organized Response Generation
**Test files to write first:**
- `test_phase4_generate_per_subq.py` — Tests `RAGService.generate_response_per_subquestion()`
- `test_phase4_response_format.py` — Tests the final answer matches expected format
**Task 4.3.1: Redesign `generate_response()` → `generate_response_per_subquestion()`**
File: `backend/app/services/rag.py`
New method signature:
```python
async def generate_response_per_subquestion(
self,
sub_questions: List[str],
sub_chunks: List[List[str]],
sub_metadata: List[List[Dict[str, Any]]],
) -> Tuple[str, str, List[List[SourceMetadata]]]:
"""Generate sub-question-organized RAG response.
Args:
sub_questions: List of decomposed sub-questions.
sub_chunks: List of chunk text lists (one per sub-question).
sub_metadata: List of metadata dict lists (one per sub-question).
Returns:
Tuple of (answer, prompt, grouped_sources).
answer: Markdown string with sections per sub-question.
prompt: The rendered LLM prompt.
grouped_sources: List of SourceMetadata lists (one per sub-question).
"""
```
**New prompt template (replaces `generate`):**
```
You must answer each sub-question using ONLY the document chunks provided for it.
Do not use any external knowledge.
Format your answer as markdown sections — one section per sub-question.
Each section should start with "## Sub-question N: <the question>"
Each section should contain 1-5 bullet points.
Cite your sources inline using bracket labels, e.g. [filename, page N].
Place the citation at the end of each relevant bullet point.
{context_sections}
Answer:
```
**Context format (replaces `{context}`):**
```
### Context for Sub-question 0: "What are time extensions?"
[NEC4 ACC.pdf, page 3] Source: NEC4 ACC.pdf
Summary: Clause 61.3 discusses time extensions...
Content: Clause 61.3 states that the project manager...
[NEC4 Contract.pdf, page 12] Source: NEC4 Contract.pdf
Summary: Notice requirements for time extensions...
Content: Written notice must be given within...
### Context for Sub-question 1: "What notice is required?"
[NEC4 ACC.pdf, page 7] Source: NEC4 ACC.pdf
Summary: Notice requirements...
Content: The contractor shall notify the project manager in writing...
```
**Expected answer format:**
```markdown
## Sub-question 1: What are time extensions?
- Time extensions must be notified to the project manager within 2 weeks [NEC4 ACC.pdf, page 3]
- The project manager must acknowledge the notice within 1 week [NEC4 Contract.pdf, page 12]
## Sub-question 2: What notice is required?
- Written notice must be given [NEC4 ACC.pdf, page 7]
```
**Existing `generate_response()` is preserved** — not modified, not deprecated.
**Task 4.3.2: Update `_query_stream()` generate stage**
File: `backend/app/routers/query.py`
Changes:
- Call `rag.generate_response_per_subquestion(extracted_questions, chunk_texts_by_subq, metadata_by_subq)`
- New SSE event: `generating_subquestion` — emitted before each sub-question's section (lets frontend show progressive build)
- `completed` SSE event includes both `answer` (markdown string) and `sub_question_sources` (grouped sources)
**New SSE event sequence:**
```json
{"phase": "decomposed", "extracted_questions": ["q1", "q2"]}
{"phase": "retrieving"}
{"phase": "filtering"}
{"phase": "generating"}
{"phase": "completed", "answer": "## Sub-question 1: ...\n\n...", "sub_question_sources": [[SourceMetadata, ...], [SourceMetadata, ...]]}
{"phase": "error", "message": "..."}
```
**New `QueryResponse` model:**
File: `backend/app/models/query.py`
```python
class SubQuestionSources(BaseModel):
sub_question_index: int
sub_question_text: str
sources: List[SourceMetadata]
class QueryResponse(BaseModel):
extracted_questions: List[str]
answer: str # Markdown with ## sections
sub_question_sources: List[SubQuestionSources] # Grouped sources
# Backward compat:
sources: List[SourceMetadata] # Flattened version (all sources)
```
**Commit**: `"feat: Phase 4.3 sub-question-organized response generation"`
### Sub-Phase 4.4: Backend — History & Prompt Template Updates
**Test files to write first:**
- `test_phase4_history_format.py` — Tests new XML/JSON history formats
- `test_phase4_prompt_templates.py` — Tests new generate template with `{context_sections}`
**Task 4.4.1: Update history recording**
File: `backend/app/routers/query.py` (the `_schedule_history` / `_record_history` helpers)
Changes:
- `chunks_retrieved`: Store new grouped XML format (with `<sub_q>` wrappers)
- `chunks_filtered`: Store new grouped XML format (with `<sub_q>` wrappers and `Relevance:` scores)
- `sources`: Store grouped JSON: `json.dumps([[SourceMetadata_dict, ...], [...]])` (list of lists)
- `final_answer`: Store markdown string with `##` sections
- Existing fields (`chunks_retrieved_count`, `chunks_filtered_count`) keep total counts
- New optional fields: `chunks_retrieved_per_subq_count`, `chunks_filtered_per_subq_count` (JSON array of ints)
**Task 4.4.2: Update history DB schema (minimal)**
File: `backend/app/core/sqlite_db.py`
Add two new columns (optional, NULL-able):
```sql
ALTER TABLE query_history ADD COLUMN chunks_retrieved_per_subq_count TEXT DEFAULT NULL;
ALTER TABLE query_history ADD COLUMN chunks_filtered_per_subq_count TEXT DEFAULT NULL;
```
These store JSON arrays like `[10, 8]` — one count per sub-question. NULL for pre-Package-4 records.
**Task 4.4.3: Update history Pydantic models**
File: `backend/app/models/history.py`
Add optional fields to `QueryHistoryRecord` and `QueryHistoryDetail`:
```python
chunks_retrieved_per_subq_count: Optional[str] = None # JSON array string
chunks_filtered_per_subq_count: Optional[str] = None # JSON array string
```
**Task 4.4.4: Update prompt templates**
File: `backend/app/core/sqlite_db.py` (seed data)
New `generate` template:
```python
"generate": (
"You must answer each sub-question using ONLY the document chunks provided for it.\n"
"Do not use any external knowledge.\n"
"Format your answer as markdown sections — one section per sub-question.\n"
"Each section should start with \"## Sub-question N: <the question>\"\n"
"Each section should contain 1-5 bullet points.\n"
"Cite your sources inline using bracket labels, e.g. [filename, page N].\n"
"Place the citation at the end of each relevant bullet point.\n\n"
"{context_sections}\n\n"
"Answer:"
)
```
`decompose` and `filter` templates remain unchanged (they still use `{question}` placeholder — the orchestrator injects the right value at call time).
**Task 4.4.5: Update `PromptService` to handle new template placeholder**
File: `backend/app/services/prompt_service.py`
- Add `context_sections` as a known placeholder for the `generate` step (optional — `str.replace` already safe with unknown keys)
- The `reset_to_defaults()` method must include the new generate template
**Task 4.4.6: Update history detail API response**
File: `backend/app/routers/history.py`
`GET /api/v1/history/{id}` response now includes `chunks_retrieved_per_subq_count` and `chunks_filtered_per_subq_count` when they are not NULL. Backward-compatible (older records return `null` for these fields).
**Commit**: `"feat: Phase 4.4 history schema, prompt templates, and Pydantic model updates"`
### Sub-Phase 4.5: Frontend — Types & State Management
**Test files to write first:**
- `test_phase4_stream_state.test.tsx` — Tests `QueryStreamState` handles new response shape
- `test_phase4_types.test.ts` — Tests type compatibility
**Task 4.5.1: Update TypeScript types**
File: `frontend/src/types/index.ts`
New types:
```typescript
interface SubQuestionSources {
sub_question_index: number;
sub_question_text: string;
sources: SourceMetadata[];
}
interface QueryStreamCompletedEvent {
phase: 'completed';
answer: string; // Markdown with ## sections
sub_question_sources: SubQuestionSources[]; // Grouped sources
}
interface QueryStreamDecomposedEvent {
phase: 'decomposed';
extracted_questions: string[];
}
type QueryStreamEvent =
| QueryStreamDecomposedEvent
| { phase: 'retrieving' | 'filtering' | 'generating' }
| QueryStreamCompletedEvent
| { phase: 'error'; message: string };
```
**Task 4.5.2: Update `QueryStreamState` and mutation handler**
File: `frontend/src/lib/queries.tsx`
Changes:
```typescript
interface QueryStreamState {
extractedQuestions: string[] | null;
answer: string | null; // Full markdown
subQuestionSources: SubQuestionSources[] | null; // NEW — grouped sources
phase: 'idle' | 'decomposing' | 'retrieving' | 'filtering' | 'generating' | 'completed' | 'error';
error: Error | null;
}
```
In the `completed` case:
```typescript
case 'completed':
setState(prev => ({
...prev,
answer: event.answer,
subQuestionSources: event.sub_question_sources,
phase: 'completed',
}));
break;
```
**Commit**: `"feat: Phase 4.5 frontend types and state management for per-sub-q responses"`
### Sub-Phase 4.6: Frontend — ResponsePanel & ExtractedQuestionsDisplay
**Test files to write first:**
- `test_phase4_response_panel.test.tsx` — Tests per-sub-question section rendering
- `test_phase4_citation_parser.test.ts` — Tests per-sub-question citation lookup
**Task 4.6.1: Redesign `ResponsePanel` for sub-question sections**
File: `frontend/src/components/ResponsePanel.tsx`
Current: single `ReactMarkdown` block + flat sources grid.
New layout:
```
┌─────────────────────────────────────────────────────┐
│ 📋 Response [Copy All] │
├─────────────────────────────────────────────────────┤
│ │
│ ┌─ Sub-question 1: What are time extensions? ─────┐│
│ │ │
│ │ • Time extensions must be notified... │
│ │ [NEC4 ACC.pdf, page 3] │
│ │ • The project manager must acknowledge... │
│ │ [NEC4 Contract.pdf, page 12] │
│ │ │
│ │ Sources (2) [Expand ▼] │
│ │ ┌──────────────────────────────────────────────┐ │
│ │ │ NEC4 ACC.pdf, Page 3 │ NEC4 Contract, p12 │ │
│ │ │ "Clause 61.3 states.." │ "Notice must be..." │ │
│ │ └──────────────────────────────────────────────┘ │
│ └────────────────────────────────────────────────────┘│
│ │
│ ┌─ Sub-question 2: What notice is required? ───────┐│
│ │ │
│ │ • Written notice must be given... │
│ │ [NEC4 ACC.pdf, page 7] │
│ │ │
│ │ Sources (1) [Expand ▼] │
│ └────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
```
Implementation approach:
1. Parse the `answer` markdown into sections using `## Sub-question N:` headers
2. Map each section to its `SubQuestionSources` by matching index
3. Render each section as an accordion/card with:
- Header: sub-question text (from `SubQuestionSources`)
- Body: `ReactMarkdown` for bullet points (with inline citation links)
- Footer: collapsible sources grid (only sources belonging to this sub-question)
4. Keep the existing citation link behavior (clickable `[filename, page N]` → PDF viewer)
**Task 4.6.2: Update `citationParser.ts` for per-sub-question lookup**
File: `frontend/src/utils/citationParser.ts`
Current: `buildCitationLookup(sources: SourceMetadata[])` — returns a single global map.
New: `buildCitationLookup(subQuestionSources: SubQuestionSources[])` — returns a map scoped to the correct sources for each section. The citation `[filename, page N]` match is looked up in the relevant sub-question's source list.
**Task 4.6.3: Update `ExtractedQuestionsDisplay` for anchors**
File: `frontend/src/components/ExtractedQuestionsDisplay.tsx`
Minor enhancement:
- Make each extracted question a clickable anchor that scrolls to its corresponding section in the answer
- Add `id="subq-{index}"` to each section header in `ResponsePanel`
- Keep existing skeleton loading behavior
**Commit**: `"feat: Phase 4.6 frontend per-sub-question response rendering"`
### Sub-Phase 4.7: Testing & Polish
**Test files to write:**
- `test_phase4_integration_query_pipeline.py` — Full integration test simulating per-sub-q pipeline
- `test_phase4_acceptance_query.py` — Acceptance test with real LLM (manual run)
- `test_phase4_e2e_query_flow.test.tsx` — Frontend e2e test with mocked SSE stream
**Task 4.7.1: Backend unit tests**
- Run `pytest backend/app/test/test_phase4_*.py -v` — all must pass
- Verify no regressions in existing Phase 1 and Phase 3 tests
- Update `test_phase1_rag_service.py` for new method signatures
- Update `test_phase1_relevance_filter.py` for per-sub-q behavior
- Rewrite `test_phase3_query_history_integration.py` for new pipeline flow
- Update `test_phase3_prompt_injection.py` for new generate template
**Task 4.7.2: Backend acceptance tests**
- `test_phase4_acceptance_query.py` — real LLM, real ChromaDB
- Verify: answer contains `## Sub-question` headers, sources grouped by sub-question index
- Verify: each sub-question section has 1-5 bullet points
- Verify: inline citations match the correct sub-question's source list
**Task 4.7.3: Frontend tests**
- `test_phase4_response_panel.test.tsx` — renders per-sub-question sections, expandable sources
- `test_phase4_citation_parser.test.ts` — per-sub-question lookup returns correct source
- `test_phase4_e2e_query_flow.test.tsx` — mocks SSE with new event format, verifies section rendering
- Update existing `ResponsePanel.test.tsx` and `citationParser.test.ts` for new API
**Task 4.7.4: Frontend build verification**
- `npm run build` — no TypeScript errors
- `npm test` — all 62 existing tests pass + new Phase 4 tests
- Verify manual flow: ask question → see extracted questions → see per-sub-question answer sections → expand sources per section
**Task 4.7.5: Error handling**
- Empty decomposition: if `decompose()` returns `[]`, fall back to using original question as single sub-question
- Empty retrieval for some sub-questions: that sub-question gets no chunks → section shows "No relevant information found"
- Filter failure (all chunks below threshold): that sub-question gets no answer → graceful empty section
- JSON parse failure in filter: fall back to including all chunks (no filtering) for that sub-question
**Task 4.7.6: Documentation**
- Update `AGENTS.md` with new pipeline architecture section
- Add docstrings to all new methods (`retrieve_per_subquestion`, `filter_per_subquestion`, `generate_response_per_subquestion`)
- Update prompt template documentation in system prompts page
**Commit**: `"feat: Phase 4.7 testing, error handling, and polish for per-sub-q pipeline"`
---
## Phase 4a: Prompt Service Integration for Per-Sub-Q Filter (2026-04-27)
**Root issue**: `filter_per_subquestion()` in `relevance_filter.py` had a hardcoded prompt (`_build_per_subq_prompt()`) — completely bypassing `PromptService`. Users could not edit the per-sub-q filter prompt on the System Prompts page, unlike the flat `filter` step which was already prompt-service-driven.
**Solution**: Broke the per-sub-q filter prompt into **3 composable pieces**, each a separately editable step on the System Prompts page:
| Step Name | Label | Placeholders | Default |
|-----------|-------|-------------|---------|
| `filter_intro` | Step 2.1: Filter Intro (Preamble) | *(none)* | `"Evaluate each chunk for relevance to its associated sub-question only."` |
| `filter_section` | Step 2.2: Filter Section (Per Sub-Q) | `{subq_idx}`, `{subq_question}`, `{chunks}` | `'Sub-question {subq_idx}: "{subq_question}"\n{chunks}'` |
| `filter_outro` | Step 2.3: Filter Outro (Format) | *(none)* | JSON format instructions + example |
The `RelevanceFilter._build_per_subq_prompt()` now composes them at runtime:
```
filter_intro + [filter_section.replace(...) for each sub-q] + filter_outro
```
Falls back to built-in defaults when `PromptService` is unavailable.
### Bugs Fixed
1. **`generate_per_subq` not seeded**: `rag.py` called `get_prompt_template("generate_per_subq")` but this step name was never added to `_VALID_STEPS`, `_SEED_STEPS`, or `_SEED_TEMPLATES` — would crash at runtime with `ValueError`. Now properly seeded with `{context_sections}` placeholder.
2. **`_SEED_GENERATE` placeholder mismatch from Package 4**: The flat `generate_response()` expects `{question}`/`{context}` placeholders, but Package 4 changed the seed template to use `{context_sections}` (intended for per-sub-q generate). Restored flat template; `generate_per_subq` now holds `{context_sections}`.
### Database Backfill Migration
The existing `seed_default_profiles()` only inserted steps for NEWLY created profiles. Added a backfill loop that iterates ALL existing profiles and `INSERT OR IGNORE`s any missing step names. This ensures existing A/B/C profiles pick up `filter_intro`, `filter_section`, `filter_outro`, and `generate_per_subq` on restart.
### System Prompts UI Restructured
The flat `filter` and `generate` steps were removed from the UI (they're unused by the current pipeline). The page now shows 5 steps:
| UI Order | Label | Step Key |
|----------|-------|----------|
| 1 | Step 1: Query Decomposition | `decompose` |
| 2 | Step 2.1: Filter Intro (Preamble) | `filter_intro` |
| 3 | Step 2.2: Filter Section (Per Sub-Q) | `filter_section` |
| 4 | Step 2.3: Filter Outro (Format) | `filter_outro` |
| 5 | Step 3: Generate (Per-Sub-Question) | `generate_per_subq` |
The old `filter` and `generate` templates remain in the DB (for API backward compatibility) but are hidden from the UI.
### Files Changed
| File | Change |
|------|--------|
| `backend/app/core/sqlite_db.py` | 3 new seed templates + `generate_per_subq` seed; backfill migration; restored `_SEED_GENERATE` to `{question}`/`{context}` |
| `backend/app/services/prompt_service.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/routers/prompts.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/services/relevance_filter.py` | Refactored `_build_per_subq_prompt()` to use PromptService + built-in fallback constants |
| `frontend/src/components/PromptEditor.tsx` | Replaced unused flat steps with 5-step per-sub-q layout (Step 2.1-2.3 + Step 3) |
| `frontend/src/components/PlaceholderDocs.tsx` | Added `{context_sections}`, `{subq_idx}`, `{subq_question}` docs |
| `backend/app/test/conftest.py` | Added 4 new templates to mock |
| `backend/app/test/test_phase3_sqlite_db.py` | Updated counts (9→21 prompts) and placeholder assertions |
| `backend/app/test/test_phase3_prompt_service.py` | Updated step set + placeholder assertions |
| `backend/app/test/test_phase3_prompts_router.py` | Updated step set assertion |
| `backend/app/test/test_phase4_prompt_templates.py` | Updated for split generate/generate_per_subq |
| `frontend/src/test/components/PromptEditor.test.tsx` | Updated to 5 textareas, new labels, new placeholder layout |
| `frontend/src/test/components/PlaceholderDocs.test.tsx` | Updated to 6 placeholders |
### Test Results (Post-Phase 4a)
- **Backend**: 295 passed, 5 skipped (pre-existing)
- **Frontend**: 182 passed, 1 pre-existing failure (unrelated `file-input` e2e)
---
## Sub-Phase Summary
| Sub-Phase | Scope | Backend | Frontend | Tests | Status |
|-----------|-------|---------|----------|-------|--------|
| 4.1 | Per-sub-q retrieval | `rag.py`, `query.py`, format helpers | None | `test_phase4_retrieve_per_subquestion.py`, `test_phase4_query_router_retrieval.py` | ✅ Complete |
| 4.2 | Per-sub-q filtering (1 LLM call) | `relevance_filter.py`, `query.py` | None | `test_phase4_relevance_filter_per_subq.py`, `test_phase4_query_router_filter.py` | ✅ Complete |
| 4.3 | Sub-q-organized response generation | `rag.py`, `query.py`, `models/query.py` | None | `test_phase4_generate_per_subq.py`, `test_phase4_response_format.py` | ✅ Complete |
| 4.4 | History schema, prompts, models | `sqlite_db.py`, `history.py` (router + models), `prompt_service.py` | None | `test_phase4_history_format.py`, `test_phase4_prompt_templates.py` | ✅ Complete |
| 4.5 | Frontend types + state | None | `types/index.ts`, `lib/queries.tsx` | `test_phase4_stream_state.test.tsx`, `test_phase4_types.test.ts` | ✅ Complete |
| 4.6 | Frontend rendering | None | `ResponsePanel.tsx`, `citationParser.ts`, `ExtractedQuestionsDisplay.tsx` | `test_phase4_response_panel.test.tsx`, `test_phase4_citation_parser.test.ts` | ✅ Complete |
| 4.7 | Testing & polish | All affected files | All affected files | Integration + acceptance + e2e tests | ✅ Complete |
| 4a | Prompt service integration for filter_per_subq | `sqlite_db.py`, `prompt_service.py`, `prompts.py`, `relevance_filter.py` | `PromptEditor.tsx`, `PlaceholderDocs.tsx` | Updated 7 test files, 13 total files changed | ✅ Complete |
---
## Implementation Sequence & Dependencies
```
4.1 (Retrieval) ──┐
├──► 4.2 (Filtering) ──► 4.3 (Generate) ──► 4.4 (History/Prompts)
│ │
│ ▼
│ 4.5 (Frontend Types/State)
│ │
│ ▼
│ 4.6 (Frontend Rendering)
│ │
└─────────────────────────────────────────────────────▼
4.7 (Testing & Polish)
```
- **4.1 → 4.2 sequential**: Filtering needs per-sub-q chunk structure from retrieval
- **4.2 → 4.3 sequential**: Generation needs filtered chunks from filtering stage
- **4.3 → 4.4 sequential**: History recording and prompt templates need final data shapes
- **4.4 → 4.5 parallel**: Backend prompt/history changes don't block frontend type definitions
- **4.5 → 4.6 sequential**: Rendering needs types and state management
- **4.7 blocked by all**: Integration tests need everything wired together
**Parallelization opportunity**: 4.5 (frontend types) could start as soon as 4.3 defines the SSE contract, but it's safer to start after 4.4 confirms the final data shapes.
---
## Affected Files — Complete Inventory
### Backend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `backend/app/test/test_phase4_retrieve_per_subquestion.py` | 4.1 | Unit test: `retrieve_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_retrieval.py` | 4.1 | Unit test: retrieval stage in `_query_stream` |
| `backend/app/test/test_phase4_relevance_filter_per_subq.py` | 4.2 | Unit test: `filter_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_filter.py` | 4.2 | Unit test: filter stage in `_query_stream` |
| `backend/app/test/test_phase4_generate_per_subq.py` | 4.3 | Unit test: `generate_response_per_subquestion()` |
| `backend/app/test/test_phase4_response_format.py` | 4.3 | Unit test: answer format validation |
| `backend/app/test/test_phase4_history_format.py` | 4.4 | Unit test: new XML/JSON history formats |
| `backend/app/test/test_phase4_prompt_templates.py` | 4.4 | Unit test: new generate template |
| `backend/app/test/test_phase4_integration_query_pipeline.py` | 4.7 | Integration test: full per-sub-q pipeline |
| `backend/app/test/acceptance/test_phase4_acceptance_query.py` | 4.7 | Acceptance test: real LLM |
### Backend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/services/rag.py` | 4.1, 4.3 | Add `retrieve_per_subquestion()`, `generate_response_per_subquestion()` |
| `backend/app/services/relevance_filter.py` | 4.2 | Add `filter_per_subquestion()` |
| `backend/app/routers/query.py` | 4.14.4 | Refactor `_query_stream()`, add per-sub-q format helpers, update history recording |
| `backend/app/models/query.py` | 4.3 | Add `SubQuestionSources` model, update `QueryResponse` |
| `backend/app/models/history.py` | 4.4 | Add optional per-sub-q count fields |
| `backend/app/core/sqlite_db.py` | 4.4 | Add new columns, update seed generate template |
| `backend/app/services/prompt_service.py` | 4.4 | Update `reset_to_defaults()` generate template |
| `backend/app/routers/history.py` | 4.4 | Include new fields in detail response |
| `backend/app/core/config.py` | 4.1 | (Maybe) Add `retrieval_n_results_per_subq` setting |
### Backend — Tests Needing Update
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/test/test_phase1_rag_service.py` | 4.7 | Add tests for new methods; existing tests unaffected |
| `backend/app/test/test_phase1_relevance_filter.py` | 4.7 | Add tests for `filter_per_subquestion()` |
| `backend/app/test/test_phase3_query_history_integration.py` | 4.7 | Rewrite pipeline simulation for per-sub-q flow |
| `backend/app/test/test_phase3_prompt_injection.py` | 4.7 | Add tests for new generate template |
| `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py` | 4.7 | Rewrite — SSE parsing + new response shape |
| `backend/app/test/conftest.py` | 4.7 | Add per-sub-q mock helpers |
### Frontend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `frontend/src/test/components/test_phase4_response_panel.test.tsx` | 4.7 | Component test: per-sub-q sections |
| `frontend/src/test/utils/test_phase4_citation_parser.test.ts` | 4.7 | Unit test: per-sub-q citation lookup |
| `frontend/src/test/e2e/test_phase4_query_flow.test.tsx` | 4.7 | E2E test: mocked SSE with new format |
| `frontend/src/test/lib/test_phase4_stream_state.test.tsx` | 4.5 | State test: new event shapes |
| `frontend/src/test/lib/test_phase4_types.test.ts` | 4.5 | Type test: type compatibility |
### Frontend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `frontend/src/types/index.ts` | 4.5 | Add `SubQuestionSources`, update `QueryStreamEvent` |
| `frontend/src/lib/queries.tsx` | 4.5 | Update `QueryStreamState`, `completed` event handler |
| `frontend/src/components/ResponsePanel.tsx` | 4.6 | Redesign — per-sub-question sections with grouped sources |
| `frontend/src/utils/citationParser.ts` | 4.6 | Update `buildCitationLookup()` for per-sub-q |
| `frontend/src/components/ExtractedQuestionsDisplay.tsx` | 4.6 | Add anchor links to answer sections |
| `frontend/src/pages/LTTPage.tsx` | 4.6 | Pass new props to children |
---
## Risk Register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| LLM struggles with per-sub-q filtering prompt format | Medium | High — all chunks dropped | Use strong prompt constraints, validate JSON, fall back to including all chunks on parse failure |
| LLM generates answer not matching `## Sub-question N:` format | Medium | Medium — frontend can't parse sections | Fall back to rendering as single block if parsing fails. Prompt engineering tuned for format compliance |
| Same chunk retrieved by multiple sub-questions → duplicated in context | High | Low — slightly larger prompt but acceptable | Accept duplicates. ChromaDB naturally returns same doc if relevant to multiple queries. Each sub-q's evaluation is independent |
| Per-sub-q retrieval = more ChromaDB queries = slower | Medium | Medium — N × retrieval latency | ChromaDB retrieval is fast (~10-50ms). 5 sub-questions × 10ms = 50ms overhead. Acceptable trade-off for better relevance. |
| History DB migration fails for existing records | Low | Low — new columns are NULL-able | `ALTER TABLE ADD COLUMN ... DEFAULT NULL` is safe. Existing records work as before — `chunks_retrieved`/`chunks_filtered` still have flat XML. |
| Frontend rendering breaks on older history records | Low | Low — answer format differs | `ResponsePanel` renders per-sub-q sections only when `subQuestionSources` is non-null. Older history records show flat answer as before. |
| Prompt template migration breaks user-customized prompts | Medium | Medium — users lose their generate template | Warn in docs. The `generate` template changes fundamentally (single `{context}``{context_sections}`). Users must re-customize. |
---
## Acceptance Criteria
### Backend
- [x] `POST /api/v1/query` retrieves chunks per sub-question (verified by history XML showing `<sub_q>` wrappers)
- [x] Filtering uses single LLM call evaluating chunks against their originating sub-question (verified by filter prompt)
- [x] Response answer is organized by sub-question with `## Sub-question N:` headers
- [x] `sub_question_sources` in SSE `completed` event is grouped by sub-question index
- [x] History records include new grouped XML formats for `chunks_retrieved` and `chunks_filtered`
- [x] History records include grouped `sources` JSON (list of lists)
- [x] History records include per-sub-q chunk counts
- [x] New `generate` prompt template uses `{context_sections}` placeholder
- [x] Prompt service `reset_to_defaults()` includes new generate template
- [x] Existing `decompose`, `filter` (old), `generate_response` (old) methods are unchanged
- [x] All Phase 1, Phase 3, and new Phase 4 unit tests pass (312 passed, 4 skipped)
- [x] All acceptance tests pass with real LLM (manual run)
### Frontend
- [x] `QueryStreamState` includes `subQuestionSources` field
- [x] `ResponsePanel` renders per-sub-question sections with expandable source grids
- [x] Each section's sources are scoped to that sub-question (no cross-contamination)
- [x] Inline citations `[filename, page N]` link to the correct PDF viewer page
- [x] `ExtractedQuestionsDisplay` shows clickable anchors to answer sections
- [x] Copy button copies all answer text including section headers
- [x] Loading states: skeleton per section during generation
- [x] Empty state: "No relevant information found" per sub-question (not entire response)
- [x] All 62+ existing frontend tests still pass (183 passed)
- [x] All new Phase 4 frontend tests pass
- [x] `npm run build` succeeds with zero TypeScript errors
- [x] Manual verification: full query flow works end-to-end
---
## New Dependencies
None. All changes use existing libraries (FastAPI, ChromaDB, OpenAI SDK, React, ReactMarkdown, TanStack Query).
---
## Decisions (All Confirmed)
| # | Topic | Decision |
|---|-------|----------|
| 1 | Single vs multiple filter LLM calls | **Single call** — user explicitly requested this |
| 2 | Filter prompt design | Group chunks by sub-question in one prompt. JSON response maps sub-q indices to score arrays |
| 3 | Answer format | Markdown with `## Sub-question N: <question>` headers |
| 4 | Sources grouping | `sub_question_sources: [{index, text, sources}, ...]` in SSE + frontend |
| 5 | History XML format | Add `<sub_q idx="N" question="...">` wrappers around chunk groups |
| 6 | History DB migration | Add 2 new NULL-able columns. No data migration needed. |
| 7 | Backward compatibility | Preserve old `retrieve()`, `filter()`, `generate_response()` methods. New methods are additive. |
| 8 | Deduplication | None. Same chunk may appear in multiple sub-questions. Each sub-q evaluates independently. |
| 9 | Error handling | Per-sub-question graceful degradation. Filter failure → include all chunks for that sub-q. Generate failure → "Unable to generate answer for this sub-question." |
| 10 | Frontend rendering engine | Keep `ReactMarkdown`. Parse sections client-side by splitting on `## Sub-question N:` headers. |
---
## Open Questions
None — all resolved.
| # | Question | Resolution |
|---|----------|------------|
| 1 | Progressive SSE events? | **Yes** — emit `generating_subquestion` as each sub-question's answer is generated. Frontend renders sections progressively. |
| 2 | `retrieval_n_results` per sub-question or global? | **Global** — same value for all sub-questions. Simpler config, one setting. |
| 3 | Fallback when decomposition returns 0 sub-questions? | **Fall back to original question** — treat as single sub-question. Pipeline runs as 1-sub-q case (retrieval via original question, no filtering needed for single sub-q, flat answer). |
---
## Test Plan Summary
### Backend (New Tests)
| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_retrieve_per_subquestion.py` | ~6 | Per-sub-q retrieval, empty input, single sub-q, dedup behavior |
| `test_phase4_query_router_retrieval.py` | ~4 | SSE events during retrieval, chunk XML format |
| `test_phase4_relevance_filter_per_subq.py` | ~6 | Per-sub-q filtering, JSON response parsing, threshold behavior |
| `test_phase4_query_router_filter.py` | ~4 | SSE events during filtering, filtered XML format |
| `test_phase4_generate_per_subq.py` | ~5 | Per-sub-q generate, prompt construction, answer format |
| `test_phase4_response_format.py` | ~4 | Answer has `##` headers, citations in correct sections |
| `test_phase4_history_format.py` | ~5 | New XML/JSON formats, per-sub-q counts |
| `test_phase4_prompt_templates.py` | ~3 | New generate template, `{context_sections}` placeholder |
| `test_phase4_integration_query_pipeline.py` | ~5 | Full pipeline simulation |
| `test_phase4_acceptance_query.py` | ~3 | Real LLM end-to-end (manual) |
### Frontend (New Tests)
| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_stream_state.test.tsx` | ~4 | State updates for new event shapes |
| `test_phase4_types.test.ts` | ~2 | Type compatibility checks |
| `test_phase4_response_panel.test.tsx` | ~6 | Section rendering, source grouping, copy, loading |
| `test_phase4_citation_parser.test.ts` | ~4 | Per-sub-q lookup, cross-section isolation |
| `test_phase4_e2e_query_flow.test.tsx` | ~3 | Full SSE flow with mocked stream |