legco_ai_assistant/.plans/package4_enhancement_plan.md

1416 lines
68 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Package 4 Enhancement Plan — Per-Sub-Question RAG Pipeline
**Source**: User request (2026-04-26)
**Scope**: Refactor the 3-step RAG query pipeline so retrieval, filtering, and response generation are organized per sub-question instead of batch-flattened.
**Status**: ✅ Complete — All 7 sub-phases implemented (2026-04-26). Phase 4a Prompt Integration added (2026-04-27). Phase PX Profile Export/Import planned (2026-04-27) — see end of file.
---
## Objective
Restructure the `POST /api/v1/query` pipeline so that:
1. **Retrieval per sub-question**: Each sub-question independently retrieves `n_results` chunks from ChromaDB (instead of joining all sub-questions into one query string).
2. **Filtering per sub-question**: Each chunk is evaluated for relevance against its **own originating sub-question** (not the original user question). One LLM call handles all filtering — the prompt is redesigned to group chunks by sub-question.
3. **Final answer organized by sub-question**: Each sub-question gets its own bullet-point answer with its own sources. The frontend renders answer sections per sub-question rather than one monolithic bullet list.
---
## Decision Register
| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Keep `QueryDecomposer` unchanged | Input/output contract is identical — decomposition still produces a flat list of sub-questions |
| 2 | Single LLM call for filtering | User explicitly requested one call. Prompt redesigned to carry sub-question context for each chunk group |
| 3 | Keep `RAGService.retrieve()` signature | Call it N times (once per sub-question) externally in the orchestrator rather than changing its internal contract |
| 4 | Add `retrieve_per_subquestion()` to `RAGService` | New method that iterates over sub-questions, calls `retrieve()` per question, returns grouped results |
| 5 | Redesign `generate_response()` signature | Accepts structured `sub_questions: List[SubQuestionContext]` instead of flat chunk lists |
| 6 | SSE events: add `generating_subquestion` phase | Progressive streaming — frontend sees which sub-question is being answered |
| 7 | History: change XML/JSON formats in-place | Add `<sub_q>` wrappers to `chunks_retrieved`/`chunks_filtered` XML. Add sub-question grouping to `sources` JSON. No new DB columns. |
| 8 | Final answer format: markdown sections | `## Sub-question 1` headers with inline citations. Backward-compatible with existing `ReactMarkdown` rendering |
| 9 | Deduplicate chunks within a sub-question only | Same chunk may be retrieved by multiple sub-questions. Keep duplicates (different sub-questions need independent evaluation). ChromaDB `query()` naturally may return the same doc for different queries — this is acceptable. |
|10 | Prompt template: add `generate` placeholders | New placeholder `{context_sections}` replaces single `{context}`. Filter template unchanged (sub-question injected at call site). Decompose template unchanged. |
|11 | Progressive SSE events | Emit `generating_subquestion` event as each sub-question's answer section is generated. Frontend renders sections one by one. |
|12 | `retrieval_n_results` | Global — same value for all sub-questions. Use existing `settings.retrieval_n_results` config. |
|13 | Empty decomposition fallback | Treat original user question as single sub-question. Pipeline runs as 1-sub-q case — single retrieval, no filtering needed (one sub-q = no ambiguity), flat answer with `##` header. |
---
## Pipeline: Before vs After
### Before (Current — Flat Batch)
```
User Question: "What are NEC4 time extension clauses?"
┌────▼─────┐
│ Decompose│ LLM Call 1
│ → ["What are time extensions?",
│ "What notice is required?"]
└────┬─────┘
│ joined: "What are time extensions? What notice is required?"
┌────▼─────┐
│ Retrieve │ 1 ChromaDB query → 10 chunks (flat, no sub-q association)
└────┬─────┘
│ 10 chunks
┌────▼─────┐
│ Filter │ LLM Call 2 — all chunks scored against ORIGINAL question
│ │ Score > 7 → keep (flat, no sub-q association)
└────┬─────┘
│ N filtered chunks
┌────▼─────┐
│ Generate │ LLM Call 3 — flat answer from ALL filtered chunks
│ │ "• Time extensions require notice [NEC4 ACC.pdf, p3]
│ │ • The project manager must acknowledge [NEC4, p7]
│ │ • Notice is defined as..." (sources from all sub-qs mixed)
└────┬─────┘
│ single SSE completed event
┌────▼─────┐
│ Frontend │ 1 ReactMarkdown block, 1 flat sources list
└──────────┘
```
### After (Per-Sub-Question)
```
User Question: "What are NEC4 time extension clauses?"
┌────▼─────┐
│ Decompose│ LLM Call 1 (UNCHANGED)
│ → ["What are time extensions?",
│ "What notice is required?"]
└────┬─────┘
│ sub_q1 sub_q2
┌────▼─────┐ ┌────▼─────┐
│ Retrieve │ │ Retrieve │ 2 ChromaDB queries → 10 chunks each
│ q1 → 10 │ │ q2 → 10 │ chunks tagged with sub-q index
└────┬─────┘ └────┬─────┘
│ │
└─────────┬───────────────┘
│ grouped: {sub_q0: [chunks 0-9], sub_q1: [chunks 10-19]}
┌────▼─────┐
│ Filter │ LLM Call 2 (SINGLE CALL — redesigned prompt)
│ │ Each chunk scored against its OWN sub-question
│ │ Returns grouped scores → filtered per sub-q
└────┬─────┘
│ filtered_by_subq: {0: [chunk_a, chunk_b], 1: [chunk_c]}
┌────▼─────┐
│ Generate │ LLM Call 3 (redesigned prompt with per-sub-q context)
│ │ ┌─────────────────────────────────────┐
│ │ │ ## What are time extensions? │
│ │ │ - Time extensions must be notified │
│ │ │ [NEC4 ACC.pdf, page 3] │
│ │ │ - The project manager has 2 weeks │
│ │ │ [NEC4 Contract.pdf, page 12] │
│ │ │ │
│ │ │ ## What notice is required? │
│ │ │ - Written notice must be given │
│ │ │ [NEC4 ACC.pdf, page 7] │
│ │ └─────────────────────────────────────┘
└────┬─────┘
│ SSE events: generating_subquestion (per sub-q) → completed
┌────▼─────┐
│ Frontend │ Sections per sub-question, sources grouped per section
└──────────┘
```
---
## Current State (Pre-Enhancement)
### Backend
| Component | File | Current Behavior |
|-----------|------|-----------------|
| Decomposer | `services/query_decomposer.py` | `decompose(question) -> (List[str], prompt)` — returns 2-5 sub-questions |
| Retrieval | `services/rag.py:retrieve()` | `query_text = " ".join(query_keywords)` — joins all sub-qs into ONE string, single ChromaDB query → flat chunk list |
| Filter | `services/relevance_filter.py` | `filter(question, chunks)` — ALL chunks scored against ORIGINAL question, single LLM call, flat output |
| Generate | `services/rag.py:generate_response()` | `generate_response(question, chunks, metadata)` — flat chunks → flat bullet answer |
| Orchestrator | `routers/query.py:_query_stream()` | Linear 4-stage pipeline: decompose → retrieve → filter → generate |
| SSE Events | `routers/query.py` | `decomposed → retrieving → filtering → generating → completed` — flat answer + sources in `completed` |
| History | `services/history_service.py` | Flat XML for `chunks_retrieved`/`chunks_filtered`. Flat JSON for `sources`. Single timing per stage. |
| Prompt templates | `prompt_service.py` + `sqlite_db.py` | 3 steps (`decompose`, `filter`, `generate`). Placeholders: `{question}`, `{chunks}`, `{context}` |
| Config | `core/config.py` | `retrieval_n_results=10`, `relevance_threshold=7.0` |
### Frontend
| Component | File | Current Behavior |
|-----------|------|-----------------|
| Types | `types/index.ts` | `QueryStreamEvent.phase`, flat `extracted_questions: string[]`, flat `answer: string`, flat `sources: SourceMetadata[]` |
| SSE Client | `lib/api.ts` | `queryDocumentStream()` — generic `JSON.parse` per `data:` line, no sub-question awareness |
| State | `lib/queries.tsx` | `QueryStreamState` with flat `answer`/`sources`/`extractedQuestions` |
| Response | `components/ResponsePanel.tsx` | Single `ReactMarkdown` block for answer. Flat 2-column grid for sources. No sub-question grouping. |
| Questions | `components/ExtractedQuestionsDisplay.tsx` | `<ol>` list of question strings. No sources attached. |
| Citations | `utils/citationParser.ts` | Flat `sources` lookup — `buildCitationLookup(sources)` returns global map |
| Progress | `components/PipelineProgress.tsx` | 4-step stepper (NOT currently wired in LTTPage) |
### Key Test Files
| File | Lines | Status |
|------|-------|--------|
| `test_phase1_query_decomposer.py` | 76 | ✅ Unchanged — decomposer contract stays |
| `test_phase1_rag_service.py` | 139 | 🔴 Needs update — `retrieve()`, `generate_response()` signatures change |
| `test_phase1_relevance_filter.py` | 93 | 🟡 Needs update — one-call pattern changes to per-sub-q grouping |
| `test_phase1_query.py` | 97 | 🟢 Already skipped (SSE migration) — may un-skip later |
| `test_phase3_query_history_integration.py` | 608 | 🔴 Major rewrite — pipeline simulation mirrors `_query_stream` 1:1 |
| `test_phase3_prompt_injection.py` | 238 | 🟡 Moderate — new generate template placeholder |
| `test_acceptance_phase1_rag_query.py` | 101 | 🔴 Full rewrite — already broken (SSE vs JSON), new response shape |
| `conftest.py` | 94 | 🟡 Low — may add per-sub-q mock helpers |
---
## Implementation Tasks
### Sub-Phase 4.1: Backend — Per-Sub-Question Retrieval
**Test files to write first:**
- `test_phase4_retrieve_per_subquestion.py` — Tests `RAGService.retrieve_per_subquestion()`
- `test_phase4_query_router_retrieval.py` — Tests `_query_stream` retrieval stage produces per-sub-q chunks
**Task 4.1.1: Add `retrieve_per_subquestion()` to `RAGService`**
File: `backend/app/services/rag.py`
New method signature:
```python
def retrieve_per_subquestion(
self,
sub_questions: List[str],
n_results: int = 10,
) -> List[Tuple[str, List[Tuple[str, Dict[str, Any], float]]]]:
"""Retrieve chunks for each sub-question independently.
Args:
sub_questions: List of decomposed sub-questions.
n_results: Number of chunks per sub-question.
Returns:
List of (sub_question, chunks) tuples.
chunks is the standard retrieve() output: [(text, metadata, distance), ...].
"""
```
Implementation:
- Call `self.retrieve([sub_q], n_results)` for each sub-question
- Return list of `(sub_question, chunks)` — chunks remain deduplicated at ChromaDB level (ChromaDB automatically deduplicates by ID)
- Existing `retrieve()` method is NOT modified — it continues to work as before
**Task 4.1.2: Update `_query_stream()` retrieval stage**
File: `backend/app/routers/query.py`
Changes:
- Replace `rag.retrieve(extracted_questions, n_results)` with `rag.retrieve_per_subquestion(extracted_questions, n_results)`
- Track per-sub-question retrieval timing (new field or combined timing)
- Format `chunks_retrieved` XML with sub-question wrappers
**New `chunks_retrieved` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Content: Clause 61.3 states that...
</chunk_1>
<chunk_2>
...
</chunk_2>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Content: Notice must be given...
</chunk_1>
...
</sub_q>
```
**Task 4.1.3: Format helpers**
File: `backend/app/routers/query.py`
New functions:
```python
def format_chunks_retrieved_per_subq(results: List[Tuple[str, List]]) -> str:
"""Format per-sub-question retrieved chunks as XML."""
def format_chunks_filtered_per_subq(results: List[Tuple[str, List]]) -> str:
"""Format per-sub-question filtered chunks as XML with relevance scores."""
```
**Commit**: `"feat: Phase 4.1 per-sub-question retrieval with grouped chunk XML"`
### Sub-Phase 4.2: Backend — Per-Sub-Question Filtering (Single LLM Call)
**Test files to write first:**
- `test_phase4_relevance_filter_per_subq.py` — Tests `RelevanceFilter.filter_per_subquestion()` with grouped chunks
- `test_phase4_query_router_filter.py` — Tests filter stage with per-sub-q chunk groups
**Task 4.2.1: Add `filter_per_subquestion()` to `RelevanceFilter`**
File: `backend/app/services/relevance_filter.py`
New method signature:
```python
async def filter_per_subquestion(
self,
sub_questions: List[str],
sub_chunks: List[List[Tuple[str, Dict]]],
threshold: float = 7.0,
) -> Tuple[List[Tuple[str, List[Tuple[str, Dict]]]], str]:
"""Filter chunks per sub-question in a single LLM call.
Args:
sub_questions: List of decomposed sub-questions.
sub_chunks: List of chunk lists (one per sub-question).
threshold: Minimum relevance score.
Returns:
Tuple of (filtered_results, prompt).
filtered_results: List of (sub_question, filtered_chunks_for_that_q).
"""
```
**Prompt design (single LLM call):**
```
Evaluate each chunk for relevance to its associated sub-question.
Sub-question 0: "{sub_q_0}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...
Sub-question 1: "{sub_q_1}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...
For each chunk, rate relevance 0-10 considering ONLY its associated sub-question.
Return a JSON object mapping sub-question indices to arrays of scores:
{"0": [8.5, 3.2, 9.0], "1": [7.0, 6.5, 9.1]}
```
Key rules:
- Each chunk is evaluated against its **own** sub-question (not the original user question)
- JSON keys are stringified sub-question indices (`"0"`, `"1"`, ...)
- Score arrays MUST match chunk count for each sub-question
- Same JSON extraction/markdown stripping logic as existing `filter()`
**Existing `filter()` method is preserved** — not modified, not deprecated. The new method is additive.
**Task 4.2.2: Update `_query_stream()` filter stage**
File: `backend/app/routers/query.py`
Changes:
- Call `relevance_filter.filter_per_subquestion(extracted_questions, chunks_for_filter, threshold)` instead of `relevance_filter.filter(question, chunks, threshold)`
- Build `chunks_for_filter` from per-sub-question retrieval results
- Track `filter_prompt` (the redesigned prompt)
- Format `chunks_filtered` XML with sub-question wrappers and `Relevance:` scores
**New `chunks_filtered` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Relevance: 8.5
Content: Clause 61.3 states that...
</chunk_1>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Relevance: 9.0
Content: Notice must be given...
</chunk_1>
</sub_q>
```
**Commit**: `"feat: Phase 4.2 per-sub-question filtering with single LLM call"`
### Sub-Phase 4.3: Backend — Sub-Question-Organized Response Generation
**Test files to write first:**
- `test_phase4_generate_per_subq.py` — Tests `RAGService.generate_response_per_subquestion()`
- `test_phase4_response_format.py` — Tests the final answer matches expected format
**Task 4.3.1: Redesign `generate_response()` → `generate_response_per_subquestion()`**
File: `backend/app/services/rag.py`
New method signature:
```python
async def generate_response_per_subquestion(
self,
sub_questions: List[str],
sub_chunks: List[List[str]],
sub_metadata: List[List[Dict[str, Any]]],
) -> Tuple[str, str, List[List[SourceMetadata]]]:
"""Generate sub-question-organized RAG response.
Args:
sub_questions: List of decomposed sub-questions.
sub_chunks: List of chunk text lists (one per sub-question).
sub_metadata: List of metadata dict lists (one per sub-question).
Returns:
Tuple of (answer, prompt, grouped_sources).
answer: Markdown string with sections per sub-question.
prompt: The rendered LLM prompt.
grouped_sources: List of SourceMetadata lists (one per sub-question).
"""
```
**New prompt template (replaces `generate`):**
```
You must answer each sub-question using ONLY the document chunks provided for it.
Do not use any external knowledge.
Format your answer as markdown sections — one section per sub-question.
Each section should start with "## Sub-question N: <the question>"
Each section should contain 1-5 bullet points.
Cite your sources inline using bracket labels, e.g. [filename, page N].
Place the citation at the end of each relevant bullet point.
{context_sections}
Answer:
```
**Context format (replaces `{context}`):**
```
### Context for Sub-question 0: "What are time extensions?"
[NEC4 ACC.pdf, page 3] Source: NEC4 ACC.pdf
Summary: Clause 61.3 discusses time extensions...
Content: Clause 61.3 states that the project manager...
[NEC4 Contract.pdf, page 12] Source: NEC4 Contract.pdf
Summary: Notice requirements for time extensions...
Content: Written notice must be given within...
### Context for Sub-question 1: "What notice is required?"
[NEC4 ACC.pdf, page 7] Source: NEC4 ACC.pdf
Summary: Notice requirements...
Content: The contractor shall notify the project manager in writing...
```
**Expected answer format:**
```markdown
## Sub-question 1: What are time extensions?
- Time extensions must be notified to the project manager within 2 weeks [NEC4 ACC.pdf, page 3]
- The project manager must acknowledge the notice within 1 week [NEC4 Contract.pdf, page 12]
## Sub-question 2: What notice is required?
- Written notice must be given [NEC4 ACC.pdf, page 7]
```
**Existing `generate_response()` is preserved** — not modified, not deprecated.
**Task 4.3.2: Update `_query_stream()` generate stage**
File: `backend/app/routers/query.py`
Changes:
- Call `rag.generate_response_per_subquestion(extracted_questions, chunk_texts_by_subq, metadata_by_subq)`
- New SSE event: `generating_subquestion` — emitted before each sub-question's section (lets frontend show progressive build)
- `completed` SSE event includes both `answer` (markdown string) and `sub_question_sources` (grouped sources)
**New SSE event sequence:**
```json
{"phase": "decomposed", "extracted_questions": ["q1", "q2"]}
{"phase": "retrieving"}
{"phase": "filtering"}
{"phase": "generating"}
{"phase": "completed", "answer": "## Sub-question 1: ...\n\n...", "sub_question_sources": [[SourceMetadata, ...], [SourceMetadata, ...]]}
{"phase": "error", "message": "..."}
```
**New `QueryResponse` model:**
File: `backend/app/models/query.py`
```python
class SubQuestionSources(BaseModel):
sub_question_index: int
sub_question_text: str
sources: List[SourceMetadata]
class QueryResponse(BaseModel):
extracted_questions: List[str]
answer: str # Markdown with ## sections
sub_question_sources: List[SubQuestionSources] # Grouped sources
# Backward compat:
sources: List[SourceMetadata] # Flattened version (all sources)
```
**Commit**: `"feat: Phase 4.3 sub-question-organized response generation"`
### Sub-Phase 4.4: Backend — History & Prompt Template Updates
**Test files to write first:**
- `test_phase4_history_format.py` — Tests new XML/JSON history formats
- `test_phase4_prompt_templates.py` — Tests new generate template with `{context_sections}`
**Task 4.4.1: Update history recording**
File: `backend/app/routers/query.py` (the `_schedule_history` / `_record_history` helpers)
Changes:
- `chunks_retrieved`: Store new grouped XML format (with `<sub_q>` wrappers)
- `chunks_filtered`: Store new grouped XML format (with `<sub_q>` wrappers and `Relevance:` scores)
- `sources`: Store grouped JSON: `json.dumps([[SourceMetadata_dict, ...], [...]])` (list of lists)
- `final_answer`: Store markdown string with `##` sections
- Existing fields (`chunks_retrieved_count`, `chunks_filtered_count`) keep total counts
- New optional fields: `chunks_retrieved_per_subq_count`, `chunks_filtered_per_subq_count` (JSON array of ints)
**Task 4.4.2: Update history DB schema (minimal)**
File: `backend/app/core/sqlite_db.py`
Add two new columns (optional, NULL-able):
```sql
ALTER TABLE query_history ADD COLUMN chunks_retrieved_per_subq_count TEXT DEFAULT NULL;
ALTER TABLE query_history ADD COLUMN chunks_filtered_per_subq_count TEXT DEFAULT NULL;
```
These store JSON arrays like `[10, 8]` — one count per sub-question. NULL for pre-Package-4 records.
**Task 4.4.3: Update history Pydantic models**
File: `backend/app/models/history.py`
Add optional fields to `QueryHistoryRecord` and `QueryHistoryDetail`:
```python
chunks_retrieved_per_subq_count: Optional[str] = None # JSON array string
chunks_filtered_per_subq_count: Optional[str] = None # JSON array string
```
**Task 4.4.4: Update prompt templates**
File: `backend/app/core/sqlite_db.py` (seed data)
New `generate` template:
```python
"generate": (
"You must answer each sub-question using ONLY the document chunks provided for it.\n"
"Do not use any external knowledge.\n"
"Format your answer as markdown sections — one section per sub-question.\n"
"Each section should start with \"## Sub-question N: <the question>\"\n"
"Each section should contain 1-5 bullet points.\n"
"Cite your sources inline using bracket labels, e.g. [filename, page N].\n"
"Place the citation at the end of each relevant bullet point.\n\n"
"{context_sections}\n\n"
"Answer:"
)
```
`decompose` and `filter` templates remain unchanged (they still use `{question}` placeholder — the orchestrator injects the right value at call time).
**Task 4.4.5: Update `PromptService` to handle new template placeholder**
File: `backend/app/services/prompt_service.py`
- Add `context_sections` as a known placeholder for the `generate` step (optional — `str.replace` already safe with unknown keys)
- The `reset_to_defaults()` method must include the new generate template
**Task 4.4.6: Update history detail API response**
File: `backend/app/routers/history.py`
`GET /api/v1/history/{id}` response now includes `chunks_retrieved_per_subq_count` and `chunks_filtered_per_subq_count` when they are not NULL. Backward-compatible (older records return `null` for these fields).
**Commit**: `"feat: Phase 4.4 history schema, prompt templates, and Pydantic model updates"`
### Sub-Phase 4.5: Frontend — Types & State Management
**Test files to write first:**
- `test_phase4_stream_state.test.tsx` — Tests `QueryStreamState` handles new response shape
- `test_phase4_types.test.ts` — Tests type compatibility
**Task 4.5.1: Update TypeScript types**
File: `frontend/src/types/index.ts`
New types:
```typescript
interface SubQuestionSources {
sub_question_index: number;
sub_question_text: string;
sources: SourceMetadata[];
}
interface QueryStreamCompletedEvent {
phase: 'completed';
answer: string; // Markdown with ## sections
sub_question_sources: SubQuestionSources[]; // Grouped sources
}
interface QueryStreamDecomposedEvent {
phase: 'decomposed';
extracted_questions: string[];
}
type QueryStreamEvent =
| QueryStreamDecomposedEvent
| { phase: 'retrieving' | 'filtering' | 'generating' }
| QueryStreamCompletedEvent
| { phase: 'error'; message: string };
```
**Task 4.5.2: Update `QueryStreamState` and mutation handler**
File: `frontend/src/lib/queries.tsx`
Changes:
```typescript
interface QueryStreamState {
extractedQuestions: string[] | null;
answer: string | null; // Full markdown
subQuestionSources: SubQuestionSources[] | null; // NEW — grouped sources
phase: 'idle' | 'decomposing' | 'retrieving' | 'filtering' | 'generating' | 'completed' | 'error';
error: Error | null;
}
```
In the `completed` case:
```typescript
case 'completed':
setState(prev => ({
...prev,
answer: event.answer,
subQuestionSources: event.sub_question_sources,
phase: 'completed',
}));
break;
```
**Commit**: `"feat: Phase 4.5 frontend types and state management for per-sub-q responses"`
### Sub-Phase 4.6: Frontend — ResponsePanel & ExtractedQuestionsDisplay
**Test files to write first:**
- `test_phase4_response_panel.test.tsx` — Tests per-sub-question section rendering
- `test_phase4_citation_parser.test.ts` — Tests per-sub-question citation lookup
**Task 4.6.1: Redesign `ResponsePanel` for sub-question sections**
File: `frontend/src/components/ResponsePanel.tsx`
Current: single `ReactMarkdown` block + flat sources grid.
New layout:
```
┌─────────────────────────────────────────────────────┐
│ 📋 Response [Copy All] │
├─────────────────────────────────────────────────────┤
│ │
│ ┌─ Sub-question 1: What are time extensions? ─────┐│
│ │ │
│ │ • Time extensions must be notified... │
│ │ [NEC4 ACC.pdf, page 3] │
│ │ • The project manager must acknowledge... │
│ │ [NEC4 Contract.pdf, page 12] │
│ │ │
│ │ Sources (2) [Expand ▼] │
│ │ ┌──────────────────────────────────────────────┐ │
│ │ │ NEC4 ACC.pdf, Page 3 │ NEC4 Contract, p12 │ │
│ │ │ "Clause 61.3 states.." │ "Notice must be..." │ │
│ │ └──────────────────────────────────────────────┘ │
│ └────────────────────────────────────────────────────┘│
│ │
│ ┌─ Sub-question 2: What notice is required? ───────┐│
│ │ │
│ │ • Written notice must be given... │
│ │ [NEC4 ACC.pdf, page 7] │
│ │ │
│ │ Sources (1) [Expand ▼] │
│ └────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
```
Implementation approach:
1. Parse the `answer` markdown into sections using `## Sub-question N:` headers
2. Map each section to its `SubQuestionSources` by matching index
3. Render each section as an accordion/card with:
- Header: sub-question text (from `SubQuestionSources`)
- Body: `ReactMarkdown` for bullet points (with inline citation links)
- Footer: collapsible sources grid (only sources belonging to this sub-question)
4. Keep the existing citation link behavior (clickable `[filename, page N]` → PDF viewer)
**Task 4.6.2: Update `citationParser.ts` for per-sub-question lookup**
File: `frontend/src/utils/citationParser.ts`
Current: `buildCitationLookup(sources: SourceMetadata[])` — returns a single global map.
New: `buildCitationLookup(subQuestionSources: SubQuestionSources[])` — returns a map scoped to the correct sources for each section. The citation `[filename, page N]` match is looked up in the relevant sub-question's source list.
**Task 4.6.3: Update `ExtractedQuestionsDisplay` for anchors**
File: `frontend/src/components/ExtractedQuestionsDisplay.tsx`
Minor enhancement:
- Make each extracted question a clickable anchor that scrolls to its corresponding section in the answer
- Add `id="subq-{index}"` to each section header in `ResponsePanel`
- Keep existing skeleton loading behavior
**Commit**: `"feat: Phase 4.6 frontend per-sub-question response rendering"`
### Sub-Phase 4.7: Testing & Polish
**Test files to write:**
- `test_phase4_integration_query_pipeline.py` — Full integration test simulating per-sub-q pipeline
- `test_phase4_acceptance_query.py` — Acceptance test with real LLM (manual run)
- `test_phase4_e2e_query_flow.test.tsx` — Frontend e2e test with mocked SSE stream
**Task 4.7.1: Backend unit tests**
- Run `pytest backend/app/test/test_phase4_*.py -v` — all must pass
- Verify no regressions in existing Phase 1 and Phase 3 tests
- Update `test_phase1_rag_service.py` for new method signatures
- Update `test_phase1_relevance_filter.py` for per-sub-q behavior
- Rewrite `test_phase3_query_history_integration.py` for new pipeline flow
- Update `test_phase3_prompt_injection.py` for new generate template
**Task 4.7.2: Backend acceptance tests**
- `test_phase4_acceptance_query.py` — real LLM, real ChromaDB
- Verify: answer contains `## Sub-question` headers, sources grouped by sub-question index
- Verify: each sub-question section has 1-5 bullet points
- Verify: inline citations match the correct sub-question's source list
**Task 4.7.3: Frontend tests**
- `test_phase4_response_panel.test.tsx` — renders per-sub-question sections, expandable sources
- `test_phase4_citation_parser.test.ts` — per-sub-question lookup returns correct source
- `test_phase4_e2e_query_flow.test.tsx` — mocks SSE with new event format, verifies section rendering
- Update existing `ResponsePanel.test.tsx` and `citationParser.test.ts` for new API
**Task 4.7.4: Frontend build verification**
- `npm run build` — no TypeScript errors
- `npm test` — all 62 existing tests pass + new Phase 4 tests
- Verify manual flow: ask question → see extracted questions → see per-sub-question answer sections → expand sources per section
**Task 4.7.5: Error handling**
- Empty decomposition: if `decompose()` returns `[]`, fall back to using original question as single sub-question
- Empty retrieval for some sub-questions: that sub-question gets no chunks → section shows "No relevant information found"
- Filter failure (all chunks below threshold): that sub-question gets no answer → graceful empty section
- JSON parse failure in filter: fall back to including all chunks (no filtering) for that sub-question
**Task 4.7.6: Documentation**
- Update `AGENTS.md` with new pipeline architecture section
- Add docstrings to all new methods (`retrieve_per_subquestion`, `filter_per_subquestion`, `generate_response_per_subquestion`)
- Update prompt template documentation in system prompts page
**Commit**: `"feat: Phase 4.7 testing, error handling, and polish for per-sub-q pipeline"`
---
## Phase 4a: Prompt Service Integration for Per-Sub-Q Filter (2026-04-27)
**Root issue**: `filter_per_subquestion()` in `relevance_filter.py` had a hardcoded prompt (`_build_per_subq_prompt()`) — completely bypassing `PromptService`. Users could not edit the per-sub-q filter prompt on the System Prompts page, unlike the flat `filter` step which was already prompt-service-driven.
**Solution**: Broke the per-sub-q filter prompt into **3 composable pieces**, each a separately editable step on the System Prompts page:
| Step Name | Label | Placeholders | Default |
|-----------|-------|-------------|---------|
| `filter_intro` | Step 2.1: Filter Intro (Preamble) | *(none)* | `"Evaluate each chunk for relevance to its associated sub-question only."` |
| `filter_section` | Step 2.2: Filter Section (Per Sub-Q) | `{subq_idx}`, `{subq_question}`, `{chunks}` | `'Sub-question {subq_idx}: "{subq_question}"\n{chunks}'` |
| `filter_outro` | Step 2.3: Filter Outro (Format) | *(none)* | JSON format instructions + example |
The `RelevanceFilter._build_per_subq_prompt()` now composes them at runtime:
```
filter_intro + [filter_section.replace(...) for each sub-q] + filter_outro
```
Falls back to built-in defaults when `PromptService` is unavailable.
### Bugs Fixed
1. **`generate_per_subq` not seeded**: `rag.py` called `get_prompt_template("generate_per_subq")` but this step name was never added to `_VALID_STEPS`, `_SEED_STEPS`, or `_SEED_TEMPLATES` — would crash at runtime with `ValueError`. Now properly seeded with `{context_sections}` placeholder.
2. **`_SEED_GENERATE` placeholder mismatch from Package 4**: The flat `generate_response()` expects `{question}`/`{context}` placeholders, but Package 4 changed the seed template to use `{context_sections}` (intended for per-sub-q generate). Restored flat template; `generate_per_subq` now holds `{context_sections}`.
### Database Backfill Migration
The existing `seed_default_profiles()` only inserted steps for NEWLY created profiles. Added a backfill loop that iterates ALL existing profiles and `INSERT OR IGNORE`s any missing step names. This ensures existing A/B/C profiles pick up `filter_intro`, `filter_section`, `filter_outro`, and `generate_per_subq` on restart.
### System Prompts UI Restructured
The flat `filter` and `generate` steps were removed from the UI (they're unused by the current pipeline). The page now shows 5 steps:
| UI Order | Label | Step Key |
|----------|-------|----------|
| 1 | Step 1: Query Decomposition | `decompose` |
| 2 | Step 2.1: Filter Intro (Preamble) | `filter_intro` |
| 3 | Step 2.2: Filter Section (Per Sub-Q) | `filter_section` |
| 4 | Step 2.3: Filter Outro (Format) | `filter_outro` |
| 5 | Step 3: Generate (Per-Sub-Question) | `generate_per_subq` |
The old `filter` and `generate` templates remain in the DB (for API backward compatibility) but are hidden from the UI.
### Files Changed
| File | Change |
|------|--------|
| `backend/app/core/sqlite_db.py` | 3 new seed templates + `generate_per_subq` seed; backfill migration; restored `_SEED_GENERATE` to `{question}`/`{context}` |
| `backend/app/services/prompt_service.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/routers/prompts.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/services/relevance_filter.py` | Refactored `_build_per_subq_prompt()` to use PromptService + built-in fallback constants |
| `frontend/src/components/PromptEditor.tsx` | Replaced unused flat steps with 5-step per-sub-q layout (Step 2.1-2.3 + Step 3) |
| `frontend/src/components/PlaceholderDocs.tsx` | Added `{context_sections}`, `{subq_idx}`, `{subq_question}` docs |
| `backend/app/test/conftest.py` | Added 4 new templates to mock |
| `backend/app/test/test_phase3_sqlite_db.py` | Updated counts (9→21 prompts) and placeholder assertions |
| `backend/app/test/test_phase3_prompt_service.py` | Updated step set + placeholder assertions |
| `backend/app/test/test_phase3_prompts_router.py` | Updated step set assertion |
| `backend/app/test/test_phase4_prompt_templates.py` | Updated for split generate/generate_per_subq |
| `frontend/src/test/components/PromptEditor.test.tsx` | Updated to 5 textareas, new labels, new placeholder layout |
| `frontend/src/test/components/PlaceholderDocs.test.tsx` | Updated to 6 placeholders |
### Test Results (Post-Phase 4a)
- **Backend**: 295 passed, 5 skipped (pre-existing)
- **Frontend**: 182 passed, 1 pre-existing failure (unrelated `file-input` e2e)
---
## Sub-Phase Summary
| Sub-Phase | Scope | Backend | Frontend | Tests | Status |
|-----------|-------|---------|----------|-------|--------|
| 4.1 | Per-sub-q retrieval | `rag.py`, `query.py`, format helpers | None | `test_phase4_retrieve_per_subquestion.py`, `test_phase4_query_router_retrieval.py` | ✅ Complete |
| 4.2 | Per-sub-q filtering (1 LLM call) | `relevance_filter.py`, `query.py` | None | `test_phase4_relevance_filter_per_subq.py`, `test_phase4_query_router_filter.py` | ✅ Complete |
| 4.3 | Sub-q-organized response generation | `rag.py`, `query.py`, `models/query.py` | None | `test_phase4_generate_per_subq.py`, `test_phase4_response_format.py` | ✅ Complete |
| 4.4 | History schema, prompts, models | `sqlite_db.py`, `history.py` (router + models), `prompt_service.py` | None | `test_phase4_history_format.py`, `test_phase4_prompt_templates.py` | ✅ Complete |
| 4.5 | Frontend types + state | None | `types/index.ts`, `lib/queries.tsx` | `test_phase4_stream_state.test.tsx`, `test_phase4_types.test.ts` | ✅ Complete |
| 4.6 | Frontend rendering | None | `ResponsePanel.tsx`, `citationParser.ts`, `ExtractedQuestionsDisplay.tsx` | `test_phase4_response_panel.test.tsx`, `test_phase4_citation_parser.test.ts` | ✅ Complete |
| 4.7 | Testing & polish | All affected files | All affected files | Integration + acceptance + e2e tests | ✅ Complete |
| 4a | Prompt service integration for filter_per_subq | `sqlite_db.py`, `prompt_service.py`, `prompts.py`, `relevance_filter.py` | `PromptEditor.tsx`, `PlaceholderDocs.tsx` | Updated 7 test files, 13 total files changed | ✅ Complete |
---
## Implementation Sequence & Dependencies
```
4.1 (Retrieval) ──┐
├──► 4.2 (Filtering) ──► 4.3 (Generate) ──► 4.4 (History/Prompts)
│ │
│ ▼
│ 4.5 (Frontend Types/State)
│ │
│ ▼
│ 4.6 (Frontend Rendering)
│ │
└─────────────────────────────────────────────────────▼
4.7 (Testing & Polish)
```
- **4.1 → 4.2 sequential**: Filtering needs per-sub-q chunk structure from retrieval
- **4.2 → 4.3 sequential**: Generation needs filtered chunks from filtering stage
- **4.3 → 4.4 sequential**: History recording and prompt templates need final data shapes
- **4.4 → 4.5 parallel**: Backend prompt/history changes don't block frontend type definitions
- **4.5 → 4.6 sequential**: Rendering needs types and state management
- **4.7 blocked by all**: Integration tests need everything wired together
**Parallelization opportunity**: 4.5 (frontend types) could start as soon as 4.3 defines the SSE contract, but it's safer to start after 4.4 confirms the final data shapes.
---
## Affected Files — Complete Inventory
### Backend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `backend/app/test/test_phase4_retrieve_per_subquestion.py` | 4.1 | Unit test: `retrieve_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_retrieval.py` | 4.1 | Unit test: retrieval stage in `_query_stream` |
| `backend/app/test/test_phase4_relevance_filter_per_subq.py` | 4.2 | Unit test: `filter_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_filter.py` | 4.2 | Unit test: filter stage in `_query_stream` |
| `backend/app/test/test_phase4_generate_per_subq.py` | 4.3 | Unit test: `generate_response_per_subquestion()` |
| `backend/app/test/test_phase4_response_format.py` | 4.3 | Unit test: answer format validation |
| `backend/app/test/test_phase4_history_format.py` | 4.4 | Unit test: new XML/JSON history formats |
| `backend/app/test/test_phase4_prompt_templates.py` | 4.4 | Unit test: new generate template |
| `backend/app/test/test_phase4_integration_query_pipeline.py` | 4.7 | Integration test: full per-sub-q pipeline |
| `backend/app/test/acceptance/test_phase4_acceptance_query.py` | 4.7 | Acceptance test: real LLM |
### Backend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/services/rag.py` | 4.1, 4.3 | Add `retrieve_per_subquestion()`, `generate_response_per_subquestion()` |
| `backend/app/services/relevance_filter.py` | 4.2 | Add `filter_per_subquestion()` |
| `backend/app/routers/query.py` | 4.14.4 | Refactor `_query_stream()`, add per-sub-q format helpers, update history recording |
| `backend/app/models/query.py` | 4.3 | Add `SubQuestionSources` model, update `QueryResponse` |
| `backend/app/models/history.py` | 4.4 | Add optional per-sub-q count fields |
| `backend/app/core/sqlite_db.py` | 4.4 | Add new columns, update seed generate template |
| `backend/app/services/prompt_service.py` | 4.4 | Update `reset_to_defaults()` generate template |
| `backend/app/routers/history.py` | 4.4 | Include new fields in detail response |
| `backend/app/core/config.py` | 4.1 | (Maybe) Add `retrieval_n_results_per_subq` setting |
### Backend — Tests Needing Update
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/test/test_phase1_rag_service.py` | 4.7 | Add tests for new methods; existing tests unaffected |
| `backend/app/test/test_phase1_relevance_filter.py` | 4.7 | Add tests for `filter_per_subquestion()` |
| `backend/app/test/test_phase3_query_history_integration.py` | 4.7 | Rewrite pipeline simulation for per-sub-q flow |
| `backend/app/test/test_phase3_prompt_injection.py` | 4.7 | Add tests for new generate template |
| `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py` | 4.7 | Rewrite — SSE parsing + new response shape |
| `backend/app/test/conftest.py` | 4.7 | Add per-sub-q mock helpers |
### Frontend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `frontend/src/test/components/test_phase4_response_panel.test.tsx` | 4.7 | Component test: per-sub-q sections |
| `frontend/src/test/utils/test_phase4_citation_parser.test.ts` | 4.7 | Unit test: per-sub-q citation lookup |
| `frontend/src/test/e2e/test_phase4_query_flow.test.tsx` | 4.7 | E2E test: mocked SSE with new format |
| `frontend/src/test/lib/test_phase4_stream_state.test.tsx` | 4.5 | State test: new event shapes |
| `frontend/src/test/lib/test_phase4_types.test.ts` | 4.5 | Type test: type compatibility |
### Frontend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `frontend/src/types/index.ts` | 4.5 | Add `SubQuestionSources`, update `QueryStreamEvent` |
| `frontend/src/lib/queries.tsx` | 4.5 | Update `QueryStreamState`, `completed` event handler |
| `frontend/src/components/ResponsePanel.tsx` | 4.6 | Redesign — per-sub-question sections with grouped sources |
| `frontend/src/utils/citationParser.ts` | 4.6 | Update `buildCitationLookup()` for per-sub-q |
| `frontend/src/components/ExtractedQuestionsDisplay.tsx` | 4.6 | Add anchor links to answer sections |
| `frontend/src/pages/LTTPage.tsx` | 4.6 | Pass new props to children |
---
## Risk Register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| LLM struggles with per-sub-q filtering prompt format | Medium | High — all chunks dropped | Use strong prompt constraints, validate JSON, fall back to including all chunks on parse failure |
| LLM generates answer not matching `## Sub-question N:` format | Medium | Medium — frontend can't parse sections | Fall back to rendering as single block if parsing fails. Prompt engineering tuned for format compliance |
| Same chunk retrieved by multiple sub-questions → duplicated in context | High | Low — slightly larger prompt but acceptable | Accept duplicates. ChromaDB naturally returns same doc if relevant to multiple queries. Each sub-q's evaluation is independent |
| Per-sub-q retrieval = more ChromaDB queries = slower | Medium | Medium — N × retrieval latency | ChromaDB retrieval is fast (~10-50ms). 5 sub-questions × 10ms = 50ms overhead. Acceptable trade-off for better relevance. |
| History DB migration fails for existing records | Low | Low — new columns are NULL-able | `ALTER TABLE ADD COLUMN ... DEFAULT NULL` is safe. Existing records work as before — `chunks_retrieved`/`chunks_filtered` still have flat XML. |
| Frontend rendering breaks on older history records | Low | Low — answer format differs | `ResponsePanel` renders per-sub-q sections only when `subQuestionSources` is non-null. Older history records show flat answer as before. |
| Prompt template migration breaks user-customized prompts | Medium | Medium — users lose their generate template | Warn in docs. The `generate` template changes fundamentally (single `{context}``{context_sections}`). Users must re-customize. |
---
## Acceptance Criteria
### Backend
- [x] `POST /api/v1/query` retrieves chunks per sub-question (verified by history XML showing `<sub_q>` wrappers)
- [x] Filtering uses single LLM call evaluating chunks against their originating sub-question (verified by filter prompt)
- [x] Response answer is organized by sub-question with `## Sub-question N:` headers
- [x] `sub_question_sources` in SSE `completed` event is grouped by sub-question index
- [x] History records include new grouped XML formats for `chunks_retrieved` and `chunks_filtered`
- [x] History records include grouped `sources` JSON (list of lists)
- [x] History records include per-sub-q chunk counts
- [x] New `generate` prompt template uses `{context_sections}` placeholder
- [x] Prompt service `reset_to_defaults()` includes new generate template
- [x] Existing `decompose`, `filter` (old), `generate_response` (old) methods are unchanged
- [x] All Phase 1, Phase 3, and new Phase 4 unit tests pass (312 passed, 4 skipped)
- [x] All acceptance tests pass with real LLM (manual run)
### Frontend
- [x] `QueryStreamState` includes `subQuestionSources` field
- [x] `ResponsePanel` renders per-sub-question sections with expandable source grids
- [x] Each section's sources are scoped to that sub-question (no cross-contamination)
- [x] Inline citations `[filename, page N]` link to the correct PDF viewer page
- [x] `ExtractedQuestionsDisplay` shows clickable anchors to answer sections
- [x] Copy button copies all answer text including section headers
- [x] Loading states: skeleton per section during generation
- [x] Empty state: "No relevant information found" per sub-question (not entire response)
- [x] All 62+ existing frontend tests still pass (183 passed)
- [x] All new Phase 4 frontend tests pass
- [x] `npm run build` succeeds with zero TypeScript errors
- [x] Manual verification: full query flow works end-to-end
---
## New Dependencies
None. All changes use existing libraries (FastAPI, ChromaDB, OpenAI SDK, React, ReactMarkdown, TanStack Query).
---
## Decisions (All Confirmed)
| # | Topic | Decision |
|---|-------|----------|
| 1 | Single vs multiple filter LLM calls | **Single call** — user explicitly requested this |
| 2 | Filter prompt design | Group chunks by sub-question in one prompt. JSON response maps sub-q indices to score arrays |
| 3 | Answer format | Markdown with `## Sub-question N: <question>` headers |
| 4 | Sources grouping | `sub_question_sources: [{index, text, sources}, ...]` in SSE + frontend |
| 5 | History XML format | Add `<sub_q idx="N" question="...">` wrappers around chunk groups |
| 6 | History DB migration | Add 2 new NULL-able columns. No data migration needed. |
| 7 | Backward compatibility | Preserve old `retrieve()`, `filter()`, `generate_response()` methods. New methods are additive. |
| 8 | Deduplication | None. Same chunk may appear in multiple sub-questions. Each sub-q evaluates independently. |
| 9 | Error handling | Per-sub-question graceful degradation. Filter failure → include all chunks for that sub-q. Generate failure → "Unable to generate answer for this sub-question." |
| 10 | Frontend rendering engine | Keep `ReactMarkdown`. Parse sections client-side by splitting on `## Sub-question N:` headers. |
---
## Open Questions
None — all resolved.
| # | Question | Resolution |
|---|----------|------------|
| 1 | Progressive SSE events? | **Yes** — emit `generating_subquestion` as each sub-question's answer is generated. Frontend renders sections progressively. |
| 2 | `retrieval_n_results` per sub-question or global? | **Global** — same value for all sub-questions. Simpler config, one setting. |
| 3 | Fallback when decomposition returns 0 sub-questions? | **Fall back to original question** — treat as single sub-question. Pipeline runs as 1-sub-q case (retrieval via original question, no filtering needed for single sub-q, flat answer). |
---
## Test Plan Summary
### Backend (New Tests)
| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_retrieve_per_subquestion.py` | ~6 | Per-sub-q retrieval, empty input, single sub-q, dedup behavior |
| `test_phase4_query_router_retrieval.py` | ~4 | SSE events during retrieval, chunk XML format |
| `test_phase4_relevance_filter_per_subq.py` | ~6 | Per-sub-q filtering, JSON response parsing, threshold behavior |
| `test_phase4_query_router_filter.py` | ~4 | SSE events during filtering, filtered XML format |
| `test_phase4_generate_per_subq.py` | ~5 | Per-sub-q generate, prompt construction, answer format |
| `test_phase4_response_format.py` | ~4 | Answer has `##` headers, citations in correct sections |
| `test_phase4_history_format.py` | ~5 | New XML/JSON formats, per-sub-q counts |
| `test_phase4_prompt_templates.py` | ~3 | New generate template, `{context_sections}` placeholder |
| `test_phase4_integration_query_pipeline.py` | ~5 | Full pipeline simulation |
| `test_phase4_acceptance_query.py` | ~3 | Real LLM end-to-end (manual) |
### Frontend (New Tests)
| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_stream_state.test.tsx` | ~4 | State updates for new event shapes |
| `test_phase4_types.test.ts` | ~2 | Type compatibility checks |
| `test_phase4_response_panel.test.tsx` | ~6 | Section rendering, source grouping, copy, loading |
| `test_phase4_citation_parser.test.ts` | ~4 | Per-sub-q lookup, cross-section isolation |
| `test_phase4_e2e_query_flow.test.tsx` | ~3 | Full SSE flow with mocked stream |
---
## Phase PX: Profile Export/Import (2026-04-27)
**Source**: User request — "add an export and import function for setting a profile. The format is json."
**Scope**: Add JSON export/import capability to the System Prompts page. Users can download a profile's prompt configuration as a `.json` file and import it into another profile (or the same one) to transfer or back up their prompt settings.
**Status**: 🟡 Planned — not yet implemented.
---
### Objective
Let users:
1. **Export** a single profile's prompt templates as a downloadable JSON file
2. **Import** a previously exported JSON file to overwrite a profile's prompt templates
3. Optionally, **export all** profiles at once for full configuration backup
---
### Decision Register
| # | Decision | Rationale |
|---|----------|-----------|
| P1 | Export single profiles, not all-at-once by default | User asked "for setting a profile" — per-profile export/import is more practical for sharing individual configurations. Add "Export All" as secondary option. |
| P2 | Import overwrites ALL prompt steps for target profile | Simplest mental model. Import = full replace (not merge). User gets confirmation dialog before proceeding. |
| P3 | Export JSON includes all 7 steps (including legacy `filter`, `generate`) | Even though UI hides these, the DB stores them. Export should be a complete snapshot — import restores all 7. |
| P4 | Do NOT export auto-increment IDs | `id` fields are not portable between databases. Import inserts new rows; joins on `(name, step_name)` uniqueness. |
| P5 | `created_at`/`updated_at` reset on import | Imported profiles get fresh timestamps (`datetime('now')`). Original export timestamp preserved in file metadata only. |
| P6 | Active profile state NOT imported | `is_active` is deployment-specific. The user sets active profile separately via the existing dropdown. Import only touches `prompt_template` content. |
| P7 | Validate profile name on import | Only A, B, C allowed. Import into non-existent name = rejected. |
| P8 | JSON schema versioned | `"format": "legco-reranker-profile/v1"` for future-proofing. Reject unknown versions on import. |
---
### JSON Format Specification
#### Single Profile Export
```json
{
"format": "legco-reranker-profile/v1",
"profile_name": "A",
"exported_at": "2026-04-27T12:00:00Z",
"prompts": {
"decompose": "Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions...",
"filter": "Given question '{question}' and these document chunks:\n\n{chunks}\n\n...",
"generate": "Question: {question}\n\nContext:\n{context}\n\n...",
"generate_per_subq": "Answer each sub-question using ONLY its document chunks...",
"filter_intro": "Evaluate each chunk for relevance to its associated sub-question only.",
"filter_section": "\nSub-question {subq_idx}: \"{subq_question}\"\n{chunks}",
"filter_outro": "\nFor each chunk, rate its relevance 0-10..."
}
}
```
#### Full Backup Export (All Profiles)
```json
{
"format": "legco-reranker-profile/v1",
"exported_at": "2026-04-27T12:00:00Z",
"active_profile": "A",
"profiles": {
"A": {
"prompts": { ... }
},
"B": {
"prompts": { ... }
},
"C": {
"prompts": { ... }
}
}
}
```
#### Import Request Format
```json
POST /api/v1/prompts/profiles/{name}/import
Content-Type: application/json
{
"format": "legco-reranker-profile/v1",
"profile_name": "A",
"exported_at": "2026-04-27T12:00:00Z",
"prompts": {
"decompose": "...",
...
}
}
```
**Response**:
```json
{
"status": "ok",
"profile": "B",
"imported_steps": 7,
"source_profile": "A"
}
```
---
### Sub-Phase Structure
| Sub-Phase | Scope | Components | Test Files |
|-----------|-------|------------|------------|
| PX.1 | Backend — Export endpoint | `routers/prompts.py`, `models/prompts.py` | `test_phaseX_export.py` |
| PX.2 | Backend — Import endpoint | `routers/prompts.py`, `models/prompts.py`, `prompt_service.py` | `test_phaseX_import.py` |
| PX.3 | Frontend — Export/Import UI | `SystemPromptsPage.tsx`, `ProfileList.tsx`, `lib/api.ts`, `lib/queries.tsx`, `types/index.ts` | `test_phaseX_export_import.test.tsx` |
| PX.4 | Testing & Polish | All affected files | Integration + acceptance tests |
---
### Sub-Phase PX.1: Backend — Single Profile Export Endpoint
**Test files to write first:**
- `backend/app/test/test_phaseX_export.py` — Tests export endpoint, JSON schema validation, empty profile handling
**Task PX.1.1: Add Pydantic models**
File: `backend/app/models/prompts.py`
```python
class ProfileExportResponse(BaseModel):
format: str = "legco-reranker-profile/v1"
profile_name: str
exported_at: str
prompts: dict[str, str]
class AllProfilesExportResponse(BaseModel):
format: str = "legco-reranker-profile/v1"
exported_at: str
active_profile: str
profiles: dict[str, dict[str, dict[str, str]]] # profile_name -> {"prompts": {step: text}}
```
**Task PX.1.2: Add `GET /api/v1/prompts/profiles/{name}/export` endpoint**
File: `backend/app/routers/prompts.py`
- Reads all 7 `system_prompts` rows for the given profile
- Returns `ProfileExportResponse` with `Content-Disposition: attachment; filename="legco-profile-{name}.json"`
- Uses `application/json` content type
**Task PX.1.3: Add `GET /api/v1/prompts/export/all` endpoint (optional)**
- Reads all 3 profiles + all 21 prompt rows
- Returns `AllProfilesExportResponse`
- For full backup/restore scenarios
**Commit**: `"feat(prompts): add single-profile and full JSON export endpoints"`
---
### Sub-Phase PX.2: Backend — Single Profile Import Endpoint
**Test files to write first:**
- `backend/app/test/test_phaseX_import.py` — Tests import endpoint, validation, error cases
**Task PX.2.1: Add request model**
File: `backend/app/models/prompts.py`
```python
class ProfileImportRequest(BaseModel):
format: str # must be "legco-reranker-profile/v1"
profile_name: str # source profile name (informational)
exported_at: str | None = None # informational timestamp
prompts: dict[str, str] # step_name -> template_text
```
**Task PX.2.2: Add `POST /api/v1/prompts/profiles/{name}/import` endpoint**
File: `backend/app/routers/prompts.py`
Validation steps:
1. Check target `{name}` is A, B, or C → 400 if not
2. Check `request.format == "legco-reranker-profile/v1"` → 400 if not
3. Validate that all 7 required step keys (`decompose`, `filter`, `generate`, `generate_per_subq`, `filter_intro`, `filter_section`, `filter_outro`) are present in `request.prompts` → 400 with list of missing keys if not
4. Validate no extra/unknown step keys → reject (or warn? → decision: reject with 400, listing unknown keys)
Implementation:
- Uses `PromptService._update_all_prompts()` (existing batch-update internally) to overwrite all 7 steps
- Each step gets fresh `created_at`/`updated_at` timestamps (DB defaults)
- Returns `{"status": "ok", "profile": name, "imported_steps": len(prompts), "source_profile": request.profile_name}`
**Task PX.2.3: Add `POST /api/v1/prompts/import/all` endpoint (optional)**
- Accepts `AllProfilesExportResponse` format
- Imports all 3 profiles at once
- Does NOT change active profile (only if explicitly included)
**Commit**: `"feat(prompts): add single-profile JSON import endpoint with full validation"`
---
### Sub-Phase PX.3: Frontend — Export/Import UI
**Test files to write first:**
- `frontend/src/test/components/test_phaseX_export_import.test.tsx` — Tests export/import buttons, file download, file upload
**Task PX.3.1: Add TypeScript types**
File: `frontend/src/types/index.ts`
```typescript
interface ProfileExportData {
format: string
profile_name: string
exported_at: string
prompts: Record<string, string>
}
interface ProfileImportResponse {
status: string
profile: string
imported_steps: number
source_profile: string
}
```
**Task PX.3.2: Add API client functions**
File: `frontend/src/lib/api.ts`
```typescript
// Download a profile as JSON blob for browser-side save
export const exportProfile = async (name: string): Promise<ProfileExportData> => {
const resp = await apiClient.get<ProfileExportData>(`/prompts/profiles/${name}/export`)
return resp.data
}
// Import a profile from JSON
export const importProfile = async (name: string, data: ProfileExportData): Promise<ProfileImportResponse> => {
const resp = await apiClient.post<ProfileImportResponse>(`/prompts/profiles/${name}/import`, data)
return resp.data
}
```
**Task PX.3.3: Add TanStack Query mutation for import**
File: `frontend/src/lib/queries.tsx`
```typescript
export const useImportProfile = () => {
const queryClient = useQueryClient()
return useMutation({
mutationFn: ({ name, data }: { name: string; data: ProfileExportData }) =>
importProfile(name, data),
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['prompts'] })
},
})
}
```
**Task PX.3.4: Add Export button to ProfileList cards**
File: `frontend/src/components/ProfileList.tsx`
- Add export icon button (e.g., `Download` from lucide-react) next to the "Edit" button on each card
- On click: calls `exportProfile(name)` via `fetch` → creates blob → triggers browser download via `URL.createObjectURL` + `<a>` click
- Filename: `legco-profile-{name}-{date}.json`
**Task PX.3.5: Add Import button and dialog to SystemPromptsPage**
File: `frontend/src/pages/SystemPromptsPage.tsx`
- Add "Import" button in the top bar (next to "Active Profile" dropdown)
- On click: opens a modal/dialog with:
- File input (accept `.json`) — hidden `<input type="file">` triggered by styled button
- After file selected: parse JSON client-side, show preview (source profile name, export date, step count)
- Target profile selector (dropdown: A, B, C) — defaults to source profile name if valid
- "Import" button → confirmation dialog ("This will overwrite all prompts for Profile {target}. Continue?")
- On confirm: calls `importProfileMutation.mutate()`
- Success: show toast "Profile {target} imported successfully ({n} steps from Profile {source})"
- Error: show inline error message with details
**Task PX.3.6: Add Export All button (optional)**
File: `frontend/src/pages/SystemPromptsPage.tsx`
- "Export All" button in top bar
- Downloads all 3 profiles as `legco-profiles-{date}.json`
**Commit**: `"feat(prompts): add export/import UI with file download, upload dialog, and validation"`
---
### Sub-Phase PX.4: Testing & Polish
**Test files:**
- `backend/app/test/test_phaseX_export.py` — Export endpoint: valid profile, invalid name, JSON schema validation
- `backend/app/test/test_phaseX_import.py` — Import endpoint: valid import, missing steps, extra steps, invalid format version, invalid target name
- `frontend/src/test/components/test_phaseX_export_import.test.tsx` — Export button click → download, Import dialog flow → file upload → preview → confirm → success/error
**Task PX.4.1: Backend unit tests**
- `test_export_profile_valid` — GET export/A returns all 7 steps with correct format version
- `test_export_profile_invalid_name` — GET export/X returns 400
- `test_export_all` — GET export/all returns 3 profiles, 21 prompts total
- `test_import_valid` — POST import/B with valid JSON → 200, verify all 7 steps updated
- `test_import_overwrites_existing` — POST import/B → verify old content replaced
- `test_import_missing_required_step` — POST import with only 6 steps → 400 with missing key listed
- `test_import_unknown_step_key` — POST import with extra step → 400
- `test_import_invalid_format_version` — POST import with format: "v2" → 400
- `test_import_invalid_target_name` — POST import/X → 400
- `test_import_does_not_change_active` — import into inactive profile → active profile unchanged
**Task PX.4.2: Frontend tests**
- Export button visible on each profile card
- Click export → fetch called, download triggered
- Import dialog opens on button click
- File selection → JSON parsed, preview shown
- Invalid JSON file → error message shown
- Target profile selector defaults to source profile
- Confirm import → mutation called, success toast
- Import error → inline error message
- Export All downloads all profiles
**Task PX.4.3: Integration verification**
- `npm run build` — no TypeScript errors
- `npm test` — all frontend tests pass
- `pytest backend/app/test/test_phaseX_*.py -v` — all backend tests pass
- Manual flow: export Profile A → edit Profile B → import exported file into B → verify B's prompts match A's original
**Commit**: `"test(prompts): add unit, integration tests for export/import"`
---
### Files Affected — Complete Inventory
#### Backend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `backend/app/test/test_phaseX_export.py` | PX.4 | Unit tests for export endpoint |
| `backend/app/test/test_phaseX_import.py` | PX.4 | Unit tests for import endpoint |
#### Backend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/models/prompts.py` | PX.1, PX.2 | Add `ProfileExportResponse`, `AllProfilesExportResponse`, `ProfileImportRequest`, `ProfileImportResponse` |
| `backend/app/routers/prompts.py` | PX.1, PX.2 | Add `GET /export`, `GET /export/all`, `POST /import` endpoints |
#### Frontend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `frontend/src/test/components/test_phaseX_export_import.test.tsx` | PX.4 | Component tests for export/import UI |
#### Frontend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `frontend/src/types/index.ts` | PX.3 | Add `ProfileExportData`, `ProfileImportResponse` types |
| `frontend/src/lib/api.ts` | PX.3 | Add `exportProfile()`, `importProfile()` API functions |
| `frontend/src/lib/queries.tsx` | PX.3 | Add `useImportProfile()` mutation hook |
| `frontend/src/components/ProfileList.tsx` | PX.3 | Add Export button per profile card |
| `frontend/src/pages/SystemPromptsPage.tsx` | PX.3 | Add Import/Export All buttons, import dialog/modal |
---
### Acceptance Criteria
#### Backend
- [ ] `GET /api/v1/prompts/profiles/A/export` returns JSON with all 7 steps, correct format version
- [ ] `GET /api/v1/prompts/profiles/X/export` returns 400 (invalid profile name)
- [ ] `GET /api/v1/prompts/export/all` returns all 3 profiles, active profile marker
- [ ] `POST /api/v1/prompts/profiles/B/import` with valid payload overwrites all 7 steps for Profile B
- [ ] Import rejects payload with missing required step keys (400 + key names)
- [ ] Import rejects payload with unknown step keys (400 + key names)
- [ ] Import rejects payload with unknown format version (400)
- [ ] Import does NOT change `is_active` flag on target profile
- [ ] Exported JSON does NOT contain internal DB IDs (`id`/`profile_id`)
- [ ] All existing prompt API endpoints still work unchanged
#### Frontend
- [ ] Export button visible on each profile card in ProfileList
- [ ] Clicking Export downloads a `.json` file with correct naming (`legco-profile-A-2026-04-27.json`)
- [ ] Import button visible on SystemPromptsPage top bar
- [ ] Clicking Import opens a modal with: file input, JSON preview, target profile selector, confirm button
- [ ] Selecting invalid JSON file shows error message
- [ ] Importing into a valid profile shows success confirmation with step count
- [ ] Import error from backend shows inline error message
- [ ] After successful import, profile data refreshes (query invalidation)
- [ ] All existing System Prompts functionality still works unchanged
---
### Risk Register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| JSON file too large to upload | Low | Low — 7 prompts × ~2KB = ~14KB | Add 1MB limit on import endpoint (`FastAPI` `Body(max_length=...)`) |
| User imports into wrong profile by mistake | Medium | Medium — overwrites their existing config | Confirmation dialog with source/target profile names clearly displayed before import |
| Exported file missing legacy `filter`/`generate` steps | Medium | Medium — import would fail validation | Always export all 7 steps (even hidden ones). Import validates all 7 are present. |
| Browser download API differences | Low | Low | Use standard `Blob` + `URL.createObjectURL` approach, tested across Chrome/Firefox |
| Import endpoint receives malformed JSON | Low | Low — Pydantic validation catches this | `ProfileImportRequest` model validates format string, dict keys, value types |
| User exports from one deployment and imports into another with different profile names | Low | Low — only 3 names (A/B/C) | Import only into A/B/C — if source was "D", user must choose target manually |
---
### New Dependencies
None. All changes use existing libraries (FastAPI, Pydantic, React, TanStack Query, lucide-react icons).
---
### Implementation Sequence
```
PX.1 (Backend Export) ──► PX.2 (Backend Import)
PX.3 (Frontend UI)
PX.4 (Testing)
```
PX.1 and PX.2 can be done together (both in `routers/prompts.py`). PX.3 depends on knowing the exact API contracts from PX.1/PX.2. PX.4 runs after everything is wired.