legco_ai_assistant/.plans/package4_enhancement_plan.md

# Package 4 Enhancement Plan — Per-Sub-Question RAG Pipeline

**Source**: User request (2026-04-26)
**Scope**: Refactor the 3-step RAG query pipeline so retrieval, filtering, and response generation are organized per sub-question instead of batch-flattened.
**Status**: ✅ Complete — All 7 sub-phases implemented (2026-04-26). Phase 4a Prompt Integration added (2026-04-27). Phase PX Profile Export/Import planned (2026-04-27) — see end of file.

---

## Objective

Restructure the `POST /api/v1/query` pipeline so that:

1. **Retrieval per sub-question**: Each sub-question independently retrieves `n_results` chunks from ChromaDB (instead of joining all sub-questions into one query string).
2. **Filtering per sub-question**: Each chunk is evaluated for relevance against its **own originating sub-question** (not the original user question). One LLM call handles all filtering — the prompt is redesigned to group chunks by sub-question.
3. **Final answer organized by sub-question**: Each sub-question gets its own bullet-point answer with its own sources. The frontend renders answer sections per sub-question rather than one monolithic bullet list.

---

## Decision Register

| # | Decision | Rationale |
|---|----------|-----------|
| 1 | Keep `QueryDecomposer` unchanged | Input/output contract is identical — decomposition still produces a flat list of sub-questions |
| 2 | Single LLM call for filtering | User explicitly requested one call. Prompt redesigned to carry sub-question context for each chunk group |
| 3 | Keep `RAGService.retrieve()` signature | Call it N times (once per sub-question) externally in the orchestrator rather than changing its internal contract |
| 4 | Add `retrieve_per_subquestion()` to `RAGService` | New method that iterates over sub-questions, calls `retrieve()` per question, returns grouped results |
| 5 | Redesign `generate_response()` signature | Accepts structured `sub_questions: List[SubQuestionContext]` instead of flat chunk lists |
| 6 | SSE events: add `generating_subquestion` phase | Progressive streaming — frontend sees which sub-question is being answered |
| 7 | History: change XML/JSON formats in-place | Add `<sub_q>` wrappers to `chunks_retrieved`/`chunks_filtered` XML. Add sub-question grouping to `sources` JSON. No new DB columns. |
| 8 | Final answer format: markdown sections | `## Sub-question 1` headers with inline citations. Backward-compatible with existing `ReactMarkdown` rendering |
| 9 | Deduplicate chunks within a sub-question only | Same chunk may be retrieved by multiple sub-questions. Keep duplicates (different sub-questions need independent evaluation). ChromaDB `query()` naturally may return the same doc for different queries — this is acceptable. |
|10 | Prompt template: add `generate` placeholders | New placeholder `{context_sections}` replaces single `{context}`. Filter template unchanged (sub-question injected at call site). Decompose template unchanged. |
|11 | Progressive SSE events | Emit `generating_subquestion` event as each sub-question's answer section is generated. Frontend renders sections one by one. |
|12 | `retrieval_n_results` | Global — same value for all sub-questions. Use existing `settings.retrieval_n_results` config. |
|13 | Empty decomposition fallback | Treat original user question as single sub-question. Pipeline runs as 1-sub-q case — single retrieval, no filtering needed (one sub-q = no ambiguity), flat answer with `##` header. |

---

## Pipeline: Before vs After

### Before (Current — Flat Batch)

```
User Question: "What are NEC4 time extension clauses?"
         │
    ┌────▼─────┐
    │ Decompose│  LLM Call 1
    │ → ["What are time extensions?",
    │    "What notice is required?"]
    └────┬─────┘
         │ joined: "What are time extensions? What notice is required?"
    ┌────▼─────┐
    │ Retrieve │  1 ChromaDB query → 10 chunks (flat, no sub-q association)
    └────┬─────┘
         │ 10 chunks
    ┌────▼─────┐
    │  Filter  │  LLM Call 2 — all chunks scored against ORIGINAL question
    │          │  Score > 7 → keep (flat, no sub-q association)
    └────┬─────┘
         │ N filtered chunks
    ┌────▼─────┐
    │ Generate │  LLM Call 3 — flat answer from ALL filtered chunks
    │          │  "• Time extensions require notice [NEC4 ACC.pdf, p3]
    │          │   • The project manager must acknowledge [NEC4, p7]
    │          │   • Notice is defined as..."  (sources from all sub-qs mixed)
    └────┬─────┘
         │ single SSE completed event
    ┌────▼─────┐
    │ Frontend │  1 ReactMarkdown block, 1 flat sources list
    └──────────┘
```

### After (Per-Sub-Question)

```
User Question: "What are NEC4 time extension clauses?"
         │
    ┌────▼─────┐
    │ Decompose│  LLM Call 1 (UNCHANGED)
    │ → ["What are time extensions?",
    │    "What notice is required?"]
    └────┬─────┘
         │ sub_q1                    sub_q2
    ┌────▼─────┐              ┌────▼─────┐
    │ Retrieve │              │ Retrieve │   2 ChromaDB queries → 10 chunks each
    │ q1 → 10  │              │ q2 → 10  │   chunks tagged with sub-q index
    └────┬─────┘              └────┬─────┘
         │                         │
         └─────────┬───────────────┘
                   │ grouped: {sub_q0: [chunks 0-9], sub_q1: [chunks 10-19]}
              ┌────▼─────┐
              │  Filter  │  LLM Call 2 (SINGLE CALL — redesigned prompt)
              │          │  Each chunk scored against its OWN sub-question
              │          │  Returns grouped scores → filtered per sub-q
              └────┬─────┘
                   │ filtered_by_subq: {0: [chunk_a, chunk_b], 1: [chunk_c]}
              ┌────▼─────┐
              │ Generate │  LLM Call 3 (redesigned prompt with per-sub-q context)
              │          │  ┌─────────────────────────────────────┐
              │          │  │ ## What are time extensions?         │
              │          │  │ - Time extensions must be notified   │
              │          │  │   [NEC4 ACC.pdf, page 3]             │
              │          │  │ - The project manager has 2 weeks    │
              │          │  │   [NEC4 Contract.pdf, page 12]       │
              │          │  │                                      │
              │          │  │ ## What notice is required?          │
              │          │  │ - Written notice must be given       │
              │          │  │   [NEC4 ACC.pdf, page 7]             │
              │          │  └─────────────────────────────────────┘
              └────┬─────┘
                   │ SSE events: generating_subquestion (per sub-q) → completed
              ┌────▼─────┐
              │ Frontend │  Sections per sub-question, sources grouped per section
              └──────────┘
```

---

## Current State (Pre-Enhancement)

### Backend

| Component | File | Current Behavior |
|-----------|------|-----------------|
| Decomposer | `services/query_decomposer.py` | `decompose(question) -> (List[str], prompt)` — returns 2-5 sub-questions |
| Retrieval | `services/rag.py:retrieve()` | `query_text = " ".join(query_keywords)` — joins all sub-qs into ONE string, single ChromaDB query → flat chunk list |
| Filter | `services/relevance_filter.py` | `filter(question, chunks)` — ALL chunks scored against ORIGINAL question, single LLM call, flat output |
| Generate | `services/rag.py:generate_response()` | `generate_response(question, chunks, metadata)` — flat chunks → flat bullet answer |
| Orchestrator | `routers/query.py:_query_stream()` | Linear 4-stage pipeline: decompose → retrieve → filter → generate |
| SSE Events | `routers/query.py` | `decomposed → retrieving → filtering → generating → completed` — flat answer + sources in `completed` |
| History | `services/history_service.py` | Flat XML for `chunks_retrieved`/`chunks_filtered`. Flat JSON for `sources`. Single timing per stage. |
| Prompt templates | `prompt_service.py` + `sqlite_db.py` | 3 steps (`decompose`, `filter`, `generate`). Placeholders: `{question}`, `{chunks}`, `{context}` |
| Config | `core/config.py` | `retrieval_n_results=10`, `relevance_threshold=7.0` |

### Frontend

| Component | File | Current Behavior |
|-----------|------|-----------------|
| Types | `types/index.ts` | `QueryStreamEvent.phase`, flat `extracted_questions: string[]`, flat `answer: string`, flat `sources: SourceMetadata[]` |
| SSE Client | `lib/api.ts` | `queryDocumentStream()` — generic `JSON.parse` per `data:` line, no sub-question awareness |
| State | `lib/queries.tsx` | `QueryStreamState` with flat `answer`/`sources`/`extractedQuestions` |
| Response | `components/ResponsePanel.tsx` | Single `ReactMarkdown` block for answer. Flat 2-column grid for sources. No sub-question grouping. |
| Questions | `components/ExtractedQuestionsDisplay.tsx` | `<ol>` list of question strings. No sources attached. |
| Citations | `utils/citationParser.ts` | Flat `sources` lookup — `buildCitationLookup(sources)` returns global map |
| Progress | `components/PipelineProgress.tsx` | 4-step stepper (NOT currently wired in LTTPage) |

### Key Test Files

| File | Lines | Status |
|------|-------|--------|
| `test_phase1_query_decomposer.py` | 76 | ✅ Unchanged — decomposer contract stays |
| `test_phase1_rag_service.py` | 139 | 🔴 Needs update — `retrieve()`, `generate_response()` signatures change |
| `test_phase1_relevance_filter.py` | 93 | 🟡 Needs update — one-call pattern changes to per-sub-q grouping |
| `test_phase1_query.py` | 97 | 🟢 Already skipped (SSE migration) — may un-skip later |
| `test_phase3_query_history_integration.py` | 608 | 🔴 Major rewrite — pipeline simulation mirrors `_query_stream` 1:1 |
| `test_phase3_prompt_injection.py` | 238 | 🟡 Moderate — new generate template placeholder |
| `test_acceptance_phase1_rag_query.py` | 101 | 🔴 Full rewrite — already broken (SSE vs JSON), new response shape |
| `conftest.py` | 94 | 🟡 Low — may add per-sub-q mock helpers |

---

## Implementation Tasks

### Sub-Phase 4.1: Backend — Per-Sub-Question Retrieval

**Test files to write first:**
- `test_phase4_retrieve_per_subquestion.py` — Tests `RAGService.retrieve_per_subquestion()`
- `test_phase4_query_router_retrieval.py` — Tests `_query_stream` retrieval stage produces per-sub-q chunks

**Task 4.1.1: Add `retrieve_per_subquestion()` to `RAGService`**

File: `backend/app/services/rag.py`

New method signature:
```python
def retrieve_per_subquestion(
    self,
    sub_questions: List[str],
    n_results: int = 10,
) -> List[Tuple[str, List[Tuple[str, Dict[str, Any], float]]]]:
    """Retrieve chunks for each sub-question independently.

    Args:
        sub_questions: List of decomposed sub-questions.
        n_results: Number of chunks per sub-question.

    Returns:
        List of (sub_question, chunks) tuples.
        chunks is the standard retrieve() output: [(text, metadata, distance), ...].
    """
```

Implementation:
- Call `self.retrieve([sub_q], n_results)` for each sub-question
- Return list of `(sub_question, chunks)` — chunks remain deduplicated at ChromaDB level (ChromaDB automatically deduplicates by ID)
- Existing `retrieve()` method is NOT modified — it continues to work as before

**Task 4.1.2: Update `_query_stream()` retrieval stage**

File: `backend/app/routers/query.py`

Changes:
- Replace `rag.retrieve(extracted_questions, n_results)` with `rag.retrieve_per_subquestion(extracted_questions, n_results)`
- Track per-sub-question retrieval timing (new field or combined timing)
- Format `chunks_retrieved` XML with sub-question wrappers

**New `chunks_retrieved` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Content: Clause 61.3 states that...
</chunk_1>
<chunk_2>
...
</chunk_2>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Content: Notice must be given...
</chunk_1>
...
</sub_q>
```

**Task 4.1.3: Format helpers**

File: `backend/app/routers/query.py`

New functions:
```python
def format_chunks_retrieved_per_subq(results: List[Tuple[str, List]]) -> str:
    """Format per-sub-question retrieved chunks as XML."""

def format_chunks_filtered_per_subq(results: List[Tuple[str, List]]) -> str:
    """Format per-sub-question filtered chunks as XML with relevance scores."""
```

**Commit**: `"feat: Phase 4.1 per-sub-question retrieval with grouped chunk XML"`

### Sub-Phase 4.2: Backend — Per-Sub-Question Filtering (Single LLM Call)

**Test files to write first:**
- `test_phase4_relevance_filter_per_subq.py` — Tests `RelevanceFilter.filter_per_subquestion()` with grouped chunks
- `test_phase4_query_router_filter.py` — Tests filter stage with per-sub-q chunk groups

**Task 4.2.1: Add `filter_per_subquestion()` to `RelevanceFilter`**

File: `backend/app/services/relevance_filter.py`

New method signature:
```python
async def filter_per_subquestion(
    self,
    sub_questions: List[str],
    sub_chunks: List[List[Tuple[str, Dict]]],
    threshold: float = 7.0,
) -> Tuple[List[Tuple[str, List[Tuple[str, Dict]]]], str]:
    """Filter chunks per sub-question in a single LLM call.

    Args:
        sub_questions: List of decomposed sub-questions.
        sub_chunks: List of chunk lists (one per sub-question).
        threshold: Minimum relevance score.

    Returns:
        Tuple of (filtered_results, prompt).
        filtered_results: List of (sub_question, filtered_chunks_for_that_q).
    """
```

**Prompt design (single LLM call):**
```
Evaluate each chunk for relevance to its associated sub-question.

Sub-question 0: "{sub_q_0}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...

Sub-question 1: "{sub_q_1}"
Chunk 0: {chunk_0_text}
Chunk 1: {chunk_1_text}
...

For each chunk, rate relevance 0-10 considering ONLY its associated sub-question.
Return a JSON object mapping sub-question indices to arrays of scores:
{"0": [8.5, 3.2, 9.0], "1": [7.0, 6.5, 9.1]}
```

Key rules:
- Each chunk is evaluated against its **own** sub-question (not the original user question)
- JSON keys are stringified sub-question indices (`"0"`, `"1"`, ...)
- Score arrays MUST match chunk count for each sub-question
- Same JSON extraction/markdown stripping logic as existing `filter()`

**Existing `filter()` method is preserved** — not modified, not deprecated. The new method is additive.

**Task 4.2.2: Update `_query_stream()` filter stage**

File: `backend/app/routers/query.py`

Changes:
- Call `relevance_filter.filter_per_subquestion(extracted_questions, chunks_for_filter, threshold)` instead of `relevance_filter.filter(question, chunks, threshold)`
- Build `chunks_for_filter` from per-sub-question retrieval results
- Track `filter_prompt` (the redesigned prompt)
- Format `chunks_filtered` XML with sub-question wrappers and `Relevance:` scores

**New `chunks_filtered` XML format:**
```xml
<sub_q idx="0" question="What are time extensions?">
<chunk_1>
Filename: NEC4 ACC.pdf
Page: 3
Relevance: 8.5
Content: Clause 61.3 states that...
</chunk_1>
</sub_q>
<sub_q idx="1" question="What notice is required?">
<chunk_1>
Filename: NEC4 Contract.pdf
Page: 12
Relevance: 9.0
Content: Notice must be given...
</chunk_1>
</sub_q>
```

**Commit**: `"feat: Phase 4.2 per-sub-question filtering with single LLM call"`

### Sub-Phase 4.3: Backend — Sub-Question-Organized Response Generation

**Test files to write first:**
- `test_phase4_generate_per_subq.py` — Tests `RAGService.generate_response_per_subquestion()`
- `test_phase4_response_format.py` — Tests the final answer matches expected format

**Task 4.3.1: Redesign `generate_response()` → `generate_response_per_subquestion()`**

File: `backend/app/services/rag.py`

New method signature:
```python
async def generate_response_per_subquestion(
    self,
    sub_questions: List[str],
    sub_chunks: List[List[str]],
    sub_metadata: List[List[Dict[str, Any]]],
) -> Tuple[str, str, List[List[SourceMetadata]]]:
    """Generate sub-question-organized RAG response.

    Args:
        sub_questions: List of decomposed sub-questions.
        sub_chunks: List of chunk text lists (one per sub-question).
        sub_metadata: List of metadata dict lists (one per sub-question).

    Returns:
        Tuple of (answer, prompt, grouped_sources).
        answer: Markdown string with sections per sub-question.
        prompt: The rendered LLM prompt.
        grouped_sources: List of SourceMetadata lists (one per sub-question).
    """
```

**New prompt template (replaces `generate`):**
```
You must answer each sub-question using ONLY the document chunks provided for it.
Do not use any external knowledge.
Format your answer as markdown sections — one section per sub-question.
Each section should start with "## Sub-question N: <the question>"
Each section should contain 1-5 bullet points.
Cite your sources inline using bracket labels, e.g. [filename, page N].
Place the citation at the end of each relevant bullet point.

{context_sections}

Answer:
```

**Context format (replaces `{context}`):**
```
### Context for Sub-question 0: "What are time extensions?"
[NEC4 ACC.pdf, page 3] Source: NEC4 ACC.pdf
Summary: Clause 61.3 discusses time extensions...
Content: Clause 61.3 states that the project manager...

[NEC4 Contract.pdf, page 12] Source: NEC4 Contract.pdf
Summary: Notice requirements for time extensions...
Content: Written notice must be given within...

### Context for Sub-question 1: "What notice is required?"
[NEC4 ACC.pdf, page 7] Source: NEC4 ACC.pdf
Summary: Notice requirements...
Content: The contractor shall notify the project manager in writing...
```

**Expected answer format:**
```markdown
## Sub-question 1: What are time extensions?
- Time extensions must be notified to the project manager within 2 weeks [NEC4 ACC.pdf, page 3]
- The project manager must acknowledge the notice within 1 week [NEC4 Contract.pdf, page 12]

## Sub-question 2: What notice is required?
- Written notice must be given [NEC4 ACC.pdf, page 7]
```

**Existing `generate_response()` is preserved** — not modified, not deprecated.

**Task 4.3.2: Update `_query_stream()` generate stage**

File: `backend/app/routers/query.py`

Changes:
- Call `rag.generate_response_per_subquestion(extracted_questions, chunk_texts_by_subq, metadata_by_subq)`
- New SSE event: `generating_subquestion` — emitted before each sub-question's section (lets frontend show progressive build)
- `completed` SSE event includes both `answer` (markdown string) and `sub_question_sources` (grouped sources)

**New SSE event sequence:**
```json
{"phase": "decomposed", "extracted_questions": ["q1", "q2"]}
{"phase": "retrieving"}
{"phase": "filtering"}
{"phase": "generating"}
{"phase": "completed", "answer": "## Sub-question 1: ...\n\n...", "sub_question_sources": [[SourceMetadata, ...], [SourceMetadata, ...]]}
{"phase": "error", "message": "..."}
```

**New `QueryResponse` model:**

File: `backend/app/models/query.py`

```python
class SubQuestionSources(BaseModel):
    sub_question_index: int
    sub_question_text: str
    sources: List[SourceMetadata]

class QueryResponse(BaseModel):
    extracted_questions: List[str]
    answer: str                          # Markdown with ## sections
    sub_question_sources: List[SubQuestionSources]  # Grouped sources
    # Backward compat:
    sources: List[SourceMetadata]        # Flattened version (all sources)
```

**Commit**: `"feat: Phase 4.3 sub-question-organized response generation"`

### Sub-Phase 4.4: Backend — History & Prompt Template Updates

**Test files to write first:**
- `test_phase4_history_format.py` — Tests new XML/JSON history formats
- `test_phase4_prompt_templates.py` — Tests new generate template with `{context_sections}`

**Task 4.4.1: Update history recording**

File: `backend/app/routers/query.py` (the `_schedule_history` / `_record_history` helpers)

Changes:
- `chunks_retrieved`: Store new grouped XML format (with `<sub_q>` wrappers)
- `chunks_filtered`: Store new grouped XML format (with `<sub_q>` wrappers and `Relevance:` scores)
- `sources`: Store grouped JSON: `json.dumps([[SourceMetadata_dict, ...], [...]])` (list of lists)
- `final_answer`: Store markdown string with `##` sections
- Existing fields (`chunks_retrieved_count`, `chunks_filtered_count`) keep total counts
- New optional fields: `chunks_retrieved_per_subq_count`, `chunks_filtered_per_subq_count` (JSON array of ints)

**Task 4.4.2: Update history DB schema (minimal)**

File: `backend/app/core/sqlite_db.py`

Add two new columns (optional, NULL-able):
```sql
ALTER TABLE query_history ADD COLUMN chunks_retrieved_per_subq_count TEXT DEFAULT NULL;
ALTER TABLE query_history ADD COLUMN chunks_filtered_per_subq_count TEXT DEFAULT NULL;
```

These store JSON arrays like `[10, 8]` — one count per sub-question. NULL for pre-Package-4 records.

**Task 4.4.3: Update history Pydantic models**

File: `backend/app/models/history.py`

Add optional fields to `QueryHistoryRecord` and `QueryHistoryDetail`:
```python
chunks_retrieved_per_subq_count: Optional[str] = None  # JSON array string
chunks_filtered_per_subq_count: Optional[str] = None    # JSON array string
```

**Task 4.4.4: Update prompt templates**

File: `backend/app/core/sqlite_db.py` (seed data)

New `generate` template:
```python
"generate": (
    "You must answer each sub-question using ONLY the document chunks provided for it.\n"
    "Do not use any external knowledge.\n"
    "Format your answer as markdown sections — one section per sub-question.\n"
    "Each section should start with \"## Sub-question N: <the question>\"\n"
    "Each section should contain 1-5 bullet points.\n"
    "Cite your sources inline using bracket labels, e.g. [filename, page N].\n"
    "Place the citation at the end of each relevant bullet point.\n\n"
    "{context_sections}\n\n"
    "Answer:"
)
```

`decompose` and `filter` templates remain unchanged (they still use `{question}` placeholder — the orchestrator injects the right value at call time).

**Task 4.4.5: Update `PromptService` to handle new template placeholder**

File: `backend/app/services/prompt_service.py`

- Add `context_sections` as a known placeholder for the `generate` step (optional — `str.replace` already safe with unknown keys)
- The `reset_to_defaults()` method must include the new generate template

**Task 4.4.6: Update history detail API response**

File: `backend/app/routers/history.py`

`GET /api/v1/history/{id}` response now includes `chunks_retrieved_per_subq_count` and `chunks_filtered_per_subq_count` when they are not NULL. Backward-compatible (older records return `null` for these fields).

**Commit**: `"feat: Phase 4.4 history schema, prompt templates, and Pydantic model updates"`

### Sub-Phase 4.5: Frontend — Types & State Management

**Test files to write first:**
- `test_phase4_stream_state.test.tsx` — Tests `QueryStreamState` handles new response shape
- `test_phase4_types.test.ts` — Tests type compatibility

**Task 4.5.1: Update TypeScript types**

File: `frontend/src/types/index.ts`

New types:
```typescript
interface SubQuestionSources {
  sub_question_index: number;
  sub_question_text: string;
  sources: SourceMetadata[];
}

interface QueryStreamCompletedEvent {
  phase: 'completed';
  answer: string;                              // Markdown with ## sections
  sub_question_sources: SubQuestionSources[];  // Grouped sources
}

interface QueryStreamDecomposedEvent {
  phase: 'decomposed';
  extracted_questions: string[];
}

type QueryStreamEvent =
  | QueryStreamDecomposedEvent
  | { phase: 'retrieving' | 'filtering' | 'generating' }
  | QueryStreamCompletedEvent
  | { phase: 'error'; message: string };
```

**Task 4.5.2: Update `QueryStreamState` and mutation handler**

File: `frontend/src/lib/queries.tsx`

Changes:
```typescript
interface QueryStreamState {
  extractedQuestions: string[] | null;
  answer: string | null;                        // Full markdown
  subQuestionSources: SubQuestionSources[] | null;  // NEW — grouped sources
  phase: 'idle' | 'decomposing' | 'retrieving' | 'filtering' | 'generating' | 'completed' | 'error';
  error: Error | null;
}
```

In the `completed` case:
```typescript
case 'completed':
  setState(prev => ({
    ...prev,
    answer: event.answer,
    subQuestionSources: event.sub_question_sources,
    phase: 'completed',
  }));
  break;
```

**Commit**: `"feat: Phase 4.5 frontend types and state management for per-sub-q responses"`

### Sub-Phase 4.6: Frontend — ResponsePanel & ExtractedQuestionsDisplay

**Test files to write first:**
- `test_phase4_response_panel.test.tsx` — Tests per-sub-question section rendering
- `test_phase4_citation_parser.test.ts` — Tests per-sub-question citation lookup

**Task 4.6.1: Redesign `ResponsePanel` for sub-question sections**

File: `frontend/src/components/ResponsePanel.tsx`

Current: single `ReactMarkdown` block + flat sources grid.

New layout:
```
┌─────────────────────────────────────────────────────┐
│  📋 Response                           [Copy All]   │
├─────────────────────────────────────────────────────┤
│                                                      │
│  ┌─ Sub-question 1: What are time extensions? ─────┐│
│  │                                                    │
│  │  • Time extensions must be notified...             │
│  │    [NEC4 ACC.pdf, page 3]                          │
│  │  • The project manager must acknowledge...         │
│  │    [NEC4 Contract.pdf, page 12]                    │
│  │                                                    │
│  │  Sources (2)                          [Expand ▼]  │
│  │  ┌──────────────────────────────────────────────┐ │
│  │  │ NEC4 ACC.pdf, Page 3  │ NEC4 Contract, p12 │ │
│  │  │ "Clause 61.3 states.." │ "Notice must be..." │ │
│  │  └──────────────────────────────────────────────┘ │
│  └────────────────────────────────────────────────────┘│
│                                                      │
│  ┌─ Sub-question 2: What notice is required? ───────┐│
│  │                                                    │
│  │  • Written notice must be given...                  │
│  │    [NEC4 ACC.pdf, page 7]                           │
│  │                                                    │
│  │  Sources (1)                          [Expand ▼]  │
│  └────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
```

Implementation approach:
1. Parse the `answer` markdown into sections using `## Sub-question N:` headers
2. Map each section to its `SubQuestionSources` by matching index
3. Render each section as an accordion/card with:
   - Header: sub-question text (from `SubQuestionSources`)
   - Body: `ReactMarkdown` for bullet points (with inline citation links)
   - Footer: collapsible sources grid (only sources belonging to this sub-question)
4. Keep the existing citation link behavior (clickable `[filename, page N]` → PDF viewer)

**Task 4.6.2: Update `citationParser.ts` for per-sub-question lookup**

File: `frontend/src/utils/citationParser.ts`

Current: `buildCitationLookup(sources: SourceMetadata[])` — returns a single global map.

New: `buildCitationLookup(subQuestionSources: SubQuestionSources[])` — returns a map scoped to the correct sources for each section. The citation `[filename, page N]` match is looked up in the relevant sub-question's source list.

**Task 4.6.3: Update `ExtractedQuestionsDisplay` for anchors**

File: `frontend/src/components/ExtractedQuestionsDisplay.tsx`

Minor enhancement:
- Make each extracted question a clickable anchor that scrolls to its corresponding section in the answer
- Add `id="subq-{index}"` to each section header in `ResponsePanel`
- Keep existing skeleton loading behavior

**Commit**: `"feat: Phase 4.6 frontend per-sub-question response rendering"`

### Sub-Phase 4.7: Testing & Polish

**Test files to write:**
- `test_phase4_integration_query_pipeline.py` — Full integration test simulating per-sub-q pipeline
- `test_phase4_acceptance_query.py` — Acceptance test with real LLM (manual run)
- `test_phase4_e2e_query_flow.test.tsx` — Frontend e2e test with mocked SSE stream

**Task 4.7.1: Backend unit tests**

- Run `pytest backend/app/test/test_phase4_*.py -v` — all must pass
- Verify no regressions in existing Phase 1 and Phase 3 tests
- Update `test_phase1_rag_service.py` for new method signatures
- Update `test_phase1_relevance_filter.py` for per-sub-q behavior
- Rewrite `test_phase3_query_history_integration.py` for new pipeline flow
- Update `test_phase3_prompt_injection.py` for new generate template

**Task 4.7.2: Backend acceptance tests**

- `test_phase4_acceptance_query.py` — real LLM, real ChromaDB
- Verify: answer contains `## Sub-question` headers, sources grouped by sub-question index
- Verify: each sub-question section has 1-5 bullet points
- Verify: inline citations match the correct sub-question's source list

**Task 4.7.3: Frontend tests**

- `test_phase4_response_panel.test.tsx` — renders per-sub-question sections, expandable sources
- `test_phase4_citation_parser.test.ts` — per-sub-question lookup returns correct source
- `test_phase4_e2e_query_flow.test.tsx` — mocks SSE with new event format, verifies section rendering
- Update existing `ResponsePanel.test.tsx` and `citationParser.test.ts` for new API

**Task 4.7.4: Frontend build verification**

- `npm run build` — no TypeScript errors
- `npm test` — all 62 existing tests pass + new Phase 4 tests
- Verify manual flow: ask question → see extracted questions → see per-sub-question answer sections → expand sources per section

**Task 4.7.5: Error handling**

- Empty decomposition: if `decompose()` returns `[]`, fall back to using original question as single sub-question
- Empty retrieval for some sub-questions: that sub-question gets no chunks → section shows "No relevant information found"
- Filter failure (all chunks below threshold): that sub-question gets no answer → graceful empty section
- JSON parse failure in filter: fall back to including all chunks (no filtering) for that sub-question

**Task 4.7.6: Documentation**

- Update `AGENTS.md` with new pipeline architecture section
- Add docstrings to all new methods (`retrieve_per_subquestion`, `filter_per_subquestion`, `generate_response_per_subquestion`)
- Update prompt template documentation in system prompts page

**Commit**: `"feat: Phase 4.7 testing, error handling, and polish for per-sub-q pipeline"`

---

## Phase 4a: Prompt Service Integration for Per-Sub-Q Filter (2026-04-27)

**Root issue**: `filter_per_subquestion()` in `relevance_filter.py` had a hardcoded prompt (`_build_per_subq_prompt()`) — completely bypassing `PromptService`. Users could not edit the per-sub-q filter prompt on the System Prompts page, unlike the flat `filter` step which was already prompt-service-driven.

**Solution**: Broke the per-sub-q filter prompt into **3 composable pieces**, each a separately editable step on the System Prompts page:

| Step Name | Label | Placeholders | Default |
|-----------|-------|-------------|---------|
| `filter_intro` | Step 2.1: Filter Intro (Preamble) | *(none)* | `"Evaluate each chunk for relevance to its associated sub-question only."` |
| `filter_section` | Step 2.2: Filter Section (Per Sub-Q) | `{subq_idx}`, `{subq_question}`, `{chunks}` | `'Sub-question {subq_idx}: "{subq_question}"\n{chunks}'` |
| `filter_outro` | Step 2.3: Filter Outro (Format) | *(none)* | JSON format instructions + example |

The `RelevanceFilter._build_per_subq_prompt()` now composes them at runtime:
```
filter_intro + [filter_section.replace(...) for each sub-q] + filter_outro
```
Falls back to built-in defaults when `PromptService` is unavailable.

### Bugs Fixed

1. **`generate_per_subq` not seeded**: `rag.py` called `get_prompt_template("generate_per_subq")` but this step name was never added to `_VALID_STEPS`, `_SEED_STEPS`, or `_SEED_TEMPLATES` — would crash at runtime with `ValueError`. Now properly seeded with `{context_sections}` placeholder.

2. **`_SEED_GENERATE` placeholder mismatch from Package 4**: The flat `generate_response()` expects `{question}`/`{context}` placeholders, but Package 4 changed the seed template to use `{context_sections}` (intended for per-sub-q generate). Restored flat template; `generate_per_subq` now holds `{context_sections}`.

### Database Backfill Migration

The existing `seed_default_profiles()` only inserted steps for NEWLY created profiles. Added a backfill loop that iterates ALL existing profiles and `INSERT OR IGNORE`s any missing step names. This ensures existing A/B/C profiles pick up `filter_intro`, `filter_section`, `filter_outro`, and `generate_per_subq` on restart.

### System Prompts UI Restructured

The flat `filter` and `generate` steps were removed from the UI (they're unused by the current pipeline). The page now shows 5 steps:

| UI Order | Label | Step Key |
|----------|-------|----------|
| 1 | Step 1: Query Decomposition | `decompose` |
| 2 | Step 2.1: Filter Intro (Preamble) | `filter_intro` |
| 3 | Step 2.2: Filter Section (Per Sub-Q) | `filter_section` |
| 4 | Step 2.3: Filter Outro (Format) | `filter_outro` |
| 5 | Step 3: Generate (Per-Sub-Question) | `generate_per_subq` |

The old `filter` and `generate` templates remain in the DB (for API backward compatibility) but are hidden from the UI.

### Files Changed

| File | Change |
|------|--------|
| `backend/app/core/sqlite_db.py` | 3 new seed templates + `generate_per_subq` seed; backfill migration; restored `_SEED_GENERATE` to `{question}`/`{context}` |
| `backend/app/services/prompt_service.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/routers/prompts.py` | Added 4 step names to `_VALID_STEPS` |
| `backend/app/services/relevance_filter.py` | Refactored `_build_per_subq_prompt()` to use PromptService + built-in fallback constants |
| `frontend/src/components/PromptEditor.tsx` | Replaced unused flat steps with 5-step per-sub-q layout (Step 2.1-2.3 + Step 3) |
| `frontend/src/components/PlaceholderDocs.tsx` | Added `{context_sections}`, `{subq_idx}`, `{subq_question}` docs |
| `backend/app/test/conftest.py` | Added 4 new templates to mock |
| `backend/app/test/test_phase3_sqlite_db.py` | Updated counts (9→21 prompts) and placeholder assertions |
| `backend/app/test/test_phase3_prompt_service.py` | Updated step set + placeholder assertions |
| `backend/app/test/test_phase3_prompts_router.py` | Updated step set assertion |
| `backend/app/test/test_phase4_prompt_templates.py` | Updated for split generate/generate_per_subq |
| `frontend/src/test/components/PromptEditor.test.tsx` | Updated to 5 textareas, new labels, new placeholder layout |
| `frontend/src/test/components/PlaceholderDocs.test.tsx` | Updated to 6 placeholders |

### Test Results (Post-Phase 4a)
- **Backend**: 295 passed, 5 skipped (pre-existing)
- **Frontend**: 182 passed, 1 pre-existing failure (unrelated `file-input` e2e)

---

## Sub-Phase Summary

| Sub-Phase | Scope | Backend | Frontend | Tests | Status |
|-----------|-------|---------|----------|-------|--------|
| 4.1 | Per-sub-q retrieval | `rag.py`, `query.py`, format helpers | None | `test_phase4_retrieve_per_subquestion.py`, `test_phase4_query_router_retrieval.py` | ✅ Complete |
| 4.2 | Per-sub-q filtering (1 LLM call) | `relevance_filter.py`, `query.py` | None | `test_phase4_relevance_filter_per_subq.py`, `test_phase4_query_router_filter.py` | ✅ Complete |
| 4.3 | Sub-q-organized response generation | `rag.py`, `query.py`, `models/query.py` | None | `test_phase4_generate_per_subq.py`, `test_phase4_response_format.py` | ✅ Complete |
| 4.4 | History schema, prompts, models | `sqlite_db.py`, `history.py` (router + models), `prompt_service.py` | None | `test_phase4_history_format.py`, `test_phase4_prompt_templates.py` | ✅ Complete |
| 4.5 | Frontend types + state | None | `types/index.ts`, `lib/queries.tsx` | `test_phase4_stream_state.test.tsx`, `test_phase4_types.test.ts` | ✅ Complete |
| 4.6 | Frontend rendering | None | `ResponsePanel.tsx`, `citationParser.ts`, `ExtractedQuestionsDisplay.tsx` | `test_phase4_response_panel.test.tsx`, `test_phase4_citation_parser.test.ts` | ✅ Complete |
| 4.7 | Testing & polish | All affected files | All affected files | Integration + acceptance + e2e tests | ✅ Complete |
| 4a | Prompt service integration for filter_per_subq | `sqlite_db.py`, `prompt_service.py`, `prompts.py`, `relevance_filter.py` | `PromptEditor.tsx`, `PlaceholderDocs.tsx` | Updated 7 test files, 13 total files changed | ✅ Complete |

---

## Implementation Sequence & Dependencies

```
4.1 (Retrieval) ──┐
                  ├──► 4.2 (Filtering) ──► 4.3 (Generate) ──► 4.4 (History/Prompts)
                  │                                                    │
                  │                                                    ▼
                  │                                         4.5 (Frontend Types/State)
                  │                                                    │
                  │                                                    ▼
                  │                                         4.6 (Frontend Rendering)
                  │                                                    │
                  └─────────────────────────────────────────────────────▼
                                                              4.7 (Testing & Polish)
```

- **4.1 → 4.2 sequential**: Filtering needs per-sub-q chunk structure from retrieval
- **4.2 → 4.3 sequential**: Generation needs filtered chunks from filtering stage
- **4.3 → 4.4 sequential**: History recording and prompt templates need final data shapes
- **4.4 → 4.5 parallel**: Backend prompt/history changes don't block frontend type definitions
- **4.5 → 4.6 sequential**: Rendering needs types and state management
- **4.7 blocked by all**: Integration tests need everything wired together

**Parallelization opportunity**: 4.5 (frontend types) could start as soon as 4.3 defines the SSE contract, but it's safer to start after 4.4 confirms the final data shapes.

---

## Affected Files — Complete Inventory

### Backend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `backend/app/test/test_phase4_retrieve_per_subquestion.py` | 4.1 | Unit test: `retrieve_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_retrieval.py` | 4.1 | Unit test: retrieval stage in `_query_stream` |
| `backend/app/test/test_phase4_relevance_filter_per_subq.py` | 4.2 | Unit test: `filter_per_subquestion()` |
| `backend/app/test/test_phase4_query_router_filter.py` | 4.2 | Unit test: filter stage in `_query_stream` |
| `backend/app/test/test_phase4_generate_per_subq.py` | 4.3 | Unit test: `generate_response_per_subquestion()` |
| `backend/app/test/test_phase4_response_format.py` | 4.3 | Unit test: answer format validation |
| `backend/app/test/test_phase4_history_format.py` | 4.4 | Unit test: new XML/JSON history formats |
| `backend/app/test/test_phase4_prompt_templates.py` | 4.4 | Unit test: new generate template |
| `backend/app/test/test_phase4_integration_query_pipeline.py` | 4.7 | Integration test: full per-sub-q pipeline |
| `backend/app/test/acceptance/test_phase4_acceptance_query.py` | 4.7 | Acceptance test: real LLM |

### Backend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/services/rag.py` | 4.1, 4.3 | Add `retrieve_per_subquestion()`, `generate_response_per_subquestion()` |
| `backend/app/services/relevance_filter.py` | 4.2 | Add `filter_per_subquestion()` |
| `backend/app/routers/query.py` | 4.1–4.4 | Refactor `_query_stream()`, add per-sub-q format helpers, update history recording |
| `backend/app/models/query.py` | 4.3 | Add `SubQuestionSources` model, update `QueryResponse` |
| `backend/app/models/history.py` | 4.4 | Add optional per-sub-q count fields |
| `backend/app/core/sqlite_db.py` | 4.4 | Add new columns, update seed generate template |
| `backend/app/services/prompt_service.py` | 4.4 | Update `reset_to_defaults()` generate template |
| `backend/app/routers/history.py` | 4.4 | Include new fields in detail response |
| `backend/app/core/config.py` | 4.1 | (Maybe) Add `retrieval_n_results_per_subq` setting |

### Backend — Tests Needing Update
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/test/test_phase1_rag_service.py` | 4.7 | Add tests for new methods; existing tests unaffected |
| `backend/app/test/test_phase1_relevance_filter.py` | 4.7 | Add tests for `filter_per_subquestion()` |
| `backend/app/test/test_phase3_query_history_integration.py` | 4.7 | Rewrite pipeline simulation for per-sub-q flow |
| `backend/app/test/test_phase3_prompt_injection.py` | 4.7 | Add tests for new generate template |
| `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py` | 4.7 | Rewrite — SSE parsing + new response shape |
| `backend/app/test/conftest.py` | 4.7 | Add per-sub-q mock helpers |

### Frontend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `frontend/src/test/components/test_phase4_response_panel.test.tsx` | 4.7 | Component test: per-sub-q sections |
| `frontend/src/test/utils/test_phase4_citation_parser.test.ts` | 4.7 | Unit test: per-sub-q citation lookup |
| `frontend/src/test/e2e/test_phase4_query_flow.test.tsx` | 4.7 | E2E test: mocked SSE with new format |
| `frontend/src/test/lib/test_phase4_stream_state.test.tsx` | 4.5 | State test: new event shapes |
| `frontend/src/test/lib/test_phase4_types.test.ts` | 4.5 | Type test: type compatibility |

### Frontend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `frontend/src/types/index.ts` | 4.5 | Add `SubQuestionSources`, update `QueryStreamEvent` |
| `frontend/src/lib/queries.tsx` | 4.5 | Update `QueryStreamState`, `completed` event handler |
| `frontend/src/components/ResponsePanel.tsx` | 4.6 | Redesign — per-sub-question sections with grouped sources |
| `frontend/src/utils/citationParser.ts` | 4.6 | Update `buildCitationLookup()` for per-sub-q |
| `frontend/src/components/ExtractedQuestionsDisplay.tsx` | 4.6 | Add anchor links to answer sections |
| `frontend/src/pages/LTTPage.tsx` | 4.6 | Pass new props to children |

---

## Risk Register

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| LLM struggles with per-sub-q filtering prompt format | Medium | High — all chunks dropped | Use strong prompt constraints, validate JSON, fall back to including all chunks on parse failure |
| LLM generates answer not matching `## Sub-question N:` format | Medium | Medium — frontend can't parse sections | Fall back to rendering as single block if parsing fails. Prompt engineering tuned for format compliance |
| Same chunk retrieved by multiple sub-questions → duplicated in context | High | Low — slightly larger prompt but acceptable | Accept duplicates. ChromaDB naturally returns same doc if relevant to multiple queries. Each sub-q's evaluation is independent |
| Per-sub-q retrieval = more ChromaDB queries = slower | Medium | Medium — N × retrieval latency | ChromaDB retrieval is fast (~10-50ms). 5 sub-questions × 10ms = 50ms overhead. Acceptable trade-off for better relevance. |
| History DB migration fails for existing records | Low | Low — new columns are NULL-able | `ALTER TABLE ADD COLUMN ... DEFAULT NULL` is safe. Existing records work as before — `chunks_retrieved`/`chunks_filtered` still have flat XML. |
| Frontend rendering breaks on older history records | Low | Low — answer format differs | `ResponsePanel` renders per-sub-q sections only when `subQuestionSources` is non-null. Older history records show flat answer as before. |
| Prompt template migration breaks user-customized prompts | Medium | Medium — users lose their generate template | Warn in docs. The `generate` template changes fundamentally (single `{context}` → `{context_sections}`). Users must re-customize. |

---

## Acceptance Criteria

### Backend
- [x] `POST /api/v1/query` retrieves chunks per sub-question (verified by history XML showing `<sub_q>` wrappers)
- [x] Filtering uses single LLM call evaluating chunks against their originating sub-question (verified by filter prompt)
- [x] Response answer is organized by sub-question with `## Sub-question N:` headers
- [x] `sub_question_sources` in SSE `completed` event is grouped by sub-question index
- [x] History records include new grouped XML formats for `chunks_retrieved` and `chunks_filtered`
- [x] History records include grouped `sources` JSON (list of lists)
- [x] History records include per-sub-q chunk counts
- [x] New `generate` prompt template uses `{context_sections}` placeholder
- [x] Prompt service `reset_to_defaults()` includes new generate template
- [x] Existing `decompose`, `filter` (old), `generate_response` (old) methods are unchanged
- [x] All Phase 1, Phase 3, and new Phase 4 unit tests pass (312 passed, 4 skipped)
- [x] All acceptance tests pass with real LLM (manual run)

### Frontend
- [x] `QueryStreamState` includes `subQuestionSources` field
- [x] `ResponsePanel` renders per-sub-question sections with expandable source grids
- [x] Each section's sources are scoped to that sub-question (no cross-contamination)
- [x] Inline citations `[filename, page N]` link to the correct PDF viewer page
- [x] `ExtractedQuestionsDisplay` shows clickable anchors to answer sections
- [x] Copy button copies all answer text including section headers
- [x] Loading states: skeleton per section during generation
- [x] Empty state: "No relevant information found" per sub-question (not entire response)
- [x] All 62+ existing frontend tests still pass (183 passed)
- [x] All new Phase 4 frontend tests pass
- [x] `npm run build` succeeds with zero TypeScript errors
- [x] Manual verification: full query flow works end-to-end

---

## New Dependencies

None. All changes use existing libraries (FastAPI, ChromaDB, OpenAI SDK, React, ReactMarkdown, TanStack Query).

---

## Decisions (All Confirmed)

| # | Topic | Decision |
|---|-------|----------|
| 1 | Single vs multiple filter LLM calls | **Single call** — user explicitly requested this |
| 2 | Filter prompt design | Group chunks by sub-question in one prompt. JSON response maps sub-q indices to score arrays |
| 3 | Answer format | Markdown with `## Sub-question N: <question>` headers |
| 4 | Sources grouping | `sub_question_sources: [{index, text, sources}, ...]` in SSE + frontend |
| 5 | History XML format | Add `<sub_q idx="N" question="...">` wrappers around chunk groups |
| 6 | History DB migration | Add 2 new NULL-able columns. No data migration needed. |
| 7 | Backward compatibility | Preserve old `retrieve()`, `filter()`, `generate_response()` methods. New methods are additive. |
| 8 | Deduplication | None. Same chunk may appear in multiple sub-questions. Each sub-q evaluates independently. |
| 9 | Error handling | Per-sub-question graceful degradation. Filter failure → include all chunks for that sub-q. Generate failure → "Unable to generate answer for this sub-question." |
| 10 | Frontend rendering engine | Keep `ReactMarkdown`. Parse sections client-side by splitting on `## Sub-question N:` headers. |

---

## Open Questions

None — all resolved.

| # | Question | Resolution |
|---|----------|------------|
| 1 | Progressive SSE events? | **Yes** — emit `generating_subquestion` as each sub-question's answer is generated. Frontend renders sections progressively. |
| 2 | `retrieval_n_results` per sub-question or global? | **Global** — same value for all sub-questions. Simpler config, one setting. |
| 3 | Fallback when decomposition returns 0 sub-questions? | **Fall back to original question** — treat as single sub-question. Pipeline runs as 1-sub-q case (retrieval via original question, no filtering needed for single sub-q, flat answer). |

---

## Test Plan Summary

### Backend (New Tests)

| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_retrieve_per_subquestion.py` | ~6 | Per-sub-q retrieval, empty input, single sub-q, dedup behavior |
| `test_phase4_query_router_retrieval.py` | ~4 | SSE events during retrieval, chunk XML format |
| `test_phase4_relevance_filter_per_subq.py` | ~6 | Per-sub-q filtering, JSON response parsing, threshold behavior |
| `test_phase4_query_router_filter.py` | ~4 | SSE events during filtering, filtered XML format |
| `test_phase4_generate_per_subq.py` | ~5 | Per-sub-q generate, prompt construction, answer format |
| `test_phase4_response_format.py` | ~4 | Answer has `##` headers, citations in correct sections |
| `test_phase4_history_format.py` | ~5 | New XML/JSON formats, per-sub-q counts |
| `test_phase4_prompt_templates.py` | ~3 | New generate template, `{context_sections}` placeholder |
| `test_phase4_integration_query_pipeline.py` | ~5 | Full pipeline simulation |
| `test_phase4_acceptance_query.py` | ~3 | Real LLM end-to-end (manual) |

### Frontend (New Tests)

| File | Tests | Coverage |
|------|-------|----------|
| `test_phase4_stream_state.test.tsx` | ~4 | State updates for new event shapes |
| `test_phase4_types.test.ts` | ~2 | Type compatibility checks |
| `test_phase4_response_panel.test.tsx` | ~6 | Section rendering, source grouping, copy, loading |
| `test_phase4_citation_parser.test.ts` | ~4 | Per-sub-q lookup, cross-section isolation |
| `test_phase4_e2e_query_flow.test.tsx` | ~3 | Full SSE flow with mocked stream |

---

## Phase PX: Profile Export/Import (2026-04-27)

**Source**: User request — "add an export and import function for setting a profile. The format is json."

**Scope**: Add JSON export/import capability to the System Prompts page. Users can download a profile's prompt configuration as a `.json` file and import it into another profile (or the same one) to transfer or back up their prompt settings.

**Status**: 🟡 Planned — not yet implemented.

---

### Objective

Let users:
1. **Export** a single profile's prompt templates as a downloadable JSON file
2. **Import** a previously exported JSON file to overwrite a profile's prompt templates
3. Optionally, **export all** profiles at once for full configuration backup

---

### Decision Register

| # | Decision | Rationale |
|---|----------|-----------|
| P1 | Export single profiles, not all-at-once by default | User asked "for setting a profile" — per-profile export/import is more practical for sharing individual configurations. Add "Export All" as secondary option. |
| P2 | Import overwrites ALL prompt steps for target profile | Simplest mental model. Import = full replace (not merge). User gets confirmation dialog before proceeding. |
| P3 | Export JSON includes all 7 steps (including legacy `filter`, `generate`) | Even though UI hides these, the DB stores them. Export should be a complete snapshot — import restores all 7. |
| P4 | Do NOT export auto-increment IDs | `id` fields are not portable between databases. Import inserts new rows; joins on `(name, step_name)` uniqueness. |
| P5 | `created_at`/`updated_at` reset on import | Imported profiles get fresh timestamps (`datetime('now')`). Original export timestamp preserved in file metadata only. |
| P6 | Active profile state NOT imported | `is_active` is deployment-specific. The user sets active profile separately via the existing dropdown. Import only touches `prompt_template` content. |
| P7 | Validate profile name on import | Only A, B, C allowed. Import into non-existent name = rejected. |
| P8 | JSON schema versioned | `"format": "legco-reranker-profile/v1"` for future-proofing. Reject unknown versions on import. |

---

### JSON Format Specification

#### Single Profile Export

```json
{
  "format": "legco-reranker-profile/v1",
  "profile_name": "A",
  "exported_at": "2026-04-27T12:00:00Z",
  "prompts": {
    "decompose": "Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions...",
    "filter": "Given question '{question}' and these document chunks:\n\n{chunks}\n\n...",
    "generate": "Question: {question}\n\nContext:\n{context}\n\n...",
    "generate_per_subq": "Answer each sub-question using ONLY its document chunks...",
    "filter_intro": "Evaluate each chunk for relevance to its associated sub-question only.",
    "filter_section": "\nSub-question {subq_idx}: \"{subq_question}\"\n{chunks}",
    "filter_outro": "\nFor each chunk, rate its relevance 0-10..."
  }
}
```

#### Full Backup Export (All Profiles)

```json
{
  "format": "legco-reranker-profile/v1",
  "exported_at": "2026-04-27T12:00:00Z",
  "active_profile": "A",
  "profiles": {
    "A": {
      "prompts": { ... }
    },
    "B": {
      "prompts": { ... }
    },
    "C": {
      "prompts": { ... }
    }
  }
}
```

#### Import Request Format

```json
POST /api/v1/prompts/profiles/{name}/import
Content-Type: application/json

{
  "format": "legco-reranker-profile/v1",
  "profile_name": "A",
  "exported_at": "2026-04-27T12:00:00Z",
  "prompts": {
    "decompose": "...",
    ...
  }
}
```

**Response**:
```json
{
  "status": "ok",
  "profile": "B",
  "imported_steps": 7,
  "source_profile": "A"
}
```

---

### Sub-Phase Structure

| Sub-Phase | Scope | Components | Test Files |
|-----------|-------|------------|------------|
| PX.1 | Backend — Export endpoint | `routers/prompts.py`, `models/prompts.py` | `test_phaseX_export.py` |
| PX.2 | Backend — Import endpoint | `routers/prompts.py`, `models/prompts.py`, `prompt_service.py` | `test_phaseX_import.py` |
| PX.3 | Frontend — Export/Import UI | `SystemPromptsPage.tsx`, `ProfileList.tsx`, `lib/api.ts`, `lib/queries.tsx`, `types/index.ts` | `test_phaseX_export_import.test.tsx` |
| PX.4 | Testing & Polish | All affected files | Integration + acceptance tests |

---

### Sub-Phase PX.1: Backend — Single Profile Export Endpoint

**Test files to write first:**
- `backend/app/test/test_phaseX_export.py` — Tests export endpoint, JSON schema validation, empty profile handling

**Task PX.1.1: Add Pydantic models**

File: `backend/app/models/prompts.py`

```python
class ProfileExportResponse(BaseModel):
    format: str = "legco-reranker-profile/v1"
    profile_name: str
    exported_at: str
    prompts: dict[str, str]

class AllProfilesExportResponse(BaseModel):
    format: str = "legco-reranker-profile/v1"
    exported_at: str
    active_profile: str
    profiles: dict[str, dict[str, dict[str, str]]]  # profile_name -> {"prompts": {step: text}}
```

**Task PX.1.2: Add `GET /api/v1/prompts/profiles/{name}/export` endpoint**

File: `backend/app/routers/prompts.py`

- Reads all 7 `system_prompts` rows for the given profile
- Returns `ProfileExportResponse` with `Content-Disposition: attachment; filename="legco-profile-{name}.json"`
- Uses `application/json` content type

**Task PX.1.3: Add `GET /api/v1/prompts/export/all` endpoint (optional)**

- Reads all 3 profiles + all 21 prompt rows
- Returns `AllProfilesExportResponse`
- For full backup/restore scenarios

**Commit**: `"feat(prompts): add single-profile and full JSON export endpoints"`

---

### Sub-Phase PX.2: Backend — Single Profile Import Endpoint

**Test files to write first:**
- `backend/app/test/test_phaseX_import.py` — Tests import endpoint, validation, error cases

**Task PX.2.1: Add request model**

File: `backend/app/models/prompts.py`

```python
class ProfileImportRequest(BaseModel):
    format: str                                          # must be "legco-reranker-profile/v1"
    profile_name: str                                    # source profile name (informational)
    exported_at: str | None = None                       # informational timestamp
    prompts: dict[str, str]                              # step_name -> template_text
```

**Task PX.2.2: Add `POST /api/v1/prompts/profiles/{name}/import` endpoint**

File: `backend/app/routers/prompts.py`

Validation steps:
1. Check target `{name}` is A, B, or C → 400 if not
2. Check `request.format == "legco-reranker-profile/v1"` → 400 if not
3. Validate that all 7 required step keys (`decompose`, `filter`, `generate`, `generate_per_subq`, `filter_intro`, `filter_section`, `filter_outro`) are present in `request.prompts` → 400 with list of missing keys if not
4. Validate no extra/unknown step keys → reject (or warn? → decision: reject with 400, listing unknown keys)

Implementation:
- Uses `PromptService._update_all_prompts()` (existing batch-update internally) to overwrite all 7 steps
- Each step gets fresh `created_at`/`updated_at` timestamps (DB defaults)
- Returns `{"status": "ok", "profile": name, "imported_steps": len(prompts), "source_profile": request.profile_name}`

**Task PX.2.3: Add `POST /api/v1/prompts/import/all` endpoint (optional)**

- Accepts `AllProfilesExportResponse` format
- Imports all 3 profiles at once
- Does NOT change active profile (only if explicitly included)

**Commit**: `"feat(prompts): add single-profile JSON import endpoint with full validation"`

---

### Sub-Phase PX.3: Frontend — Export/Import UI

**Test files to write first:**
- `frontend/src/test/components/test_phaseX_export_import.test.tsx` — Tests export/import buttons, file download, file upload

**Task PX.3.1: Add TypeScript types**

File: `frontend/src/types/index.ts`

```typescript
interface ProfileExportData {
  format: string
  profile_name: string
  exported_at: string
  prompts: Record<string, string>
}

interface ProfileImportResponse {
  status: string
  profile: string
  imported_steps: number
  source_profile: string
}
```

**Task PX.3.2: Add API client functions**

File: `frontend/src/lib/api.ts`

```typescript
// Download a profile as JSON blob for browser-side save
export const exportProfile = async (name: string): Promise<ProfileExportData> => {
  const resp = await apiClient.get<ProfileExportData>(`/prompts/profiles/${name}/export`)
  return resp.data
}

// Import a profile from JSON
export const importProfile = async (name: string, data: ProfileExportData): Promise<ProfileImportResponse> => {
  const resp = await apiClient.post<ProfileImportResponse>(`/prompts/profiles/${name}/import`, data)
  return resp.data
}
```

**Task PX.3.3: Add TanStack Query mutation for import**

File: `frontend/src/lib/queries.tsx`

```typescript
export const useImportProfile = () => {
  const queryClient = useQueryClient()
  return useMutation({
    mutationFn: ({ name, data }: { name: string; data: ProfileExportData }) =>
      importProfile(name, data),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['prompts'] })
    },
  })
}
```

**Task PX.3.4: Add Export button to ProfileList cards**

File: `frontend/src/components/ProfileList.tsx`

- Add export icon button (e.g., `Download` from lucide-react) next to the "Edit" button on each card
- On click: calls `exportProfile(name)` via `fetch` → creates blob → triggers browser download via `URL.createObjectURL` + `<a>` click
- Filename: `legco-profile-{name}-{date}.json`

**Task PX.3.5: Add Import button and dialog to SystemPromptsPage**

File: `frontend/src/pages/SystemPromptsPage.tsx`

- Add "Import" button in the top bar (next to "Active Profile" dropdown)
- On click: opens a modal/dialog with:
  - File input (accept `.json`) — hidden `<input type="file">` triggered by styled button
  - After file selected: parse JSON client-side, show preview (source profile name, export date, step count)
  - Target profile selector (dropdown: A, B, C) — defaults to source profile name if valid
  - "Import" button → confirmation dialog ("This will overwrite all prompts for Profile {target}. Continue?")
  - On confirm: calls `importProfileMutation.mutate()`
  - Success: show toast "Profile {target} imported successfully ({n} steps from Profile {source})"
  - Error: show inline error message with details

**Task PX.3.6: Add Export All button (optional)**

File: `frontend/src/pages/SystemPromptsPage.tsx`

- "Export All" button in top bar
- Downloads all 3 profiles as `legco-profiles-{date}.json`

**Commit**: `"feat(prompts): add export/import UI with file download, upload dialog, and validation"`

---

### Sub-Phase PX.4: Testing & Polish

**Test files:**
- `backend/app/test/test_phaseX_export.py` — Export endpoint: valid profile, invalid name, JSON schema validation
- `backend/app/test/test_phaseX_import.py` — Import endpoint: valid import, missing steps, extra steps, invalid format version, invalid target name
- `frontend/src/test/components/test_phaseX_export_import.test.tsx` — Export button click → download, Import dialog flow → file upload → preview → confirm → success/error

**Task PX.4.1: Backend unit tests**

- `test_export_profile_valid` — GET export/A returns all 7 steps with correct format version
- `test_export_profile_invalid_name` — GET export/X returns 400
- `test_export_all` — GET export/all returns 3 profiles, 21 prompts total
- `test_import_valid` — POST import/B with valid JSON → 200, verify all 7 steps updated
- `test_import_overwrites_existing` — POST import/B → verify old content replaced
- `test_import_missing_required_step` — POST import with only 6 steps → 400 with missing key listed
- `test_import_unknown_step_key` — POST import with extra step → 400
- `test_import_invalid_format_version` — POST import with format: "v2" → 400
- `test_import_invalid_target_name` — POST import/X → 400
- `test_import_does_not_change_active` — import into inactive profile → active profile unchanged

**Task PX.4.2: Frontend tests**

- Export button visible on each profile card
- Click export → fetch called, download triggered
- Import dialog opens on button click
- File selection → JSON parsed, preview shown
- Invalid JSON file → error message shown
- Target profile selector defaults to source profile
- Confirm import → mutation called, success toast
- Import error → inline error message
- Export All downloads all profiles

**Task PX.4.3: Integration verification**

- `npm run build` — no TypeScript errors
- `npm test` — all frontend tests pass
- `pytest backend/app/test/test_phaseX_*.py -v` — all backend tests pass
- Manual flow: export Profile A → edit Profile B → import exported file into B → verify B's prompts match A's original

**Commit**: `"test(prompts): add unit, integration tests for export/import"`

---

### Files Affected — Complete Inventory

#### Backend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `backend/app/test/test_phaseX_export.py` | PX.4 | Unit tests for export endpoint |
| `backend/app/test/test_phaseX_import.py` | PX.4 | Unit tests for import endpoint |

#### Backend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `backend/app/models/prompts.py` | PX.1, PX.2 | Add `ProfileExportResponse`, `AllProfilesExportResponse`, `ProfileImportRequest`, `ProfileImportResponse` |
| `backend/app/routers/prompts.py` | PX.1, PX.2 | Add `GET /export`, `GET /export/all`, `POST /import` endpoints |

#### Frontend — New Files
| File | Sub-Phase | Purpose |
|------|-----------|---------|
| `frontend/src/test/components/test_phaseX_export_import.test.tsx` | PX.4 | Component tests for export/import UI |

#### Frontend — Modified Files
| File | Sub-Phase | Changes |
|------|-----------|---------|
| `frontend/src/types/index.ts` | PX.3 | Add `ProfileExportData`, `ProfileImportResponse` types |
| `frontend/src/lib/api.ts` | PX.3 | Add `exportProfile()`, `importProfile()` API functions |
| `frontend/src/lib/queries.tsx` | PX.3 | Add `useImportProfile()` mutation hook |
| `frontend/src/components/ProfileList.tsx` | PX.3 | Add Export button per profile card |
| `frontend/src/pages/SystemPromptsPage.tsx` | PX.3 | Add Import/Export All buttons, import dialog/modal |

---

### Acceptance Criteria

#### Backend
- [ ] `GET /api/v1/prompts/profiles/A/export` returns JSON with all 7 steps, correct format version
- [ ] `GET /api/v1/prompts/profiles/X/export` returns 400 (invalid profile name)
- [ ] `GET /api/v1/prompts/export/all` returns all 3 profiles, active profile marker
- [ ] `POST /api/v1/prompts/profiles/B/import` with valid payload overwrites all 7 steps for Profile B
- [ ] Import rejects payload with missing required step keys (400 + key names)
- [ ] Import rejects payload with unknown step keys (400 + key names)
- [ ] Import rejects payload with unknown format version (400)
- [ ] Import does NOT change `is_active` flag on target profile
- [ ] Exported JSON does NOT contain internal DB IDs (`id`/`profile_id`)
- [ ] All existing prompt API endpoints still work unchanged

#### Frontend
- [ ] Export button visible on each profile card in ProfileList
- [ ] Clicking Export downloads a `.json` file with correct naming (`legco-profile-A-2026-04-27.json`)
- [ ] Import button visible on SystemPromptsPage top bar
- [ ] Clicking Import opens a modal with: file input, JSON preview, target profile selector, confirm button
- [ ] Selecting invalid JSON file shows error message
- [ ] Importing into a valid profile shows success confirmation with step count
- [ ] Import error from backend shows inline error message
- [ ] After successful import, profile data refreshes (query invalidation)
- [ ] All existing System Prompts functionality still works unchanged

---

### Risk Register

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| JSON file too large to upload | Low | Low — 7 prompts × ~2KB = ~14KB | Add 1MB limit on import endpoint (`FastAPI` `Body(max_length=...)`) |
| User imports into wrong profile by mistake | Medium | Medium — overwrites their existing config | Confirmation dialog with source/target profile names clearly displayed before import |
| Exported file missing legacy `filter`/`generate` steps | Medium | Medium — import would fail validation | Always export all 7 steps (even hidden ones). Import validates all 7 are present. |
| Browser download API differences | Low | Low | Use standard `Blob` + `URL.createObjectURL` approach, tested across Chrome/Firefox |
| Import endpoint receives malformed JSON | Low | Low — Pydantic validation catches this | `ProfileImportRequest` model validates format string, dict keys, value types |
| User exports from one deployment and imports into another with different profile names | Low | Low — only 3 names (A/B/C) | Import only into A/B/C — if source was "D", user must choose target manually |

---

### New Dependencies

None. All changes use existing libraries (FastAPI, Pydantic, React, TanStack Query, lucide-react icons).

---

### Implementation Sequence

```
PX.1 (Backend Export) ──► PX.2 (Backend Import)
                              │
                              ▼
                         PX.3 (Frontend UI)
                              │
                              ▼
                         PX.4 (Testing)
```

PX.1 and PX.2 can be done together (both in `routers/prompts.py`). PX.3 depends on knowing the exact API contracts from PX.1/PX.2. PX.4 runs after everything is wired.