# Package 7 Enhancement Plan — Response Highlighting & Compact Citations **Source**: User request (2026-05-15) **Scope**: Two enhancements to the final RAG response: (1) yellow-highlight figures/dates in the answer using LLM-identified terms, (2) replace verbose `[filename.pdf, page N]` citations with compact sequentially-numbered `[1] [2] [3]` clickable links. **Status**: Draft --- ## Objective ### Feature 1: Figure & Date Highlighting Ask the LLM to identify monetary figures, percentages, statistics, and dates in the final response. Return these as a separate list. The frontend then searches the rendered answer and wraps matching terms in yellow `` highlights. ### Feature 2: Compact Citation Links Replace the current inline citation display `[document_file_name.pdf, page N]` with compact sequentially-numbered links: `[1]`, `[2]`, `[3]`, etc. The full source details remain visible in the collapsible source cards below each sub-question section. **Non-goals**: The underlying citation URL logic (PDF viewer vs highlight page routing) remains unchanged. Source cards are not modified. --- ## Current State ### Response Generation Flow (Backend) ``` RAGService.generate_response_per_subquestion() # rag.py:174-273 ├── Builds context per sub-q with [filename, page N] labels ├── Renders prompt template (generate_per_subq) ├── Calls LLM (temperature=0.3) └── Returns (answer, prompt, grouped_sources_meta) query.py _query_stream() # query.py:290-385 ├── Yield {"phase": "generating"} ├── Yield {"phase": "generating_subquestion", ...} per sub-q └── Yield {"phase": "completed", answer, sub_question_sources, sources, history_id} ``` ### Current Citation Display (Frontend) ``` LLM returns: "...according to the report [NEC4 ACC.pdf, page 3]..." ↓ citationParser.processCitations() # citationParser.ts:133-138 → Regex: /(? tag via CitationLink component ``` ### Current Markdown Rendering - Library: `react-markdown` v10.1.0 - Custom components: `{ a: CitationLink }` only (ResponsePanel.tsx:122) - Styling: Tailwind `prose prose-sm max-w-none text-gray-800` --- ## Target State ### Feature 1: Figure & Date Highlighting Two viable approaches identified: | | Approach A: Inline Markers | Approach B: Separate Highlights List | |---|---|---| | **LLM output** | `==HK$1,000,000==` embedded in answer | Answer text + `---HIGHLIGHTS---\n["HK$1,000,000", ...]` | | **Backend parsing** | None needed | Split on marker, parse JSON list | | **SSE change** | None | Add `highlights: string[]` to `completed` event | | **Frontend work** | Regex `==term==` → `term` | Receive list, search-and-wrap in answer text | | **Reliability** | High — markers are simple, no structured parsing | Medium — LLM may malformat JSON | | **Extensibility** | Harder to add metadata (e.g., highlight type) | Easy to add per-term metadata later | **Decision**: Use Approach A (inline `==term==` markers) for simplicity and reliability. No backend changes needed for the LLM response parsing — only prompt template + frontend rendering. ``` LLM prompt (updated generate_per_subq template): "For monetary figures (e.g., HK$1,000,000), percentages (e.g., 35%), statistics, and dates (e.g., 1 January 2024, 2023), wrap them in ==double equals markers== for highlighting." LLM answer: "The budget was ==HK$1,000,000== (a ==35%== increase from ==2023==)..." ↓ Frontend pre-processing (new highlightTerms() function): → Regex: /==(.+?)==/g → Replace: ==term== → term ↓ ReactMarkdown with custom mark component: → HK$1,000,000 ``` ### Feature 2: Compact Citation Links ``` Current: "...according to the report [NEC4 ACC.pdf, page 3]..." Desired: "...according to the report [1] and further noted in [2]..." Implementation: In replaceCitationPatterns() (citationParser.ts:105-131), add a closure counter. Each matched citation gets a sequential number: [trimmed](url) → [1](url), [2](url), [3](url) ... ``` The citation URL, source lookup, and "View PDF" source cards are unchanged. Each `[N]` label is clickable and opens the same linked page as before. --- ## Decision Register | # | Decision | Rationale | |---|----------|-----------| | 1 | **Inline `==term==` markers** (not separate highlights list) | LLMs reliably produce inline markup. Inline markers preserve the relationship between highlighted terms and their surrounding context (which citation they belong to, which bullet point). No backend parsing, no SSE schema change. Simple regex on frontend: `==(.+?)==` → `$1`. | | 2 | **Prompt-only change for LLM** (no backend code change for parsing) | The highlight instruction is added to the `generate_per_subq` prompt template only. `rag.py` requires zero changes — the `==term==` markers are part of the answer string and transparent to existing code. | | 3 | **Custom `` component in ReactMarkdown** (not raw HTML injection) | `ReactMarkdown` strips raw HTML by default. Using `components={{ mark: HighlightMark }}` is the proper React way. Styling via Tailwind classes: `bg-yellow-200 rounded px-0.5`. | | 4 | **`==term==` syntax choice** | `==...==` is used in many wiki/markdown dialects for highlighting (Obsidian, Markdown-it-mark). It's visually distinct from `**bold**`, `*italic*`, and `~~strikethrough~~`. No risk of colliding with existing markdown in LLM output. | | 5 | **Sequential numbering `[1]` `[2]` `[3]`** (not a single `[ref]` label) | User requested sequential numbering for better visual clarity. Each citation in the answer gets a unique number (`[1]`, `[2]`, `[3]`...), making it easy to distinguish multiple references at a glance. The source cards below still provide full details. Implemented via a closure counter in `replaceCitationPatterns()`. | | 6 | **Pure frontend change for compact citations** (no backend changes) | Citations are parsed purely on the frontend in `citationParser.ts`. The LLM still produces `[filename, page N]` — the frontend converts to sequential `[1]`, `[2]`, `[3]` during `replaceCitationPatterns()`. Backend is untouched. | | 7 | **`processCitations` before highlight preprocessing** | Run citation processing first (convert `[filename]` to markdown links), then highlight preprocessing (convert `==term==` to ``). This order ensures `==` markers inside citation brackets don't interfere with citation regex and vice versa. | | 8 | **Seed template update only** (not database migration) | The `_SEED_GENERATE_PER_SUBQ` template in `sqlite_db.py` is updated. Existing databases will NOT be auto-migrated — users must reset prompts or manually update via the API. This matches the existing pattern (all prompt changes are seed-only). | | 9 | **Yellow highlight color: `bg-yellow-200`** | Tailwind's `yellow-200` (`#FEF08A`) provides a soft, readable yellow that works on both light backgrounds and is distinct from the `text-blue-600` citation links. Add `rounded px-0.5` for visual polish. | --- ## Files to Create | # | File | Purpose | |---|------|---------| | F1 | `frontend/src/test/utils/highlightParser.test.ts` | Unit tests for `highlightTerms()` function | | F2 | `frontend/src/test/utils/citationCompactRef.test.ts` | Unit tests for compact sequential `[1] [2] [3]` citation format | | F3 | `frontend/src/test/components/ResponsePanel_highlights.test.tsx` | Integration: ResponsePanel renders highlights and compact refs | --- ## Files to Modify | # | File | Change | |---|------|--------| | M1 | `backend/app/core/sqlite_db.py` | Update `_SEED_GENERATE_PER_SUBQ` template (lines 42–53): add `==term==` instruction for figures/dates | | M2 | `frontend/src/utils/citationParser.ts` | (a) Add closure counter in `replaceCitationPatterns()` to output `[1](url)`, `[2](url)` instead of `[trimmed](url)`. (b) Add `highlightTerms(markdown: string): string` function: regex `==(.+?)==` → `$1` | | M3 | `frontend/src/components/ResponsePanel.tsx` | (a) Add `HighlightMark` component. (b) Add `highlightTerms()` call in `SubQuestionSection` and `FlatResponse` before ReactMarkdown. (c) Add `mark` to ReactMarkdown `components`. | | M4 | `frontend/src/styles.css` | Add `.prose mark { background-color: #FEF08A; border-radius: 0.125rem; padding: 0 0.125rem; }` | | M5 | `frontend/src/test/utils/citationParser.test.ts` | Update existing citation tests to expect sequential `[1](url)`, `[2](url)` output format | --- ## Implementation Tasks ### Task 7.1: Update prompt template (seed) - [ ] In `backend/app/core/sqlite_db.py`, update `_SEED_GENERATE_PER_SUBQ` (lines 42–53): ```python _SEED_GENERATE_PER_SUBQ = ( "Answer each sub-question using ONLY its document chunks.\n" "Format as markdown sections with ## Sub-question N: headers.\n" "Under each ## heading, write your answer as bullet points using - (one bullet point per finding or fact).\n" "Cite your sources inline using bracket labels, e.g. [filename, page N].\n" "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions.\n" "Place the citation at the end of each bullet point.\n" "For monetary figures (e.g., HK$1,000,000), percentages (e.g., 35%), key statistics, " "and dates (e.g., 1 January 2024, 2023), wrap them in ==double equals markers== for highlighting.\n" "{context_sections}\n\n" "Answer:" ) ``` - [ ] No migration needed — template is seeded on first DB init. Existing DBs must reset prompts or update via API. - **No separate test file** — verified via existing prompt service tests + acceptance test. ### Task 7.2: Add `highlightTerms()` to citationParser - [ ] In `frontend/src/utils/citationParser.ts`, add: ```typescript export function highlightTerms(markdown: string): string { // Match ==term== patterns and convert to term // Avoid matching inside code blocks (```...```) and inline code (`...`) return markdown.replace(/(?$1') } ``` - [ ] Export from `citationParser.ts` (add to existing exports) - [ ] **Test file**: `frontend/src/test/utils/highlightParser.test.ts` ### Task 7.3: Change citation output to sequential `[1] [2] [3]` - [ ] In `frontend/src/utils/citationParser.ts`, `replaceCitationPatterns()` (lines 105–131): - Add a `let refCounter = 0` before the `.replace()` call - Increment counter on each matched citation and output `[${refCounter}](${url})`: ```typescript function replaceCitationPatterns( text: string, lookup: Map, highlightKeys?: Set ): string { const citationPattern = /(? { const trimmed = content.trim() const source = findSource(trimmed, lookup) if (source) { let isReady = false if (highlightKeys && source.document_id && source.sub_question_text) { isReady = highlightKeys.has( `${source.document_id}_${source.chunk_index}_${encodeURIComponent(source.sub_question_text)}` ) } const url = buildCitationUrl(source, isReady) if (url) { refCounter++ return `[${refCounter}](${url})` } } return fullMatch }) } ``` - [ ] Update existing tests in `frontend/src/test/utils/citationParser.test.ts` to expect sequential `[1](url)`, `[2](url)` output - [ ] **Test file**: `frontend/src/test/utils/citationCompactRef.test.ts` (optional — existing tests cover this after update) ### Task 7.4: Wire highlighting into ResponsePanel - [ ] In `frontend/src/components/ResponsePanel.tsx`: - Add `HighlightMark` component: ```tsx const HighlightMark = ({ children }: { children?: React.ReactNode }) => ( {children} ) ``` - In `SubQuestionSection` (line ~109), add `highlightTerms()` after `processCitations()`: ```tsx const processedAnswer = processCitations(answerSection, allSources, highlightReadyKeys) const highlightedAnswer = highlightTerms(processedAnswer) ``` - In `FlatResponse` (line ~455), same pattern - Update `ReactMarkdown` components to include `mark`: ```tsx {highlightedAnswer} ``` - [ ] Add `import { highlightTerms } from '../utils/citationParser'` to imports - [ ] **Test file**: `frontend/src/test/components/ResponsePanel_highlights.test.tsx` ### Task 7.5: Add CSS for mark styling - [ ] In `frontend/src/styles.css`, add: ```css .prose mark { background-color: #FEF08A; /* yellow-200 */ border-radius: 0.125rem; padding: 0 0.125rem; } ``` - **No separate test file** — verified via visual inspection and ResponsePanel integration tests. ### Task 7.6: Run tests & verify - [ ] Run frontend tests: `cd frontend && pnpm test` - [ ] Run backend tests: `cd backend && pytest app/test/ -v` (no regressions) - [ ] Verify existing citation tests pass with sequential `[1](url)` `[2](url)` output format - [ ] Verify new highlight tests pass - [ ] Visual manual test: ask a question with figures/dates, verify yellow highlights appear and sequential `[1] [2]` links work --- ## Test Files Summary | # | Test File | Type | Coverage | |---|-----------|------|----------| | T7.2 | `frontend/src/test/utils/highlightParser.test.ts` | Unit | `highlightTerms()`: basic `==term==` → ``, multiple highlights, no false positives on `==` in code, edge cases (empty, no markers, adjacent markers) | | T7.3 | `frontend/src/test/utils/citationParser.test.ts` (update) | Unit | Existing 16 tests updated to expect sequential `[1](url)`, `[2](url)` output. Add test: multiple citations render as `[1]`, `[2]`, `[3]` | | T7.4 | `frontend/src/test/components/ResponsePanel_highlights.test.tsx` | Integration | Full `ResponsePanel` with mock answer containing `==figure==` markers and `[citation]` brackets: verifies yellow `` elements render, `[1] [2]` links are clickable, source cards unchanged | --- ## Dependency Graph ``` Task 7.1 (update seed template) │ │ (independent — prompt template change tested implicitly) │ Task 7.2 (highlightTerms function) Task 7.3 (sequential citation refs) │ │ └──────────────┬───────────────────────┘ │ Task 7.4 (wire into ResponsePanel) │ Task 7.5 (CSS styling) │ Task 7.6 (run tests & verify) ``` Tasks 7.2 and 7.3 are independent and can run in parallel. --- ## Acceptance Criteria ### Feature 1: Highlighting - [ ] LLM prompt template includes `==term==` instruction for figures, percentages, statistics, and dates - [ ] `highlightTerms()` correctly converts all `==term==` patterns to `term` - [ ] Yellow highlights render in the browser for monetary amounts (e.g., `HK$1,000,000`), percentages (e.g., `35%`), and dates (e.g., `1 January 2024`) - [ ] Highlights do NOT appear inside code blocks or inline code - [ ] Highlights work correctly alongside citation links (`[1] [2] [3]`) - [ ] Highlights work in both sub-question mode and flat response mode - [ ] No regressions in existing tests ### Feature 2: Compact Citations - [ ] All inline citations display as sequential `[1]`, `[2]`, `[3]` instead of `[filename.pdf, page N]` - [ ] Sequential numbers increment correctly per answer section (reset per sub-question section) - [ ] `[1]` `[2]` links are clickable and navigate to the correct PDF viewer or highlight page - [ ] Source cards below each section still show full filename, page, date, and summary - [ ] Existing citation tests pass with updated sequential `[1](url)` output format - [ ] No regressions in existing tests --- ## Rollback Plan ### Feature 1 (Highlighting): 1. Remove the `==term==` instruction line from the `_SEED_GENERATE_PER_SUBQ` seed template in `sqlite_db.py` 2. Remove `highlightTerms()` call from `ResponsePanel.tsx` (revert to passing `processedAnswer` directly) 3. Remove `HighlightMark` component and `mark` from ReactMarkdown components 4. Remove CSS rule for `.prose mark` 5. `highlightTerms()` function can remain in `citationParser.ts` (no harm) ### Feature 2 (Compact Citations): 1. Revert `citationParser.ts` line 125 from `[${refCounter}](${url})` back to `[${trimmed}](${url})` and remove the counter 2. Update test expectations back to full citation text Both features are independent — can roll back one without affecting the other. --- ## Non-Goals (explicitly out of scope) - ❌ Do NOT change the LLM response parsing in `rag.py` (no backend code changes for highlighting) - ❌ Do NOT change the SSE event schema (no new fields in `completed` event) - ❌ Do NOT change the citation URL routing logic (`buildCitationUrl()` stays as-is) - ❌ Do NOT modify source cards (`SubQuestionSourceCard`) — they still show full details - ❌ Do NOT add tooltips or popovers on `[N]` links (future enhancement) - ❌ Do NOT add per-term highlight metadata (type: figure vs date, color coding) - ❌ Do NOT add configuration UI for highlight colors - ❌ Do NOT modify the non-sub-question fallback `generate_response()` (legacy flat mode — highlight markers work from prompt template alone) --- ## Reference: Key Code Locations | What | Where | Lines | |------|-------|-------| | Response generation | `backend/app/services/rag.py` | 174–273 | | SSE completed event | `backend/app/routers/query.py` | 379–385 | | Prompt seed template | `backend/app/core/sqlite_db.py` | 42–53 | | Citation → link parser | `frontend/src/utils/citationParser.ts` | 105–131 | | Response rendering | `frontend/src/components/ResponsePanel.tsx` | 89–158, 338–517 | | ReactMarkdown usage | `frontend/src/components/ResponsePanel.tsx` | 122–125, 456–459 | | Frontend SSE handler | `frontend/src/lib/queries.tsx` | 51–89 | | Frontend types | `frontend/src/types/index.ts` | 1–37 | | Custom CSS | `frontend/src/styles.css` | 1–12 | | Citation parser tests | `frontend/src/test/utils/citationParser.test.ts` | 1–242 | | ResponsePanel tests | `frontend/src/test/components/ResponsePanel.test.tsx` | 1–366 |