19 KiB

Raw Permalink Blame History

Package 7 Enhancement Plan — Response Highlighting & Compact Citations

Source: User request (2026-05-15) Scope: Two enhancements to the final RAG response: (1) yellow-highlight figures/dates in the answer using LLM-identified terms, (2) replace verbose [filename.pdf, page N] citations with compact sequentially-numbered [1] [2] [3] clickable links. Status: Draft

Objective

Feature 1: Figure & Date Highlighting

Ask the LLM to identify monetary figures, percentages, statistics, and dates in the final response. Return these as a separate list. The frontend then searches the rendered answer and wraps matching terms in yellow <mark> highlights.

Feature 2: Compact Citation Links

Replace the current inline citation display [document_file_name.pdf, page N] with compact sequentially-numbered links: [1], [2], [3], etc. The full source details remain visible in the collapsible source cards below each sub-question section.

Non-goals: The underlying citation URL logic (PDF viewer vs highlight page routing) remains unchanged. Source cards are not modified.

Current State

Response Generation Flow (Backend)

RAGService.generate_response_per_subquestion()        # rag.py:174-273
  ├── Builds context per sub-q with [filename, page N] labels
  ├── Renders prompt template (generate_per_subq)
  ├── Calls LLM (temperature=0.3)
  └── Returns (answer, prompt, grouped_sources_meta)

query.py _query_stream()                               # query.py:290-385
  ├── Yield {"phase": "generating"}
  ├── Yield {"phase": "generating_subquestion", ...} per sub-q
  └── Yield {"phase": "completed", answer, sub_question_sources, sources, history_id}

Current Citation Display (Frontend)

LLM returns:  "...according to the report [NEC4 ACC.pdf, page 3]..."
                   ↓
citationParser.processCitations()                      # citationParser.ts:133-138
  → Regex: /(?<!!)\[([^\]]+)\](?!\()/g
  → Lookup source by filename in buildCitationLookup()
  → Replace: [filename, page N] → [filename, page N](pdf-viewer-url)
                   ↓
ReactMarkdown renders [filename, page N] as clickable <a> tag via CitationLink component

Current Markdown Rendering

Library: react-markdown v10.1.0
Custom components: { a: CitationLink } only (ResponsePanel.tsx:122)
Styling: Tailwind prose prose-sm max-w-none text-gray-800

Target State

Feature 1: Figure & Date Highlighting

Two viable approaches identified:

	Approach A: Inline Markers	Approach B: Separate Highlights List
LLM output	`==HK$1,000,000==` embedded in answer	Answer text + `---HIGHLIGHTS---\n["HK$1,000,000", ...]`
Backend parsing	None needed	Split on marker, parse JSON list
SSE change	None	Add `highlights: string[]` to `completed` event
Frontend work	Regex `==term==` → `<mark>term</mark>`	Receive list, search-and-wrap in answer text
Reliability	High — markers are simple, no structured parsing	Medium — LLM may malformat JSON
Extensibility	Harder to add metadata (e.g., highlight type)	Easy to add per-term metadata later

Decision: Use Approach A (inline ==term== markers) for simplicity and reliability. No backend changes needed for the LLM response parsing — only prompt template + frontend rendering.

LLM prompt (updated generate_per_subq template):
  "For monetary figures (e.g., HK$1,000,000), percentages (e.g., 35%),
   statistics, and dates (e.g., 1 January 2024, 2023), wrap them in
   ==double equals markers== for highlighting."

LLM answer:  "The budget was ==HK$1,000,000== (a ==35%== increase from ==2023==)..."
                   ↓
Frontend pre-processing (new highlightTerms() function):
  → Regex: /==(.+?)==/g
  → Replace: ==term== → <mark class="bg-yellow-200 rounded px-0.5">term</mark>
                   ↓
ReactMarkdown with custom mark component:
  → <mark class="bg-yellow-200 rounded px-0.5">HK$1,000,000</mark>

Feature 2: Compact Citation Links

Current:  "...according to the report [NEC4 ACC.pdf, page 3]..."
Desired:   "...according to the report [1] and further noted in [2]..."

Implementation: In replaceCitationPatterns() (citationParser.ts:105-131),
add a closure counter. Each matched citation gets a sequential number:
  [trimmed](url) → [1](url), [2](url), [3](url) ...

The citation URL, source lookup, and "View PDF" source cards are unchanged. Each [N] label is clickable and opens the same linked page as before.

Decision Register

#	Decision	Rationale
1	Inline `==term==` markers (not separate highlights list)	LLMs reliably produce inline markup. Inline markers preserve the relationship between highlighted terms and their surrounding context (which citation they belong to, which bullet point). No backend parsing, no SSE schema change. Simple regex on frontend: `==(.+?)==` → `<mark>$1</mark>`.
2	Prompt-only change for LLM (no backend code change for parsing)	The highlight instruction is added to the `generate_per_subq` prompt template only. `rag.py` requires zero changes — the `==term==` markers are part of the answer string and transparent to existing code.
3	Custom `<mark>` component in ReactMarkdown (not raw HTML injection)	`ReactMarkdown` strips raw HTML by default. Using `components={{ mark: HighlightMark }}` is the proper React way. Styling via Tailwind classes: `bg-yellow-200 rounded px-0.5`.
4	`==term==` syntax choice	`==...==` is used in many wiki/markdown dialects for highlighting (Obsidian, Markdown-it-mark). It's visually distinct from `bold`, `italic`, and `~~strikethrough~~`. No risk of colliding with existing markdown in LLM output.
5	Sequential numbering `[1]` `[2]` `[3]` (not a single `[ref]` label)	User requested sequential numbering for better visual clarity. Each citation in the answer gets a unique number (`[1]`, `[2]`, `[3]`...), making it easy to distinguish multiple references at a glance. The source cards below still provide full details. Implemented via a closure counter in `replaceCitationPatterns()`.
6	Pure frontend change for compact citations (no backend changes)	Citations are parsed purely on the frontend in `citationParser.ts`. The LLM still produces `[filename, page N]` — the frontend converts to sequential `[1]`, `[2]`, `[3]` during `replaceCitationPatterns()`. Backend is untouched.
7	`processCitations` before highlight preprocessing	Run citation processing first (convert `[filename]` to markdown links), then highlight preprocessing (convert `==term==` to `<mark>`). This order ensures `==` markers inside citation brackets don't interfere with citation regex and vice versa.
8	Seed template update only (not database migration)	The `_SEED_GENERATE_PER_SUBQ` template in `sqlite_db.py` is updated. Existing databases will NOT be auto-migrated — users must reset prompts or manually update via the API. This matches the existing pattern (all prompt changes are seed-only).
9	Yellow highlight color: `bg-yellow-200`	Tailwind's `yellow-200` (`#FEF08A`) provides a soft, readable yellow that works on both light backgrounds and is distinct from the `text-blue-600` citation links. Add `rounded px-0.5` for visual polish.

Files to Create

#	File	Purpose
F1	`frontend/src/test/utils/highlightParser.test.ts`	Unit tests for `highlightTerms()` function
F2	`frontend/src/test/utils/citationCompactRef.test.ts`	Unit tests for compact sequential `[1] [2] [3]` citation format
F3	`frontend/src/test/components/ResponsePanel_highlights.test.tsx`	Integration: ResponsePanel renders highlights and compact refs

Files to Modify

#	File	Change
M1	`backend/app/core/sqlite_db.py`	Update `_SEED_GENERATE_PER_SUBQ` template (lines 42–53): add `==term==` instruction for figures/dates
M2	`frontend/src/utils/citationParser.ts`	(a) Add closure counter in `replaceCitationPatterns()` to output `[1](url)`, `[2](url)` instead of `[trimmed](url)`. (b) Add `highlightTerms(markdown: string): string` function: regex `==(.+?)==` → `<mark>$1</mark>`
M3	`frontend/src/components/ResponsePanel.tsx`	(a) Add `HighlightMark` component. (b) Add `highlightTerms()` call in `SubQuestionSection` and `FlatResponse` before ReactMarkdown. (c) Add `mark` to ReactMarkdown `components`.
M4	`frontend/src/styles.css`	Add `.prose mark { background-color: #FEF08A; border-radius: 0.125rem; padding: 0 0.125rem; }`
M5	`frontend/src/test/utils/citationParser.test.ts`	Update existing citation tests to expect sequential `[1](url)`, `[2](url)` output format

Implementation Tasks

Task 7.1: Update prompt template (seed)

In backend/app/core/sqlite_db.py, update _SEED_GENERATE_PER_SUBQ (lines 42–53):

_SEED_GENERATE_PER_SUBQ = (
    "Answer each sub-question using ONLY its document chunks.\n"
    "Format as markdown sections with ## Sub-question N: headers.\n"
    "Under each ## heading, write your answer as bullet points using - (one bullet point per finding or fact).\n"
    "Cite your sources inline using bracket labels, e.g. [filename, page N].\n"
    "Copy the exact bracket labels shown in the document chunks — do not modify filenames or add/remove extensions.\n"
    "Place the citation at the end of each bullet point.\n"
    "For monetary figures (e.g., HK$1,000,000), percentages (e.g., 35%), key statistics, "
    "and dates (e.g., 1 January 2024, 2023), wrap them in ==double equals markers== for highlighting.\n"
    "{context_sections}\n\n"
    "Answer:"
)

No migration needed — template is seeded on first DB init. Existing DBs must reset prompts or update via API.
No separate test file — verified via existing prompt service tests + acceptance test.

Task 7.2: Add `highlightTerms()` to citationParser

In frontend/src/utils/citationParser.ts, add:

export function highlightTerms(markdown: string): string {
  // Match ==term== patterns and convert to <mark>term</mark>
  // Avoid matching inside code blocks (```...```) and inline code (`...`)
  return markdown.replace(/(?<!`)==(.+?)==(?!`)/g, '<mark class="bg-yellow-200 rounded px-0.5">$1</mark>')
}

Export from citationParser.ts (add to existing exports)
Test file: frontend/src/test/utils/highlightParser.test.ts

Task 7.3: Change citation output to sequential `[1] [2] [3]`

In frontend/src/utils/citationParser.ts, replaceCitationPatterns() (lines 105–131):

Add a let refCounter = 0 before the .replace() call
Increment counter on each matched citation and output [${refCounter}](${url}):

function replaceCitationPatterns(
  text: string,
  lookup: Map<string, SourceMetadata>,
  highlightKeys?: Set<string>
): string {
  const citationPattern = /(?<!!)\[([^\]]+)\](?!\()/g
  let refCounter = 0

  return text.replace(citationPattern, (fullMatch, content: string) => {
    const trimmed = content.trim()
    const source = findSource(trimmed, lookup)

    if (source) {
      let isReady = false
      if (highlightKeys && source.document_id && source.sub_question_text) {
        isReady = highlightKeys.has(
          `${source.document_id}_${source.chunk_index}_${encodeURIComponent(source.sub_question_text)}`
        )
      }
      const url = buildCitationUrl(source, isReady)
      if (url) {
        refCounter++
        return `[${refCounter}](${url})`
      }
    }

    return fullMatch
  })
}

Update existing tests in frontend/src/test/utils/citationParser.test.ts to expect sequential [1](url), [2](url) output
Test file: frontend/src/test/utils/citationCompactRef.test.ts (optional — existing tests cover this after update)

Task 7.4: Wire highlighting into ResponsePanel

In frontend/src/components/ResponsePanel.tsx:

Add HighlightMark component:

const HighlightMark = ({ children }: { children?: React.ReactNode }) => (
  <mark className="bg-yellow-200 rounded px-0.5">{children}</mark>
)

In SubQuestionSection (line ~109), add highlightTerms() after processCitations():

const processedAnswer = processCitations(answerSection, allSources, highlightReadyKeys)
const highlightedAnswer = highlightTerms(processedAnswer)

In FlatResponse (line ~455), same pattern

Update ReactMarkdown components to include mark:

<ReactMarkdown components={{ a: CitationLink, mark: HighlightMark }}>
  {highlightedAnswer}
</ReactMarkdown>

Add import { highlightTerms } from '../utils/citationParser' to imports
Test file: frontend/src/test/components/ResponsePanel_highlights.test.tsx

Task 7.5: Add CSS for mark styling

In frontend/src/styles.css, add:

.prose mark {
  background-color: #FEF08A; /* yellow-200 */
  border-radius: 0.125rem;
  padding: 0 0.125rem;
}

No separate test file — verified via visual inspection and ResponsePanel integration tests.

Task 7.6: Run tests & verify

Run frontend tests: cd frontend && pnpm test
Run backend tests: cd backend && pytest app/test/ -v (no regressions)
Verify existing citation tests pass with sequential [1](url) [2](url) output format
Verify new highlight tests pass
Visual manual test: ask a question with figures/dates, verify yellow highlights appear and sequential [1] [2] links work

Test Files Summary

#	Test File	Type	Coverage
T7.2	`frontend/src/test/utils/highlightParser.test.ts`	Unit	`highlightTerms()`: basic `==term==` → `<mark>`, multiple highlights, no false positives on `==` in code, edge cases (empty, no markers, adjacent markers)
T7.3	`frontend/src/test/utils/citationParser.test.ts` (update)	Unit	Existing 16 tests updated to expect sequential `[1](url)`, `[2](url)` output. Add test: multiple citations render as `[1]`, `[2]`, `[3]`
T7.4	`frontend/src/test/components/ResponsePanel_highlights.test.tsx`	Integration	Full `ResponsePanel` with mock answer containing `==figure==` markers and `[citation]` brackets: verifies yellow `<mark>` elements render, `[1] [2]` links are clickable, source cards unchanged

Dependency Graph

Task 7.1 (update seed template)
    │
    │  (independent — prompt template change tested implicitly)
    │
Task 7.2 (highlightTerms function)    Task 7.3 (sequential citation refs)
    │                                      │
    └──────────────┬───────────────────────┘
                   │
            Task 7.4 (wire into ResponsePanel)
                   │
            Task 7.5 (CSS styling)
                   │
            Task 7.6 (run tests & verify)

Tasks 7.2 and 7.3 are independent and can run in parallel.

Acceptance Criteria

Feature 1: Highlighting

LLM prompt template includes ==term== instruction for figures, percentages, statistics, and dates
highlightTerms() correctly converts all ==term== patterns to <mark class="bg-yellow-200 rounded px-0.5">term</mark>
Yellow highlights render in the browser for monetary amounts (e.g., HK$1,000,000), percentages (e.g., 35%), and dates (e.g., 1 January 2024)
Highlights do NOT appear inside code blocks or inline code
Highlights work correctly alongside citation links ([1] [2] [3])
Highlights work in both sub-question mode and flat response mode
No regressions in existing tests

Feature 2: Compact Citations

All inline citations display as sequential [1], [2], [3] instead of [filename.pdf, page N]
Sequential numbers increment correctly per answer section (reset per sub-question section)
[1] [2] links are clickable and navigate to the correct PDF viewer or highlight page
Source cards below each section still show full filename, page, date, and summary
Existing citation tests pass with updated sequential [1](url) output format
No regressions in existing tests

Rollback Plan

Feature 1 (Highlighting):

Remove the ==term== instruction line from the _SEED_GENERATE_PER_SUBQ seed template in sqlite_db.py
Remove highlightTerms() call from ResponsePanel.tsx (revert to passing processedAnswer directly)
Remove HighlightMark component and mark from ReactMarkdown components
Remove CSS rule for .prose mark
highlightTerms() function can remain in citationParser.ts (no harm)

Feature 2 (Compact Citations):

Revert citationParser.ts line 125 from [${refCounter}](${url}) back to [${trimmed}](${url}) and remove the counter
Update test expectations back to full citation text

Both features are independent — can roll back one without affecting the other.

Non-Goals (explicitly out of scope)

❌ Do NOT change the LLM response parsing in rag.py (no backend code changes for highlighting)
❌ Do NOT change the SSE event schema (no new fields in completed event)
❌ Do NOT change the citation URL routing logic (buildCitationUrl() stays as-is)
❌ Do NOT modify source cards (SubQuestionSourceCard) — they still show full details
❌ Do NOT add tooltips or popovers on [N] links (future enhancement)
❌ Do NOT add per-term highlight metadata (type: figure vs date, color coding)
❌ Do NOT add configuration UI for highlight colors
❌ Do NOT modify the non-sub-question fallback generate_response() (legacy flat mode — highlight markers work from prompt template alone)

Reference: Key Code Locations

What	Where	Lines
Response generation	`backend/app/services/rag.py`	174–273
SSE completed event	`backend/app/routers/query.py`	379–385
Prompt seed template	`backend/app/core/sqlite_db.py`	42–53
Citation → link parser	`frontend/src/utils/citationParser.ts`	105–131
Response rendering	`frontend/src/components/ResponsePanel.tsx`	89–158, 338–517
ReactMarkdown usage	`frontend/src/components/ResponsePanel.tsx`	122–125, 456–459
Frontend SSE handler	`frontend/src/lib/queries.tsx`	51–89
Frontend types	`frontend/src/types/index.ts`	1–37
Custom CSS	`frontend/src/styles.css`	1–12
Citation parser tests	`frontend/src/test/utils/citationParser.test.ts`	1–242
ResponsePanel tests	`frontend/src/test/components/ResponsePanel.test.tsx`	1–366

19 KiB Raw Permalink Blame History Unescape Escape

Package 7 Enhancement Plan — Response Highlighting & Compact Citations

Objective

Feature 1: Figure & Date Highlighting

Feature 2: Compact Citation Links

Current State

Response Generation Flow (Backend)

Current Citation Display (Frontend)

Current Markdown Rendering

Target State

Feature 1: Figure & Date Highlighting

Feature 2: Compact Citation Links

Decision Register

Files to Create

Files to Modify

Implementation Tasks

Task 7.1: Update prompt template (seed)

Task 7.2: Add highlightTerms() to citationParser

Task 7.3: Change citation output to sequential [1] [2] [3]

Task 7.4: Wire highlighting into ResponsePanel

Task 7.5: Add CSS for mark styling

Task 7.6: Run tests & verify

Test Files Summary

Dependency Graph

Acceptance Criteria

Feature 1: Highlighting

Feature 2: Compact Citations

Rollback Plan

Feature 1 (Highlighting):

Feature 2 (Compact Citations):

Non-Goals (explicitly out of scope)

Reference: Key Code Locations

19 KiB

Raw Permalink Blame History

Task 7.2: Add `highlightTerms()` to citationParser

Task 7.3: Change citation output to sequential `[1] [2] [3]`