40 KiB

Raw Permalink Blame History

Package 8 Enhancement Plan — Q&A-Pair Chunking Strategy

Source: User request (2026-05-15) Scope: New "chunked by question" RAG algorithm that splits LegCo meeting documents into Q&A (問/答) pairs using LLM-assisted boundary detection, with vision-model-based table-to-markdown extraction. Coexists alongside existing "chunked by token" strategy — user selects strategy on the RAG Database page. Status: Draft

Objective

Feature 1: Q&A-Pair Chunking Algorithm

Split LegCo meeting transcript PDFs by question-answer pair boundaries rather than fixed token windows. Each Q&A pair becomes one ChromaDB chunk. The page number reference points to the question (問) location. This preserves the natural semantic unit of legislative documents where questions and answers form self-contained knowledge units.

Feature 2: Table-to-Markdown Extraction

Answers in LegCo documents frequently contain complex tables (budget figures, capacity tables, allowance rates). These tables must be converted to markdown format so the LLM can correctly read table contents during retrieval. Use vision LLM (GPT-4o or compatible) to identify tables in answer sections and render them as markdown tables.

Feature 3: Strategy Selection UI

Allow users to choose between "Chunked by token" (existing) and "Chunked by question" (new) on the RAG Database page. Strategy metadata is stored per-document and displayed in the document/chunk list views.

Non-goals: Does NOT modify the existing TokenChunkingStrategy. Does NOT change the ChromaDB storage schema beyond adding metadata fields. Does NOT affect the retrieval/query pipeline. Does NOT replace the existing per-chunk PDF generation (continues for both strategies).

Current State

Existing Chunking Pipeline

PDF → parse_pdf_by_page() [pypdf extract_text per page]
    → chunk_pages() [TokenChunkingStrategy: 1 chunk/page, 200-token overlap]
    → extract_metadata() [filename, date, summary, page_number, chunk_file_path]
    → rag.ingest_document() → collection.add() → ChromaDB

Chunking Strategy Abstraction

File: backend/app/utils/chunking.py

ChunkingStrategy ABC (line 13): defines chunk(text: str) -> List[str]
TokenChunkingStrategy(ChunkingStrategy) (line 28): token-based sliding window
chunk_pages(pages, overlap_tokens) (line 75): page-aware variant (PDF only)
NO strategy selection mechanism — TokenChunkingStrategy is hardcoded in the ingest router

Ingest Router

File: backend/app/routers/ingest.py

Line 44: from app.utils.chunking import TokenChunkingStrategy
Line 71: chunker = TokenChunkingStrategy(chunk_size=settings.chunk_size, overlap=settings.chunk_overlap)
Line 39: async def ingest_document(file: UploadFile = File(...)) — no strategy param
Three format branches (PDF/DOCX/TXT), each instantiates the same chunker

Config

File: backend/app/core/config.py

chunk_size: int = 1000 (line 41)
chunk_overlap: int = 200 (line 42)
No strategy config exists

Frontend

File: frontend/src/pages/RAGDatabasePage.tsx

handleUpload() calls ingestDocumentMutation.mutateAsync(file) — no strategy param
No strategy selector UI component exists

File: frontend/src/components/DocumentList.tsx

Displays filename, chunk_count, upload_date — no strategy info

File: frontend/src/components/ChunkList.tsx

Shows Page: {chunk.page_number} — no concept of question index

Document Analysis — LTT Test Materials

Four LegCo supplementary papers in test materials/LTT/:

File A: `詳細數字一覽表` (18 pages) — No Q&A format

Pure data tables appendix (17/18 pages are tables)
8 tables covering development statistics, costs, compensation stats
Will fall back to token-based chunking (Q&A strategy produces 0 pairs)

File B: `古洞北粉嶺北新發展區概覽` (8 pages) — 3 Q&A pairs

發言要點 (speaking points) section at start
備用問答 section at end with 3 pairs: 問 B1/答 B1, 問 B2/答 B2, 問 B3/答 B3
Contains milestone tables and planning review result tables

File E: `基礎設施` (27 pages) — 12 Chinese Q&A + 3 "Others"

發言要點 (pages 1-2): narrative briefing points
備用問答 (pages 2-23): Q&A organized by topic sections:
- (A) 排水系統 (Drainage): 問 A1-A3 / 答 A1-A3
- (B) 污水收集 (Sewage): 問 B1 / 答 B1
- (C) 供水系統 (Water supply): 問 C1 / 答 C1
- (D) 公用設施共同溝 (Common utility enclosure): 問 D1-D7 / 答 D1-D7
Others section (pages 23-27): O1-O4 (combined Q&A without explicit 問/答 markers)
Contains tables within answers (treatment plant capacity table, common utility comparison)
Contains [內部參考...] / [內部補充資料...] internal reference brackets
Contains [如被追問...] followup answer patterns

File L: `補償及安置安排` (36 pages) — 24 English Q&A

Table of contents + 8 narrative policy sections (pages 1-13)
備用問題 (pages 14-36): Q&A organized by topic:
- 住戶／鄉村遷置: Q1-Q6
- 業務經營者: Q7-Q16
- 務農人士: Q17-Q21
- 墳墓／金塔: Q22-Q24
No explicit "A" answer marker — answer ends at next Q or section header
Contains tables within answers (allowance rate tables, compensation amount tables)
Contains [如被追問...] / [如追問...] followup patterns
Contains [內部參考...] brackets

Key Structural Findings

Two distinct Q&A formats:

Style	Files	Q Pattern	A Pattern	Boundary Rule
Chinese 問/答	B, E	`問\s[A-Z]\d+\s[︰：]`	`答\s[A-Z]\d+\s[︰：]`	Q ends at 答; A ends at next 問 or section header
English Q-only	L	`^Q\d+\s`	None (implicit)	Q ends at end of Q text; A ends at next `^Q\d+` or topic line

Page layout conventions:

Page headers: (letter) on first page
Page footers: {letter}-{page_number}\n{YYYY-MM-DD} — these appear interspersed in extracted text at page boundaries
Section headers: (A) Section Name within Q&A blocks — serve as grouping markers
備用問答 / 備用問題: section headers marking Q&A start

Answer content complexity:

Multi-paragraph answers (spanning 2-7 pages for complex topics)
Tables embedded mid-answer (capacity tables, financial tables)
Internal reference brackets [內部參考...] with technical data
Followup Q&A [如被追問...] — nested Q&A within answers
Bullet lists with various markers (⚫, ➢, -, i. ii. iii.)

Page boundary interference: Q&A pairs frequently span page boundaries. The pypdf extracted text retains page number markers like E-2, E-3 mid-text. These must be stripped or treated as boundaries.

Research Findings — External Approaches

Q&A Chunking Approaches (from librarian research)

1. FAQ-RAG (Pro-GenAI/FAQ-RAG): Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. Adaptation: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.

2. LLM Boundary Detection (LlamaChunk / semchunk): Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. Relevance: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats).

3. Regex + LLM Hybrid: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot.

4. Index-RAG: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.

Vision Table Extraction (from librarian research)

1. gptpdf (CosmosShadow/gptpdf — 3,550 stars): 293 lines of Python. Uses PyMuPDF to detect non-text regions → marks with colored rectangles → GPT-4o converts page images to markdown. Cost: ~$0.013/page. Best fit: Lightweight, handles Chinese well, supports multiple vision LLMs.

2. llm-based-ocr (yigitkonur/llm-based-ocr — 890 stars): FastAPI service. PDF → Markdown with table preservation. Parallel processing (50-page PDF in seconds). Batch API cost: ~$4/1000 pages. Best fit for production: Cost-effective at scale.

3. Instructor + GPT-4o: Structured table extraction as Pydantic models with MarkdownDataFrame. Provides typed, structured table data directly usable in RAG. Best fit: When tables need to be queryable as structured data.

4. markdrop: Comprehensive PDF → Markdown with AI descriptions for images and tables. Supports 6 LLM providers (OpenAI, Gemini, Claude, Groq, OpenRouter, LiteLLM). Best fit: Full-pipeline solution but heavier than gptpdf.

Recommended Approach (synthesized)

PDF Transcript
    │
    ▼
[Preprocess] Concatenate pages, strip footers/headers, normalize
    │
    ▼
[LLM Structure Detection]  ← Single call: identify ALL section boundaries
    │                         Classify each as qa / narrative / speaking_notes / table
    │                         Same model as everything else (Qwen3.5-35B-A3B)
    ▼
[Regex fast-pass]  ← Optional: skip LLM for documents with 100%-regex-detectable format
    │
    ▼
[Chunk Builder]  ← For QA: one chunk per pair
    │               For narrative: paragraph-level split with heading prepended
    │               For speaking_notes: one chunk per bullet with topic prepended
    │               For table: standalone chunk or merged into parent section
    ▼
[Table Extraction]  ← Vision LLM (same model, base64 images) for pages with tables
    │                 → markdown tables injected into chunk text
    ▼
Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...)
    │
    ▼
ChromaDB (with strategy_type="question" in metadata)

Target State

Architecture

POST /api/v1/ingest?strategy=token|question
    │
    ├─ strategy="token" → TokenChunkingStrategy (existing, unchanged)
    │
    └─ strategy="question" → QuestionChunkingStrategy (NEW)
         │
         ├─ [Preprocess] Concatenate pages, strip footers/headers, normalize
         │
         ├─ [LLM Structure Detection] Send full document text to LLM
         │   │  Same model as everything else (Qwen3.5-35B-A3B)
         │   │  Prompt: "Identify all sections. Classify each as Q&A pair,
         │   │           narrative section, table section, or speaking notes.
         │   │           For each Q&A, output question text, answer text,
         │   │           start/end boundaries."
         │   └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes",
         │                   qa_id?, question?, answer?, heading?, start_page, end_page }, ...]
         │
         ├─ [Regex fallback] Skip LLM when disabled or for known patterns
         │   └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization
         │
         ├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n
         │   └─ Question text prepended to each sub-chunk for context
         │
         ├─ [Table extraction] Vision LLM per page with tables → markdown
         │   └─ Same model: base64 page images via standard Chat Completions API
         │
         ├─ [Narrative sections] Sections with no Q&A (發言要點, policy text)
         │   └─ Chunk by section heading boundaries with heading+parent prepended
         │
         └─ [Metadata] Per-chunk:
              ├─ strategy_type: "question"
              ├─ section_type: "qa" | "narrative" | "speaking_notes"
              ├─ question_index: int (0-based, Q&A only)
              ├─ question_id: str ("A1", "Q3", etc., Q&A only)
              ├─ question_text: str (full question text, Q&A only)
              ├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
              ├─ answer_contains_table: bool
              └─ source_page_range: [int, int]

Algorithm Detail: LLM Structure Detection (Primary)

LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like (1) 住戶的安置補償, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can.

Step 1 — Preprocess extracted text:

Concatenate all pages from parse_pdf_by_page() with \n separator
Strip page footer markers: regex ^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$ and ^[A-Z]-\d+$
Strip page header letters ^($[A-Z]$)\s*$ (keep on page 1 as doc identifier)
Normalize colons: ： → :, ︰ → :
Insert page boundary markers: [PAGE_BREAK: N] at each page boundary so the LLM knows where pages start/end

Step 2 — LLM Structure Detection (single call):

Send the preprocessed text to the LLM (same model as everything else: LLM_MODEL_NAME). Ask it to identify every section boundary and classify each section:

You are analyzing a Hong Kong Legislative Council document.
The text has page markers like [PAGE_BREAK: N] showing where pages begin.

For each distinct section in this document, identify:
1. The section type:
   - "qa": a question-and-answer pair (問/答 or Q1/Q2 format)
   - "narrative": policy text, explanatory paragraphs, section content with bullets
   - "speaking_notes": briefing points (發言要點) with bullet markers
   - "table": standalone data tables (not embedded in answers)
   - "toc": table of contents
   - "heading_only": a section heading with no following content

2. For "qa" sections:
   - The question text (exact)
   - The answer text (exact, including tables, bullet lists, and [內部參考] content)
   - The question ID if present (e.g. "A1", "Q3")
   - The start page and end page

3. For all sections:
   - The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
   - The start page and end page
   - Whether the section contains tables

Return JSON:
{
  "sections": [
    {
      "type": "qa",
      "heading": "(A) 排水系統",
      "qa_id": "A1",
      "question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...",
      "answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...",
      "start_page": 2,
      "end_page": 3,
      "has_table": true,
      "parent_topic": "排水系統"
    },
    {
      "type": "narrative",
      "heading": "(1) 住戶的安置補償",
      "content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...",
      "start_page": 2,
      "end_page": 5,
      "has_table": false
    },
    {
      "type": "speaking_notes",
      "heading": "發言要點",
      "content": "⚫ 古洞北／粉嶺北新發展區（新發展區）是北部都會區內...",
      "start_page": 1,
      "end_page": 2,
      "has_table": false
    }
  ]
}

Step 3 — Chunk building from LLM output:

For each section in the JSON response:

Section type	Chunking rule
`qa`	One chunk = one Q&A pair. Question text + answer text concatenated.
`narrative`	Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context.
`speaking_notes`	Each bullet point (⚫) becomes a chunk, with topic heading prepended.
`table`	Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks.
`toc` / `heading_only`	Skipped (not stored as chunks — no retrievable content).

Step 4 — Regex fast-pass (optional optimization):

If qa_vision_enabled=false or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when:

No regex pattern matches (unknown format)
Regex produces < 2 sections (likely misdetection)
qa_structure_model is not set to "none"

Algorithm Detail: Table-to-Markdown

Option A — Vision LLM per page (Primary for PDF):

Render each page of the Q&A section as an image (300 DPI, pypdf + PIL)

Send to the existing LLM model (LLM_MODEL_NAME, currently qwen/qwen3.5-35b-a3b) which is a native vision-language model that accepts base64 images via the standard OpenAI Chat Completions API:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Convert this page to Markdown. For any tables, use |---|---| syntax..." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
  ]
}

No separate vision model or API key needed — uses the existing LLM_BASE_URL/LLM_API_KEY config

Prompt:

Convert this page to Markdown. For any tables:
- Use proper markdown table syntax with |---|---| alignment
- Preserve all column headers and row labels
- Do not modify or translate the content
- If a table spans multiple pages, note it

Option B — Table detection + LLM extraction (Fallback for non-PDF):

Detect table regions in extracted text using heuristics (rows of aligned numbers, pipe-delimited data)

Send detected table text blocks to LLM with prompt:

Convert this raw table text extracted from a PDF into a markdown table.
Preserve all data exactly. Detect column boundaries and alignment.

No vision API cost — uses text-only LLM

Option C — Hybrid (Recommended):

Use vision LLM (same model) for pages with complex multi-column/multi-row tables
Use text-based LLM for simple tables detected via regex
Cache vision results per page to avoid re-processing on re-ingest

Config Additions

# backend/app/core/config.py
class Settings(BaseSettings):
    # Existing...
    chunk_size: int = 1000
    chunk_overlap: int = 200

    # NEW: Chunking strategy selection
    default_chunking_strategy: str = "token"  # "token" | "question"

    # NEW: Q&A chunking config
    qa_vision_enabled: bool = True   # Toggle vision-based table extraction (uses existing LLM_MODEL_NAME)
    qa_max_chunk_tokens: int = 3000  # Max tokens per Q&A chunk (before forced split)
    qa_structure_model: str = ""  # LLM for structure detection (empty = use LLM_MODEL_NAME)
    qa_include_internal_refs: bool = True  # Include [內部參考] in chunks
    qa_cache_vision_results: bool = True   # Cache vision results per page

    # Env var mappings
    # DEFAULT_CHUNKING_STRATEGY=token
    # QA_VISION_ENABLED=true
    # QA_MAX_CHUNK_TOKENS=3000
    # QA_STRUCTURE_MODEL=
    # QA_INCLUDE_INTERNAL_REFS=true
    # QA_CACHE_VISION_RESULTS=true

    # NOTE: Vision table extraction uses the existing LLM_MODEL_NAME
    # (qwen/qwen3.5-35b-a3b on OpenRouter — a native vision-language model).
    # No separate vision model config needed. Base64 images are sent
    # via the standard OpenAI Chat Completions API image_url format.

Implementation Plan

Sub-Phase 8.0: Config & Enums (Foundation)

Objective: Add all config settings and type definitions needed by subsequent sub-phases.

Test Files:

backend/app/test/test_phase1_config.py — add: test_qa_chunking_config_defaults
frontend/src/test/ (TBD) — type-level validation

Implementation Tasks:

Add Q&A chunking config fields to backend/app/core/config.py
Define ChunkingStrategyType = Literal["token", "question"] in backend/app/models/ingest.py
Add VALID_CHUNKING_STRATEGIES = {"token", "question"} constant
Update backend/.env.example with new env vars
Add strategy fields to frontend types (ChunkingStrategy type, extend IngestResponse, DocumentInfo, ChunkInfo)

Sub-Phase 8.1: Q&A Chunking Strategy Class

Objective: Implement QuestionChunkingStrategy as a new ChunkingStrategy subclass.

Test Files:

backend/app/test/test_phase8_qa_chunking.py — new file
- test_llm_structure_detection — LLM correctly identifies Q&A, narrative, speaking_notes sections
- test_llm_handles_mixed_formats — document with 問/答 + section headings + narrative correctly classified
- test_llm_handles_no_qa_format — narrative-only documents (File L pages 1-13) correctly chunked by section
- test_llm_handles_speaking_notes — 發言要點 sections chunked by bullet point
- test_regex_fastpass_chinese — known 問/答 format detected without LLM (when configured)
- test_regex_fastpass_english — known Q1/Q2 format detected without LLM
- test_multi_page_sections — sections spanning page boundaries tracked correctly
- test_abc_contract — satisfies ChunkingStrategy ABC
- test_page_number_reference_question — page ref points to question location for Q&A chunks
- test_size_limit — oversized sections get recursive split with heading preserved

Implementation Tasks:

Create backend/app/utils/qa_chunking.py:
- preprocess_text(pages: List[Tuple[int, str]]) -> str — strip footers/headers, normalize colons, insert page markers, reconstruct full text
- build_structure_detection_prompt(text: str) -> str — construct the LLM prompt for section classification
- parse_llm_structure_response(response: dict) -> List[Section] — parse the JSON returned by the LLM
- Section dataclass: {type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}
- split_chinese_qa(text: str) -> List[Section] — regex fast-pass for known 問/答 format
- split_english_qa(text: str) -> List[Section] — regex fast-pass for known Q-number format
- build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]] — build chunk texts + page refs + metadata from sections
Create QuestionChunkingStrategy(ChunkingStrategy) in backend/app/utils/chunking.py:
- chunk(text: str) -> List[str] — entry point for DOCX/TXT
- chunk_pages(pages: List[Tuple[int, str]]) -> List[Tuple[str, int]] — entry point for PDF
- Returns chunks with page number = question start page
- Handles fallback: if no Q&A detected, falls back to token-based or section-based
Create backend/app/utils/table_extraction.py:
- extract_tables_vision(page_images: List[Image]) -> List[str] — vision LLM → markdown tables
- extract_tables_text(text: str) -> List[str] — heuristic table detection → LLM markdown conversion
- inject_tables_into_answer(answer: str, tables_md: List[str]) -> str — replace raw tables with markdown
- cache_vision_result(page_hash: str) -> Optional[str] — disk cache for vision results
Add get_chunking_strategy(name: str, settings: Settings) -> ChunkingStrategy factory function in backend/app/utils/chunking.py

Sub-Phase 8.2: Metadata Enrichment

Objective: Extend extract_metadata() to support Q&A-specific fields.

Test Files:

backend/app/test/test_phase8_metadata.py — new file
- test_qa_metadata_fields — strategy_type, question_index, question_id, question_text
- test_qa_metadata_topic_section
- test_token_metadata_unchanged — existing metadata unchanged for token strategy
- test_page_number_from_question — page ref points to question location

Implementation Tasks:

Extend extract_metadata() in backend/app/utils/metadata.py:
- New param: strategy_type: str = "token"
- New param: qa_pairs: Optional[List[QAPair]] = None
- When strategy_type == "question": populate qa_id, question_index, question_text, topic_section, qa_style, source_page_range, has_table
- Maintain backward compatibility: all existing fields still populated
Update backend/app/models/documents.py:
- ChunkInfo: add strategy_type: str, question_index: Optional[int] = None, question_id: Optional[str] = None, question_text: Optional[str] = None, topic_section: Optional[str] = None
Update backend/app/models/ingest.py:
- IngestResponse: add strategy: str
- Add IngestRequest model with strategy: str = "token"

Sub-Phase 8.3: Ingest API Integration

Objective: Wire the strategy selection into the ingest endpoint.

Test Files:

backend/app/test/test_phase8_ingest.py — new file
- test_ingest_with_strategy_token — existing behavior unchanged
- test_ingest_with_strategy_question — Q&A chunking applied
- test_ingest_invalid_strategy_rejected — 400 on unknown strategy
- test_ingest_question_strategy_pdf — PDF with Chinese Q&A
- test_ingest_question_strategy_docx — DOCX with Q&A format
- test_ingest_question_strategy_no_qa_fallback — document without Q&A falls back
- test_ingest_response_includes_strategy — IngestResponse has strategy field

Implementation Tasks:

Update backend/app/routers/ingest.py:
- Line 39: accept strategy: str = Query("token") param
- Validate against VALID_CHUNKING_STRATEGIES
- Lines 70-73: replace hardcoded TokenChunkingStrategy with get_chunking_strategy(strategy, settings)
- Lines 108-115, 148-151, 183-186: pass strategy_type=strategy and QA metadata to extract_metadata()
- Line 192: include strategy in IngestResponse
Update backend/app/services/rag.py:
- list_documents(): extract strategy_type from first chunk's metadata per document
- list_chunks(): return strategy_type, question_index, question_id, question_text, topic_section from metadata
Update backend/app/routers/documents.py:
- Map new fields from RAG service to API response models

Sub-Phase 8.4: Frontend Strategy Selector

Objective: Add strategy selection UI on the RAG Database page, display strategy info in document/chunk lists.

Test Files:

frontend/src/test/components/ChunkList.test.tsx — add Q&A chunk rendering tests
frontend/src/test/components/DocumentList.test.tsx — add strategy badge tests
frontend/src/test/pages/RAGDatabasePage.test.tsx (new) — strategy selector integration

Implementation Tasks:

Update frontend/src/types/index.ts:
- Add ChunkingStrategy type
- Extend IngestResponse, DocumentInfo, ChunkInfo with new fields
Update frontend/src/lib/api.ts:
- ingestDocument(file, strategy = "token"): append strategy as query param
Update frontend/src/lib/queries.tsx:
- useIngestDocument: accept { file: File; strategy: ChunkingStrategy }
Update frontend/src/pages/RAGDatabasePage.tsx:
- Add chunkingStrategy state (default: "token")
- Add strategy selector UI (two radio buttons or toggle):
  - "Chunk by Token" (icon: text icon, description: "Split by token windows with overlap")
  - "Chunk by Question" (icon: Q&A icon, description: "Detect Q&A pairs, extract tables")
- Pass strategy to handleUpload
- Show strategy info in success toast
Update frontend/src/components/DocumentList.tsx:
- Add strategy badge: chunked by token or chunked by question
- Color-code: token = gray, question = blue
Update frontend/src/components/ChunkList.tsx:
- For Q&A chunks: show Q: {question_id} — {question_text} instead of page number
- Show Topic: {topic_section} if available
- Show Pages {start}-{end} for the page range
- Show ⚠ Contains table badge if has_table is true
Update frontend/src/components/DocumentUpload.tsx:
- Disable upload button when no file selected (if not already)
- Show strategy selector inline or accept selectedStrategy prop

Sub-Phase 8.5: Integration & Acceptance Testing

Objective: End-to-end verification with real LegCo PDFs from test materials/LTT/.

Test Files:

backend/app/test/acceptance/test_acceptance_phase8_qa_chunking.py — new file
- test_real_qa_chunking_fileE — File E produces 12 Chinese Q&A pairs + 3 Others
- test_real_qa_chunking_fileL — File L produces 24 English Q&A pairs
- test_real_qa_chunking_fileB — File B produces 3 Chinese Q&A pairs + narrative sections
- test_real_qa_chunking_fileA — File A falls back (no Q&A, should not error)
- test_table_extraction_fileE — Tables in File E answers converted to markdown
- test_table_extraction_fileL — Tables in File L answers converted to markdown
- test_qa_page_references — Each Q&A chunk's page number points to question location
- test_full_pipeline_question_strategy — Full ingest → retrieve → query with Q&A chunks

Implementation Tasks:

Run Q&A chunking on all 4 LTT PDFs, verify pair counts and content
Verify table extraction quality (spot-check markdown tables)
Verify page number references point to question start
Run pytest backend/app/test/test_phase8*.py -v — all pass
Run pytest backend/app/test/acceptance/test_acceptance_phase8*.py -v -m acceptance — all pass
Test frontend: upload File E with "chunk by question" → verify chunks displayed with Q&A metadata
Verify strategy coexistence: upload File E with both strategies → both appear in document list with correct badges

Sub-Phase 8.6: Polish & Edge Cases

Objective: Handle remaining edge cases and polish UI.

Tasks:

Handle Others section in File E (O1-O04 without explicit 問/答)
Handle [如被追問] nested Q&A: include as part of answer or split as sub-chunks
Handle [內部參考] inclusion/exclusion toggle via config
Add chunking strategy filter to query page (optional — skip if out of scope)
Add loading state for vision table extraction (can take seconds for large docs)
Track chunking strategy in query history (bonus: show which strategy produced results)

File Change Summary

New Files

File	Purpose
`backend/app/utils/qa_chunking.py`	Q&A boundary detection, splitting, QAPair dataclass
`backend/app/utils/table_extraction.py`	Vision LLM table extraction, heuristic fallback, caching
`backend/app/test/test_phase8_qa_chunking.py`	Tests for Q&A chunking logic
`backend/app/test/test_phase8_metadata.py`	Tests for Q&A metadata enrichment
`backend/app/test/test_phase8_ingest.py`	Tests for strategy selection in ingest API
`backend/app/test/acceptance/test_acceptance_phase8_qa_chunking.py`	Acceptance tests with real LTT PDFs

Modified Files

File	Changes
`backend/app/utils/chunking.py`	Add `QuestionChunkingStrategy` class + `get_chunking_strategy()` factory
`backend/app/core/config.py`	Add Q&A chunking config fields (5 new — vision uses existing LLM model)
`backend/app/utils/metadata.py`	Add `strategy_type` and Q&A fields to metadata output
`backend/app/models/ingest.py`	Add `strategy` to `IngestResponse`, add `IngestRequest`
`backend/app/models/documents.py`	Extend `ChunkInfo` with Q&A fields, `DocumentInfo` with `chunking_strategy`
`backend/app/routers/ingest.py`	Accept `strategy` query param, use factory, pass strategy to metadata
`backend/app/services/rag.py`	`list_documents()`/`list_chunks()` return strategy info
`backend/app/routers/documents.py`	Map new fields to API responses
`backend/.env.example`	Add new env vars
`frontend/src/types/index.ts`	Add `ChunkingStrategy` type, extend interfaces
`frontend/src/lib/api.ts`	`ingestDocument()` accepts strategy param
`frontend/src/lib/queries.tsx`	`useIngestDocument()` mutation signature updated
`frontend/src/pages/RAGDatabasePage.tsx`	Add strategy selector state + UI
`frontend/src/components/DocumentUpload.tsx`	Accept strategy prop / integrate selector
`frontend/src/components/DocumentList.tsx`	Display strategy badge
`frontend/src/components/ChunkList.tsx`	Conditional Q&A display (question_id instead of page_number)

Test Files (Modified)

File	Changes
`backend/app/test/test_phase1_chunking.py`	Add test for strategy factory
`backend/app/test/test_phase1_ingest.py`	Add strategy param test
`backend/app/test/test_phase1_documents_router.py`	Add strategy field verification
`backend/app/test/test_phase1_metadata.py`	Add Q&A metadata field tests
`backend/app/test/test_phase1_config.py`	Add Q&A config tests
`frontend/src/test/components/ChunkList.test.tsx`	Add Q&A chunk display tests
`frontend/src/test/components/DocumentList.test.tsx`	Add strategy badge test
`frontend/src/test/components/IngestPanel.test.tsx`	Update for strategy selection

Acceptance Criteria

AC1: User can select "Chunk by token" or "Chunk by question" on the RAG Database page before uploading
AC2: "Chunk by token" uploads produce chunks identical to current behavior (no regression)
AC3: "Chunk by question" upload of File E produces 12+ Chinese Q&A chunks with correct question_id, question_text, topic_section metadata
AC4: "Chunk by question" upload of File L produces 24 English Q&A chunks with correct boundaries
AC5: Each Q&A chunk's page number references the question (問) page, not the answer
AC6: Tables within answers are converted to markdown table syntax (vision LLM or heuristic)
AC7: Document list shows chunking strategy badge per document
AC8: Chunk list shows question_id/question_text for Q&A chunks instead of page_number
AC9: Documents without Q&A format (File A, narrative sections) fall back to token/section-based chunking without error
AC10: Both strategies coexist — documents stored with different strategies appear side-by-side in document list
AC11: Strategy param is validated — invalid values return 400
AC12: All backend tests pass (pytest backend/app/test/test_phase8*.py -v)
AC13: All acceptance tests pass (pytest backend/app/test/acceptance/test_acceptance_phase8*.py -v -m acceptance)
AC14: Frontend tests pass (pnpm test in frontend/)
AC15: Existing tests still pass (no regression)

Risks & Mitigations

Risk	Severity	Mitigation
Vision LLM cost: Per-page image processing adds latency and token cost for large docs	Medium	Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost.
Q&A boundary detection: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle	Medium	LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content.
Table extraction quality: Vision LLM may misinterpret complex Chinese tables	Medium	Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction.
Performance: Vision LLM + LLM verification adds latency to ingest	Medium	Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents.
Page boundary interference: Page footer numbers appear mid-Q&A text	Low	Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs.
LegCo format drift: Future documents may use different formats or section structures	Low	LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed.
Chunk size: Some Q&A pairs are very long (7+ pages)	Medium	Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context.
DOCX/TXT Q&A: Non-PDF formats may have different Q&A markers	Low	Use same regex detection on concatenated text. Skip vision table extraction (text-based only).
vLLM vision compatibility: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served	Medium	Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below.

Dependencies

Vision LLM API: The existing LLM_MODEL_NAME (qwen/qwen3.5-35b-a3b) is a native vision-language model that accepts base64 images via the standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (QA_VISION_ENABLED=false), fall back to heuristic table detection (text-only).

vLLM compatibility (when VLLM_ENGINE=true): Vision table extraction requires vLLM v0.6.0+ with the model served as multimodal (vision encoder loaded). Verify with a quick smoke test before implementation:

from openai import AsyncOpenAI
client = AsyncOpenAI(base_url=settings.llm_base_url, api_key=settings.llm_api_key)
resp = await client.chat.completions.create(
    model=settings.llm_model_name,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what you see."},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgo="}}
        ]
    }]
)

If this returns a valid response → vision works. If it errors (400/422) → set QA_VISION_ENABLED=false and use Option B (heuristic text-based table extraction). The heuristic fallback works identically regardless of provider (OpenRouter or vLLM).

New Python packages: Pillow (likely already installed for image rendering). pypdf already installed. No pymupdf needed — vision extraction sends raw page images (PNG) directly to the LLM; the LLM itself identifies table regions.
Existing codebase: No breaking changes. Strategy is additive — existing TokenChunkingStrategy unchanged.
ChromaDB: No schema change. Metadata fields are flexible (ChromaDB accepts arbitrary dict keys).

Cost Estimate (per document)

Component	Model	Cost	Notes
PDF text extraction	pypdf	Free	Existing parser
Q&A regex splitting	N/A	Free	Pure Python
Vision table extraction (PDF)	Same as `LLM_MODEL_NAME`	Image token cost only	Only pages with tables; no separate model cost
LLM boundary verification	Same as `LLM_MODEL_NAME`	~$0.001/document	Edge cases only
Embedding	qwen3-embedding-4b	~$0.00002/page	Existing pipeline

Single model, single API key. Vision table extraction and boundary verification both use the existing LLM_MODEL_NAME (qwen/qwen3.5-35b-a3b) via the same LLM_BASE_URL/LLM_API_KEY. The only incremental cost is image tokens for pages sent to the vision model — estimated at a few cents per document for typical LegCo papers.

References

Internal

backend/app/utils/chunking.py — Current chunking strategy
backend/app/routers/ingest.py — Ingestion pipeline
backend/app/utils/metadata.py — Metadata extraction
test materials/LTT/ — Test documents (4 LegCo PDFs)

External

FAQ-RAG (QA-native chunking): https://github.com/Pro-GenAI/FAQ-RAG
gptpdf (vision PDF→Markdown): https://github.com/CosmosShadow/gptpdf
llm-based-ocr (production PDF→MD): https://github.com/yigitkonur/llm-based-ocr
semchunk (AI-powered chunking): https://github.com/isaacus-dev/semchunk
2026 Document Chunking Benchmark (36 strategies): https://arxiv.org/html/2603.06976
EACL 2026 RAG Error Taxonomy: https://wiki.charleschen.ai (29.7% errors from chunking)
LumberChunker (LLM semantic boundary detection): https://aclanthology.org/2024.findings-emnlp.377
MultiDocFusion (layout+LLM hierarchy): https://aclanthology.org/2025.emnlp-main.1062
Hong Kong LegCo transcript dataset: https://github.com/Swithord/hong-kong-legco-transcript
LegCo Open Data API: https://www.legco.gov.hk/en/open-legco/open-data/hansard-database.html
LegCo Questions API: https://www.legco.gov.hk/en/open-legco/open-data/questions-at-council-meetings.html

40 KiB Raw Permalink Blame History Unescape Escape

Package 8 Enhancement Plan — Q&A-Pair Chunking Strategy

Objective

Feature 1: Q&A-Pair Chunking Algorithm

Feature 2: Table-to-Markdown Extraction

Feature 3: Strategy Selection UI

Current State

Existing Chunking Pipeline

Chunking Strategy Abstraction

Ingest Router

Config

Frontend

Document Analysis — LTT Test Materials

File A: 詳細數字一覽表 (18 pages) — No Q&A format

File B: 古洞北粉嶺北新發展區概覽 (8 pages) — 3 Q&A pairs

File E: 基礎設施 (27 pages) — 12 Chinese Q&A + 3 "Others"

File L: 補償及安置安排 (36 pages) — 24 English Q&A

Key Structural Findings

Research Findings — External Approaches

Q&A Chunking Approaches (from librarian research)

Vision Table Extraction (from librarian research)

Recommended Approach (synthesized)

Target State

Architecture

Algorithm Detail: LLM Structure Detection (Primary)

Algorithm Detail: Table-to-Markdown

Config Additions

Implementation Plan

Sub-Phase 8.0: Config & Enums (Foundation)

Sub-Phase 8.1: Q&A Chunking Strategy Class

Sub-Phase 8.2: Metadata Enrichment

Sub-Phase 8.3: Ingest API Integration

Sub-Phase 8.4: Frontend Strategy Selector

Sub-Phase 8.5: Integration & Acceptance Testing

Sub-Phase 8.6: Polish & Edge Cases

File Change Summary

New Files

Modified Files

Test Files (Modified)

Acceptance Criteria

Risks & Mitigations

Dependencies

Cost Estimate (per document)

References

Internal

External

40 KiB

Raw Permalink Blame History

File A: `詳細數字一覽表` (18 pages) — No Q&A format

File B: `古洞北粉嶺北新發展區概覽` (8 pages) — 3 Q&A pairs

File E: `基礎設施` (27 pages) — 12 Chinese Q&A + 3 "Others"

File L: `補償及安置安排` (36 pages) — 24 English Q&A