docs: Package 8 — switch to LLM-first structure detection (not regex-first)

LegCo documents use multiple formats (問/答 markers, Q1/Q2 numbering, section headings like '(1) 住戶的安置補償', 發言要點 bullet points, and pure table pages). Regex alone cannot reliably classify all these. Changes: - Primary detection: LLM call identifies ALL section types in one pass (qa, narrative, speaking_notes, table, toc, heading_only) - Regex: downgraded to optional fast-pass optimization for known patterns - Architecture diagram, algorithm detail, risks, and test plan all updated - Single model handles structure detection + table extraction + verification
2026-05-15 11:34:24 +08:00 · 2026-05-15 11:34:24 +08:00 · 6bf04cedb1
parent 29b4713f22
commit 6bf04cedb1
1 changed files with 132 additions and 108 deletions
--- a/.plans/package8_enhancement_plan.md
+++ b/.plans/package8_enhancement_plan.md
@ -144,12 +144,9 @@ Four LegCo supplementary papers in `test materials/LTT/`:

 **1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.

-**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass.
+**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats).

-**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources:
- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents)
- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A)
- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes
+**3. Regex + LLM Hybrid**: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot.

 **4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.

@ -169,19 +166,25 @@ Four LegCo supplementary papers in `test materials/LTT/`:
 PDF Transcript
    │
    ▼
-[gptpdf / llm-ocr]  ← Vision LLM converts page images to markdown
-    │                  Tables rendered as markdown tables inline
-    ▼
-Clean Markdown with 問/答 markers preserved
+[Preprocess] Concatenate pages, strip footers/headers, normalize
    │
    ▼
-[Regex Splitter]  ← Primary: split on 問...答 boundaries
-    │               Fallback: split on Q1, Q2... boundaries
+[LLM Structure Detection]  ← Single call: identify ALL section boundaries
+    │                         Classify each as qa / narrative / speaking_notes / table
+    │                         Same model as everything else (Qwen3.5-35B-A3B)
    ▼
-[LLM Boundary Verifier]  ← Edge cases: nested Q&A, tables at boundaries
+[Regex fast-pass]  ← Optional: skip LLM for documents with 100%-regex-detectable format
    │
    ▼
-Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...)
+[Chunk Builder]  ← For QA: one chunk per pair
+    │               For narrative: paragraph-level split with heading prepended
+    │               For speaking_notes: one chunk per bullet with topic prepended
+    │               For table: standalone chunk or merged into parent section
+    ▼
+[Table Extraction]  ← Vision LLM (same model, base64 images) for pages with tables
+    │                 → markdown tables injected into chunk text
+    ▼
+Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...)
    │
    ▼
 ChromaDB (with strategy_type="question" in metadata)
@ -200,113 +203,132 @@ POST /api/v1/ingest?strategy=token|question
    │
    └─ strategy="question" → QuestionChunkingStrategy (NEW)
         │
-          ├─ [PDF only] Render pages as images → LLM vision (base64) → markdown
-          │   └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API)
-          │   └─ Tables converted to markdown table syntax inline
+         ├─ [Preprocess] Concatenate pages, strip footers/headers, normalize
         │
-         ├─ [Detect Q&A style] Chinese 問/答 vs English Q-only
+         ├─ [LLM Structure Detection] Send full document text to LLM
+         │   │  Same model as everything else (Qwen3.5-35B-A3B)
+         │   │  Prompt: "Identify all sections. Classify each as Q&A pair,
+         │   │           narrative section, table section, or speaking notes.
+         │   │           For each Q&A, output question text, answer text,
+         │   │           start/end boundaries."
+         │   └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes",
+         │                   qa_id?, question?, answer?, heading?, start_page, end_page }, ...]
         │
-         ├─ [Primary split] Regex-based Q&A boundary detection
-         │   ├─ Chinese: split on `問\s*[A-Z]\d+\s*[︰：]` → `答\s*([A-Z]\d+)`
-         │   └─ English: split on `^Q\d+\s` → next Q/section = answer end
+         ├─ [Regex fallback] Skip LLM when disabled or for known patterns
+         │   └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization
         │
-         ├─ [LLM verification] Edge case handling:
-         │   ├─ Multi-paragraph answers exceeding max chunk size
-         │   ├─ Tables spanning Q&A boundaries (keep with answer)
-         │   ├─ Nested Q&A (如被追問 within answers)
-         │   └─ Ambiguous boundaries (O1-O4 style without explicit 問/答)
+         ├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n
+         │   └─ Question text prepended to each sub-chunk for context
         │
-         ├─ [Size check] Oversized pairs → recursive split on \n\n → \n
-         │   └─ Split answer portions only; keep question + answer prefix together
+         ├─ [Table extraction] Vision LLM per page with tables → markdown
+         │   └─ Same model: base64 page images via standard Chat Completions API
         │
-         ├─ [Fallback] Sections with no Q&A (發言要點, narrative policy)
-         │   └─ Chunk by section headers (章節) or fall back to token-based
+         ├─ [Narrative sections] Sections with no Q&A (發言要點, policy text)
+         │   └─ Chunk by section heading boundaries with heading+parent prepended
         │
         └─ [Metadata] Per-chunk:
              ├─ strategy_type: "question"
-              ├─ qa_style: "chinese" | "english"
-              ├─ question_index: int (0-based within document)
-              ├─ question_id: str ("A1", "Q3", etc.)
-              ├─ question_text: str (full question text)
+              ├─ section_type: "qa" | "narrative" | "speaking_notes"
+              ├─ question_index: int (0-based, Q&A only)
+              ├─ question_id: str ("A1", "Q3", etc., Q&A only)
+              ├─ question_text: str (full question text, Q&A only)
+              ├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
              ├─ answer_contains_table: bool
-              ├─ topic_section: str (e.g. "(A) 排水系統")
-              └─ source_page_range: [int, int] (page range of Q&A pair)
+              └─ source_page_range: [int, int]
 ```

-### Algorithm Detail: Q&A Boundary Detection
+### Algorithm Detail: LLM Structure Detection (Primary)
+
+LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like `(1) 住戶的安置補償`, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can.

 **Step 1 — Preprocess extracted text**:
 1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator
 2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$`
 3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier)
 4. Normalize colons: `：` → `:`, `︰` → `:`
+5. Insert page boundary markers: `[PAGE_BREAK: N]` at each page boundary so the LLM knows where pages start/end
+
+**Step 2 — LLM Structure Detection (single call)**:
+
+Send the preprocessed text to the LLM (same model as everything else: `LLM_MODEL_NAME`). Ask it to identify every section boundary and classify each section:

-**Step 2 — Detect Q&A style**:
-```python
-def detect_qa_style(text: str) -> str:
-    chinese_pattern = r'問\s*[A-Z]\d+\s*[︰：]'
-    english_pattern = r'^Q\d+\s'
-    if re.search(chinese_pattern, text, re.MULTILINE):
-        return "chinese"
-    if re.search(english_pattern, text, re.MULTILINE):
-        return "english"
-    return "none"  # fall back to token-based
 ```
+You are analyzing a Hong Kong Legislative Council document.
+The text has page markers like [PAGE_BREAK: N] showing where pages begin.

-**Step 3 — Split by Q&A (Chinese style)**:
-```python
-QA_PATTERN_CHINESE = re.compile(
-    r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[︰：]\s*(.*?)\s*'
-    r'(?:\n\s*答\s*\1\s*[︰：]\s*)'  # matched answer marker
-    r'(.*?)'
-    r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[︰：]|\([A-Z]\)|$))',
-    re.DOTALL
-)
-# Group 1: question_id (e.g. "A1")
-# Group 2: question_text
-# Group 3: answer_text (everything until next 問 or section header or EOF)
-```
+For each distinct section in this document, identify:
+1. The section type:
+   - "qa": a question-and-answer pair (問/答 or Q1/Q2 format)
+   - "narrative": policy text, explanatory paragraphs, section content with bullets
+   - "speaking_notes": briefing points (發言要點) with bullet markers
+   - "table": standalone data tables (not embedded in answers)
+   - "toc": table of contents
+   - "heading_only": a section heading with no following content

-**Step 3b — Split by Q&A (English style)**:
-```python
-QA_PATTERN_ENGLISH = re.compile(
-    r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*'
-    r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|'
-    r'[\u4e00-\u9fff]{2,}\s*$|'  # Chinese-only topic line
-    r'$))',
-    re.DOTALL
-)
-# Group 1: question_id (e.g. "Q1")
-# Group 2: full text (question + answer combined, since no explicit A marker)
-```
+2. For "qa" sections:
+   - The question text (exact)
+   - The answer text (exact, including tables, bullet lists, and [內部參考] content)
+   - The question ID if present (e.g. "A1", "Q3")
+   - The start page and end page

-**Step 4 — LLM verification pass** (for edge cases only):
-For each Q&A pair, if any of these conditions are true, send pair to LLM for verification:
- Answer length > 2000 tokens (may contain multiple Q&A or tables)
- Contains `[如被追問` or `[如追問` (nested Q&A)
- Contains `[內部參考` or `[內部補充` (supplementary content to handle)
- Answer text contains regex pattern for table rows
-
-LLM prompt:
-```
-You are processing a Hong Kong Legislative Council document. 
-Analyze this Q&A pair and determine:
-
-1. Does the answer contain tables? If yes, output them as markdown tables.
-2. Are there nested followup Q&A (追問) within the answer? If yes, separate them.
-3. Does the answer contain [內部參考] supplementary content? Flag it.
-4. Is the Q&A boundary correct? If not, suggest the correct boundary.
+3. For all sections:
+   - The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
+   - The start page and end page
+   - Whether the section contains tables

 Return JSON:
 {
-  "correct_boundary": true/false,
-  "tables_markdown": ["table1...", "table2..."],
-  "nested_qa": [{"q": "...", "a": "..."}],
-  "internal_refs": ["ref1...", "ref2..."],
-  "clean_answer": "answer text with tables replaced by markdown"
+  "sections": [
+    {
+      "type": "qa",
+      "heading": "(A) 排水系統",
+      "qa_id": "A1",
+      "question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...",
+      "answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...",
+      "start_page": 2,
+      "end_page": 3,
+      "has_table": true,
+      "parent_topic": "排水系統"
+    },
+    {
+      "type": "narrative",
+      "heading": "(1) 住戶的安置補償",
+      "content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...",
+      "start_page": 2,
+      "end_page": 5,
+      "has_table": false
+    },
+    {
+      "type": "speaking_notes",
+      "heading": "發言要點",
+      "content": "⚫ 古洞北／粉嶺北新發展區（新發展區）是北部都會區內...",
+      "start_page": 1,
+      "end_page": 2,
+      "has_table": false
+    }
+  ]
 }
 ```

+**Step 3 — Chunk building from LLM output**:
+
+For each section in the JSON response:
+
+| Section type | Chunking rule |
+|-------------|---------------|
+| `qa` | One chunk = one Q&A pair. Question text + answer text concatenated. |
+| `narrative` | Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context. |
+| `speaking_notes` | Each bullet point (⚫) becomes a chunk, with topic heading prepended. |
+| `table` | Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks. |
+| `toc` / `heading_only` | Skipped (not stored as chunks — no retrievable content). |
+
+**Step 4 — Regex fast-pass (optional optimization)**:
+
+If `qa_vision_enabled=false` or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when:
+- No regex pattern matches (unknown format)
+- Regex produces < 2 sections (likely misdetection)
+- `qa_verification_model` is not set to `"none"`
+
 ### Algorithm Detail: Table-to-Markdown

 **Option A — Vision LLM per page (Primary for PDF)**:
@ -403,24 +425,27 @@ class Settings(BaseSettings):

 **Test Files**:
 - `backend/app/test/test_phase8_qa_chunking.py` — new file
-  - `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs
-  - `test_english_qa_detection` — detect and split English Q-only pairs
-  - `test_no_qa_fallback` — section-based chunking when no Q&A found
-  - `test_multi_page_qa` — Q&A spanning page boundaries
-  - `test_nested_qa_handling` — [如被追問] patterns
-  - `test_table_preservation` — tables embedded in answers
+  - `test_llm_structure_detection` — LLM correctly identifies Q&A, narrative, speaking_notes sections
+  - `test_llm_handles_mixed_formats` — document with 問/答 + section headings + narrative correctly classified
+  - `test_llm_handles_no_qa_format` — narrative-only documents (File L pages 1-13) correctly chunked by section
+  - `test_llm_handles_speaking_notes` — 發言要點 sections chunked by bullet point
+  - `test_regex_fastpass_chinese` — known 問/答 format detected without LLM (when configured)
+  - `test_regex_fastpass_english` — known Q1/Q2 format detected without LLM
+  - `test_multi_page_sections` — sections spanning page boundaries tracked correctly
  - `test_abc_contract` — satisfies ChunkingStrategy ABC
-  - `test_page_number_reference_question` — page ref points to 問 location
-  - `test_size_limit` — oversized pairs get recursive split
+  - `test_page_number_reference_question` — page ref points to question location for Q&A chunks
+  - `test_size_limit` — oversized sections get recursive split with heading preserved

 **Implementation Tasks**:

 1. **Create** `backend/app/utils/qa_chunking.py`:
-   - `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none"
-   - `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text
-   - `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format
-   - `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format
-   - `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}`
+   - `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, insert page markers, reconstruct full text
+   - `build_structure_detection_prompt(text: str) -> str` — construct the LLM prompt for section classification
+   - `parse_llm_structure_response(response: dict) -> List[Section]` — parse the JSON returned by the LLM
+   - `Section` dataclass: `{type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}`
+   - `split_chinese_qa(text: str) -> List[Section]` — regex fast-pass for known 問/答 format
+   - `split_english_qa(text: str) -> List[Section]` — regex fast-pass for known Q-number format
+   - `build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]]` — build chunk texts + page refs + metadata from sections

 2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`:
   - `chunk(text: str) -> List[str]` — entry point for DOCX/TXT
@ -647,12 +672,11 @@ class Settings(BaseSettings):
 | Risk | Severity | Mitigation |
 |------|----------|------------|
 | **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. |
-| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. |
-| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. |
+| **Q&A boundary detection**: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle | Medium | LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content. |
 | **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. |
 | **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. |
 | **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. |
-| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. |
+| **LegCo format drift**: Future documents may use different formats or section structures | Low | LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed. |
 | **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context. |
 | **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
 | **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |