docs: Package 8 — switch to LLM-first structure detection (not regex-first)

LegCo documents use multiple formats (問/答 markers, Q1/Q2 numbering,
section headings like '(1) 住戶的安置補償', 發言要點 bullet points,
and pure table pages). Regex alone cannot reliably classify all these.

Changes:
- Primary detection: LLM call identifies ALL section types in one pass
  (qa, narrative, speaking_notes, table, toc, heading_only)
- Regex: downgraded to optional fast-pass optimization for known patterns
- Architecture diagram, algorithm detail, risks, and test plan all updated
- Single model handles structure detection + table extraction + verification
This commit is contained in:
Woody 2026-05-15 11:34:24 +08:00
parent 29b4713f22
commit 6bf04cedb1
1 changed files with 132 additions and 108 deletions

View File

@ -144,12 +144,9 @@ Four LegCo supplementary papers in `test materials/LTT/`:
**1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents. **1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass. **2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats).
**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources: **3. Regex + LLM Hybrid**: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot.
- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents)
- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A)
- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes
**4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval. **4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.
@ -169,19 +166,25 @@ Four LegCo supplementary papers in `test materials/LTT/`:
PDF Transcript PDF Transcript
[gptpdf / llm-ocr] ← Vision LLM converts page images to markdown [Preprocess] Concatenate pages, strip footers/headers, normalize
│ Tables rendered as markdown tables inline
Clean Markdown with 問/答 markers preserved
[Regex Splitter] ← Primary: split on 問...答 boundaries [LLM Structure Detection] ← Single call: identify ALL section boundaries
│ Fallback: split on Q1, Q2... boundaries │ Classify each as qa / narrative / speaking_notes / table
│ Same model as everything else (Qwen3.5-35B-A3B)
[LLM Boundary Verifier] ← Edge cases: nested Q&A, tables at boundaries [Regex fast-pass] ← Optional: skip LLM for documents with 100%-regex-detectable format
Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...) [Chunk Builder] ← For QA: one chunk per pair
│ For narrative: paragraph-level split with heading prepended
│ For speaking_notes: one chunk per bullet with topic prepended
│ For table: standalone chunk or merged into parent section
[Table Extraction] ← Vision LLM (same model, base64 images) for pages with tables
│ → markdown tables injected into chunk text
Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...)
ChromaDB (with strategy_type="question" in metadata) ChromaDB (with strategy_type="question" in metadata)
@ -200,113 +203,132 @@ POST /api/v1/ingest?strategy=token|question
└─ strategy="question" → QuestionChunkingStrategy (NEW) └─ strategy="question" → QuestionChunkingStrategy (NEW)
├─ [PDF only] Render pages as images → LLM vision (base64) → markdown ├─ [Preprocess] Concatenate pages, strip footers/headers, normalize
│ └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API)
│ └─ Tables converted to markdown table syntax inline
├─ [Detect Q&A style] Chinese 問/答 vs English Q-only ├─ [LLM Structure Detection] Send full document text to LLM
│ │ Same model as everything else (Qwen3.5-35B-A3B)
│ │ Prompt: "Identify all sections. Classify each as Q&A pair,
│ │ narrative section, table section, or speaking notes.
│ │ For each Q&A, output question text, answer text,
│ │ start/end boundaries."
│ └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes",
│ qa_id?, question?, answer?, heading?, start_page, end_page }, ...]
├─ [Primary split] Regex-based Q&A boundary detection ├─ [Regex fallback] Skip LLM when disabled or for known patterns
│ ├─ Chinese: split on `問\s*[A-Z]\d+\s*[]``答\s*([A-Z]\d+)` │ └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization
│ └─ English: split on `^Q\d+\s` → next Q/section = answer end
├─ [LLM verification] Edge case handling: ├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n
│ ├─ Multi-paragraph answers exceeding max chunk size │ └─ Question text prepended to each sub-chunk for context
│ ├─ Tables spanning Q&A boundaries (keep with answer)
│ ├─ Nested Q&A (如被追問 within answers)
│ └─ Ambiguous boundaries (O1-O4 style without explicit 問/答)
├─ [Size check] Oversized pairs → recursive split on \n\n → \n ├─ [Table extraction] Vision LLM per page with tables → markdown
│ └─ Split answer portions only; keep question + answer prefix together │ └─ Same model: base64 page images via standard Chat Completions API
├─ [Fallback] Sections with no Q&A (發言要點, narrative policy) ├─ [Narrative sections] Sections with no Q&A (發言要點, policy text)
│ └─ Chunk by section headers (章節) or fall back to token-based │ └─ Chunk by section heading boundaries with heading+parent prepended
└─ [Metadata] Per-chunk: └─ [Metadata] Per-chunk:
├─ strategy_type: "question" ├─ strategy_type: "question"
├─ qa_style: "chinese" | "english" ├─ section_type: "qa" | "narrative" | "speaking_notes"
├─ question_index: int (0-based within document) ├─ question_index: int (0-based, Q&A only)
├─ question_id: str ("A1", "Q3", etc.) ├─ question_id: str ("A1", "Q3", etc., Q&A only)
├─ question_text: str (full question text) ├─ question_text: str (full question text, Q&A only)
├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
├─ answer_contains_table: bool ├─ answer_contains_table: bool
├─ topic_section: str (e.g. "(A) 排水系統") └─ source_page_range: [int, int]
└─ source_page_range: [int, int] (page range of Q&A pair)
``` ```
### Algorithm Detail: Q&A Boundary Detection ### Algorithm Detail: LLM Structure Detection (Primary)
LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like `(1) 住戶的安置補償`, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can.
**Step 1 — Preprocess extracted text**: **Step 1 — Preprocess extracted text**:
1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator 1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator
2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$` 2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$`
3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier) 3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier)
4. Normalize colons: ```:`, ```:` 4. Normalize colons: ```:`, ```:`
5. Insert page boundary markers: `[PAGE_BREAK: N]` at each page boundary so the LLM knows where pages start/end
**Step 2 — LLM Structure Detection (single call)**:
Send the preprocessed text to the LLM (same model as everything else: `LLM_MODEL_NAME`). Ask it to identify every section boundary and classify each section:
**Step 2 — Detect Q&A style**:
```python
def detect_qa_style(text: str) -> str:
chinese_pattern = r'問\s*[A-Z]\d+\s*[]'
english_pattern = r'^Q\d+\s'
if re.search(chinese_pattern, text, re.MULTILINE):
return "chinese"
if re.search(english_pattern, text, re.MULTILINE):
return "english"
return "none" # fall back to token-based
``` ```
You are analyzing a Hong Kong Legislative Council document.
The text has page markers like [PAGE_BREAK: N] showing where pages begin.
**Step 3 — Split by Q&A (Chinese style)**: For each distinct section in this document, identify:
```python 1. The section type:
QA_PATTERN_CHINESE = re.compile( - "qa": a question-and-answer pair (問/答 or Q1/Q2 format)
r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[]\s*(.*?)\s*' - "narrative": policy text, explanatory paragraphs, section content with bullets
r'(?:\n\s*答\s*\1\s*[]\s*)' # matched answer marker - "speaking_notes": briefing points (發言要點) with bullet markers
r'(.*?)' - "table": standalone data tables (not embedded in answers)
r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[]|\([A-Z]\)|$))', - "toc": table of contents
re.DOTALL - "heading_only": a section heading with no following content
)
# Group 1: question_id (e.g. "A1")
# Group 2: question_text
# Group 3: answer_text (everything until next 問 or section header or EOF)
```
**Step 3b — Split by Q&A (English style)**: 2. For "qa" sections:
```python - The question text (exact)
QA_PATTERN_ENGLISH = re.compile( - The answer text (exact, including tables, bullet lists, and [內部參考] content)
r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*' - The question ID if present (e.g. "A1", "Q3")
r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|' - The start page and end page
r'[\u4e00-\u9fff]{2,}\s*$|' # Chinese-only topic line
r'$))',
re.DOTALL
)
# Group 1: question_id (e.g. "Q1")
# Group 2: full text (question + answer combined, since no explicit A marker)
```
**Step 4 — LLM verification pass** (for edge cases only): 3. For all sections:
For each Q&A pair, if any of these conditions are true, send pair to LLM for verification: - The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
- Answer length > 2000 tokens (may contain multiple Q&A or tables) - The start page and end page
- Contains `[如被追問` or `[如追問` (nested Q&A) - Whether the section contains tables
- Contains `[內部參考` or `[內部補充` (supplementary content to handle)
- Answer text contains regex pattern for table rows
LLM prompt:
```
You are processing a Hong Kong Legislative Council document.
Analyze this Q&A pair and determine:
1. Does the answer contain tables? If yes, output them as markdown tables.
2. Are there nested followup Q&A (追問) within the answer? If yes, separate them.
3. Does the answer contain [內部參考] supplementary content? Flag it.
4. Is the Q&A boundary correct? If not, suggest the correct boundary.
Return JSON: Return JSON:
{ {
"correct_boundary": true/false, "sections": [
"tables_markdown": ["table1...", "table2..."], {
"nested_qa": [{"q": "...", "a": "..."}], "type": "qa",
"internal_refs": ["ref1...", "ref2..."], "heading": "(A) 排水系統",
"clean_answer": "answer text with tables replaced by markdown" "qa_id": "A1",
"question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...",
"answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...",
"start_page": 2,
"end_page": 3,
"has_table": true,
"parent_topic": "排水系統"
},
{
"type": "narrative",
"heading": "(1) 住戶的安置補償",
"content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...",
"start_page": 2,
"end_page": 5,
"has_table": false
},
{
"type": "speaking_notes",
"heading": "發言要點",
"content": "⚫ 古洞北/粉嶺北新發展區(新發展區)是北部都會區內...",
"start_page": 1,
"end_page": 2,
"has_table": false
}
]
} }
``` ```
**Step 3 — Chunk building from LLM output**:
For each section in the JSON response:
| Section type | Chunking rule |
|-------------|---------------|
| `qa` | One chunk = one Q&A pair. Question text + answer text concatenated. |
| `narrative` | Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context. |
| `speaking_notes` | Each bullet point (⚫) becomes a chunk, with topic heading prepended. |
| `table` | Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks. |
| `toc` / `heading_only` | Skipped (not stored as chunks — no retrievable content). |
**Step 4 — Regex fast-pass (optional optimization)**:
If `qa_vision_enabled=false` or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when:
- No regex pattern matches (unknown format)
- Regex produces < 2 sections (likely misdetection)
- `qa_verification_model` is not set to `"none"`
### Algorithm Detail: Table-to-Markdown ### Algorithm Detail: Table-to-Markdown
**Option A — Vision LLM per page (Primary for PDF)**: **Option A — Vision LLM per page (Primary for PDF)**:
@ -403,24 +425,27 @@ class Settings(BaseSettings):
**Test Files**: **Test Files**:
- `backend/app/test/test_phase8_qa_chunking.py` — new file - `backend/app/test/test_phase8_qa_chunking.py` — new file
- `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs - `test_llm_structure_detection` — LLM correctly identifies Q&A, narrative, speaking_notes sections
- `test_english_qa_detection` — detect and split English Q-only pairs - `test_llm_handles_mixed_formats` — document with 問/答 + section headings + narrative correctly classified
- `test_no_qa_fallback` — section-based chunking when no Q&A found - `test_llm_handles_no_qa_format` — narrative-only documents (File L pages 1-13) correctly chunked by section
- `test_multi_page_qa` — Q&A spanning page boundaries - `test_llm_handles_speaking_notes` — 發言要點 sections chunked by bullet point
- `test_nested_qa_handling` — [如被追問] patterns - `test_regex_fastpass_chinese` — known 問/答 format detected without LLM (when configured)
- `test_table_preservation` — tables embedded in answers - `test_regex_fastpass_english` — known Q1/Q2 format detected without LLM
- `test_multi_page_sections` — sections spanning page boundaries tracked correctly
- `test_abc_contract` — satisfies ChunkingStrategy ABC - `test_abc_contract` — satisfies ChunkingStrategy ABC
- `test_page_number_reference_question` — page ref points to 問 location - `test_page_number_reference_question` — page ref points to question location for Q&A chunks
- `test_size_limit` — oversized pairs get recursive split - `test_size_limit` — oversized sections get recursive split with heading preserved
**Implementation Tasks**: **Implementation Tasks**:
1. **Create** `backend/app/utils/qa_chunking.py`: 1. **Create** `backend/app/utils/qa_chunking.py`:
- `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none" - `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, insert page markers, reconstruct full text
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text - `build_structure_detection_prompt(text: str) -> str` — construct the LLM prompt for section classification
- `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format - `parse_llm_structure_response(response: dict) -> List[Section]` — parse the JSON returned by the LLM
- `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format - `Section` dataclass: `{type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}`
- `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}` - `split_chinese_qa(text: str) -> List[Section]` — regex fast-pass for known 問/答 format
- `split_english_qa(text: str) -> List[Section]` — regex fast-pass for known Q-number format
- `build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]]` — build chunk texts + page refs + metadata from sections
2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`: 2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`:
- `chunk(text: str) -> List[str]` — entry point for DOCX/TXT - `chunk(text: str) -> List[str]` — entry point for DOCX/TXT
@ -647,12 +672,11 @@ class Settings(BaseSettings):
| Risk | Severity | Mitigation | | Risk | Severity | Mitigation |
|------|----------|------------| |------|----------|------------|
| **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. | | **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. |
| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. | | **Q&A boundary detection**: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle | Medium | LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content. |
| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. |
| **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. | | **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. |
| **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. | | **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. |
| **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. | | **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. |
| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. | | **LegCo format drift**: Future documents may use different formats or section structures | Low | LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed. |
| **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n``\n` with question text prepended to each sub-chunk for context. | | **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n``\n` with question text prepended to each sub-chunk for context. |
| **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). | | **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
| **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. | | **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |