docs: Package 8 — switch to LLM-first structure detection (not regex-first)

LegCo documents use multiple formats (問/答 markers, Q1/Q2 numbering,
section headings like '(1) 住戶的安置補償', 發言要點 bullet points,
and pure table pages). Regex alone cannot reliably classify all these.

Changes:
- Primary detection: LLM call identifies ALL section types in one pass
  (qa, narrative, speaking_notes, table, toc, heading_only)
- Regex: downgraded to optional fast-pass optimization for known patterns
- Architecture diagram, algorithm detail, risks, and test plan all updated
- Single model handles structure detection + table extraction + verification
This commit is contained in:
Woody 2026-05-15 11:34:24 +08:00
parent 29b4713f22
commit 6bf04cedb1
1 changed files with 132 additions and 108 deletions

View File

@ -144,12 +144,9 @@ Four LegCo supplementary papers in `test materials/LTT/`:
**1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass.
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats).
**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources:
- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents)
- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A)
- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes
**3. Regex + LLM Hybrid**: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot.
**4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.
@ -169,19 +166,25 @@ Four LegCo supplementary papers in `test materials/LTT/`:
PDF Transcript
[gptpdf / llm-ocr] ← Vision LLM converts page images to markdown
│ Tables rendered as markdown tables inline
Clean Markdown with 問/答 markers preserved
[Preprocess] Concatenate pages, strip footers/headers, normalize
[Regex Splitter] ← Primary: split on 問...答 boundaries
│ Fallback: split on Q1, Q2... boundaries
[LLM Structure Detection] ← Single call: identify ALL section boundaries
│ Classify each as qa / narrative / speaking_notes / table
│ Same model as everything else (Qwen3.5-35B-A3B)
[LLM Boundary Verifier] ← Edge cases: nested Q&A, tables at boundaries
[Regex fast-pass] ← Optional: skip LLM for documents with 100%-regex-detectable format
Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...)
[Chunk Builder] ← For QA: one chunk per pair
│ For narrative: paragraph-level split with heading prepended
│ For speaking_notes: one chunk per bullet with topic prepended
│ For table: standalone chunk or merged into parent section
[Table Extraction] ← Vision LLM (same model, base64 images) for pages with tables
│ → markdown tables injected into chunk text
Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...)
ChromaDB (with strategy_type="question" in metadata)
@ -200,113 +203,132 @@ POST /api/v1/ingest?strategy=token|question
└─ strategy="question" → QuestionChunkingStrategy (NEW)
├─ [PDF only] Render pages as images → LLM vision (base64) → markdown
│ └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API)
│ └─ Tables converted to markdown table syntax inline
├─ [Preprocess] Concatenate pages, strip footers/headers, normalize
├─ [Detect Q&A style] Chinese 問/答 vs English Q-only
├─ [LLM Structure Detection] Send full document text to LLM
│ │ Same model as everything else (Qwen3.5-35B-A3B)
│ │ Prompt: "Identify all sections. Classify each as Q&A pair,
│ │ narrative section, table section, or speaking notes.
│ │ For each Q&A, output question text, answer text,
│ │ start/end boundaries."
│ └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes",
│ qa_id?, question?, answer?, heading?, start_page, end_page }, ...]
├─ [Primary split] Regex-based Q&A boundary detection
│ ├─ Chinese: split on `問\s*[A-Z]\d+\s*[]``答\s*([A-Z]\d+)`
│ └─ English: split on `^Q\d+\s` → next Q/section = answer end
├─ [Regex fallback] Skip LLM when disabled or for known patterns
│ └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization
├─ [LLM verification] Edge case handling:
│ ├─ Multi-paragraph answers exceeding max chunk size
│ ├─ Tables spanning Q&A boundaries (keep with answer)
│ ├─ Nested Q&A (如被追問 within answers)
│ └─ Ambiguous boundaries (O1-O4 style without explicit 問/答)
├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n
│ └─ Question text prepended to each sub-chunk for context
├─ [Size check] Oversized pairs → recursive split on \n\n → \n
│ └─ Split answer portions only; keep question + answer prefix together
├─ [Table extraction] Vision LLM per page with tables → markdown
│ └─ Same model: base64 page images via standard Chat Completions API
├─ [Fallback] Sections with no Q&A (發言要點, narrative policy)
│ └─ Chunk by section headers (章節) or fall back to token-based
├─ [Narrative sections] Sections with no Q&A (發言要點, policy text)
│ └─ Chunk by section heading boundaries with heading+parent prepended
└─ [Metadata] Per-chunk:
├─ strategy_type: "question"
├─ qa_style: "chinese" | "english"
├─ question_index: int (0-based within document)
├─ question_id: str ("A1", "Q3", etc.)
├─ question_text: str (full question text)
├─ section_type: "qa" | "narrative" | "speaking_notes"
├─ question_index: int (0-based, Q&A only)
├─ question_id: str ("A1", "Q3", etc., Q&A only)
├─ question_text: str (full question text, Q&A only)
├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
├─ answer_contains_table: bool
├─ topic_section: str (e.g. "(A) 排水系統")
└─ source_page_range: [int, int] (page range of Q&A pair)
└─ source_page_range: [int, int]
```
### Algorithm Detail: Q&A Boundary Detection
### Algorithm Detail: LLM Structure Detection (Primary)
LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like `(1) 住戶的安置補償`, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can.
**Step 1 — Preprocess extracted text**:
1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator
2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$`
3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier)
4. Normalize colons: ```:`, ```:`
5. Insert page boundary markers: `[PAGE_BREAK: N]` at each page boundary so the LLM knows where pages start/end
**Step 2 — LLM Structure Detection (single call)**:
Send the preprocessed text to the LLM (same model as everything else: `LLM_MODEL_NAME`). Ask it to identify every section boundary and classify each section:
**Step 2 — Detect Q&A style**:
```python
def detect_qa_style(text: str) -> str:
chinese_pattern = r'問\s*[A-Z]\d+\s*[]'
english_pattern = r'^Q\d+\s'
if re.search(chinese_pattern, text, re.MULTILINE):
return "chinese"
if re.search(english_pattern, text, re.MULTILINE):
return "english"
return "none" # fall back to token-based
```
You are analyzing a Hong Kong Legislative Council document.
The text has page markers like [PAGE_BREAK: N] showing where pages begin.
**Step 3 — Split by Q&A (Chinese style)**:
```python
QA_PATTERN_CHINESE = re.compile(
r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[]\s*(.*?)\s*'
r'(?:\n\s*答\s*\1\s*[]\s*)' # matched answer marker
r'(.*?)'
r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[]|\([A-Z]\)|$))',
re.DOTALL
)
# Group 1: question_id (e.g. "A1")
# Group 2: question_text
# Group 3: answer_text (everything until next 問 or section header or EOF)
```
For each distinct section in this document, identify:
1. The section type:
- "qa": a question-and-answer pair (問/答 or Q1/Q2 format)
- "narrative": policy text, explanatory paragraphs, section content with bullets
- "speaking_notes": briefing points (發言要點) with bullet markers
- "table": standalone data tables (not embedded in answers)
- "toc": table of contents
- "heading_only": a section heading with no following content
**Step 3b — Split by Q&A (English style)**:
```python
QA_PATTERN_ENGLISH = re.compile(
r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*'
r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|'
r'[\u4e00-\u9fff]{2,}\s*$|' # Chinese-only topic line
r'$))',
re.DOTALL
)
# Group 1: question_id (e.g. "Q1")
# Group 2: full text (question + answer combined, since no explicit A marker)
```
2. For "qa" sections:
- The question text (exact)
- The answer text (exact, including tables, bullet lists, and [內部參考] content)
- The question ID if present (e.g. "A1", "Q3")
- The start page and end page
**Step 4 — LLM verification pass** (for edge cases only):
For each Q&A pair, if any of these conditions are true, send pair to LLM for verification:
- Answer length > 2000 tokens (may contain multiple Q&A or tables)
- Contains `[如被追問` or `[如追問` (nested Q&A)
- Contains `[內部參考` or `[內部補充` (supplementary content to handle)
- Answer text contains regex pattern for table rows
LLM prompt:
```
You are processing a Hong Kong Legislative Council document.
Analyze this Q&A pair and determine:
1. Does the answer contain tables? If yes, output them as markdown tables.
2. Are there nested followup Q&A (追問) within the answer? If yes, separate them.
3. Does the answer contain [內部參考] supplementary content? Flag it.
4. Is the Q&A boundary correct? If not, suggest the correct boundary.
3. For all sections:
- The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
- The start page and end page
- Whether the section contains tables
Return JSON:
{
"correct_boundary": true/false,
"tables_markdown": ["table1...", "table2..."],
"nested_qa": [{"q": "...", "a": "..."}],
"internal_refs": ["ref1...", "ref2..."],
"clean_answer": "answer text with tables replaced by markdown"
"sections": [
{
"type": "qa",
"heading": "(A) 排水系統",
"qa_id": "A1",
"question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...",
"answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...",
"start_page": 2,
"end_page": 3,
"has_table": true,
"parent_topic": "排水系統"
},
{
"type": "narrative",
"heading": "(1) 住戶的安置補償",
"content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...",
"start_page": 2,
"end_page": 5,
"has_table": false
},
{
"type": "speaking_notes",
"heading": "發言要點",
"content": "⚫ 古洞北/粉嶺北新發展區(新發展區)是北部都會區內...",
"start_page": 1,
"end_page": 2,
"has_table": false
}
]
}
```
**Step 3 — Chunk building from LLM output**:
For each section in the JSON response:
| Section type | Chunking rule |
|-------------|---------------|
| `qa` | One chunk = one Q&A pair. Question text + answer text concatenated. |
| `narrative` | Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context. |
| `speaking_notes` | Each bullet point (⚫) becomes a chunk, with topic heading prepended. |
| `table` | Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks. |
| `toc` / `heading_only` | Skipped (not stored as chunks — no retrievable content). |
**Step 4 — Regex fast-pass (optional optimization)**:
If `qa_vision_enabled=false` or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when:
- No regex pattern matches (unknown format)
- Regex produces < 2 sections (likely misdetection)
- `qa_verification_model` is not set to `"none"`
### Algorithm Detail: Table-to-Markdown
**Option A — Vision LLM per page (Primary for PDF)**:
@ -403,24 +425,27 @@ class Settings(BaseSettings):
**Test Files**:
- `backend/app/test/test_phase8_qa_chunking.py` — new file
- `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs
- `test_english_qa_detection` — detect and split English Q-only pairs
- `test_no_qa_fallback` — section-based chunking when no Q&A found
- `test_multi_page_qa` — Q&A spanning page boundaries
- `test_nested_qa_handling` — [如被追問] patterns
- `test_table_preservation` — tables embedded in answers
- `test_llm_structure_detection` — LLM correctly identifies Q&A, narrative, speaking_notes sections
- `test_llm_handles_mixed_formats` — document with 問/答 + section headings + narrative correctly classified
- `test_llm_handles_no_qa_format` — narrative-only documents (File L pages 1-13) correctly chunked by section
- `test_llm_handles_speaking_notes` — 發言要點 sections chunked by bullet point
- `test_regex_fastpass_chinese` — known 問/答 format detected without LLM (when configured)
- `test_regex_fastpass_english` — known Q1/Q2 format detected without LLM
- `test_multi_page_sections` — sections spanning page boundaries tracked correctly
- `test_abc_contract` — satisfies ChunkingStrategy ABC
- `test_page_number_reference_question` — page ref points to 問 location
- `test_size_limit` — oversized pairs get recursive split
- `test_page_number_reference_question` — page ref points to question location for Q&A chunks
- `test_size_limit` — oversized sections get recursive split with heading preserved
**Implementation Tasks**:
1. **Create** `backend/app/utils/qa_chunking.py`:
- `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none"
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text
- `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format
- `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format
- `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}`
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, insert page markers, reconstruct full text
- `build_structure_detection_prompt(text: str) -> str` — construct the LLM prompt for section classification
- `parse_llm_structure_response(response: dict) -> List[Section]` — parse the JSON returned by the LLM
- `Section` dataclass: `{type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}`
- `split_chinese_qa(text: str) -> List[Section]` — regex fast-pass for known 問/答 format
- `split_english_qa(text: str) -> List[Section]` — regex fast-pass for known Q-number format
- `build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]]` — build chunk texts + page refs + metadata from sections
2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`:
- `chunk(text: str) -> List[str]` — entry point for DOCX/TXT
@ -647,12 +672,11 @@ class Settings(BaseSettings):
| Risk | Severity | Mitigation |
|------|----------|------------|
| **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. |
| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. |
| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. |
| **Q&A boundary detection**: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle | Medium | LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content. |
| **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. |
| **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. |
| **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. |
| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. |
| **LegCo format drift**: Future documents may use different formats or section structures | Low | LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed. |
| **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n``\n` with question text prepended to each sub-chunk for context. |
| **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
| **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |