docs: Package 8 — switch to LLM-first structure detection (not regex-first)
LegCo documents use multiple formats (問/答 markers, Q1/Q2 numbering, section headings like '(1) 住戶的安置補償', 發言要點 bullet points, and pure table pages). Regex alone cannot reliably classify all these. Changes: - Primary detection: LLM call identifies ALL section types in one pass (qa, narrative, speaking_notes, table, toc, heading_only) - Regex: downgraded to optional fast-pass optimization for known patterns - Architecture diagram, algorithm detail, risks, and test plan all updated - Single model handles structure detection + table extraction + verification
This commit is contained in:
parent
29b4713f22
commit
6bf04cedb1
|
|
@ -144,12 +144,9 @@ Four LegCo supplementary papers in `test materials/LTT/`:
|
|||
|
||||
**1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.
|
||||
|
||||
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass.
|
||||
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats).
|
||||
|
||||
**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources:
|
||||
- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents)
|
||||
- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A)
|
||||
- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes
|
||||
**3. Regex + LLM Hybrid**: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot.
|
||||
|
||||
**4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.
|
||||
|
||||
|
|
@ -169,19 +166,25 @@ Four LegCo supplementary papers in `test materials/LTT/`:
|
|||
PDF Transcript
|
||||
│
|
||||
▼
|
||||
[gptpdf / llm-ocr] ← Vision LLM converts page images to markdown
|
||||
│ Tables rendered as markdown tables inline
|
||||
▼
|
||||
Clean Markdown with 問/答 markers preserved
|
||||
[Preprocess] Concatenate pages, strip footers/headers, normalize
|
||||
│
|
||||
▼
|
||||
[Regex Splitter] ← Primary: split on 問...答 boundaries
|
||||
│ Fallback: split on Q1, Q2... boundaries
|
||||
[LLM Structure Detection] ← Single call: identify ALL section boundaries
|
||||
│ Classify each as qa / narrative / speaking_notes / table
|
||||
│ Same model as everything else (Qwen3.5-35B-A3B)
|
||||
▼
|
||||
[LLM Boundary Verifier] ← Edge cases: nested Q&A, tables at boundaries
|
||||
[Regex fast-pass] ← Optional: skip LLM for documents with 100%-regex-detectable format
|
||||
│
|
||||
▼
|
||||
Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...)
|
||||
[Chunk Builder] ← For QA: one chunk per pair
|
||||
│ For narrative: paragraph-level split with heading prepended
|
||||
│ For speaking_notes: one chunk per bullet with topic prepended
|
||||
│ For table: standalone chunk or merged into parent section
|
||||
▼
|
||||
[Table Extraction] ← Vision LLM (same model, base64 images) for pages with tables
|
||||
│ → markdown tables injected into chunk text
|
||||
▼
|
||||
Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...)
|
||||
│
|
||||
▼
|
||||
ChromaDB (with strategy_type="question" in metadata)
|
||||
|
|
@ -200,113 +203,132 @@ POST /api/v1/ingest?strategy=token|question
|
|||
│
|
||||
└─ strategy="question" → QuestionChunkingStrategy (NEW)
|
||||
│
|
||||
├─ [PDF only] Render pages as images → LLM vision (base64) → markdown
|
||||
│ └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API)
|
||||
│ └─ Tables converted to markdown table syntax inline
|
||||
├─ [Preprocess] Concatenate pages, strip footers/headers, normalize
|
||||
│
|
||||
├─ [Detect Q&A style] Chinese 問/答 vs English Q-only
|
||||
├─ [LLM Structure Detection] Send full document text to LLM
|
||||
│ │ Same model as everything else (Qwen3.5-35B-A3B)
|
||||
│ │ Prompt: "Identify all sections. Classify each as Q&A pair,
|
||||
│ │ narrative section, table section, or speaking notes.
|
||||
│ │ For each Q&A, output question text, answer text,
|
||||
│ │ start/end boundaries."
|
||||
│ └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes",
|
||||
│ qa_id?, question?, answer?, heading?, start_page, end_page }, ...]
|
||||
│
|
||||
├─ [Primary split] Regex-based Q&A boundary detection
|
||||
│ ├─ Chinese: split on `問\s*[A-Z]\d+\s*[︰:]` → `答\s*([A-Z]\d+)`
|
||||
│ └─ English: split on `^Q\d+\s` → next Q/section = answer end
|
||||
├─ [Regex fallback] Skip LLM when disabled or for known patterns
|
||||
│ └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization
|
||||
│
|
||||
├─ [LLM verification] Edge case handling:
|
||||
│ ├─ Multi-paragraph answers exceeding max chunk size
|
||||
│ ├─ Tables spanning Q&A boundaries (keep with answer)
|
||||
│ ├─ Nested Q&A (如被追問 within answers)
|
||||
│ └─ Ambiguous boundaries (O1-O4 style without explicit 問/答)
|
||||
├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n
|
||||
│ └─ Question text prepended to each sub-chunk for context
|
||||
│
|
||||
├─ [Size check] Oversized pairs → recursive split on \n\n → \n
|
||||
│ └─ Split answer portions only; keep question + answer prefix together
|
||||
├─ [Table extraction] Vision LLM per page with tables → markdown
|
||||
│ └─ Same model: base64 page images via standard Chat Completions API
|
||||
│
|
||||
├─ [Fallback] Sections with no Q&A (發言要點, narrative policy)
|
||||
│ └─ Chunk by section headers (章節) or fall back to token-based
|
||||
├─ [Narrative sections] Sections with no Q&A (發言要點, policy text)
|
||||
│ └─ Chunk by section heading boundaries with heading+parent prepended
|
||||
│
|
||||
└─ [Metadata] Per-chunk:
|
||||
├─ strategy_type: "question"
|
||||
├─ qa_style: "chinese" | "english"
|
||||
├─ question_index: int (0-based within document)
|
||||
├─ question_id: str ("A1", "Q3", etc.)
|
||||
├─ question_text: str (full question text)
|
||||
├─ section_type: "qa" | "narrative" | "speaking_notes"
|
||||
├─ question_index: int (0-based, Q&A only)
|
||||
├─ question_id: str ("A1", "Q3", etc., Q&A only)
|
||||
├─ question_text: str (full question text, Q&A only)
|
||||
├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
|
||||
├─ answer_contains_table: bool
|
||||
├─ topic_section: str (e.g. "(A) 排水系統")
|
||||
└─ source_page_range: [int, int] (page range of Q&A pair)
|
||||
└─ source_page_range: [int, int]
|
||||
```
|
||||
|
||||
### Algorithm Detail: Q&A Boundary Detection
|
||||
### Algorithm Detail: LLM Structure Detection (Primary)
|
||||
|
||||
LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like `(1) 住戶的安置補償`, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can.
|
||||
|
||||
**Step 1 — Preprocess extracted text**:
|
||||
1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator
|
||||
2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$`
|
||||
3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier)
|
||||
4. Normalize colons: `:` → `:`, `︰` → `:`
|
||||
5. Insert page boundary markers: `[PAGE_BREAK: N]` at each page boundary so the LLM knows where pages start/end
|
||||
|
||||
**Step 2 — LLM Structure Detection (single call)**:
|
||||
|
||||
Send the preprocessed text to the LLM (same model as everything else: `LLM_MODEL_NAME`). Ask it to identify every section boundary and classify each section:
|
||||
|
||||
**Step 2 — Detect Q&A style**:
|
||||
```python
|
||||
def detect_qa_style(text: str) -> str:
|
||||
chinese_pattern = r'問\s*[A-Z]\d+\s*[︰:]'
|
||||
english_pattern = r'^Q\d+\s'
|
||||
if re.search(chinese_pattern, text, re.MULTILINE):
|
||||
return "chinese"
|
||||
if re.search(english_pattern, text, re.MULTILINE):
|
||||
return "english"
|
||||
return "none" # fall back to token-based
|
||||
```
|
||||
You are analyzing a Hong Kong Legislative Council document.
|
||||
The text has page markers like [PAGE_BREAK: N] showing where pages begin.
|
||||
|
||||
**Step 3 — Split by Q&A (Chinese style)**:
|
||||
```python
|
||||
QA_PATTERN_CHINESE = re.compile(
|
||||
r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[︰:]\s*(.*?)\s*'
|
||||
r'(?:\n\s*答\s*\1\s*[︰:]\s*)' # matched answer marker
|
||||
r'(.*?)'
|
||||
r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[︰:]|\([A-Z]\)|$))',
|
||||
re.DOTALL
|
||||
)
|
||||
# Group 1: question_id (e.g. "A1")
|
||||
# Group 2: question_text
|
||||
# Group 3: answer_text (everything until next 問 or section header or EOF)
|
||||
```
|
||||
For each distinct section in this document, identify:
|
||||
1. The section type:
|
||||
- "qa": a question-and-answer pair (問/答 or Q1/Q2 format)
|
||||
- "narrative": policy text, explanatory paragraphs, section content with bullets
|
||||
- "speaking_notes": briefing points (發言要點) with bullet markers
|
||||
- "table": standalone data tables (not embedded in answers)
|
||||
- "toc": table of contents
|
||||
- "heading_only": a section heading with no following content
|
||||
|
||||
**Step 3b — Split by Q&A (English style)**:
|
||||
```python
|
||||
QA_PATTERN_ENGLISH = re.compile(
|
||||
r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*'
|
||||
r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|'
|
||||
r'[\u4e00-\u9fff]{2,}\s*$|' # Chinese-only topic line
|
||||
r'$))',
|
||||
re.DOTALL
|
||||
)
|
||||
# Group 1: question_id (e.g. "Q1")
|
||||
# Group 2: full text (question + answer combined, since no explicit A marker)
|
||||
```
|
||||
2. For "qa" sections:
|
||||
- The question text (exact)
|
||||
- The answer text (exact, including tables, bullet lists, and [內部參考] content)
|
||||
- The question ID if present (e.g. "A1", "Q3")
|
||||
- The start page and end page
|
||||
|
||||
**Step 4 — LLM verification pass** (for edge cases only):
|
||||
For each Q&A pair, if any of these conditions are true, send pair to LLM for verification:
|
||||
- Answer length > 2000 tokens (may contain multiple Q&A or tables)
|
||||
- Contains `[如被追問` or `[如追問` (nested Q&A)
|
||||
- Contains `[內部參考` or `[內部補充` (supplementary content to handle)
|
||||
- Answer text contains regex pattern for table rows
|
||||
|
||||
LLM prompt:
|
||||
```
|
||||
You are processing a Hong Kong Legislative Council document.
|
||||
Analyze this Q&A pair and determine:
|
||||
|
||||
1. Does the answer contain tables? If yes, output them as markdown tables.
|
||||
2. Are there nested followup Q&A (追問) within the answer? If yes, separate them.
|
||||
3. Does the answer contain [內部參考] supplementary content? Flag it.
|
||||
4. Is the Q&A boundary correct? If not, suggest the correct boundary.
|
||||
3. For all sections:
|
||||
- The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償")
|
||||
- The start page and end page
|
||||
- Whether the section contains tables
|
||||
|
||||
Return JSON:
|
||||
{
|
||||
"correct_boundary": true/false,
|
||||
"tables_markdown": ["table1...", "table2..."],
|
||||
"nested_qa": [{"q": "...", "a": "..."}],
|
||||
"internal_refs": ["ref1...", "ref2..."],
|
||||
"clean_answer": "answer text with tables replaced by markdown"
|
||||
"sections": [
|
||||
{
|
||||
"type": "qa",
|
||||
"heading": "(A) 排水系統",
|
||||
"qa_id": "A1",
|
||||
"question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...",
|
||||
"answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...",
|
||||
"start_page": 2,
|
||||
"end_page": 3,
|
||||
"has_table": true,
|
||||
"parent_topic": "排水系統"
|
||||
},
|
||||
{
|
||||
"type": "narrative",
|
||||
"heading": "(1) 住戶的安置補償",
|
||||
"content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...",
|
||||
"start_page": 2,
|
||||
"end_page": 5,
|
||||
"has_table": false
|
||||
},
|
||||
{
|
||||
"type": "speaking_notes",
|
||||
"heading": "發言要點",
|
||||
"content": "⚫ 古洞北/粉嶺北新發展區(新發展區)是北部都會區內...",
|
||||
"start_page": 1,
|
||||
"end_page": 2,
|
||||
"has_table": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3 — Chunk building from LLM output**:
|
||||
|
||||
For each section in the JSON response:
|
||||
|
||||
| Section type | Chunking rule |
|
||||
|-------------|---------------|
|
||||
| `qa` | One chunk = one Q&A pair. Question text + answer text concatenated. |
|
||||
| `narrative` | Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context. |
|
||||
| `speaking_notes` | Each bullet point (⚫) becomes a chunk, with topic heading prepended. |
|
||||
| `table` | Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks. |
|
||||
| `toc` / `heading_only` | Skipped (not stored as chunks — no retrievable content). |
|
||||
|
||||
**Step 4 — Regex fast-pass (optional optimization)**:
|
||||
|
||||
If `qa_vision_enabled=false` or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when:
|
||||
- No regex pattern matches (unknown format)
|
||||
- Regex produces < 2 sections (likely misdetection)
|
||||
- `qa_verification_model` is not set to `"none"`
|
||||
|
||||
### Algorithm Detail: Table-to-Markdown
|
||||
|
||||
**Option A — Vision LLM per page (Primary for PDF)**:
|
||||
|
|
@ -403,24 +425,27 @@ class Settings(BaseSettings):
|
|||
|
||||
**Test Files**:
|
||||
- `backend/app/test/test_phase8_qa_chunking.py` — new file
|
||||
- `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs
|
||||
- `test_english_qa_detection` — detect and split English Q-only pairs
|
||||
- `test_no_qa_fallback` — section-based chunking when no Q&A found
|
||||
- `test_multi_page_qa` — Q&A spanning page boundaries
|
||||
- `test_nested_qa_handling` — [如被追問] patterns
|
||||
- `test_table_preservation` — tables embedded in answers
|
||||
- `test_llm_structure_detection` — LLM correctly identifies Q&A, narrative, speaking_notes sections
|
||||
- `test_llm_handles_mixed_formats` — document with 問/答 + section headings + narrative correctly classified
|
||||
- `test_llm_handles_no_qa_format` — narrative-only documents (File L pages 1-13) correctly chunked by section
|
||||
- `test_llm_handles_speaking_notes` — 發言要點 sections chunked by bullet point
|
||||
- `test_regex_fastpass_chinese` — known 問/答 format detected without LLM (when configured)
|
||||
- `test_regex_fastpass_english` — known Q1/Q2 format detected without LLM
|
||||
- `test_multi_page_sections` — sections spanning page boundaries tracked correctly
|
||||
- `test_abc_contract` — satisfies ChunkingStrategy ABC
|
||||
- `test_page_number_reference_question` — page ref points to 問 location
|
||||
- `test_size_limit` — oversized pairs get recursive split
|
||||
- `test_page_number_reference_question` — page ref points to question location for Q&A chunks
|
||||
- `test_size_limit` — oversized sections get recursive split with heading preserved
|
||||
|
||||
**Implementation Tasks**:
|
||||
|
||||
1. **Create** `backend/app/utils/qa_chunking.py`:
|
||||
- `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none"
|
||||
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text
|
||||
- `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format
|
||||
- `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format
|
||||
- `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}`
|
||||
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, insert page markers, reconstruct full text
|
||||
- `build_structure_detection_prompt(text: str) -> str` — construct the LLM prompt for section classification
|
||||
- `parse_llm_structure_response(response: dict) -> List[Section]` — parse the JSON returned by the LLM
|
||||
- `Section` dataclass: `{type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}`
|
||||
- `split_chinese_qa(text: str) -> List[Section]` — regex fast-pass for known 問/答 format
|
||||
- `split_english_qa(text: str) -> List[Section]` — regex fast-pass for known Q-number format
|
||||
- `build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]]` — build chunk texts + page refs + metadata from sections
|
||||
|
||||
2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`:
|
||||
- `chunk(text: str) -> List[str]` — entry point for DOCX/TXT
|
||||
|
|
@ -647,12 +672,11 @@ class Settings(BaseSettings):
|
|||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. |
|
||||
| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. |
|
||||
| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. |
|
||||
| **Q&A boundary detection**: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle | Medium | LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content. |
|
||||
| **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. |
|
||||
| **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. |
|
||||
| **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. |
|
||||
| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. |
|
||||
| **LegCo format drift**: Future documents may use different formats or section structures | Low | LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed. |
|
||||
| **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context. |
|
||||
| **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
|
||||
| **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |
|
||||
|
|
|
|||
Loading…
Reference in New Issue