From 6bf04cedb106e21e2e17e18330da2e23ae67b6a1 Mon Sep 17 00:00:00 2001 From: Woody Date: Fri, 15 May 2026 11:34:24 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20Package=208=20=E2=80=94=20switch=20to?= =?UTF-8?q?=20LLM-first=20structure=20detection=20(not=20regex-first)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit LegCo documents use multiple formats (問/答 markers, Q1/Q2 numbering, section headings like '(1) 住戶的安置補償', 發言要點 bullet points, and pure table pages). Regex alone cannot reliably classify all these. Changes: - Primary detection: LLM call identifies ALL section types in one pass (qa, narrative, speaking_notes, table, toc, heading_only) - Regex: downgraded to optional fast-pass optimization for known patterns - Architecture diagram, algorithm detail, risks, and test plan all updated - Single model handles structure detection + table extraction + verification --- .plans/package8_enhancement_plan.md | 240 +++++++++++++++------------- 1 file changed, 132 insertions(+), 108 deletions(-) diff --git a/.plans/package8_enhancement_plan.md b/.plans/package8_enhancement_plan.md index 3cfed38..79391e3 100644 --- a/.plans/package8_enhancement_plan.md +++ b/.plans/package8_enhancement_plan.md @@ -144,12 +144,9 @@ Four LegCo supplementary papers in `test materials/LTT/`: **1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents. -**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass. +**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: Primary detection approach — LLM identifies all section types (Q&A, narrative, speaking notes, tables) in one call. Handles format variations that regex cannot (section headings without Q&A markers, mixed formats). -**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources: -- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents) -- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A) -- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes +**3. Regex + LLM Hybrid**: Used as optimization — regex fast-pass for documents with clear markers (問/答, Q1/Q2), falling back to LLM for everything else. LLM handles all format variations (narrative sections, speaking notes, section headings without Q&A markers) that regex cannot. **4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval. @@ -169,19 +166,25 @@ Four LegCo supplementary papers in `test materials/LTT/`: PDF Transcript │ ▼ -[gptpdf / llm-ocr] ← Vision LLM converts page images to markdown - │ Tables rendered as markdown tables inline - ▼ -Clean Markdown with 問/答 markers preserved +[Preprocess] Concatenate pages, strip footers/headers, normalize │ ▼ -[Regex Splitter] ← Primary: split on 問...答 boundaries - │ Fallback: split on Q1, Q2... boundaries +[LLM Structure Detection] ← Single call: identify ALL section boundaries + │ Classify each as qa / narrative / speaking_notes / table + │ Same model as everything else (Qwen3.5-35B-A3B) ▼ -[LLM Boundary Verifier] ← Edge cases: nested Q&A, tables at boundaries +[Regex fast-pass] ← Optional: skip LLM for documents with 100%-regex-detectable format │ ▼ -Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...) +[Chunk Builder] ← For QA: one chunk per pair + │ For narrative: paragraph-level split with heading prepended + │ For speaking_notes: one chunk per bullet with topic prepended + │ For table: standalone chunk or merged into parent section + ▼ +[Table Extraction] ← Vision LLM (same model, base64 images) for pages with tables + │ → markdown tables injected into chunk text + ▼ +Q&A/Narrative Chunks → extract_metadata(section_type, heading, qa_id?, ...) │ ▼ ChromaDB (with strategy_type="question" in metadata) @@ -200,113 +203,132 @@ POST /api/v1/ingest?strategy=token|question │ └─ strategy="question" → QuestionChunkingStrategy (NEW) │ - ├─ [PDF only] Render pages as images → LLM vision (base64) → markdown - │ └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API) - │ └─ Tables converted to markdown table syntax inline + ├─ [Preprocess] Concatenate pages, strip footers/headers, normalize │ - ├─ [Detect Q&A style] Chinese 問/答 vs English Q-only + ├─ [LLM Structure Detection] Send full document text to LLM + │ │ Same model as everything else (Qwen3.5-35B-A3B) + │ │ Prompt: "Identify all sections. Classify each as Q&A pair, + │ │ narrative section, table section, or speaking notes. + │ │ For each Q&A, output question text, answer text, + │ │ start/end boundaries." + │ └─ Returns: [{ type: "qa"|"narrative"|"table"|"speaking_notes", + │ qa_id?, question?, answer?, heading?, start_page, end_page }, ...] │ - ├─ [Primary split] Regex-based Q&A boundary detection - │ ├─ Chinese: split on `問\s*[A-Z]\d+\s*[︰:]` → `答\s*([A-Z]\d+)` - │ └─ English: split on `^Q\d+\s` → next Q/section = answer end + ├─ [Regex fallback] Skip LLM when disabled or for known patterns + │ └─ 問/答 markers and Q1/Q2 lines as fast-pass optimization │ - ├─ [LLM verification] Edge case handling: - │ ├─ Multi-paragraph answers exceeding max chunk size - │ ├─ Tables spanning Q&A boundaries (keep with answer) - │ ├─ Nested Q&A (如被追問 within answers) - │ └─ Ambiguous boundaries (O1-O4 style without explicit 問/答) + ├─ [Size check] Oversized Q&A answers → recursive split on \n\n → \n + │ └─ Question text prepended to each sub-chunk for context │ - ├─ [Size check] Oversized pairs → recursive split on \n\n → \n - │ └─ Split answer portions only; keep question + answer prefix together + ├─ [Table extraction] Vision LLM per page with tables → markdown + │ └─ Same model: base64 page images via standard Chat Completions API │ - ├─ [Fallback] Sections with no Q&A (發言要點, narrative policy) - │ └─ Chunk by section headers (章節) or fall back to token-based + ├─ [Narrative sections] Sections with no Q&A (發言要點, policy text) + │ └─ Chunk by section heading boundaries with heading+parent prepended │ └─ [Metadata] Per-chunk: ├─ strategy_type: "question" - ├─ qa_style: "chinese" | "english" - ├─ question_index: int (0-based within document) - ├─ question_id: str ("A1", "Q3", etc.) - ├─ question_text: str (full question text) + ├─ section_type: "qa" | "narrative" | "speaking_notes" + ├─ question_index: int (0-based, Q&A only) + ├─ question_id: str ("A1", "Q3", etc., Q&A only) + ├─ question_text: str (full question text, Q&A only) + ├─ section_heading: str (e.g. "(A) 排水系統", "(1) 住戶的安置補償") ├─ answer_contains_table: bool - ├─ topic_section: str (e.g. "(A) 排水系統") - └─ source_page_range: [int, int] (page range of Q&A pair) + └─ source_page_range: [int, int] ``` -### Algorithm Detail: Q&A Boundary Detection +### Algorithm Detail: LLM Structure Detection (Primary) + +LegCo documents use multiple formats — explicit 問/答 markers, Q1/Q2 numbering, section headings like `(1) 住戶的安置補償`, 發言要點 bullet points, and pure table pages. Regex alone cannot reliably classify all these. The LLM can. **Step 1 — Preprocess extracted text**: 1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator 2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$` 3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier) 4. Normalize colons: `:` → `:`, `︰` → `:` +5. Insert page boundary markers: `[PAGE_BREAK: N]` at each page boundary so the LLM knows where pages start/end + +**Step 2 — LLM Structure Detection (single call)**: + +Send the preprocessed text to the LLM (same model as everything else: `LLM_MODEL_NAME`). Ask it to identify every section boundary and classify each section: -**Step 2 — Detect Q&A style**: -```python -def detect_qa_style(text: str) -> str: - chinese_pattern = r'問\s*[A-Z]\d+\s*[︰:]' - english_pattern = r'^Q\d+\s' - if re.search(chinese_pattern, text, re.MULTILINE): - return "chinese" - if re.search(english_pattern, text, re.MULTILINE): - return "english" - return "none" # fall back to token-based ``` +You are analyzing a Hong Kong Legislative Council document. +The text has page markers like [PAGE_BREAK: N] showing where pages begin. -**Step 3 — Split by Q&A (Chinese style)**: -```python -QA_PATTERN_CHINESE = re.compile( - r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[︰:]\s*(.*?)\s*' - r'(?:\n\s*答\s*\1\s*[︰:]\s*)' # matched answer marker - r'(.*?)' - r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[︰:]|\([A-Z]\)|$))', - re.DOTALL -) -# Group 1: question_id (e.g. "A1") -# Group 2: question_text -# Group 3: answer_text (everything until next 問 or section header or EOF) -``` +For each distinct section in this document, identify: +1. The section type: + - "qa": a question-and-answer pair (問/答 or Q1/Q2 format) + - "narrative": policy text, explanatory paragraphs, section content with bullets + - "speaking_notes": briefing points (發言要點) with bullet markers + - "table": standalone data tables (not embedded in answers) + - "toc": table of contents + - "heading_only": a section heading with no following content -**Step 3b — Split by Q&A (English style)**: -```python -QA_PATTERN_ENGLISH = re.compile( - r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*' - r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|' - r'[\u4e00-\u9fff]{2,}\s*$|' # Chinese-only topic line - r'$))', - re.DOTALL -) -# Group 1: question_id (e.g. "Q1") -# Group 2: full text (question + answer combined, since no explicit A marker) -``` +2. For "qa" sections: + - The question text (exact) + - The answer text (exact, including tables, bullet lists, and [內部參考] content) + - The question ID if present (e.g. "A1", "Q3") + - The start page and end page -**Step 4 — LLM verification pass** (for edge cases only): -For each Q&A pair, if any of these conditions are true, send pair to LLM for verification: -- Answer length > 2000 tokens (may contain multiple Q&A or tables) -- Contains `[如被追問` or `[如追問` (nested Q&A) -- Contains `[內部參考` or `[內部補充` (supplementary content to handle) -- Answer text contains regex pattern for table rows - -LLM prompt: -``` -You are processing a Hong Kong Legislative Council document. -Analyze this Q&A pair and determine: - -1. Does the answer contain tables? If yes, output them as markdown tables. -2. Are there nested followup Q&A (追問) within the answer? If yes, separate them. -3. Does the answer contain [內部參考] supplementary content? Flag it. -4. Is the Q&A boundary correct? If not, suggest the correct boundary. +3. For all sections: + - The section heading (e.g. "(A) 排水系統", "(1) 住戶的安置補償") + - The start page and end page + - Whether the section contains tables Return JSON: { - "correct_boundary": true/false, - "tables_markdown": ["table1...", "table2..."], - "nested_qa": [{"q": "...", "a": "..."}], - "internal_refs": ["ref1...", "ref2..."], - "clean_answer": "answer text with tables replaced by markdown" + "sections": [ + { + "type": "qa", + "heading": "(A) 排水系統", + "qa_id": "A1", + "question": "古洞北和粉嶺北新發展區的設計是否能抵禦氣候變化...", + "answer": "研究顧問已為古洞北和粉嶺北新發展區研究範圍進行了...", + "start_page": 2, + "end_page": 3, + "has_table": true, + "parent_topic": "排水系統" + }, + { + "type": "narrative", + "heading": "(1) 住戶的安置補償", + "content": "合資格住戶可選擇(i)「須通過經濟狀況審查」安置...", + "start_page": 2, + "end_page": 5, + "has_table": false + }, + { + "type": "speaking_notes", + "heading": "發言要點", + "content": "⚫ 古洞北/粉嶺北新發展區(新發展區)是北部都會區內...", + "start_page": 1, + "end_page": 2, + "has_table": false + } + ] } ``` +**Step 3 — Chunk building from LLM output**: + +For each section in the JSON response: + +| Section type | Chunking rule | +|-------------|---------------| +| `qa` | One chunk = one Q&A pair. Question text + answer text concatenated. | +| `narrative` | Split into chunks at paragraph boundaries (`\n\n`), respecting `qa_max_chunk_tokens`. Each chunk prepended with section heading for context. | +| `speaking_notes` | Each bullet point (⚫) becomes a chunk, with topic heading prepended. | +| `table` | Each table becomes a chunk. If adjacent to a narrative section, merged into that section's chunks. | +| `toc` / `heading_only` | Skipped (not stored as chunks — no retrievable content). | + +**Step 4 — Regex fast-pass (optional optimization)**: + +If `qa_vision_enabled=false` or for cost optimization, use regex as a fast pre-filter. The LLM call is skipped for documents where regex confidently identifies all boundaries, and used only when: +- No regex pattern matches (unknown format) +- Regex produces < 2 sections (likely misdetection) +- `qa_verification_model` is not set to `"none"` + ### Algorithm Detail: Table-to-Markdown **Option A — Vision LLM per page (Primary for PDF)**: @@ -403,24 +425,27 @@ class Settings(BaseSettings): **Test Files**: - `backend/app/test/test_phase8_qa_chunking.py` — new file - - `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs - - `test_english_qa_detection` — detect and split English Q-only pairs - - `test_no_qa_fallback` — section-based chunking when no Q&A found - - `test_multi_page_qa` — Q&A spanning page boundaries - - `test_nested_qa_handling` — [如被追問] patterns - - `test_table_preservation` — tables embedded in answers + - `test_llm_structure_detection` — LLM correctly identifies Q&A, narrative, speaking_notes sections + - `test_llm_handles_mixed_formats` — document with 問/答 + section headings + narrative correctly classified + - `test_llm_handles_no_qa_format` — narrative-only documents (File L pages 1-13) correctly chunked by section + - `test_llm_handles_speaking_notes` — 發言要點 sections chunked by bullet point + - `test_regex_fastpass_chinese` — known 問/答 format detected without LLM (when configured) + - `test_regex_fastpass_english` — known Q1/Q2 format detected without LLM + - `test_multi_page_sections` — sections spanning page boundaries tracked correctly - `test_abc_contract` — satisfies ChunkingStrategy ABC - - `test_page_number_reference_question` — page ref points to 問 location - - `test_size_limit` — oversized pairs get recursive split + - `test_page_number_reference_question` — page ref points to question location for Q&A chunks + - `test_size_limit` — oversized sections get recursive split with heading preserved **Implementation Tasks**: 1. **Create** `backend/app/utils/qa_chunking.py`: - - `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none" - - `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text - - `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format - - `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format - - `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}` + - `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, insert page markers, reconstruct full text + - `build_structure_detection_prompt(text: str) -> str` — construct the LLM prompt for section classification + - `parse_llm_structure_response(response: dict) -> List[Section]` — parse the JSON returned by the LLM + - `Section` dataclass: `{type, heading, qa_id?, question?, answer?, content?, start_page, end_page, has_table, parent_topic}` + - `split_chinese_qa(text: str) -> List[Section]` — regex fast-pass for known 問/答 format + - `split_english_qa(text: str) -> List[Section]` — regex fast-pass for known Q-number format + - `build_chunks_from_sections(sections: List[Section], max_tokens: int) -> List[Tuple[str, int, dict]]` — build chunk texts + page refs + metadata from sections 2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`: - `chunk(text: str) -> List[str]` — entry point for DOCX/TXT @@ -647,12 +672,11 @@ class Settings(BaseSettings): | Risk | Severity | Mitigation | |------|----------|------------| | **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. | -| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. | -| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. | +| **Q&A boundary detection**: LegCo documents use multiple formats (問/答, Q1/Q2, section headings, 發言要點) that regex alone cannot handle | Medium | LLM structure detection as primary approach handles all format variations in one call. Regex fast-pass only used for optimization on known patterns. Fall back to section-based chunking if LLM detects no Q&A content. | | **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. | | **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. | | **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. | -| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. | +| **LegCo format drift**: Future documents may use different formats or section structures | Low | LLM structure detection adapts to new formats without code changes — the LLM prompt describes what to look for, not how. Add new format examples to the prompt if needed. | | **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context. | | **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). | | **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |