docs: Package 8 enhancement plan — Q&A-pair chunking strategy with vision table extraction
- New QuestionChunkingStrategy splits by 問/答 and Q1/Q2 boundaries - Vision-based table-to-markdown using existing Qwen3.5-35B-A3B (native vision model) - Strategy selector UI on RAG Database page (token vs question) - Hybrid approach: regex primary split + LLM verification for edge cases - Single-model architecture — no separate vision API needed - 6 sub-phases with test-first delivery, 7 new files, 15+ modified files
This commit is contained in:
parent
c3392989dc
commit
dbae9411c6
|
|
@ -0,0 +1,703 @@
|
||||||
|
# Package 8 Enhancement Plan — Q&A-Pair Chunking Strategy
|
||||||
|
|
||||||
|
**Source**: User request (2026-05-15)
|
||||||
|
**Scope**: New "chunked by question" RAG algorithm that splits LegCo meeting documents into Q&A (問/答) pairs using LLM-assisted boundary detection, with vision-model-based table-to-markdown extraction. Coexists alongside existing "chunked by token" strategy — user selects strategy on the RAG Database page.
|
||||||
|
**Status**: Draft
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
|
||||||
|
### Feature 1: Q&A-Pair Chunking Algorithm
|
||||||
|
|
||||||
|
Split LegCo meeting transcript PDFs by question-answer pair boundaries rather than fixed token windows. Each Q&A pair becomes one ChromaDB chunk. The page number reference points to the **question (問) location**. This preserves the natural semantic unit of legislative documents where questions and answers form self-contained knowledge units.
|
||||||
|
|
||||||
|
### Feature 2: Table-to-Markdown Extraction
|
||||||
|
|
||||||
|
Answers in LegCo documents frequently contain complex tables (budget figures, capacity tables, allowance rates). These tables must be converted to markdown format so the LLM can correctly read table contents during retrieval. Use vision LLM (GPT-4o or compatible) to identify tables in answer sections and render them as markdown tables.
|
||||||
|
|
||||||
|
### Feature 3: Strategy Selection UI
|
||||||
|
|
||||||
|
Allow users to choose between "Chunked by token" (existing) and "Chunked by question" (new) on the RAG Database page. Strategy metadata is stored per-document and displayed in the document/chunk list views.
|
||||||
|
|
||||||
|
**Non-goals**: Does NOT modify the existing TokenChunkingStrategy. Does NOT change the ChromaDB storage schema beyond adding metadata fields. Does NOT affect the retrieval/query pipeline. Does NOT replace the existing per-chunk PDF generation (continues for both strategies).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
### Existing Chunking Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
PDF → parse_pdf_by_page() [pypdf extract_text per page]
|
||||||
|
→ chunk_pages() [TokenChunkingStrategy: 1 chunk/page, 200-token overlap]
|
||||||
|
→ extract_metadata() [filename, date, summary, page_number, chunk_file_path]
|
||||||
|
→ rag.ingest_document() → collection.add() → ChromaDB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Chunking Strategy Abstraction
|
||||||
|
|
||||||
|
**File**: `backend/app/utils/chunking.py`
|
||||||
|
|
||||||
|
- `ChunkingStrategy` ABC (line 13): defines `chunk(text: str) -> List[str]`
|
||||||
|
- `TokenChunkingStrategy(ChunkingStrategy)` (line 28): token-based sliding window
|
||||||
|
- `chunk_pages(pages, overlap_tokens)` (line 75): page-aware variant (PDF only)
|
||||||
|
- NO strategy selection mechanism — `TokenChunkingStrategy` is hardcoded in the ingest router
|
||||||
|
|
||||||
|
### Ingest Router
|
||||||
|
|
||||||
|
**File**: `backend/app/routers/ingest.py`
|
||||||
|
|
||||||
|
- Line 44: `from app.utils.chunking import TokenChunkingStrategy`
|
||||||
|
- Line 71: `chunker = TokenChunkingStrategy(chunk_size=settings.chunk_size, overlap=settings.chunk_overlap)`
|
||||||
|
- Line 39: `async def ingest_document(file: UploadFile = File(...))` — **no strategy param**
|
||||||
|
- Three format branches (PDF/DOCX/TXT), each instantiates the same chunker
|
||||||
|
|
||||||
|
### Config
|
||||||
|
|
||||||
|
**File**: `backend/app/core/config.py`
|
||||||
|
- `chunk_size: int = 1000` (line 41)
|
||||||
|
- `chunk_overlap: int = 200` (line 42)
|
||||||
|
- No strategy config exists
|
||||||
|
|
||||||
|
### Frontend
|
||||||
|
|
||||||
|
**File**: `frontend/src/pages/RAGDatabasePage.tsx`
|
||||||
|
- `handleUpload()` calls `ingestDocumentMutation.mutateAsync(file)` — no strategy param
|
||||||
|
- No strategy selector UI component exists
|
||||||
|
|
||||||
|
**File**: `frontend/src/components/DocumentList.tsx`
|
||||||
|
- Displays `filename`, `chunk_count`, `upload_date` — no strategy info
|
||||||
|
|
||||||
|
**File**: `frontend/src/components/ChunkList.tsx`
|
||||||
|
- Shows `Page: {chunk.page_number}` — no concept of question index
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Document Analysis — LTT Test Materials
|
||||||
|
|
||||||
|
Four LegCo supplementary papers in `test materials/LTT/`:
|
||||||
|
|
||||||
|
### File A: `詳細數字一覽表` (18 pages) — **No Q&A format**
|
||||||
|
- Pure data tables appendix (17/18 pages are tables)
|
||||||
|
- 8 tables covering development statistics, costs, compensation stats
|
||||||
|
- Will fall back to token-based chunking (Q&A strategy produces 0 pairs)
|
||||||
|
|
||||||
|
### File B: `古洞北粉嶺北新發展區概覽` (8 pages) — **3 Q&A pairs**
|
||||||
|
- 發言要點 (speaking points) section at start
|
||||||
|
- 備用問答 section at end with 3 pairs: 問 B1/答 B1, 問 B2/答 B2, 問 B3/答 B3
|
||||||
|
- Contains milestone tables and planning review result tables
|
||||||
|
|
||||||
|
### File E: `基礎設施` (27 pages) — **12 Chinese Q&A + 3 "Others"**
|
||||||
|
- 發言要點 (pages 1-2): narrative briefing points
|
||||||
|
- 備用問答 (pages 2-23): Q&A organized by topic sections:
|
||||||
|
- `(A) 排水系統` (Drainage): 問 A1-A3 / 答 A1-A3
|
||||||
|
- `(B) 污水收集` (Sewage): 問 B1 / 答 B1
|
||||||
|
- `(C) 供水系統` (Water supply): 問 C1 / 答 C1
|
||||||
|
- `(D) 公用設施共同溝` (Common utility enclosure): 問 D1-D7 / 答 D1-D7
|
||||||
|
- Others section (pages 23-27): O1-O4 (combined Q&A without explicit 問/答 markers)
|
||||||
|
- Contains **tables within answers** (treatment plant capacity table, common utility comparison)
|
||||||
|
- Contains `[內部參考...]` / `[內部補充資料...]` internal reference brackets
|
||||||
|
- Contains `[如被追問...]` followup answer patterns
|
||||||
|
|
||||||
|
### File L: `補償及安置安排` (36 pages) — **24 English Q&A**
|
||||||
|
- Table of contents + 8 narrative policy sections (pages 1-13)
|
||||||
|
- 備用問題 (pages 14-36): Q&A organized by topic:
|
||||||
|
- 住戶/鄉村遷置: Q1-Q6
|
||||||
|
- 業務經營者: Q7-Q16
|
||||||
|
- 務農人士: Q17-Q21
|
||||||
|
- 墳墓/金塔: Q22-Q24
|
||||||
|
- **No explicit "A" answer marker** — answer ends at next `Q` or section header
|
||||||
|
- Contains tables within answers (allowance rate tables, compensation amount tables)
|
||||||
|
- Contains `[如被追問...]` / `[如追問...]` followup patterns
|
||||||
|
- Contains `[內部參考...]` brackets
|
||||||
|
|
||||||
|
### Key Structural Findings
|
||||||
|
|
||||||
|
**Two distinct Q&A formats**:
|
||||||
|
|
||||||
|
| Style | Files | Q Pattern | A Pattern | Boundary Rule |
|
||||||
|
|-------|-------|-----------|-----------|---------------|
|
||||||
|
| Chinese 問/答 | B, E | `問\s*[A-Z]\d+\s*[︰:]` | `答\s*[A-Z]\d+\s*[︰:]` | Q ends at 答; A ends at next 問 or section header |
|
||||||
|
| English Q-only | L | `^Q\d+\s` | None (implicit) | Q ends at end of Q text; A ends at next `^Q\d+` or topic line |
|
||||||
|
|
||||||
|
**Page layout conventions**:
|
||||||
|
- Page headers: `(letter)` on first page
|
||||||
|
- Page footers: `{letter}-{page_number}\n{YYYY-MM-DD}` — these appear interspersed in extracted text at page boundaries
|
||||||
|
- Section headers: `(A) Section Name` within Q&A blocks — serve as grouping markers
|
||||||
|
- 備用問答 / 備用問題: section headers marking Q&A start
|
||||||
|
|
||||||
|
**Answer content complexity**:
|
||||||
|
- Multi-paragraph answers (spanning 2-7 pages for complex topics)
|
||||||
|
- Tables embedded mid-answer (capacity tables, financial tables)
|
||||||
|
- Internal reference brackets `[內部參考...]` with technical data
|
||||||
|
- Followup Q&A `[如被追問...]` — nested Q&A within answers
|
||||||
|
- Bullet lists with various markers (⚫, ➢, -, i. ii. iii.)
|
||||||
|
|
||||||
|
**Page boundary interference**: Q&A pairs frequently span page boundaries. The `pypdf` extracted text retains page number markers like `E-2`, `E-3` mid-text. These must be stripped or treated as boundaries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research Findings — External Approaches
|
||||||
|
|
||||||
|
### Q&A Chunking Approaches (from librarian research)
|
||||||
|
|
||||||
|
**1. FAQ-RAG (Pro-GenAI/FAQ-RAG)**: Treats each Q&A as an atomic retrieval unit. Stores dual embeddings per pair (question + answer). Gold standard for QA-native documents. **Adaptation**: Instead of generating FAQs, detect existing Q&A pairs in LegCo documents.
|
||||||
|
|
||||||
|
**2. LLM Boundary Detection (LlamaChunk / semchunk)**: Uses LLM logprobs or prompt-based boundary detection to identify semantic breakpoints. semchunk AI mode achieves 37.7% correctness vs 33.3% for fixed-size chunking on legal RAG benchmarks. **Relevance**: For edge cases where regex fails (nested Q&A, tables spanning boundaries), use an LLM verification pass.
|
||||||
|
|
||||||
|
**3. Regex + LLM Hybrid**: Two-pass approach recommended by multiple sources:
|
||||||
|
- Pass 1: Regex-based Q&A boundary detection (free, fast, 95%+ accurate for marked documents)
|
||||||
|
- Pass 2: LLM verification for edge cases (multi-paragraph answers, embedded tables, nested Q&A)
|
||||||
|
- Validated by 2026 EACL RAG error taxonomy: 29.7% of RAG errors from chunking mistakes
|
||||||
|
|
||||||
|
**4. Index-RAG**: Multi-embedding per chunk (question text + answer summary + topic + keywords). Optional enhancement — store multiple embeddings per Q&A pair for richer retrieval.
|
||||||
|
|
||||||
|
### Vision Table Extraction (from librarian research)
|
||||||
|
|
||||||
|
**1. gptpdf (CosmosShadow/gptpdf — 3,550 stars)**: 293 lines of Python. Uses PyMuPDF to detect non-text regions → marks with colored rectangles → GPT-4o converts page images to markdown. Cost: ~$0.013/page. **Best fit**: Lightweight, handles Chinese well, supports multiple vision LLMs.
|
||||||
|
|
||||||
|
**2. llm-based-ocr (yigitkonur/llm-based-ocr — 890 stars)**: FastAPI service. PDF → Markdown with table preservation. Parallel processing (50-page PDF in seconds). Batch API cost: ~$4/1000 pages. **Best fit for production**: Cost-effective at scale.
|
||||||
|
|
||||||
|
**3. Instructor + GPT-4o**: Structured table extraction as Pydantic models with MarkdownDataFrame. Provides typed, structured table data directly usable in RAG. **Best fit**: When tables need to be queryable as structured data.
|
||||||
|
|
||||||
|
**4. markdrop**: Comprehensive PDF → Markdown with AI descriptions for images and tables. Supports 6 LLM providers (OpenAI, Gemini, Claude, Groq, OpenRouter, LiteLLM). **Best fit**: Full-pipeline solution but heavier than gptpdf.
|
||||||
|
|
||||||
|
### Recommended Approach (synthesized)
|
||||||
|
|
||||||
|
```
|
||||||
|
PDF Transcript
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
[gptpdf / llm-ocr] ← Vision LLM converts page images to markdown
|
||||||
|
│ Tables rendered as markdown tables inline
|
||||||
|
▼
|
||||||
|
Clean Markdown with 問/答 markers preserved
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
[Regex Splitter] ← Primary: split on 問...答 boundaries
|
||||||
|
│ Fallback: split on Q1, Q2... boundaries
|
||||||
|
▼
|
||||||
|
[LLM Boundary Verifier] ← Edge cases: nested Q&A, tables at boundaries
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Q&A Pair Chunks → extract_metadata(qa_style, question_index, ...)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
ChromaDB (with strategy_type="question" in metadata)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Target State
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/v1/ingest?strategy=token|question
|
||||||
|
│
|
||||||
|
├─ strategy="token" → TokenChunkingStrategy (existing, unchanged)
|
||||||
|
│
|
||||||
|
└─ strategy="question" → QuestionChunkingStrategy (NEW)
|
||||||
|
│
|
||||||
|
├─ [PDF only] Render pages as images → LLM vision (base64) → markdown
|
||||||
|
│ └─ Uses existing Qwen3.5-35B-A3B (native vision model, no separate API)
|
||||||
|
│ └─ Tables converted to markdown table syntax inline
|
||||||
|
│
|
||||||
|
├─ [Detect Q&A style] Chinese 問/答 vs English Q-only
|
||||||
|
│
|
||||||
|
├─ [Primary split] Regex-based Q&A boundary detection
|
||||||
|
│ ├─ Chinese: split on `問\s*[A-Z]\d+\s*[︰:]` → `答\s*([A-Z]\d+)`
|
||||||
|
│ └─ English: split on `^Q\d+\s` → next Q/section = answer end
|
||||||
|
│
|
||||||
|
├─ [LLM verification] Edge case handling:
|
||||||
|
│ ├─ Multi-paragraph answers exceeding max chunk size
|
||||||
|
│ ├─ Tables spanning Q&A boundaries (keep with answer)
|
||||||
|
│ ├─ Nested Q&A (如被追問 within answers)
|
||||||
|
│ └─ Ambiguous boundaries (O1-O4 style without explicit 問/答)
|
||||||
|
│
|
||||||
|
├─ [Size check] Oversized pairs → recursive split on \n\n → \n
|
||||||
|
│ └─ Split answer portions only; keep question + answer prefix together
|
||||||
|
│
|
||||||
|
├─ [Fallback] Sections with no Q&A (發言要點, narrative policy)
|
||||||
|
│ └─ Chunk by section headers (章節) or fall back to token-based
|
||||||
|
│
|
||||||
|
└─ [Metadata] Per-chunk:
|
||||||
|
├─ strategy_type: "question"
|
||||||
|
├─ qa_style: "chinese" | "english"
|
||||||
|
├─ question_index: int (0-based within document)
|
||||||
|
├─ question_id: str ("A1", "Q3", etc.)
|
||||||
|
├─ question_text: str (full question text)
|
||||||
|
├─ answer_contains_table: bool
|
||||||
|
├─ topic_section: str (e.g. "(A) 排水系統")
|
||||||
|
└─ source_page_range: [int, int] (page range of Q&A pair)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Algorithm Detail: Q&A Boundary Detection
|
||||||
|
|
||||||
|
**Step 1 — Preprocess extracted text**:
|
||||||
|
1. Concatenate all pages from `parse_pdf_by_page()` with `\n` separator
|
||||||
|
2. Strip page footer markers: regex `^[A-Z]-\d+\s*\n\d{4}-\d{2}-\d{2}$` and `^[A-Z]-\d+$`
|
||||||
|
3. Strip page header letters `^(\([A-Z]\))\s*$` (keep on page 1 as doc identifier)
|
||||||
|
4. Normalize colons: `:` → `:`, `︰` → `:`
|
||||||
|
|
||||||
|
**Step 2 — Detect Q&A style**:
|
||||||
|
```python
|
||||||
|
def detect_qa_style(text: str) -> str:
|
||||||
|
chinese_pattern = r'問\s*[A-Z]\d+\s*[︰:]'
|
||||||
|
english_pattern = r'^Q\d+\s'
|
||||||
|
if re.search(chinese_pattern, text, re.MULTILINE):
|
||||||
|
return "chinese"
|
||||||
|
if re.search(english_pattern, text, re.MULTILINE):
|
||||||
|
return "english"
|
||||||
|
return "none" # fall back to token-based
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3 — Split by Q&A (Chinese style)**:
|
||||||
|
```python
|
||||||
|
QA_PATTERN_CHINESE = re.compile(
|
||||||
|
r'(?:^|\n)\s*問\s*([A-Z]\d+)\s*[︰:]\s*(.*?)\s*'
|
||||||
|
r'(?:\n\s*答\s*\1\s*[︰:]\s*)' # matched answer marker
|
||||||
|
r'(.*?)'
|
||||||
|
r'(?=\n\s*(?:問\s*[A-Z]\d+\s*[︰:]|\([A-Z]\)|$))',
|
||||||
|
re.DOTALL
|
||||||
|
)
|
||||||
|
# Group 1: question_id (e.g. "A1")
|
||||||
|
# Group 2: question_text
|
||||||
|
# Group 3: answer_text (everything until next 問 or section header or EOF)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3b — Split by Q&A (English style)**:
|
||||||
|
```python
|
||||||
|
QA_PATTERN_ENGLISH = re.compile(
|
||||||
|
r'(?:^|\n)\s*(Q\d+)\s+(.*?)\s*'
|
||||||
|
r'(?=\n\s*(?:Q\d+|[A-Z]\d+\s|'
|
||||||
|
r'[\u4e00-\u9fff]{2,}\s*$|' # Chinese-only topic line
|
||||||
|
r'$))',
|
||||||
|
re.DOTALL
|
||||||
|
)
|
||||||
|
# Group 1: question_id (e.g. "Q1")
|
||||||
|
# Group 2: full text (question + answer combined, since no explicit A marker)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4 — LLM verification pass** (for edge cases only):
|
||||||
|
For each Q&A pair, if any of these conditions are true, send pair to LLM for verification:
|
||||||
|
- Answer length > 2000 tokens (may contain multiple Q&A or tables)
|
||||||
|
- Contains `[如被追問` or `[如追問` (nested Q&A)
|
||||||
|
- Contains `[內部參考` or `[內部補充` (supplementary content to handle)
|
||||||
|
- Answer text contains regex pattern for table rows
|
||||||
|
|
||||||
|
LLM prompt:
|
||||||
|
```
|
||||||
|
You are processing a Hong Kong Legislative Council document.
|
||||||
|
Analyze this Q&A pair and determine:
|
||||||
|
|
||||||
|
1. Does the answer contain tables? If yes, output them as markdown tables.
|
||||||
|
2. Are there nested followup Q&A (追問) within the answer? If yes, separate them.
|
||||||
|
3. Does the answer contain [內部參考] supplementary content? Flag it.
|
||||||
|
4. Is the Q&A boundary correct? If not, suggest the correct boundary.
|
||||||
|
|
||||||
|
Return JSON:
|
||||||
|
{
|
||||||
|
"correct_boundary": true/false,
|
||||||
|
"tables_markdown": ["table1...", "table2..."],
|
||||||
|
"nested_qa": [{"q": "...", "a": "..."}],
|
||||||
|
"internal_refs": ["ref1...", "ref2..."],
|
||||||
|
"clean_answer": "answer text with tables replaced by markdown"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Algorithm Detail: Table-to-Markdown
|
||||||
|
|
||||||
|
**Option A — Vision LLM per page (Primary for PDF)**:
|
||||||
|
1. Render each page of the Q&A section as an image (300 DPI, pypdf + PIL)
|
||||||
|
2. Send to the **existing LLM model** (`LLM_MODEL_NAME`, currently `qwen/qwen3.5-35b-a3b`) which is a native vision-language model that accepts base64 images via the standard OpenAI Chat Completions API:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{ "type": "text", "text": "Convert this page to Markdown. For any tables, use |---|---| syntax..." },
|
||||||
|
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
3. **No separate vision model or API key needed** — uses the existing `LLM_BASE_URL`/`LLM_API_KEY` config
|
||||||
|
4. Prompt:
|
||||||
|
```
|
||||||
|
Convert this page to Markdown. For any tables:
|
||||||
|
- Use proper markdown table syntax with |---|---| alignment
|
||||||
|
- Preserve all column headers and row labels
|
||||||
|
- Do not modify or translate the content
|
||||||
|
- If a table spans multiple pages, note it
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B — Table detection + LLM extraction (Fallback for non-PDF)**:
|
||||||
|
1. Detect table regions in extracted text using heuristics (rows of aligned numbers, pipe-delimited data)
|
||||||
|
2. Send detected table text blocks to LLM with prompt:
|
||||||
|
```
|
||||||
|
Convert this raw table text extracted from a PDF into a markdown table.
|
||||||
|
Preserve all data exactly. Detect column boundaries and alignment.
|
||||||
|
```
|
||||||
|
3. No vision API cost — uses text-only LLM
|
||||||
|
|
||||||
|
**Option C — Hybrid (Recommended)**:
|
||||||
|
- Use vision LLM (same model) for pages with complex multi-column/multi-row tables
|
||||||
|
- Use text-based LLM for simple tables detected via regex
|
||||||
|
- Cache vision results per page to avoid re-processing on re-ingest
|
||||||
|
|
||||||
|
### Config Additions
|
||||||
|
|
||||||
|
```python
|
||||||
|
# backend/app/core/config.py
|
||||||
|
class Settings(BaseSettings):
|
||||||
|
# Existing...
|
||||||
|
chunk_size: int = 1000
|
||||||
|
chunk_overlap: int = 200
|
||||||
|
|
||||||
|
# NEW: Chunking strategy selection
|
||||||
|
default_chunking_strategy: str = "token" # "token" | "question"
|
||||||
|
|
||||||
|
# NEW: Q&A chunking config
|
||||||
|
qa_vision_enabled: bool = True # Toggle vision-based table extraction (uses existing LLM_MODEL_NAME)
|
||||||
|
qa_max_chunk_tokens: int = 3000 # Max tokens per Q&A chunk (before forced split)
|
||||||
|
qa_verification_model: str = "" # LLM for boundary verification (empty = use LLM_MODEL_NAME)
|
||||||
|
qa_include_internal_refs: bool = True # Include [內部參考] in chunks
|
||||||
|
qa_cache_vision_results: bool = True # Cache vision results per page
|
||||||
|
|
||||||
|
# Env var mappings
|
||||||
|
# DEFAULT_CHUNKING_STRATEGY=token
|
||||||
|
# QA_VISION_ENABLED=true
|
||||||
|
# QA_MAX_CHUNK_TOKENS=3000
|
||||||
|
# QA_VERIFICATION_MODEL=
|
||||||
|
# QA_INCLUDE_INTERNAL_REFS=true
|
||||||
|
# QA_CACHE_VISION_RESULTS=true
|
||||||
|
|
||||||
|
# NOTE: Vision table extraction uses the existing LLM_MODEL_NAME
|
||||||
|
# (qwen/qwen3.5-35b-a3b on OpenRouter — a native vision-language model).
|
||||||
|
# No separate vision model config needed. Base64 images are sent
|
||||||
|
# via the standard OpenAI Chat Completions API image_url format.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Sub-Phase 8.0: Config & Enums (Foundation)
|
||||||
|
|
||||||
|
**Objective**: Add all config settings and type definitions needed by subsequent sub-phases.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `backend/app/test/test_phase1_config.py` — add: `test_qa_chunking_config_defaults`
|
||||||
|
- `frontend/src/test/` (TBD) — type-level validation
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
1. Add Q&A chunking config fields to `backend/app/core/config.py`
|
||||||
|
2. Define `ChunkingStrategyType = Literal["token", "question"]` in `backend/app/models/ingest.py`
|
||||||
|
3. Add `VALID_CHUNKING_STRATEGIES = {"token", "question"}` constant
|
||||||
|
4. Update `backend/.env.example` with new env vars
|
||||||
|
5. Add strategy fields to frontend types (`ChunkingStrategy` type, extend `IngestResponse`, `DocumentInfo`, `ChunkInfo`)
|
||||||
|
|
||||||
|
### Sub-Phase 8.1: Q&A Chunking Strategy Class
|
||||||
|
|
||||||
|
**Objective**: Implement `QuestionChunkingStrategy` as a new `ChunkingStrategy` subclass.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `backend/app/test/test_phase8_qa_chunking.py` — new file
|
||||||
|
- `test_chinese_qa_detection` — detect and split Chinese 問/答 pairs
|
||||||
|
- `test_english_qa_detection` — detect and split English Q-only pairs
|
||||||
|
- `test_no_qa_fallback` — section-based chunking when no Q&A found
|
||||||
|
- `test_multi_page_qa` — Q&A spanning page boundaries
|
||||||
|
- `test_nested_qa_handling` — [如被追問] patterns
|
||||||
|
- `test_table_preservation` — tables embedded in answers
|
||||||
|
- `test_abc_contract` — satisfies ChunkingStrategy ABC
|
||||||
|
- `test_page_number_reference_question` — page ref points to 問 location
|
||||||
|
- `test_size_limit` — oversized pairs get recursive split
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
|
||||||
|
1. **Create** `backend/app/utils/qa_chunking.py`:
|
||||||
|
- `detect_qa_style(text: str) -> str` — returns "chinese", "english", or "none"
|
||||||
|
- `preprocess_text(pages: List[Tuple[int, str]]) -> str` — strip footers/headers, normalize colons, reconstruct full text
|
||||||
|
- `split_chinese_qa(text: str) -> List[QAPair]` — regex-based splitting for 問/答 format
|
||||||
|
- `split_english_qa(text: str) -> List[QAPair]` — regex-based splitting for Q-only format
|
||||||
|
- `QAPair` dataclass: `{qa_id, question, answer, start_page, end_page, topic_section, has_table, internal_refs}`
|
||||||
|
|
||||||
|
2. **Create** `QuestionChunkingStrategy(ChunkingStrategy)` in `backend/app/utils/chunking.py`:
|
||||||
|
- `chunk(text: str) -> List[str]` — entry point for DOCX/TXT
|
||||||
|
- `chunk_pages(pages: List[Tuple[int, str]]) -> List[Tuple[str, int]]` — entry point for PDF
|
||||||
|
- Returns chunks with page number = question start page
|
||||||
|
- Handles fallback: if no Q&A detected, falls back to token-based or section-based
|
||||||
|
|
||||||
|
3. **Create** `backend/app/utils/table_extraction.py`:
|
||||||
|
- `extract_tables_vision(page_images: List[Image]) -> List[str]` — vision LLM → markdown tables
|
||||||
|
- `extract_tables_text(text: str) -> List[str]` — heuristic table detection → LLM markdown conversion
|
||||||
|
- `inject_tables_into_answer(answer: str, tables_md: List[str]) -> str` — replace raw tables with markdown
|
||||||
|
- `cache_vision_result(page_hash: str) -> Optional[str]` — disk cache for vision results
|
||||||
|
|
||||||
|
4. **Add** `get_chunking_strategy(name: str, settings: Settings) -> ChunkingStrategy` factory function in `backend/app/utils/chunking.py`
|
||||||
|
|
||||||
|
### Sub-Phase 8.2: Metadata Enrichment
|
||||||
|
|
||||||
|
**Objective**: Extend `extract_metadata()` to support Q&A-specific fields.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `backend/app/test/test_phase8_metadata.py` — new file
|
||||||
|
- `test_qa_metadata_fields` — strategy_type, question_index, question_id, question_text
|
||||||
|
- `test_qa_metadata_topic_section`
|
||||||
|
- `test_token_metadata_unchanged` — existing metadata unchanged for token strategy
|
||||||
|
- `test_page_number_from_question` — page ref points to question location
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
|
||||||
|
1. Extend `extract_metadata()` in `backend/app/utils/metadata.py`:
|
||||||
|
- New param: `strategy_type: str = "token"`
|
||||||
|
- New param: `qa_pairs: Optional[List[QAPair]] = None`
|
||||||
|
- When `strategy_type == "question"`: populate `qa_id`, `question_index`, `question_text`, `topic_section`, `qa_style`, `source_page_range`, `has_table`
|
||||||
|
- Maintain backward compatibility: all existing fields still populated
|
||||||
|
|
||||||
|
2. Update `backend/app/models/documents.py`:
|
||||||
|
- `ChunkInfo`: add `strategy_type: str`, `question_index: Optional[int] = None`, `question_id: Optional[str] = None`, `question_text: Optional[str] = None`, `topic_section: Optional[str] = None`
|
||||||
|
|
||||||
|
3. Update `backend/app/models/ingest.py`:
|
||||||
|
- `IngestResponse`: add `strategy: str`
|
||||||
|
- Add `IngestRequest` model with `strategy: str = "token"`
|
||||||
|
|
||||||
|
### Sub-Phase 8.3: Ingest API Integration
|
||||||
|
|
||||||
|
**Objective**: Wire the strategy selection into the ingest endpoint.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `backend/app/test/test_phase8_ingest.py` — new file
|
||||||
|
- `test_ingest_with_strategy_token` — existing behavior unchanged
|
||||||
|
- `test_ingest_with_strategy_question` — Q&A chunking applied
|
||||||
|
- `test_ingest_invalid_strategy_rejected` — 400 on unknown strategy
|
||||||
|
- `test_ingest_question_strategy_pdf` — PDF with Chinese Q&A
|
||||||
|
- `test_ingest_question_strategy_docx` — DOCX with Q&A format
|
||||||
|
- `test_ingest_question_strategy_no_qa_fallback` — document without Q&A falls back
|
||||||
|
- `test_ingest_response_includes_strategy` — IngestResponse has strategy field
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
|
||||||
|
1. Update `backend/app/routers/ingest.py`:
|
||||||
|
- Line 39: accept `strategy: str = Query("token")` param
|
||||||
|
- Validate against `VALID_CHUNKING_STRATEGIES`
|
||||||
|
- Lines 70-73: replace hardcoded `TokenChunkingStrategy` with `get_chunking_strategy(strategy, settings)`
|
||||||
|
- Lines 108-115, 148-151, 183-186: pass `strategy_type=strategy` and QA metadata to `extract_metadata()`
|
||||||
|
- Line 192: include `strategy` in `IngestResponse`
|
||||||
|
|
||||||
|
2. Update `backend/app/services/rag.py`:
|
||||||
|
- `list_documents()`: extract `strategy_type` from first chunk's metadata per document
|
||||||
|
- `list_chunks()`: return `strategy_type`, `question_index`, `question_id`, `question_text`, `topic_section` from metadata
|
||||||
|
|
||||||
|
3. Update `backend/app/routers/documents.py`:
|
||||||
|
- Map new fields from RAG service to API response models
|
||||||
|
|
||||||
|
### Sub-Phase 8.4: Frontend Strategy Selector
|
||||||
|
|
||||||
|
**Objective**: Add strategy selection UI on the RAG Database page, display strategy info in document/chunk lists.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `frontend/src/test/components/ChunkList.test.tsx` — add Q&A chunk rendering tests
|
||||||
|
- `frontend/src/test/components/DocumentList.test.tsx` — add strategy badge tests
|
||||||
|
- `frontend/src/test/pages/RAGDatabasePage.test.tsx` (new) — strategy selector integration
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
|
||||||
|
1. Update `frontend/src/types/index.ts`:
|
||||||
|
- Add `ChunkingStrategy` type
|
||||||
|
- Extend `IngestResponse`, `DocumentInfo`, `ChunkInfo` with new fields
|
||||||
|
|
||||||
|
2. Update `frontend/src/lib/api.ts`:
|
||||||
|
- `ingestDocument(file, strategy = "token")`: append strategy as query param
|
||||||
|
|
||||||
|
3. Update `frontend/src/lib/queries.tsx`:
|
||||||
|
- `useIngestDocument`: accept `{ file: File; strategy: ChunkingStrategy }`
|
||||||
|
|
||||||
|
4. Update `frontend/src/pages/RAGDatabasePage.tsx`:
|
||||||
|
- Add `chunkingStrategy` state (default: "token")
|
||||||
|
- Add strategy selector UI (two radio buttons or toggle):
|
||||||
|
- "Chunk by Token" (icon: text icon, description: "Split by token windows with overlap")
|
||||||
|
- "Chunk by Question" (icon: Q&A icon, description: "Detect Q&A pairs, extract tables")
|
||||||
|
- Pass strategy to `handleUpload`
|
||||||
|
- Show strategy info in success toast
|
||||||
|
|
||||||
|
5. Update `frontend/src/components/DocumentList.tsx`:
|
||||||
|
- Add strategy badge: `chunked by token` or `chunked by question`
|
||||||
|
- Color-code: token = gray, question = blue
|
||||||
|
|
||||||
|
6. Update `frontend/src/components/ChunkList.tsx`:
|
||||||
|
- For Q&A chunks: show `Q: {question_id} — {question_text}` instead of page number
|
||||||
|
- Show `Topic: {topic_section}` if available
|
||||||
|
- Show `Pages {start}-{end}` for the page range
|
||||||
|
- Show `⚠ Contains table` badge if `has_table` is true
|
||||||
|
|
||||||
|
7. Update `frontend/src/components/DocumentUpload.tsx`:
|
||||||
|
- Disable upload button when no file selected (if not already)
|
||||||
|
- Show strategy selector inline or accept `selectedStrategy` prop
|
||||||
|
|
||||||
|
### Sub-Phase 8.5: Integration & Acceptance Testing
|
||||||
|
|
||||||
|
**Objective**: End-to-end verification with real LegCo PDFs from `test materials/LTT/`.
|
||||||
|
|
||||||
|
**Test Files**:
|
||||||
|
- `backend/app/test/acceptance/test_acceptance_phase8_qa_chunking.py` — new file
|
||||||
|
- `test_real_qa_chunking_fileE` — File E produces 12 Chinese Q&A pairs + 3 Others
|
||||||
|
- `test_real_qa_chunking_fileL` — File L produces 24 English Q&A pairs
|
||||||
|
- `test_real_qa_chunking_fileB` — File B produces 3 Chinese Q&A pairs + narrative sections
|
||||||
|
- `test_real_qa_chunking_fileA` — File A falls back (no Q&A, should not error)
|
||||||
|
- `test_table_extraction_fileE` — Tables in File E answers converted to markdown
|
||||||
|
- `test_table_extraction_fileL` — Tables in File L answers converted to markdown
|
||||||
|
- `test_qa_page_references` — Each Q&A chunk's page number points to question location
|
||||||
|
- `test_full_pipeline_question_strategy` — Full ingest → retrieve → query with Q&A chunks
|
||||||
|
|
||||||
|
**Implementation Tasks**:
|
||||||
|
1. Run Q&A chunking on all 4 LTT PDFs, verify pair counts and content
|
||||||
|
2. Verify table extraction quality (spot-check markdown tables)
|
||||||
|
3. Verify page number references point to question start
|
||||||
|
4. Run `pytest backend/app/test/test_phase8*.py -v` — all pass
|
||||||
|
5. Run `pytest backend/app/test/acceptance/test_acceptance_phase8*.py -v -m acceptance` — all pass
|
||||||
|
6. Test frontend: upload File E with "chunk by question" → verify chunks displayed with Q&A metadata
|
||||||
|
7. Verify strategy coexistence: upload File E with both strategies → both appear in document list with correct badges
|
||||||
|
|
||||||
|
### Sub-Phase 8.6: Polish & Edge Cases
|
||||||
|
|
||||||
|
**Objective**: Handle remaining edge cases and polish UI.
|
||||||
|
|
||||||
|
**Tasks**:
|
||||||
|
1. Handle `Others` section in File E (O1-O04 without explicit 問/答)
|
||||||
|
2. Handle `[如被追問]` nested Q&A: include as part of answer or split as sub-chunks
|
||||||
|
3. Handle `[內部參考]` inclusion/exclusion toggle via config
|
||||||
|
4. Add chunking strategy filter to query page (optional — skip if out of scope)
|
||||||
|
5. Add loading state for vision table extraction (can take seconds for large docs)
|
||||||
|
6. Track chunking strategy in query history (bonus: show which strategy produced results)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Change Summary
|
||||||
|
|
||||||
|
### New Files
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `backend/app/utils/qa_chunking.py` | Q&A boundary detection, splitting, QAPair dataclass |
|
||||||
|
| `backend/app/utils/table_extraction.py` | Vision LLM table extraction, heuristic fallback, caching |
|
||||||
|
| `backend/app/test/test_phase8_qa_chunking.py` | Tests for Q&A chunking logic |
|
||||||
|
| `backend/app/test/test_phase8_metadata.py` | Tests for Q&A metadata enrichment |
|
||||||
|
| `backend/app/test/test_phase8_ingest.py` | Tests for strategy selection in ingest API |
|
||||||
|
| `backend/app/test/acceptance/test_acceptance_phase8_qa_chunking.py` | Acceptance tests with real LTT PDFs |
|
||||||
|
|
||||||
|
### Modified Files
|
||||||
|
|
||||||
|
| File | Changes |
|
||||||
|
|------|---------|
|
||||||
|
| `backend/app/utils/chunking.py` | Add `QuestionChunkingStrategy` class + `get_chunking_strategy()` factory |
|
||||||
|
| `backend/app/core/config.py` | Add Q&A chunking config fields (5 new — vision uses existing LLM model) |
|
||||||
|
| `backend/app/utils/metadata.py` | Add `strategy_type` and Q&A fields to metadata output |
|
||||||
|
| `backend/app/models/ingest.py` | Add `strategy` to `IngestResponse`, add `IngestRequest` |
|
||||||
|
| `backend/app/models/documents.py` | Extend `ChunkInfo` with Q&A fields, `DocumentInfo` with `chunking_strategy` |
|
||||||
|
| `backend/app/routers/ingest.py` | Accept `strategy` query param, use factory, pass strategy to metadata |
|
||||||
|
| `backend/app/services/rag.py` | `list_documents()`/`list_chunks()` return strategy info |
|
||||||
|
| `backend/app/routers/documents.py` | Map new fields to API responses |
|
||||||
|
| `backend/.env.example` | Add new env vars |
|
||||||
|
| `frontend/src/types/index.ts` | Add `ChunkingStrategy` type, extend interfaces |
|
||||||
|
| `frontend/src/lib/api.ts` | `ingestDocument()` accepts strategy param |
|
||||||
|
| `frontend/src/lib/queries.tsx` | `useIngestDocument()` mutation signature updated |
|
||||||
|
| `frontend/src/pages/RAGDatabasePage.tsx` | Add strategy selector state + UI |
|
||||||
|
| `frontend/src/components/DocumentUpload.tsx` | Accept strategy prop / integrate selector |
|
||||||
|
| `frontend/src/components/DocumentList.tsx` | Display strategy badge |
|
||||||
|
| `frontend/src/components/ChunkList.tsx` | Conditional Q&A display (question_id instead of page_number) |
|
||||||
|
|
||||||
|
### Test Files (Modified)
|
||||||
|
|
||||||
|
| File | Changes |
|
||||||
|
|------|---------|
|
||||||
|
| `backend/app/test/test_phase1_chunking.py` | Add test for strategy factory |
|
||||||
|
| `backend/app/test/test_phase1_ingest.py` | Add strategy param test |
|
||||||
|
| `backend/app/test/test_phase1_documents_router.py` | Add strategy field verification |
|
||||||
|
| `backend/app/test/test_phase1_metadata.py` | Add Q&A metadata field tests |
|
||||||
|
| `backend/app/test/test_phase1_config.py` | Add Q&A config tests |
|
||||||
|
| `frontend/src/test/components/ChunkList.test.tsx` | Add Q&A chunk display tests |
|
||||||
|
| `frontend/src/test/components/DocumentList.test.tsx` | Add strategy badge test |
|
||||||
|
| `frontend/src/test/components/IngestPanel.test.tsx` | Update for strategy selection |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
|
||||||
|
- [ ] **AC1**: User can select "Chunk by token" or "Chunk by question" on the RAG Database page before uploading
|
||||||
|
- [ ] **AC2**: "Chunk by token" uploads produce chunks identical to current behavior (no regression)
|
||||||
|
- [ ] **AC3**: "Chunk by question" upload of File E produces 12+ Chinese Q&A chunks with correct question_id, question_text, topic_section metadata
|
||||||
|
- [ ] **AC4**: "Chunk by question" upload of File L produces 24 English Q&A chunks with correct boundaries
|
||||||
|
- [ ] **AC5**: Each Q&A chunk's page number references the question (問) page, not the answer
|
||||||
|
- [ ] **AC6**: Tables within answers are converted to markdown table syntax (vision LLM or heuristic)
|
||||||
|
- [ ] **AC7**: Document list shows chunking strategy badge per document
|
||||||
|
- [ ] **AC8**: Chunk list shows question_id/question_text for Q&A chunks instead of page_number
|
||||||
|
- [ ] **AC9**: Documents without Q&A format (File A, narrative sections) fall back to token/section-based chunking without error
|
||||||
|
- [ ] **AC10**: Both strategies coexist — documents stored with different strategies appear side-by-side in document list
|
||||||
|
- [ ] **AC11**: Strategy param is validated — invalid values return 400
|
||||||
|
- [ ] **AC12**: All backend tests pass (`pytest backend/app/test/test_phase8*.py -v`)
|
||||||
|
- [ ] **AC13**: All acceptance tests pass (`pytest backend/app/test/acceptance/test_acceptance_phase8*.py -v -m acceptance`)
|
||||||
|
- [ ] **AC14**: Frontend tests pass (`pnpm test` in frontend/)
|
||||||
|
- [ ] **AC15**: Existing tests still pass (no regression)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks & Mitigations
|
||||||
|
|
||||||
|
| Risk | Severity | Mitigation |
|
||||||
|
|------|----------|------------|
|
||||||
|
| **Vision LLM cost**: Per-page image processing adds latency and token cost for large docs | Medium | Cache vision results per page hash. Skip vision for pages without detected tables. `qa_vision_enabled` toggle allows disabling vision and using heuristic-only extraction. Vision uses the same model/API key — no additional provider cost. |
|
||||||
|
| **Q&A boundary false positives**: Regex matches non-Q&A text as Q&A | High | Use `detect_qa_style()` first — only apply Q&A splitting if document has sufficient Q&A density. LLM verification pass for ambiguous boundaries. Fall back to section-based chunking if < 3 pairs detected. |
|
||||||
|
| **Q&A boundary false negatives**: Regex misses some Q&A pairs (O1-O4 style) | Medium | LLM verification pass catches these. Log missed pairs for manual review. |
|
||||||
|
| **Table extraction quality**: Vision LLM may misinterpret complex Chinese tables | Medium | Flag `has_table=True` so retrieval can surface context. Allow user to inspect raw chunk content. Provide config to disable vision and use heuristic-only extraction. |
|
||||||
|
| **Performance**: Vision LLM + LLM verification adds latency to ingest | Medium | Make vision extraction async (background task). Show progress in frontend. Cache results. Allow skipping verification for trusted documents. |
|
||||||
|
| **Page boundary interference**: Page footer numbers appear mid-Q&A text | Low | Preprocessing step strips known footer patterns. Test with all 4 LTT PDFs. |
|
||||||
|
| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. |
|
||||||
|
| **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context. |
|
||||||
|
| **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **Vision LLM API**: The existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) is a native vision-language model that accepts base64 images via OpenRouter's standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (`QA_VISION_ENABLED=false`), fall back to heuristic table detection (text-only).
|
||||||
|
- **New Python packages**: `Pillow` (likely already installed for image rendering). `pypdf` already installed. No `pymupdf` needed — vision extraction sends raw page images (PNG) directly to the LLM; the LLM itself identifies table regions.
|
||||||
|
- **Existing codebase**: No breaking changes. Strategy is additive — existing TokenChunkingStrategy unchanged.
|
||||||
|
- **ChromaDB**: No schema change. Metadata fields are flexible (ChromaDB accepts arbitrary dict keys).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Estimate (per document)
|
||||||
|
|
||||||
|
| Component | Model | Cost | Notes |
|
||||||
|
|-----------|-------|------|-------|
|
||||||
|
| PDF text extraction | pypdf | Free | Existing parser |
|
||||||
|
| Q&A regex splitting | N/A | Free | Pure Python |
|
||||||
|
| Vision table extraction (PDF) | Same as `LLM_MODEL_NAME` | Image token cost only | Only pages with tables; no separate model cost |
|
||||||
|
| LLM boundary verification | Same as `LLM_MODEL_NAME` | ~$0.001/document | Edge cases only |
|
||||||
|
| Embedding | qwen3-embedding-4b | ~$0.00002/page | Existing pipeline |
|
||||||
|
|
||||||
|
**Single model, single API key.** Vision table extraction and boundary verification both use the existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) via the same `LLM_BASE_URL`/`LLM_API_KEY`. The only incremental cost is image tokens for pages sent to the vision model — estimated at a few cents per document for typical LegCo papers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
### Internal
|
||||||
|
- `backend/app/utils/chunking.py` — Current chunking strategy
|
||||||
|
- `backend/app/routers/ingest.py` — Ingestion pipeline
|
||||||
|
- `backend/app/utils/metadata.py` — Metadata extraction
|
||||||
|
- `test materials/LTT/` — Test documents (4 LegCo PDFs)
|
||||||
|
|
||||||
|
### External
|
||||||
|
- FAQ-RAG (QA-native chunking): https://github.com/Pro-GenAI/FAQ-RAG
|
||||||
|
- gptpdf (vision PDF→Markdown): https://github.com/CosmosShadow/gptpdf
|
||||||
|
- llm-based-ocr (production PDF→MD): https://github.com/yigitkonur/llm-based-ocr
|
||||||
|
- semchunk (AI-powered chunking): https://github.com/isaacus-dev/semchunk
|
||||||
|
- 2026 Document Chunking Benchmark (36 strategies): https://arxiv.org/html/2603.06976
|
||||||
|
- EACL 2026 RAG Error Taxonomy: https://wiki.charleschen.ai (29.7% errors from chunking)
|
||||||
|
- LumberChunker (LLM semantic boundary detection): https://aclanthology.org/2024.findings-emnlp.377
|
||||||
|
- MultiDocFusion (layout+LLM hierarchy): https://aclanthology.org/2025.emnlp-main.1062
|
||||||
|
- Hong Kong LegCo transcript dataset: https://github.com/Swithord/hong-kong-legco-transcript
|
||||||
|
- LegCo Open Data API: https://www.legco.gov.hk/en/open-legco/open-data/hansard-database.html
|
||||||
|
- LegCo Questions API: https://www.legco.gov.hk/en/open-legco/open-data/questions-at-council-meetings.html
|
||||||
Loading…
Reference in New Issue