legco_ai_assistant/backend/app/utils
Woody 14423c773a feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy
8.1 — Core algorithm (test-first):
- qa_chunking.py: preprocess_text, build_structure_detection_prompt,
  parse_llm_structure_response, Section dataclass, split_chinese_qa,
  split_english_qa, build_chunks_from_sections with recursive size split
- QuestionChunkingStrategy in chunking.py with _chunk_metadata tracking
- get_chunking_strategy() factory function
- table_extraction.py: vision LLM extraction, heuristic text fallback,
  disk cache, inject_tables_into_answer
- 18/18 tests pass (LLM parse, regex fast-pass, multi-page, ABC contract,
  size limit, chunk building, preprocess)

8.2 — Metadata enrichment:
- extract_metadata() accepts strategy_type + chunk_metadata params
- Q&A fields (question_id, question_index, section_heading, etc.)
  merged into ChromaDB metadata entries
- DocumentInfo.chunking_strategy + ChunkInfo Q&A fields in models
- 6/6 metadata tests pass

8.3 — Ingest API integration:
- POST /api/v1/ingest accepts ?strategy=token|question
- validate strategy against VALID_CHUNKING_STRATEGIES
- factory creates correct chunker; _chunk_metadata passed to extract_metadata
- 6/6 ingest integration tests pass, zero regressions on existing tests

8.4 — Frontend strategy selector:
- Radio button selector (Token / Question) on RAG Database page
- Strategy passed to ingest mutation via api.ts
- DocumentList: strategy badge (gray/blue)
- ChunkList: Q&A display with question_id, question_text, page range, table badge
- tsc --noEmit clean, vite build successful
2026-05-15 12:44:04 +08:00
..
__init__.py feat: Phase 1.1 project setup with config, database, and models 2026-04-22 16:13:52 +08:00
chunking.py feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy 2026-05-15 12:44:04 +08:00
docx_parser.py feat: rewrite DOCX parser with table extraction 2026-04-28 16:42:41 +08:00
metadata.py feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy 2026-05-15 12:44:04 +08:00
pdf_extractor.py feat(backend): add PDF page extractor and chunk PDF storage config 2026-04-24 10:52:57 +08:00
pdf_parser.py feat(backend): add page-aware PDF parsing with per-page text extraction 2026-04-24 10:30:04 +08:00
qa_chunking.py feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy 2026-05-15 12:44:04 +08:00
sentence_splitter.py feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2) 2026-04-29 09:26:06 +08:00
table_extraction.py feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy 2026-05-15 12:44:04 +08:00
text_to_pdf.py feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3) 2026-04-28 17:32:22 +08:00