legco_ai_assistant

History

Woody 14423c773a feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy 8.1 — Core algorithm (test-first): - qa_chunking.py: preprocess_text, build_structure_detection_prompt, parse_llm_structure_response, Section dataclass, split_chinese_qa, split_english_qa, build_chunks_from_sections with recursive size split - QuestionChunkingStrategy in chunking.py with _chunk_metadata tracking - get_chunking_strategy() factory function - table_extraction.py: vision LLM extraction, heuristic text fallback, disk cache, inject_tables_into_answer - 18/18 tests pass (LLM parse, regex fast-pass, multi-page, ABC contract, size limit, chunk building, preprocess) 8.2 — Metadata enrichment: - extract_metadata() accepts strategy_type + chunk_metadata params - Q&A fields (question_id, question_index, section_heading, etc.) merged into ChromaDB metadata entries - DocumentInfo.chunking_strategy + ChunkInfo Q&A fields in models - 6/6 metadata tests pass 8.3 — Ingest API integration: - POST /api/v1/ingest accepts ?strategy=token\|question - validate strategy against VALID_CHUNKING_STRATEGIES - factory creates correct chunker; _chunk_metadata passed to extract_metadata - 6/6 ingest integration tests pass, zero regressions on existing tests 8.4 — Frontend strategy selector: - Radio button selector (Token / Question) on RAG Database page - Strategy passed to ingest mutation via api.ts - DocumentList: strategy badge (gray/blue) - ChunkList: Q&A display with question_id, question_text, page range, table badge - tsc --noEmit clean, vite build successful		2026-05-15 12:44:04 +08:00
..
__init__.py	feat: Phase 1.1 project setup with config, database, and models	2026-04-22 16:13:52 +08:00
chunking.py	feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy	2026-05-15 12:44:04 +08:00
docx_parser.py	feat: rewrite DOCX parser with table extraction	2026-04-28 16:42:41 +08:00
metadata.py	feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy	2026-05-15 12:44:04 +08:00
pdf_extractor.py	feat(backend): add PDF page extractor and chunk PDF storage config	2026-04-24 10:52:57 +08:00
pdf_parser.py	feat(backend): add page-aware PDF parsing with per-page text extraction	2026-04-24 10:30:04 +08:00
qa_chunking.py	feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy	2026-05-15 12:44:04 +08:00
sentence_splitter.py	feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2)	2026-04-29 09:26:06 +08:00
table_extraction.py	feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy	2026-05-15 12:44:04 +08:00
text_to_pdf.py	feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3)	2026-04-28 17:32:22 +08:00