Commit Graph

4 Commits

Author SHA1 Message Date
Woody 73c1789698 fix: Q\&A chunking always fell back to token — LLM never called, missing API fields
Three bugs caused 'Chunk by Question' to silently produce token chunks:

1. QuestionChunkingStrategy.chunk_pages() had a broken event-loop check
   that always skipped LLM structure detection in FastAPI's async context.
   Fixed by making chunk_pages() async and removing the is_running() guard.

2. get_chunking_strategy() factory never passed an LLMClient to
   QuestionChunkingStrategy. Fixed by creating LLMClient in the factory
   with graceful fallback to regex-only when config is incomplete.

3. rag.list_documents() and list_chunks() didn't extract strategy_type
   or Q&A fields from ChromaDB metadata, so the frontend always showed
   chunking_strategy='token' and null Q&A fields. Fixed by reading
   these fields from ChromaDB and routing them through the API.

Also: TokenChunkingStrategy.chunk_pages() made async for consistency
with the question strategy; ingest router updated to await it.
Tests updated (asyncio.run() for sync tests, async mock chunk_pages).
2026-05-15 14:46:45 +08:00
Woody 14423c773a feat: Sub-Phases 8.1-8.4 — Q&A-pair chunking strategy
8.1 — Core algorithm (test-first):
- qa_chunking.py: preprocess_text, build_structure_detection_prompt,
  parse_llm_structure_response, Section dataclass, split_chinese_qa,
  split_english_qa, build_chunks_from_sections with recursive size split
- QuestionChunkingStrategy in chunking.py with _chunk_metadata tracking
- get_chunking_strategy() factory function
- table_extraction.py: vision LLM extraction, heuristic text fallback,
  disk cache, inject_tables_into_answer
- 18/18 tests pass (LLM parse, regex fast-pass, multi-page, ABC contract,
  size limit, chunk building, preprocess)

8.2 — Metadata enrichment:
- extract_metadata() accepts strategy_type + chunk_metadata params
- Q&A fields (question_id, question_index, section_heading, etc.)
  merged into ChromaDB metadata entries
- DocumentInfo.chunking_strategy + ChunkInfo Q&A fields in models
- 6/6 metadata tests pass

8.3 — Ingest API integration:
- POST /api/v1/ingest accepts ?strategy=token|question
- validate strategy against VALID_CHUNKING_STRATEGIES
- factory creates correct chunker; _chunk_metadata passed to extract_metadata
- 6/6 ingest integration tests pass, zero regressions on existing tests

8.4 — Frontend strategy selector:
- Radio button selector (Token / Question) on RAG Database page
- Strategy passed to ingest mutation via api.ts
- DocumentList: strategy badge (gray/blue)
- ChunkList: Q&A display with question_id, question_text, page range, table badge
- tsc --noEmit clean, vite build successful
2026-05-15 12:44:04 +08:00
Woody 0995c685fa feat(backend): add page-aware chunking with adjacent-page overlap
Add chunk_pages() to TokenChunkingStrategy: one chunk per page with 200-token overlap from adjacent pages. Uses original page text for main content, decoded tokens for overlap. Never splits a page regardless of size.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:30:18 +08:00
Woody d94abaac77 feat: Phase 1.2 ingestion pipeline with chunking and metadata
- Add document parsers (DOCX, PDF) with lazy imports
- Add TokenChunkingStrategy with ABC for future replacement
- Add metadata extraction (filename, upload_date, content_summary)
- Add RAGService for ChromaDB ingestion/retrieval/response generation
- Add POST /api/v1/ingest endpoint with file validation
- Test-first: 20 passed, 2 skipped (python-docx not installed)
2026-04-22 16:49:52 +08:00