Commit Graph

12 Commits

Author SHA1 Message Date
Woody b11d31e2d1 feat: add sentence splitter and highlight data models (Phase 5.4.1-5.4.2)
- sentence_splitter.py: regex-based sentence splitting for English + Chinese punctuation
- highlight.py: 6 Pydantic models (ChunkHighlightTarget, HighlightBatchRequest,
  RelevantSentence, ChunkHighlights, HighlightBatchResult, HighlightBatchResponse)
- 43 tests: 13 sentence splitter + 30 model validation
2026-04-29 09:26:06 +08:00
Woody 25b26c9b48 feat(ingest): generate per-chunk PDFs for DOCX/TXT documents (Phase 5.3)
DOCX and TXT ingestion now produces chunk_file_path + per-chunk PDF files matching the PDF ingestion flow. Uses reportlab to render chunk text as simple PDFs with automatic text wrapping.

- Add reportlab==4.2.5 to requirements.txt
- New utils/text_to_pdf.py: generate_text_pdf() renders chunk text as PDF
- Ingest router DOCX/TXT branches: generate chunk_N.pdf per chunk, store in chunk_file_paths
- Graceful degradation: chunk_file_path stays None if PDF generation fails
- Update test_phase1_ingest_page_aware.py assertions: DOCX chunks now HAVE chunk_file_path
- New test_phase5_docx_pdf_generation.py: 5 tests (DOCX PDF gen, TXT PDF gen, PDF regression, file count, graceful degradation)
- 361 backend tests pass (4 pre-existing embedding failures unrelated)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 17:32:22 +08:00
Woody 136c25ae38 feat: rewrite DOCX parser with table extraction
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-28 16:42:41 +08:00
Woody 2656f9ca08 refactor(test): rewrite tests to comply with integration-first rules
Replace mocked DB/internal-services with real ChromaDB/SQLite via tmp_path.
Only mock truly external APIs (LLM, embedding for deterministic vectors).

13 test files rewritten (314 pass, 0 fail):
- Route tests: use TestClient + real ChromaDB, seed test data
- Service tests: use real PersistentClient/SQLite instances
- Pipeline tests: TestClient hits SSE /query endpoint, verify history
- Converted unittest.TestCase to pytest where applicable

Plus: fix metadata.py to filter None values from ChromaDB metadata
(pre-existing bug caught by real-DB ingestion tests)
2026-04-27 11:46:58 +08:00
Woody 8c84062996 feat(backend): add PDF page extractor and chunk PDF storage config
New pdf_extractor.py with extract_page_as_pdf() and extract_pages_as_pdf() for extracting individual PDF pages as separate files. Adds document_chunk_path setting to config and document_chunk/ to .gitignore.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:52:57 +08:00
Woody b97264c66a feat(backend): add page_number, chunk_file_path, document_id to chunk metadata
Enhance extract_metadata() with three new optional fields for page-aware chunking support. Validates list length mismatches. Fully backward compatible — existing callers unaffected.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:30:40 +08:00
Woody 0995c685fa feat(backend): add page-aware chunking with adjacent-page overlap
Add chunk_pages() to TokenChunkingStrategy: one chunk per page with 200-token overlap from adjacent pages. Uses original page text for main content, decoded tokens for overlap. Never splits a page regardless of size.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:30:18 +08:00
Woody f4fa577fb0 feat(backend): add page-aware PDF parsing with per-page text extraction
Add parse_pdf_by_page() that returns List[Tuple[int, str]] with 1-indexed page numbers. Pages with no extractable text are skipped. Follows same error handling as existing parse_pdf().

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:30:04 +08:00
Woody b48c23001e fix(backend): preserve original filename in chunk metadata instead of temp file name
When uploading files, the backend passes them through NamedTemporaryFile, causing os.path.basename to return temp names like 'tmp90i7xqa8.pdf'. Added original_filename parameter to extract_metadata() so the actual upload filename is stored.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-24 10:14:58 +08:00
Woody 44028ebd6e refactor(backend): update document parsers for DOCX and PDF
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-23 13:27:08 +08:00
Woody d94abaac77 feat: Phase 1.2 ingestion pipeline with chunking and metadata
- Add document parsers (DOCX, PDF) with lazy imports
- Add TokenChunkingStrategy with ABC for future replacement
- Add metadata extraction (filename, upload_date, content_summary)
- Add RAGService for ChromaDB ingestion/retrieval/response generation
- Add POST /api/v1/ingest endpoint with file validation
- Test-first: 20 passed, 2 skipped (python-docx not installed)
2026-04-22 16:49:52 +08:00
Woody 3712397d64 feat: Phase 1.1 project setup with config, database, and models
- Add requirements.txt with all dependencies
- Add .env.example with required environment variables
- Add Pydantic Settings (config.py) with .env loading
- Add ChromaDB persistent client (database.py)
- Add Pydantic schemas (ingest.py) for request/response
- Add FastAPI main.py with CORS middleware
- Add package __init__.py files
- Add tests: test_phase1_config.py, test_phase1_database.py
- All 5 tests pass
2026-04-22 16:13:52 +08:00