diff --git a/.plans/phase1_enhancement_plan.md b/.plans/phase1_enhancement_plan.md index 38bd0b6..562a3fa 100644 --- a/.plans/phase1_enhancement_plan.md +++ b/.plans/phase1_enhancement_plan.md @@ -2,7 +2,7 @@ **Source**: User request (2026-04-23) **Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs -**Status**: 🔄 In Progress — Feature 1 ✅ Complete, Features 2-3 pending +**Status**: 🔄 In Progress — Feature 1 ✅ Complete, Feature 2 backend ✅ Complete, Feature 2 frontend & Feature 3 pending --- @@ -33,8 +33,8 @@ Enhance the existing Phase 1 application with three features: ### What's Missing (Gaps This Plan Fills) - ~~No routing or multi-page support~~ ✅ Done in Feature 1 -- No way to view what's stored in ChromaDB -- No way to delete documents or chunks +- ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2) +- ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2) - No page-level awareness in chunking (all pages flattened before token splitting) - No persistent chunk files (chunks only exist as ChromaDB document text) - No clickable links in RAG responses to view source chunks @@ -202,9 +202,9 @@ class DeleteResponse(BaseModel): ### 2.5 Acceptance Criteria -- [ ] `GET /api/v1/documents` returns all documents with chunk counts -- [ ] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs -- [ ] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk +- [x] `GET /api/v1/documents` returns all documents with chunk counts +- [x] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs +- [x] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk - [ ] RAG Database page shows all documents with chunk counts - [ ] User can expand a document to see its chunks - [ ] User can delete a document (with confirmation) @@ -530,7 +530,7 @@ Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline | Sub-Phase | Feature | Scope | Backend | Frontend | Status | |-----------|---------|-------|---------|----------|--------| | 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete | -| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | 📋 Pending | +| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete | | 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | 📋 Pending | | 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | 📋 Pending | | 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending | @@ -580,12 +580,14 @@ document_chunk/ | Risk | Impact | Mitigation | |------|--------|------------| -| ChromaDB has no native "group by document" query | `list_documents()` needs manual grouping from all metadata | Group by `filename` + `document_id` in metadata. Cache result if slow. | +| ~~ChromaDB has no native "group by document" query~~ | ~~`list_documents()` needs manual grouping from all metadata~~ | ✅ Resolved: Groups by document_id extracted from chunk IDs via `rsplit("_", 1)` | | Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. | | Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. | | DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. | | Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. | | Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. | +| ChromaDB 1.5.8 requires `name()` on embedding functions | `_EmbeddingFunctionWrapper` crashes on `collection.get()` | ✅ Fixed: Added `name()` method returning `"custom_embedding_wrapper"` | +| Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing `chroma_db/` | Pre-existing issue. Works with fresh DB. May need `chroma_db` reset for production. | --- @@ -613,7 +615,7 @@ None — all resolved. | File | Coverage | |------|----------| -| `test_phase1_documents_router.py` | GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} | +| `test_phase1_documents_router.py` | ✅ GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) | | `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty | | `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks | | `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |