docs: update enhancement plan with sub-phase 1.5.2 completion status
Mark sub-phase 1.5.2 (backend CRUD) as complete. Update acceptance criteria, risk mitigations, and test plan. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
This commit is contained in:
parent
c6abe5c335
commit
9a7329c5f8
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
**Source**: User request (2026-04-23)
|
||||
**Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs
|
||||
**Status**: 🔄 In Progress — Feature 1 ✅ Complete, Features 2-3 pending
|
||||
**Status**: 🔄 In Progress — Feature 1 ✅ Complete, Feature 2 backend ✅ Complete, Feature 2 frontend & Feature 3 pending
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -33,8 +33,8 @@ Enhance the existing Phase 1 application with three features:
|
|||
|
||||
### What's Missing (Gaps This Plan Fills)
|
||||
- ~~No routing or multi-page support~~ ✅ Done in Feature 1
|
||||
- No way to view what's stored in ChromaDB
|
||||
- No way to delete documents or chunks
|
||||
- ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2)
|
||||
- ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2)
|
||||
- No page-level awareness in chunking (all pages flattened before token splitting)
|
||||
- No persistent chunk files (chunks only exist as ChromaDB document text)
|
||||
- No clickable links in RAG responses to view source chunks
|
||||
|
|
@ -202,9 +202,9 @@ class DeleteResponse(BaseModel):
|
|||
|
||||
### 2.5 Acceptance Criteria
|
||||
|
||||
- [ ] `GET /api/v1/documents` returns all documents with chunk counts
|
||||
- [ ] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs
|
||||
- [ ] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk
|
||||
- [x] `GET /api/v1/documents` returns all documents with chunk counts
|
||||
- [x] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs
|
||||
- [x] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk
|
||||
- [ ] RAG Database page shows all documents with chunk counts
|
||||
- [ ] User can expand a document to see its chunks
|
||||
- [ ] User can delete a document (with confirmation)
|
||||
|
|
@ -530,7 +530,7 @@ Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
|
|||
| Sub-Phase | Feature | Scope | Backend | Frontend | Status |
|
||||
|-----------|---------|-------|---------|----------|--------|
|
||||
| 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete |
|
||||
| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | 📋 Pending |
|
||||
| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete |
|
||||
| 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | 📋 Pending |
|
||||
| 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | 📋 Pending |
|
||||
| 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending |
|
||||
|
|
@ -580,12 +580,14 @@ document_chunk/
|
|||
|
||||
| Risk | Impact | Mitigation |
|
||||
|------|--------|------------|
|
||||
| ChromaDB has no native "group by document" query | `list_documents()` needs manual grouping from all metadata | Group by `filename` + `document_id` in metadata. Cache result if slow. |
|
||||
| ~~ChromaDB has no native "group by document" query~~ | ~~`list_documents()` needs manual grouping from all metadata~~ | ✅ Resolved: Groups by document_id extracted from chunk IDs via `rsplit("_", 1)` |
|
||||
| Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. |
|
||||
| Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. |
|
||||
| DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. |
|
||||
| Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. |
|
||||
| Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. |
|
||||
| ChromaDB 1.5.8 requires `name()` on embedding functions | `_EmbeddingFunctionWrapper` crashes on `collection.get()` | ✅ Fixed: Added `name()` method returning `"custom_embedding_wrapper"` |
|
||||
| Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing `chroma_db/` | Pre-existing issue. Works with fresh DB. May need `chroma_db` reset for production. |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -613,7 +615,7 @@ None — all resolved.
|
|||
|
||||
| File | Coverage |
|
||||
|------|----------|
|
||||
| `test_phase1_documents_router.py` | GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} |
|
||||
| `test_phase1_documents_router.py` | ✅ GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) |
|
||||
| `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty |
|
||||
| `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks |
|
||||
| `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |
|
||||
|
|
|
|||
Loading…
Reference in New Issue