docs: update enhancement plan with sub-phase 1.5.2 completion status

Mark sub-phase 1.5.2 (backend CRUD) as complete. Update acceptance criteria, risk mitigations, and test plan.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
This commit is contained in:
Woody 2026-04-23 19:05:01 +08:00
parent c6abe5c335
commit 9a7329c5f8
1 changed files with 11 additions and 9 deletions

View File

@ -2,7 +2,7 @@
**Source**: User request (2026-04-23) **Source**: User request (2026-04-23)
**Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs **Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs
**Status**: 🔄 In Progress — Feature 1 ✅ Complete, Features 2-3 pending **Status**: 🔄 In Progress — Feature 1 ✅ Complete, Feature 2 backend ✅ Complete, Feature 2 frontend & Feature 3 pending
--- ---
@ -33,8 +33,8 @@ Enhance the existing Phase 1 application with three features:
### What's Missing (Gaps This Plan Fills) ### What's Missing (Gaps This Plan Fills)
- ~~No routing or multi-page support~~ ✅ Done in Feature 1 - ~~No routing or multi-page support~~ ✅ Done in Feature 1
- No way to view what's stored in ChromaDB - ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2)
- No way to delete documents or chunks - ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2)
- No page-level awareness in chunking (all pages flattened before token splitting) - No page-level awareness in chunking (all pages flattened before token splitting)
- No persistent chunk files (chunks only exist as ChromaDB document text) - No persistent chunk files (chunks only exist as ChromaDB document text)
- No clickable links in RAG responses to view source chunks - No clickable links in RAG responses to view source chunks
@ -202,9 +202,9 @@ class DeleteResponse(BaseModel):
### 2.5 Acceptance Criteria ### 2.5 Acceptance Criteria
- [ ] `GET /api/v1/documents` returns all documents with chunk counts - [x] `GET /api/v1/documents` returns all documents with chunk counts
- [ ] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs - [x] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs
- [ ] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk - [x] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk
- [ ] RAG Database page shows all documents with chunk counts - [ ] RAG Database page shows all documents with chunk counts
- [ ] User can expand a document to see its chunks - [ ] User can expand a document to see its chunks
- [ ] User can delete a document (with confirmation) - [ ] User can delete a document (with confirmation)
@ -530,7 +530,7 @@ Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
| Sub-Phase | Feature | Scope | Backend | Frontend | Status | | Sub-Phase | Feature | Scope | Backend | Frontend | Status |
|-----------|---------|-------|---------|----------|--------| |-----------|---------|-------|---------|----------|--------|
| 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete | | 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete |
| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | 📋 Pending | | 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete |
| 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | 📋 Pending | | 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | 📋 Pending |
| 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | 📋 Pending | | 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | 📋 Pending |
| 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending | | 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending |
@ -580,12 +580,14 @@ document_chunk/
| Risk | Impact | Mitigation | | Risk | Impact | Mitigation |
|------|--------|------------| |------|--------|------------|
| ChromaDB has no native "group by document" query | `list_documents()` needs manual grouping from all metadata | Group by `filename` + `document_id` in metadata. Cache result if slow. | | ~~ChromaDB has no native "group by document" query~~ | ~~`list_documents()` needs manual grouping from all metadata~~ | ✅ Resolved: Groups by document_id extracted from chunk IDs via `rsplit("_", 1)` |
| Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. | | Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. |
| Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. | | Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. |
| DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. | | DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. |
| Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. | | Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. |
| Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. | | Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. |
| ChromaDB 1.5.8 requires `name()` on embedding functions | `_EmbeddingFunctionWrapper` crashes on `collection.get()` | ✅ Fixed: Added `name()` method returning `"custom_embedding_wrapper"` |
| Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing `chroma_db/` | Pre-existing issue. Works with fresh DB. May need `chroma_db` reset for production. |
--- ---
@ -613,7 +615,7 @@ None — all resolved.
| File | Coverage | | File | Coverage |
|------|----------| |------|----------|
| `test_phase1_documents_router.py` | GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} | | `test_phase1_documents_router.py` | GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) |
| `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty | | `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty |
| `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks | | `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks |
| `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF | | `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |