# Phase 1 Enhancement Plan **Source**: User request (2026-04-23) **Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs **Status**: 🔄 In Progress — Feature 1 ✅ Complete, Feature 2 backend ✅ Complete, Feature 2 frontend & Feature 3 pending --- ## Objective Enhance the existing Phase 1 application with three features: 1. **Navigation Bar** — Top nav bar with two pages: "LTT" (current query page) and "RAG Database" 2. **RAG Database Page** — View/manage ChromaDB documents (list, delete, upload) 3. **Page-Aware Chunking** — Chunks tagged with page numbers, saved as PDFs in `document_chunk/`, with clickable links in RAG responses --- ## Current State (Pre-Enhancement) ### Current State (Pre-Enhancement) ### What Exists - Multi-page React app with react-router-dom routing - Nav bar with "LTT" and "RAG Database" tabs - "LTT" page at `/` with current query interface - "RAG Database" placeholder page at `/rag-database` - 2 API endpoints: `POST /api/v1/ingest`, `POST /api/v1/query` - Flat chunking: PDF text extracted page-by-page but concatenated into one string before chunking - Metadata per chunk: `filename`, `upload_date`, `content_summary`, `chunk_index` - ChromaDB collection `documents` with UUID-based IDs (`{document_id}_{chunk_index}`) - Frontend: pages in `pages/`, components in `components/`, TanStack Query, react-markdown ### What's Missing (Gaps This Plan Fills) - ~~No routing or multi-page support~~ ✅ Done in Feature 1 - ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2) - ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2) - No page-level awareness in chunking (all pages flattened before token splitting) - No persistent chunk files (chunks only exist as ChromaDB document text) - No clickable links in RAG responses to view source chunks - ~~Upload only via IngestPanel on the main query page~~ (IngestPanel stays on LTT, upload also coming to RAG DB page) --- ## Feature 1: Navigation Bar & Multi-Page Layout ✅ COMPLETE **Completed**: 2026-04-23 ### 1.1 Changes Required **Frontend**: - ~~Install `react-router-dom`~~ ✅ - ~~Create `frontend/src/components/NavBar.tsx` — top navigation bar~~ ✅ - ~~Create `frontend/src/pages/LTTPage.tsx` — move current App.tsx content here~~ ✅ - ~~Create `frontend/src/pages/RAGDatabasePage.tsx` — placeholder, fleshed out in Feature 2~~ ✅ - ~~Refactor `frontend/src/App.tsx` — Router + NavBar + route definitions~~ ✅ **Backend**: None ### 1.2 Nav Bar Design ``` ┌─────────────────────────────────────────────────┐ │ 🔍 LTT (active) │ 📚 RAG Database │ ← top nav bar (fixed) ├─────────────────────────────────────────────────┤ │ │ │ [Page content: LTT or RAG Database] │ │ │ └─────────────────────────────────────────────────┘ ``` - Fixed top bar, full width - Two tabs: "LTT" (current query page) and "RAG Database" - Active tab highlighted - "LTT" route: `/` (default) - "RAG Database" route: `/rag-database` ### 1.3 Implementation Tasks | Task | Description | Files | Status | |------|-------------|-------|--------| | Install react-router-dom | `npm install react-router-dom` | `package.json` | ✅ | | Create NavBar component | Horizontal nav with two links | `frontend/src/components/NavBar.tsx` | ✅ | | Create LTTPage | Extract current AppContent into page component | `frontend/src/pages/LTTPage.tsx` | ✅ | | Create RAGDatabasePage | Placeholder page (scaffold for Feature 2) | `frontend/src/pages/RAGDatabasePage.tsx` | ✅ | | Refactor App.tsx | BrowserRouter + Routes + NavBar wrapper | `frontend/src/App.tsx` | ✅ | ### 1.4 Acceptance Criteria - [x] Nav bar visible at top of every page - [x] Clicking "LTT" navigates to `/` and shows current query interface - [x] Clicking "RAG Database" navigates to `/rag-database` - [x] Current page highlighted in nav bar - [x] All existing functionality preserved (query, ingest, response display) - [x] Build passes, no TypeScript errors - [x] 62/62 frontend tests pass --- ## Feature 2: RAG Database Management Page ### 2.1 Overview A dedicated page to view and manage all documents/chunks stored in ChromaDB. ### 2.2 Backend Changes **New API Endpoints**: | Method | Path | Description | |--------|------|-------------| | `GET` | `/api/v1/documents` | List all documents with chunk counts | | `GET` | `/api/v1/documents/{document_id}/chunks` | List all chunks for a document | | `DELETE` | `/api/v1/documents/{document_id}` | Delete all chunks for a document | | `DELETE` | `/api/v1/chunks/{chunk_id}` | Delete a single chunk | **New/Modified Files**: | File | Change | |------|--------| | `backend/app/routers/documents.py` | **NEW** — CRUD endpoints for documents/chunks | | `backend/app/services/rag.py` | Add `list_documents()`, `list_chunks()`, `delete_document()`, `delete_chunk()` methods | | `backend/app/models/documents.py` | **NEW** — Pydantic schemas for document/chunk listing | | `backend/app/main.py` | Register `documents` router | **New Pydantic Schemas** (`models/documents.py`): ```python class DocumentInfo(BaseModel): document_id: str filename: str chunk_count: int upload_date: str class ChunkInfo(BaseModel): chunk_id: str chunk_index: int content_summary: str page_number: int | None = None # Added by Feature 3 chunk_file_path: str | None = None # Added by Feature 3 class DocumentListResponse(BaseModel): documents: List[DocumentInfo] total_documents: int total_chunks: int class DeleteResponse(BaseModel): deleted: bool message: str ``` **Implementation Notes for ChromaDB Operations**: - `list_documents()`: ChromaDB has no native "group by document" — need to `collection.get(include=["metadatas"])`, then group by `filename` and extract `document_id` from chunk IDs (format: `{document_id}_{chunk_index}`) - `delete_document()`: Use `collection.delete(where={"filename": "..."})` or collect all chunk IDs matching the document_id prefix and call `collection.delete(ids=[...])` - `delete_chunk()`: Use `collection.delete(ids=[chunk_id])` - **Important**: When deleting a document, also clean up associated chunk PDF files from `document_chunk/` (Feature 3) ### 2.3 Frontend Changes **New/Modified Files**: | File | Change | |------|--------| | `frontend/src/pages/RAGDatabasePage.tsx` | Full implementation | | `frontend/src/components/DocumentList.tsx` | **NEW** — Document table/cards | | `frontend/src/components/ChunkList.tsx` | **NEW** — Chunk table for selected document | | `frontend/src/components/DocumentUpload.tsx` | **NEW** — Upload form (can reuse IngestPanel logic) | | `frontend/src/lib/api.ts` | Add `listDocuments()`, `deleteDocument()`, `deleteChunk()` | | `frontend/src/lib/queries.tsx` | Add TanStack Query hooks for new endpoints | | `frontend/src/types/index.ts` | Add `DocumentInfo`, `ChunkInfo`, `DeleteResponse` types | ### 2.4 Page Layout ``` ┌──────────────────────────────────────────────────┐ │ RAG Database [Upload] │ ├──────────────────────────────────────────────────┤ │ Total: 5 documents, 342 chunks │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ 📄 NEC4 ACC.pdf │ 101 chunks │ 2026-04-23 │ │ │ │ [View Chunks] [Delete] │ │ │ ├──────────────────────────────────────────────┤ │ │ │ 📄 meeting_notes.docx │ 45 chunks │ 2026-04-22│ │ │ │ [View Chunks] [Delete] │ │ │ ├──────────────────────────────────────────────┤ │ │ │ 📄 budget_report.txt │ 28 chunks │ 2026-04-21│ │ │ │ [View Chunks] [Delete] │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ [Expanded chunk view when "View Chunks" clicked] │ │ ┌──────────────────────────────────────────────┐ │ │ │ Chunk 0 │ p.3 │ "Discussion of budget..." │ │ │ │ │ [View PDF] [Delete Chunk] │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Chunk 1 │ p.4 │ "Allocation for Q4..." │ │ │ │ │ [View PDF] [Delete Chunk] │ │ │ └──────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘ ``` ### 2.5 Acceptance Criteria - [x] `GET /api/v1/documents` returns all documents with chunk counts - [x] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs - [x] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk - [ ] RAG Database page shows all documents with chunk counts - [ ] User can expand a document to see its chunks - [ ] User can delete a document (with confirmation) - [ ] User can delete individual chunks (with confirmation) - [ ] User can upload documents from this page - [ ] Stats displayed: total documents, total chunks - [ ] Uploading a file with existing filename triggers automatic replacement (old data deleted first) --- ## Feature 3: Page-Aware Chunking & Chunk PDF Storage ### 3.1 Overview When a document is uploaded: 1. Parse it page-by-page (PDF) or section-by-section (DOCX) 2. Each chunk is tagged with its source page number 3. Each chunk's source page is saved as a PDF in `document_chunk/` 4. RAG responses include clickable links to the chunk PDF ### 3.2 Backend Changes #### 3.2.1 Page-Aware PDF Parsing **Current**: `parse_pdf()` concatenates all pages into one string, losing page boundaries. **New**: `parse_pdf_by_page()` returns `List[Tuple[int, str]]` — list of (page_number, page_text) tuples. **Modified Files**: | File | Change | |------|--------| | `backend/app/utils/pdf_parser.py` | Add `parse_pdf_by_page()` function | ```python def parse_pdf_by_page(file_path: str) -> List[Tuple[int, str]]: """Parse PDF and return per-page text with page numbers (1-indexed).""" reader = PdfReader(file_path) pages = [] for i, page in enumerate(reader.pages, start=1): text = page.extract_text() if text and text.strip(): pages.append((i, text.strip())) return pages ``` **DOCX Note**: DOCX files don't have true page numbers. For DOCX, we can use paragraph-based indexing or skip page tracking. Suggested approach: chunk DOCX normally, set `page_number = None` in metadata. #### 3.2.2 Page-Aware Chunking **Current**: `TokenChunkingStrategy.chunk(text)` takes a flat string and splits by tokens. **New**: Page-as-chunk-unit with overlap context from adjacent pages. **Chunking Algorithm (confirmed)**: ``` For page N (1-indexed): overlap_before = last 200 tokens of page N-1 text (or empty if page 1) overlap_after = first 200 tokens of page N+1 text (or empty if last page) chunk_text = overlap_before + page_N_text + overlap_after ``` - One chunk per page — **never split** a page even if it exceeds 1000 tokens - Overlap provides surrounding context for better embedding/retrieval - The `page_number` metadata always refers to the main page (N), not the overlap pages **Modified Files**: | File | Change | |------|--------| | `backend/app/utils/chunking.py` | Add `chunk_pages()` method to `TokenChunkingStrategy` | ```python def chunk_pages( self, pages: List[Tuple[int, str]], overlap_tokens: int = 200 ) -> List[Tuple[str, int]]: """Chunk page-segmented text with overlap from adjacent pages. For each page, creates one chunk containing: [last overlap_tokens of previous page] + [full current page] + [first overlap_tokens of next page] Args: pages: List of (page_number, page_text) tuples. 1-indexed. overlap_tokens: Number of tokens to include from adjacent pages. Returns: List of (chunk_text, page_number) tuples. One chunk per page. """ if not pages: return [] # Tokenize all pages upfront tokenized = [] for page_num, page_text in pages: tokens = self._encoding.encode(page_text) tokenized.append((page_num, tokens, page_text)) chunks = [] for i, (page_num, tokens, _text) in enumerate(tokenized): parts = [] # Overlap from previous page (last N tokens) if i > 0: prev_tokens = tokenized[i - 1][1] overlap = prev_tokens[-overlap_tokens:] if len(prev_tokens) >= overlap_tokens else prev_tokens if overlap: parts.append(self._encoding.decode(overlap)) # Full current page text (use original text, not re-decoded) parts.append(pages[i][1]) # Overlap from next page (first N tokens) if i < len(tokenized) - 1: next_tokens = tokenized[i + 1][1] overlap = next_tokens[:overlap_tokens] if len(next_tokens) >= overlap_tokens else next_tokens if overlap: parts.append(self._encoding.decode(overlap)) chunk_text = "\n".join(parts) chunks.append((chunk_text, page_num)) return chunks ``` #### 3.2.3 Chunk PDF Generation & Storage **New directory**: `document_chunk/` (at project root, alongside `chroma_db/`) **Naming convention**: `{original_filename_without_ext}_page_{page_number}.pdf` Example: `NEC4 ACC_page_3.pdf` **One file per page** — multiple chunks never exist for a page (decision: never split a page), so deduplication is not needed. **Content**: The actual page extracted from the source PDF — preserves original formatting, layout, tables, images. Not a generated text PDF. **Modified/New Files**: | File | Change | |------|--------| | `backend/app/utils/pdf_extractor.py` | **NEW** — Extract and save individual PDF pages | | `backend/app/core/config.py` | Add `DOCUMENT_CHUNK_PATH` setting (default: `./document_chunk`) | ```python # pdf_extractor.py from pypdf import PdfReader, PdfWriter def extract_page_as_pdf(source_path: str, page_number: int, output_path: str) -> str: """Extract a single page from a PDF and save as a new PDF file. Args: source_path: Path to original PDF page_number: 1-indexed page number output_path: Where to save the extracted page PDF Returns: The output_path of the saved PDF """ reader = PdfReader(source_path) writer = PdfWriter() writer.add_page(reader.pages[page_number - 1]) # 0-indexed in reader with open(output_path, "wb") as f: writer.write(f) return output_path ``` **Note**: For DOCX files, chunk PDF generation is skipped (set `chunk_file_path = None` in metadata). Only PDFs support page extraction. #### 3.2.4 Enhanced Metadata **Current metadata**: ```python { "filename": "report.pdf", "upload_date": "2026-04-23T...", "content_summary": "First 200 chars...", "chunk_index": 0, } ``` **Enhanced metadata**: ```python { "filename": "report.pdf", "upload_date": "2026-04-23T...", "content_summary": "First 200 chars...", "chunk_index": 0, "page_number": 3, # NEW "chunk_file_path": "report_page_3.pdf", # NEW (relative path) "document_id": "uuid-string", # NEW (for grouping) } ``` **Modified Files**: | File | Change | |------|--------| | `backend/app/utils/metadata.py` | Add `page_number`, `chunk_file_path`, `document_id` to metadata | | `backend/app/models/common.py` | Add new fields to `SourceMetadata` | #### 3.2.5 Chunk File Serving Endpoint **New endpoint**: | Method | Path | Description | |--------|------|-------------| | `GET` | `/api/v1/chunks/{file_path}/pdf` | Serve chunk PDF file | **Modified Files**: | File | Change | |------|--------| | `backend/app/routers/documents.py` | Add `GET /chunks/{file_path}/pdf` endpoint | ```python @router.get("/chunks/{file_path}/pdf") async def get_chunk_pdf(file_path: str): """Serve a chunk PDF file from document_chunk/ directory.""" # Validate path to prevent directory traversal # Return FileResponse from DOCUMENT_CHUNK_PATH / file_path ``` #### 3.2.6 Ingestion Pipeline Refactor The entire ingestion flow needs to be updated: **Current flow**: ``` Upload → parse_pdf() → flat text → chunk() → metadata → store in ChromaDB ``` **New flow**: ``` Upload → check if filename exists → YES: delete old chunks + chunk PDFs (full replacement) → parse_pdf_by_page() → per-page text → chunk_pages() with 200-token overlap from adjacent pages → for each page: extract page as PDF → save to document_chunk/ → enhanced metadata (page_number, chunk_file_path, document_id) → store in ChromaDB ``` **Same-filename replacement** (confirmed): - On upload, query ChromaDB for existing chunks with matching `filename` - If found: delete old chunk IDs from collection, delete old PDFs from `document_chunk/` - Create new `document_id`, ingest fresh - This ensures clean replacement without orphaned data **Modified Files**: | File | Change | |------|--------| | `backend/app/routers/ingest.py` | Refactor: page-aware parsing, chunk PDF generation, enhanced metadata, same-filename replacement | ### 3.3 Frontend Changes **Modified Files**: | File | Change | |------|--------| | `frontend/src/types/index.ts` | Add `page_number`, `chunk_file_path` to `SourceMetadata` | | `frontend/src/components/ResponsePanel.tsx` | Render `chunk_file_path` as clickable link in sources | | `frontend/src/components/ChunkList.tsx` | Show page number, link to chunk PDF | **Source Card Update**: Current source card shows: `filename`, `upload_date`, `content_summary`, `chunk_index` Enhanced source card adds: `page_number` (e.g., "Page 3"), clickable "View Source" link opening chunk PDF ### 3.4 Directory Structure After Enhancement ``` legco_reranker/ ├── app/ │ ├── backend/... │ ├── frontend/... │ └── chroma_db/ # Existing ├── document_chunk/ # NEW — chunk PDF files │ ├── NEC4 ACC_page_1.pdf │ ├── NEC4 ACC_page_2.pdf │ ├── NEC4 ACC_page_3.pdf │ └── meeting_notes_page_5.pdf ├── .plans/ └── ... ``` ### 3.5 Acceptance Criteria - [ ] PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages - [ ] Each page is saved as a separate PDF (original page, not generated text) in `document_chunk/` - [ ] Chunk PDF filename follows convention: `{filename}_page_{n}.pdf` - [ ] Page numbers are sequential index (1, 2, 3...), not PDF internal labels - [ ] Oversized pages are kept as single chunks (never split) - [ ] `GET /api/v1/chunks/{file_path}/pdf` serves the original chunk PDF - [ ] RAG response sources include `page_number` and `chunk_file_path` - [ ] Frontend source cards show page number and clickable link - [ ] Clicking source link opens/downloads the original chunk PDF - [ ] DOCX uploads work without page numbers (graceful degradation, no chunk PDFs) - [ ] Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id) - [ ] `document_chunk/` is `.gitignore`d - [ ] Deleting a document also removes its chunk PDFs from `document_chunk/` --- ## Implementation Sequence The three features have dependencies. Recommended order: ``` Feature 1 (Nav + Routing) ← No backend changes, enables Feature 2 │ ▼ Feature 2 (RAG Database Page) ← Needs Feature 1 for page routing │ But backend CRUD endpoints are independent ▼ Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline Enhances Feature 2 (chunk file links) Enhances ResponsePanel (clickable sources) ``` ### Sub-Phase Breakdown | Sub-Phase | Feature | Scope | Backend | Frontend | Status | |-----------|---------|-------|---------|----------|--------| | 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete | | 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete | | 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | 📋 Pending | | 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | 📋 Pending | | 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending | | 1.5.6 | 3 | Chunk file serving + frontend links | documents router endpoint | ResponsePanel clickable links, ChunkList updates | 📋 Pending | ### Parallelization Opportunities - **1.5.1 and 1.5.2 can run in parallel** — Frontend routing changes and backend CRUD are independent - **1.5.3 blocked by 1.5.1 + 1.5.2** — Needs both routing and backend endpoints - **1.5.4 and 1.5.5 are sequential** — 1.5.5 depends on 1.5.4's page-aware parsing - **1.5.6 blocked by 1.5.3 + 1.5.5** — Needs both frontend page and backend chunk serving ``` 1.5.1 (Nav+Routing) ─┐ ├─► 1.5.3 (RAG DB Page) ─┐ 1.5.2 (Backend CRUD) ─┘ │ ├─► 1.5.6 (Links) 1.5.4 (Page-Aware) ──► 1.5.5 (Chunk PDFs) ─────┘ ``` --- ## New Dependencies ### Backend | Package | Purpose | Already installed? | |---------|---------|--------------------| | (none) | pypdf already supports page extraction | ✅ | ### Frontend | Package | Purpose | Already installed? | |---------|---------|--------------------| | `react-router-dom` | Client-side routing | ✅ Installed | --- ## .gitignore Updates ```gitignore # Chunk PDF storage document_chunk/ ``` --- ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | ~~ChromaDB has no native "group by document" query~~ | ~~`list_documents()` needs manual grouping from all metadata~~ | ✅ Resolved: Groups by document_id extracted from chunk IDs via `rsplit("_", 1)` | | Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. | | Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. | | DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. | | Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. | | Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. | | ChromaDB 1.5.8 requires `name()` on embedding functions | `_EmbeddingFunctionWrapper` crashes on `collection.get()` | ✅ Fixed: Added `name()` method returning `"custom_embedding_wrapper"` | | Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing `chroma_db/` | Pre-existing issue. Works with fresh DB. May need `chroma_db` reset for production. | --- ## Decisions (Confirmed) | # | Question | Decision | |---|----------|----------| | 1 | Chunk algorithm | **Page-as-chunk-unit**. Each chunk = `[last 200 tokens of prev page] + [full current page text] + [first 200 tokens of next page]`. One chunk per page — never split a page even if oversized. | | 2 | DOCX chunk PDFs | **No**. Only PDFs get chunk PDFs. DOCX chunks show text preview only. | | 3 | IngestPanel placement | **Keep on LTT page** + also add upload on RAG Database page. | | 4 | Re-ingestion / same filename | **Full replacement**. Delete old chunks + old chunk PDFs + create new `document_id`. | | 5 | Chunk PDF content | **Original page from source PDF**. Extract actual page — preserves formatting, tables, images. | | 6 | Page numbering | **Sequential index** (1, 2, 3...). Not PDF internal labels. | | 7 | Oversized pages | **Never split**. One chunk per page regardless of token count. | ## Open Questions None — all resolved. --- ## Test Plan ### Backend Tests (New Files) | File | Coverage | |------|----------| | `test_phase1_documents_router.py` | ✅ GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) | | `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty | | `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks | | `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF | | `test_phase1_chunk_serving.py` | GET /chunks/{path}/pdf — valid file, missing file, path traversal | ### Frontend Tests (New Files) | File | Coverage | |------|----------| | `NavBar.test.tsx` | Navigation links, active state | | `RAGDatabasePage.test.tsx` | Document list, delete, upload | | `DocumentList.test.tsx` | Document cards, expand/collapse | | `ChunkList.test.tsx` | Chunk table, page numbers, PDF links | ### Acceptance Tests | File | Coverage | |------|----------| | `test_acceptance_phase1_documents_crud.py` | Real ChromaDB CRUD with list, delete | | `test_acceptance_phase1_page_chunking.py` | Real PDF upload → page-aware chunks → chunk PDFs exist | | `test_acceptance_phase1_chunk_links.py` | Full flow: upload → query → response has clickable chunk links |