26 KiB
Phase 1 Enhancement Plan
Source: User request (2026-04-23)
Scope: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs
Status: 🔄 In Progress — Features 1-2 ✅ Complete, Feature 3 pending
Objective
Enhance the existing Phase 1 application with three features:
- Navigation Bar — Top nav bar with two pages: "LTT" (current query page) and "RAG Database"
- RAG Database Page — View/manage ChromaDB documents (list, delete, upload)
- Page-Aware Chunking — Chunks tagged with page numbers, saved as PDFs in
document_chunk/, with clickable links in RAG responses
Current State (Pre-Enhancement)
Current State (Pre-Enhancement)
What Exists
- Multi-page React app with react-router-dom routing
- Nav bar with "LTT" and "RAG Database" tabs
- "LTT" page at
/with current query interface - "RAG Database" placeholder page at
/rag-database - 2 API endpoints:
POST /api/v1/ingest,POST /api/v1/query - Flat chunking: PDF text extracted page-by-page but concatenated into one string before chunking
- Metadata per chunk:
filename,upload_date,content_summary,chunk_index - ChromaDB collection
documentswith UUID-based IDs ({document_id}_{chunk_index}) - Frontend: pages in
pages/, components incomponents/, TanStack Query, react-markdown
What's Missing (Gaps This Plan Fills)
No routing or multi-page support✅ Done in Feature 1No way to view what's stored in ChromaDB✅ Backend CRUD done (sub-phase 1.5.2)No way to delete documents or chunks✅ Backend CRUD done (sub-phase 1.5.2)- No page-level awareness in chunking (all pages flattened before token splitting)
- No persistent chunk files (chunks only exist as ChromaDB document text)
- No clickable links in RAG responses to view source chunks
Upload only via IngestPanel on the main query page(IngestPanel stays on LTT, upload also coming to RAG DB page)
Feature 1: Navigation Bar & Multi-Page Layout ✅ COMPLETE
Completed: 2026-04-23
1.1 Changes Required
Frontend:
Install✅react-router-domCreate✅frontend/src/components/NavBar.tsx— top navigation barCreate✅frontend/src/pages/LTTPage.tsx— move current App.tsx content hereCreate✅frontend/src/pages/RAGDatabasePage.tsx— placeholder, fleshed out in Feature 2Refactor✅frontend/src/App.tsx— Router + NavBar + route definitions
Backend: None
1.2 Nav Bar Design
┌─────────────────────────────────────────────────┐
│ 🔍 LTT (active) │ 📚 RAG Database │ ← top nav bar (fixed)
├─────────────────────────────────────────────────┤
│ │
│ [Page content: LTT or RAG Database] │
│ │
└─────────────────────────────────────────────────┘
- Fixed top bar, full width
- Two tabs: "LTT" (current query page) and "RAG Database"
- Active tab highlighted
- "LTT" route:
/(default) - "RAG Database" route:
/rag-database
1.3 Implementation Tasks
| Task | Description | Files | Status |
|---|---|---|---|
| Install react-router-dom | npm install react-router-dom |
package.json |
✅ |
| Create NavBar component | Horizontal nav with two links | frontend/src/components/NavBar.tsx |
✅ |
| Create LTTPage | Extract current AppContent into page component | frontend/src/pages/LTTPage.tsx |
✅ |
| Create RAGDatabasePage | Placeholder page (scaffold for Feature 2) | frontend/src/pages/RAGDatabasePage.tsx |
✅ |
| Refactor App.tsx | BrowserRouter + Routes + NavBar wrapper | frontend/src/App.tsx |
✅ |
1.4 Acceptance Criteria
- Nav bar visible at top of every page
- Clicking "LTT" navigates to
/and shows current query interface - Clicking "RAG Database" navigates to
/rag-database - Current page highlighted in nav bar
- All existing functionality preserved (query, ingest, response display)
- Build passes, no TypeScript errors
- 62/62 frontend tests pass
Feature 2: RAG Database Management Page
2.1 Overview
A dedicated page to view and manage all documents/chunks stored in ChromaDB.
2.2 Backend Changes
New API Endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/documents |
List all documents with chunk counts |
GET |
/api/v1/documents/{document_id}/chunks |
List all chunks for a document |
DELETE |
/api/v1/documents/{document_id} |
Delete all chunks for a document |
DELETE |
/api/v1/chunks/{chunk_id} |
Delete a single chunk |
New/Modified Files:
| File | Change |
|---|---|
backend/app/routers/documents.py |
NEW — CRUD endpoints for documents/chunks |
backend/app/services/rag.py |
Add list_documents(), list_chunks(), delete_document(), delete_chunk() methods |
backend/app/models/documents.py |
NEW — Pydantic schemas for document/chunk listing |
backend/app/main.py |
Register documents router |
New Pydantic Schemas (models/documents.py):
class DocumentInfo(BaseModel):
document_id: str
filename: str
chunk_count: int
upload_date: str
class ChunkInfo(BaseModel):
chunk_id: str
chunk_index: int
content_summary: str
page_number: int | None = None # Added by Feature 3
chunk_file_path: str | None = None # Added by Feature 3
class DocumentListResponse(BaseModel):
documents: List[DocumentInfo]
total_documents: int
total_chunks: int
class DeleteResponse(BaseModel):
deleted: bool
message: str
Implementation Notes for ChromaDB Operations:
list_documents(): ChromaDB has no native "group by document" — need tocollection.get(include=["metadatas"]), then group byfilenameand extractdocument_idfrom chunk IDs (format:{document_id}_{chunk_index})delete_document(): Usecollection.delete(where={"filename": "..."})or collect all chunk IDs matching the document_id prefix and callcollection.delete(ids=[...])delete_chunk(): Usecollection.delete(ids=[chunk_id])- Important: When deleting a document, also clean up associated chunk PDF files from
document_chunk/(Feature 3)
2.3 Frontend Changes
New/Modified Files:
| File | Change |
|---|---|
frontend/src/pages/RAGDatabasePage.tsx |
Full implementation |
frontend/src/components/DocumentList.tsx |
NEW — Document table/cards |
frontend/src/components/ChunkList.tsx |
NEW — Chunk table for selected document |
frontend/src/components/DocumentUpload.tsx |
NEW — Upload form (can reuse IngestPanel logic) |
frontend/src/lib/api.ts |
Add listDocuments(), deleteDocument(), deleteChunk() |
frontend/src/lib/queries.tsx |
Add TanStack Query hooks for new endpoints |
frontend/src/types/index.ts |
Add DocumentInfo, ChunkInfo, DeleteResponse types |
2.4 Page Layout
┌──────────────────────────────────────────────────┐
│ RAG Database [Upload] │
├──────────────────────────────────────────────────┤
│ Total: 5 documents, 342 chunks │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ 📄 NEC4 ACC.pdf │ 101 chunks │ 2026-04-23 │ │
│ │ [View Chunks] [Delete] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ 📄 meeting_notes.docx │ 45 chunks │ 2026-04-22│ │
│ │ [View Chunks] [Delete] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ 📄 budget_report.txt │ 28 chunks │ 2026-04-21│ │
│ │ [View Chunks] [Delete] │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ [Expanded chunk view when "View Chunks" clicked] │
│ ┌──────────────────────────────────────────────┐ │
│ │ Chunk 0 │ p.3 │ "Discussion of budget..." │ │
│ │ │ [View PDF] [Delete Chunk] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Chunk 1 │ p.4 │ "Allocation for Q4..." │ │
│ │ │ [View PDF] [Delete Chunk] │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
2.5 Acceptance Criteria
GET /api/v1/documentsreturns all documents with chunk countsDELETE /api/v1/documents/{document_id}removes all chunks from ChromaDB + associated chunk PDFsDELETE /api/v1/chunks/{chunk_id}removes a single chunk- RAG Database page shows all documents with chunk counts
- User can expand a document to see its chunks
- User can delete a document (with confirmation)
- User can delete individual chunks (with confirmation)
- User can upload documents from this page
- Stats displayed: total documents, total chunks
- Uploading a file with existing filename triggers automatic replacement (old data deleted first)
Feature 3: Page-Aware Chunking & Chunk PDF Storage
3.1 Overview
When a document is uploaded:
- Parse it page-by-page (PDF) or section-by-section (DOCX)
- Each chunk is tagged with its source page number
- Each chunk's source page is saved as a PDF in
document_chunk/ - RAG responses include clickable links to the chunk PDF
3.2 Backend Changes
3.2.1 Page-Aware PDF Parsing
Current: parse_pdf() concatenates all pages into one string, losing page boundaries.
New: parse_pdf_by_page() returns List[Tuple[int, str]] — list of (page_number, page_text) tuples.
Modified Files:
| File | Change |
|---|---|
backend/app/utils/pdf_parser.py |
Add parse_pdf_by_page() function |
def parse_pdf_by_page(file_path: str) -> List[Tuple[int, str]]:
"""Parse PDF and return per-page text with page numbers (1-indexed)."""
reader = PdfReader(file_path)
pages = []
for i, page in enumerate(reader.pages, start=1):
text = page.extract_text()
if text and text.strip():
pages.append((i, text.strip()))
return pages
DOCX Note: DOCX files don't have true page numbers. For DOCX, we can use paragraph-based indexing or skip page tracking. Suggested approach: chunk DOCX normally, set page_number = None in metadata.
3.2.2 Page-Aware Chunking
Current: TokenChunkingStrategy.chunk(text) takes a flat string and splits by tokens.
New: Page-as-chunk-unit with overlap context from adjacent pages.
Chunking Algorithm (confirmed):
For page N (1-indexed):
overlap_before = last 200 tokens of page N-1 text (or empty if page 1)
overlap_after = first 200 tokens of page N+1 text (or empty if last page)
chunk_text = overlap_before + page_N_text + overlap_after
- One chunk per page — never split a page even if it exceeds 1000 tokens
- Overlap provides surrounding context for better embedding/retrieval
- The
page_numbermetadata always refers to the main page (N), not the overlap pages
Modified Files:
| File | Change |
|---|---|
backend/app/utils/chunking.py |
Add chunk_pages() method to TokenChunkingStrategy |
def chunk_pages(
self, pages: List[Tuple[int, str]], overlap_tokens: int = 200
) -> List[Tuple[str, int]]:
"""Chunk page-segmented text with overlap from adjacent pages.
For each page, creates one chunk containing:
[last overlap_tokens of previous page] + [full current page] + [first overlap_tokens of next page]
Args:
pages: List of (page_number, page_text) tuples. 1-indexed.
overlap_tokens: Number of tokens to include from adjacent pages.
Returns:
List of (chunk_text, page_number) tuples. One chunk per page.
"""
if not pages:
return []
# Tokenize all pages upfront
tokenized = []
for page_num, page_text in pages:
tokens = self._encoding.encode(page_text)
tokenized.append((page_num, tokens, page_text))
chunks = []
for i, (page_num, tokens, _text) in enumerate(tokenized):
parts = []
# Overlap from previous page (last N tokens)
if i > 0:
prev_tokens = tokenized[i - 1][1]
overlap = prev_tokens[-overlap_tokens:] if len(prev_tokens) >= overlap_tokens else prev_tokens
if overlap:
parts.append(self._encoding.decode(overlap))
# Full current page text (use original text, not re-decoded)
parts.append(pages[i][1])
# Overlap from next page (first N tokens)
if i < len(tokenized) - 1:
next_tokens = tokenized[i + 1][1]
overlap = next_tokens[:overlap_tokens] if len(next_tokens) >= overlap_tokens else next_tokens
if overlap:
parts.append(self._encoding.decode(overlap))
chunk_text = "\n".join(parts)
chunks.append((chunk_text, page_num))
return chunks
3.2.3 Chunk PDF Generation & Storage
New directory: document_chunk/ (at project root, alongside chroma_db/)
Naming convention: {original_filename_without_ext}_page_{page_number}.pdf
Example: NEC4 ACC_page_3.pdf
One file per page — multiple chunks never exist for a page (decision: never split a page), so deduplication is not needed.
Content: The actual page extracted from the source PDF — preserves original formatting, layout, tables, images. Not a generated text PDF.
Modified/New Files:
| File | Change |
|---|---|
backend/app/utils/pdf_extractor.py |
NEW — Extract and save individual PDF pages |
backend/app/core/config.py |
Add DOCUMENT_CHUNK_PATH setting (default: ./document_chunk) |
# pdf_extractor.py
from pypdf import PdfReader, PdfWriter
def extract_page_as_pdf(source_path: str, page_number: int, output_path: str) -> str:
"""Extract a single page from a PDF and save as a new PDF file.
Args:
source_path: Path to original PDF
page_number: 1-indexed page number
output_path: Where to save the extracted page PDF
Returns:
The output_path of the saved PDF
"""
reader = PdfReader(source_path)
writer = PdfWriter()
writer.add_page(reader.pages[page_number - 1]) # 0-indexed in reader
with open(output_path, "wb") as f:
writer.write(f)
return output_path
Note: For DOCX files, chunk PDF generation is skipped (set chunk_file_path = None in metadata). Only PDFs support page extraction.
3.2.4 Enhanced Metadata
Current metadata:
{
"filename": "report.pdf",
"upload_date": "2026-04-23T...",
"content_summary": "First 200 chars...",
"chunk_index": 0,
}
Enhanced metadata:
{
"filename": "report.pdf",
"upload_date": "2026-04-23T...",
"content_summary": "First 200 chars...",
"chunk_index": 0,
"page_number": 3, # NEW
"chunk_file_path": "report_page_3.pdf", # NEW (relative path)
"document_id": "uuid-string", # NEW (for grouping)
}
Modified Files:
| File | Change |
|---|---|
backend/app/utils/metadata.py |
Add page_number, chunk_file_path, document_id to metadata |
backend/app/models/common.py |
Add new fields to SourceMetadata |
3.2.5 Chunk File Serving Endpoint
New endpoint:
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/chunks/{file_path}/pdf |
Serve chunk PDF file |
Modified Files:
| File | Change |
|---|---|
backend/app/routers/documents.py |
Add GET /chunks/{file_path}/pdf endpoint |
@router.get("/chunks/{file_path}/pdf")
async def get_chunk_pdf(file_path: str):
"""Serve a chunk PDF file from document_chunk/ directory."""
# Validate path to prevent directory traversal
# Return FileResponse from DOCUMENT_CHUNK_PATH / file_path
3.2.6 Ingestion Pipeline Refactor
The entire ingestion flow needs to be updated:
Current flow:
Upload → parse_pdf() → flat text → chunk() → metadata → store in ChromaDB
New flow:
Upload → check if filename exists → YES: delete old chunks + chunk PDFs (full replacement)
→ parse_pdf_by_page() → per-page text
→ chunk_pages() with 200-token overlap from adjacent pages
→ for each page: extract page as PDF → save to document_chunk/
→ enhanced metadata (page_number, chunk_file_path, document_id)
→ store in ChromaDB
Same-filename replacement (confirmed):
- On upload, query ChromaDB for existing chunks with matching
filename - If found: delete old chunk IDs from collection, delete old PDFs from
document_chunk/ - Create new
document_id, ingest fresh - This ensures clean replacement without orphaned data
Modified Files:
| File | Change |
|---|---|
backend/app/routers/ingest.py |
Refactor: page-aware parsing, chunk PDF generation, enhanced metadata, same-filename replacement |
3.3 Frontend Changes
Modified Files:
| File | Change |
|---|---|
frontend/src/types/index.ts |
Add page_number, chunk_file_path to SourceMetadata |
frontend/src/components/ResponsePanel.tsx |
Render chunk_file_path as clickable link in sources |
frontend/src/components/ChunkList.tsx |
Show page number, link to chunk PDF |
Source Card Update:
Current source card shows: filename, upload_date, content_summary, chunk_index
Enhanced source card adds: page_number (e.g., "Page 3"), clickable "View Source" link opening chunk PDF
3.4 Directory Structure After Enhancement
legco_reranker/
├── app/
│ ├── backend/...
│ ├── frontend/...
│ └── chroma_db/ # Existing
├── document_chunk/ # NEW — chunk PDF files
│ ├── NEC4 ACC_page_1.pdf
│ ├── NEC4 ACC_page_2.pdf
│ ├── NEC4 ACC_page_3.pdf
│ └── meeting_notes_page_5.pdf
├── .plans/
└── ...
3.5 Acceptance Criteria
- PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages
- Each page is saved as a separate PDF (original page, not generated text) in
document_chunk/ - Chunk PDF filename follows convention:
{filename}_page_{n}.pdf - Page numbers are sequential index (1, 2, 3...), not PDF internal labels
- Oversized pages are kept as single chunks (never split)
GET /api/v1/chunks/{file_path}/pdfserves the original chunk PDF- RAG response sources include
page_numberandchunk_file_path - Frontend source cards show page number and clickable link
- Clicking source link opens/downloads the original chunk PDF
- DOCX uploads work without page numbers (graceful degradation, no chunk PDFs)
- Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id)
document_chunk/is.gitignored- Deleting a document also removes its chunk PDFs from
document_chunk/
Implementation Sequence
The three features have dependencies. Recommended order:
Feature 1 (Nav + Routing) ← No backend changes, enables Feature 2
│
▼
Feature 2 (RAG Database Page) ← Needs Feature 1 for page routing
│ But backend CRUD endpoints are independent
▼
Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
Enhances Feature 2 (chunk file links)
Enhances ResponsePanel (clickable sources)
Sub-Phase Breakdown
| Sub-Phase | Feature | Scope | Backend | Frontend | Status |
|---|---|---|---|---|---|
| 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete |
| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete |
| 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | ✅ Complete |
| 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | ✅ Complete |
| 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | 📋 Pending |
| 1.5.6 | 3 | Chunk file serving + frontend links | documents router endpoint | ResponsePanel clickable links, ChunkList updates | 📋 Pending |
Parallelization Opportunities
- 1.5.1 and 1.5.2 can run in parallel — Frontend routing changes and backend CRUD are independent
- 1.5.3 blocked by 1.5.1 + 1.5.2 — Needs both routing and backend endpoints
- 1.5.4 and 1.5.5 are sequential — 1.5.5 depends on 1.5.4's page-aware parsing
- 1.5.6 blocked by 1.5.3 + 1.5.5 — Needs both frontend page and backend chunk serving
1.5.1 (Nav+Routing) ─┐
├─► 1.5.3 (RAG DB Page) ─┐
1.5.2 (Backend CRUD) ─┘ │
├─► 1.5.6 (Links)
1.5.4 (Page-Aware) ──► 1.5.5 (Chunk PDFs) ─────┘
New Dependencies
Backend
| Package | Purpose | Already installed? |
|---|---|---|
| (none) | pypdf already supports page extraction | ✅ |
Frontend
| Package | Purpose | Already installed? |
|---|---|---|
react-router-dom |
Client-side routing | ✅ Installed |
.gitignore Updates
# Chunk PDF storage
document_chunk/
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
list_documents() needs manual grouping from all metadata |
✅ Resolved: Groups by document_id extracted from chunk IDs via rsplit("_", 1) |
|
| Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. |
| Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. |
| DOCX has no page numbers | page_number is None for DOCX chunks |
Graceful degradation — show "N/A" or hide page info for DOCX. |
| Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. |
| Path traversal in chunk PDF endpoint | Security risk | Validate file_path doesn't contain .. or absolute paths. Use whitelist of known files. |
ChromaDB 1.5.8 requires name() on embedding functions |
_EmbeddingFunctionWrapper crashes on collection.get() |
✅ Fixed: Added name() method returning "custom_embedding_wrapper" |
| Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing chroma_db/ |
Pre-existing issue. Works with fresh DB. May need chroma_db reset for production. |
Decisions (Confirmed)
| # | Question | Decision |
|---|---|---|
| 1 | Chunk algorithm | Page-as-chunk-unit. Each chunk = [last 200 tokens of prev page] + [full current page text] + [first 200 tokens of next page]. One chunk per page — never split a page even if oversized. |
| 2 | DOCX chunk PDFs | No. Only PDFs get chunk PDFs. DOCX chunks show text preview only. |
| 3 | IngestPanel placement | Keep on LTT page + also add upload on RAG Database page. |
| 4 | Re-ingestion / same filename | Full replacement. Delete old chunks + old chunk PDFs + create new document_id. |
| 5 | Chunk PDF content | Original page from source PDF. Extract actual page — preserves formatting, tables, images. |
| 6 | Page numbering | Sequential index (1, 2, 3...). Not PDF internal labels. |
| 7 | Oversized pages | Never split. One chunk per page regardless of token count. |
Open Questions
None — all resolved.
Test Plan
Backend Tests (New Files)
| File | Coverage |
|---|---|
test_phase1_documents_router.py |
✅ GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) |
test_phase1_pdf_parser_pages.py |
parse_pdf_by_page() — multi-page PDFs, single-page, empty |
test_phase1_page_aware_chunking.py |
chunk_with_pages() — cross-page chunks, single-page chunks |
test_phase1_pdf_extractor.py |
extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |
test_phase1_chunk_serving.py |
GET /chunks/{path}/pdf — valid file, missing file, path traversal |
Frontend Tests (New Files)
| File | Coverage |
|---|---|
NavBar.test.tsx |
Navigation links, active state |
RAGDatabasePage.test.tsx |
Document list, delete, upload |
DocumentList.test.tsx |
Document cards, expand/collapse |
ChunkList.test.tsx |
Chunk table, page numbers, PDF links |
Acceptance Tests
| File | Coverage |
|---|---|
test_acceptance_phase1_documents_crud.py |
Real ChromaDB CRUD with list, delete |
test_acceptance_phase1_page_chunking.py |
Real PDF upload → page-aware chunks → chunk PDFs exist |
test_acceptance_phase1_chunk_links.py |
Full flow: upload → query → response has clickable chunk links |