legco_ai_assistant/.plans/phase1_enhancement_plan.md

640 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1 Enhancement Plan
**Source**: User request (2026-04-23)
**Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs
**Status**: ✅ Complete — All 3 features implemented (1.5.11.5.6)
---
## Objective
Enhance the existing Phase 1 application with three features:
1. **Navigation Bar** — Top nav bar with two pages: "LTT" (current query page) and "RAG Database"
2. **RAG Database Page** — View/manage ChromaDB documents (list, delete, upload)
3. **Page-Aware Chunking** — Chunks tagged with page numbers, saved as PDFs in `document_chunk/`, with clickable links in RAG responses
---
## Current State (Pre-Enhancement)
### Current State (Pre-Enhancement)
### What Exists
- Multi-page React app with react-router-dom routing
- Nav bar with "LTT" and "RAG Database" tabs
- "LTT" page at `/` with current query interface
- "RAG Database" placeholder page at `/rag-database`
- 2 API endpoints: `POST /api/v1/ingest`, `POST /api/v1/query`
- Flat chunking: PDF text extracted page-by-page but concatenated into one string before chunking
- Metadata per chunk: `filename`, `upload_date`, `content_summary`, `chunk_index`
- ChromaDB collection `documents` with UUID-based IDs (`{document_id}_{chunk_index}`)
- Frontend: pages in `pages/`, components in `components/`, TanStack Query, react-markdown
### What's Missing (Gaps This Plan Fills)
- ~~No routing or multi-page support~~ ✅ Done in Feature 1
- ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2)
- ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2)
- ~~No page-level awareness in chunking (all pages flattened before token splitting)~~ ✅ Done in 1.5.4
- ~~No persistent chunk files (chunks only exist as ChromaDB document text)~~ ✅ Done in 1.5.5
- No clickable links in RAG responses to view source chunks
- ~~Upload only via IngestPanel on the main query page~~ (IngestPanel stays on LTT, upload also coming to RAG DB page)
---
## Feature 1: Navigation Bar & Multi-Page Layout ✅ COMPLETE
**Completed**: 2026-04-23
### 1.1 Changes Required
**Frontend**:
- ~~Install `react-router-dom`~~ ✅
- ~~Create `frontend/src/components/NavBar.tsx` — top navigation bar~~ ✅
- ~~Create `frontend/src/pages/LTTPage.tsx` — move current App.tsx content here~~ ✅
- ~~Create `frontend/src/pages/RAGDatabasePage.tsx` — placeholder, fleshed out in Feature 2~~ ✅
- ~~Refactor `frontend/src/App.tsx` — Router + NavBar + route definitions~~ ✅
**Backend**: None
### 1.2 Nav Bar Design
```
┌─────────────────────────────────────────────────┐
│ 🔍 LTT (active) │ 📚 RAG Database │ ← top nav bar (fixed)
├─────────────────────────────────────────────────┤
│ │
│ [Page content: LTT or RAG Database] │
│ │
└─────────────────────────────────────────────────┘
```
- Fixed top bar, full width
- Two tabs: "LTT" (current query page) and "RAG Database"
- Active tab highlighted
- "LTT" route: `/` (default)
- "RAG Database" route: `/rag-database`
### 1.3 Implementation Tasks
| Task | Description | Files | Status |
|------|-------------|-------|--------|
| Install react-router-dom | `npm install react-router-dom` | `package.json` | ✅ |
| Create NavBar component | Horizontal nav with two links | `frontend/src/components/NavBar.tsx` | ✅ |
| Create LTTPage | Extract current AppContent into page component | `frontend/src/pages/LTTPage.tsx` | ✅ |
| Create RAGDatabasePage | Placeholder page (scaffold for Feature 2) | `frontend/src/pages/RAGDatabasePage.tsx` | ✅ |
| Refactor App.tsx | BrowserRouter + Routes + NavBar wrapper | `frontend/src/App.tsx` | ✅ |
### 1.4 Acceptance Criteria
- [x] Nav bar visible at top of every page
- [x] Clicking "LTT" navigates to `/` and shows current query interface
- [x] Clicking "RAG Database" navigates to `/rag-database`
- [x] Current page highlighted in nav bar
- [x] All existing functionality preserved (query, ingest, response display)
- [x] Build passes, no TypeScript errors
- [x] 62/62 frontend tests pass
---
## Feature 2: RAG Database Management Page
### 2.1 Overview
A dedicated page to view and manage all documents/chunks stored in ChromaDB.
### 2.2 Backend Changes
**New API Endpoints**:
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/documents` | List all documents with chunk counts |
| `GET` | `/api/v1/documents/{document_id}/chunks` | List all chunks for a document |
| `DELETE` | `/api/v1/documents/{document_id}` | Delete all chunks for a document |
| `DELETE` | `/api/v1/chunks/{chunk_id}` | Delete a single chunk |
**New/Modified Files**:
| File | Change |
|------|--------|
| `backend/app/routers/documents.py` | **NEW** — CRUD endpoints for documents/chunks |
| `backend/app/services/rag.py` | Add `list_documents()`, `list_chunks()`, `delete_document()`, `delete_chunk()` methods |
| `backend/app/models/documents.py` | **NEW** — Pydantic schemas for document/chunk listing |
| `backend/app/main.py` | Register `documents` router |
**New Pydantic Schemas** (`models/documents.py`):
```python
class DocumentInfo(BaseModel):
document_id: str
filename: str
chunk_count: int
upload_date: str
class ChunkInfo(BaseModel):
chunk_id: str
chunk_index: int
content_summary: str
page_number: int | None = None # Added by Feature 3
chunk_file_path: str | None = None # Added by Feature 3
class DocumentListResponse(BaseModel):
documents: List[DocumentInfo]
total_documents: int
total_chunks: int
class DeleteResponse(BaseModel):
deleted: bool
message: str
```
**Implementation Notes for ChromaDB Operations**:
- `list_documents()`: ChromaDB has no native "group by document" — need to `collection.get(include=["metadatas"])`, then group by `filename` and extract `document_id` from chunk IDs (format: `{document_id}_{chunk_index}`)
- `delete_document()`: Use `collection.delete(where={"filename": "..."})` or collect all chunk IDs matching the document_id prefix and call `collection.delete(ids=[...])`
- `delete_chunk()`: Use `collection.delete(ids=[chunk_id])`
- **Important**: When deleting a document, also clean up associated chunk PDF files from `document_chunk/` (Feature 3)
### 2.3 Frontend Changes
**New/Modified Files**:
| File | Change |
|------|--------|
| `frontend/src/pages/RAGDatabasePage.tsx` | Full implementation |
| `frontend/src/components/DocumentList.tsx` | **NEW** — Document table/cards |
| `frontend/src/components/ChunkList.tsx` | **NEW** — Chunk table for selected document |
| `frontend/src/components/DocumentUpload.tsx` | **NEW** — Upload form (can reuse IngestPanel logic) |
| `frontend/src/lib/api.ts` | Add `listDocuments()`, `deleteDocument()`, `deleteChunk()` |
| `frontend/src/lib/queries.tsx` | Add TanStack Query hooks for new endpoints |
| `frontend/src/types/index.ts` | Add `DocumentInfo`, `ChunkInfo`, `DeleteResponse` types |
### 2.4 Page Layout
```
┌──────────────────────────────────────────────────┐
│ RAG Database [Upload] │
├──────────────────────────────────────────────────┤
│ Total: 5 documents, 342 chunks │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ 📄 NEC4 ACC.pdf │ 101 chunks │ 2026-04-23 │ │
│ │ [View Chunks] [Delete] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ 📄 meeting_notes.docx │ 45 chunks │ 2026-04-22│ │
│ │ [View Chunks] [Delete] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ 📄 budget_report.txt │ 28 chunks │ 2026-04-21│ │
│ │ [View Chunks] [Delete] │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ [Expanded chunk view when "View Chunks" clicked] │
│ ┌──────────────────────────────────────────────┐ │
│ │ Chunk 0 │ p.3 │ "Discussion of budget..." │ │
│ │ │ [View PDF] [Delete Chunk] │ │
│ ├──────────────────────────────────────────────┤ │
│ │ Chunk 1 │ p.4 │ "Allocation for Q4..." │ │
│ │ │ [View PDF] [Delete Chunk] │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
```
### 2.5 Acceptance Criteria
- [x] `GET /api/v1/documents` returns all documents with chunk counts
- [x] `DELETE /api/v1/documents/{document_id}` removes all chunks from ChromaDB + associated chunk PDFs
- [x] `DELETE /api/v1/chunks/{chunk_id}` removes a single chunk
- [x] RAG Database page shows all documents with chunk counts
- [x] User can expand a document to see its chunks
- [x] User can delete a document (with confirmation)
- [x] User can delete individual chunks (with confirmation)
- [x] User can upload documents from this page
- [x] Stats displayed: total documents, total chunks
- [x] Uploading a file with existing filename triggers automatic replacement (old data deleted first)
---
## Feature 3: Page-Aware Chunking & Chunk PDF Storage
### 3.1 Overview
When a document is uploaded:
1. Parse it page-by-page (PDF) or section-by-section (DOCX)
2. Each chunk is tagged with its source page number
3. Each chunk's source page is saved as a PDF in `document_chunk/`
4. RAG responses include clickable links to the chunk PDF
### 3.2 Backend Changes
#### 3.2.1 Page-Aware PDF Parsing
**Current**: `parse_pdf()` concatenates all pages into one string, losing page boundaries.
**New**: `parse_pdf_by_page()` returns `List[Tuple[int, str]]` — list of (page_number, page_text) tuples.
**Modified Files**:
| File | Change |
|------|--------|
| `backend/app/utils/pdf_parser.py` | Add `parse_pdf_by_page()` function |
```python
def parse_pdf_by_page(file_path: str) -> List[Tuple[int, str]]:
"""Parse PDF and return per-page text with page numbers (1-indexed)."""
reader = PdfReader(file_path)
pages = []
for i, page in enumerate(reader.pages, start=1):
text = page.extract_text()
if text and text.strip():
pages.append((i, text.strip()))
return pages
```
**DOCX Note**: DOCX files don't have true page numbers. For DOCX, we can use paragraph-based indexing or skip page tracking. Suggested approach: chunk DOCX normally, set `page_number = None` in metadata.
#### 3.2.2 Page-Aware Chunking
**Current**: `TokenChunkingStrategy.chunk(text)` takes a flat string and splits by tokens.
**New**: Page-as-chunk-unit with overlap context from adjacent pages.
**Chunking Algorithm (confirmed)**:
```
For page N (1-indexed):
overlap_before = last 200 tokens of page N-1 text (or empty if page 1)
overlap_after = first 200 tokens of page N+1 text (or empty if last page)
chunk_text = overlap_before + page_N_text + overlap_after
```
- One chunk per page — **never split** a page even if it exceeds 1000 tokens
- Overlap provides surrounding context for better embedding/retrieval
- The `page_number` metadata always refers to the main page (N), not the overlap pages
**Modified Files**:
| File | Change |
|------|--------|
| `backend/app/utils/chunking.py` | Add `chunk_pages()` method to `TokenChunkingStrategy` |
```python
def chunk_pages(
self, pages: List[Tuple[int, str]], overlap_tokens: int = 200
) -> List[Tuple[str, int]]:
"""Chunk page-segmented text with overlap from adjacent pages.
For each page, creates one chunk containing:
[last overlap_tokens of previous page] + [full current page] + [first overlap_tokens of next page]
Args:
pages: List of (page_number, page_text) tuples. 1-indexed.
overlap_tokens: Number of tokens to include from adjacent pages.
Returns:
List of (chunk_text, page_number) tuples. One chunk per page.
"""
if not pages:
return []
# Tokenize all pages upfront
tokenized = []
for page_num, page_text in pages:
tokens = self._encoding.encode(page_text)
tokenized.append((page_num, tokens, page_text))
chunks = []
for i, (page_num, tokens, _text) in enumerate(tokenized):
parts = []
# Overlap from previous page (last N tokens)
if i > 0:
prev_tokens = tokenized[i - 1][1]
overlap = prev_tokens[-overlap_tokens:] if len(prev_tokens) >= overlap_tokens else prev_tokens
if overlap:
parts.append(self._encoding.decode(overlap))
# Full current page text (use original text, not re-decoded)
parts.append(pages[i][1])
# Overlap from next page (first N tokens)
if i < len(tokenized) - 1:
next_tokens = tokenized[i + 1][1]
overlap = next_tokens[:overlap_tokens] if len(next_tokens) >= overlap_tokens else next_tokens
if overlap:
parts.append(self._encoding.decode(overlap))
chunk_text = "\n".join(parts)
chunks.append((chunk_text, page_num))
return chunks
```
#### 3.2.3 Chunk PDF Generation & Storage
**New directory**: `document_chunk/` (at project root, alongside `chroma_db/`)
**Naming convention**: `{original_filename_without_ext}_page_{page_number}.pdf`
Example: `NEC4 ACC_page_3.pdf`
**One file per page** — multiple chunks never exist for a page (decision: never split a page), so deduplication is not needed.
**Content**: The actual page extracted from the source PDF — preserves original formatting, layout, tables, images. Not a generated text PDF.
**Modified/New Files**:
| File | Change |
|------|--------|
| `backend/app/utils/pdf_extractor.py` | **NEW** — Extract and save individual PDF pages |
| `backend/app/core/config.py` | Add `DOCUMENT_CHUNK_PATH` setting (default: `./document_chunk`) |
```python
# pdf_extractor.py
from pypdf import PdfReader, PdfWriter
def extract_page_as_pdf(source_path: str, page_number: int, output_path: str) -> str:
"""Extract a single page from a PDF and save as a new PDF file.
Args:
source_path: Path to original PDF
page_number: 1-indexed page number
output_path: Where to save the extracted page PDF
Returns:
The output_path of the saved PDF
"""
reader = PdfReader(source_path)
writer = PdfWriter()
writer.add_page(reader.pages[page_number - 1]) # 0-indexed in reader
with open(output_path, "wb") as f:
writer.write(f)
return output_path
```
**Note**: For DOCX files, chunk PDF generation is skipped (set `chunk_file_path = None` in metadata). Only PDFs support page extraction.
#### 3.2.4 Enhanced Metadata
**Current metadata**:
```python
{
"filename": "report.pdf",
"upload_date": "2026-04-23T...",
"content_summary": "First 200 chars...",
"chunk_index": 0,
}
```
**Enhanced metadata**:
```python
{
"filename": "report.pdf",
"upload_date": "2026-04-23T...",
"content_summary": "First 200 chars...",
"chunk_index": 0,
"page_number": 3, # NEW
"chunk_file_path": "report_page_3.pdf", # NEW (relative path)
"document_id": "uuid-string", # NEW (for grouping)
}
```
**Modified Files**:
| File | Change |
|------|--------|
| `backend/app/utils/metadata.py` | Add `page_number`, `chunk_file_path`, `document_id` to metadata |
| `backend/app/models/common.py` | Add new fields to `SourceMetadata` |
#### 3.2.5 Chunk File Serving Endpoint
**New endpoint**:
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/chunks/{file_path}/pdf` | Serve chunk PDF file |
**Modified Files**:
| File | Change |
|------|--------|
| `backend/app/routers/documents.py` | Add `GET /chunks/{file_path}/pdf` endpoint |
```python
@router.get("/chunks/{file_path}/pdf")
async def get_chunk_pdf(file_path: str):
"""Serve a chunk PDF file from document_chunk/ directory."""
# Validate path to prevent directory traversal
# Return FileResponse from DOCUMENT_CHUNK_PATH / file_path
```
#### 3.2.6 Ingestion Pipeline Refactor
The entire ingestion flow needs to be updated:
**Current flow**:
```
Upload → parse_pdf() → flat text → chunk() → metadata → store in ChromaDB
```
**New flow**:
```
Upload → check if filename exists → YES: delete old chunks + chunk PDFs (full replacement)
→ parse_pdf_by_page() → per-page text
→ chunk_pages() with 200-token overlap from adjacent pages
→ for each page: extract page as PDF → save to document_chunk/
→ enhanced metadata (page_number, chunk_file_path, document_id)
→ store in ChromaDB
```
**Same-filename replacement** (confirmed):
- On upload, query ChromaDB for existing chunks with matching `filename`
- If found: delete old chunk IDs from collection, delete old PDFs from `document_chunk/`
- Create new `document_id`, ingest fresh
- This ensures clean replacement without orphaned data
**Modified Files**:
| File | Change |
|------|--------|
| `backend/app/routers/ingest.py` | Refactor: page-aware parsing, chunk PDF generation, enhanced metadata, same-filename replacement |
### 3.3 Frontend Changes
**Modified Files**:
| File | Change |
|------|--------|
| `frontend/src/types/index.ts` | Add `page_number`, `chunk_file_path` to `SourceMetadata` |
| `frontend/src/components/ResponsePanel.tsx` | Render `chunk_file_path` as clickable link in sources |
| `frontend/src/components/ChunkList.tsx` | Show page number, link to chunk PDF |
**Source Card Update**:
Current source card shows: `filename`, `upload_date`, `content_summary`, `chunk_index`
Enhanced source card adds: `page_number` (e.g., "Page 3"), clickable "View Source" link opening chunk PDF
### 3.4 Directory Structure After Enhancement
```
legco_reranker/
├── app/
│ ├── backend/...
│ ├── frontend/...
│ └── chroma_db/ # Existing
├── document_chunk/ # NEW — chunk PDF files
│ ├── NEC4 ACC_page_1.pdf
│ ├── NEC4 ACC_page_2.pdf
│ ├── NEC4 ACC_page_3.pdf
│ └── meeting_notes_page_5.pdf
├── .plans/
└── ...
```
### 3.5 Acceptance Criteria
- [x] PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages
- [x] Each page is saved as a separate PDF (original page, not generated text) in `document_chunk/`
- [x] Chunk PDF filename follows convention: `{filename}_page_{n}.pdf`
- [x] Page numbers are sequential index (1, 2, 3...), not PDF internal labels
- [x] Oversized pages are kept as single chunks (never split)
- [x] `GET /api/v1/chunks/{file_path}/pdf` serves the original chunk PDF
- [x] RAG response sources include `page_number` and `chunk_file_path`
- [x] Frontend source cards show page number and clickable link
- [x] Clicking source link opens/downloads the original chunk PDF
- [x] DOCX uploads work without page numbers (graceful degradation, no chunk PDFs)
- [x] Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id)
- [x] `document_chunk/` is `.gitignore`d
- [x] Deleting a document also removes its chunk PDFs from `document_chunk/`
---
## Implementation Sequence
The three features have dependencies. Recommended order:
```
Feature 1 (Nav + Routing) ← No backend changes, enables Feature 2
Feature 2 (RAG Database Page) ← Needs Feature 1 for page routing
│ But backend CRUD endpoints are independent
Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
Enhances Feature 2 (chunk file links)
Enhances ResponsePanel (clickable sources)
```
### Sub-Phase Breakdown
| Sub-Phase | Feature | Scope | Backend | Frontend | Status |
|-----------|---------|-------|---------|----------|--------|
| 1.5.1 | 1 | Nav bar + routing + page scaffold | None | NavBar, LTTPage, RAGDatabasePage, App.tsx refactor | ✅ Complete |
| 1.5.2 | 2 | Backend CRUD for documents/chunks | documents router, RAGService methods, schemas | None | ✅ Complete |
| 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | ✅ Complete |
| 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | ✅ Complete |
| 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | ✅ Complete |
| 1.5.6 | 3 | Chunk file serving + frontend links | documents router endpoint | ResponsePanel clickable links, ChunkList updates | ✅ Complete |
### Parallelization Opportunities
- **1.5.1 and 1.5.2 can run in parallel** — Frontend routing changes and backend CRUD are independent
- **1.5.3 blocked by 1.5.1 + 1.5.2** — Needs both routing and backend endpoints
- **1.5.4 and 1.5.5 are sequential** — 1.5.5 depends on 1.5.4's page-aware parsing
- **1.5.6 blocked by 1.5.3 + 1.5.5** — Needs both frontend page and backend chunk serving
```
1.5.1 (Nav+Routing) ─┐
├─► 1.5.3 (RAG DB Page) ─┐
1.5.2 (Backend CRUD) ─┘ │
├─► 1.5.6 (Links)
1.5.4 (Page-Aware) ──► 1.5.5 (Chunk PDFs) ─────┘
```
---
## New Dependencies
### Backend
| Package | Purpose | Already installed? |
|---------|---------|--------------------|
| (none) | pypdf already supports page extraction | ✅ |
### Frontend
| Package | Purpose | Already installed? |
|---------|---------|--------------------|
| `react-router-dom` | Client-side routing | ✅ Installed |
---
## .gitignore Updates
```gitignore
# Chunk PDF storage
document_chunk/
```
---
## Risks & Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| ~~ChromaDB has no native "group by document" query~~ | ~~`list_documents()` needs manual grouping from all metadata~~ | ✅ Resolved: Groups by document_id extracted from chunk IDs via `rsplit("_", 1)` |
| Large PDFs → many chunk PDF files | Disk usage grows | One PDF per unique page (not per chunk). Pages shared by chunks reuse same file. |
| Chunk spans multiple pages | Ambiguous page assignment | Tag chunk with STARTING page only. Note in UI. |
| DOCX has no page numbers | `page_number` is None for DOCX chunks | Graceful degradation — show "N/A" or hide page info for DOCX. |
| Deleting documents must clean up chunk files | Orphan files if deletion fails | Delete files after successful ChromaDB deletion. Log failures for manual cleanup. |
| Path traversal in chunk PDF endpoint | Security risk | Validate `file_path` doesn't contain `..` or absolute paths. Use whitelist of known files. |
| ChromaDB 1.5.8 requires `name()` on embedding functions | `_EmbeddingFunctionWrapper` crashes on `collection.get()` | ✅ Fixed: Added `name()` method returning `"custom_embedding_wrapper"` |
| Existing ChromaDB data corrupted (HNSW segment error) | Endpoints return 500 against existing `chroma_db/` | Pre-existing issue. Works with fresh DB. May need `chroma_db` reset for production. |
---
## Decisions (Confirmed)
| # | Question | Decision |
|---|----------|----------|
| 1 | Chunk algorithm | **Page-as-chunk-unit**. Each chunk = `[last 200 tokens of prev page] + [full current page text] + [first 200 tokens of next page]`. One chunk per page — never split a page even if oversized. |
| 2 | DOCX chunk PDFs | **No**. Only PDFs get chunk PDFs. DOCX chunks show text preview only. |
| 3 | IngestPanel placement | **Keep on LTT page** + also add upload on RAG Database page. |
| 4 | Re-ingestion / same filename | **Full replacement**. Delete old chunks + old chunk PDFs + create new `document_id`. |
| 5 | Chunk PDF content | **Original page from source PDF**. Extract actual page — preserves formatting, tables, images. |
| 6 | Page numbering | **Sequential index** (1, 2, 3...). Not PDF internal labels. |
| 7 | Oversized pages | **Never split**. One chunk per page regardless of token count. |
## Open Questions
None — all resolved.
---
## Test Plan
### Backend Tests (New Files)
| File | Coverage |
|------|----------|
| `test_phase1_documents_router.py` | ✅ GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass) |
| `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty |
| `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks |
| `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |
| `test_phase1_chunk_serving.py` | ✅ GET /chunks/{path}/pdf — valid file, missing file, path traversal, filenames with spaces (5 tests, all pass) |
### Frontend Tests (New Files)
| File | Coverage |
|------|----------|
| `NavBar.test.tsx` | Navigation links, active state |
| `RAGDatabasePage.test.tsx` | Document list, delete, upload |
| `DocumentList.test.tsx` | Document cards, expand/collapse |
| `ChunkList.test.tsx` | ✅ Chunk table, page numbers, PDF links (7 tests, all pass) |
### Acceptance Tests
| File | Coverage |
|------|----------|
| `test_acceptance_phase1_documents_crud.py` | Real ChromaDB CRUD with list, delete |
| `test_acceptance_phase1_page_chunking.py` | Real PDF upload → page-aware chunks → chunk PDFs exist |
| `test_acceptance_phase1_chunk_links.py` | Full flow: upload → query → response has clickable chunk links |