legco_ai_assistant/.plans/phase1_enhancement_plan.md

26 KiB

Phase 1 Enhancement Plan

Source: User request (2026-04-23)
Scope: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs
Status: 🔄 In Progress — Features 1-2 Complete, Feature 3 pending


Objective

Enhance the existing Phase 1 application with three features:

  1. Navigation Bar — Top nav bar with two pages: "LTT" (current query page) and "RAG Database"
  2. RAG Database Page — View/manage ChromaDB documents (list, delete, upload)
  3. Page-Aware Chunking — Chunks tagged with page numbers, saved as PDFs in document_chunk/, with clickable links in RAG responses

Current State (Pre-Enhancement)

Current State (Pre-Enhancement)

What Exists

  • Multi-page React app with react-router-dom routing
  • Nav bar with "LTT" and "RAG Database" tabs
  • "LTT" page at / with current query interface
  • "RAG Database" placeholder page at /rag-database
  • 2 API endpoints: POST /api/v1/ingest, POST /api/v1/query
  • Flat chunking: PDF text extracted page-by-page but concatenated into one string before chunking
  • Metadata per chunk: filename, upload_date, content_summary, chunk_index
  • ChromaDB collection documents with UUID-based IDs ({document_id}_{chunk_index})
  • Frontend: pages in pages/, components in components/, TanStack Query, react-markdown

What's Missing (Gaps This Plan Fills)

  • No routing or multi-page support Done in Feature 1
  • No way to view what's stored in ChromaDB Backend CRUD done (sub-phase 1.5.2)
  • No way to delete documents or chunks Backend CRUD done (sub-phase 1.5.2)
  • No page-level awareness in chunking (all pages flattened before token splitting)
  • No persistent chunk files (chunks only exist as ChromaDB document text)
  • No clickable links in RAG responses to view source chunks
  • Upload only via IngestPanel on the main query page (IngestPanel stays on LTT, upload also coming to RAG DB page)

Feature 1: Navigation Bar & Multi-Page Layout COMPLETE

Completed: 2026-04-23

1.1 Changes Required

Frontend:

  • Install react-router-dom
  • Create frontend/src/components/NavBar.tsx — top navigation bar
  • Create frontend/src/pages/LTTPage.tsx — move current App.tsx content here
  • Create frontend/src/pages/RAGDatabasePage.tsx — placeholder, fleshed out in Feature 2
  • Refactor frontend/src/App.tsx — Router + NavBar + route definitions

Backend: None

1.2 Nav Bar Design

┌─────────────────────────────────────────────────┐
│  🔍 LTT (active)  │  📚 RAG Database           │  ← top nav bar (fixed)
├─────────────────────────────────────────────────┤
│                                                  │
│  [Page content: LTT or RAG Database]             │
│                                                  │
└─────────────────────────────────────────────────┘
  • Fixed top bar, full width
  • Two tabs: "LTT" (current query page) and "RAG Database"
  • Active tab highlighted
  • "LTT" route: / (default)
  • "RAG Database" route: /rag-database

1.3 Implementation Tasks

Task Description Files Status
Install react-router-dom npm install react-router-dom package.json
Create NavBar component Horizontal nav with two links frontend/src/components/NavBar.tsx
Create LTTPage Extract current AppContent into page component frontend/src/pages/LTTPage.tsx
Create RAGDatabasePage Placeholder page (scaffold for Feature 2) frontend/src/pages/RAGDatabasePage.tsx
Refactor App.tsx BrowserRouter + Routes + NavBar wrapper frontend/src/App.tsx

1.4 Acceptance Criteria

  • Nav bar visible at top of every page
  • Clicking "LTT" navigates to / and shows current query interface
  • Clicking "RAG Database" navigates to /rag-database
  • Current page highlighted in nav bar
  • All existing functionality preserved (query, ingest, response display)
  • Build passes, no TypeScript errors
  • 62/62 frontend tests pass

Feature 2: RAG Database Management Page

2.1 Overview

A dedicated page to view and manage all documents/chunks stored in ChromaDB.

2.2 Backend Changes

New API Endpoints:

Method Path Description
GET /api/v1/documents List all documents with chunk counts
GET /api/v1/documents/{document_id}/chunks List all chunks for a document
DELETE /api/v1/documents/{document_id} Delete all chunks for a document
DELETE /api/v1/chunks/{chunk_id} Delete a single chunk

New/Modified Files:

File Change
backend/app/routers/documents.py NEW — CRUD endpoints for documents/chunks
backend/app/services/rag.py Add list_documents(), list_chunks(), delete_document(), delete_chunk() methods
backend/app/models/documents.py NEW — Pydantic schemas for document/chunk listing
backend/app/main.py Register documents router

New Pydantic Schemas (models/documents.py):

class DocumentInfo(BaseModel):
    document_id: str
    filename: str
    chunk_count: int
    upload_date: str

class ChunkInfo(BaseModel):
    chunk_id: str
    chunk_index: int
    content_summary: str
    page_number: int | None = None       # Added by Feature 3
    chunk_file_path: str | None = None   # Added by Feature 3

class DocumentListResponse(BaseModel):
    documents: List[DocumentInfo]
    total_documents: int
    total_chunks: int

class DeleteResponse(BaseModel):
    deleted: bool
    message: str

Implementation Notes for ChromaDB Operations:

  • list_documents(): ChromaDB has no native "group by document" — need to collection.get(include=["metadatas"]), then group by filename and extract document_id from chunk IDs (format: {document_id}_{chunk_index})
  • delete_document(): Use collection.delete(where={"filename": "..."}) or collect all chunk IDs matching the document_id prefix and call collection.delete(ids=[...])
  • delete_chunk(): Use collection.delete(ids=[chunk_id])
  • Important: When deleting a document, also clean up associated chunk PDF files from document_chunk/ (Feature 3)

2.3 Frontend Changes

New/Modified Files:

File Change
frontend/src/pages/RAGDatabasePage.tsx Full implementation
frontend/src/components/DocumentList.tsx NEW — Document table/cards
frontend/src/components/ChunkList.tsx NEW — Chunk table for selected document
frontend/src/components/DocumentUpload.tsx NEW — Upload form (can reuse IngestPanel logic)
frontend/src/lib/api.ts Add listDocuments(), deleteDocument(), deleteChunk()
frontend/src/lib/queries.tsx Add TanStack Query hooks for new endpoints
frontend/src/types/index.ts Add DocumentInfo, ChunkInfo, DeleteResponse types

2.4 Page Layout

┌──────────────────────────────────────────────────┐
│  RAG Database                         [Upload]   │
├──────────────────────────────────────────────────┤
│  Total: 5 documents, 342 chunks                  │
│                                                   │
│  ┌──────────────────────────────────────────────┐ │
│  │ 📄 NEC4 ACC.pdf    │ 101 chunks │ 2026-04-23 │ │
│  │    [View Chunks]  [Delete]                    │ │
│  ├──────────────────────────────────────────────┤ │
│  │ 📄 meeting_notes.docx │ 45 chunks │ 2026-04-22│ │
│  │    [View Chunks]  [Delete]                    │ │
│  ├──────────────────────────────────────────────┤ │
│  │ 📄 budget_report.txt │ 28 chunks │ 2026-04-21│ │
│  │    [View Chunks]  [Delete]                    │ │
│  └──────────────────────────────────────────────┘ │
│                                                   │
│  [Expanded chunk view when "View Chunks" clicked] │
│  ┌──────────────────────────────────────────────┐ │
│  │ Chunk 0 │ p.3 │ "Discussion of budget..."     │ │
│  │          │ [View PDF]  [Delete Chunk]          │ │
│  ├──────────────────────────────────────────────┤ │
│  │ Chunk 1 │ p.4 │ "Allocation for Q4..."        │ │
│  │          │ [View PDF]  [Delete Chunk]          │ │
│  └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

2.5 Acceptance Criteria

  • GET /api/v1/documents returns all documents with chunk counts
  • DELETE /api/v1/documents/{document_id} removes all chunks from ChromaDB + associated chunk PDFs
  • DELETE /api/v1/chunks/{chunk_id} removes a single chunk
  • RAG Database page shows all documents with chunk counts
  • User can expand a document to see its chunks
  • User can delete a document (with confirmation)
  • User can delete individual chunks (with confirmation)
  • User can upload documents from this page
  • Stats displayed: total documents, total chunks
  • Uploading a file with existing filename triggers automatic replacement (old data deleted first)

Feature 3: Page-Aware Chunking & Chunk PDF Storage

3.1 Overview

When a document is uploaded:

  1. Parse it page-by-page (PDF) or section-by-section (DOCX)
  2. Each chunk is tagged with its source page number
  3. Each chunk's source page is saved as a PDF in document_chunk/
  4. RAG responses include clickable links to the chunk PDF

3.2 Backend Changes

3.2.1 Page-Aware PDF Parsing

Current: parse_pdf() concatenates all pages into one string, losing page boundaries.

New: parse_pdf_by_page() returns List[Tuple[int, str]] — list of (page_number, page_text) tuples.

Modified Files:

File Change
backend/app/utils/pdf_parser.py Add parse_pdf_by_page() function
def parse_pdf_by_page(file_path: str) -> List[Tuple[int, str]]:
    """Parse PDF and return per-page text with page numbers (1-indexed)."""
    reader = PdfReader(file_path)
    pages = []
    for i, page in enumerate(reader.pages, start=1):
        text = page.extract_text()
        if text and text.strip():
            pages.append((i, text.strip()))
    return pages

DOCX Note: DOCX files don't have true page numbers. For DOCX, we can use paragraph-based indexing or skip page tracking. Suggested approach: chunk DOCX normally, set page_number = None in metadata.

3.2.2 Page-Aware Chunking

Current: TokenChunkingStrategy.chunk(text) takes a flat string and splits by tokens.

New: Page-as-chunk-unit with overlap context from adjacent pages.

Chunking Algorithm (confirmed):

For page N (1-indexed):
  overlap_before = last 200 tokens of page N-1 text (or empty if page 1)
  overlap_after  = first 200 tokens of page N+1 text (or empty if last page)
  chunk_text = overlap_before + page_N_text + overlap_after
  • One chunk per page — never split a page even if it exceeds 1000 tokens
  • Overlap provides surrounding context for better embedding/retrieval
  • The page_number metadata always refers to the main page (N), not the overlap pages

Modified Files:

File Change
backend/app/utils/chunking.py Add chunk_pages() method to TokenChunkingStrategy
def chunk_pages(
    self, pages: List[Tuple[int, str]], overlap_tokens: int = 200
) -> List[Tuple[str, int]]:
    """Chunk page-segmented text with overlap from adjacent pages.

    For each page, creates one chunk containing:
      [last overlap_tokens of previous page] + [full current page] + [first overlap_tokens of next page]

    Args:
        pages: List of (page_number, page_text) tuples. 1-indexed.
        overlap_tokens: Number of tokens to include from adjacent pages.

    Returns:
        List of (chunk_text, page_number) tuples. One chunk per page.
    """
    if not pages:
        return []

    # Tokenize all pages upfront
    tokenized = []
    for page_num, page_text in pages:
        tokens = self._encoding.encode(page_text)
        tokenized.append((page_num, tokens, page_text))

    chunks = []
    for i, (page_num, tokens, _text) in enumerate(tokenized):
        parts = []

        # Overlap from previous page (last N tokens)
        if i > 0:
            prev_tokens = tokenized[i - 1][1]
            overlap = prev_tokens[-overlap_tokens:] if len(prev_tokens) >= overlap_tokens else prev_tokens
            if overlap:
                parts.append(self._encoding.decode(overlap))

        # Full current page text (use original text, not re-decoded)
        parts.append(pages[i][1])

        # Overlap from next page (first N tokens)
        if i < len(tokenized) - 1:
            next_tokens = tokenized[i + 1][1]
            overlap = next_tokens[:overlap_tokens] if len(next_tokens) >= overlap_tokens else next_tokens
            if overlap:
                parts.append(self._encoding.decode(overlap))

        chunk_text = "\n".join(parts)
        chunks.append((chunk_text, page_num))

    return chunks

3.2.3 Chunk PDF Generation & Storage

New directory: document_chunk/ (at project root, alongside chroma_db/)

Naming convention: {original_filename_without_ext}_page_{page_number}.pdf

Example: NEC4 ACC_page_3.pdf

One file per page — multiple chunks never exist for a page (decision: never split a page), so deduplication is not needed.

Content: The actual page extracted from the source PDF — preserves original formatting, layout, tables, images. Not a generated text PDF.

Modified/New Files:

File Change
backend/app/utils/pdf_extractor.py NEW — Extract and save individual PDF pages
backend/app/core/config.py Add DOCUMENT_CHUNK_PATH setting (default: ./document_chunk)
# pdf_extractor.py
from pypdf import PdfReader, PdfWriter

def extract_page_as_pdf(source_path: str, page_number: int, output_path: str) -> str:
    """Extract a single page from a PDF and save as a new PDF file.
    
    Args:
        source_path: Path to original PDF
        page_number: 1-indexed page number
        output_path: Where to save the extracted page PDF
    
    Returns:
        The output_path of the saved PDF
    """
    reader = PdfReader(source_path)
    writer = PdfWriter()
    writer.add_page(reader.pages[page_number - 1])  # 0-indexed in reader
    with open(output_path, "wb") as f:
        writer.write(f)
    return output_path

Note: For DOCX files, chunk PDF generation is skipped (set chunk_file_path = None in metadata). Only PDFs support page extraction.

3.2.4 Enhanced Metadata

Current metadata:

{
    "filename": "report.pdf",
    "upload_date": "2026-04-23T...",
    "content_summary": "First 200 chars...",
    "chunk_index": 0,
}

Enhanced metadata:

{
    "filename": "report.pdf",
    "upload_date": "2026-04-23T...",
    "content_summary": "First 200 chars...",
    "chunk_index": 0,
    "page_number": 3,                              # NEW
    "chunk_file_path": "report_page_3.pdf",        # NEW (relative path)
    "document_id": "uuid-string",                   # NEW (for grouping)
}

Modified Files:

File Change
backend/app/utils/metadata.py Add page_number, chunk_file_path, document_id to metadata
backend/app/models/common.py Add new fields to SourceMetadata

3.2.5 Chunk File Serving Endpoint

New endpoint:

Method Path Description
GET /api/v1/chunks/{file_path}/pdf Serve chunk PDF file

Modified Files:

File Change
backend/app/routers/documents.py Add GET /chunks/{file_path}/pdf endpoint
@router.get("/chunks/{file_path}/pdf")
async def get_chunk_pdf(file_path: str):
    """Serve a chunk PDF file from document_chunk/ directory."""
    # Validate path to prevent directory traversal
    # Return FileResponse from DOCUMENT_CHUNK_PATH / file_path

3.2.6 Ingestion Pipeline Refactor

The entire ingestion flow needs to be updated:

Current flow:

Upload → parse_pdf() → flat text → chunk() → metadata → store in ChromaDB

New flow:

Upload → check if filename exists → YES: delete old chunks + chunk PDFs (full replacement)
       → parse_pdf_by_page() → per-page text
       → chunk_pages() with 200-token overlap from adjacent pages
       → for each page: extract page as PDF → save to document_chunk/
       → enhanced metadata (page_number, chunk_file_path, document_id)
       → store in ChromaDB

Same-filename replacement (confirmed):

  • On upload, query ChromaDB for existing chunks with matching filename
  • If found: delete old chunk IDs from collection, delete old PDFs from document_chunk/
  • Create new document_id, ingest fresh
  • This ensures clean replacement without orphaned data

Modified Files:

File Change
backend/app/routers/ingest.py Refactor: page-aware parsing, chunk PDF generation, enhanced metadata, same-filename replacement

3.3 Frontend Changes

Modified Files:

File Change
frontend/src/types/index.ts Add page_number, chunk_file_path to SourceMetadata
frontend/src/components/ResponsePanel.tsx Render chunk_file_path as clickable link in sources
frontend/src/components/ChunkList.tsx Show page number, link to chunk PDF

Source Card Update:

Current source card shows: filename, upload_date, content_summary, chunk_index

Enhanced source card adds: page_number (e.g., "Page 3"), clickable "View Source" link opening chunk PDF

3.4 Directory Structure After Enhancement

legco_reranker/
├── app/
│   ├── backend/...
│   ├── frontend/...
│   └── chroma_db/                    # Existing
├── document_chunk/                    # NEW — chunk PDF files
│   ├── NEC4 ACC_page_1.pdf
│   ├── NEC4 ACC_page_2.pdf
│   ├── NEC4 ACC_page_3.pdf
│   └── meeting_notes_page_5.pdf
├── .plans/
└── ...

3.5 Acceptance Criteria

  • PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages
  • Each page is saved as a separate PDF (original page, not generated text) in document_chunk/
  • Chunk PDF filename follows convention: {filename}_page_{n}.pdf
  • Page numbers are sequential index (1, 2, 3...), not PDF internal labels
  • Oversized pages are kept as single chunks (never split)
  • GET /api/v1/chunks/{file_path}/pdf serves the original chunk PDF
  • RAG response sources include page_number and chunk_file_path
  • Frontend source cards show page number and clickable link
  • Clicking source link opens/downloads the original chunk PDF
  • DOCX uploads work without page numbers (graceful degradation, no chunk PDFs)
  • Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id)
  • document_chunk/ is .gitignored
  • Deleting a document also removes its chunk PDFs from document_chunk/

Implementation Sequence

The three features have dependencies. Recommended order:

Feature 1 (Nav + Routing)     ← No backend changes, enables Feature 2
    │
    ▼
Feature 2 (RAG Database Page)  ← Needs Feature 1 for page routing
    │                            But backend CRUD endpoints are independent
    ▼
Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
                                 Enhances Feature 2 (chunk file links)
                                 Enhances ResponsePanel (clickable sources)

Sub-Phase Breakdown

Sub-Phase Feature Scope Backend Frontend Status
1.5.1 1 Nav bar + routing + page scaffold None NavBar, LTTPage, RAGDatabasePage, App.tsx refactor Complete
1.5.2 2 Backend CRUD for documents/chunks documents router, RAGService methods, schemas None Complete
1.5.3 2 Frontend RAG Database page None RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks Complete
1.5.4 3 Page-aware parsing & chunking pdf_parser, chunking, metadata enhancements None Complete
1.5.5 3 Chunk PDF generation & storage pdf_extractor, config, ingest pipeline refactor None Complete
1.5.6 3 Chunk file serving + frontend links documents router endpoint ResponsePanel clickable links, ChunkList updates 📋 Pending

Parallelization Opportunities

  • 1.5.1 and 1.5.2 can run in parallel — Frontend routing changes and backend CRUD are independent
  • 1.5.3 blocked by 1.5.1 + 1.5.2 — Needs both routing and backend endpoints
  • 1.5.4 and 1.5.5 are sequential — 1.5.5 depends on 1.5.4's page-aware parsing
  • 1.5.6 blocked by 1.5.3 + 1.5.5 — Needs both frontend page and backend chunk serving
1.5.1 (Nav+Routing)  ─┐
                       ├─► 1.5.3 (RAG DB Page) ─┐
1.5.2 (Backend CRUD) ─┘                         │
                                                 ├─► 1.5.6 (Links)
1.5.4 (Page-Aware) ──► 1.5.5 (Chunk PDFs) ─────┘

New Dependencies

Backend

Package Purpose Already installed?
(none) pypdf already supports page extraction

Frontend

Package Purpose Already installed?
react-router-dom Client-side routing Installed

.gitignore Updates

# Chunk PDF storage
document_chunk/

Risks & Mitigations

Risk Impact Mitigation
ChromaDB has no native "group by document" query list_documents() needs manual grouping from all metadata Resolved: Groups by document_id extracted from chunk IDs via rsplit("_", 1)
Large PDFs → many chunk PDF files Disk usage grows One PDF per unique page (not per chunk). Pages shared by chunks reuse same file.
Chunk spans multiple pages Ambiguous page assignment Tag chunk with STARTING page only. Note in UI.
DOCX has no page numbers page_number is None for DOCX chunks Graceful degradation — show "N/A" or hide page info for DOCX.
Deleting documents must clean up chunk files Orphan files if deletion fails Delete files after successful ChromaDB deletion. Log failures for manual cleanup.
Path traversal in chunk PDF endpoint Security risk Validate file_path doesn't contain .. or absolute paths. Use whitelist of known files.
ChromaDB 1.5.8 requires name() on embedding functions _EmbeddingFunctionWrapper crashes on collection.get() Fixed: Added name() method returning "custom_embedding_wrapper"
Existing ChromaDB data corrupted (HNSW segment error) Endpoints return 500 against existing chroma_db/ Pre-existing issue. Works with fresh DB. May need chroma_db reset for production.

Decisions (Confirmed)

# Question Decision
1 Chunk algorithm Page-as-chunk-unit. Each chunk = [last 200 tokens of prev page] + [full current page text] + [first 200 tokens of next page]. One chunk per page — never split a page even if oversized.
2 DOCX chunk PDFs No. Only PDFs get chunk PDFs. DOCX chunks show text preview only.
3 IngestPanel placement Keep on LTT page + also add upload on RAG Database page.
4 Re-ingestion / same filename Full replacement. Delete old chunks + old chunk PDFs + create new document_id.
5 Chunk PDF content Original page from source PDF. Extract actual page — preserves formatting, tables, images.
6 Page numbering Sequential index (1, 2, 3...). Not PDF internal labels.
7 Oversized pages Never split. One chunk per page regardless of token count.

Open Questions

None — all resolved.


Test Plan

Backend Tests (New Files)

File Coverage
test_phase1_documents_router.py GET /documents, DELETE /documents/{id}, DELETE /chunks/{id} (8 tests, all pass)
test_phase1_pdf_parser_pages.py parse_pdf_by_page() — multi-page PDFs, single-page, empty
test_phase1_page_aware_chunking.py chunk_with_pages() — cross-page chunks, single-page chunks
test_phase1_pdf_extractor.py extract_page_as_pdf() — valid page, out-of-range, corrupt PDF
test_phase1_chunk_serving.py GET /chunks/{path}/pdf — valid file, missing file, path traversal

Frontend Tests (New Files)

File Coverage
NavBar.test.tsx Navigation links, active state
RAGDatabasePage.test.tsx Document list, delete, upload
DocumentList.test.tsx Document cards, expand/collapse
ChunkList.test.tsx Chunk table, page numbers, PDF links

Acceptance Tests

File Coverage
test_acceptance_phase1_documents_crud.py Real ChromaDB CRUD with list, delete
test_acceptance_phase1_page_chunking.py Real PDF upload → page-aware chunks → chunk PDFs exist
test_acceptance_phase1_chunk_links.py Full flow: upload → query → response has clickable chunk links