feat: add chunk PDF serving endpoint and frontend clickable source links (1.5.6)

- Add page_number and chunk_file_path to SourceMetadata model and query router - Add GET /chunks/{file_path}/pdf endpoint with path traversal protection - Add View PDF links in ResponsePanel source cards and ChunkList component - Update TypeScript types and API helper for chunk PDF URLs - Add backend tests (5) and frontend ChunkList tests (7) - Update enhancement plan: all 3 features complete
2026-04-24 11:49:39 +08:00 · 2026-04-24 11:49:39 +08:00 · d49756f374
parent 64043b75a7
commit d49756f374
13 changed files with 311 additions and 22 deletions
--- a/.plans/phase1_enhancement_plan.md
+++ b/.plans/phase1_enhancement_plan.md
@ -2,7 +2,7 @@

 **Source**: User request (2026-04-23)  
 **Scope**: Frontend navigation + RAG Database management page + page-aware chunking with chunk PDFs  
-**Status**: 🔄 In Progress — Features 1-2 ✅ Complete, Feature 3 pending
+**Status**: ✅ Complete — All 3 features implemented (1.5.1–1.5.6)

 ---

@ -35,8 +35,8 @@ Enhance the existing Phase 1 application with three features:
 - ~~No routing or multi-page support~~ ✅ Done in Feature 1
 - ~~No way to view what's stored in ChromaDB~~ ✅ Backend CRUD done (sub-phase 1.5.2)
 - ~~No way to delete documents or chunks~~ ✅ Backend CRUD done (sub-phase 1.5.2)
- No page-level awareness in chunking (all pages flattened before token splitting)
- No persistent chunk files (chunks only exist as ChromaDB document text)
+- ~~No page-level awareness in chunking (all pages flattened before token splitting)~~ ✅ Done in 1.5.4
+- ~~No persistent chunk files (chunks only exist as ChromaDB document text)~~ ✅ Done in 1.5.5
 - No clickable links in RAG responses to view source chunks
 - ~~Upload only via IngestPanel on the main query page~~ (IngestPanel stays on LTT, upload also coming to RAG DB page)

@ -211,7 +211,7 @@ class DeleteResponse(BaseModel):
 - [x] User can delete individual chunks (with confirmation)
 - [x] User can upload documents from this page
 - [x] Stats displayed: total documents, total chunks
- [ ] Uploading a file with existing filename triggers automatic replacement (old data deleted first)
+- [x] Uploading a file with existing filename triggers automatic replacement (old data deleted first)

 ---

@ -493,19 +493,19 @@ legco_reranker/

 ### 3.5 Acceptance Criteria

- [ ] PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages
- [ ] Each page is saved as a separate PDF (original page, not generated text) in `document_chunk/`
- [ ] Chunk PDF filename follows convention: `{filename}_page_{n}.pdf`
- [ ] Page numbers are sequential index (1, 2, 3...), not PDF internal labels
- [ ] Oversized pages are kept as single chunks (never split)
- [ ] `GET /api/v1/chunks/{file_path}/pdf` serves the original chunk PDF
- [ ] RAG response sources include `page_number` and `chunk_file_path`
- [ ] Frontend source cards show page number and clickable link
- [ ] Clicking source link opens/downloads the original chunk PDF
- [ ] DOCX uploads work without page numbers (graceful degradation, no chunk PDFs)
- [ ] Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id)
- [ ] `document_chunk/` is `.gitignore`d
- [ ] Deleting a document also removes its chunk PDFs from `document_chunk/`
+- [x] PDF uploads produce page-aware chunks: 1 chunk per page with 200-token overlap from adjacent pages
+- [x] Each page is saved as a separate PDF (original page, not generated text) in `document_chunk/`
+- [x] Chunk PDF filename follows convention: `{filename}_page_{n}.pdf`
+- [x] Page numbers are sequential index (1, 2, 3...), not PDF internal labels
+- [x] Oversized pages are kept as single chunks (never split)
+- [x] `GET /api/v1/chunks/{file_path}/pdf` serves the original chunk PDF
+- [x] RAG response sources include `page_number` and `chunk_file_path`
+- [x] Frontend source cards show page number and clickable link
+- [x] Clicking source link opens/downloads the original chunk PDF
+- [x] DOCX uploads work without page numbers (graceful degradation, no chunk PDFs)
+- [x] Uploading a file with same filename replaces existing document (old chunks + PDFs deleted, new document_id)
+- [x] `document_chunk/` is `.gitignore`d
+- [x] Deleting a document also removes its chunk PDFs from `document_chunk/`

 ---

@ -534,7 +534,7 @@ Feature 3 (Page-Aware Chunking) ← Modifies ingestion pipeline
 | 1.5.3 | 2 | Frontend RAG Database page | None | RAGDatabasePage, DocumentList, ChunkList, DocumentUpload, API hooks | ✅ Complete |
 | 1.5.4 | 3 | Page-aware parsing & chunking | pdf_parser, chunking, metadata enhancements | None | ✅ Complete |
 | 1.5.5 | 3 | Chunk PDF generation & storage | pdf_extractor, config, ingest pipeline refactor | None | ✅ Complete |
-| 1.5.6 | 3 | Chunk file serving + frontend links | documents router endpoint | ResponsePanel clickable links, ChunkList updates | 📋 Pending |
+| 1.5.6 | 3 | Chunk file serving + frontend links | documents router endpoint | ResponsePanel clickable links, ChunkList updates | ✅ Complete |

 ### Parallelization Opportunities

@ -619,7 +619,7 @@ None — all resolved.
 | `test_phase1_pdf_parser_pages.py` | parse_pdf_by_page() — multi-page PDFs, single-page, empty |
 | `test_phase1_page_aware_chunking.py` | chunk_with_pages() — cross-page chunks, single-page chunks |
 | `test_phase1_pdf_extractor.py` | extract_page_as_pdf() — valid page, out-of-range, corrupt PDF |
-| `test_phase1_chunk_serving.py` | GET /chunks/{path}/pdf — valid file, missing file, path traversal |
+| `test_phase1_chunk_serving.py` | ✅ GET /chunks/{path}/pdf — valid file, missing file, path traversal, filenames with spaces (5 tests, all pass) |

 ### Frontend Tests (New Files)

@ -628,7 +628,7 @@ None — all resolved.
 | `NavBar.test.tsx` | Navigation links, active state |
 | `RAGDatabasePage.test.tsx` | Document list, delete, upload |
 | `DocumentList.test.tsx` | Document cards, expand/collapse |
-| `ChunkList.test.tsx` | Chunk table, page numbers, PDF links |
+| `ChunkList.test.tsx` | ✅ Chunk table, page numbers, PDF links (7 tests, all pass) |

 ### Acceptance Tests

--- a/backend/app/models/common.py
+++ b/backend/app/models/common.py
@ -1,3 +1,5 @@
+from typing import Optional
+
 from pydantic import BaseModel


@ -6,3 +8,5 @@ class SourceMetadata(BaseModel):
    upload_date: str
    content_summary: str
    chunk_index: int
+    page_number: Optional[int] = None
+    chunk_file_path: Optional[str] = None
--- a/backend/app/routers/documents.py
+++ b/backend/app/routers/documents.py
@ -3,6 +3,7 @@ import logging
 import os

 from fastapi import APIRouter, HTTPException
+from fastapi.responses import FileResponse

 from app.models.documents import (
    DocumentInfo,
@ -126,3 +127,30 @@ async def delete_chunk(chunk_id: str):
        deleted=True,
        message=f"Deleted chunk {chunk_id}",
    )
+
+
+@router.get("/chunks/{file_path:path}/pdf")
+async def get_chunk_pdf(file_path: str):
+    """Serve a chunk PDF file from document_chunk/ directory."""
+    from app.core.config import get_settings
+
+    # Path traversal protection
+    if ".." in file_path:
+        raise HTTPException(status_code=400, detail="Invalid file path")
+
+    settings = get_settings()
+    base_dir = os.path.realpath(settings.document_chunk_path)
+    full_path = os.path.realpath(os.path.join(base_dir, file_path))
+
+    # Ensure resolved path is within base directory
+    if not full_path.startswith(base_dir + os.sep) and full_path != base_dir:
+        raise HTTPException(status_code=400, detail="Invalid file path")
+
+    if not os.path.isfile(full_path):
+        raise HTTPException(status_code=404, detail=f"Chunk file not found: {file_path}")
+
+    return FileResponse(
+        full_path,
+        media_type="application/pdf",
+        filename=os.path.basename(full_path),
+    )
--- a/backend/app/routers/query.py
+++ b/backend/app/routers/query.py
@ -59,6 +59,8 @@ async def query(request: QueryRequest):
                upload_date=meta.get("upload_date", ""),
                content_summary=meta.get("content_summary", ""),
                chunk_index=meta.get("chunk_index", 0),
+                page_number=meta.get("page_number"),
+                chunk_file_path=meta.get("chunk_file_path"),
            )
            for meta in chunk_metadata
        ]
--- a/backend/app/test/test_phase1_chunk_serving.py
+++ b/backend/app/test/test_phase1_chunk_serving.py
@ -0,0 +1,73 @@
+"""Tests for chunk PDF file serving endpoint.
+
+Coverage:
+- GET /api/v1/chunks/{file_path}/pdf — success, 404, path traversal 400
+"""
+import os
+import tempfile
+import unittest
+from unittest.mock import patch
+
+from fastapi.testclient import TestClient
+
+from app.main import app
+
+
+class TestChunkServing(unittest.TestCase):
+    """Test GET /api/v1/chunks/{file_path}/pdf endpoint."""
+
+    def setUp(self):
+        self.client = TestClient(app)
+
+    def test_get_chunk_pdf_success(self):
+        """Should serve chunk PDF file with 200 and application/pdf."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            test_file = os.path.join(tmp_dir, "test_page_1.pdf")
+            with open(test_file, "wb") as f:
+                f.write(b"%PDF-1.4 fake content")
+
+            with patch("app.core.config.get_settings") as mock_settings:
+                mock_settings.return_value.document_chunk_path = tmp_dir
+                response = self.client.get("/api/v1/chunks/test_page_1.pdf/pdf")
+
+            self.assertEqual(response.status_code, 200)
+            self.assertIn("application/pdf", response.headers["content-type"])
+
+    def test_get_chunk_pdf_not_found(self):
+        """Should return 404 for non-existent chunk file."""
+        with patch("app.core.config.get_settings") as mock_settings:
+            mock_settings.return_value.document_chunk_path = "/tmp/nonexistent_chunk_dir"
+            response = self.client.get("/api/v1/chunks/nonexistent.pdf/pdf")
+
+        self.assertEqual(response.status_code, 404)
+
+    def test_get_chunk_pdf_path_traversal_double_dot(self):
+        """Should reject path traversal with .. (404 due to Starlette normalization)."""
+        with patch("app.core.config.get_settings") as mock_settings:
+            mock_settings.return_value.document_chunk_path = "/tmp/fake_chunk_dir"
+            response = self.client.get("/api/v1/chunks/../etc/passwd/pdf")
+
+        self.assertIn(response.status_code, [400, 404])
+
+    def test_get_chunk_pdf_path_traversal_symlink_escape(self):
+        """Should reject resolved path escaping base directory (404 from normalization)."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with patch("app.core.config.get_settings") as mock_settings:
+                mock_settings.return_value.document_chunk_path = tmp_dir
+                response = self.client.get("/api/v1/chunks/../../etc/passwd/pdf")
+
+            self.assertIn(response.status_code, [400, 404])
+
+    def test_get_chunk_pdf_with_spaces_in_filename(self):
+        """Should serve files with spaces in the filename."""
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            test_file = os.path.join(tmp_dir, "NEC4 ACC_page_3.pdf")
+            with open(test_file, "wb") as f:
+                f.write(b"%PDF-1.4 fake content")
+
+            with patch("app.core.config.get_settings") as mock_settings:
+                mock_settings.return_value.document_chunk_path = tmp_dir
+                response = self.client.get("/api/v1/chunks/NEC4 ACC_page_3.pdf/pdf")
+
+            self.assertEqual(response.status_code, 200)
+            self.assertIn("application/pdf", response.headers["content-type"])
--- a/frontend/src/components/ChunkList.tsx
+++ b/frontend/src/components/ChunkList.tsx
@ -1,6 +1,7 @@
 import React from 'react'
 import { Trash2 } from 'lucide-react'
 import type { ChunkInfo } from '../types'
+import { getChunkPdfUrl } from '../lib/api'

 interface ChunkListProps {
  chunks: ChunkInfo[]
@ -64,6 +65,17 @@ export const ChunkList: React.FC<ChunkListProps> = ({
                ? `${chunk.content_summary.slice(0, 100)}...`
                : chunk.content_summary}
            </div>
+            {chunk.chunk_file_path && (
+              <a
+                href={getChunkPdfUrl(chunk.chunk_file_path)}
+                target="_blank"
+                rel="noopener noreferrer"
+                className="inline-flex items-center mt-1 text-xs text-blue-600 hover:text-blue-800 hover:underline"
+                data-testid="view-chunk-pdf-link"
+              >
+                View PDF
+              </a>
+            )}
          </div>
          <button
            data-testid="delete-chunk-btn"
--- a/frontend/src/components/ResponsePanel.tsx
+++ b/frontend/src/components/ResponsePanel.tsx
@ -2,6 +2,7 @@ import React, { useState } from 'react'
 import { MessageSquare, AlertCircle, Copy, ChevronDown, ChevronRight } from 'lucide-react'
 import ReactMarkdown from 'react-markdown'
 import type { SourceMetadata } from '../types'
+import { getChunkPdfUrl } from '../lib/api'

 interface ResponsePanelProps {
  answer: string | null
@ -133,10 +134,30 @@ export const ResponsePanel: React.FC<ResponsePanelProps> = ({
                  key={index}
                  className="border border-gray-200 rounded p-3 bg-gray-50"
                >
-                  <div className="font-medium text-sm">{source.filename}</div>
+                  <div className="flex items-center justify-between">
+                    <div className="font-medium text-sm">{source.filename}</div>
+                    {source.page_number !== null && (
+                      <span className="text-xs text-blue-600 font-medium">
+                        Page {source.page_number}
+                      </span>
+                    )}
+                  </div>
                  <div className="text-sm text-gray-500">{source.upload_date}</div>
                  <div className="text-sm text-gray-600 mt-1">{source.content_summary}</div>
-                  <div className="text-xs text-gray-400 mt-1">Chunk {source.chunk_index}</div>
+                  <div className="flex items-center justify-between mt-1">
+                    <div className="text-xs text-gray-400">Chunk {source.chunk_index}</div>
+                    {source.chunk_file_path && (
+                      <a
+                        href={getChunkPdfUrl(source.chunk_file_path)}
+                        target="_blank"
+                        rel="noopener noreferrer"
+                        className="text-xs text-blue-600 hover:text-blue-800 hover:underline"
+                        data-testid="view-chunk-pdf-link"
+                      >
+                        View PDF
+                      </a>
+                    )}
+                  </div>
                </div>
              ))}
            </div>
--- a/frontend/src/lib/api.ts
+++ b/frontend/src/lib/api.ts
@ -38,3 +38,8 @@ export const deleteChunk = async (chunkId: string): Promise<DeleteResponse> => {
  const resp = await apiClient.delete<DeleteResponse>(`/chunks/${chunkId}`)
  return resp.data
 }
+
+export const getChunkPdfUrl = (filePath: string): string => {
+  const baseUrl: string = import.meta.env.VITE_API_BASE_URL ?? 'http://localhost:8000/api/v1'
+  return `${baseUrl}/chunks/${filePath}/pdf`
+}
--- a/frontend/src/test/components/ChunkList.test.tsx
+++ b/frontend/src/test/components/ChunkList.test.tsx
@ -0,0 +1,136 @@
+import { describe, it, expect, vi } from 'vitest'
+import { render, screen, fireEvent } from '@testing-library/react'
+import { ChunkList } from '../../components/ChunkList'
+import type { ChunkInfo } from '../../types'
+
+vi.mock('../../lib/api', () => ({
+  getChunkPdfUrl: (path: string) => `http://localhost:8000/api/v1/chunks/${path}/pdf`,
+}))
+
+const mockChunks: ChunkInfo[] = [
+  {
+    chunk_id: 'doc1_0',
+    chunk_index: 0,
+    content_summary: 'Summary of chunk 0 content...',
+    page_number: 1,
+    chunk_file_path: 'test_page_1.pdf',
+  },
+  {
+    chunk_id: 'doc1_1',
+    chunk_index: 1,
+    content_summary: 'Summary of chunk 1 content that is quite long and should be truncated when displayed in the component',
+    page_number: 2,
+    chunk_file_path: 'test_page_2.pdf',
+  },
+  {
+    chunk_id: 'doc1_2',
+    chunk_index: 2,
+    content_summary: 'A DOCX chunk without page info',
+    page_number: null,
+    chunk_file_path: null,
+  },
+]
+
+describe('ChunkList', () => {
+  it('renders chunk rows with chunk index and page number', () => {
+    render(
+      <ChunkList
+        chunks={mockChunks}
+        isLoading={false}
+        onDeleteChunk={vi.fn()}
+        isDeleting={false}
+      />
+    )
+
+    expect(screen.getByText(/Chunk 0/)).toBeInTheDocument()
+    expect(screen.getByText(/Chunk 1/)).toBeInTheDocument()
+    expect(screen.getByText(/Chunk 2/)).toBeInTheDocument()
+    expect(screen.getByText(/Page: 1/)).toBeInTheDocument()
+    expect(screen.getByText(/Page: 2/)).toBeInTheDocument()
+    expect(screen.getByText(/Page: N\/A/)).toBeInTheDocument()
+  })
+
+  it('renders View PDF links for chunks with chunk_file_path', () => {
+    render(
+      <ChunkList
+        chunks={mockChunks}
+        isLoading={false}
+        onDeleteChunk={vi.fn()}
+        isDeleting={false}
+      />
+    )
+
+    const links = screen.getAllByTestId('view-chunk-pdf-link')
+    expect(links).toHaveLength(2)
+    expect(links[0]).toHaveAttribute('href', 'http://localhost:8000/api/v1/chunks/test_page_1.pdf/pdf')
+    expect(links[0]).toHaveAttribute('target', '_blank')
+  })
+
+  it('does not render View PDF link for chunks without chunk_file_path', () => {
+    const docxChunk = [mockChunks[2]]
+    render(
+      <ChunkList
+        chunks={docxChunk}
+        isLoading={false}
+        onDeleteChunk={vi.fn()}
+        isDeleting={false}
+      />
+    )
+
+    expect(screen.queryByTestId('view-chunk-pdf-link')).not.toBeInTheDocument()
+  })
+
+  it('calls onDeleteChunk when delete button is clicked', () => {
+    const onDelete = vi.fn()
+    render(
+      <ChunkList
+        chunks={[mockChunks[0]]}
+        isLoading={false}
+        onDeleteChunk={onDelete}
+        isDeleting={false}
+      />
+    )
+
+    fireEvent.click(screen.getByTestId('delete-chunk-btn'))
+    expect(onDelete).toHaveBeenCalledWith('doc1_0')
+  })
+
+  it('shows loading skeletons when isLoading is true', () => {
+    render(
+      <ChunkList
+        chunks={[]}
+        isLoading={true}
+        onDeleteChunk={vi.fn()}
+        isDeleting={false}
+      />
+    )
+
+    expect(screen.getAllByTestId('skeleton-line').length).toBeGreaterThan(0)
+  })
+
+  it('shows empty message when no chunks', () => {
+    render(
+      <ChunkList
+        chunks={[]}
+        isLoading={false}
+        onDeleteChunk={vi.fn()}
+        isDeleting={false}
+      />
+    )
+
+    expect(screen.getByText('No chunks found')).toBeInTheDocument()
+  })
+
+  it('disables delete button when isDeleting is true', () => {
+    render(
+      <ChunkList
+        chunks={[mockChunks[0]]}
+        isLoading={false}
+        onDeleteChunk={vi.fn()}
+        isDeleting={true}
+      />
+    )
+
+    expect(screen.getByTestId('delete-chunk-btn')).toBeDisabled()
+  })
+})
--- a/frontend/src/test/components/ResponsePanel.test.tsx
+++ b/frontend/src/test/components/ResponsePanel.test.tsx
@ -10,12 +10,16 @@ describe('ResponsePanel', () => {
      upload_date: '2024-01-15',
      content_summary: 'Introduction to RAG systems',
      chunk_index: 0,
+      page_number: 1,
+      chunk_file_path: 'test_chunk_1.pdf',
    },
    {
      filename: 'document2.txt',
      upload_date: '2024-01-16',
      content_summary: 'Advanced retrieval techniques',
      chunk_index: 1,
+      page_number: null,
+      chunk_file_path: null,
    },
  ]

--- a/frontend/src/test/e2e/query_flow.test.tsx
+++ b/frontend/src/test/e2e/query_flow.test.tsx
@ -23,6 +23,8 @@ const mockQueryResponse: QueryResponse = {
      upload_date: '2024-01-15',
      content_summary: 'Meeting minutes',
      chunk_index: 0,
+      page_number: 1,
+      chunk_file_path: 'test_chunk.pdf',
    },
  ],
 }
--- a/frontend/src/types/index.ts
+++ b/frontend/src/types/index.ts
@ -3,6 +3,8 @@ export interface SourceMetadata {
  upload_date: string
  content_summary: string
  chunk_index: number
+  page_number: number | null
+  chunk_file_path: string | null
 }

 export interface QueryRequest {
--- a/materials/~$C4
+++ b/materials/~$C4