docs: add PDF support alongside DOCX in all plans

2026-04-22 15:59:55 +08:00 · 2026-04-22 15:59:55 +08:00 · abffc9cf1d
parent 1518b72969
commit abffc9cf1d
3 changed files with 17 additions and 12 deletions
--- a/.plans/phase1_backend_plan.md
+++ b/.plans/phase1_backend_plan.md
@ -10,7 +10,7 @@
 ## Objective

 Build a complete FastAPI backend that:
-1. Accepts DOCX uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
+1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
 2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
 3. Serves API endpoints for ingestion and querying with full metadata attribution

@ -18,7 +18,7 @@ Build a complete FastAPI backend that:

 ## Acceptance Criteria

- [ ] `POST /api/v1/ingest` accepts DOCX, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
+- [ ] `POST /api/v1/ingest` accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
 - [ ] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects)
 - [ ] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
 - [ ] All LLM/ASR configuration reads from `.env` (OpenRouter for dev)
@ -33,6 +33,7 @@ Build a complete FastAPI backend that:

 **File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py`
 - `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata
+- `test_ingest_pdf_with_real_embedding()` — Upload PDF, verify ChromaDB entries with metadata

 **File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py`
 - `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources
@ -49,7 +50,7 @@ Build a complete FastAPI backend that:
 - `test_phase1_database.py` — Test ChromaDB client initialization

 **Task 1.1.1**: Environment and dependencies
- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
+- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
 - Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
 - Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading

@ -77,10 +78,11 @@ Build a complete FastAPI backend that:
 - `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow
 - `test_phase1_metadata.py` — Test metadata extraction

-**Task 1.2.1**: DOCX parsing
- `utils/docx_parser.py`: `parse_docx(file_path) -> str`
- Handle paragraphs, tables, headers
- Return plain text with preserved paragraph breaks
+**Task 1.2.1**: Document parsing
+- `utils/docx_parser.py`: `parse_docx(file_path) -> str` — Extract text from DOCX
+- `utils/pdf_parser.py`: `parse_pdf(file_path) -> str` — Extract text from PDF using pypdf
+- Both return plain text with preserved paragraph breaks
+- Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)

 **Task 1.2.2**: Chunking abstraction
 - `utils/chunking.py`: Abstract base class `ChunkingStrategy`
@ -101,7 +103,8 @@ Build a complete FastAPI backend that:

 **Task 1.2.5**: Ingest endpoint
 - `routers/ingest.py`: `POST /api/v1/ingest`
- Accept `UploadFile` (DOCX only, validate extension)
+- Accept `UploadFile` (DOCX and PDF, validate extension)
+- Route to correct parser based on file extension
 - Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
 - Return `IngestResponse`

@ -196,7 +199,9 @@ Build a complete FastAPI backend that:
 | Query Decomposer | `services/query_decomposer.py` | Extract keywords from question |
 | Relevance Filter | `services/relevance_filter.py` | Batch score chunk relevance |
 | RAG Service | `services/rag.py` | Embedding, retrieval, response generation |
-| DOCX Parser | `utils/docx_parser.py` | Extract text from DOCX |
+| Document Parser | `utils/document_parser.py` | Router to DOCX/PDF parsers |
+| DOCX Parser | `utils/parsers/docx_parser.py` | Extract text from DOCX |
+| PDF Parser | `utils/parsers/pdf_parser.py` | Extract text from PDF |
 | Chunking | `utils/chunking.py` | Token-based chunking with overlap |
 | Metadata | `utils/metadata.py` | Extract file metadata |

--- a/.plans/phase1_frontend_plan.md
+++ b/.plans/phase1_frontend_plan.md
@ -99,7 +99,7 @@ Build a React frontend that:

 4. **IngestPanel component (optional for Phase 1)**
   - `src/components/IngestPanel.tsx`
-   - Simple file upload for DOCX
+   - Simple file upload for DOCX and PDF
   - Progress indicator during upload
   - Success/error feedback

@ -126,7 +126,7 @@ Build a React frontend that:
   - Smooth transitions between states

 3. **Integration with backend**
-   - End-to-end test: upload DOCX → ask question → verify keywords + answer + sources
+   - End-to-end test: upload DOCX/PDF → ask question → verify keywords + answer + sources
   - Verify CORS works correctly
   - Test error scenarios

--- a/development_plan.md
+++ b/development_plan.md
@ -59,7 +59,7 @@ app/
 - **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).  
 - **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).  
 - **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).  
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX.  
+- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.  
 - **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.  
 - **Video**: MP4 and common formats, maximum 300MB.  
 - **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.