docs: add PDF support alongside DOCX in all plans

This commit is contained in:
Woody 2026-04-22 15:59:55 +08:00
parent 1518b72969
commit abffc9cf1d
3 changed files with 17 additions and 12 deletions

View File

@ -10,7 +10,7 @@
## Objective ## Objective
Build a complete FastAPI backend that: Build a complete FastAPI backend that:
1. Accepts DOCX uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata 1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response 2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
3. Serves API endpoints for ingestion and querying with full metadata attribution 3. Serves API endpoints for ingestion and querying with full metadata attribution
@ -18,7 +18,7 @@ Build a complete FastAPI backend that:
## Acceptance Criteria ## Acceptance Criteria
- [ ] `POST /api/v1/ingest` accepts DOCX, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary - [ ] `POST /api/v1/ingest` accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
- [ ] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects) - [ ] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects)
- [ ] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation - [ ] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
- [ ] All LLM/ASR configuration reads from `.env` (OpenRouter for dev) - [ ] All LLM/ASR configuration reads from `.env` (OpenRouter for dev)
@ -33,6 +33,7 @@ Build a complete FastAPI backend that:
**File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py` **File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py`
- `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata - `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata
- `test_ingest_pdf_with_real_embedding()` — Upload PDF, verify ChromaDB entries with metadata
**File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py` **File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py`
- `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources - `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources
@ -49,7 +50,7 @@ Build a complete FastAPI backend that:
- `test_phase1_database.py` — Test ChromaDB client initialization - `test_phase1_database.py` — Test ChromaDB client initialization
**Task 1.1.1**: Environment and dependencies **Task 1.1.1**: Environment and dependencies
- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken - Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
- Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH - Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
- Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading - Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading
@ -77,10 +78,11 @@ Build a complete FastAPI backend that:
- `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow - `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow
- `test_phase1_metadata.py` — Test metadata extraction - `test_phase1_metadata.py` — Test metadata extraction
**Task 1.2.1**: DOCX parsing **Task 1.2.1**: Document parsing
- `utils/docx_parser.py`: `parse_docx(file_path) -> str` - `utils/docx_parser.py`: `parse_docx(file_path) -> str` — Extract text from DOCX
- Handle paragraphs, tables, headers - `utils/pdf_parser.py`: `parse_pdf(file_path) -> str` — Extract text from PDF using pypdf
- Return plain text with preserved paragraph breaks - Both return plain text with preserved paragraph breaks
- Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)
**Task 1.2.2**: Chunking abstraction **Task 1.2.2**: Chunking abstraction
- `utils/chunking.py`: Abstract base class `ChunkingStrategy` - `utils/chunking.py`: Abstract base class `ChunkingStrategy`
@ -101,7 +103,8 @@ Build a complete FastAPI backend that:
**Task 1.2.5**: Ingest endpoint **Task 1.2.5**: Ingest endpoint
- `routers/ingest.py`: `POST /api/v1/ingest` - `routers/ingest.py`: `POST /api/v1/ingest`
- Accept `UploadFile` (DOCX only, validate extension) - Accept `UploadFile` (DOCX and PDF, validate extension)
- Route to correct parser based on file extension
- Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup - Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
- Return `IngestResponse` - Return `IngestResponse`
@ -196,7 +199,9 @@ Build a complete FastAPI backend that:
| Query Decomposer | `services/query_decomposer.py` | Extract keywords from question | | Query Decomposer | `services/query_decomposer.py` | Extract keywords from question |
| Relevance Filter | `services/relevance_filter.py` | Batch score chunk relevance | | Relevance Filter | `services/relevance_filter.py` | Batch score chunk relevance |
| RAG Service | `services/rag.py` | Embedding, retrieval, response generation | | RAG Service | `services/rag.py` | Embedding, retrieval, response generation |
| DOCX Parser | `utils/docx_parser.py` | Extract text from DOCX | | Document Parser | `utils/document_parser.py` | Router to DOCX/PDF parsers |
| DOCX Parser | `utils/parsers/docx_parser.py` | Extract text from DOCX |
| PDF Parser | `utils/parsers/pdf_parser.py` | Extract text from PDF |
| Chunking | `utils/chunking.py` | Token-based chunking with overlap | | Chunking | `utils/chunking.py` | Token-based chunking with overlap |
| Metadata | `utils/metadata.py` | Extract file metadata | | Metadata | `utils/metadata.py` | Extract file metadata |

View File

@ -99,7 +99,7 @@ Build a React frontend that:
4. **IngestPanel component (optional for Phase 1)** 4. **IngestPanel component (optional for Phase 1)**
- `src/components/IngestPanel.tsx` - `src/components/IngestPanel.tsx`
- Simple file upload for DOCX - Simple file upload for DOCX and PDF
- Progress indicator during upload - Progress indicator during upload
- Success/error feedback - Success/error feedback
@ -126,7 +126,7 @@ Build a React frontend that:
- Smooth transitions between states - Smooth transitions between states
3. **Integration with backend** 3. **Integration with backend**
- End-to-end test: upload DOCX → ask question → verify keywords + answer + sources - End-to-end test: upload DOCX/PDF → ask question → verify keywords + answer + sources
- Verify CORS works correctly - Verify CORS works correctly
- Test error scenarios - Test error scenarios

View File

@ -59,7 +59,7 @@ app/
- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM). - **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata). - **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod). - **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX. - **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement. - **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
- **Video**: MP4 and common formats, maximum 300MB. - **Video**: MP4 and common formats, maximum 300MB.
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button. - **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.