diff --git a/.plans/phase1_backend_plan.md b/.plans/phase1_backend_plan.md index 0ccfe0f..4bef6ad 100644 --- a/.plans/phase1_backend_plan.md +++ b/.plans/phase1_backend_plan.md @@ -10,7 +10,7 @@ ## Objective Build a complete FastAPI backend that: -1. Accepts DOCX uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata +1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata 2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response 3. Serves API endpoints for ingestion and querying with full metadata attribution @@ -18,7 +18,7 @@ Build a complete FastAPI backend that: ## Acceptance Criteria -- [ ] `POST /api/v1/ingest` accepts DOCX, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary +- [ ] `POST /api/v1/ingest` accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary - [ ] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects) - [ ] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation - [ ] All LLM/ASR configuration reads from `.env` (OpenRouter for dev) @@ -33,6 +33,7 @@ Build a complete FastAPI backend that: **File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py` - `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata +- `test_ingest_pdf_with_real_embedding()` — Upload PDF, verify ChromaDB entries with metadata **File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py` - `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources @@ -49,7 +50,7 @@ Build a complete FastAPI backend that: - `test_phase1_database.py` — Test ChromaDB client initialization **Task 1.1.1**: Environment and dependencies -- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken +- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken - Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH - Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading @@ -77,10 +78,11 @@ Build a complete FastAPI backend that: - `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow - `test_phase1_metadata.py` — Test metadata extraction -**Task 1.2.1**: DOCX parsing -- `utils/docx_parser.py`: `parse_docx(file_path) -> str` -- Handle paragraphs, tables, headers -- Return plain text with preserved paragraph breaks +**Task 1.2.1**: Document parsing +- `utils/docx_parser.py`: `parse_docx(file_path) -> str` — Extract text from DOCX +- `utils/pdf_parser.py`: `parse_pdf(file_path) -> str` — Extract text from PDF using pypdf +- Both return plain text with preserved paragraph breaks +- Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning) **Task 1.2.2**: Chunking abstraction - `utils/chunking.py`: Abstract base class `ChunkingStrategy` @@ -101,7 +103,8 @@ Build a complete FastAPI backend that: **Task 1.2.5**: Ingest endpoint - `routers/ingest.py`: `POST /api/v1/ingest` -- Accept `UploadFile` (DOCX only, validate extension) +- Accept `UploadFile` (DOCX and PDF, validate extension) +- Route to correct parser based on file extension - Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup - Return `IngestResponse` @@ -196,7 +199,9 @@ Build a complete FastAPI backend that: | Query Decomposer | `services/query_decomposer.py` | Extract keywords from question | | Relevance Filter | `services/relevance_filter.py` | Batch score chunk relevance | | RAG Service | `services/rag.py` | Embedding, retrieval, response generation | -| DOCX Parser | `utils/docx_parser.py` | Extract text from DOCX | +| Document Parser | `utils/document_parser.py` | Router to DOCX/PDF parsers | +| DOCX Parser | `utils/parsers/docx_parser.py` | Extract text from DOCX | +| PDF Parser | `utils/parsers/pdf_parser.py` | Extract text from PDF | | Chunking | `utils/chunking.py` | Token-based chunking with overlap | | Metadata | `utils/metadata.py` | Extract file metadata | diff --git a/.plans/phase1_frontend_plan.md b/.plans/phase1_frontend_plan.md index bce0b6c..c507513 100644 --- a/.plans/phase1_frontend_plan.md +++ b/.plans/phase1_frontend_plan.md @@ -99,7 +99,7 @@ Build a React frontend that: 4. **IngestPanel component (optional for Phase 1)** - `src/components/IngestPanel.tsx` - - Simple file upload for DOCX + - Simple file upload for DOCX and PDF - Progress indicator during upload - Success/error feedback @@ -126,7 +126,7 @@ Build a React frontend that: - Smooth transitions between states 3. **Integration with backend** - - End-to-end test: upload DOCX → ask question → verify keywords + answer + sources + - End-to-end test: upload DOCX/PDF → ask question → verify keywords + answer + sources - Verify CORS works correctly - Test error scenarios diff --git a/development_plan.md b/development_plan.md index a54d87a..737587e 100644 --- a/development_plan.md +++ b/development_plan.md @@ -59,7 +59,7 @@ app/ - **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM). - **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata). - **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod). -- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX. +- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF. - **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement. - **Video**: MP4 and common formats, maximum 300MB. - **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.