docs: add PDF support alongside DOCX in all plans
This commit is contained in:
parent
1518b72969
commit
abffc9cf1d
|
|
@ -10,7 +10,7 @@
|
|||
## Objective
|
||||
|
||||
Build a complete FastAPI backend that:
|
||||
1. Accepts DOCX uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
|
||||
1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
|
||||
2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
|
||||
3. Serves API endpoints for ingestion and querying with full metadata attribution
|
||||
|
||||
|
|
@ -18,7 +18,7 @@ Build a complete FastAPI backend that:
|
|||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `POST /api/v1/ingest` accepts DOCX, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
|
||||
- [ ] `POST /api/v1/ingest` accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
|
||||
- [ ] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects)
|
||||
- [ ] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
|
||||
- [ ] All LLM/ASR configuration reads from `.env` (OpenRouter for dev)
|
||||
|
|
@ -33,6 +33,7 @@ Build a complete FastAPI backend that:
|
|||
|
||||
**File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py`
|
||||
- `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata
|
||||
- `test_ingest_pdf_with_real_embedding()` — Upload PDF, verify ChromaDB entries with metadata
|
||||
|
||||
**File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py`
|
||||
- `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources
|
||||
|
|
@ -49,7 +50,7 @@ Build a complete FastAPI backend that:
|
|||
- `test_phase1_database.py` — Test ChromaDB client initialization
|
||||
|
||||
**Task 1.1.1**: Environment and dependencies
|
||||
- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
|
||||
- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
|
||||
- Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
|
||||
- Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading
|
||||
|
||||
|
|
@ -77,10 +78,11 @@ Build a complete FastAPI backend that:
|
|||
- `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow
|
||||
- `test_phase1_metadata.py` — Test metadata extraction
|
||||
|
||||
**Task 1.2.1**: DOCX parsing
|
||||
- `utils/docx_parser.py`: `parse_docx(file_path) -> str`
|
||||
- Handle paragraphs, tables, headers
|
||||
- Return plain text with preserved paragraph breaks
|
||||
**Task 1.2.1**: Document parsing
|
||||
- `utils/docx_parser.py`: `parse_docx(file_path) -> str` — Extract text from DOCX
|
||||
- `utils/pdf_parser.py`: `parse_pdf(file_path) -> str` — Extract text from PDF using pypdf
|
||||
- Both return plain text with preserved paragraph breaks
|
||||
- Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)
|
||||
|
||||
**Task 1.2.2**: Chunking abstraction
|
||||
- `utils/chunking.py`: Abstract base class `ChunkingStrategy`
|
||||
|
|
@ -101,7 +103,8 @@ Build a complete FastAPI backend that:
|
|||
|
||||
**Task 1.2.5**: Ingest endpoint
|
||||
- `routers/ingest.py`: `POST /api/v1/ingest`
|
||||
- Accept `UploadFile` (DOCX only, validate extension)
|
||||
- Accept `UploadFile` (DOCX and PDF, validate extension)
|
||||
- Route to correct parser based on file extension
|
||||
- Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
|
||||
- Return `IngestResponse`
|
||||
|
||||
|
|
@ -196,7 +199,9 @@ Build a complete FastAPI backend that:
|
|||
| Query Decomposer | `services/query_decomposer.py` | Extract keywords from question |
|
||||
| Relevance Filter | `services/relevance_filter.py` | Batch score chunk relevance |
|
||||
| RAG Service | `services/rag.py` | Embedding, retrieval, response generation |
|
||||
| DOCX Parser | `utils/docx_parser.py` | Extract text from DOCX |
|
||||
| Document Parser | `utils/document_parser.py` | Router to DOCX/PDF parsers |
|
||||
| DOCX Parser | `utils/parsers/docx_parser.py` | Extract text from DOCX |
|
||||
| PDF Parser | `utils/parsers/pdf_parser.py` | Extract text from PDF |
|
||||
| Chunking | `utils/chunking.py` | Token-based chunking with overlap |
|
||||
| Metadata | `utils/metadata.py` | Extract file metadata |
|
||||
|
||||
|
|
|
|||
|
|
@ -99,7 +99,7 @@ Build a React frontend that:
|
|||
|
||||
4. **IngestPanel component (optional for Phase 1)**
|
||||
- `src/components/IngestPanel.tsx`
|
||||
- Simple file upload for DOCX
|
||||
- Simple file upload for DOCX and PDF
|
||||
- Progress indicator during upload
|
||||
- Success/error feedback
|
||||
|
||||
|
|
@ -126,7 +126,7 @@ Build a React frontend that:
|
|||
- Smooth transitions between states
|
||||
|
||||
3. **Integration with backend**
|
||||
- End-to-end test: upload DOCX → ask question → verify keywords + answer + sources
|
||||
- End-to-end test: upload DOCX/PDF → ask question → verify keywords + answer + sources
|
||||
- Verify CORS works correctly
|
||||
- Test error scenarios
|
||||
|
||||
|
|
|
|||
|
|
@ -59,7 +59,7 @@ app/
|
|||
- **LLM/ASR Configuration**: Backend reads from `.env` for easy switching between development (OpenRouter / Alibaba Cloud) and production (local vLLM).
|
||||
- **RAG Database**: ChromaDB with metadata support (filename + extracted content metadata).
|
||||
- **Embedding Model**: `qwen/qwen3-embedding-4b` via sentence-transformers, provider-switchable via `.env` (OpenRouter for dev, local vLLM for prod).
|
||||
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX.
|
||||
- **Document Ingestion**: Via UI (project-based demo, no user authentication). Supported formats: DOCX, PDF.
|
||||
- **Chunking Strategy**: 1000 tokens per chunk, 200 token overlap. Strategy abstracted for future replacement.
|
||||
- **Video**: MP4 and common formats, maximum 300MB.
|
||||
- **ASR Flow**: Both **automatic** (on transcript updates) and **manual** "Ask from Video" button.
|
||||
|
|
|
|||
Loading…
Reference in New Issue