251 lines
11 KiB
Markdown
251 lines
11 KiB
Markdown
# Phase 1 Backend Development Plan
|
|
|
|
**Source**: `development_plan.md`
|
|
**Scope**: FastAPI backend for text-based RAG Q&A
|
|
**Estimated Duration**: 3-4 days
|
|
**Status**: ✅ Complete (Phase 1.1, 1.2, 1.3, 1.4 all done)
|
|
|
|
---
|
|
|
|
## Objective
|
|
|
|
Build a complete FastAPI backend that:
|
|
1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
|
|
2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
|
|
3. Serves API endpoints for ingestion and querying with full metadata attribution
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [x] `POST /api/v1/ingest` accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
|
|
- [x] `POST /api/v1/query` accepts natural language question, returns JSON with: `keywords`, `answer` (bullet points), `sources` (array of metadata objects)
|
|
- [x] Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
|
|
- [x] All LLM/ASR configuration reads from `.env` (OpenRouter for dev)
|
|
- [x] ChromaDB persists to `chroma_db/` directory
|
|
- [x] Chunking strategy is abstracted (interface/class) for future replacement
|
|
- [x] All unit tests pass (`pytest app/test/test_phase1_*.py -v`)
|
|
- [x] All acceptance tests pass (`pytest app/test/acceptance/ -v -m acceptance`)
|
|
|
|
---
|
|
|
|
## Acceptance Tests
|
|
|
|
**File**: `backend/app/test/acceptance/test_acceptance_phase1_ingest.py`
|
|
- `test_ingest_docx_with_real_embedding()` — Upload DOCX, verify ChromaDB entries with metadata
|
|
- `test_ingest_pdf_with_real_embedding()` — Upload PDF, verify ChromaDB entries with metadata
|
|
|
|
**File**: `backend/app/test/acceptance/test_acceptance_phase1_rag_query.py`
|
|
- `test_query_with_real_llm()` — Ask question, verify 3-step pipeline produces bullet answer with sources
|
|
- `test_query_keywords_displayed()` — Verify response includes extracted keywords
|
|
|
|
---
|
|
|
|
## Implementation Tasks
|
|
|
|
### Phase 1.1: Project Setup & Core Infrastructure
|
|
|
|
**Test files to write first**:
|
|
- `test_phase1_config.py` — Test config loads from .env correctly
|
|
- `test_phase1_database.py` — Test ChromaDB client initialization
|
|
|
|
**Task 1.1.1**: Environment and dependencies
|
|
- Create `backend/requirements.txt` with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
|
|
- Create `backend/.env.example` with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
|
|
- Create `backend/app/core/config.py` — Pydantic Settings with `.env` loading
|
|
|
|
**Task 1.1.2**: Database initialization
|
|
- Create `backend/app/core/database.py` — ChromaDB persistent client
|
|
- Function: `get_chroma_client()` returns persistent client pointing to `chroma_db/`
|
|
- Function: `get_or_create_collection(name, embedding_function)`
|
|
|
|
**Task 1.1.3**: Project structure
|
|
- Create all `__init__.py` files for package structure
|
|
- Create `backend/app/main.py` with FastAPI app, CORS middleware
|
|
- Include routers: `app.include_router(ingest.router, prefix="/api/v1")`, etc.
|
|
|
|
**Task 1.1.4**: Pydantic schemas
|
|
- `models/ingest.py`: `IngestResponse` with `document_id`, `chunk_count`, `filename`
|
|
- `models/query.py`: `QueryRequest` with `question`; `QueryResponse` with `keywords`, `answer`, `sources`
|
|
- `models/common.py`: `SourceMetadata` with `filename`, `upload_date`, `content_summary`, `chunk_index`
|
|
|
|
**Commit**: "feat: Phase 1.1 project setup with config, database, and models"
|
|
|
|
**Status**: ✅ Complete
|
|
**Tests**: 5 passed (2 config, 3 database)
|
|
|
|
---
|
|
|
|
### Phase 1.2: Ingestion Pipeline
|
|
|
|
**Test files to write first**:
|
|
- `test_phase1_chunking.py` — Test 1000/200 chunking with various text sizes
|
|
- `test_phase1_ingest.py` — Mock ChromaDB, test endpoint flow
|
|
- `test_phase1_metadata.py` — Test metadata extraction
|
|
|
|
**Task 1.2.1**: Document parsing
|
|
- `utils/docx_parser.py`: `parse_docx(file_path) -> str` — Extract text from DOCX
|
|
- `utils/pdf_parser.py`: `parse_pdf(file_path) -> str` — Extract text from PDF using pypdf
|
|
- Both return plain text with preserved paragraph breaks
|
|
- Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)
|
|
|
|
**Task 1.2.2**: Chunking abstraction
|
|
- `utils/chunking.py`: Abstract base class `ChunkingStrategy`
|
|
- `TokenChunkingStrategy` implementation using tiktoken
|
|
- Config: chunk_size=1000, overlap=200
|
|
- Method: `chunk(text: str) -> list[str]`
|
|
|
|
**Task 1.2.3**: Metadata extraction
|
|
- `utils/metadata.py`: `extract_metadata(file_path, chunks) -> list[dict]`
|
|
- Returns list of metadata dicts matching chunk count
|
|
- Each metadata has: `filename`, `upload_date`, `content_summary` (first 200 chars of chunk)
|
|
|
|
**Task 1.2.4**: Embedding service
|
|
- `services/rag.py`: `RAGService` class
|
|
- Initialize embedding function with `qwen/qwen3-embedding-4b`
|
|
- Method: `ingest_document(file_path, chunks, metadata_list)`
|
|
- Store in ChromaDB collection "documents"
|
|
|
|
**Task 1.2.5**: Ingest endpoint
|
|
- `routers/ingest.py`: `POST /api/v1/ingest`
|
|
- Accept `UploadFile` (DOCX and PDF, validate extension)
|
|
- Route to correct parser based on file extension
|
|
- Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
|
|
- Return `IngestResponse`
|
|
|
|
**Commit**: "feat: Phase 1.2 ingestion pipeline with chunking and metadata"
|
|
|
|
**Status**: ✅ Complete
|
|
**Tests**: 20 passed, 2 skipped (python-docx not installed in test env)
|
|
**Coverage**: chunking (4), metadata (3), parsers (5), RAGService (6), ingest endpoint (4)
|
|
|
|
---
|
|
|
|
### Phase 1.3: Query Pipeline (3-Step)
|
|
|
|
**Test files to write first**:
|
|
- `test_phase1_llm_client.py` — Test LLM client error handling
|
|
- `test_phase1_rag_service.py` — Test retrieval and response generation
|
|
- `test_phase1_query.py` — Test full pipeline with mocked LLM calls
|
|
|
|
**Task 1.3.1**: LLM client — ✅ Done in Phase 1.1
|
|
- `services/llm_client.py`: `LLMClient` class — Implemented
|
|
- Constructor takes config from `Settings`
|
|
- Method: `complete(prompt: str, temperature: float = 0.7) -> str`
|
|
- Use httpx with OpenAI-compatible API format
|
|
- Handle errors gracefully
|
|
|
|
**Task 1.3.2**: Query decomposition
|
|
- `services/query_decomposer.py`: `QueryDecomposer` class — 🔄 Pending
|
|
- Prompt template: "Given question: '{question}', extract key search keywords as JSON array"
|
|
- Method: `decompose(question: str) -> list[str]`
|
|
- Parse LLM JSON response into list of keywords
|
|
|
|
**Task 1.3.3**: Retrieval from ChromaDB — ✅ Done in Phase 1.2
|
|
- `services/rag.py`: `retrieve(query_keywords: list[str], n_results: int = 10)` — Implemented
|
|
- Join keywords with space for query text
|
|
- Return list of `(chunk_text, metadata, distance)` tuples
|
|
|
|
**Task 1.3.4**: Relevance filtering
|
|
- `services/relevance_filter.py`: `RelevanceFilter` class — 🔄 Pending
|
|
- Prompt: "Given question '{question}' and these document chunks, rate each 0-10 for relevance. Return JSON array of scores."
|
|
- Input: list of chunks
|
|
- Output: filtered list of (chunk, metadata) with score > threshold (e.g., 7)
|
|
- Batch all chunks in single LLM call
|
|
|
|
**Task 1.3.5**: Response generation — ✅ Done in Phase 1.2
|
|
- `services/rag.py`: `generate_response(question: str, chunks: list, metadata: list) -> str` — Implemented
|
|
- Prompt: "Answer question using ONLY these document chunks. Format as bullet points. Cite sources."
|
|
- Include chunk content and metadata in context
|
|
- Enforce bullet-point format via prompt
|
|
|
|
**Task 1.3.6**: Query endpoint
|
|
- `routers/query.py`: `POST /api/v1/query` — 🔄 Pending
|
|
- Full pipeline orchestration:
|
|
1. Call `query_decomposer.decompose()` → get keywords
|
|
2. Call `rag.retrieve()` → get chunks
|
|
3. Call `relevance_filter.filter()` → filter chunks
|
|
4. Call `rag.generate_response()` → get answer
|
|
- Return `QueryResponse` with keywords, answer, sources
|
|
|
|
**Commit**: "feat: Phase 1.3 query pipeline with decomposition, relevance filter, and response"
|
|
|
|
**Status**: ✅ Complete
|
|
**Tests**: 13 passed (5 decomposer, 5 relevance filter, 3 query endpoint)
|
|
|
|
---
|
|
|
|
### Phase 1.4: Testing & Polish
|
|
|
|
**Test files to write first**:
|
|
- `test_acceptance_phase1_ingest.py` — Real embedding test
|
|
- `test_acceptance_phase1_rag_query.py` — Real LLM pipeline test
|
|
|
|
**Task 1.4.1**: Unit tests
|
|
- Run `pytest app/test/test_phase1_*.py -v` — all must pass
|
|
- Add missing test coverage for edge cases
|
|
|
|
**Task 1.4.2**: Acceptance tests
|
|
- Create real `.env` with OpenRouter credentials
|
|
- Run `test_acceptance_phase1_ingest.py` with real embedding
|
|
- Run `test_acceptance_phase1_rag_query.py` with real LLM calls
|
|
- Verify keywords appear, answer is bullet format, sources have metadata
|
|
|
|
**Task 1.4.3**: Error handling
|
|
- Add try/except in all endpoints
|
|
- Return proper HTTP status codes (400 for bad input, 500 for LLM errors)
|
|
- Log errors with context
|
|
|
|
**Task 1.4.4**: Documentation
|
|
- Update `AGENTS.md` if any conventions changed
|
|
- Add docstrings to all public methods
|
|
- Verify all imports work
|
|
|
|
**Commit**: "feat: Phase 1.4 acceptance tests, error handling, and polish"
|
|
|
|
**Status**: ✅ Complete
|
|
**Tests**: 41 unit tests passed (2 skipped), 5 acceptance tests passed
|
|
**Acceptance**: Full 3-step pipeline verified with real OpenRouter LLM calls
|
|
|
|
---
|
|
|
|
## Services Status
|
|
|
|
| Service | File | Status | Responsibility |
|
|
|---------|------|--------|----------------|
|
|
| Config | `core/config.py` | ✅ Complete | `.env` loading, Settings class |
|
|
| Database | `core/database.py` | ✅ Complete | ChromaDB persistent client |
|
|
| LLM Client | `services/llm_client.py` | ✅ Complete | OpenAI-compatible API wrapper |
|
|
| Query Decomposer | `services/query_decomposer.py` | ✅ Complete | Extract keywords from question |
|
|
| Relevance Filter | `services/relevance_filter.py` | ✅ Complete | Batch score chunk relevance |
|
|
| RAG Service | `services/rag.py` | ✅ Complete | Embedding, retrieval, response generation |
|
|
| Ingest Router | `routers/ingest.py` | ✅ Complete | POST /api/v1/ingest endpoint |
|
|
| Query Router | `routers/query.py` | ✅ Complete | POST /api/v1/query endpoint |
|
|
| DOCX Parser | `utils/docx_parser.py` | ✅ Complete | Extract text from DOCX |
|
|
| PDF Parser | `utils/pdf_parser.py` | ✅ Complete | Extract text from PDF |
|
|
| Chunking | `utils/chunking.py` | ✅ Complete | Token-based chunking with overlap |
|
|
| Metadata | `utils/metadata.py` | ✅ Complete | Extract file metadata |
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
LLM_BASE_URL=https://openrouter.ai/api/v1
|
|
LLM_API_KEY=your_openrouter_key
|
|
LLM_MODEL_NAME=qwen/qwen3.5-35b-a3b
|
|
EMBEDDING_MODEL=qwen/qwen3-embedding-4b
|
|
EMBEDDING_BASE_URL=https://openrouter.ai/api/v1
|
|
CHROMA_DB_PATH=./chroma_db
|
|
```
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Chunking strategy uses ABC pattern for easy future replacement
|
|
- Relevance filtering uses single batch call for efficiency
|
|
- All LLM calls go through `LLMClient` for consistent error handling
|
|
- ChromaDB collection name: "documents"
|
|
- Metadata fields: filename, upload_date (ISO format), content_summary, chunk_index
|
|
- Response format enforced purely through prompt engineering (no JSON schema) |