11 KiB

Raw Blame History

Phase 1 Backend Development Plan

Source: development_plan.md
Scope: FastAPI backend for text-based RAG Q&A
Estimated Duration: 3-4 days
Status: ✅ Complete (Phase 1.1, 1.2, 1.3, 1.4 all done)

Objective

Build a complete FastAPI backend that:

Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
Serves API endpoints for ingestion and querying with full metadata attribution

Acceptance Criteria

POST /api/v1/ingest accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
POST /api/v1/query accepts natural language question, returns JSON with: keywords, answer (bullet points), sources (array of metadata objects)
Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
All LLM/ASR configuration reads from .env (OpenRouter for dev)
ChromaDB persists to chroma_db/ directory
Chunking strategy is abstracted (interface/class) for future replacement
All unit tests pass (pytest app/test/test_phase1_*.py -v)
All acceptance tests pass (pytest app/test/acceptance/ -v -m acceptance)

Acceptance Tests

File: backend/app/test/acceptance/test_acceptance_phase1_ingest.py

test_ingest_docx_with_real_embedding() — Upload DOCX, verify ChromaDB entries with metadata
test_ingest_pdf_with_real_embedding() — Upload PDF, verify ChromaDB entries with metadata

File: backend/app/test/acceptance/test_acceptance_phase1_rag_query.py

test_query_with_real_llm() — Ask question, verify 3-step pipeline produces bullet answer with sources
test_query_keywords_displayed() — Verify response includes extracted keywords

Implementation Tasks

Phase 1.1: Project Setup & Core Infrastructure

Test files to write first:

test_phase1_config.py — Test config loads from .env correctly
test_phase1_database.py — Test ChromaDB client initialization

Task 1.1.1: Environment and dependencies

Create backend/requirements.txt with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
Create backend/.env.example with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
Create backend/app/core/config.py — Pydantic Settings with .env loading

Task 1.1.2: Database initialization

Create backend/app/core/database.py — ChromaDB persistent client
Function: get_chroma_client() returns persistent client pointing to chroma_db/
Function: get_or_create_collection(name, embedding_function)

Task 1.1.3: Project structure

Create all __init__.py files for package structure
Create backend/app/main.py with FastAPI app, CORS middleware
Include routers: app.include_router(ingest.router, prefix="/api/v1"), etc.

Task 1.1.4: Pydantic schemas

models/ingest.py: IngestResponse with document_id, chunk_count, filename
models/query.py: QueryRequest with question; QueryResponse with keywords, answer, sources
models/common.py: SourceMetadata with filename, upload_date, content_summary, chunk_index

Commit: "feat: Phase 1.1 project setup with config, database, and models"

Status: ✅ Complete
Tests: 5 passed (2 config, 3 database)

Phase 1.2: Ingestion Pipeline

Test files to write first:

test_phase1_chunking.py — Test 1000/200 chunking with various text sizes
test_phase1_ingest.py — Mock ChromaDB, test endpoint flow
test_phase1_metadata.py — Test metadata extraction

Task 1.2.1: Document parsing

utils/docx_parser.py: parse_docx(file_path) -> str — Extract text from DOCX
utils/pdf_parser.py: parse_pdf(file_path) -> str — Extract text from PDF using pypdf
Both return plain text with preserved paragraph breaks
Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)

Task 1.2.2: Chunking abstraction

utils/chunking.py: Abstract base class ChunkingStrategy
TokenChunkingStrategy implementation using tiktoken
Config: chunk_size=1000, overlap=200
Method: chunk(text: str) -> list[str]

Task 1.2.3: Metadata extraction

utils/metadata.py: extract_metadata(file_path, chunks) -> list[dict]
Returns list of metadata dicts matching chunk count
Each metadata has: filename, upload_date, content_summary (first 200 chars of chunk)

Task 1.2.4: Embedding service

services/rag.py: RAGService class
Initialize embedding function with qwen/qwen3-embedding-4b
Method: ingest_document(file_path, chunks, metadata_list)
Store in ChromaDB collection "documents"

Task 1.2.5: Ingest endpoint

routers/ingest.py: POST /api/v1/ingest
Accept UploadFile (DOCX and PDF, validate extension)
Route to correct parser based on file extension
Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
Return IngestResponse

Commit: "feat: Phase 1.2 ingestion pipeline with chunking and metadata"

Status: ✅ Complete
Tests: 20 passed, 2 skipped (python-docx not installed in test env) Coverage: chunking (4), metadata (3), parsers (5), RAGService (6), ingest endpoint (4)

Phase 1.3: Query Pipeline (3-Step)

Test files to write first:

test_phase1_llm_client.py — Test LLM client error handling
test_phase1_rag_service.py — Test retrieval and response generation
test_phase1_query.py — Test full pipeline with mocked LLM calls

Task 1.3.1: LLM client — ✅ Done in Phase 1.1

services/llm_client.py: LLMClient class — Implemented
Constructor takes config from Settings
Method: complete(prompt: str, temperature: float = 0.7) -> str
Use httpx with OpenAI-compatible API format
Handle errors gracefully

Task 1.3.2: Query decomposition

services/query_decomposer.py: QueryDecomposer class — 🔄 Pending
Prompt template: "Given question: '{question}', extract key search keywords as JSON array"
Method: decompose(question: str) -> list[str]
Parse LLM JSON response into list of keywords

Task 1.3.3: Retrieval from ChromaDB — ✅ Done in Phase 1.2

services/rag.py: retrieve(query_keywords: list[str], n_results: int = 10) — Implemented
Join keywords with space for query text
Return list of (chunk_text, metadata, distance) tuples

Task 1.3.4: Relevance filtering

services/relevance_filter.py: RelevanceFilter class — 🔄 Pending
Prompt: "Given question '{question}' and these document chunks, rate each 0-10 for relevance. Return JSON array of scores."
Input: list of chunks
Output: filtered list of (chunk, metadata) with score > threshold (e.g., 7)
Batch all chunks in single LLM call

Task 1.3.5: Response generation — ✅ Done in Phase 1.2

services/rag.py: generate_response(question: str, chunks: list, metadata: list) -> str — Implemented
Prompt: "Answer question using ONLY these document chunks. Format as bullet points. Cite sources."
Include chunk content and metadata in context
Enforce bullet-point format via prompt

Task 1.3.6: Query endpoint

routers/query.py: POST /api/v1/query — 🔄 Pending
Full pipeline orchestration:
1. Call query_decomposer.decompose() → get keywords
2. Call rag.retrieve() → get chunks
3. Call relevance_filter.filter() → filter chunks
4. Call rag.generate_response() → get answer
Return QueryResponse with keywords, answer, sources

Commit: "feat: Phase 1.3 query pipeline with decomposition, relevance filter, and response"

Status: ✅ Complete
Tests: 13 passed (5 decomposer, 5 relevance filter, 3 query endpoint)

Phase 1.4: Testing & Polish

Test files to write first:

test_acceptance_phase1_ingest.py — Real embedding test
test_acceptance_phase1_rag_query.py — Real LLM pipeline test

Task 1.4.1: Unit tests

Run pytest app/test/test_phase1_*.py -v — all must pass
Add missing test coverage for edge cases

Task 1.4.2: Acceptance tests

Create real .env with OpenRouter credentials
Run test_acceptance_phase1_ingest.py with real embedding
Run test_acceptance_phase1_rag_query.py with real LLM calls
Verify keywords appear, answer is bullet format, sources have metadata

Task 1.4.3: Error handling

Add try/except in all endpoints
Return proper HTTP status codes (400 for bad input, 500 for LLM errors)
Log errors with context

Task 1.4.4: Documentation

Update AGENTS.md if any conventions changed
Add docstrings to all public methods
Verify all imports work

Commit: "feat: Phase 1.4 acceptance tests, error handling, and polish"

Status: ✅ Complete
Tests: 41 unit tests passed (2 skipped), 5 acceptance tests passed
Acceptance: Full 3-step pipeline verified with real OpenRouter LLM calls

Services Status

Service	File	Status	Responsibility
Config	`core/config.py`	✅ Complete	`.env` loading, Settings class
Database	`core/database.py`	✅ Complete	ChromaDB persistent client
LLM Client	`services/llm_client.py`	✅ Complete	OpenAI-compatible API wrapper
Query Decomposer	`services/query_decomposer.py`	✅ Complete	Extract keywords from question
Relevance Filter	`services/relevance_filter.py`	✅ Complete	Batch score chunk relevance
RAG Service	`services/rag.py`	✅ Complete	Embedding, retrieval, response generation
Ingest Router	`routers/ingest.py`	✅ Complete	POST /api/v1/ingest endpoint
Query Router	`routers/query.py`	✅ Complete	POST /api/v1/query endpoint
DOCX Parser	`utils/docx_parser.py`	✅ Complete	Extract text from DOCX
PDF Parser	`utils/pdf_parser.py`	✅ Complete	Extract text from PDF
Chunking	`utils/chunking.py`	✅ Complete	Token-based chunking with overlap
Metadata	`utils/metadata.py`	✅ Complete	Extract file metadata

Environment Variables

LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=your_openrouter_key
LLM_MODEL_NAME=qwen/qwen3.5-35b-a3b
EMBEDDING_MODEL=qwen/qwen3-embedding-4b
EMBEDDING_BASE_URL=https://openrouter.ai/api/v1
CHROMA_DB_PATH=./chroma_db

Notes

Chunking strategy uses ABC pattern for easy future replacement
Relevance filtering uses single batch call for efficiency
All LLM calls go through LLMClient for consistent error handling
ChromaDB collection name: "documents"
Metadata fields: filename, upload_date (ISO format), content_summary, chunk_index
Response format enforced purely through prompt engineering (no JSON schema)

11 KiB Raw Blame History

Phase 1 Backend Development Plan

Objective

Acceptance Criteria

Acceptance Tests

Implementation Tasks

Phase 1.1: Project Setup & Core Infrastructure

Phase 1.2: Ingestion Pipeline

Phase 1.3: Query Pipeline (3-Step)

Phase 1.4: Testing & Polish

Services Status

Environment Variables

Notes

11 KiB

Raw Blame History