legco_ai_assistant/.plans/phase1_backend_plan.md

11 KiB

Phase 1 Backend Development Plan

Source: development_plan.md
Scope: FastAPI backend for text-based RAG Q&A
Estimated Duration: 3-4 days
Status: Complete (Phase 1.1, 1.2, 1.3, 1.4 all done)


Objective

Build a complete FastAPI backend that:

  1. Accepts DOCX and PDF uploads, chunks text (1000 tokens / 200 overlap), embeds via Qwen, and stores in persistent ChromaDB with metadata
  2. Runs a 3-step RAG pipeline: query decomposition → retrieval → relevance filtering → bullet-point response
  3. Serves API endpoints for ingestion and querying with full metadata attribution

Acceptance Criteria

  • POST /api/v1/ingest accepts DOCX and PDF, parses content, chunks at 1000/200, embeds, stores in ChromaDB with filename/upload_date/content_summary
  • POST /api/v1/query accepts natural language question, returns JSON with: keywords, answer (bullet points), sources (array of metadata objects)
  • Query pipeline executes 3 LLM calls: decomposition → relevance filter → response generation
  • All LLM/ASR configuration reads from .env (OpenRouter for dev)
  • ChromaDB persists to chroma_db/ directory
  • Chunking strategy is abstracted (interface/class) for future replacement
  • All unit tests pass (pytest app/test/test_phase1_*.py -v)
  • All acceptance tests pass (pytest app/test/acceptance/ -v -m acceptance)

Acceptance Tests

File: backend/app/test/acceptance/test_acceptance_phase1_ingest.py

  • test_ingest_docx_with_real_embedding() — Upload DOCX, verify ChromaDB entries with metadata
  • test_ingest_pdf_with_real_embedding() — Upload PDF, verify ChromaDB entries with metadata

File: backend/app/test/acceptance/test_acceptance_phase1_rag_query.py

  • test_query_with_real_llm() — Ask question, verify 3-step pipeline produces bullet answer with sources
  • test_query_keywords_displayed() — Verify response includes extracted keywords

Implementation Tasks

Phase 1.1: Project Setup & Core Infrastructure

Test files to write first:

  • test_phase1_config.py — Test config loads from .env correctly
  • test_phase1_database.py — Test ChromaDB client initialization

Task 1.1.1: Environment and dependencies

  • Create backend/requirements.txt with: fastapi, uvicorn[standard], pydantic, pydantic-settings, chromadb, sentence-transformers, python-docx, pypdf, python-dotenv, httpx, pytest, pytest-asyncio, tiktoken
  • Create backend/.env.example with: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_NAME, EMBEDDING_MODEL, EMBEDDING_BASE_URL, CHROMA_DB_PATH
  • Create backend/app/core/config.py — Pydantic Settings with .env loading

Task 1.1.2: Database initialization

  • Create backend/app/core/database.py — ChromaDB persistent client
  • Function: get_chroma_client() returns persistent client pointing to chroma_db/
  • Function: get_or_create_collection(name, embedding_function)

Task 1.1.3: Project structure

  • Create all __init__.py files for package structure
  • Create backend/app/main.py with FastAPI app, CORS middleware
  • Include routers: app.include_router(ingest.router, prefix="/api/v1"), etc.

Task 1.1.4: Pydantic schemas

  • models/ingest.py: IngestResponse with document_id, chunk_count, filename
  • models/query.py: QueryRequest with question; QueryResponse with keywords, answer, sources
  • models/common.py: SourceMetadata with filename, upload_date, content_summary, chunk_index

Commit: "feat: Phase 1.1 project setup with config, database, and models"

Status: Complete
Tests: 5 passed (2 config, 3 database)


Phase 1.2: Ingestion Pipeline

Test files to write first:

  • test_phase1_chunking.py — Test 1000/200 chunking with various text sizes
  • test_phase1_ingest.py — Mock ChromaDB, test endpoint flow
  • test_phase1_metadata.py — Test metadata extraction

Task 1.2.1: Document parsing

  • utils/docx_parser.py: parse_docx(file_path) -> str — Extract text from DOCX
  • utils/pdf_parser.py: parse_pdf(file_path) -> str — Extract text from PDF using pypdf
  • Both return plain text with preserved paragraph breaks
  • Handle edge cases: empty docs, corrupted files, scanned PDFs (skip with warning)

Task 1.2.2: Chunking abstraction

  • utils/chunking.py: Abstract base class ChunkingStrategy
  • TokenChunkingStrategy implementation using tiktoken
  • Config: chunk_size=1000, overlap=200
  • Method: chunk(text: str) -> list[str]

Task 1.2.3: Metadata extraction

  • utils/metadata.py: extract_metadata(file_path, chunks) -> list[dict]
  • Returns list of metadata dicts matching chunk count
  • Each metadata has: filename, upload_date, content_summary (first 200 chars of chunk)

Task 1.2.4: Embedding service

  • services/rag.py: RAGService class
  • Initialize embedding function with qwen/qwen3-embedding-4b
  • Method: ingest_document(file_path, chunks, metadata_list)
  • Store in ChromaDB collection "documents"

Task 1.2.5: Ingest endpoint

  • routers/ingest.py: POST /api/v1/ingest
  • Accept UploadFile (DOCX and PDF, validate extension)
  • Route to correct parser based on file extension
  • Orchestration: save temp → parse → chunk → extract metadata → embed → store → cleanup
  • Return IngestResponse

Commit: "feat: Phase 1.2 ingestion pipeline with chunking and metadata"

Status: Complete
Tests: 20 passed, 2 skipped (python-docx not installed in test env) Coverage: chunking (4), metadata (3), parsers (5), RAGService (6), ingest endpoint (4)


Phase 1.3: Query Pipeline (3-Step)

Test files to write first:

  • test_phase1_llm_client.py — Test LLM client error handling
  • test_phase1_rag_service.py — Test retrieval and response generation
  • test_phase1_query.py — Test full pipeline with mocked LLM calls

Task 1.3.1: LLM client — Done in Phase 1.1

  • services/llm_client.py: LLMClient class — Implemented
  • Constructor takes config from Settings
  • Method: complete(prompt: str, temperature: float = 0.7) -> str
  • Use httpx with OpenAI-compatible API format
  • Handle errors gracefully

Task 1.3.2: Query decomposition

  • services/query_decomposer.py: QueryDecomposer class — 🔄 Pending
  • Prompt template: "Given question: '{question}', extract key search keywords as JSON array"
  • Method: decompose(question: str) -> list[str]
  • Parse LLM JSON response into list of keywords

Task 1.3.3: Retrieval from ChromaDB — Done in Phase 1.2

  • services/rag.py: retrieve(query_keywords: list[str], n_results: int = 10) — Implemented
  • Join keywords with space for query text
  • Return list of (chunk_text, metadata, distance) tuples

Task 1.3.4: Relevance filtering

  • services/relevance_filter.py: RelevanceFilter class — 🔄 Pending
  • Prompt: "Given question '{question}' and these document chunks, rate each 0-10 for relevance. Return JSON array of scores."
  • Input: list of chunks
  • Output: filtered list of (chunk, metadata) with score > threshold (e.g., 7)
  • Batch all chunks in single LLM call

Task 1.3.5: Response generation — Done in Phase 1.2

  • services/rag.py: generate_response(question: str, chunks: list, metadata: list) -> str — Implemented
  • Prompt: "Answer question using ONLY these document chunks. Format as bullet points. Cite sources."
  • Include chunk content and metadata in context
  • Enforce bullet-point format via prompt

Task 1.3.6: Query endpoint

  • routers/query.py: POST /api/v1/query🔄 Pending
  • Full pipeline orchestration:
    1. Call query_decomposer.decompose() → get keywords
    2. Call rag.retrieve() → get chunks
    3. Call relevance_filter.filter() → filter chunks
    4. Call rag.generate_response() → get answer
  • Return QueryResponse with keywords, answer, sources

Commit: "feat: Phase 1.3 query pipeline with decomposition, relevance filter, and response"

Status: Complete
Tests: 13 passed (5 decomposer, 5 relevance filter, 3 query endpoint)


Phase 1.4: Testing & Polish

Test files to write first:

  • test_acceptance_phase1_ingest.py — Real embedding test
  • test_acceptance_phase1_rag_query.py — Real LLM pipeline test

Task 1.4.1: Unit tests

  • Run pytest app/test/test_phase1_*.py -v — all must pass
  • Add missing test coverage for edge cases

Task 1.4.2: Acceptance tests

  • Create real .env with OpenRouter credentials
  • Run test_acceptance_phase1_ingest.py with real embedding
  • Run test_acceptance_phase1_rag_query.py with real LLM calls
  • Verify keywords appear, answer is bullet format, sources have metadata

Task 1.4.3: Error handling

  • Add try/except in all endpoints
  • Return proper HTTP status codes (400 for bad input, 500 for LLM errors)
  • Log errors with context

Task 1.4.4: Documentation

  • Update AGENTS.md if any conventions changed
  • Add docstrings to all public methods
  • Verify all imports work

Commit: "feat: Phase 1.4 acceptance tests, error handling, and polish"

Status: Complete
Tests: 41 unit tests passed (2 skipped), 5 acceptance tests passed
Acceptance: Full 3-step pipeline verified with real OpenRouter LLM calls


Services Status

Service File Status Responsibility
Config core/config.py Complete .env loading, Settings class
Database core/database.py Complete ChromaDB persistent client
LLM Client services/llm_client.py Complete OpenAI-compatible API wrapper
Query Decomposer services/query_decomposer.py Complete Extract keywords from question
Relevance Filter services/relevance_filter.py Complete Batch score chunk relevance
RAG Service services/rag.py Complete Embedding, retrieval, response generation
Ingest Router routers/ingest.py Complete POST /api/v1/ingest endpoint
Query Router routers/query.py Complete POST /api/v1/query endpoint
DOCX Parser utils/docx_parser.py Complete Extract text from DOCX
PDF Parser utils/pdf_parser.py Complete Extract text from PDF
Chunking utils/chunking.py Complete Token-based chunking with overlap
Metadata utils/metadata.py Complete Extract file metadata

Environment Variables

LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=your_openrouter_key
LLM_MODEL_NAME=qwen/qwen3.5-35b-a3b
EMBEDDING_MODEL=qwen/qwen3-embedding-4b
EMBEDDING_BASE_URL=https://openrouter.ai/api/v1
CHROMA_DB_PATH=./chroma_db

Notes

  • Chunking strategy uses ABC pattern for easy future replacement
  • Relevance filtering uses single batch call for efficiency
  • All LLM calls go through LLMClient for consistent error handling
  • ChromaDB collection name: "documents"
  • Metadata fields: filename, upload_date (ISO format), content_summary, chunk_index
  • Response format enforced purely through prompt engineering (no JSON schema)