legco_ai_assistant/.plans/package6_enhancement_plan.md

15 KiB

Package 6 Enhancement Plan — Deepseek API Client for Decomposed Questions

Source: User request (2026-05-04) Scope: Add a dedicated Deepseek API client (LLMClientDP) for the query decomposition step only, while the existing LLMClient continues to handle filtering and response generation via OpenRouter/vLLM. Status: Draft


Objective

Replace the LLM provider for the decompose step (Stage 1 of the 3-step RAG pipeline) with Deepseek API (deepseek-v4-pro), keeping the filter and generate stages on the existing OpenRouter/vLLM provider.

The decomposition step is the first and most critical LLM call — it shapes all downstream retrieval. Using Deepseek's deepseek-v4-pro with thinking mode disabled provides a dedicated pipeline for this stage with better structured output reliability.

Non-goal: The filter and generate steps remain unchanged (still use LLMClient with OpenRouter/vLLM).


Current Architecture

query.py _query_stream():
    settings = get_settings()
    llm_client = LLMClient(settings)          # ← single client for ALL stages

    decomposer = QueryDecomposer(llm_client, ...)       # Stage 1: uses llm_client
    ...
    relevance_filter = RelevanceFilter(llm_client, ...)  # Stage 3: uses same llm_client
    ...
    rag.generate_response_per_subquestion(...)            # Stage 4: uses same llm_client

LLMClient is configured via Settings:

  • llm_base_url → OpenRouter or vLLM endpoint
  • llm_api_key → shared API key
  • llm_model_name → e.g. qwen/qwen3.5-35b-a3b
  • vllm_engine → toggles extra_body format
  • llm_enable_thinking → controls thinking/reasoning tokens

Target Architecture

query.py _query_stream():
    settings = get_settings()

    llm_client_dp = LLMClientDP(settings)       # NEW: Deepseek client
    llm_client = LLMClient(settings)            # existing: OpenRouter/vLLM client

    decomposer = QueryDecomposer(llm_client_dp, ...)       # Stage 1: Deepseek
    ...
    relevance_filter = RelevanceFilter(llm_client, ...)     # Stage 3: OpenRouter/vLLM
    ...
    rag.generate_response_per_subquestion(...)              # Stage 4: OpenRouter/vLLM

Decision Register

# Decision Rationale
1 Separate class LLMClientDP, not a subclass of LLMClient Deepseek has different extra_body format ({"thinking": {"type": "disabled"}} vs OpenRouter's {"reasoning": {"enabled": false}}), different config keys, and no vLLM engine support. Inheritance would create tight coupling to the provider-switching logic in _build_extra_body(). A clean standalone class is simpler, more testable, and avoids accidentally breaking existing stages.
2 New config fields in Settings (dp_base_url, dp_api_key, dp_model_name) Separate from existing llm_* fields. The decompose step uses a different provider+model — they should have independent config. If dp_api_key is empty, fall back to llm_api_key (same API key for different providers is common).
3 Thinking mode disabled (extra_body={"thinking": {"type": "disabled"}}) User explicitly requested thinking mode disabled. Deepseek defaults to thinking enabled; disabling it enables temperature control and avoids reasoning token overhead for structured output.
4 Only complete() and complete_structured() needed QueryDecomposer.decompose() calls complete_structured() (primary) and complete() (fallback). No LangChain _get_langchain_model() needed — Deepseek uses the same OpenAI-compatible response_format as OpenRouter, so the existing _complete_structured_openai() pattern works.
5 Reuse AsyncOpenAI from openai package Deepseek's API is OpenAI-compatible. Same SDK, different base_url + api_key. No new dependencies.
6 New file backend/app/services/llm_client_dp.py Clean separation. Follows existing naming (llm_client.py). The _dp suffix clarifies it's the Deepseek-specific client for the decompose pipeline step.
7 QueryDecomposer requires no changes It accepts any object with complete() and complete_structured() methods (duck typing). LLMClientDP provides both.
8 Temperature 0.0 for structured decompose, 0.7 for fallback Matches existing LLMClient behavior: complete_structured() uses 0.0 for deterministic schema compliance; complete() fallback uses 0.7 for creative fallback parsing.
9 Logging matches existing LLMClient pattern Same _truncate_prompt_for_log(), same [step_name] log prefix, same timing measurement. Operational consistency.

Deepseek API Details (from docs)

Parameter Value
Base URL https://api.deepseek.com
Chat completions endpoint POST /chat/completions
Model deepseek-v4-pro
Auth Bearer token (passed as api_key to OpenAI SDK)
Thinking disable extra_body={"thinking": {"type": "disabled"}}
SDK compatibility OpenAI-compatible (AsyncOpenAI from openai package)
Rate limiting Dynamic concurrency (HTTP 429 when hit), 10-min inference timeout

Files to Create

# File Purpose
F1 backend/app/services/llm_client_dp.py New LLMClientDP class
F2 backend/app/test/test_phase6_llm_client_dp.py Unit tests for LLMClientDP
F3 backend/app/test/test_phase6_decompose_dp.py Integration: QueryDecomposer + LLMClientDP
F4 backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py Acceptance: real Deepseek API call

Files to Modify

# File Change
M1 backend/app/core/config.py Add dp_base_url, dp_api_key, dp_model_name to Settings
M2 backend/app/core/dependencies.py Add get_llm_client_dp() dependency
M3 backend/app/routers/query.py Create LLMClientDP for decompose, keep LLMClient for filter+generate

Implementation Tasks

Task 6.1: Add Deepseek config to Settings

  • Add to backend/app/core/config.py:
    # Deepseek API (decompose step only)
    dp_base_url: str = "https://api.deepseek.com"
    dp_api_key: str = ""
    dp_model_name: str = "deepseek-v4-pro"
    
  • Add fallback logic: if dp_api_key is empty, use llm_api_key (handled in LLMClientDP.__init__)
  • Update .env.example with new fields
  • No separate test file — tested implicitly via LLMClientDP instantiation tests

Task 6.2: Create LLMClientDP class

  • Create backend/app/services/llm_client_dp.py with:
    class LLMClientDP:
        """Async Deepseek API client for query decomposition step only.
    
        Uses the OpenAI-compatible SDK with Deepseek's base URL.
        Thinking mode is always disabled (extra_body={"thinking": {"type": "disabled"}}).
        """
    
        def __init__(self, settings: Settings):
            api_key = settings.dp_api_key or settings.llm_api_key
            self.model = settings.dp_model_name
            self.logger = logging.getLogger(__name__)
            self._client = AsyncOpenAI(
                base_url=settings.dp_base_url.rstrip("/"),
                api_key=api_key,
                timeout=settings.llm_timeout,  # reuse existing timeout
                http_client=httpx.AsyncClient(
                    headers={"Content-Type": "application/json"},
                ),
            )
            self._langchain_model = None
    
        async def complete(self, prompt: str, temperature: float = 0.7, step_name: str = "DP") -> str:
            """Send chat completion with thinking disabled."""
            # Same pattern as LLMClient.complete() but with Deepseek extra_body
            ...
    
        async def complete_structured(self, prompt: str, pydantic_model, step_name: str = "DP"):
            """Structured output using OpenAI-native json_schema (Deepseek compatible)."""
            # Same pattern as LLMClient._complete_structured_openai()
            ...
    
        async def close(self): ...
    
  • _build_extra_body() → always returns {"thinking": {"type": "disabled"}} (no toggle — always disabled per user request)
  • _get_langchain_model() → same init_chat_model() pattern as LLMClient, but using Deepseek base_url/api_key. No vLLM-specific logic needed.
  • Reuse _truncate_prompt_for_log() helper (or copy it — keep the class self-contained)
  • LLMClientDPError exception class (mirrors LLMClientError)
  • Test file: test_phase6_llm_client_dp.py

Task 6.3: Wire LLMClientDP into the query router

  • In backend/app/routers/query.py _query_stream():
    # Before (current):
    llm_client = LLMClient(settings)
    decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service)
    
    # After:
    from app.services.llm_client_dp import LLMClientDP
    llm_client_dp = LLMClientDP(settings)
    llm_client = LLMClient(settings)
    
    decomposer = QueryDecomposer(llm_client_dp, prompt_service=prompt_service)  # uses Deepseek
    # ... filter and generate still use llm_client (OpenRouter/vLLM)
    
  • Ensure LLMClientDP is properly closed if needed (the existing llm_client is not explicitly closed in _query_stream() either — match existing pattern)
  • No changes to RelevanceFilter or RAGService instantiation
  • Test file: Existing integration tests (test_integration_phase1.py, test_phase4_integration_query_pipeline.py) must pass with updated mocks

Task 6.4: Add get_llm_client_dp() to dependencies

  • In backend/app/core/dependencies.py:
    def get_llm_client_dp():
        settings = get_settings_cached()
        from app.services.llm_client_dp import LLMClientDP
        return LLMClientDP(settings)
    
  • Update conftest.py if needed for test fixtures
  • No separate test file — used by integration tests

Task 6.5: Update tests

  • Update test_phase1_query.py: The _MockLLMClient mock replaces LLMClient — ensure it still works for both decompose and filter/generate stages. The decompose mock needs complete_structured() support (already added in Phase 5).
  • Update test_phase5_query_decomposer_structured.py: MockLLMClientStructured is a standalone mock — unaffected.
  • Update conftest.py: If mock_llm_client fixture exists, verify it doesn't conflict with the new LLMClientDP.
  • Run full backend test suite: cd backend && pytest app/test/ -v

Task 6.6: Acceptance test

  • Create backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py:
    • Requires .env with valid DP_API_KEY (Deepseek API key)
    • Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, empty question → [], structured output returns SubQuestions Pydantic model
    • Mark with @pytest.mark.acceptance and @pytest.mark.slow
  • Run: cd backend && pytest app/test/acceptance/test_acceptance_phase6_dp_decompose.py -v -m acceptance

Test Files Summary

# Test File Type Coverage
T6.2 test_phase6_llm_client_dp.py Integration LLMClientDP.complete() with mock Deepseek, complete_structured() with mock LangChain model, thinking disabled in extra_body, error handling, timeout
T6.3 test_phase6_decompose_dp.py Integration QueryDecomposer + LLMClientDP end-to-end, structured output path, legacy fallback path, empty decomposition
T6.4 (update) test_phase1_query.py Integration Updated mock to support dual-client architecture
AT6.6 test_acceptance_phase6_dp_decompose.py Acceptance Real Deepseek API: Cantonese decompose, English decompose, structured output

Dependency Graph

Task 6.1 (add Settings fields)
    │
    └── Task 6.2 (create LLMClientDP class)
            │
            ├── Task 6.3 (wire into query.py)
            │       │
            │       └── Task 6.5 (update tests)
            │               │
            │               └── Task 6.6 (acceptance test)
            │
            └── Task 6.4 (add dependencies.py helper)

Tasks 6.3 and 6.4 can run in parallel after 6.2.


Acceptance Criteria

  • LLMClientDP sends requests to https://api.deepseek.com/chat/completions with deepseek-v4-pro model
  • Every request includes extra_body={"thinking": {"type": "disabled"}}
  • QueryDecomposer.decompose() returns valid sub-questions when using LLMClientDP
  • complete_structured() returns validated SubQuestions Pydantic model (no JSON parse errors)
  • complete() fallback works (legacy JSON parsing) when structured output fails
  • Filter and generate stages continue to use LLMClient (OpenRouter/vLLM) — no regression
  • All existing tests pass: pytest app/test/ -v
  • Acceptance tests pass with real Deepseek API key

Rollback Plan

If Deepseek API causes issues in the decompose step:

  1. QueryDecomposer uses duck typing — swap back to LLMClient by changing one line in query.py:
    decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service)  # was llm_client_dp
    
  2. LLMClientDP class and config fields remain in codebase (no cleanup needed)
  3. No database migrations, no schema changes, no frontend changes

LLMClientDP Interface (design sketch)

class LLMClientDP:
    """Async Deepseek API client for query decomposition step only."""
    
    def __init__(self, settings: Settings):
        """Initialize with Deepseek-specific config from Settings."""
        ...
    
    async def complete(
        self, prompt: str, temperature: float = 0.7, step_name: str = "DP"
    ) -> str:
        """Send chat completion with thinking disabled.
        
        Args:
            prompt: The decompose prompt.
            temperature: 0.7 for fallback path (creative parsing).
            step_name: "QueryDecomposer" for logging.
        
        Returns:
            LLM response text (JSON string for legacy parsing).
        """
        ...
    
    async def complete_structured(
        self, prompt: str, pydantic_model, step_name: str = "DP"
    ):
        """Structured output via OpenAI-native json_schema.
        
        Args:
            prompt: The decompose prompt.
            pydantic_model: Pydantic BaseModel subclass (SubQuestions).
            step_name: "QueryDecomposer" for logging.
        
        Returns:
            Validated Pydantic model instance.
        """
        ...
    
    async def close(self):
        """Close the underlying HTTP client."""
        ...

Non-Goals (explicitly out of scope)

  • Do NOT create a provider abstraction/factory for LLM clients
  • Do NOT modify LLMClient class (risk of breaking filter/generate stages)
  • Do NOT add thinking mode toggle for Deepseek (always disabled per user request)
  • Do NOT change RelevanceFilter or RAGService (they continue using LLMClient)
  • Do NOT change QueryDecomposer (duck typing — requires no modifications)
  • Do NOT add frontend changes
  • Do NOT add streaming support for Deepseek (decompose is not streamed)