15 KiB

Raw Blame History

Package 6 Enhancement Plan — Deepseek API Client for Decomposed Questions

Source: User request (2026-05-04) Scope: Add a dedicated Deepseek API client (LLMClientDP) for the query decomposition step only, while the existing LLMClient continues to handle filtering and response generation via OpenRouter/vLLM. Status: Draft

Objective

Replace the LLM provider for the decompose step (Stage 1 of the 3-step RAG pipeline) with Deepseek API (deepseek-v4-pro), keeping the filter and generate stages on the existing OpenRouter/vLLM provider.

The decomposition step is the first and most critical LLM call — it shapes all downstream retrieval. Using Deepseek's deepseek-v4-pro with thinking mode disabled provides a dedicated pipeline for this stage with better structured output reliability.

Non-goal: The filter and generate steps remain unchanged (still use LLMClient with OpenRouter/vLLM).

Current Architecture

query.py _query_stream():
    settings = get_settings()
    llm_client = LLMClient(settings)          # ← single client for ALL stages

    decomposer = QueryDecomposer(llm_client, ...)       # Stage 1: uses llm_client
    ...
    relevance_filter = RelevanceFilter(llm_client, ...)  # Stage 3: uses same llm_client
    ...
    rag.generate_response_per_subquestion(...)            # Stage 4: uses same llm_client

LLMClient is configured via Settings:

llm_base_url → OpenRouter or vLLM endpoint
llm_api_key → shared API key
llm_model_name → e.g. qwen/qwen3.5-35b-a3b
vllm_engine → toggles extra_body format
llm_enable_thinking → controls thinking/reasoning tokens

Target Architecture

query.py _query_stream():
    settings = get_settings()

    llm_client_dp = LLMClientDP(settings)       # NEW: Deepseek client
    llm_client = LLMClient(settings)            # existing: OpenRouter/vLLM client

    decomposer = QueryDecomposer(llm_client_dp, ...)       # Stage 1: Deepseek
    ...
    relevance_filter = RelevanceFilter(llm_client, ...)     # Stage 3: OpenRouter/vLLM
    ...
    rag.generate_response_per_subquestion(...)              # Stage 4: OpenRouter/vLLM

Decision Register

#	Decision	Rationale
1	Separate class `LLMClientDP`, not a subclass of `LLMClient`	Deepseek has different `extra_body` format (`{"thinking": {"type": "disabled"}}` vs OpenRouter's `{"reasoning": {"enabled": false}}`), different config keys, and no vLLM engine support. Inheritance would create tight coupling to the provider-switching logic in `_build_extra_body()`. A clean standalone class is simpler, more testable, and avoids accidentally breaking existing stages.
2	New config fields in `Settings` (`dp_base_url`, `dp_api_key`, `dp_model_name`)	Separate from existing `llm_*` fields. The decompose step uses a different provider+model — they should have independent config. If `dp_api_key` is empty, fall back to `llm_api_key` (same API key for different providers is common).
3	Thinking mode disabled (`extra_body={"thinking": {"type": "disabled"}}`)	User explicitly requested thinking mode disabled. Deepseek defaults to thinking enabled; disabling it enables `temperature` control and avoids reasoning token overhead for structured output.
4	Only `complete()` and `complete_structured()` needed	`QueryDecomposer.decompose()` calls `complete_structured()` (primary) and `complete()` (fallback). No LangChain `_get_langchain_model()` needed — Deepseek uses the same OpenAI-compatible `response_format` as OpenRouter, so the existing `_complete_structured_openai()` pattern works.
5	Reuse `AsyncOpenAI` from `openai` package	Deepseek's API is OpenAI-compatible. Same SDK, different `base_url` + `api_key`. No new dependencies.
6	New file `backend/app/services/llm_client_dp.py`	Clean separation. Follows existing naming (`llm_client.py`). The `_dp` suffix clarifies it's the Deepseek-specific client for the decompose pipeline step.
7	`QueryDecomposer` requires no changes	It accepts any object with `complete()` and `complete_structured()` methods (duck typing). `LLMClientDP` provides both.
8	Temperature 0.0 for structured decompose, 0.7 for fallback	Matches existing `LLMClient` behavior: `complete_structured()` uses 0.0 for deterministic schema compliance; `complete()` fallback uses 0.7 for creative fallback parsing.
9	Logging matches existing `LLMClient` pattern	Same `_truncate_prompt_for_log()`, same `[step_name]` log prefix, same timing measurement. Operational consistency.

Deepseek API Details (from docs)

Parameter	Value
Base URL	`https://api.deepseek.com`
Chat completions endpoint	`POST /chat/completions`
Model	`deepseek-v4-pro`
Auth	Bearer token (passed as `api_key` to OpenAI SDK)
Thinking disable	`extra_body={"thinking": {"type": "disabled"}}`
SDK compatibility	OpenAI-compatible (`AsyncOpenAI` from `openai` package)
Rate limiting	Dynamic concurrency (HTTP 429 when hit), 10-min inference timeout

Files to Create

#	File	Purpose
F1	`backend/app/services/llm_client_dp.py`	New `LLMClientDP` class
F2	`backend/app/test/test_phase6_llm_client_dp.py`	Unit tests for `LLMClientDP`
F3	`backend/app/test/test_phase6_decompose_dp.py`	Integration: `QueryDecomposer` + `LLMClientDP`
F4	`backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py`	Acceptance: real Deepseek API call

Files to Modify

#	File	Change
M1	`backend/app/core/config.py`	Add `dp_base_url`, `dp_api_key`, `dp_model_name` to `Settings`
M2	`backend/app/core/dependencies.py`	Add `get_llm_client_dp()` dependency
M3	`backend/app/routers/query.py`	Create `LLMClientDP` for decompose, keep `LLMClient` for filter+generate

Implementation Tasks

Task 6.1: Add Deepseek config to Settings

Add to backend/app/core/config.py:

# Deepseek API (decompose step only)
dp_base_url: str = "https://api.deepseek.com"
dp_api_key: str = ""
dp_model_name: str = "deepseek-v4-pro"

Add fallback logic: if dp_api_key is empty, use llm_api_key (handled in LLMClientDP.__init__)
Update .env.example with new fields
No separate test file — tested implicitly via LLMClientDP instantiation tests

Task 6.2: Create `LLMClientDP` class

Create backend/app/services/llm_client_dp.py with:

class LLMClientDP:
    """Async Deepseek API client for query decomposition step only.

    Uses the OpenAI-compatible SDK with Deepseek's base URL.
    Thinking mode is always disabled (extra_body={"thinking": {"type": "disabled"}}).
    """

    def __init__(self, settings: Settings):
        api_key = settings.dp_api_key or settings.llm_api_key
        self.model = settings.dp_model_name
        self.logger = logging.getLogger(__name__)
        self._client = AsyncOpenAI(
            base_url=settings.dp_base_url.rstrip("/"),
            api_key=api_key,
            timeout=settings.llm_timeout,  # reuse existing timeout
            http_client=httpx.AsyncClient(
                headers={"Content-Type": "application/json"},
            ),
        )
        self._langchain_model = None

    async def complete(self, prompt: str, temperature: float = 0.7, step_name: str = "DP") -> str:
        """Send chat completion with thinking disabled."""
        # Same pattern as LLMClient.complete() but with Deepseek extra_body
        ...

    async def complete_structured(self, prompt: str, pydantic_model, step_name: str = "DP"):
        """Structured output using OpenAI-native json_schema (Deepseek compatible)."""
        # Same pattern as LLMClient._complete_structured_openai()
        ...

    async def close(self): ...

_build_extra_body() → always returns {"thinking": {"type": "disabled"}} (no toggle — always disabled per user request)
_get_langchain_model() → same init_chat_model() pattern as LLMClient, but using Deepseek base_url/api_key. No vLLM-specific logic needed.
Reuse _truncate_prompt_for_log() helper (or copy it — keep the class self-contained)
LLMClientDPError exception class (mirrors LLMClientError)
Test file: test_phase6_llm_client_dp.py

Task 6.3: Wire `LLMClientDP` into the query router

In backend/app/routers/query.py _query_stream():

# Before (current):
llm_client = LLMClient(settings)
decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service)

# After:
from app.services.llm_client_dp import LLMClientDP
llm_client_dp = LLMClientDP(settings)
llm_client = LLMClient(settings)

decomposer = QueryDecomposer(llm_client_dp, prompt_service=prompt_service)  # uses Deepseek
# ... filter and generate still use llm_client (OpenRouter/vLLM)

Ensure LLMClientDP is properly closed if needed (the existing llm_client is not explicitly closed in _query_stream() either — match existing pattern)
No changes to RelevanceFilter or RAGService instantiation
Test file: Existing integration tests (test_integration_phase1.py, test_phase4_integration_query_pipeline.py) must pass with updated mocks

Task 6.4: Add `get_llm_client_dp()` to dependencies

In backend/app/core/dependencies.py:

def get_llm_client_dp():
    settings = get_settings_cached()
    from app.services.llm_client_dp import LLMClientDP
    return LLMClientDP(settings)

Update conftest.py if needed for test fixtures
No separate test file — used by integration tests

Task 6.5: Update tests

Update test_phase1_query.py: The _MockLLMClient mock replaces LLMClient — ensure it still works for both decompose and filter/generate stages. The decompose mock needs complete_structured() support (already added in Phase 5).
Update test_phase5_query_decomposer_structured.py: MockLLMClientStructured is a standalone mock — unaffected.
Update conftest.py: If mock_llm_client fixture exists, verify it doesn't conflict with the new LLMClientDP.
Run full backend test suite: cd backend && pytest app/test/ -v

Task 6.6: Acceptance test

Create backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py:
- Requires .env with valid DP_API_KEY (Deepseek API key)
- Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, empty question → [], structured output returns SubQuestions Pydantic model
- Mark with @pytest.mark.acceptance and @pytest.mark.slow
Run: cd backend && pytest app/test/acceptance/test_acceptance_phase6_dp_decompose.py -v -m acceptance

Test Files Summary

#	Test File	Type	Coverage
T6.2	`test_phase6_llm_client_dp.py`	Integration	`LLMClientDP.complete()` with mock Deepseek, `complete_structured()` with mock LangChain model, thinking disabled in extra_body, error handling, timeout
T6.3	`test_phase6_decompose_dp.py`	Integration	`QueryDecomposer` + `LLMClientDP` end-to-end, structured output path, legacy fallback path, empty decomposition
T6.4	(update) `test_phase1_query.py`	Integration	Updated mock to support dual-client architecture
AT6.6	`test_acceptance_phase6_dp_decompose.py`	Acceptance	Real Deepseek API: Cantonese decompose, English decompose, structured output

Dependency Graph

Task 6.1 (add Settings fields)
    │
    └── Task 6.2 (create LLMClientDP class)
            │
            ├── Task 6.3 (wire into query.py)
            │       │
            │       └── Task 6.5 (update tests)
            │               │
            │               └── Task 6.6 (acceptance test)
            │
            └── Task 6.4 (add dependencies.py helper)

Tasks 6.3 and 6.4 can run in parallel after 6.2.

Acceptance Criteria

LLMClientDP sends requests to https://api.deepseek.com/chat/completions with deepseek-v4-pro model
Every request includes extra_body={"thinking": {"type": "disabled"}}
QueryDecomposer.decompose() returns valid sub-questions when using LLMClientDP
complete_structured() returns validated SubQuestions Pydantic model (no JSON parse errors)
complete() fallback works (legacy JSON parsing) when structured output fails
Filter and generate stages continue to use LLMClient (OpenRouter/vLLM) — no regression
All existing tests pass: pytest app/test/ -v
Acceptance tests pass with real Deepseek API key

Rollback Plan

If Deepseek API causes issues in the decompose step:

QueryDecomposer uses duck typing — swap back to LLMClient by changing one line in query.py:

decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service)  # was llm_client_dp

LLMClientDP class and config fields remain in codebase (no cleanup needed)
No database migrations, no schema changes, no frontend changes

LLMClientDP Interface (design sketch)

class LLMClientDP:
    """Async Deepseek API client for query decomposition step only."""
    
    def __init__(self, settings: Settings):
        """Initialize with Deepseek-specific config from Settings."""
        ...
    
    async def complete(
        self, prompt: str, temperature: float = 0.7, step_name: str = "DP"
    ) -> str:
        """Send chat completion with thinking disabled.
        
        Args:
            prompt: The decompose prompt.
            temperature: 0.7 for fallback path (creative parsing).
            step_name: "QueryDecomposer" for logging.
        
        Returns:
            LLM response text (JSON string for legacy parsing).
        """
        ...
    
    async def complete_structured(
        self, prompt: str, pydantic_model, step_name: str = "DP"
    ):
        """Structured output via OpenAI-native json_schema.
        
        Args:
            prompt: The decompose prompt.
            pydantic_model: Pydantic BaseModel subclass (SubQuestions).
            step_name: "QueryDecomposer" for logging.
        
        Returns:
            Validated Pydantic model instance.
        """
        ...
    
    async def close(self):
        """Close the underlying HTTP client."""
        ...

Non-Goals (explicitly out of scope)

❌ Do NOT create a provider abstraction/factory for LLM clients
❌ Do NOT modify LLMClient class (risk of breaking filter/generate stages)
❌ Do NOT add thinking mode toggle for Deepseek (always disabled per user request)
❌ Do NOT change RelevanceFilter or RAGService (they continue using LLMClient)
❌ Do NOT change QueryDecomposer (duck typing — requires no modifications)
❌ Do NOT add frontend changes
❌ Do NOT add streaming support for Deepseek (decompose is not streamed)

15 KiB Raw Blame History