15 KiB
Package 6 Enhancement Plan — Deepseek API Client for Decomposed Questions
Source: User request (2026-05-04)
Scope: Add a dedicated Deepseek API client (LLMClientDP) for the query decomposition step only, while the existing LLMClient continues to handle filtering and response generation via OpenRouter/vLLM.
Status: Draft
Objective
Replace the LLM provider for the decompose step (Stage 1 of the 3-step RAG pipeline) with Deepseek API (deepseek-v4-pro), keeping the filter and generate stages on the existing OpenRouter/vLLM provider.
The decomposition step is the first and most critical LLM call — it shapes all downstream retrieval. Using Deepseek's deepseek-v4-pro with thinking mode disabled provides a dedicated pipeline for this stage with better structured output reliability.
Non-goal: The filter and generate steps remain unchanged (still use LLMClient with OpenRouter/vLLM).
Current Architecture
query.py _query_stream():
settings = get_settings()
llm_client = LLMClient(settings) # ← single client for ALL stages
decomposer = QueryDecomposer(llm_client, ...) # Stage 1: uses llm_client
...
relevance_filter = RelevanceFilter(llm_client, ...) # Stage 3: uses same llm_client
...
rag.generate_response_per_subquestion(...) # Stage 4: uses same llm_client
LLMClient is configured via Settings:
llm_base_url→ OpenRouter or vLLM endpointllm_api_key→ shared API keyllm_model_name→ e.g.qwen/qwen3.5-35b-a3bvllm_engine→ togglesextra_bodyformatllm_enable_thinking→ controls thinking/reasoning tokens
Target Architecture
query.py _query_stream():
settings = get_settings()
llm_client_dp = LLMClientDP(settings) # NEW: Deepseek client
llm_client = LLMClient(settings) # existing: OpenRouter/vLLM client
decomposer = QueryDecomposer(llm_client_dp, ...) # Stage 1: Deepseek
...
relevance_filter = RelevanceFilter(llm_client, ...) # Stage 3: OpenRouter/vLLM
...
rag.generate_response_per_subquestion(...) # Stage 4: OpenRouter/vLLM
Decision Register
| # | Decision | Rationale |
|---|---|---|
| 1 | Separate class LLMClientDP, not a subclass of LLMClient |
Deepseek has different extra_body format ({"thinking": {"type": "disabled"}} vs OpenRouter's {"reasoning": {"enabled": false}}), different config keys, and no vLLM engine support. Inheritance would create tight coupling to the provider-switching logic in _build_extra_body(). A clean standalone class is simpler, more testable, and avoids accidentally breaking existing stages. |
| 2 | New config fields in Settings (dp_base_url, dp_api_key, dp_model_name) |
Separate from existing llm_* fields. The decompose step uses a different provider+model — they should have independent config. If dp_api_key is empty, fall back to llm_api_key (same API key for different providers is common). |
| 3 | Thinking mode disabled (extra_body={"thinking": {"type": "disabled"}}) |
User explicitly requested thinking mode disabled. Deepseek defaults to thinking enabled; disabling it enables temperature control and avoids reasoning token overhead for structured output. |
| 4 | Only complete() and complete_structured() needed |
QueryDecomposer.decompose() calls complete_structured() (primary) and complete() (fallback). No LangChain _get_langchain_model() needed — Deepseek uses the same OpenAI-compatible response_format as OpenRouter, so the existing _complete_structured_openai() pattern works. |
| 5 | Reuse AsyncOpenAI from openai package |
Deepseek's API is OpenAI-compatible. Same SDK, different base_url + api_key. No new dependencies. |
| 6 | New file backend/app/services/llm_client_dp.py |
Clean separation. Follows existing naming (llm_client.py). The _dp suffix clarifies it's the Deepseek-specific client for the decompose pipeline step. |
| 7 | QueryDecomposer requires no changes |
It accepts any object with complete() and complete_structured() methods (duck typing). LLMClientDP provides both. |
| 8 | Temperature 0.0 for structured decompose, 0.7 for fallback | Matches existing LLMClient behavior: complete_structured() uses 0.0 for deterministic schema compliance; complete() fallback uses 0.7 for creative fallback parsing. |
| 9 | Logging matches existing LLMClient pattern |
Same _truncate_prompt_for_log(), same [step_name] log prefix, same timing measurement. Operational consistency. |
Deepseek API Details (from docs)
| Parameter | Value |
|---|---|
| Base URL | https://api.deepseek.com |
| Chat completions endpoint | POST /chat/completions |
| Model | deepseek-v4-pro |
| Auth | Bearer token (passed as api_key to OpenAI SDK) |
| Thinking disable | extra_body={"thinking": {"type": "disabled"}} |
| SDK compatibility | OpenAI-compatible (AsyncOpenAI from openai package) |
| Rate limiting | Dynamic concurrency (HTTP 429 when hit), 10-min inference timeout |
Files to Create
| # | File | Purpose |
|---|---|---|
| F1 | backend/app/services/llm_client_dp.py |
New LLMClientDP class |
| F2 | backend/app/test/test_phase6_llm_client_dp.py |
Unit tests for LLMClientDP |
| F3 | backend/app/test/test_phase6_decompose_dp.py |
Integration: QueryDecomposer + LLMClientDP |
| F4 | backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py |
Acceptance: real Deepseek API call |
Files to Modify
| # | File | Change |
|---|---|---|
| M1 | backend/app/core/config.py |
Add dp_base_url, dp_api_key, dp_model_name to Settings |
| M2 | backend/app/core/dependencies.py |
Add get_llm_client_dp() dependency |
| M3 | backend/app/routers/query.py |
Create LLMClientDP for decompose, keep LLMClient for filter+generate |
Implementation Tasks
Task 6.1: Add Deepseek config to Settings
- Add to
backend/app/core/config.py:# Deepseek API (decompose step only) dp_base_url: str = "https://api.deepseek.com" dp_api_key: str = "" dp_model_name: str = "deepseek-v4-pro" - Add fallback logic: if
dp_api_keyis empty, usellm_api_key(handled inLLMClientDP.__init__) - Update
.env.examplewith new fields - No separate test file — tested implicitly via
LLMClientDPinstantiation tests
Task 6.2: Create LLMClientDP class
- Create
backend/app/services/llm_client_dp.pywith:class LLMClientDP: """Async Deepseek API client for query decomposition step only. Uses the OpenAI-compatible SDK with Deepseek's base URL. Thinking mode is always disabled (extra_body={"thinking": {"type": "disabled"}}). """ def __init__(self, settings: Settings): api_key = settings.dp_api_key or settings.llm_api_key self.model = settings.dp_model_name self.logger = logging.getLogger(__name__) self._client = AsyncOpenAI( base_url=settings.dp_base_url.rstrip("/"), api_key=api_key, timeout=settings.llm_timeout, # reuse existing timeout http_client=httpx.AsyncClient( headers={"Content-Type": "application/json"}, ), ) self._langchain_model = None async def complete(self, prompt: str, temperature: float = 0.7, step_name: str = "DP") -> str: """Send chat completion with thinking disabled.""" # Same pattern as LLMClient.complete() but with Deepseek extra_body ... async def complete_structured(self, prompt: str, pydantic_model, step_name: str = "DP"): """Structured output using OpenAI-native json_schema (Deepseek compatible).""" # Same pattern as LLMClient._complete_structured_openai() ... async def close(self): ... _build_extra_body()→ always returns{"thinking": {"type": "disabled"}}(no toggle — always disabled per user request)_get_langchain_model()→ sameinit_chat_model()pattern asLLMClient, but using Deepseekbase_url/api_key. No vLLM-specific logic needed.- Reuse
_truncate_prompt_for_log()helper (or copy it — keep the class self-contained) LLMClientDPErrorexception class (mirrorsLLMClientError)- Test file:
test_phase6_llm_client_dp.py
Task 6.3: Wire LLMClientDP into the query router
- In
backend/app/routers/query.py_query_stream():# Before (current): llm_client = LLMClient(settings) decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service) # After: from app.services.llm_client_dp import LLMClientDP llm_client_dp = LLMClientDP(settings) llm_client = LLMClient(settings) decomposer = QueryDecomposer(llm_client_dp, prompt_service=prompt_service) # uses Deepseek # ... filter and generate still use llm_client (OpenRouter/vLLM) - Ensure
LLMClientDPis properly closed if needed (the existingllm_clientis not explicitly closed in_query_stream()either — match existing pattern) - No changes to
RelevanceFilterorRAGServiceinstantiation - Test file: Existing integration tests (
test_integration_phase1.py,test_phase4_integration_query_pipeline.py) must pass with updated mocks
Task 6.4: Add get_llm_client_dp() to dependencies
- In
backend/app/core/dependencies.py:def get_llm_client_dp(): settings = get_settings_cached() from app.services.llm_client_dp import LLMClientDP return LLMClientDP(settings) - Update
conftest.pyif needed for test fixtures - No separate test file — used by integration tests
Task 6.5: Update tests
- Update
test_phase1_query.py: The_MockLLMClientmock replacesLLMClient— ensure it still works for both decompose and filter/generate stages. The decompose mock needscomplete_structured()support (already added in Phase 5). - Update
test_phase5_query_decomposer_structured.py:MockLLMClientStructuredis a standalone mock — unaffected. - Update
conftest.py: Ifmock_llm_clientfixture exists, verify it doesn't conflict with the newLLMClientDP. - Run full backend test suite:
cd backend && pytest app/test/ -v
Task 6.6: Acceptance test
- Create
backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py:- Requires
.envwith validDP_API_KEY(Deepseek API key) - Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, empty question →
[], structured output returnsSubQuestionsPydantic model - Mark with
@pytest.mark.acceptanceand@pytest.mark.slow
- Requires
- Run:
cd backend && pytest app/test/acceptance/test_acceptance_phase6_dp_decompose.py -v -m acceptance
Test Files Summary
| # | Test File | Type | Coverage |
|---|---|---|---|
| T6.2 | test_phase6_llm_client_dp.py |
Integration | LLMClientDP.complete() with mock Deepseek, complete_structured() with mock LangChain model, thinking disabled in extra_body, error handling, timeout |
| T6.3 | test_phase6_decompose_dp.py |
Integration | QueryDecomposer + LLMClientDP end-to-end, structured output path, legacy fallback path, empty decomposition |
| T6.4 | (update) test_phase1_query.py |
Integration | Updated mock to support dual-client architecture |
| AT6.6 | test_acceptance_phase6_dp_decompose.py |
Acceptance | Real Deepseek API: Cantonese decompose, English decompose, structured output |
Dependency Graph
Task 6.1 (add Settings fields)
│
└── Task 6.2 (create LLMClientDP class)
│
├── Task 6.3 (wire into query.py)
│ │
│ └── Task 6.5 (update tests)
│ │
│ └── Task 6.6 (acceptance test)
│
└── Task 6.4 (add dependencies.py helper)
Tasks 6.3 and 6.4 can run in parallel after 6.2.
Acceptance Criteria
LLMClientDPsends requests tohttps://api.deepseek.com/chat/completionswithdeepseek-v4-promodel- Every request includes
extra_body={"thinking": {"type": "disabled"}} QueryDecomposer.decompose()returns valid sub-questions when usingLLMClientDPcomplete_structured()returns validatedSubQuestionsPydantic model (no JSON parse errors)complete()fallback works (legacy JSON parsing) when structured output fails- Filter and generate stages continue to use
LLMClient(OpenRouter/vLLM) — no regression - All existing tests pass:
pytest app/test/ -v - Acceptance tests pass with real Deepseek API key
Rollback Plan
If Deepseek API causes issues in the decompose step:
QueryDecomposeruses duck typing — swap back toLLMClientby changing one line inquery.py:decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service) # was llm_client_dpLLMClientDPclass and config fields remain in codebase (no cleanup needed)- No database migrations, no schema changes, no frontend changes
LLMClientDP Interface (design sketch)
class LLMClientDP:
"""Async Deepseek API client for query decomposition step only."""
def __init__(self, settings: Settings):
"""Initialize with Deepseek-specific config from Settings."""
...
async def complete(
self, prompt: str, temperature: float = 0.7, step_name: str = "DP"
) -> str:
"""Send chat completion with thinking disabled.
Args:
prompt: The decompose prompt.
temperature: 0.7 for fallback path (creative parsing).
step_name: "QueryDecomposer" for logging.
Returns:
LLM response text (JSON string for legacy parsing).
"""
...
async def complete_structured(
self, prompt: str, pydantic_model, step_name: str = "DP"
):
"""Structured output via OpenAI-native json_schema.
Args:
prompt: The decompose prompt.
pydantic_model: Pydantic BaseModel subclass (SubQuestions).
step_name: "QueryDecomposer" for logging.
Returns:
Validated Pydantic model instance.
"""
...
async def close(self):
"""Close the underlying HTTP client."""
...
Non-Goals (explicitly out of scope)
- ❌ Do NOT create a provider abstraction/factory for LLM clients
- ❌ Do NOT modify
LLMClientclass (risk of breaking filter/generate stages) - ❌ Do NOT add thinking mode toggle for Deepseek (always disabled per user request)
- ❌ Do NOT change
RelevanceFilterorRAGService(they continue usingLLMClient) - ❌ Do NOT change
QueryDecomposer(duck typing — requires no modifications) - ❌ Do NOT add frontend changes
- ❌ Do NOT add streaming support for Deepseek (decompose is not streamed)