330 lines
15 KiB
Markdown
330 lines
15 KiB
Markdown
# Package 6 Enhancement Plan — Deepseek API Client for Decomposed Questions
|
|
|
|
**Source**: User request (2026-05-04)
|
|
**Scope**: Add a dedicated Deepseek API client (`LLMClientDP`) for the query decomposition step only, while the existing `LLMClient` continues to handle filtering and response generation via OpenRouter/vLLM.
|
|
**Status**: Draft
|
|
|
|
---
|
|
|
|
## Objective
|
|
|
|
Replace the LLM provider for the **decompose step** (Stage 1 of the 3-step RAG pipeline) with Deepseek API (`deepseek-v4-pro`), keeping the filter and generate stages on the existing OpenRouter/vLLM provider.
|
|
|
|
The decomposition step is the first and most critical LLM call — it shapes all downstream retrieval. Using Deepseek's `deepseek-v4-pro` with thinking mode disabled provides a dedicated pipeline for this stage with better structured output reliability.
|
|
|
|
**Non-goal**: The filter and generate steps remain unchanged (still use `LLMClient` with OpenRouter/vLLM).
|
|
|
|
---
|
|
|
|
## Current Architecture
|
|
|
|
```
|
|
query.py _query_stream():
|
|
settings = get_settings()
|
|
llm_client = LLMClient(settings) # ← single client for ALL stages
|
|
|
|
decomposer = QueryDecomposer(llm_client, ...) # Stage 1: uses llm_client
|
|
...
|
|
relevance_filter = RelevanceFilter(llm_client, ...) # Stage 3: uses same llm_client
|
|
...
|
|
rag.generate_response_per_subquestion(...) # Stage 4: uses same llm_client
|
|
```
|
|
|
|
`LLMClient` is configured via `Settings`:
|
|
- `llm_base_url` → OpenRouter or vLLM endpoint
|
|
- `llm_api_key` → shared API key
|
|
- `llm_model_name` → e.g. `qwen/qwen3.5-35b-a3b`
|
|
- `vllm_engine` → toggles `extra_body` format
|
|
- `llm_enable_thinking` → controls thinking/reasoning tokens
|
|
|
|
---
|
|
|
|
## Target Architecture
|
|
|
|
```
|
|
query.py _query_stream():
|
|
settings = get_settings()
|
|
|
|
llm_client_dp = LLMClientDP(settings) # NEW: Deepseek client
|
|
llm_client = LLMClient(settings) # existing: OpenRouter/vLLM client
|
|
|
|
decomposer = QueryDecomposer(llm_client_dp, ...) # Stage 1: Deepseek
|
|
...
|
|
relevance_filter = RelevanceFilter(llm_client, ...) # Stage 3: OpenRouter/vLLM
|
|
...
|
|
rag.generate_response_per_subquestion(...) # Stage 4: OpenRouter/vLLM
|
|
```
|
|
|
|
---
|
|
|
|
## Decision Register
|
|
|
|
| # | Decision | Rationale |
|
|
|---|----------|-----------|
|
|
| 1 | **Separate class `LLMClientDP`**, not a subclass of `LLMClient` | Deepseek has different `extra_body` format (`{"thinking": {"type": "disabled"}}` vs OpenRouter's `{"reasoning": {"enabled": false}}`), different config keys, and no vLLM engine support. Inheritance would create tight coupling to the provider-switching logic in `_build_extra_body()`. A clean standalone class is simpler, more testable, and avoids accidentally breaking existing stages. |
|
|
| 2 | **New config fields in `Settings`** (`dp_base_url`, `dp_api_key`, `dp_model_name`) | Separate from existing `llm_*` fields. The decompose step uses a different provider+model — they should have independent config. If `dp_api_key` is empty, fall back to `llm_api_key` (same API key for different providers is common). |
|
|
| 3 | **Thinking mode disabled** (`extra_body={"thinking": {"type": "disabled"}}`) | User explicitly requested thinking mode disabled. Deepseek defaults to thinking enabled; disabling it enables `temperature` control and avoids reasoning token overhead for structured output. |
|
|
| 4 | **Only `complete()` and `complete_structured()` needed** | `QueryDecomposer.decompose()` calls `complete_structured()` (primary) and `complete()` (fallback). No LangChain `_get_langchain_model()` needed — Deepseek uses the same OpenAI-compatible `response_format` as OpenRouter, so the existing `_complete_structured_openai()` pattern works. |
|
|
| 5 | **Reuse `AsyncOpenAI` from `openai` package** | Deepseek's API is OpenAI-compatible. Same SDK, different `base_url` + `api_key`. No new dependencies. |
|
|
| 6 | **New file `backend/app/services/llm_client_dp.py`** | Clean separation. Follows existing naming (`llm_client.py`). The `_dp` suffix clarifies it's the Deepseek-specific client for the decompose pipeline step. |
|
|
| 7 | **`QueryDecomposer` requires no changes** | It accepts any object with `complete()` and `complete_structured()` methods (duck typing). `LLMClientDP` provides both. |
|
|
| 8 | **Temperature 0.0 for structured decompose, 0.7 for fallback** | Matches existing `LLMClient` behavior: `complete_structured()` uses 0.0 for deterministic schema compliance; `complete()` fallback uses 0.7 for creative fallback parsing. |
|
|
| 9 | **Logging matches existing `LLMClient` pattern** | Same `_truncate_prompt_for_log()`, same `[step_name]` log prefix, same timing measurement. Operational consistency. |
|
|
|
|
---
|
|
|
|
## Deepseek API Details (from docs)
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| Base URL | `https://api.deepseek.com` |
|
|
| Chat completions endpoint | `POST /chat/completions` |
|
|
| Model | `deepseek-v4-pro` |
|
|
| Auth | Bearer token (passed as `api_key` to OpenAI SDK) |
|
|
| Thinking disable | `extra_body={"thinking": {"type": "disabled"}}` |
|
|
| SDK compatibility | OpenAI-compatible (`AsyncOpenAI` from `openai` package) |
|
|
| Rate limiting | Dynamic concurrency (HTTP 429 when hit), 10-min inference timeout |
|
|
|
|
---
|
|
|
|
## Files to Create
|
|
|
|
| # | File | Purpose |
|
|
|---|------|---------|
|
|
| F1 | `backend/app/services/llm_client_dp.py` | New `LLMClientDP` class |
|
|
| F2 | `backend/app/test/test_phase6_llm_client_dp.py` | Unit tests for `LLMClientDP` |
|
|
| F3 | `backend/app/test/test_phase6_decompose_dp.py` | Integration: `QueryDecomposer` + `LLMClientDP` |
|
|
| F4 | `backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py` | Acceptance: real Deepseek API call |
|
|
|
|
---
|
|
|
|
## Files to Modify
|
|
|
|
| # | File | Change |
|
|
|---|------|--------|
|
|
| M1 | `backend/app/core/config.py` | Add `dp_base_url`, `dp_api_key`, `dp_model_name` to `Settings` |
|
|
| M2 | `backend/app/core/dependencies.py` | Add `get_llm_client_dp()` dependency |
|
|
| M3 | `backend/app/routers/query.py` | Create `LLMClientDP` for decompose, keep `LLMClient` for filter+generate |
|
|
|
|
---
|
|
|
|
## Implementation Tasks
|
|
|
|
### Task 6.1: Add Deepseek config to Settings
|
|
|
|
- [ ] Add to `backend/app/core/config.py`:
|
|
```python
|
|
# Deepseek API (decompose step only)
|
|
dp_base_url: str = "https://api.deepseek.com"
|
|
dp_api_key: str = ""
|
|
dp_model_name: str = "deepseek-v4-pro"
|
|
```
|
|
- [ ] Add fallback logic: if `dp_api_key` is empty, use `llm_api_key` (handled in `LLMClientDP.__init__`)
|
|
- [ ] Update `.env.example` with new fields
|
|
- **No separate test file** — tested implicitly via `LLMClientDP` instantiation tests
|
|
|
|
### Task 6.2: Create `LLMClientDP` class
|
|
|
|
- [ ] Create `backend/app/services/llm_client_dp.py` with:
|
|
```python
|
|
class LLMClientDP:
|
|
"""Async Deepseek API client for query decomposition step only.
|
|
|
|
Uses the OpenAI-compatible SDK with Deepseek's base URL.
|
|
Thinking mode is always disabled (extra_body={"thinking": {"type": "disabled"}}).
|
|
"""
|
|
|
|
def __init__(self, settings: Settings):
|
|
api_key = settings.dp_api_key or settings.llm_api_key
|
|
self.model = settings.dp_model_name
|
|
self.logger = logging.getLogger(__name__)
|
|
self._client = AsyncOpenAI(
|
|
base_url=settings.dp_base_url.rstrip("/"),
|
|
api_key=api_key,
|
|
timeout=settings.llm_timeout, # reuse existing timeout
|
|
http_client=httpx.AsyncClient(
|
|
headers={"Content-Type": "application/json"},
|
|
),
|
|
)
|
|
self._langchain_model = None
|
|
|
|
async def complete(self, prompt: str, temperature: float = 0.7, step_name: str = "DP") -> str:
|
|
"""Send chat completion with thinking disabled."""
|
|
# Same pattern as LLMClient.complete() but with Deepseek extra_body
|
|
...
|
|
|
|
async def complete_structured(self, prompt: str, pydantic_model, step_name: str = "DP"):
|
|
"""Structured output using OpenAI-native json_schema (Deepseek compatible)."""
|
|
# Same pattern as LLMClient._complete_structured_openai()
|
|
...
|
|
|
|
async def close(self): ...
|
|
```
|
|
- [ ] `_build_extra_body()` → always returns `{"thinking": {"type": "disabled"}}` (no toggle — always disabled per user request)
|
|
- [ ] `_get_langchain_model()` → same `init_chat_model()` pattern as `LLMClient`, but using Deepseek `base_url`/`api_key`. No vLLM-specific logic needed.
|
|
- [ ] Reuse `_truncate_prompt_for_log()` helper (or copy it — keep the class self-contained)
|
|
- [ ] `LLMClientDPError` exception class (mirrors `LLMClientError`)
|
|
- **Test file**: `test_phase6_llm_client_dp.py`
|
|
|
|
### Task 6.3: Wire `LLMClientDP` into the query router
|
|
|
|
- [ ] In `backend/app/routers/query.py` `_query_stream()`:
|
|
```python
|
|
# Before (current):
|
|
llm_client = LLMClient(settings)
|
|
decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service)
|
|
|
|
# After:
|
|
from app.services.llm_client_dp import LLMClientDP
|
|
llm_client_dp = LLMClientDP(settings)
|
|
llm_client = LLMClient(settings)
|
|
|
|
decomposer = QueryDecomposer(llm_client_dp, prompt_service=prompt_service) # uses Deepseek
|
|
# ... filter and generate still use llm_client (OpenRouter/vLLM)
|
|
```
|
|
- [ ] Ensure `LLMClientDP` is properly closed if needed (the existing `llm_client` is not explicitly closed in `_query_stream()` either — match existing pattern)
|
|
- [ ] No changes to `RelevanceFilter` or `RAGService` instantiation
|
|
- **Test file**: Existing integration tests (`test_integration_phase1.py`, `test_phase4_integration_query_pipeline.py`) must pass with updated mocks
|
|
|
|
### Task 6.4: Add `get_llm_client_dp()` to dependencies
|
|
|
|
- [ ] In `backend/app/core/dependencies.py`:
|
|
```python
|
|
def get_llm_client_dp():
|
|
settings = get_settings_cached()
|
|
from app.services.llm_client_dp import LLMClientDP
|
|
return LLMClientDP(settings)
|
|
```
|
|
- [ ] Update `conftest.py` if needed for test fixtures
|
|
- **No separate test file** — used by integration tests
|
|
|
|
### Task 6.5: Update tests
|
|
|
|
- [ ] **Update `test_phase1_query.py`**: The `_MockLLMClient` mock replaces `LLMClient` — ensure it still works for both decompose and filter/generate stages. The decompose mock needs `complete_structured()` support (already added in Phase 5).
|
|
- [ ] **Update `test_phase5_query_decomposer_structured.py`**: `MockLLMClientStructured` is a standalone mock — unaffected.
|
|
- [ ] **Update `conftest.py`**: If `mock_llm_client` fixture exists, verify it doesn't conflict with the new `LLMClientDP`.
|
|
- [ ] Run full backend test suite: `cd backend && pytest app/test/ -v`
|
|
|
|
### Task 6.6: Acceptance test
|
|
|
|
- [ ] Create `backend/app/test/acceptance/test_acceptance_phase6_dp_decompose.py`:
|
|
- Requires `.env` with valid `DP_API_KEY` (Deepseek API key)
|
|
- Tests: Cantonese question → valid sub-questions, English question → valid sub-questions, empty question → `[]`, structured output returns `SubQuestions` Pydantic model
|
|
- Mark with `@pytest.mark.acceptance` and `@pytest.mark.slow`
|
|
- [ ] Run: `cd backend && pytest app/test/acceptance/test_acceptance_phase6_dp_decompose.py -v -m acceptance`
|
|
|
|
---
|
|
|
|
## Test Files Summary
|
|
|
|
| # | Test File | Type | Coverage |
|
|
|---|-----------|------|----------|
|
|
| T6.2 | `test_phase6_llm_client_dp.py` | Integration | `LLMClientDP.complete()` with mock Deepseek, `complete_structured()` with mock LangChain model, thinking disabled in extra_body, error handling, timeout |
|
|
| T6.3 | `test_phase6_decompose_dp.py` | Integration | `QueryDecomposer` + `LLMClientDP` end-to-end, structured output path, legacy fallback path, empty decomposition |
|
|
| T6.4 | (update) `test_phase1_query.py` | Integration | Updated mock to support dual-client architecture |
|
|
| AT6.6 | `test_acceptance_phase6_dp_decompose.py` | Acceptance | Real Deepseek API: Cantonese decompose, English decompose, structured output |
|
|
|
|
---
|
|
|
|
## Dependency Graph
|
|
|
|
```
|
|
Task 6.1 (add Settings fields)
|
|
│
|
|
└── Task 6.2 (create LLMClientDP class)
|
|
│
|
|
├── Task 6.3 (wire into query.py)
|
|
│ │
|
|
│ └── Task 6.5 (update tests)
|
|
│ │
|
|
│ └── Task 6.6 (acceptance test)
|
|
│
|
|
└── Task 6.4 (add dependencies.py helper)
|
|
```
|
|
|
|
Tasks 6.3 and 6.4 can run in parallel after 6.2.
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `LLMClientDP` sends requests to `https://api.deepseek.com/chat/completions` with `deepseek-v4-pro` model
|
|
- [ ] Every request includes `extra_body={"thinking": {"type": "disabled"}}`
|
|
- [ ] `QueryDecomposer.decompose()` returns valid sub-questions when using `LLMClientDP`
|
|
- [ ] `complete_structured()` returns validated `SubQuestions` Pydantic model (no JSON parse errors)
|
|
- [ ] `complete()` fallback works (legacy JSON parsing) when structured output fails
|
|
- [ ] Filter and generate stages continue to use `LLMClient` (OpenRouter/vLLM) — no regression
|
|
- [ ] All existing tests pass: `pytest app/test/ -v`
|
|
- [ ] Acceptance tests pass with real Deepseek API key
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If Deepseek API causes issues in the decompose step:
|
|
1. `QueryDecomposer` uses duck typing — swap back to `LLMClient` by changing one line in `query.py`:
|
|
```python
|
|
decomposer = QueryDecomposer(llm_client, prompt_service=prompt_service) # was llm_client_dp
|
|
```
|
|
2. `LLMClientDP` class and config fields remain in codebase (no cleanup needed)
|
|
3. No database migrations, no schema changes, no frontend changes
|
|
|
|
---
|
|
|
|
## LLMClientDP Interface (design sketch)
|
|
|
|
```python
|
|
class LLMClientDP:
|
|
"""Async Deepseek API client for query decomposition step only."""
|
|
|
|
def __init__(self, settings: Settings):
|
|
"""Initialize with Deepseek-specific config from Settings."""
|
|
...
|
|
|
|
async def complete(
|
|
self, prompt: str, temperature: float = 0.7, step_name: str = "DP"
|
|
) -> str:
|
|
"""Send chat completion with thinking disabled.
|
|
|
|
Args:
|
|
prompt: The decompose prompt.
|
|
temperature: 0.7 for fallback path (creative parsing).
|
|
step_name: "QueryDecomposer" for logging.
|
|
|
|
Returns:
|
|
LLM response text (JSON string for legacy parsing).
|
|
"""
|
|
...
|
|
|
|
async def complete_structured(
|
|
self, prompt: str, pydantic_model, step_name: str = "DP"
|
|
):
|
|
"""Structured output via OpenAI-native json_schema.
|
|
|
|
Args:
|
|
prompt: The decompose prompt.
|
|
pydantic_model: Pydantic BaseModel subclass (SubQuestions).
|
|
step_name: "QueryDecomposer" for logging.
|
|
|
|
Returns:
|
|
Validated Pydantic model instance.
|
|
"""
|
|
...
|
|
|
|
async def close(self):
|
|
"""Close the underlying HTTP client."""
|
|
...
|
|
```
|
|
|
|
---
|
|
|
|
## Non-Goals (explicitly out of scope)
|
|
|
|
- ❌ Do NOT create a provider abstraction/factory for LLM clients
|
|
- ❌ Do NOT modify `LLMClient` class (risk of breaking filter/generate stages)
|
|
- ❌ Do NOT add thinking mode toggle for Deepseek (always disabled per user request)
|
|
- ❌ Do NOT change `RelevanceFilter` or `RAGService` (they continue using `LLMClient`)
|
|
- ❌ Do NOT change `QueryDecomposer` (duck typing — requires no modifications)
|
|
- ❌ Do NOT add frontend changes
|
|
- ❌ Do NOT add streaming support for Deepseek (decompose is not streamed)
|