docs: add vLLM structured output fix plan

- Diagnose: vLLM ignores OpenAI-native response_format, causing NoneType error - Diagnose: legacy fallback prompt lacks JSON instruction → empty questions - Plan: use vLLM-native guided_json via extra_body instead of with_structured_output - Plan: update _SEED_DECOMPOSE with JSON format instruction - Plan: add diagnostic logging (exc_info, method, schema preview) wip: temporary function_calling switch for vLLM (to be replaced by guided_json)
2026-04-29 16:42:23 +08:00 · 2026-04-29 16:42:23 +08:00 · 2aca18d30e
parent cbb958d75d
commit 2aca18d30e
2 changed files with 137 additions and 1 deletions
--- a/.plans/debug_vllm_structured_output.md
+++ b/.plans/debug_vllm_structured_output.md
@ -0,0 +1,131 @@
 # Fix Plan: vLLM Structured Output Failure + Legacy Fallback
 **Date**: 2026-04-29
 **Source**: Server log analysis — vLLM backend with Qwen3.6-35B-A3B
 ---
 ## Diagnosis
 ### Issue 1: `argument of type 'NoneType' is not iterable` in `complete_structured()`
 **Root cause**: vLLM DOES support JSON schema structured output, but through its own protocol (`guided_json` in `extra_body`), NOT through OpenAI's `response_format={"type": "json_schema"}`. LangChain's `with_structured_output(method="json_schema")` sends the OpenAI-native format. vLLM ignores it, returns normal text. LangChain's structured parser receives the text response without the expected schema-validated fields, hits `None` where it expects an iterable.
 **Evidence**:
 - vLLM official docs show `extra_body={"guided_json": schema}` (legacy) or `extra_body={"structured_outputs": {"json": schema}}` (new unified format)
 - vLLM v0.10.2+ partially supports `response_format={"type": "json_schema"}` but it is newer and may not be on the server
 - The error happens after 550ms (LLM responded) — response parsing failure, not connection failure
 - The log shows `LLM Extra Body {'chat_template_kwargs': ...}` but no guided_json / structured_outputs
 **Current code** (`llm_client.py` line 139):
 ```python
 structured = model.with_structured_output(pydantic_model, method="json_schema")
 ```
 This generates `response_format={"type": "json_schema", "json_schema": <schema>}` → vLLM ignores → plain text response → parse failure.
 ### Issue 2: Legacy fallback prompt lacks JSON instruction
 **Root cause**: `_SEED_DECOMPOSE` (sqlite_db.py line 12) says "Break it down into 2-5 simplified sub-questions" — no JSON format instruction. The user's customized prompt in profile A is Chinese: "請你把每條問題簡化" — no JSON instruction either. When structured output fails, the legacy `_parse_legacy_json()` calls `_extract_json_from_markdown()` which only looks for JSON code blocks or bare JSON. Natural language bullets → empty list → `extracted_questions: []`.
 **Evidence**:
 - `_SEED_DECOMPOSE` = `"Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions that would help search for relevant information. Each sub-question should be short and focused on one aspect."`
 - The user's customized prompt (profile A, from DB) is injected via `prompt_service.get_prompt_template("decompose")` → same issue
 - `_parse_legacy_json()` → `_extract_json_from_markdown()` → regex for ```json blocks → finds none → returns raw text → `json.loads()` fails → `[]`
 ### Issue 3: Insufficient logging
 **Current gaps**:
 - `complete_structured()` error handler (line 156): `%s` for exc — only prints `str(exc)`, not the traceback
 - `_parse_legacy_json()` returns `[]` silently — no log of what text was being parsed
 - No log of which structured output method is being used per call
 - No log of the vLLM server version or capability
 ---
 ## Fix Plan
 ### Fix A: Use vLLM-native `guided_json` for structured output (llm_client.py)
 **Approach**: Modify `_get_langchain_model()` to pass `extra_body` with `guided_json` for vLLM, bypassing LangChain's `with_structured_output()` method. Instead of:
 ```python
 structured = model.with_structured_output(pydantic_model, method="json_schema")
 result = await structured.ainvoke(prompt)
 ```
 Use direct model invocation with guided decoding:
 ```python
 schema = pydantic_model.model_json_schema()
 model = self._get_langchain_model_for_vllm(schema)
 result_text = await model.ainvoke(prompt)
 result = pydantic_model.model_validate_json(result_text)
 ```
 **Implementation**:
 1. In `complete_structured()`, detect vLLM via `self.settings.vllm_engine`
 2. For vLLM: build the Pydantic schema, create a ChatOpenAI model with `extra_body={"guided_json": schema}` (legacy format) or `extra_body={"structured_outputs": {"json": schema}}` (new format), call `ainvoke(prompt)`, parse the JSON response with `pydantic_model.model_validate_json()`
 3. For OpenRouter: keep existing `with_structured_output(method="json_schema")` path
 4. Add a fallback: try `"structured_outputs"` format first; if fails, try `"guided_json"` format
 **Fallback strategy**: If both structured output methods fail:
 1. Log the full error with traceback
 2. Fall back to `complete()` with JSON instruction appended to prompt
 3. Parse the response with `_parse_legacy_json()` (which we will also improve — see Fix C)
 ### Fix B: Update seed decompose template (sqlite_db.py)
 **Approach**: Add JSON format instruction to `_SEED_DECOMPOSE` so the legacy fallback path has a chance of working.
 **New template**:
 ```
 "Given this question: '{question}'\n\n"
 "Break it down into 2-5 simplified sub-questions that would help "
 "search for relevant information. Each sub-question should be short "
 "and focused on one aspect.\n\n"
 "Return a JSON array of strings: [\"sub-question 1\", \"sub-question 2\"]"
 ```
 **Note**: This only helps NEW prompt profiles. Existing profiles with custom prompts (like profile A) still need the user to update their prompt via the System Prompts UI. However, by fixing structured output (Fix A), the legacy path won't be reached in normal operation.
 ### Fix C: Add diagnostic logging
 1. **`complete_structured()` error handler** (llm_client.py line 156):
   - Change `self.logger.error(...)` to `self.logger.error(..., exc_info=True)` to include full traceback
   - Add `self.logger.info("[%s] Structured output method used: %s", step_name, method)`
 2. **`_parse_legacy_json()`** (query_decomposer.py line 43):
   - Add `logger.info("Legacy JSON parse: extracted text: %s", response[:200])` before `json.loads()`
 3. **`complete_structured()` vLLM path**:
   - Log the schema being sent: `logger.info("[%s] vLLM guided_json schema: %s", step_name, json.dumps(schema)[:200])`
   - Log which format was used (`guided_json` vs `structured_outputs`)
 4. **Query decomposer** (query_decomposer.py line 96):
   - Add `logger.info("Using decompose prompt: %s", prompt[:200])` to show which prompt is in use
 ### Fix D: Revert the `function_calling` commit
 The temporary fix (commit `cbb958d`) changed `method="function_calling"` for vLLM. This is unreliable — vLLM may or may not support function calling depending on the model and configuration. Revert this in favor of Fix A (direct guided decoding).
 ---
 ## Files to Change
 | File | Changes |
 |------|---------|
 | `backend/app/services/llm_client.py` | Fix A: add vLLM-guided path in `complete_structured()`; Fix C: add `exc_info=True`, method logging |
 | `backend/app/core/sqlite_db.py` | Fix B: update `_SEED_DECOMPOSE` with JSON instruction |
 | `backend/app/services/query_decomposer.py` | Fix C: add logging in `_parse_legacy_json()`, log prompt used |
 ## Test Plan
 1. **Backend tests**: `pytest app/test/test_phase5_*.py` — must all pass (108 tests currently)
 2. **Acceptance test**: Run against vLLM backend to verify structured decomposition works
 3. **Rollback safety**: OpenRouter path unchanged; only vLLM path modified
 ## Decision Points
 | # | Decision | Rationale |
 |---|----------|-----------|
 | 1 | Use `guided_json` via `extra_body`, not `with_structured_output()`, for vLLM | LangChain's structured output sends `response_format` which vLLM may not support. `guided_json` is vLLM-native and guaranteed to work. |
 | 2 | Try `structured_outputs` format first, fall back to `guided_json` | Newer vLLM versions standardized on `structured_outputs`. Older versions use `guided_json`. Try both. |
 | 3 | Add JSON instruction to English seed template only | Chinese template is user-customized — can't force it. Structured output fix (Fix A) makes legacy fallback less critical. |
 | 4 | Keep OpenRouter path unchanged | `method="json_schema"` works perfectly on OpenRouter. No reason to change it. |
--- a/backend/app/services/llm_client.py
+++ b/backend/app/services/llm_client.py
@ -136,7 +136,12 @@ class LLMClient:
        try:
            model = self._get_langchain_model()
-            structured = model.with_structured_output(pydantic_model, method="json_schema")
+            # vLLM servers may not support json_schema response_format. Use
            # function_calling instead, which is more widely supported by
            # open-source models served through vLLM.
            method = "function_calling" if self.settings.vllm_engine else "json_schema"
            self.logger.info("[%s] Structured output method: %s", step_name, method)
            structured = model.with_structured_output(pydantic_model, method=method)
            result = await structured.ainvoke(prompt)
            elapsed_ms = (time.perf_counter() - start_time) * 1000