7.6 KiB
Fix Plan: vLLM Structured Output Failure + Legacy Fallback
Date: 2026-04-29 Source: Server log analysis — vLLM backend with Qwen3.6-35B-A3B
Diagnosis
Issue 1: argument of type 'NoneType' is not iterable in complete_structured()
Root cause: vLLM DOES support JSON schema structured output, but through its own protocol (guided_json in extra_body), NOT through OpenAI's response_format={"type": "json_schema"}. LangChain's with_structured_output(method="json_schema") sends the OpenAI-native format. vLLM ignores it, returns normal text. LangChain's structured parser receives the text response without the expected schema-validated fields, hits None where it expects an iterable.
Evidence:
- vLLM official docs show
extra_body={"guided_json": schema}(legacy) orextra_body={"structured_outputs": {"json": schema}}(new unified format) - vLLM v0.10.2+ partially supports
response_format={"type": "json_schema"}but it is newer and may not be on the server - The error happens after 550ms (LLM responded) — response parsing failure, not connection failure
- The log shows
LLM Extra Body {'chat_template_kwargs': ...}but no guided_json / structured_outputs
Current code (llm_client.py line 139):
structured = model.with_structured_output(pydantic_model, method="json_schema")
This generates response_format={"type": "json_schema", "json_schema": <schema>} → vLLM ignores → plain text response → parse failure.
Issue 2: Legacy fallback prompt lacks JSON instruction
Root cause: _SEED_DECOMPOSE (sqlite_db.py line 12) says "Break it down into 2-5 simplified sub-questions" — no JSON format instruction. The user's customized prompt in profile A is Chinese: "請你把每條問題簡化" — no JSON instruction either. When structured output fails, the legacy _parse_legacy_json() calls _extract_json_from_markdown() which only looks for JSON code blocks or bare JSON. Natural language bullets → empty list → extracted_questions: [].
Evidence:
_SEED_DECOMPOSE="Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions that would help search for relevant information. Each sub-question should be short and focused on one aspect."- The user's customized prompt (profile A, from DB) is injected via
prompt_service.get_prompt_template("decompose")→ same issue _parse_legacy_json()→_extract_json_from_markdown()→ regex for ```json blocks → finds none → returns raw text →json.loads()fails →[]
Issue 3: Insufficient logging
Current gaps:
complete_structured()error handler (line 156):%sfor exc — only printsstr(exc), not the traceback_parse_legacy_json()returns[]silently — no log of what text was being parsed- No log of which structured output method is being used per call
- No log of the vLLM server version or capability
Fix Plan
Fix A: Use vLLM-native guided_json for structured output (llm_client.py)
Approach: Modify _get_langchain_model() to pass extra_body with guided_json for vLLM, bypassing LangChain's with_structured_output() method. Instead of:
structured = model.with_structured_output(pydantic_model, method="json_schema")
result = await structured.ainvoke(prompt)
Use direct model invocation with guided decoding:
schema = pydantic_model.model_json_schema()
model = self._get_langchain_model_for_vllm(schema)
result_text = await model.ainvoke(prompt)
result = pydantic_model.model_validate_json(result_text)
Implementation:
- In
complete_structured(), detect vLLM viaself.settings.vllm_engine - For vLLM: build the Pydantic schema, create a ChatOpenAI model with
extra_body={"guided_json": schema}(legacy format) orextra_body={"structured_outputs": {"json": schema}}(new format), callainvoke(prompt), parse the JSON response withpydantic_model.model_validate_json() - For OpenRouter: keep existing
with_structured_output(method="json_schema")path - Add a fallback: try
"structured_outputs"format first; if fails, try"guided_json"format
Fallback strategy: If both structured output methods fail:
- Log the full error with traceback
- Fall back to
complete()with JSON instruction appended to prompt - Parse the response with
_parse_legacy_json()(which we will also improve — see Fix C)
Fix B: Update seed decompose template (sqlite_db.py)
Approach: Add JSON format instruction to _SEED_DECOMPOSE so the legacy fallback path has a chance of working.
New template:
"Given this question: '{question}'\n\n"
"Break it down into 2-5 simplified sub-questions that would help "
"search for relevant information. Each sub-question should be short "
"and focused on one aspect.\n\n"
"Return a JSON array of strings: [\"sub-question 1\", \"sub-question 2\"]"
Note: This only helps NEW prompt profiles. Existing profiles with custom prompts (like profile A) still need the user to update their prompt via the System Prompts UI. However, by fixing structured output (Fix A), the legacy path won't be reached in normal operation.
Fix C: Add diagnostic logging
-
complete_structured()error handler (llm_client.py line 156):- Change
self.logger.error(...)toself.logger.error(..., exc_info=True)to include full traceback - Add
self.logger.info("[%s] Structured output method used: %s", step_name, method)
- Change
-
_parse_legacy_json()(query_decomposer.py line 43):- Add
logger.info("Legacy JSON parse: extracted text: %s", response[:200])beforejson.loads()
- Add
-
complete_structured()vLLM path:- Log the schema being sent:
logger.info("[%s] vLLM guided_json schema: %s", step_name, json.dumps(schema)[:200]) - Log which format was used (
guided_jsonvsstructured_outputs)
- Log the schema being sent:
-
Query decomposer (query_decomposer.py line 96):
- Add
logger.info("Using decompose prompt: %s", prompt[:200])to show which prompt is in use
- Add
Fix D: Revert the function_calling commit
The temporary fix (commit cbb958d) changed method="function_calling" for vLLM. This is unreliable — vLLM may or may not support function calling depending on the model and configuration. Revert this in favor of Fix A (direct guided decoding).
Files to Change
| File | Changes |
|---|---|
backend/app/services/llm_client.py |
Fix A: add vLLM-guided path in complete_structured(); Fix C: add exc_info=True, method logging |
backend/app/core/sqlite_db.py |
Fix B: update _SEED_DECOMPOSE with JSON instruction |
backend/app/services/query_decomposer.py |
Fix C: add logging in _parse_legacy_json(), log prompt used |
Test Plan
- Backend tests:
pytest app/test/test_phase5_*.py— must all pass (108 tests currently) - Acceptance test: Run against vLLM backend to verify structured decomposition works
- Rollback safety: OpenRouter path unchanged; only vLLM path modified
Decision Points
| # | Decision | Rationale |
|---|---|---|
| 1 | Use guided_json via extra_body, not with_structured_output(), for vLLM |
LangChain's structured output sends response_format which vLLM may not support. guided_json is vLLM-native and guaranteed to work. |
| 2 | Try structured_outputs format first, fall back to guided_json |
Newer vLLM versions standardized on structured_outputs. Older versions use guided_json. Try both. |
| 3 | Add JSON instruction to English seed template only | Chinese template is user-customized — can't force it. Structured output fix (Fix A) makes legacy fallback less critical. |
| 4 | Keep OpenRouter path unchanged | method="json_schema" works perfectly on OpenRouter. No reason to change it. |