7.6 KiB

Raw Blame History

Fix Plan: vLLM Structured Output Failure + Legacy Fallback

Date: 2026-04-29 Source: Server log analysis — vLLM backend with Qwen3.6-35B-A3B

Diagnosis

Issue 1: `argument of type 'NoneType' is not iterable` in `complete_structured()`

Root cause: vLLM DOES support JSON schema structured output, but through its own protocol (guided_json in extra_body), NOT through OpenAI's response_format={"type": "json_schema"}. LangChain's with_structured_output(method="json_schema") sends the OpenAI-native format. vLLM ignores it, returns normal text. LangChain's structured parser receives the text response without the expected schema-validated fields, hits None where it expects an iterable.

Evidence:

vLLM official docs show extra_body={"guided_json": schema} (legacy) or extra_body={"structured_outputs": {"json": schema}} (new unified format)
vLLM v0.10.2+ partially supports response_format={"type": "json_schema"} but it is newer and may not be on the server
The error happens after 550ms (LLM responded) — response parsing failure, not connection failure
The log shows LLM Extra Body {'chat_template_kwargs': ...} but no guided_json / structured_outputs

Current code (llm_client.py line 139):

structured = model.with_structured_output(pydantic_model, method="json_schema")

This generates response_format={"type": "json_schema", "json_schema": <schema>} → vLLM ignores → plain text response → parse failure.

Issue 2: Legacy fallback prompt lacks JSON instruction

Root cause: _SEED_DECOMPOSE (sqlite_db.py line 12) says "Break it down into 2-5 simplified sub-questions" — no JSON format instruction. The user's customized prompt in profile A is Chinese: "請你把每條問題簡化" — no JSON instruction either. When structured output fails, the legacy _parse_legacy_json() calls _extract_json_from_markdown() which only looks for JSON code blocks or bare JSON. Natural language bullets → empty list → extracted_questions: [].

Evidence:

_SEED_DECOMPOSE = "Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions that would help search for relevant information. Each sub-question should be short and focused on one aspect."
The user's customized prompt (profile A, from DB) is injected via prompt_service.get_prompt_template("decompose") → same issue
_parse_legacy_json() → _extract_json_from_markdown() → regex for ```json blocks → finds none → returns raw text → json.loads() fails → []

Issue 3: Insufficient logging

Current gaps:

complete_structured() error handler (line 156): %s for exc — only prints str(exc), not the traceback
_parse_legacy_json() returns [] silently — no log of what text was being parsed
No log of which structured output method is being used per call
No log of the vLLM server version or capability

Fix Plan

Fix A: Use vLLM-native `guided_json` for structured output (llm_client.py)

Approach: Modify _get_langchain_model() to pass extra_body with guided_json for vLLM, bypassing LangChain's with_structured_output() method. Instead of:

structured = model.with_structured_output(pydantic_model, method="json_schema")
result = await structured.ainvoke(prompt)

Use direct model invocation with guided decoding:

schema = pydantic_model.model_json_schema()
model = self._get_langchain_model_for_vllm(schema)
result_text = await model.ainvoke(prompt)
result = pydantic_model.model_validate_json(result_text)

Implementation:

In complete_structured(), detect vLLM via self.settings.vllm_engine
For vLLM: build the Pydantic schema, create a ChatOpenAI model with extra_body={"guided_json": schema} (legacy format) or extra_body={"structured_outputs": {"json": schema}} (new format), call ainvoke(prompt), parse the JSON response with pydantic_model.model_validate_json()
For OpenRouter: keep existing with_structured_output(method="json_schema") path
Add a fallback: try "structured_outputs" format first; if fails, try "guided_json" format

Fallback strategy: If both structured output methods fail:

Log the full error with traceback
Fall back to complete() with JSON instruction appended to prompt
Parse the response with _parse_legacy_json() (which we will also improve — see Fix C)

Fix B: Update seed decompose template (sqlite_db.py)

Approach: Add JSON format instruction to _SEED_DECOMPOSE so the legacy fallback path has a chance of working.

New template:

"Given this question: '{question}'\n\n"
"Break it down into 2-5 simplified sub-questions that would help "
"search for relevant information. Each sub-question should be short "
"and focused on one aspect.\n\n"
"Return a JSON array of strings: [\"sub-question 1\", \"sub-question 2\"]"

Note: This only helps NEW prompt profiles. Existing profiles with custom prompts (like profile A) still need the user to update their prompt via the System Prompts UI. However, by fixing structured output (Fix A), the legacy path won't be reached in normal operation.

Fix C: Add diagnostic logging

complete_structured() error handler (llm_client.py line 156):
- Change self.logger.error(...) to self.logger.error(..., exc_info=True) to include full traceback
- Add self.logger.info("[%s] Structured output method used: %s", step_name, method)
_parse_legacy_json() (query_decomposer.py line 43):
- Add logger.info("Legacy JSON parse: extracted text: %s", response[:200]) before json.loads()
complete_structured() vLLM path:
- Log the schema being sent: logger.info("[%s] vLLM guided_json schema: %s", step_name, json.dumps(schema)[:200])
- Log which format was used (guided_json vs structured_outputs)
Query decomposer (query_decomposer.py line 96):
- Add logger.info("Using decompose prompt: %s", prompt[:200]) to show which prompt is in use

Fix D: Revert the `function_calling` commit

The temporary fix (commit cbb958d) changed method="function_calling" for vLLM. This is unreliable — vLLM may or may not support function calling depending on the model and configuration. Revert this in favor of Fix A (direct guided decoding).

Files to Change

File	Changes
`backend/app/services/llm_client.py`	Fix A: add vLLM-guided path in `complete_structured()`; Fix C: add `exc_info=True`, method logging
`backend/app/core/sqlite_db.py`	Fix B: update `_SEED_DECOMPOSE` with JSON instruction
`backend/app/services/query_decomposer.py`	Fix C: add logging in `_parse_legacy_json()`, log prompt used

Test Plan

Backend tests: pytest app/test/test_phase5_*.py — must all pass (108 tests currently)
Acceptance test: Run against vLLM backend to verify structured decomposition works
Rollback safety: OpenRouter path unchanged; only vLLM path modified

Decision Points

#	Decision	Rationale
1	Use `guided_json` via `extra_body`, not `with_structured_output()`, for vLLM	LangChain's structured output sends `response_format` which vLLM may not support. `guided_json` is vLLM-native and guaranteed to work.
2	Try `structured_outputs` format first, fall back to `guided_json`	Newer vLLM versions standardized on `structured_outputs`. Older versions use `guided_json`. Try both.
3	Add JSON instruction to English seed template only	Chinese template is user-customized — can't force it. Structured output fix (Fix A) makes legacy fallback less critical.
4	Keep OpenRouter path unchanged	`method="json_schema"` works perfectly on OpenRouter. No reason to change it.

7.6 KiB Raw Blame History