legco_ai_assistant/.plans/debug_vllm_structured_outpu...

7.6 KiB

Fix Plan: vLLM Structured Output Failure + Legacy Fallback

Date: 2026-04-29 Source: Server log analysis — vLLM backend with Qwen3.6-35B-A3B


Diagnosis

Issue 1: argument of type 'NoneType' is not iterable in complete_structured()

Root cause: vLLM DOES support JSON schema structured output, but through its own protocol (guided_json in extra_body), NOT through OpenAI's response_format={"type": "json_schema"}. LangChain's with_structured_output(method="json_schema") sends the OpenAI-native format. vLLM ignores it, returns normal text. LangChain's structured parser receives the text response without the expected schema-validated fields, hits None where it expects an iterable.

Evidence:

  • vLLM official docs show extra_body={"guided_json": schema} (legacy) or extra_body={"structured_outputs": {"json": schema}} (new unified format)
  • vLLM v0.10.2+ partially supports response_format={"type": "json_schema"} but it is newer and may not be on the server
  • The error happens after 550ms (LLM responded) — response parsing failure, not connection failure
  • The log shows LLM Extra Body {'chat_template_kwargs': ...} but no guided_json / structured_outputs

Current code (llm_client.py line 139):

structured = model.with_structured_output(pydantic_model, method="json_schema")

This generates response_format={"type": "json_schema", "json_schema": <schema>} → vLLM ignores → plain text response → parse failure.

Issue 2: Legacy fallback prompt lacks JSON instruction

Root cause: _SEED_DECOMPOSE (sqlite_db.py line 12) says "Break it down into 2-5 simplified sub-questions" — no JSON format instruction. The user's customized prompt in profile A is Chinese: "請你把每條問題簡化" — no JSON instruction either. When structured output fails, the legacy _parse_legacy_json() calls _extract_json_from_markdown() which only looks for JSON code blocks or bare JSON. Natural language bullets → empty list → extracted_questions: [].

Evidence:

  • _SEED_DECOMPOSE = "Given this question: '{question}'\n\nBreak it down into 2-5 simplified sub-questions that would help search for relevant information. Each sub-question should be short and focused on one aspect."
  • The user's customized prompt (profile A, from DB) is injected via prompt_service.get_prompt_template("decompose") → same issue
  • _parse_legacy_json()_extract_json_from_markdown() → regex for ```json blocks → finds none → returns raw text → json.loads() fails → []

Issue 3: Insufficient logging

Current gaps:

  • complete_structured() error handler (line 156): %s for exc — only prints str(exc), not the traceback
  • _parse_legacy_json() returns [] silently — no log of what text was being parsed
  • No log of which structured output method is being used per call
  • No log of the vLLM server version or capability

Fix Plan

Fix A: Use vLLM-native guided_json for structured output (llm_client.py)

Approach: Modify _get_langchain_model() to pass extra_body with guided_json for vLLM, bypassing LangChain's with_structured_output() method. Instead of:

structured = model.with_structured_output(pydantic_model, method="json_schema")
result = await structured.ainvoke(prompt)

Use direct model invocation with guided decoding:

schema = pydantic_model.model_json_schema()
model = self._get_langchain_model_for_vllm(schema)
result_text = await model.ainvoke(prompt)
result = pydantic_model.model_validate_json(result_text)

Implementation:

  1. In complete_structured(), detect vLLM via self.settings.vllm_engine
  2. For vLLM: build the Pydantic schema, create a ChatOpenAI model with extra_body={"guided_json": schema} (legacy format) or extra_body={"structured_outputs": {"json": schema}} (new format), call ainvoke(prompt), parse the JSON response with pydantic_model.model_validate_json()
  3. For OpenRouter: keep existing with_structured_output(method="json_schema") path
  4. Add a fallback: try "structured_outputs" format first; if fails, try "guided_json" format

Fallback strategy: If both structured output methods fail:

  1. Log the full error with traceback
  2. Fall back to complete() with JSON instruction appended to prompt
  3. Parse the response with _parse_legacy_json() (which we will also improve — see Fix C)

Fix B: Update seed decompose template (sqlite_db.py)

Approach: Add JSON format instruction to _SEED_DECOMPOSE so the legacy fallback path has a chance of working.

New template:

"Given this question: '{question}'\n\n"
"Break it down into 2-5 simplified sub-questions that would help "
"search for relevant information. Each sub-question should be short "
"and focused on one aspect.\n\n"
"Return a JSON array of strings: [\"sub-question 1\", \"sub-question 2\"]"

Note: This only helps NEW prompt profiles. Existing profiles with custom prompts (like profile A) still need the user to update their prompt via the System Prompts UI. However, by fixing structured output (Fix A), the legacy path won't be reached in normal operation.

Fix C: Add diagnostic logging

  1. complete_structured() error handler (llm_client.py line 156):

    • Change self.logger.error(...) to self.logger.error(..., exc_info=True) to include full traceback
    • Add self.logger.info("[%s] Structured output method used: %s", step_name, method)
  2. _parse_legacy_json() (query_decomposer.py line 43):

    • Add logger.info("Legacy JSON parse: extracted text: %s", response[:200]) before json.loads()
  3. complete_structured() vLLM path:

    • Log the schema being sent: logger.info("[%s] vLLM guided_json schema: %s", step_name, json.dumps(schema)[:200])
    • Log which format was used (guided_json vs structured_outputs)
  4. Query decomposer (query_decomposer.py line 96):

    • Add logger.info("Using decompose prompt: %s", prompt[:200]) to show which prompt is in use

Fix D: Revert the function_calling commit

The temporary fix (commit cbb958d) changed method="function_calling" for vLLM. This is unreliable — vLLM may or may not support function calling depending on the model and configuration. Revert this in favor of Fix A (direct guided decoding).


Files to Change

File Changes
backend/app/services/llm_client.py Fix A: add vLLM-guided path in complete_structured(); Fix C: add exc_info=True, method logging
backend/app/core/sqlite_db.py Fix B: update _SEED_DECOMPOSE with JSON instruction
backend/app/services/query_decomposer.py Fix C: add logging in _parse_legacy_json(), log prompt used

Test Plan

  1. Backend tests: pytest app/test/test_phase5_*.py — must all pass (108 tests currently)
  2. Acceptance test: Run against vLLM backend to verify structured decomposition works
  3. Rollback safety: OpenRouter path unchanged; only vLLM path modified

Decision Points

# Decision Rationale
1 Use guided_json via extra_body, not with_structured_output(), for vLLM LangChain's structured output sends response_format which vLLM may not support. guided_json is vLLM-native and guaranteed to work.
2 Try structured_outputs format first, fall back to guided_json Newer vLLM versions standardized on structured_outputs. Older versions use guided_json. Try both.
3 Add JSON instruction to English seed template only Chinese template is user-customized — can't force it. Structured output fix (Fix A) makes legacy fallback less critical.
4 Keep OpenRouter path unchanged method="json_schema" works perfectly on OpenRouter. No reason to change it.