14 KiB

Raw Blame History

Fix Plan: vLLM Highlight Batch Structured Output Failure

Date: 2026-05-15 (updated 2026-05-15 with external research) Source: User report — highlight function breaks when deploying to local vLLM server, works fine on OpenRouter Status: Root cause confirmed via internet research

Symptom

When VLLM_ENGINE=true, the highlight feature fails silently:

Citations appear as plain PDF links instead of highlighted chunk views
Frontend console logs: "Highlight batch computation failed"
Backend returns {"status": "failed"} from POST /api/v1/v2/highlights/batch

When VLLM_ENGINE=false (OpenRouter), highlights work correctly.

Root Cause Analysis (Confirmed)

The Highlight Pipeline

SSE stream completes
  → Frontend ResponsePanel.tsx calls POST /api/v1/v2/highlights/batch
    → Backend chunks.py creates LLMClient(settings)
      → ChunkHighlightService.compute_highlights_batch()
        → self._llm.complete_structured(prompt, HighlightBatchResult)  ← FAILS HERE

Why It Works on OpenRouter

LLMClient._complete_structured_openai() uses LangChain's with_structured_output(pydantic_model, method="json_schema"). LangChain sends OpenAI-native response_format: {type: "json_schema", json_schema: {...}} to the API. OpenRouter proxies this to the underlying model properly. Returns pre-validated Pydantic model.

Why It Fails on vLLM — CONFIRMED ROOT CAUSE

LLMClient._complete_structured_vllm() (lines 140-191 of llm_client.py) tries two extra_body formats:

Attempt	Format	Status on latest vLLM
1	`extra_body={"structured_outputs": {"json": schema}}`	✅ Supported (v0.8+) — but may fail due to `chat_template_kwargs` merge conflict
2	`extra_body={"guided_json": schema}`	❌ REMOVED in v0.12.0

guided_json was deprecated in v0.11.0 and fully removed in v0.12.0. This was confirmed via:

vLLM Official Docs:

⚠️ If you are still using the following deprecated API fields which were removed in v0.12.0: guided_json → {"structured_outputs": {"json": ...}}
vLLM RFC #19097: The vLLM team confirmed the migration path and is moving all structured output constraints into response_format.

When structured_outputs attempt 1 fails (e.g., due to the {**base_extra, **body} merge conflating chat_template_kwargs with structured output params), the guided_json fallback is silently rejected by vLLM v0.12+ — resulting in LLMClientError and status: "failed".

What Latest vLLM Actually Supports

Research confirmed three supported methods on latest vLLM (v0.12+):

#	Method	Format	Since	Evidence
1	`response_format`	`{"type": "json_schema", "json_schema": {"name": "...", "schema": schema}}`	v0.6.4	vLLM protocol.py `to_sampling_params()`, PR #7654, PR #9530
2	`extra_body`	`{"structured_outputs": {"json": schema}}`	v0.8+	vLLM test_chat.py (extensive test coverage), third-party repos: `encourage`, `mellea`
3	~~`guided_json`~~	❌	—	Removed v0.12.0 per official docs

`$defs` / Nested Schema Support

The HighlightBatchResult schema produces $defs from Pydantic's model_json_schema():

{
  "$defs": {
    "ChunkHighlights": { "properties": { ... } },
    "RelevantSentence": { "properties": { ... } }
  },
  "properties": {
    "results": { "items": { "$ref": "#/$defs/ChunkHighlights" }, "type": "array" }
  }
}

$defs are fully supported on latest vLLM with the auto backend (default):

xgrammar: Native $defs support via xgr.Grammar.from_json_schema() — $defs is NOT in has_xgrammar_unsupported_json_features() (source)
guidance: Native support via llguidance.LLMatcher.grammar_from_json_schema()
auto backend: Tries xgrammar → guidance → outlines, handling fallback transparently
PR #15627: Specifically addressed $defs for the Outlines backend (closed when V0 removed; xgrammar/guidance already handled it)
vLLM test fixtures: sample_definition_json_schema fixture tests $defs + $ref schemas in production

Why QueryDecomposer Works But Highlights Don't

QueryDecomposer uses LLMClientDP (Deepseek), which has its own complete_structured() with Deepseek JSON mode (response_format: {type: "json_object"}) — entirely different code path, unaffected by vLLM's guided_json removal.

ChunkHighlightService uses the main LLMClient and calls complete_structured() which routes to _complete_structured_vllm() → dead guided_json fallback.

Fix Plan (Revised — 2026-05-15)

Fix A (PRIMARY): Update `_complete_structured_vllm()` to use supported formats

Approach: Replace the two-attempt guided_json loop with a two-tier approach using formats confirmed to work on latest vLLM.

Implementation (llm_client.py, replace _complete_structured_vllm() lines 140-191):

async def _complete_structured_vllm(self, prompt, pydantic_model, step_name, start_time):
    """Use vLLM-native structured output via response_format or structured_outputs extra_body."""
    schema = pydantic_model.model_json_schema()
    model_name = pydantic_model.__name__

    # Tier 1: OpenAI-native response_format (vLLM v0.6.4+, portable to OpenRouter/OpenAI)
    # This is the format vLLM's own docs recommend and RFC #19097 confirms as future direction.
    try:
        response = await self._client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            response_format={
                "type": "json_schema",
                "json_schema": {"name": model_name, "schema": schema},
            },
        )
        content = response.choices[0].message.content or ""
        elapsed_ms = (time.perf_counter() - start_time) * 1000
        self.logger.info(
            "[%s] vLLM structured succeeded via response_format in %.2fms", step_name, elapsed_ms
        )
        return pydantic_model.model_validate_json(content)
    except Exception as exc:
        self.logger.warning(
            "[%s] response_format failed: %s. Trying structured_outputs fallback.", step_name, exc
        )

    # Tier 2: extra_body structured_outputs (vLLM v0.8+, most battle-tested fallback)
    try:
        response = await self._client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            extra_body={"structured_outputs": {"json": schema}},
        )
        content = response.choices[0].message.content or ""
        elapsed_ms = (time.perf_counter() - start_time) * 1000
        self.logger.info(
            "[%s] vLLM structured succeeded via structured_outputs in %.2fms", step_name, elapsed_ms
        )
        return pydantic_model.model_validate_json(content)
    except Exception as exc:
        elapsed_ms = (time.perf_counter() - start_time) * 1000
        self.logger.error(
            "[%s] Both response_format and structured_outputs failed after %.2fms",
            step_name, elapsed_ms, exc_info=True,
        )

    raise LLMClientError("vLLM structured output failed with all supported formats")

What changes from current code:

❌ Remove guided_json loop (removed in v0.12.0 — confirmed dead code)
❌ Remove {**base_extra, **body} merge (no more conflating chat_template_kwargs with structured output — _build_extra_body() remains for complete() text calls only)
✅ Add response_format as Tier 1 (vLLM v0.6.4+, OpenRouter-compatible, future-proof per RFC #19097)
✅ Keep structured_outputs as Tier 2 (battle-tested fallback for v0.8+ deployments)
✅ Add per-tier success/failure logging for diagnostics

Fix B (DEFENSE-IN-DEPTH): Add markdown fence stripping

Approach: vLLM models may wrap JSON output in markdown code fences even with structured output. Strip them before model_validate_json().

Implementation (llm_client.py, in both Tier 1 and Tier 2 of _complete_structured_vllm(), before model_validate_json()):

# Strip markdown code fences if present
import re
match = re.search(r"```(?:json)?\s*\n?(.*?)\n?```", content, re.DOTALL)
if match:
    content = match.group(1).strip()

Fix C (FALLBACK): Add plain complete() fallback to ChunkHighlightService

Approach: If both structured output tiers fail, fall back to complete() with JSON format instruction. Only needed as last-resort defense if the vLLM server is misconfigured (e.g., --structured-outputs-config.backend not set).

Implementation (chunk_highlight_service.py, in compute_highlights_batch()):

Same as original Fix A in this plan. This is now demoted to Fix C since Fix A (updating the vLLM path) should resolve the issue at the LLM client layer for all structured output callers.

Files to Change

File	Changes	Priority
`backend/app/services/llm_client.py`	Fix A: replace `_complete_structured_vllm()` with two-tier approach; Fix B: add markdown fence stripping	HIGH
`backend/app/services/chunk_highlight_service.py`	Fix C: add JSON fallback in `compute_highlights_batch()` (defense-in-depth)	Medium

Test Plan

Reproduce locally: Set VLLM_ENGINE=true in .env, point to vLLM server, run a query, observe highlight batch failure
Apply Fix A+B: Update _complete_structured_vllm(), rebuild Docker image, redeploy to vLLM
Verify highlights work: Run a query against uploaded PDF, confirm citation links show highlighted chunk views
Verify logs: Check that response_format Tier 1 succeeds (should show succeeded via response_format in logs)
Run existing tests: cd backend && pytest app/test/test_phase5_chunk_highlight_service.py -v
Run existing tests: cd backend && pytest app/test/test_phase5_highlight_endpoints.py -v
Run existing tests: cd backend && pytest app/test/ -v (full suite — ensure no regressions)

Research Evidence

Source	Finding
vLLM structured outputs docs	`guided_json` removed in v0.12.0; migration to `structured_outputs`
vLLM protocol.py	`response_format.type == "json_schema"` → `StructuredOutputsParams(json=...)` since v0.6.4
vLLM PR #7654	Added `json_schema` support (Aug 2024)
vLLM PR #9530	Fixed `json_schema` not being passed to backend (Oct 2024)
vLLM RFC #19097	Future direction: move all structured output to `response_format`
vLLM test_chat.py	`extra_body=dict(structured_outputs={"json": schema})` extensively tested
vLLM backend_xgrammar.py	`$defs` NOT in unsupported features list; xgrammar handles them natively
vLLM PR #15627	`$defs` fix for Outlines backend; xgrammar/guidance already supported it
encourage	Real-world use: `extra_body={"structured_outputs": {"json": model.model_json_schema()}}`
mellea	Real-world use: falls back from `structured_outputs` to `guided_json` for older vLLM

.plans/debug_vllm_structured_output.md — Previous vLLM structured output diagnosis (2026-04-29) that fixed the QueryDecomposer path but did not address the highlight service

Decision Points

#	Decision	Rationale
1	Replace `_complete_structured_vllm()` rather than adding fallback to ChunkHighlightService	The root cause is in the LLM client layer. Fixing it there benefits ALL structured output callers, not just highlights. `response_format` is the recommended future direction per RFC #19097.
2	Use `response_format` as Tier 1, `structured_outputs` as Tier 2	`response_format` is portable (works on OpenRouter/OpenAI/vLLM), future-proof (RFC #19097 direction), and has extra OpenAI-layer validation. `structured_outputs` is battle-tested fallback.
3	Remove `guided_json` entirely	Removed in v0.12.0 — confirmed dead code. No reason to keep it.
4	Remove `{base_extra, body}` merge	`chat_template_kwargs` is independent of structured output. Merging them risks param conflicts. `_build_extra_body()` remains for `complete()` text calls only.
5	Keep `_complete_structured_openai()` separate for OpenRouter	LangChain `with_structured_output(method="json_schema")` has zero real-world usage evidence with vLLM. Keeping separate paths avoids thinking-control incompatibility (`_get_langchain_model()`'s `model_kwargs=None` for vLLM).
6	Fix C (JSON fallback) remains as defense-in-depth	Even with the fixed vLLM path, a plain `complete()` fallback protects against server misconfiguration or backend issues.

14 KiB Raw Blame History