14 KiB
Fix Plan: vLLM Highlight Batch Structured Output Failure
Date: 2026-05-15 (updated 2026-05-15 with external research) Source: User report — highlight function breaks when deploying to local vLLM server, works fine on OpenRouter Status: Root cause confirmed via internet research
Symptom
When VLLM_ENGINE=true, the highlight feature fails silently:
- Citations appear as plain PDF links instead of highlighted chunk views
- Frontend console logs:
"Highlight batch computation failed" - Backend returns
{"status": "failed"}fromPOST /api/v1/v2/highlights/batch
When VLLM_ENGINE=false (OpenRouter), highlights work correctly.
Root Cause Analysis (Confirmed)
The Highlight Pipeline
SSE stream completes
→ Frontend ResponsePanel.tsx calls POST /api/v1/v2/highlights/batch
→ Backend chunks.py creates LLMClient(settings)
→ ChunkHighlightService.compute_highlights_batch()
→ self._llm.complete_structured(prompt, HighlightBatchResult) ← FAILS HERE
Why It Works on OpenRouter
LLMClient._complete_structured_openai() uses LangChain's with_structured_output(pydantic_model, method="json_schema"). LangChain sends OpenAI-native response_format: {type: "json_schema", json_schema: {...}} to the API. OpenRouter proxies this to the underlying model properly. Returns pre-validated Pydantic model.
Why It Fails on vLLM — CONFIRMED ROOT CAUSE
LLMClient._complete_structured_vllm() (lines 140-191 of llm_client.py) tries two extra_body formats:
| Attempt | Format | Status on latest vLLM |
|---|---|---|
| 1 | extra_body={"structured_outputs": {"json": schema}} |
✅ Supported (v0.8+) — but may fail due to chat_template_kwargs merge conflict |
| 2 | extra_body={"guided_json": schema} |
❌ REMOVED in v0.12.0 |
guided_json was deprecated in v0.11.0 and fully removed in v0.12.0. This was confirmed via:
-
⚠️ If you are still using the following deprecated API fields which were removed in v0.12.0:
guided_json→{"structured_outputs": {"json": ...}} -
vLLM RFC #19097: The vLLM team confirmed the migration path and is moving all structured output constraints into
response_format.
When structured_outputs attempt 1 fails (e.g., due to the {**base_extra, **body} merge conflating chat_template_kwargs with structured output params), the guided_json fallback is silently rejected by vLLM v0.12+ — resulting in LLMClientError and status: "failed".
What Latest vLLM Actually Supports
Research confirmed three supported methods on latest vLLM (v0.12+):
| # | Method | Format | Since | Evidence |
|---|---|---|---|---|
| 1 | response_format |
{"type": "json_schema", "json_schema": {"name": "...", "schema": schema}} |
v0.6.4 | vLLM protocol.py to_sampling_params(), PR #7654, PR #9530 |
| 2 | extra_body |
{"structured_outputs": {"json": schema}} |
v0.8+ | vLLM test_chat.py (extensive test coverage), third-party repos: encourage, mellea |
| 3 | guided_json |
❌ | — | Removed v0.12.0 per official docs |
$defs / Nested Schema Support
The HighlightBatchResult schema produces $defs from Pydantic's model_json_schema():
{
"$defs": {
"ChunkHighlights": { "properties": { ... } },
"RelevantSentence": { "properties": { ... } }
},
"properties": {
"results": { "items": { "$ref": "#/$defs/ChunkHighlights" }, "type": "array" }
}
}
$defs are fully supported on latest vLLM with the auto backend (default):
- xgrammar: Native
$defssupport viaxgr.Grammar.from_json_schema()—$defsis NOT inhas_xgrammar_unsupported_json_features()(source) - guidance: Native support via
llguidance.LLMatcher.grammar_from_json_schema() autobackend: Tries xgrammar → guidance → outlines, handling fallback transparently- PR #15627: Specifically addressed
$defsfor the Outlines backend (closed when V0 removed; xgrammar/guidance already handled it) - vLLM test fixtures:
sample_definition_json_schemafixture tests$defs+$refschemas in production
Why QueryDecomposer Works But Highlights Don't
QueryDecomposer uses LLMClientDP (Deepseek), which has its own complete_structured() with Deepseek JSON mode (response_format: {type: "json_object"}) — entirely different code path, unaffected by vLLM's guided_json removal.
ChunkHighlightService uses the main LLMClient and calls complete_structured() which routes to _complete_structured_vllm() → dead guided_json fallback.
Fix Plan (Revised — 2026-05-15)
Fix A (PRIMARY): Update _complete_structured_vllm() to use supported formats
Approach: Replace the two-attempt guided_json loop with a two-tier approach using formats confirmed to work on latest vLLM.
Implementation (llm_client.py, replace _complete_structured_vllm() lines 140-191):
async def _complete_structured_vllm(self, prompt, pydantic_model, step_name, start_time):
"""Use vLLM-native structured output via response_format or structured_outputs extra_body."""
schema = pydantic_model.model_json_schema()
model_name = pydantic_model.__name__
# Tier 1: OpenAI-native response_format (vLLM v0.6.4+, portable to OpenRouter/OpenAI)
# This is the format vLLM's own docs recommend and RFC #19097 confirms as future direction.
try:
response = await self._client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
response_format={
"type": "json_schema",
"json_schema": {"name": model_name, "schema": schema},
},
)
content = response.choices[0].message.content or ""
elapsed_ms = (time.perf_counter() - start_time) * 1000
self.logger.info(
"[%s] vLLM structured succeeded via response_format in %.2fms", step_name, elapsed_ms
)
return pydantic_model.model_validate_json(content)
except Exception as exc:
self.logger.warning(
"[%s] response_format failed: %s. Trying structured_outputs fallback.", step_name, exc
)
# Tier 2: extra_body structured_outputs (vLLM v0.8+, most battle-tested fallback)
try:
response = await self._client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
extra_body={"structured_outputs": {"json": schema}},
)
content = response.choices[0].message.content or ""
elapsed_ms = (time.perf_counter() - start_time) * 1000
self.logger.info(
"[%s] vLLM structured succeeded via structured_outputs in %.2fms", step_name, elapsed_ms
)
return pydantic_model.model_validate_json(content)
except Exception as exc:
elapsed_ms = (time.perf_counter() - start_time) * 1000
self.logger.error(
"[%s] Both response_format and structured_outputs failed after %.2fms",
step_name, elapsed_ms, exc_info=True,
)
raise LLMClientError("vLLM structured output failed with all supported formats")
What changes from current code:
- ❌ Remove
guided_jsonloop (removed in v0.12.0 — confirmed dead code) - ❌ Remove
{**base_extra, **body}merge (no more conflatingchat_template_kwargswith structured output —_build_extra_body()remains forcomplete()text calls only) - ✅ Add
response_formatas Tier 1 (vLLM v0.6.4+, OpenRouter-compatible, future-proof per RFC #19097) - ✅ Keep
structured_outputsas Tier 2 (battle-tested fallback for v0.8+ deployments) - ✅ Add per-tier success/failure logging for diagnostics
Fix B (DEFENSE-IN-DEPTH): Add markdown fence stripping
Approach: vLLM models may wrap JSON output in markdown code fences even with structured output. Strip them before model_validate_json().
Implementation (llm_client.py, in both Tier 1 and Tier 2 of _complete_structured_vllm(), before model_validate_json()):
# Strip markdown code fences if present
import re
match = re.search(r"```(?:json)?\s*\n?(.*?)\n?```", content, re.DOTALL)
if match:
content = match.group(1).strip()
Fix C (FALLBACK): Add plain complete() fallback to ChunkHighlightService
Approach: If both structured output tiers fail, fall back to complete() with JSON format instruction. Only needed as last-resort defense if the vLLM server is misconfigured (e.g., --structured-outputs-config.backend not set).
Implementation (chunk_highlight_service.py, in compute_highlights_batch()):
Same as original Fix A in this plan. This is now demoted to Fix C since Fix A (updating the vLLM path) should resolve the issue at the LLM client layer for all structured output callers.
Files to Change
| File | Changes | Priority |
|---|---|---|
backend/app/services/llm_client.py |
Fix A: replace _complete_structured_vllm() with two-tier approach; Fix B: add markdown fence stripping |
HIGH |
backend/app/services/chunk_highlight_service.py |
Fix C: add JSON fallback in compute_highlights_batch() (defense-in-depth) |
Medium |
Test Plan
- Reproduce locally: Set
VLLM_ENGINE=truein.env, point to vLLM server, run a query, observe highlight batch failure - Apply Fix A+B: Update
_complete_structured_vllm(), rebuild Docker image, redeploy to vLLM - Verify highlights work: Run a query against uploaded PDF, confirm citation links show highlighted chunk views
- Verify logs: Check that
response_formatTier 1 succeeds (should showsucceeded via response_formatin logs) - Run existing tests:
cd backend && pytest app/test/test_phase5_chunk_highlight_service.py -v - Run existing tests:
cd backend && pytest app/test/test_phase5_highlight_endpoints.py -v - Run existing tests:
cd backend && pytest app/test/ -v(full suite — ensure no regressions)
Research Evidence
| Source | Finding |
|---|---|
| vLLM structured outputs docs | guided_json removed in v0.12.0; migration to structured_outputs |
| vLLM protocol.py | response_format.type == "json_schema" → StructuredOutputsParams(json=...) since v0.6.4 |
| vLLM PR #7654 | Added json_schema support (Aug 2024) |
| vLLM PR #9530 | Fixed json_schema not being passed to backend (Oct 2024) |
| vLLM RFC #19097 | Future direction: move all structured output to response_format |
| vLLM test_chat.py | extra_body=dict(structured_outputs={"json": schema}) extensively tested |
| vLLM backend_xgrammar.py | $defs NOT in unsupported features list; xgrammar handles them natively |
| vLLM PR #15627 | $defs fix for Outlines backend; xgrammar/guidance already supported it |
| encourage | Real-world use: extra_body={"structured_outputs": {"json": model.model_json_schema()}} |
| mellea | Real-world use: falls back from structured_outputs to guided_json for older vLLM |
Related Plans
.plans/debug_vllm_structured_output.md— Previous vLLM structured output diagnosis (2026-04-29) that fixed the QueryDecomposer path but did not address the highlight service
Decision Points
| # | Decision | Rationale |
|---|---|---|
| 1 | Replace _complete_structured_vllm() rather than adding fallback to ChunkHighlightService |
The root cause is in the LLM client layer. Fixing it there benefits ALL structured output callers, not just highlights. response_format is the recommended future direction per RFC #19097. |
| 2 | Use response_format as Tier 1, structured_outputs as Tier 2 |
response_format is portable (works on OpenRouter/OpenAI/vLLM), future-proof (RFC #19097 direction), and has extra OpenAI-layer validation. structured_outputs is battle-tested fallback. |
| 3 | Remove guided_json entirely |
Removed in v0.12.0 — confirmed dead code. No reason to keep it. |
| 4 | Remove {**base_extra, **body} merge |
chat_template_kwargs is independent of structured output. Merging them risks param conflicts. _build_extra_body() remains for complete() text calls only. |
| 5 | Keep _complete_structured_openai() separate for OpenRouter |
LangChain with_structured_output(method="json_schema") has zero real-world usage evidence with vLLM. Keeping separate paths avoids thinking-control incompatibility (_get_langchain_model()'s model_kwargs=None for vLLM). |
| 6 | Fix C (JSON fallback) remains as defense-in-depth | Even with the fixed vLLM path, a plain complete() fallback protects against server misconfiguration or backend issues. |