docs: vLLM highlight failure fix plan — confirmed guided_json removed in v0.12.0
Root cause confirmed via vLLM docs, protocol.py source, RFC #19097, and GitHub test suite: guided_json was removed in v0.12.0. Our fallback to it after structured_outputs fails is dead code. Fix strategy: replace _complete_structured_vllm() with two-tier approach (response_format as Tier 1, structured_outputs as Tier 2), removing the dead guided_json path and the chat_template_kwargs merge that may conflict. Evidence from: vllm.ai docs, vllm-project/vllm tests/entrypoints, protocol.py to_sampling_params(), PRs #7654 #9530 #15627, RFC #19097
This commit is contained in:
parent
53ebafc401
commit
c3392989dc
|
|
@ -0,0 +1,241 @@
|
||||||
|
# Fix Plan: vLLM Highlight Batch Structured Output Failure
|
||||||
|
|
||||||
|
**Date**: 2026-05-15 (updated 2026-05-15 with external research)
|
||||||
|
**Source**: User report — highlight function breaks when deploying to local vLLM server, works fine on OpenRouter
|
||||||
|
**Status**: Root cause confirmed via internet research
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Symptom
|
||||||
|
|
||||||
|
When `VLLM_ENGINE=true`, the highlight feature fails silently:
|
||||||
|
- Citations appear as plain PDF links instead of highlighted chunk views
|
||||||
|
- Frontend console logs: `"Highlight batch computation failed"`
|
||||||
|
- Backend returns `{"status": "failed"}` from `POST /api/v1/v2/highlights/batch`
|
||||||
|
|
||||||
|
When `VLLM_ENGINE=false` (OpenRouter), highlights work correctly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause Analysis (Confirmed)
|
||||||
|
|
||||||
|
### The Highlight Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
SSE stream completes
|
||||||
|
→ Frontend ResponsePanel.tsx calls POST /api/v1/v2/highlights/batch
|
||||||
|
→ Backend chunks.py creates LLMClient(settings)
|
||||||
|
→ ChunkHighlightService.compute_highlights_batch()
|
||||||
|
→ self._llm.complete_structured(prompt, HighlightBatchResult) ← FAILS HERE
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why It Works on OpenRouter
|
||||||
|
|
||||||
|
`LLMClient._complete_structured_openai()` uses LangChain's `with_structured_output(pydantic_model, method="json_schema")`. LangChain sends OpenAI-native `response_format: {type: "json_schema", json_schema: {...}}` to the API. OpenRouter proxies this to the underlying model properly. Returns pre-validated Pydantic model.
|
||||||
|
|
||||||
|
### Why It Fails on vLLM — CONFIRMED ROOT CAUSE
|
||||||
|
|
||||||
|
`LLMClient._complete_structured_vllm()` (lines 140-191 of `llm_client.py`) tries two `extra_body` formats:
|
||||||
|
|
||||||
|
| Attempt | Format | Status on latest vLLM |
|
||||||
|
|---------|--------|----------------------|
|
||||||
|
| 1 | `extra_body={"structured_outputs": {"json": schema}}` | ✅ Supported (v0.8+) — but may fail due to `chat_template_kwargs` merge conflict |
|
||||||
|
| 2 | `extra_body={"guided_json": schema}` | ❌ **REMOVED in v0.12.0** |
|
||||||
|
|
||||||
|
**`guided_json` was deprecated in v0.11.0 and fully removed in v0.12.0.** This was confirmed via:
|
||||||
|
|
||||||
|
- **[vLLM Official Docs](https://docs.vllm.ai/en/stable/features/structured_outputs/):**
|
||||||
|
> ⚠️ If you are still using the following deprecated API fields which were removed in v0.12.0:
|
||||||
|
> `guided_json` → `{"structured_outputs": {"json": ...}}`
|
||||||
|
|
||||||
|
- **[vLLM RFC #19097](https://github.com/vllm-project/vllm/issues/19097):** The vLLM team confirmed the migration path and is moving all structured output constraints into `response_format`.
|
||||||
|
|
||||||
|
When `structured_outputs` attempt 1 fails (e.g., due to the `{**base_extra, **body}` merge conflating `chat_template_kwargs` with structured output params), the `guided_json` fallback is **silently rejected** by vLLM v0.12+ — resulting in `LLMClientError` and `status: "failed"`.
|
||||||
|
|
||||||
|
### What Latest vLLM Actually Supports
|
||||||
|
|
||||||
|
Research confirmed three supported methods on latest vLLM (v0.12+):
|
||||||
|
|
||||||
|
| # | Method | Format | Since | Evidence |
|
||||||
|
|---|--------|--------|-------|----------|
|
||||||
|
| 1 | `response_format` | `{"type": "json_schema", "json_schema": {"name": "...", "schema": schema}}` | v0.6.4 | [vLLM protocol.py](https://github.com/vllm-project/vllm/blob/master/vllm/entrypoints/openai/chat_completion/protocol.py) `to_sampling_params()`, [PR #7654](https://github.com/vllm-project/vllm/pull/7654), [PR #9530](https://github.com/vllm-project/vllm/pull/9530) |
|
||||||
|
| 2 | `extra_body` | `{"structured_outputs": {"json": schema}}` | v0.8+ | [vLLM test_chat.py](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/chat_completion/test_chat.py) (extensive test coverage), third-party repos: `encourage`, `mellea` |
|
||||||
|
| 3 | ~~`guided_json`~~ | ❌ | — | Removed v0.12.0 per official docs |
|
||||||
|
|
||||||
|
### `$defs` / Nested Schema Support
|
||||||
|
|
||||||
|
The `HighlightBatchResult` schema produces `$defs` from Pydantic's `model_json_schema()`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"$defs": {
|
||||||
|
"ChunkHighlights": { "properties": { ... } },
|
||||||
|
"RelevantSentence": { "properties": { ... } }
|
||||||
|
},
|
||||||
|
"properties": {
|
||||||
|
"results": { "items": { "$ref": "#/$defs/ChunkHighlights" }, "type": "array" }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**`$defs` are fully supported** on latest vLLM with the `auto` backend (default):
|
||||||
|
- **xgrammar**: Native `$defs` support via `xgr.Grammar.from_json_schema()` — `$defs` is NOT in `has_xgrammar_unsupported_json_features()` ([source](https://github.com/vllm-project/vllm/blob/main/vllm/v1/structured_output/backend_xgrammar.py))
|
||||||
|
- **guidance**: Native support via `llguidance.LLMatcher.grammar_from_json_schema()`
|
||||||
|
- **`auto` backend**: Tries xgrammar → guidance → outlines, handling fallback transparently
|
||||||
|
- **[PR #15627](https://github.com/vllm-project/vllm/pull/15627)**: Specifically addressed `$defs` for the Outlines backend (closed when V0 removed; xgrammar/guidance already handled it)
|
||||||
|
- **vLLM test fixtures**: `sample_definition_json_schema` fixture tests `$defs` + `$ref` schemas in production
|
||||||
|
|
||||||
|
### Why QueryDecomposer Works But Highlights Don't
|
||||||
|
|
||||||
|
`QueryDecomposer` uses `LLMClientDP` (Deepseek), which has its own `complete_structured()` with Deepseek JSON mode (`response_format: {type: "json_object"}`) — entirely different code path, unaffected by vLLM's guided_json removal.
|
||||||
|
|
||||||
|
`ChunkHighlightService` uses the main `LLMClient` and calls `complete_structured()` which routes to `_complete_structured_vllm()` → dead `guided_json` fallback.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Fix Plan (Revised — 2026-05-15)
|
||||||
|
|
||||||
|
### Fix A (PRIMARY): Update `_complete_structured_vllm()` to use supported formats
|
||||||
|
|
||||||
|
**Approach**: Replace the two-attempt `guided_json` loop with a two-tier approach using formats confirmed to work on latest vLLM.
|
||||||
|
|
||||||
|
**Implementation** (`llm_client.py`, replace `_complete_structured_vllm()` lines 140-191):
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def _complete_structured_vllm(self, prompt, pydantic_model, step_name, start_time):
|
||||||
|
"""Use vLLM-native structured output via response_format or structured_outputs extra_body."""
|
||||||
|
schema = pydantic_model.model_json_schema()
|
||||||
|
model_name = pydantic_model.__name__
|
||||||
|
|
||||||
|
# Tier 1: OpenAI-native response_format (vLLM v0.6.4+, portable to OpenRouter/OpenAI)
|
||||||
|
# This is the format vLLM's own docs recommend and RFC #19097 confirms as future direction.
|
||||||
|
try:
|
||||||
|
response = await self._client.chat.completions.create(
|
||||||
|
model=self.model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
temperature=0.0,
|
||||||
|
response_format={
|
||||||
|
"type": "json_schema",
|
||||||
|
"json_schema": {"name": model_name, "schema": schema},
|
||||||
|
},
|
||||||
|
)
|
||||||
|
content = response.choices[0].message.content or ""
|
||||||
|
elapsed_ms = (time.perf_counter() - start_time) * 1000
|
||||||
|
self.logger.info(
|
||||||
|
"[%s] vLLM structured succeeded via response_format in %.2fms", step_name, elapsed_ms
|
||||||
|
)
|
||||||
|
return pydantic_model.model_validate_json(content)
|
||||||
|
except Exception as exc:
|
||||||
|
self.logger.warning(
|
||||||
|
"[%s] response_format failed: %s. Trying structured_outputs fallback.", step_name, exc
|
||||||
|
)
|
||||||
|
|
||||||
|
# Tier 2: extra_body structured_outputs (vLLM v0.8+, most battle-tested fallback)
|
||||||
|
try:
|
||||||
|
response = await self._client.chat.completions.create(
|
||||||
|
model=self.model,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
temperature=0.0,
|
||||||
|
extra_body={"structured_outputs": {"json": schema}},
|
||||||
|
)
|
||||||
|
content = response.choices[0].message.content or ""
|
||||||
|
elapsed_ms = (time.perf_counter() - start_time) * 1000
|
||||||
|
self.logger.info(
|
||||||
|
"[%s] vLLM structured succeeded via structured_outputs in %.2fms", step_name, elapsed_ms
|
||||||
|
)
|
||||||
|
return pydantic_model.model_validate_json(content)
|
||||||
|
except Exception as exc:
|
||||||
|
elapsed_ms = (time.perf_counter() - start_time) * 1000
|
||||||
|
self.logger.error(
|
||||||
|
"[%s] Both response_format and structured_outputs failed after %.2fms",
|
||||||
|
step_name, elapsed_ms, exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
raise LLMClientError("vLLM structured output failed with all supported formats")
|
||||||
|
```
|
||||||
|
|
||||||
|
**What changes from current code:**
|
||||||
|
1. ❌ Remove `guided_json` loop (removed in v0.12.0 — confirmed dead code)
|
||||||
|
2. ❌ Remove `{**base_extra, **body}` merge (no more conflating `chat_template_kwargs` with structured output — `_build_extra_body()` remains for `complete()` text calls only)
|
||||||
|
3. ✅ Add `response_format` as Tier 1 (vLLM v0.6.4+, OpenRouter-compatible, future-proof per RFC #19097)
|
||||||
|
4. ✅ Keep `structured_outputs` as Tier 2 (battle-tested fallback for v0.8+ deployments)
|
||||||
|
5. ✅ Add per-tier success/failure logging for diagnostics
|
||||||
|
|
||||||
|
### Fix B (DEFENSE-IN-DEPTH): Add markdown fence stripping
|
||||||
|
|
||||||
|
**Approach**: vLLM models may wrap JSON output in markdown code fences even with structured output. Strip them before `model_validate_json()`.
|
||||||
|
|
||||||
|
**Implementation** (`llm_client.py`, in both Tier 1 and Tier 2 of `_complete_structured_vllm()`, before `model_validate_json()`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Strip markdown code fences if present
|
||||||
|
import re
|
||||||
|
match = re.search(r"```(?:json)?\s*\n?(.*?)\n?```", content, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
content = match.group(1).strip()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fix C (FALLBACK): Add plain complete() fallback to ChunkHighlightService
|
||||||
|
|
||||||
|
**Approach**: If both structured output tiers fail, fall back to `complete()` with JSON format instruction. Only needed as last-resort defense if the vLLM server is misconfigured (e.g., `--structured-outputs-config.backend` not set).
|
||||||
|
|
||||||
|
**Implementation** (`chunk_highlight_service.py`, in `compute_highlights_batch()`):
|
||||||
|
|
||||||
|
Same as original Fix A in this plan. This is now demoted to Fix C since Fix A (updating the vLLM path) should resolve the issue at the LLM client layer for all structured output callers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files to Change
|
||||||
|
|
||||||
|
| File | Changes | Priority |
|
||||||
|
|------|---------|----------|
|
||||||
|
| `backend/app/services/llm_client.py` | Fix A: replace `_complete_structured_vllm()` with two-tier approach; Fix B: add markdown fence stripping | **HIGH** |
|
||||||
|
| `backend/app/services/chunk_highlight_service.py` | Fix C: add JSON fallback in `compute_highlights_batch()` (defense-in-depth) | Medium |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test Plan
|
||||||
|
|
||||||
|
1. **Reproduce locally**: Set `VLLM_ENGINE=true` in `.env`, point to vLLM server, run a query, observe highlight batch failure
|
||||||
|
2. **Apply Fix A+B**: Update `_complete_structured_vllm()`, rebuild Docker image, redeploy to vLLM
|
||||||
|
3. **Verify highlights work**: Run a query against uploaded PDF, confirm citation links show highlighted chunk views
|
||||||
|
4. **Verify logs**: Check that `response_format` Tier 1 succeeds (should show `succeeded via response_format` in logs)
|
||||||
|
5. **Run existing tests**: `cd backend && pytest app/test/test_phase5_chunk_highlight_service.py -v`
|
||||||
|
6. **Run existing tests**: `cd backend && pytest app/test/test_phase5_highlight_endpoints.py -v`
|
||||||
|
7. **Run existing tests**: `cd backend && pytest app/test/ -v` (full suite — ensure no regressions)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research Evidence
|
||||||
|
|
||||||
|
| Source | Finding |
|
||||||
|
|--------|---------|
|
||||||
|
| [vLLM structured outputs docs](https://docs.vllm.ai/en/stable/features/structured_outputs/) | `guided_json` removed in v0.12.0; migration to `structured_outputs` |
|
||||||
|
| [vLLM protocol.py](https://github.com/vllm-project/vllm/blob/master/vllm/entrypoints/openai/chat_completion/protocol.py) | `response_format.type == "json_schema"` → `StructuredOutputsParams(json=...)` since v0.6.4 |
|
||||||
|
| [vLLM PR #7654](https://github.com/vllm-project/vllm/pull/7654) | Added `json_schema` support (Aug 2024) |
|
||||||
|
| [vLLM PR #9530](https://github.com/vllm-project/vllm/pull/9530) | Fixed `json_schema` not being passed to backend (Oct 2024) |
|
||||||
|
| [vLLM RFC #19097](https://github.com/vllm-project/vllm/issues/19097) | Future direction: move all structured output to `response_format` |
|
||||||
|
| [vLLM test_chat.py](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/chat_completion/test_chat.py) | `extra_body=dict(structured_outputs={"json": schema})` extensively tested |
|
||||||
|
| [vLLM backend_xgrammar.py](https://github.com/vllm-project/vllm/blob/main/vllm/v1/structured_output/backend_xgrammar.py) | `$defs` NOT in unsupported features list; xgrammar handles them natively |
|
||||||
|
| [vLLM PR #15627](https://github.com/vllm-project/vllm/pull/15627) | `$defs` fix for Outlines backend; xgrammar/guidance already supported it |
|
||||||
|
| [encourage](https://github.com/uhh-hcds/encourage) | Real-world use: `extra_body={"structured_outputs": {"json": model.model_json_schema()}}` |
|
||||||
|
| [mellea](https://github.com/generative-computing/mellea) | Real-world use: falls back from `structured_outputs` to `guided_json` for older vLLM |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Related Plans
|
||||||
|
|
||||||
|
- `.plans/debug_vllm_structured_output.md` — Previous vLLM structured output diagnosis (2026-04-29) that fixed the QueryDecomposer path but did not address the highlight service
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision Points
|
||||||
|
|
||||||
|
| # | Decision | Rationale |
|
||||||
|
|---|----------|-----------|
|
||||||
|
| 1 | **Replace `_complete_structured_vllm()` rather than adding fallback to ChunkHighlightService** | The root cause is in the LLM client layer. Fixing it there benefits ALL structured output callers, not just highlights. `response_format` is the recommended future direction per RFC #19097. |
|
||||||
|
| 2 | Use `response_format` as Tier 1, `structured_outputs` as Tier 2 | `response_format` is portable (works on OpenRouter/OpenAI/vLLM), future-proof (RFC #19097 direction), and has extra OpenAI-layer validation. `structured_outputs` is battle-tested fallback. |
|
||||||
|
| 3 | Remove `guided_json` entirely | Removed in v0.12.0 — confirmed dead code. No reason to keep it. |
|
||||||
|
| 4 | Remove `{**base_extra, **body}` merge | `chat_template_kwargs` is independent of structured output. Merging them risks param conflicts. `_build_extra_body()` remains for `complete()` text calls only. |
|
||||||
|
| 5 | Keep `_complete_structured_openai()` separate for OpenRouter | LangChain `with_structured_output(method="json_schema")` has zero real-world usage evidence with vLLM. Keeping separate paths avoids thinking-control incompatibility (`_get_langchain_model()`'s `model_kwargs=None` for vLLM). |
|
||||||
|
| 6 | Fix C (JSON fallback) remains as defense-in-depth | Even with the fixed vLLM path, a plain `complete()` fallback protects against server misconfiguration or backend issues. |
|
||||||
Loading…
Reference in New Issue