docs: Package 8 — add vLLM vision compatibility risk and smoke test to plan

- New risk: vLLM may not support Qwen3.5-35B-A3B vision API depending on version
- Dependencies: added vLLM compatibility note with smoke test snippet
- Heuristic fallback (Option B) works regardless of OpenRouter or vLLM
- qa_vision_enabled toggle provides escape hatch
This commit is contained in:
Woody 2026-05-15 11:20:20 +08:00
parent 16fbb107f4
commit 322caf1cc0
1 changed files with 20 additions and 1 deletions

View File

@ -655,12 +655,31 @@ class Settings(BaseSettings):
| **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. |
| **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n``\n` with question text prepended to each sub-chunk for context. |
| **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). |
| **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. |
---
## Dependencies
- **Vision LLM API**: The existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) is a native vision-language model that accepts base64 images via OpenRouter's standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (`QA_VISION_ENABLED=false`), fall back to heuristic table detection (text-only).
- **Vision LLM API**: The existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) is a native vision-language model that accepts base64 images via the standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (`QA_VISION_ENABLED=false`), fall back to heuristic table detection (text-only).
- **vLLM compatibility** (when `VLLM_ENGINE=true`): Vision table extraction requires vLLM v0.6.0+ with the model served as multimodal (vision encoder loaded). Verify with a quick smoke test before implementation:
```python
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url=settings.llm_base_url, api_key=settings.llm_api_key)
resp = await client.chat.completions.create(
model=settings.llm_model_name,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgo="}}
]
}]
)
```
If this returns a valid response → vision works. If it errors (400/422) → set `QA_VISION_ENABLED=false` and use Option B (heuristic text-based table extraction). The heuristic fallback works identically regardless of provider (OpenRouter or vLLM).
- **New Python packages**: `Pillow` (likely already installed for image rendering). `pypdf` already installed. No `pymupdf` needed — vision extraction sends raw page images (PNG) directly to the LLM; the LLM itself identifies table regions.
- **Existing codebase**: No breaking changes. Strategy is additive — existing TokenChunkingStrategy unchanged.
- **ChromaDB**: No schema change. Metadata fields are flexible (ChromaDB accepts arbitrary dict keys).