diff --git a/.plans/package8_enhancement_plan.md b/.plans/package8_enhancement_plan.md index 7c05dfd..3cfed38 100644 --- a/.plans/package8_enhancement_plan.md +++ b/.plans/package8_enhancement_plan.md @@ -655,12 +655,31 @@ class Settings(BaseSettings): | **LegCo format drift**: Future documents may use different Q&A markers | Low | Detection is regex-based — easy to add new patterns. LLM verification catches novel formats. Log format detection results for monitoring. | | **Chunk size**: Some Q&A pairs are very long (7+ pages) | Medium | Apply max chunk token limit (configurable, default 3000). Recursive split on `\n\n` → `\n` with question text prepended to each sub-chunk for context. | | **DOCX/TXT Q&A**: Non-PDF formats may have different Q&A markers | Low | Use same regex detection on concatenated text. Skip vision table extraction (text-based only). | +| **vLLM vision compatibility**: vLLM may not support vision API for Qwen3.5-35B-A3B depending on version and how the model is served | Medium | Test with a single vision call against your vLLM instance before implementation. Set `QA_VISION_ENABLED=false` and use heuristic fallback (Option B) if unsupported. See vLLM compatibility note in Dependencies below. | --- ## Dependencies -- **Vision LLM API**: The existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) is a native vision-language model that accepts base64 images via OpenRouter's standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (`QA_VISION_ENABLED=false`), fall back to heuristic table detection (text-only). +- **Vision LLM API**: The existing `LLM_MODEL_NAME` (`qwen/qwen3.5-35b-a3b`) is a native vision-language model that accepts base64 images via the standard OpenAI Chat Completions API. No separate vision model, API key, or endpoint needed. If vision is unavailable or disabled (`QA_VISION_ENABLED=false`), fall back to heuristic table detection (text-only). + +- **vLLM compatibility** (when `VLLM_ENGINE=true`): Vision table extraction requires vLLM v0.6.0+ with the model served as multimodal (vision encoder loaded). Verify with a quick smoke test before implementation: + ```python + from openai import AsyncOpenAI + client = AsyncOpenAI(base_url=settings.llm_base_url, api_key=settings.llm_api_key) + resp = await client.chat.completions.create( + model=settings.llm_model_name, + messages=[{ + "role": "user", + "content": [ + {"type": "text", "text": "Describe what you see."}, + {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0KGgo="}} + ] + }] + ) + ``` + If this returns a valid response → vision works. If it errors (400/422) → set `QA_VISION_ENABLED=false` and use Option B (heuristic text-based table extraction). The heuristic fallback works identically regardless of provider (OpenRouter or vLLM). + - **New Python packages**: `Pillow` (likely already installed for image rendering). `pypdf` already installed. No `pymupdf` needed — vision extraction sends raw page images (PNG) directly to the LLM; the LLM itself identifies table regions. - **Existing codebase**: No breaking changes. Strategy is additive — existing TokenChunkingStrategy unchanged. - **ChromaDB**: No schema change. Metadata fields are flexible (ChromaDB accepts arbitrary dict keys).