legco_ai_assistant/.plans/phase5_openrouter_asr.md

17 KiB

Phase 5: OpenRouter ASR Provider

Date: 2026-05-18 Status: Implemented (2026-05-19) Source: User request — add OpenRouter STT (google/gemini-3.1-flash-lite) as alternative ASR provider Research: OpenRouter STT docs + librarian agent (real-world code patterns) + explore agent (codebase architecture map) Test Results: 56/56 ASR tests pass (Phase 2 + Phase 5)


1. Objective

Add OpenRouter as a second ASR provider for batch transcription (transcribe_full). The realtime WebSocket streaming mode remains DashScope-only because OpenRouter has no WebSocket STT endpoint.

Users select the provider via a single env var. The existing REST endpoint POST /api/v1/video/{video_id}/transcribe and the WebSocket endpoint /ws/asr/{video_id} are unchanged from the frontend's perspective.


2. Scope

In Scope Out of Scope
OpenRouter batch transcription (transcribe_full) Realtime WebSocket streaming (DashScope-only)
ASR_PROVIDER env var switching Frontend provider selector UI
Provider abstraction (protocol class) Changing existing DashScope code behavior
Tests for new provider Retraining/changing models
.env.example update Docker image rebuild

3. Architecture

3.1 Current Flow (DashScope-only)

POST /api/v1/video/{video_id}/transcribe
  → video.py router
  → VideoService.extract_audio() → WAV bytes
  → ASRClient(settings).transcribe_full(audio_bytes, language)
  → OpenAI SDK → DashScope Chat Completions API (audio input)
  → return text

3.2 New Flow (Provider-based)

POST /api/v1/video/{video_id}/transcribe
  → video.py router
  → VideoService.extract_audio() → WAV bytes
  → ASRClient(settings).transcribe_full(audio_bytes, language)
      ├── ASR_PROVIDER=dashscope → DashScopeASRProvider (existing logic)
      └── ASR_PROVIDER=openrouter → OpenRouterASRProvider (new)
  → return text

3.3 Provider Interface (Factory + Strategy Pattern)

Based on real-world multi-provider ASR patterns (DocsGPT, LiveKit, openai-agents-python), use Factory + Strategy:

from abc import ABC, abstractmethod
from typing import Protocol

class ASRProvider(ABC):
    """Abstract base for all ASR providers."""
    @abstractmethod
    async def transcribe(self, audio_bytes: bytes, language: str) -> str:
        """Transcribe audio bytes to traditional Chinese text.

        Raises ASRError on any failure (network, HTTP, empty response).
        """
        ...

class ASRProviderFactory:
    """Selects ASR provider based on settings."""
    _providers: dict[str, type[ASRProvider]] = {}

    @classmethod
    def register(cls, name: str, provider_cls: type[ASRProvider]) -> None:
        cls._providers[name] = provider_cls

    @classmethod
    def create(cls, name: str, settings) -> ASRProvider:
        provider_cls = cls._providers.get(name)
        if not provider_cls:
            raise ValueError(f"Unknown ASR provider: {name}")
        return provider_cls(settings)

Why async? The video router endpoint is already async def. The existing transcribe_full is sync (blocking), which blocks the event loop during 30-60s API calls. New providers should be async. Existing DashScope can be wrapped in loop.run_in_executor() temporarily.

3.4 Existing Provider Pattern (LLMClient)

The codebase already has a provider-switching pattern in llm_client.pysingle-class conditional branching, not ABC/interface:

# llm_client.py pattern:
if settings.vllm_engine:
    extra_body = {"chat_template_kwargs": {"enable_thinking": False}}
else:
    extra_body = {"reasoning": {"enabled": False}}

For ASR, the same pattern would mean ASRClient checks settings.asr_provider to select the right SDK/URL. However, since DashScope and OpenRouter use fundamentally different APIs (DashScope = Chat Completions + audio input; OpenRouter = dedicated STT endpoint), the Factory+Strategy pattern (Section 3.3) is cleaner for ASR — each provider gets its own class implementing a common interface.

3.5 OpenRouter SDK vs Raw httpx

Trade-off Raw httpx OpenRouter SDK (pip install openrouter)
Type safety Manual Pydantic models
Retry logic Must implement (tenacity) Built-in retries=RetryConfig(...)
Production readiness Battle-tested Beta (auto-generated from OpenAPI)
Dependencies httpx (already installed) SDK + Pydantic + extra deps

Decision: Use raw httpx + tenacity for Phase 5. This matches the approach used by most production Python projects (lethe, openclaw) and avoids beta SDK risk. The official SDK can be adopted later if it stabilizes.

3.6 Retry & Error Handling

Based on production OpenRouter STT implementations (lethe, openrouter-proxy):

from tenacity import (
    retry, stop_after_attempt, wait_random_exponential,
    retry_if_exception_type
)

RETRIABLE_STATUS = {429, 500, 502, 503, 504}

@retry(
    reraise=True,
    stop=stop_after_attempt(4),
    wait=wait_random_exponential(multiplier=0.2, max=3.0),
    retry=retry_if_exception_type((httpx.TransportError, httpx.HTTPStatusError)),
)
async def _call_stt_api(self, audio_b64: str, language: str) -> dict:
    """Call OpenRouter STT with retry and exponential backoff."""
    ...

Error categories to handle:

Error Response Retry?
httpx.HTTPStatusError (429) Rate limited Yes (backoff)
httpx.HTTPStatusError (5xx) Server error Yes (backoff)
httpx.HTTPStatusError (4xx, non-429) Client error No
httpx.ConnectError Connection failed Yes
httpx.TimeoutException Timeout (>120s) Yes
Empty result["text"] No transcription No

Note: tenacity is NOT currently in requirements.txt. Add it as a new dependency.

3.7 API Differences

DashScope OpenRouter
Endpoint https://dashscope-intl.aliyuncs.com/compatible-mode/v1 https://openrouter.ai/api/v1/audio/transcriptions
Method Chat Completions (POST /chat/completions) Dedicated STT (POST /audio/transcriptions)
Audio format data:audio/wav;base64,... (data URL) {"data": "<base64>", "format": "wav"} (raw base64)
Auth DASHSCOPE_API_KEY OPENROUTER_API_KEY (separate key for accounting flexibility)
Response choices[0].message.content {"text": "...", "usage": {...}} (no segments/timestamps/speaker labels)
Response choices[0].message.content {"text": "...", "usage": {...}}
SDK openai.OpenAI httpx.AsyncClient (no official SDK needed)

4. Configuration

4.1 New Env Vars

Variable Default Description
ASR_PROVIDER dashscope ASR provider: dashscope or openrouter
OPENROUTER_API_KEY "" OpenRouter API key (for STT; separate from LLM_API_KEY for accounting)
ASR_OPENROUTER_MODEL google/gemini-3.1-flash-lite OpenRouter STT model name

4.2 Settings Changes

Add to Settings class in config.py:

# ASR provider (Phase 5)
asr_provider: str = "dashscope"  # "dashscope" or "openrouter"
openrouter_api_key: str = ""     # separate from llm_api_key for accounting
asr_openrouter_model: str = "google/gemini-3.1-flash-lite"

Note: OpenRouter STT uses:

  • openrouter_api_key — dedicated key (user preference for separate accounting)
  • llm_base_urlhttps://openrouter.ai/api/v1 (base, STT endpoint appended: /audio/transcriptions)

4.3 Validation

Add a startup validation in config.py or asr_client.py:

VALID_ASR_PROVIDERS = {"dashscope", "openrouter"}
if settings.asr_provider not in VALID_ASR_PROVIDERS:
    raise ValueError(f"Invalid ASR_PROVIDER: {settings.asr_provider}. Must be one of {VALID_ASR_PROVIDERS}")

5. Implementation Tasks

Task 5.1: Add config vars and validation

File: backend/app/core/config.py

  • Add asr_provider: str = "dashscope"
  • Add asr_openrouter_model: str = "google/gemini-3.1-flash-lite"
  • Add model_config validation or runtime check in get_settings()

Test file: backend/app/test/test_phase5_config.py

Task 5.2: Create OpenRouter ASR provider

File: backend/app/services/asr_providers.py (new)

class OpenRouterASRProvider:
    def __init__(self, api_key: str, base_url: str, model: str):
        self.api_key = api_key
        # STT endpoint: base_url + /audio/transcriptions
        self.stt_url = f"{base_url.rstrip('/')}/audio/transcriptions"
        self.model = model
        self._client: httpx.AsyncClient | None = None

    async def transcribe(self, audio_bytes: bytes, language: str) -> str:
        """Transcribe using OpenRouter STT endpoint."""
        ...

OpenRouter STT Request:

import base64
import httpx

audio_b64 = base64.b64encode(audio_bytes).decode("ascii")

payload = {
    "model": self.model,
    "input_audio": {
        "data": audio_b64,       # raw base64, NOT data URL
        "format": "wav",
    },
}
if language and language != "auto":
    payload["language"] = language

response = await client.post(
    self.stt_url,
    headers={
        "Authorization": f"Bearer {self.api_key}",
        "Content-Type": "application/json",
    },
    json=payload,
    timeout=120.0,  # 60s upstream timeout + buffer
)
response.raise_for_status()
result = response.json()
return _to_traditional(result["text"])

Key design notes:

  • Uses httpx.AsyncClient (already in requirements.txt)
  • Base64 format: raw bytes, NOT data:audio/wav;base64,... (DashScope uses data URL; OpenRouter wants raw base64)
  • Timeout: 120s (OpenRouter docs say 60s upstream timeout; add buffer)
  • Error handling: raise custom ASRError on HTTP errors, network errors, or empty response text

Test file: backend/app/test/test_phase5_openrouter_provider.py

Task 5.3: Refactor ASRClient to use provider abstraction

File: backend/app/services/asr_client.py

Changes:

  1. Define ASRProvider protocol (or ABC)
  2. Extract existing DashScope logic into DashScopeASRProvider (sync wrapper for now)
  3. ASRClient.__init__ selects provider based on settings.asr_provider
  4. ASRClient.transcribe_full delegates to provider
  5. Make transcribe_full async (minor refactor to video.py router)

Backward compatibility: Default asr_provider=dashscope means zero behavior change for existing deployments.

Test file: backend/app/test/test_phase2_asr_client.py — update existing tests to work with new provider structure; add tests for provider switching.

Task 5.4: Update video router for async transcription

File: backend/app/routers/video.py

Minimal change — the asr.transcribe_full() call becomes await asr.transcribe_full():

# Before (line 113):
text = asr.transcribe_full(audio_bytes, language=language)

# After:
text = await asr.transcribe_full(audio_bytes, language=language)

No other changes needed. The endpoint signature is already async def.

Task 5.5: Update .env.example and config documentation

File: backend/.env.example

  • Add ASR_PROVIDER and ASR_OPENROUTER_MODEL comments

File: AGENTS.md or development plan

  • Note the new Phase 5 capability

Task 5.6: Integration test (mock OpenRouter HTTP)

File: backend/app/test/test_phase5_integration.py

  • Test full flow: video upload → transcribe with ASR_PROVIDER=openrouter → verify text
  • Mock httpx.AsyncClient.post to return valid OpenRouter STT response

Task 5.7: Acceptance test (real OpenRouter)

File: backend/app/test/acceptance/test_acceptance_phase5_openrouter.py

  • Real OpenRouter API call with a short test audio file
  • Verify transcription quality
  • Marked @pytest.mark.acceptance and @pytest.mark.slow

6. Realtime ASR (Out of Scope — Future Consideration)

OpenRouter has no WebSocket STT endpoint. For realtime streaming, we have two possible future approaches:

  1. Chunked REST: Send accumulated audio chunks to OpenRouter REST endpoint periodically. Latency ~2-5s per chunk. Not true realtime.
  2. Stay DashScope-only: Realtime WebSocket stays on DashScope. Users who need realtime must have a DashScope key.

For Phase 5, realtime streaming stays DashScope-only. The ws_asr.py WebSocket endpoint is not modified. If ASR_PROVIDER=openrouter is set, it affects only the batch transcription path. The WebSocket endpoint continues to check dashscope_api_key independently.


7. Test Plan

Test File What It Covers Mock Strategy
test_phase5_config.py Config validation, invalid provider rejection No mocks (pure config)
test_phase5_openrouter_provider.py OpenRouterASRProvider unit tests Mock httpx.AsyncClient
test_phase2_asr_client.py (updated) ASRClient with both providers Mock DashScope + OpenRouter
test_phase5_integration.py Full video→transcribe with OpenRouter Mock httpx (TestClient)
test_acceptance_phase5_openrouter.py Real OpenRouter API None (real API)

Test-first rule: Write tests BEFORE implementation (per AGENTS.md convention). Each implementation task references its test file.


8. Acceptance Criteria

  • ASR_PROVIDER=openrouter in .env → batch transcription uses OpenRouter STT
  • ASR_PROVIDER=dashscope (default) → same behavior as before (backward compat)
  • Invalid ASR_PROVIDER value → clear error at startup
  • Realtime WebSocket ASR is unaffected (still DashScope-only, independent of ASR_PROVIDER)
  • OpenRouter transcription returns traditional Chinese (same _to_traditional conversion)
  • Error handling: network errors, HTTP errors, empty responses → clear error messages
  • All existing tests pass unchanged (with ASR_PROVIDER=dashscope)
  • New tests pass
  • Acceptance test returns valid transcription from real OpenRouter

9. Dependencies & Risks

Risk Mitigation
OpenRouter STT latency > DashScope Acceptable tradeoff; OpenRouter is cheaper and uses existing API key
OpenRouter STT not as accurate for Cantonese Need acceptance testing to verify; may need different model
transcribe_full sync→async refactor could break callers Only one caller (video.py); minimal blast radius
No streaming/WebSocket for OpenRouter Documented as limitation; realtime stays DashScope
OpenRouter 60s timeout for long videos Document limitation; large files may need chunking (future)

10. Estimated Effort

Task Est. Time
5.1 Config 15 min
5.2 OpenRouter provider 30 min
5.3 Refactor ASRClient 20 min
5.4 Update video router 5 min
5.5 Update .env.example 5 min
5.6 Integration test 20 min
5.7 Acceptance test 15 min
Total ~2 hours

11. Implementation Notes (2026-05-19)

Decisions During Implementation

  • _to_traditional moved to asr_providers.py — original plan placed it in asr_client.py with a cross-import, but this caused a circular import (asr_clientasr_providersasr_client). Moved to asr_providers.py; asr_client.py re-exports for backward compatibility with ws_asr.py.
  • Separate OPENROUTER_API_KEY — per user preference for independent accounting.
  • DashScopeASRProvider wraps sync OpenAI call in loop.run_in_executor() — avoids blocking the event loop without rewriting the existing DashScope client.

Files Changed

File Action Details
backend/app/core/config.py Modified 3 new settings + validation in get_settings()
backend/app/services/asr_providers.py New ASRProvider ABC, DashScopeASRProvider, OpenRouterASRProvider (with tenacity retry), create_asr_provider() factory
backend/app/services/asr_client.py Refactored Thin wrapper; transcribe_full now async; utility functions retained
backend/app/routers/video.py Modified await transcribe_full(); provider-aware API key validation
backend/.env.example Modified Phase 5 vars with usage comments
backend/requirements.txt Modified Added tenacity>=8.0.0

Test Files

File Tests Status
test_phase5_config.py 6
test_phase5_openrouter_provider.py 14
test_phase5_integration.py 4
test_phase2_asr_client.py 19 (3 updated)
test_phase2_full_transcript.py 6 (updated fixtures)
test_integration_phase2.py 7 (updated fixtures)

Pre-existing Test Failures (Unrelated)

  • Phase 3: test_phase3_history_service.py, test_phase3_prompt_injection.py, test_phase3_prompt_service.py, test_phase3_prompts_router.py — pre-existing failures in SQLite/prompt tests unrelated to ASR changes.
  • Phase 1: 1 config test — pre-existing, unrelated.