17 KiB
Phase 5: OpenRouter ASR Provider
Date: 2026-05-18
Status: ✅ Implemented (2026-05-19)
Source: User request — add OpenRouter STT (google/gemini-3.1-flash-lite) as alternative ASR provider
Research: OpenRouter STT docs + librarian agent (real-world code patterns) + explore agent (codebase architecture map)
Test Results: 56/56 ASR tests pass (Phase 2 + Phase 5)
1. Objective
Add OpenRouter as a second ASR provider for batch transcription (transcribe_full). The realtime WebSocket streaming mode remains DashScope-only because OpenRouter has no WebSocket STT endpoint.
Users select the provider via a single env var. The existing REST endpoint POST /api/v1/video/{video_id}/transcribe and the WebSocket endpoint /ws/asr/{video_id} are unchanged from the frontend's perspective.
2. Scope
| In Scope | Out of Scope |
|---|---|
OpenRouter batch transcription (transcribe_full) |
Realtime WebSocket streaming (DashScope-only) |
ASR_PROVIDER env var switching |
Frontend provider selector UI |
| Provider abstraction (protocol class) | Changing existing DashScope code behavior |
| Tests for new provider | Retraining/changing models |
.env.example update |
Docker image rebuild |
3. Architecture
3.1 Current Flow (DashScope-only)
POST /api/v1/video/{video_id}/transcribe
→ video.py router
→ VideoService.extract_audio() → WAV bytes
→ ASRClient(settings).transcribe_full(audio_bytes, language)
→ OpenAI SDK → DashScope Chat Completions API (audio input)
→ return text
3.2 New Flow (Provider-based)
POST /api/v1/video/{video_id}/transcribe
→ video.py router
→ VideoService.extract_audio() → WAV bytes
→ ASRClient(settings).transcribe_full(audio_bytes, language)
├── ASR_PROVIDER=dashscope → DashScopeASRProvider (existing logic)
└── ASR_PROVIDER=openrouter → OpenRouterASRProvider (new)
→ return text
3.3 Provider Interface (Factory + Strategy Pattern)
Based on real-world multi-provider ASR patterns (DocsGPT, LiveKit, openai-agents-python), use Factory + Strategy:
from abc import ABC, abstractmethod
from typing import Protocol
class ASRProvider(ABC):
"""Abstract base for all ASR providers."""
@abstractmethod
async def transcribe(self, audio_bytes: bytes, language: str) -> str:
"""Transcribe audio bytes to traditional Chinese text.
Raises ASRError on any failure (network, HTTP, empty response).
"""
...
class ASRProviderFactory:
"""Selects ASR provider based on settings."""
_providers: dict[str, type[ASRProvider]] = {}
@classmethod
def register(cls, name: str, provider_cls: type[ASRProvider]) -> None:
cls._providers[name] = provider_cls
@classmethod
def create(cls, name: str, settings) -> ASRProvider:
provider_cls = cls._providers.get(name)
if not provider_cls:
raise ValueError(f"Unknown ASR provider: {name}")
return provider_cls(settings)
Why async? The video router endpoint is already async def. The existing transcribe_full is sync (blocking), which blocks the event loop during 30-60s API calls. New providers should be async. Existing DashScope can be wrapped in loop.run_in_executor() temporarily.
3.4 Existing Provider Pattern (LLMClient)
The codebase already has a provider-switching pattern in llm_client.py — single-class conditional branching, not ABC/interface:
# llm_client.py pattern:
if settings.vllm_engine:
extra_body = {"chat_template_kwargs": {"enable_thinking": False}}
else:
extra_body = {"reasoning": {"enabled": False}}
For ASR, the same pattern would mean ASRClient checks settings.asr_provider to select the right SDK/URL. However, since DashScope and OpenRouter use fundamentally different APIs (DashScope = Chat Completions + audio input; OpenRouter = dedicated STT endpoint), the Factory+Strategy pattern (Section 3.3) is cleaner for ASR — each provider gets its own class implementing a common interface.
3.5 OpenRouter SDK vs Raw httpx
| Trade-off | Raw httpx | OpenRouter SDK (pip install openrouter) |
|---|---|---|
| Type safety | Manual | Pydantic models |
| Retry logic | Must implement (tenacity) |
Built-in retries=RetryConfig(...) |
| Production readiness | Battle-tested | Beta (auto-generated from OpenAPI) |
| Dependencies | httpx (already installed) |
SDK + Pydantic + extra deps |
Decision: Use raw httpx + tenacity for Phase 5. This matches the approach used by most production Python projects (lethe, openclaw) and avoids beta SDK risk. The official SDK can be adopted later if it stabilizes.
3.6 Retry & Error Handling
Based on production OpenRouter STT implementations (lethe, openrouter-proxy):
from tenacity import (
retry, stop_after_attempt, wait_random_exponential,
retry_if_exception_type
)
RETRIABLE_STATUS = {429, 500, 502, 503, 504}
@retry(
reraise=True,
stop=stop_after_attempt(4),
wait=wait_random_exponential(multiplier=0.2, max=3.0),
retry=retry_if_exception_type((httpx.TransportError, httpx.HTTPStatusError)),
)
async def _call_stt_api(self, audio_b64: str, language: str) -> dict:
"""Call OpenRouter STT with retry and exponential backoff."""
...
Error categories to handle:
| Error | Response | Retry? |
|---|---|---|
httpx.HTTPStatusError (429) |
Rate limited | Yes (backoff) |
httpx.HTTPStatusError (5xx) |
Server error | Yes (backoff) |
httpx.HTTPStatusError (4xx, non-429) |
Client error | No |
httpx.ConnectError |
Connection failed | Yes |
httpx.TimeoutException |
Timeout (>120s) | Yes |
Empty result["text"] |
No transcription | No |
Note: tenacity is NOT currently in requirements.txt. Add it as a new dependency.
3.7 API Differences
| DashScope | OpenRouter | |
|---|---|---|
| Endpoint | https://dashscope-intl.aliyuncs.com/compatible-mode/v1 |
https://openrouter.ai/api/v1/audio/transcriptions |
| Method | Chat Completions (POST /chat/completions) |
Dedicated STT (POST /audio/transcriptions) |
| Audio format | data:audio/wav;base64,... (data URL) |
{"data": "<base64>", "format": "wav"} (raw base64) |
| Auth | DASHSCOPE_API_KEY |
OPENROUTER_API_KEY (separate key for accounting flexibility) |
| Response | choices[0].message.content |
{"text": "...", "usage": {...}} (no segments/timestamps/speaker labels) |
| Response | choices[0].message.content |
{"text": "...", "usage": {...}} |
| SDK | openai.OpenAI |
httpx.AsyncClient (no official SDK needed) |
4. Configuration
4.1 New Env Vars
| Variable | Default | Description |
|---|---|---|
ASR_PROVIDER |
dashscope |
ASR provider: dashscope or openrouter |
OPENROUTER_API_KEY |
"" |
OpenRouter API key (for STT; separate from LLM_API_KEY for accounting) |
ASR_OPENROUTER_MODEL |
google/gemini-3.1-flash-lite |
OpenRouter STT model name |
4.2 Settings Changes
Add to Settings class in config.py:
# ASR provider (Phase 5)
asr_provider: str = "dashscope" # "dashscope" or "openrouter"
openrouter_api_key: str = "" # separate from llm_api_key for accounting
asr_openrouter_model: str = "google/gemini-3.1-flash-lite"
Note: OpenRouter STT uses:
openrouter_api_key— dedicated key (user preference for separate accounting)llm_base_url—https://openrouter.ai/api/v1(base, STT endpoint appended:/audio/transcriptions)
4.3 Validation
Add a startup validation in config.py or asr_client.py:
VALID_ASR_PROVIDERS = {"dashscope", "openrouter"}
if settings.asr_provider not in VALID_ASR_PROVIDERS:
raise ValueError(f"Invalid ASR_PROVIDER: {settings.asr_provider}. Must be one of {VALID_ASR_PROVIDERS}")
5. Implementation Tasks
Task 5.1: Add config vars and validation
File: backend/app/core/config.py
- Add
asr_provider: str = "dashscope" - Add
asr_openrouter_model: str = "google/gemini-3.1-flash-lite" - Add
model_configvalidation or runtime check inget_settings()
Test file: backend/app/test/test_phase5_config.py
Task 5.2: Create OpenRouter ASR provider
File: backend/app/services/asr_providers.py (new)
class OpenRouterASRProvider:
def __init__(self, api_key: str, base_url: str, model: str):
self.api_key = api_key
# STT endpoint: base_url + /audio/transcriptions
self.stt_url = f"{base_url.rstrip('/')}/audio/transcriptions"
self.model = model
self._client: httpx.AsyncClient | None = None
async def transcribe(self, audio_bytes: bytes, language: str) -> str:
"""Transcribe using OpenRouter STT endpoint."""
...
OpenRouter STT Request:
import base64
import httpx
audio_b64 = base64.b64encode(audio_bytes).decode("ascii")
payload = {
"model": self.model,
"input_audio": {
"data": audio_b64, # raw base64, NOT data URL
"format": "wav",
},
}
if language and language != "auto":
payload["language"] = language
response = await client.post(
self.stt_url,
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
},
json=payload,
timeout=120.0, # 60s upstream timeout + buffer
)
response.raise_for_status()
result = response.json()
return _to_traditional(result["text"])
Key design notes:
- Uses
httpx.AsyncClient(already inrequirements.txt) - Base64 format: raw bytes, NOT
data:audio/wav;base64,...(DashScope uses data URL; OpenRouter wants raw base64) - Timeout: 120s (OpenRouter docs say 60s upstream timeout; add buffer)
- Error handling: raise custom
ASRErroron HTTP errors, network errors, or empty response text
Test file: backend/app/test/test_phase5_openrouter_provider.py
Task 5.3: Refactor ASRClient to use provider abstraction
File: backend/app/services/asr_client.py
Changes:
- Define
ASRProviderprotocol (or ABC) - Extract existing DashScope logic into
DashScopeASRProvider(sync wrapper for now) ASRClient.__init__selects provider based onsettings.asr_providerASRClient.transcribe_fulldelegates to provider- Make
transcribe_fullasync (minor refactor tovideo.pyrouter)
Backward compatibility: Default asr_provider=dashscope means zero behavior change for existing deployments.
Test file: backend/app/test/test_phase2_asr_client.py — update existing tests to work with new provider structure; add tests for provider switching.
Task 5.4: Update video router for async transcription
File: backend/app/routers/video.py
Minimal change — the asr.transcribe_full() call becomes await asr.transcribe_full():
# Before (line 113):
text = asr.transcribe_full(audio_bytes, language=language)
# After:
text = await asr.transcribe_full(audio_bytes, language=language)
No other changes needed. The endpoint signature is already async def.
Task 5.5: Update .env.example and config documentation
File: backend/.env.example
- Add
ASR_PROVIDERandASR_OPENROUTER_MODELcomments
File: AGENTS.md or development plan
- Note the new Phase 5 capability
Task 5.6: Integration test (mock OpenRouter HTTP)
File: backend/app/test/test_phase5_integration.py
- Test full flow: video upload → transcribe with
ASR_PROVIDER=openrouter→ verify text - Mock
httpx.AsyncClient.postto return valid OpenRouter STT response
Task 5.7: Acceptance test (real OpenRouter)
File: backend/app/test/acceptance/test_acceptance_phase5_openrouter.py
- Real OpenRouter API call with a short test audio file
- Verify transcription quality
- Marked
@pytest.mark.acceptanceand@pytest.mark.slow
6. Realtime ASR (Out of Scope — Future Consideration)
OpenRouter has no WebSocket STT endpoint. For realtime streaming, we have two possible future approaches:
- Chunked REST: Send accumulated audio chunks to OpenRouter REST endpoint periodically. Latency ~2-5s per chunk. Not true realtime.
- Stay DashScope-only: Realtime WebSocket stays on DashScope. Users who need realtime must have a DashScope key.
For Phase 5, realtime streaming stays DashScope-only. The ws_asr.py WebSocket endpoint is not modified. If ASR_PROVIDER=openrouter is set, it affects only the batch transcription path. The WebSocket endpoint continues to check dashscope_api_key independently.
7. Test Plan
| Test File | What It Covers | Mock Strategy |
|---|---|---|
test_phase5_config.py |
Config validation, invalid provider rejection | No mocks (pure config) |
test_phase5_openrouter_provider.py |
OpenRouterASRProvider unit tests | Mock httpx.AsyncClient |
test_phase2_asr_client.py (updated) |
ASRClient with both providers | Mock DashScope + OpenRouter |
test_phase5_integration.py |
Full video→transcribe with OpenRouter | Mock httpx (TestClient) |
test_acceptance_phase5_openrouter.py |
Real OpenRouter API | None (real API) |
Test-first rule: Write tests BEFORE implementation (per AGENTS.md convention). Each implementation task references its test file.
8. Acceptance Criteria
ASR_PROVIDER=openrouterin.env→ batch transcription uses OpenRouter STTASR_PROVIDER=dashscope(default) → same behavior as before (backward compat)- Invalid
ASR_PROVIDERvalue → clear error at startup - Realtime WebSocket ASR is unaffected (still DashScope-only, independent of
ASR_PROVIDER) - OpenRouter transcription returns traditional Chinese (same
_to_traditionalconversion) - Error handling: network errors, HTTP errors, empty responses → clear error messages
- All existing tests pass unchanged (with
ASR_PROVIDER=dashscope) - New tests pass
- Acceptance test returns valid transcription from real OpenRouter
9. Dependencies & Risks
| Risk | Mitigation |
|---|---|
| OpenRouter STT latency > DashScope | Acceptable tradeoff; OpenRouter is cheaper and uses existing API key |
| OpenRouter STT not as accurate for Cantonese | Need acceptance testing to verify; may need different model |
transcribe_full sync→async refactor could break callers |
Only one caller (video.py); minimal blast radius |
| No streaming/WebSocket for OpenRouter | Documented as limitation; realtime stays DashScope |
| OpenRouter 60s timeout for long videos | Document limitation; large files may need chunking (future) |
10. Estimated Effort
| Task | Est. Time |
|---|---|
| 5.1 Config | 15 min |
| 5.2 OpenRouter provider | 30 min |
| 5.3 Refactor ASRClient | 20 min |
| 5.4 Update video router | 5 min |
| 5.5 Update .env.example | 5 min |
| 5.6 Integration test | 20 min |
| 5.7 Acceptance test | 15 min |
| Total | ~2 hours |
11. Implementation Notes (2026-05-19)
Decisions During Implementation
_to_traditionalmoved toasr_providers.py— original plan placed it inasr_client.pywith a cross-import, but this caused a circular import (asr_client→asr_providers→asr_client). Moved toasr_providers.py;asr_client.pyre-exports for backward compatibility withws_asr.py.- Separate
OPENROUTER_API_KEY— per user preference for independent accounting. DashScopeASRProviderwraps sync OpenAI call inloop.run_in_executor()— avoids blocking the event loop without rewriting the existing DashScope client.
Files Changed
| File | Action | Details |
|---|---|---|
backend/app/core/config.py |
Modified | 3 new settings + validation in get_settings() |
backend/app/services/asr_providers.py |
New | ASRProvider ABC, DashScopeASRProvider, OpenRouterASRProvider (with tenacity retry), create_asr_provider() factory |
backend/app/services/asr_client.py |
Refactored | Thin wrapper; transcribe_full now async; utility functions retained |
backend/app/routers/video.py |
Modified | await transcribe_full(); provider-aware API key validation |
backend/.env.example |
Modified | Phase 5 vars with usage comments |
backend/requirements.txt |
Modified | Added tenacity>=8.0.0 |
Test Files
| File | Tests | Status |
|---|---|---|
test_phase5_config.py |
6 | ✅ |
test_phase5_openrouter_provider.py |
14 | ✅ |
test_phase5_integration.py |
4 | ✅ |
test_phase2_asr_client.py |
19 (3 updated) | ✅ |
test_phase2_full_transcript.py |
6 (updated fixtures) | ✅ |
test_integration_phase2.py |
7 (updated fixtures) | ✅ |
Pre-existing Test Failures (Unrelated)
- Phase 3:
test_phase3_history_service.py,test_phase3_prompt_injection.py,test_phase3_prompt_service.py,test_phase3_prompts_router.py— pre-existing failures in SQLite/prompt tests unrelated to ASR changes. - Phase 1: 1 config test — pre-existing, unrelated.