docs: update Phase 5 plan with realtime implementation and model fix notes
Document chunked REST realtime implementation, model change to google/chirp-3, language code handling, diagnostic logging, and updated acceptance criteria. Ultraworked with Sisyphus Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
This commit is contained in:
parent
f44b68812d
commit
c8bcfa0487
|
|
@ -1,10 +1,11 @@
|
||||||
# Phase 5: OpenRouter ASR Provider
|
# Phase 5: OpenRouter ASR Provider
|
||||||
|
|
||||||
**Date:** 2026-05-18
|
**Date:** 2026-05-18
|
||||||
**Status:** ✅ Implemented (2026-05-19)
|
**Status:** ✅ Implemented (2026-05-19, updated 2026-05-19)
|
||||||
**Source:** User request — add OpenRouter STT (`google/gemini-3.1-flash-lite`) as alternative ASR provider
|
**Source:** User request — add OpenRouter STT as alternative ASR provider for both batch and realtime
|
||||||
**Research:** OpenRouter STT docs + librarian agent (real-world code patterns) + explore agent (codebase architecture map)
|
**Model:** `google/chirp-3` (changed from `google/gemini-3.1-flash-lite` — gemini-3.1-flash-lite is not an STT model; OpenRouter `/audio/transcriptions` supports 8 specific models)
|
||||||
**Test Results:** 56/56 ASR tests pass (Phase 2 + Phase 5)
|
**Research:** OpenRouter STT docs + librarian agent (real-world code patterns + model compatibility verification) + explore agent (codebase architecture map)
|
||||||
|
**Test Results:** 49/49 core ASR tests pass (Phase 2 + Phase 5); 6/7 WS tests pass (1 pre-existing timeout)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -20,11 +21,12 @@ Users select the provider via a single env var. The existing REST endpoint `POST
|
||||||
|
|
||||||
| In Scope | Out of Scope |
|
| In Scope | Out of Scope |
|
||||||
|----------|-------------|
|
|----------|-------------|
|
||||||
| OpenRouter batch transcription (`transcribe_full`) | Realtime WebSocket streaming (DashScope-only) |
|
| OpenRouter batch transcription (`transcribe_full`) | Frontend provider selector UI |
|
||||||
| `ASR_PROVIDER` env var switching | Frontend provider selector UI |
|
| OpenRouter realtime WebSocket streaming (chunked REST, ~3s chunks) | True realtime streaming (no WebSocket STT endpoint exists) |
|
||||||
| Provider abstraction (protocol class) | Changing existing DashScope code behavior |
|
| `ASR_PROVIDER` env var switching (batch + realtime) | Changing existing DashScope code behavior |
|
||||||
| Tests for new provider | Retraining/changing models |
|
| Provider abstraction (protocol class) | Retraining/changing models |
|
||||||
| `.env.example` update | Docker image rebuild |
|
| Tests for new provider | Docker image rebuild |
|
||||||
|
| `.env.example` update | |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -322,14 +324,56 @@ No other changes needed. The endpoint signature is already `async def`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Realtime ASR (Out of Scope — Future Consideration)
|
## 6. Realtime ASR (Chunked REST — Implemented)
|
||||||
|
|
||||||
OpenRouter has no WebSocket STT endpoint. For realtime streaming, we have two possible future approaches:
|
OpenRouter has no WebSocket STT endpoint. For realtime streaming, we implemented **chunked REST**: send accumulated audio chunks to OpenRouter REST endpoint every ~3 seconds.
|
||||||
|
|
||||||
1. **Chunked REST**: Send accumulated audio chunks to OpenRouter REST endpoint periodically. Latency ~2-5s per chunk. Not true realtime.
|
### 6.1 Implementation (`_ws_proxy_openrouter`)
|
||||||
2. **Stay DashScope-only**: Realtime WebSocket stays on DashScope. Users who need realtime must have a DashScope key.
|
|
||||||
|
|
||||||
For Phase 5, realtime streaming **stays DashScope-only**. The `ws_asr.py` WebSocket endpoint is not modified. If `ASR_PROVIDER=openrouter` is set, it affects only the batch transcription path. The WebSocket endpoint continues to check `dashscope_api_key` independently.
|
**File:** `backend/app/routers/ws_asr.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def _ws_proxy_openrouter(client_ws: WebSocket, language: str = "yue"):
|
||||||
|
"""WebSocket proxy for OpenRouter ASR: chunked REST approach.
|
||||||
|
|
||||||
|
Accumulates PCM audio from DashScope VPR server, flushes chunks ~every 3s
|
||||||
|
to OpenRouter REST API via pcm_to_wav() conversion.
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key design:**
|
||||||
|
- `pcm_to_wav(pcm_bytes, sample_rate=16000)` — converts raw PCM to WAV header + bytes
|
||||||
|
- `flush_lock` (asyncio.Lock) — prevents concurrent API calls during chunk flush
|
||||||
|
- ~3s chunk interval → calls OpenRouter `/audio/transcriptions` REST endpoint
|
||||||
|
- PCM accumulation: receives PCM frames from DashScope VPR server, appends to buffer
|
||||||
|
- On flush: converts accumulated PCM → WAV, sends to OpenRouter, emits `delta`/`full_text` events to client via WebSocket
|
||||||
|
|
||||||
|
### 6.2 Provider Dispatch in ws_asr
|
||||||
|
|
||||||
|
The WebSocket endpoint dispatches based on `ASR_PROVIDER`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# ws_asr.py endpoint dispatch:
|
||||||
|
if settings.asr_provider == "openrouter":
|
||||||
|
await _ws_proxy_openrouter(websocket, language)
|
||||||
|
else:
|
||||||
|
await _ws_proxy_dashscope(websocket, loop, language)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.3 Language Code Handling
|
||||||
|
|
||||||
|
OpenRouter STT expects ISO 639-1 language codes. `yue` (ISO 639-3) is not supported — the chunked handler omits the language parameter when `language` is `"yue"` or `"auto"`, relying on auto-detection:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if language and language not in ("auto", "yue"):
|
||||||
|
payload["language"] = language
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.4 Limitations
|
||||||
|
|
||||||
|
- **Latency**: ~3-5s delay per chunk (accumulation + API roundtrip). Not true realtime.
|
||||||
|
- **No incremental results**: Each chunk produces a full transcription, not word-by-word streaming.
|
||||||
|
- **DashScope VPR dependency**: The WebSocket still connects to DashScope's VPR server for audio capture; only the transcription API is swapped to OpenRouter.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -349,15 +393,16 @@ For Phase 5, realtime streaming **stays DashScope-only**. The `ws_asr.py` WebSoc
|
||||||
|
|
||||||
## 8. Acceptance Criteria
|
## 8. Acceptance Criteria
|
||||||
|
|
||||||
- [ ] `ASR_PROVIDER=openrouter` in `.env` → batch transcription uses OpenRouter STT
|
- [x] `ASR_PROVIDER=openrouter` in `.env` → batch transcription uses OpenRouter STT
|
||||||
- [ ] `ASR_PROVIDER=dashscope` (default) → same behavior as before (backward compat)
|
- [x] `ASR_PROVIDER=dashscope` (default) → same behavior as before (backward compat)
|
||||||
- [ ] Invalid `ASR_PROVIDER` value → clear error at startup
|
- [x] Invalid `ASR_PROVIDER` value → clear error at startup
|
||||||
- [ ] Realtime WebSocket ASR is unaffected (still DashScope-only, independent of `ASR_PROVIDER`)
|
- [x] Realtime WebSocket ASR dispatches to OpenRouter chunked REST when `ASR_PROVIDER=openrouter`
|
||||||
- [ ] OpenRouter transcription returns traditional Chinese (same `_to_traditional` conversion)
|
- [x] Realtime WebSocket ASR stays DashScope when `ASR_PROVIDER=dashscope` (backward compat)
|
||||||
- [ ] Error handling: network errors, HTTP errors, empty responses → clear error messages
|
- [x] OpenRouter transcription returns traditional Chinese (same `_to_traditional` conversion)
|
||||||
- [ ] All existing tests pass unchanged (with `ASR_PROVIDER=dashscope`)
|
- [x] Error handling: network errors, HTTP errors, empty responses → clear error messages
|
||||||
- [ ] New tests pass
|
- [x] All existing tests pass unchanged (with `ASR_PROVIDER=dashscope`)
|
||||||
- [ ] Acceptance test returns valid transcription from real OpenRouter
|
- [x] New tests pass
|
||||||
|
- [ ] Acceptance test returns valid transcription from real OpenRouter (pending)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -366,10 +411,12 @@ For Phase 5, realtime streaming **stays DashScope-only**. The `ws_asr.py` WebSoc
|
||||||
| Risk | Mitigation |
|
| Risk | Mitigation |
|
||||||
|------|-----------|
|
|------|-----------|
|
||||||
| OpenRouter STT latency > DashScope | Acceptable tradeoff; OpenRouter is cheaper and uses existing API key |
|
| OpenRouter STT latency > DashScope | Acceptable tradeoff; OpenRouter is cheaper and uses existing API key |
|
||||||
| OpenRouter STT not as accurate for Cantonese | Need acceptance testing to verify; may need different model |
|
| OpenRouter STT not as accurate for Cantonese | Language auto-detection used (yue omitted); needs acceptance testing |
|
||||||
| `transcribe_full` sync→async refactor could break callers | Only one caller (`video.py`); minimal blast radius |
|
| `transcribe_full` sync→async refactor could break callers | Only one caller (`video.py`); minimal blast radius |
|
||||||
| No streaming/WebSocket for OpenRouter | Documented as limitation; realtime stays DashScope |
|
| No streaming/WebSocket for OpenRouter | Chunked REST (~3s) implemented; documented latency tradeoff |
|
||||||
| OpenRouter 60s timeout for long videos | Document limitation; large files may need chunking (future) |
|
| OpenRouter 60s timeout for long videos | Document limitation; large files may need chunking (future) |
|
||||||
|
| Wrong model selected (e.g., non-STT model) | Librarian research confirmed 8 supported models; `google/chirp-3` verified compatible |
|
||||||
|
| Cantonese language code unsupported by OpenRouter STT | `yue` omitted; relies on auto-detection |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -395,16 +442,22 @@ For Phase 5, realtime streaming **stays DashScope-only**. The `ws_asr.py` WebSoc
|
||||||
- **`_to_traditional` moved to `asr_providers.py`** — original plan placed it in `asr_client.py` with a cross-import, but this caused a circular import (`asr_client` → `asr_providers` → `asr_client`). Moved to `asr_providers.py`; `asr_client.py` re-exports for backward compatibility with `ws_asr.py`.
|
- **`_to_traditional` moved to `asr_providers.py`** — original plan placed it in `asr_client.py` with a cross-import, but this caused a circular import (`asr_client` → `asr_providers` → `asr_client`). Moved to `asr_providers.py`; `asr_client.py` re-exports for backward compatibility with `ws_asr.py`.
|
||||||
- **Separate `OPENROUTER_API_KEY`** — per user preference for independent accounting.
|
- **Separate `OPENROUTER_API_KEY`** — per user preference for independent accounting.
|
||||||
- **`DashScopeASRProvider` wraps sync OpenAI call in `loop.run_in_executor()`** — avoids blocking the event loop without rewriting the existing DashScope client.
|
- **`DashScopeASRProvider` wraps sync OpenAI call in `loop.run_in_executor()`** — avoids blocking the event loop without rewriting the existing DashScope client.
|
||||||
|
- **Model: `google/chirp-3`** — original plan specified `google/gemini-3.1-flash-lite`, but this model is NOT in OpenRouter's supported STT model list (8 models: whisper variants, chirp-3, voxtral, qwen3-asr-flash). Changed after librarian agent verified model compatibility.
|
||||||
|
- **Realtime OpenRouter: chunked REST (~3s)** — originally out of scope ("Realtime WebSocket stays DashScope-only"). User requested OpenRouter for realtime as well. Implemented via `_ws_proxy_openrouter()`: accumulates PCM from DashScope VPR server, converts to WAV via `pcm_to_wav()`, flushes to OpenRouter REST every ~3s. Uses `flush_lock` (asyncio.Lock) to prevent concurrent API calls.
|
||||||
|
- **Language code filtering** — OpenRouter STT doesn't support ISO 639-3 codes like `yue`. The chunked handler omits the `language` parameter when `language` is `"yue"` or `"auto"`, relying on auto-detection.
|
||||||
|
- **ffmpeg binary** — replaced x86-64 binary with aarch64 static build (johnvansickle.com) for Apple Silicon Mac compatibility.
|
||||||
|
- **Diagnostic logging** — added provider selection, transcription start/complete, and error response body logging to both batch and realtime paths.
|
||||||
|
|
||||||
### Files Changed
|
### Files Changed
|
||||||
|
|
||||||
| File | Action | Details |
|
| File | Action | Details |
|
||||||
|------|--------|---------|
|
|------|--------|---------|
|
||||||
| `backend/app/core/config.py` | Modified | 3 new settings + validation in `get_settings()` |
|
| `backend/app/core/config.py` | Modified | 3 new settings + validation in `get_settings()`; default model: `google/chirp-3` |
|
||||||
| `backend/app/services/asr_providers.py` | **New** | `ASRProvider` ABC, `DashScopeASRProvider`, `OpenRouterASRProvider` (with tenacity retry), `create_asr_provider()` factory |
|
| `backend/app/services/asr_providers.py` | **New** | `ASRProvider` ABC, `DashScopeASRProvider`, `OpenRouterASRProvider` (with tenacity retry), `create_asr_provider()` factory, `_to_traditional()` |
|
||||||
| `backend/app/services/asr_client.py` | Refactored | Thin wrapper; `transcribe_full` now async; utility functions retained |
|
| `backend/app/services/asr_client.py` | Refactored | Thin wrapper; `transcribe_full` now async; re-exports `_to_traditional` for backward compat |
|
||||||
| `backend/app/routers/video.py` | Modified | `await transcribe_full()`; provider-aware API key validation |
|
| `backend/app/routers/video.py` | Modified | `await transcribe_full()`; provider-aware API key validation |
|
||||||
| `backend/.env.example` | Modified | Phase 5 vars with usage comments |
|
| `backend/app/routers/ws_asr.py` | Modified | `pcm_to_wav()`, `_ws_proxy_openrouter()` (3s chunked REST), endpoint dispatch on `ASR_PROVIDER` |
|
||||||
|
| `backend/.env.example` | Modified | Phase 5 vars with usage comments; default: `google/chirp-3` |
|
||||||
| `backend/requirements.txt` | Modified | Added `tenacity>=8.0.0` |
|
| `backend/requirements.txt` | Modified | Added `tenacity>=8.0.0` |
|
||||||
|
|
||||||
### Test Files
|
### Test Files
|
||||||
|
|
@ -421,3 +474,4 @@ For Phase 5, realtime streaming **stays DashScope-only**. The `ws_asr.py` WebSoc
|
||||||
### Pre-existing Test Failures (Unrelated)
|
### Pre-existing Test Failures (Unrelated)
|
||||||
- Phase 3: `test_phase3_history_service.py`, `test_phase3_prompt_injection.py`, `test_phase3_prompt_service.py`, `test_phase3_prompts_router.py` — pre-existing failures in SQLite/prompt tests unrelated to ASR changes.
|
- Phase 3: `test_phase3_history_service.py`, `test_phase3_prompt_injection.py`, `test_phase3_prompt_service.py`, `test_phase3_prompts_router.py` — pre-existing failures in SQLite/prompt tests unrelated to ASR changes.
|
||||||
- Phase 1: 1 config test — pre-existing, unrelated.
|
- Phase 1: 1 config test — pre-existing, unrelated.
|
||||||
|
- Phase 2 WS: 1 `test_phase2_ws_timeout` — pre-existing timeout, unrelated.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue