187 lines
7.3 KiB
Markdown
187 lines
7.3 KiB
Markdown
# Phase 2 Enhancement: Delta-based transcript via existing WebSocket
|
|
|
|
**Created:** 2026-05-07
|
|
**Status:** Complete
|
|
**Depends on:** Phase 2 (Complete)
|
|
|
|
---
|
|
|
|
## 1. Goal
|
|
|
|
Replace full-text WebSocket responses with **delta-only updates**. Backend computes the difference between consecutive `text` fields and sends only new characters. Frontend appends them client-side. All over the existing WebSocket.
|
|
|
|
## 2. Why `text` field over `stash`
|
|
|
|
Stash log analysis revealed DashScope partial events have TWO text fields:
|
|
|
|
| Field | Behavior | Description |
|
|
|-------|----------|-------------|
|
|
| `stash` | Sliding window, ~7-20 chars | Latest uncommitted characters |
|
|
| `text` | **Monotonically growing** (within one `item_id`) | Stable committed transcription |
|
|
|
|
`text` is the stable, cumulative transcription. It grows monotonically within each utterance (`item_id`). Delta = `text[len(prev_text):]` — simple suffix diff, no merge logic needed.
|
|
|
|
`stash` is still useful: it contains trailing chars not yet committed to `text`. Sent alongside deltas so the frontend can append it on pause (completing the last sentence).
|
|
|
|
## 3. Architecture
|
|
|
|
```
|
|
DashScope partial event:
|
|
{"text": "多謝主席咁啊", "stash": "咁啊亦都", "item_id": "item_ABC"}
|
|
|
|
Backend computes:
|
|
delta = text[len(prev_text):] → "咁啊" (new chars only)
|
|
sends: {"delta":"咁啊", "stash":"咁啊亦都", "is_final":false}
|
|
|
|
Frontend:
|
|
transcriptRef += msg.delta → accumulates locally
|
|
lastStashRef = msg.stash → stores for pause handler
|
|
|
|
On pause:
|
|
if stash not already at end of text:
|
|
text += stash → "...可以诶处理埋诶呢啲余。"
|
|
onFinalTranscript(text) → persists in QueryInput
|
|
```
|
|
|
|
## 4. Key Design Decisions
|
|
|
|
### 4.1 Stash-only events skipped (text empty)
|
|
|
|
Early in each utterance, `text` is `""` while `stash` has sliding window content. These are skipped — delta can't be computed from a sliding window. Once `text` starts populating (after ~1-2s), deltas stream.
|
|
|
|
### 4.2 Utterance boundary tracking
|
|
|
|
DashScope splits speech into `item_id` segments. When `item_id` changes, `text` resets to `""`. On detection:
|
|
- `prev_display` set to `" "` (space) so the next utterance's full text is sent as delta, prepended with a space
|
|
- Result: `"...上一句 融資安排方面..."` — utterances flow continuously
|
|
|
|
### 4.3 Trailing stash on pause
|
|
|
|
The `text` field lags behind — final chars of each utterance are only in `stash`. Every WS message includes the current `stash`. Frontend stores it. On pause, appends stash to text if not already overlapping.
|
|
|
|
## 5. Backend Implementation (`ws_asr.py`)
|
|
|
|
```python
|
|
prev_display = ""
|
|
current_item_id = ""
|
|
|
|
async def read_events():
|
|
nonlocal accumulated_text, prev_display, current_item_id
|
|
while True:
|
|
event = await event_queue.get()
|
|
result = format_transcription_event(event, accumulated_text)
|
|
if result is None:
|
|
continue
|
|
if result["is_final"]:
|
|
# Completed utterance: send full accumulated text
|
|
transcript = event.get("transcript", "")
|
|
if transcript and transcript.strip():
|
|
accumulated_text = build_display_text(accumulated_text, transcript)
|
|
prev_display = ""
|
|
result["delta"] = ""
|
|
result["full_text"] = _to_traditional(accumulated_text)
|
|
else:
|
|
text = result.pop("text", "")
|
|
stash = result.pop("stash", "")
|
|
|
|
# Utterance boundary: item_id change resets text to empty
|
|
item_id = event.get("item_id", "")
|
|
if item_id and item_id != current_item_id:
|
|
if prev_display:
|
|
prev_display = " " # prepend space for next utterance
|
|
current_item_id = item_id
|
|
|
|
# text is monotonically growing within one utterance
|
|
if text.strip():
|
|
new_delta = ""
|
|
if text != prev_display:
|
|
if prev_display and text.startswith(prev_display):
|
|
new_delta = text[len(prev_display):]
|
|
else:
|
|
new_delta = text
|
|
prev_display = text
|
|
result["delta"] = _to_traditional(new_delta) if new_delta else ""
|
|
result["full_text"] = ""
|
|
result["stash"] = _to_traditional(stash) if stash.strip() else ""
|
|
else:
|
|
# text empty (new utterance starting) — skip
|
|
continue
|
|
await client_ws.send_json(result)
|
|
```
|
|
|
|
## 6. Frontend Implementation (`useVideoASR.ts`)
|
|
|
|
```typescript
|
|
const transcriptRef = useRef('')
|
|
const lastStashRef = useRef('')
|
|
|
|
ws.onmessage = (e) => {
|
|
const msg = JSON.parse(e.data)
|
|
if (msg.is_final && msg.full_text) {
|
|
transcriptRef.current = msg.full_text
|
|
lastStashRef.current = ''
|
|
setTranscript(msg.full_text)
|
|
setPartialTranscript('')
|
|
onFinalTranscriptRef.current?.(msg.full_text)
|
|
} else if (msg.delta) {
|
|
transcriptRef.current += msg.delta
|
|
lastStashRef.current = msg.stash || ''
|
|
setTranscript(transcriptRef.current)
|
|
setPartialTranscript(transcriptRef.current)
|
|
}
|
|
}
|
|
|
|
const stopStreaming = useCallback(() => {
|
|
closeWebSocket()
|
|
let text = transcriptRef.current.trim()
|
|
const stash = lastStashRef.current.trim()
|
|
if (stash && !text.endsWith(stash)) {
|
|
text += stash
|
|
}
|
|
lastStashRef.current = ''
|
|
if (text) {
|
|
onFinalTranscriptRef.current?.(text)
|
|
setPartialTranscript('')
|
|
}
|
|
}, [closeWebSocket])
|
|
```
|
|
|
|
## 7. WebSocket Message Format
|
|
|
|
```json
|
|
// Partial event (new chars only, with stash):
|
|
{"delta":"咁啊", "stash":"咁啊亦都", "full_text":"", "language":"yue", "is_final":false}
|
|
|
|
// Completed event (full accumulated text):
|
|
{"delta":"", "stash":"", "full_text":"多謝主席咁啊亦都...", "language":"yue", "is_final":true}
|
|
```
|
|
|
|
## 8. Files Changed
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `backend/app/routers/ws_asr.py` | Delta computation from `text` field, `item_id` tracking, `stash` passthrough |
|
|
| `backend/app/test/test_phase2_ws_protocol.py` | Updated partial event tests for `text`/`stash` fields |
|
|
| `frontend/src/hooks/useVideoASR.ts` | `lastStashRef`, delta accumulation, stash append on pause |
|
|
| `frontend/src/types/index.ts` | Added `stash?: string` to `ASRMessage` |
|
|
|
|
## 9. Edge Cases Handled
|
|
|
|
| Case | Handling |
|
|
|------|----------|
|
|
| `text` empty (new utterance) | Skip — delta not computable from sliding stash |
|
|
| `item_id` changes | Reset `prev_display`, prepend space to next utterance |
|
|
| Trailing stash not in `text` | Frontend appends on pause via `lastStashRef` |
|
|
| `text` doesn't start with `prev_display` | Send entire `text` as delta (utterance boundary) |
|
|
| Multiple completed events | Each resets `prev_display` and `lastStash` |
|
|
| Play again after pause | New WS session, all state reset |
|
|
|
|
## 10. Lessons Learned
|
|
|
|
- **`text` is the right field** — monotonically growing, simple delta = suffix diff
|
|
- **`stash` is a sliding window** — can't be accumulated, can't be delta-diffed
|
|
- **`item_id` matters** — DashScope splits speech into items; text resets on item change
|
|
- **`_merge_stash` not needed** — `text` is already cumulative; stash only needed for trailing chars
|
|
- **`replace: true` destroys cross-utterance text** — simpler to just append with space separator
|
|
- **Stash on pause completes sentences** — `text` lags ~200ms behind; stash fills the gap
|